We raise the issue that current benchmarks for evaluating software engineering agents (e.g., SWE-Bench Verified) do not adequately reflect developer interactions in real-world IDE environments, leading to overestimation of agent capabilities. This paper proposes a novel framework that transforms existing benchmarks into real-world user queries. We apply this framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench, and SWE-Bench C#, and verify performance degradation of the agents. This research presents a new paradigm for evaluating conversational chat-based software engineering agents.