In this paper, we propose an evaluation paradigm that transforms existing QA datasets into structured adversarial discussions to address the problems of existing QA benchmarks, such as data contamination, memorization, and increasing dataset creation costs. One model defends the correct answer, and another model constructs and defends an alternative answer, and an adjudicator model that does not know the correct answer makes the decision. It is characterized by increasing the difficulty through multiple rounds of argumentation, limiting memorization, and reducing management costs by reusing existing QA items. The main contributions are a pipeline that transforms QA tasks into discussion-based evaluations and a public benchmark using a subset of MMLU-Pro questions. Experimental results verify the robustness of the method and its effectiveness against data contamination, and show that the Llama 3.1 model fine-tuned with test questions performs poorly in discussions. In addition, we show that even weak adjudicator models can distinguish strong debaters, suggesting that improved systems can be evaluated cost-effectively. In conclusion, the framework of this paper emphasizes that “pre-training a test set alone is not enough” and presents a sustainable way to measure the true inference ability of advanced language models.