This paper points out the limitations of multiple-choice assessment and proposes 'answer matching', a generative assessment method. Multiple-choice assessment is objective and easy to automate, but it has the disadvantage of being able to infer the correct answer without looking at the question. On the other hand, answer matching is a method in which the model generates answers in a free form and determines whether they match the reference answers using the latest language model. The results of measuring the agreement between human evaluation and each assessment method using the MMLU-Pro and GPQA-Diamond datasets show that answer matching has a high accuracy close to the agreement between humans even when using a small model. On the other hand, multiple-choice assessment and assessment using LLM without reference answers showed low agreement with human evaluation. Improving the evaluation through answer matching is not a simple conceptual problem, and the rankings of various models are significantly different when evaluating free-form responses by answer matching. Therefore, this paper discusses a way to transform the evaluation ecosystem from multiple-choice assessment to answer matching.