Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Created by
  • Haebom

Author

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

Outline

This paper points out the limitations of multiple-choice assessment and proposes 'answer matching', a generative assessment method. Multiple-choice assessment is objective and easy to automate, but it has the disadvantage of being able to infer the correct answer without looking at the question. On the other hand, answer matching is a method in which the model generates answers in a free form and determines whether they match the reference answers using the latest language model. The results of measuring the agreement between human evaluation and each assessment method using the MMLU-Pro and GPQA-Diamond datasets show that answer matching has a high accuracy close to the agreement between humans even when using a small model. On the other hand, multiple-choice assessment and assessment using LLM without reference answers showed low agreement with human evaluation. Improving the evaluation through answer matching is not a simple conceptual problem, and the rankings of various models are significantly different when evaluating free-form responses by answer matching. Therefore, this paper discusses a way to transform the evaluation ecosystem from multiple-choice assessment to answer matching.

Takeaways, Limitations

Takeaways:
Clearly present the limitations of multiple-choice evaluation and experimentally prove the superiority of answer matching, a generative evaluation method.
Answer matching enables more accurate and reliable language model evaluation.
Solves problems with existing multiple-choice evaluation methods and presents a paradigm shift in language model evaluation.
We demonstrate that the accuracy of answer matching is high even when using small-scale language models.
Limitations:
The computational cost of answer matching may be higher than that of multiple-choice evaluation.
The accuracy of the evaluation results may be affected by the quality and quantity of reference answers.
Evaluation results may vary depending on the performance of the language model used for answer matching.
Further research is needed to determine whether the answer matching method is applicable to all types of questions.
👍