Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Evaluation of Large Language Models via Coupled Token Generation

Created by
  • Haebom

Author

Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez-Rodriguez

Outline

This paper highlights that state-of-the-art large-scale language models (LLMs) can generate different responses to the same prompt multiple times because their responses rely on internal randomization. To address this problem in LLM evaluation and ranking due to randomization, we develop a causal model for coupled autoregressive generation. This model allows different LLMs to generate responses using the same random source. Evaluation on a benchmark dataset demonstrates that coupled autoregressive generation yields the same conclusions as conventional autoregressive generation with a significantly smaller sample count. However, pairwise comparisons reveal that, surprisingly, when comparing two or more models, coupled and conventional autoregressive generation can yield different rankings. This suggests that model superiority in conventional evaluation protocols may be confounded by the randomization of the generation process rather than by their actual performance. Experiments using multiple LLMs from the Llama, Mistral, and Qwen families demonstrate that joint autoregressive generation reaches the same conclusions with up to 75% fewer samples than traditional methods, and show that pairwise win rates for prompts on the LMSYS Chatbot Arena platform vary across joint and traditional autoregressive generation methods.

Takeaways, Limitations

Takeaways:
When evaluating large-scale language models, it raises the need to control for the influence of internal randomization.
We show that joint autoregressive generation can achieve the same conclusions with fewer samples than conventional methods.
This suggests that the results of existing evaluation methods may not accurately reflect the actual performance of the model.
We present a new method to increase the reliability of LLM evaluation and ranking.
Limitations:
Further research is needed to determine whether the proposed causal model is applicable to all types of LLMs.
Further analysis is needed to determine the cause of the ranking differences observed in pair-based evaluations.
The type and scope of LLMs used in the experiment may be limited.
Consideration may be lacking on the computational cost of generating a combined autoregressive model.
👍