Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Train-before-Test Harmonizes Language Model Rankings

Created by
  • Haebom

Author

Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

Outline

Existing language model benchmarks provide conflicting model rankings, making model selection and comparison difficult. This paper compares model potential using a "train-before-test" approach, which applies identical benchmark-specific fine-tuning to each model. Through extensive experiments on 24 benchmarks and 61 models, we demonstrate that model potential rankings based on train-before-test are consistent across benchmarks. Furthermore, train-before-test restores the relationship between perplexity and downstream task performance, which was lost in conventional evaluations, and reveals that model potential is governed by a single latent factor. We recommend train-before-test as a fundamental element of LLM benchmarking.

Takeaways, Limitations

Takeaways:
Train-before-test ensures consistency in model potential rankings.
Train-before-test restores the relationship between perplexity and downstream performance.
Train-before-test reveals single-factor dominance of model potential.
We propose a basic application of train-before-test for LLM benchmarking.
Limitations:
There is no Limitations specified in the paper.
👍