Existing language model benchmarks provide conflicting model rankings, making model selection and comparison difficult. This paper compares model potential using a "train-before-test" approach, which applies identical benchmark-specific fine-tuning to each model. Through extensive experiments on 24 benchmarks and 61 models, we demonstrate that model potential rankings based on train-before-test are consistent across benchmarks. Furthermore, train-before-test restores the relationship between perplexity and downstream task performance, which was lost in conventional evaluations, and reveals that model potential is governed by a single latent factor. We recommend train-before-test as a fundamental element of LLM benchmarking.