Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Created by
  • Haebom

Author

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

Outline

This paper raises the risk that publicly available large-scale language model (LLM) benchmarks could be unintentionally (or intentionally) used in future LLM training or selection, potentially leading to model contamination. Existing solutions, such as benchmark secrecy and participant model/prediction submission, rely on trust in a specific institution and leave open the possibility of overfitting through repeated queries. This paper proposes a method for publicly disclosing benchmarks, enabling the public evaluation of LLMs without revealing the full answers. The core idea is to inject randomness into the answers by providing multiple logically correct answers and including only one of them as the correct answer. This approach reduces the Bayesian accuracy of the benchmark, protecting the correct answer and providing a test for detecting data contamination. Since even perfect models cannot exceed the Bayesian accuracy, exceeding it is a strong indicator of data contamination. Experimental results demonstrate that this method can accurately detect data contamination across a variety of benchmarks, models, and learning methods.

Takeaways, Limitations

Takeaways:
An effective solution to the model contamination problem caused by the disclosure of LLM benchmarks on the Internet.
A new method for publicly evaluating LLMs without fully disclosing benchmark answers is proposed.
A data contamination detection technique using Bayesian accuracy is presented.
Validation of data contamination detection performance across various benchmarks, models, and learning methods.
Limitations:
The effectiveness of the proposed method may vary depending on the benchmark design and diversity of answers.
Not all cases exceeding Bayesian accuracy can be attributed to data contamination. Other causes may also be contributing.
The experimental results may be limited to specific datasets and models, and further research is needed to determine generalizability.
👍