Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

The Ever-Evolving Science Exam

Created by
  • Haebom

Author

Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

Outline

This paper introduces the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to assess the scientific understanding of foundation models. Developed to address the risks of data leakage and the inefficiencies associated with large-scale testing, EESE consists of a private EESE-Pool, comprising over 100,000 problem-answer pairs across five domains and over 500 subdomains, and a periodically updated EESE pool of 500 problems designed for leak-proof, low-cost evaluation. Experiments on 32 models demonstrate that EESE effectively distinguishes model strengths and weaknesses across both scientific and cognitive domains.

Takeaways, Limitations

Takeaways:
Dynamic benchmark design that reduces data leakage risk and improves evaluation efficiency.
Building a comprehensive set of assessment data covering a wide range of scientific disciplines and subfields.
Providing a robust, scalable, and future-proof solution for evaluating the scientific capabilities of models.
Validation of the effectiveness of the benchmark through experiments on various models.
Limitations:
Limited accessibility due to the privacy of EESE-Pool.
Continuous management and updates are required to maintain the reliability of the benchmark.
Further research is needed to determine the generalizability of the assessment results.
👍