Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

The Ever-Evolving Science Exam

Created by
  • Haebom

Author

Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

Outline

This paper introduces the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to assess the scientific understanding of foundation models. EESE was developed to address the risks of data leakage and the inefficiencies associated with large-scale testing. EESE consists of a private EESE-Pool containing over 100,000 expertly constructed scientific instances (question-answer pairs) across five domains and over 500 subdomains, and a periodically updated subset of 500 instances, EESE, for leak-proof and low-cost evaluation. Experiments on 32 open-source and closed-source models demonstrate that EESE effectively distinguishes model strengths and weaknesses across both scientific domains and cognitive dimensions.

Takeaways, Limitations

Takeaways:
Benchmark design that provides the reliability, scalability, and future compatibility required to evaluate the scientific understanding of foundation models.
Reduces the risk of data leakage and enables efficient evaluation.
Support for comparing and analyzing the scientific capabilities of various models.
Limitations:
EESE-Pool is kept private and may have limited accessibility.
Resources required for ongoing updates and maintenance of the benchmark.
Benchmarks may be focused on specific scientific fields and subfields, which may limit generalizability to other fields.
👍