This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
This paper introduces the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to assess the scientific understanding of foundation models. Developed to address the risks of data leakage and the inefficiencies associated with large-scale testing, EESE consists of a private EESE-Pool, comprising over 100,000 problem-answer pairs across five domains and over 500 subdomains, and a periodically updated EESE pool of 500 problems designed for leak-proof, low-cost evaluation. Experiments on 32 models demonstrate that EESE effectively distinguishes model strengths and weaknesses across both scientific and cognitive domains.
Takeaways, Limitations
•
Takeaways:
◦
Dynamic benchmark design that reduces data leakage risk and improves evaluation efficiency.
◦
Building a comprehensive set of assessment data covering a wide range of scientific disciplines and subfields.
◦
Providing a robust, scalable, and future-proof solution for evaluating the scientific capabilities of models.
◦
Validation of the effectiveness of the benchmark through experiments on various models.
•
Limitations:
◦
Limited accessibility due to the privacy of EESE-Pool.
◦
Continuous management and updates are required to maintain the reliability of the benchmark.
◦
Further research is needed to determine the generalizability of the assessment results.