This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
This paper introduces the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to assess the scientific understanding of foundation models. EESE was developed to address the risks of data leakage and the inefficiencies associated with large-scale testing. EESE consists of a private EESE-Pool containing over 100,000 expertly constructed scientific instances (question-answer pairs) across five domains and over 500 subdomains, and a periodically updated subset of 500 instances, EESE, for leak-proof and low-cost evaluation. Experiments on 32 open-source and closed-source models demonstrate that EESE effectively distinguishes model strengths and weaknesses across both scientific domains and cognitive dimensions.
Takeaways, Limitations
•
Takeaways:
◦
Benchmark design that provides the reliability, scalability, and future compatibility required to evaluate the scientific understanding of foundation models.
◦
Reduces the risk of data leakage and enables efficient evaluation.
◦
Support for comparing and analyzing the scientific capabilities of various models.
•
Limitations:
◦
EESE-Pool is kept private and may have limited accessibility.
◦
Resources required for ongoing updates and maintenance of the benchmark.
◦
Benchmarks may be focused on specific scientific fields and subfields, which may limit generalizability to other fields.