Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

Created by
  • Haebom

Author

Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman

BoxingGym: A Benchmark for Scientific Agents

Outline

This paper introduces BoxingGym, a benchmark for systematically evaluating the scientific model proposal, experimental design, and data-driven revision capabilities of LLM-based scientific agents, targeting the core goals of AI research: understanding the world and explaining scientific theories. It includes ten environments based on real-world scientific disciplines ranging from psychology to ecology, and uses standard model evaluation metrics such as Expected Information Gain (EIG) to assess experimental design capabilities and explanation-based evaluation and prediction error to assess model discovery. Our results show that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery, and adding explicit statistical models does not significantly improve results.

Takeaways, Limitations

Takeaways:
We present a new benchmark for evaluating the experimental design and model discovery capabilities of LLM-based scientific agents.
Quantitatively assessing experimental design capabilities using EIG.
Evaluating model discovery capabilities through explanation-based evaluation.
Provides 10 environments based on various real-world scientific fields.
Clearly state the limitations of the current LLM.
Limitations:
The current LLM performance is low and requires further improvement.
Adding an explicit statistical model does not contribute to improved performance.
👍