[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Created by
  • Haebom

Author

Yimeng Chen, Piotr Pi\c{e}kos, Mateusz Ostaszewski, Firas Laakom, J urgen Schmidhuber

Outline

PhysGym is a novel benchmark and simulation platform for evaluating the scientific discovery capabilities of agents based on large-scale language models (LLMs). It focuses on assessing their ability to cope with changing environmental complexity and their ability to utilize prior knowledge. A key feature of PhysGym is its ability to finely control the level of prior knowledge provided to the agent. It consists of an interactive physics simulation in which the agent must actively explore the environment, sequentially collect data under constraints, and formulate hypotheses about the underlying physical laws. It provides standardized evaluation protocols and metrics to evaluate the accuracy of hypotheses and the fidelity of models. We present results from a baseline LLM to demonstrate differences in ability across different prior knowledge and task complexity.

Takeaways, Limitations

Takeaways:
We provide a new benchmark to systematically evaluate the scientific reasoning ability of LLM-based agents.
The influence of prior knowledge can be quantitatively analyzed.
Comparison of agent performance across problem complexity and level of prior knowledge.
Provides standardized assessment protocols and metrics.
Limitations:
The scope of current benchmarks and the diversity of simulation environments may be limited.
There are differences from the actual scientific discovery process.
Further research is needed on the objectivity and validity of the evaluation indicators.
Additional considerations may be needed to evaluate the generalization ability of LLM agents.
👍