Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Created by
  • Haebom

Author

Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Ya ir Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson, Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng

CritPt: Complex Research using Integrated Thinking - Physics Test

Outline

This paper aims to answer the questions of whether LLMs can effectively reason about complex, open problems in cutting-edge physics research and what types of reasoning tasks physicists want their LLMs to support. To this end, we present CritPt (Complex Research using Integrated Thinking - Physics Test), the first benchmark designed to test unpublished research-level reasoning tasks. CritPt consists of 71 complex research problems spanning modern physics research areas, including condensed matter, quantum physics, atomic, molecular, and optical physics, astrophysics, high-energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics, and biophysics. Each problem was created by physicists and evaluated through an automated scoring pipeline. While the current SOTA LLM shows initial promise in individual checkpoints, it is still far from reliably solving full-scale research problems. When equipped with a coding tool, GPT-5 (high) achieves an accuracy of approximately 10%. CritPt highlights the significant gap between the capabilities of current models and the needs of real-world physics research, providing a foundation for the development of scientifically grounded AI tools.

Takeaways, Limitations

Takeaways:
CritPt provides a new benchmark for LLMs to assess their ability to tackle complex problems in real-world physics research.
The current SOTA LLM demonstrates that it is struggling to address complex physics research challenges.
We lay the foundation for developing AI-based tools and suggest scientifically grounded directions for developing AI tools.
Limitations:
The current LLM performance is low, limiting its usefulness in practical research.
CritPt's problems rely on the knowledge of physics experts, so problem creation and evaluation require expertise.
The model's accuracy is low, requiring further research to improve the model.
👍