Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Created by
  • Haebom

Author

Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng

Outline

CMPhysBench is a new benchmark for evaluating the performance of large-scale language models (LLMs) in condensed matter physics. It consists of over 520 graduate-level questions, covering key subfields and fundamental theoretical frameworks of condensed matter physics, including magnetism, superconductivity, and strongly correlated systems. It focuses solely on computational problems to ensure a deep understanding of the problem-solving process, requiring LLMs to independently generate comprehensive solutions. Furthermore, it leverages a tree-based representation of equations to introduce the Scalable Expression Edit Distance (SEED) score, which provides precise (non-binary) partial scores and more accurately assesses the similarity between predictions and the correct answer. The results show that even the best-performing model, Grok-4, achieves an average SEED score of 36 and an accuracy of only 28% on CMPhysBench, highlighting a significant gap in performance compared to existing physics models, particularly in practical and cutting-edge areas. The code and dataset are publicly available at https://github.com/CMPhysBench/CMPhysBench .

Takeaways, Limitations

Takeaways: We present CMPhysBench, a new benchmark that accurately assesses the performance of LLMs in condensed matter physics. It reveals a significant gap in the ability of LLMs to solve condensed matter physics problems. SEED scores enable precise performance evaluation. Open code and datasets facilitate ongoing research and development.
Limitations: Even the current best-performing model shows low accuracy on CMPhysBench, suggesting the need for further research to improve the understanding of condensed matter physics in LLM. The benchmark's question coverage may not fully cover all areas of condensed matter physics. The computational complexity of the SEED score may be high.
👍