Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Created by
  • Haebom

Author

Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng

Outline

CMPhysBench is a new benchmark designed to evaluate the performance of large-scale language models (LLMs) in condensed matter physics. It consists of over 520 graduate-level questions, covering key subfields and fundamental theoretical frameworks of condensed matter physics, including magnetism, superconductivity, and strongly correlated systems. It focuses on computational problems requiring LLMs to independently generate comprehensive solutions, ensuring a deep understanding of the problem-solving process. Furthermore, it leverages a tree-based representation of expressions to introduce the Scalable Expression Edit Distance (SEED) score, providing precise (non-binary) partial scores and more accurately assessing the similarity between predictions and the correct answer. The results show that even the best-performing model, Grok-4, achieves an average SEED score of 36 and an accuracy of 28% on CMPhysBench, demonstrating a significant performance gap compared to traditional physics in this practical and cutting-edge field. The code and dataset are publicly available at https://github.com/CMPhysBench/CMPhysBench .

Takeaways, Limitations

Takeaways: We present a new benchmark (CMPhysBench) that can accurately evaluate the performance of LLM in condensed matter physics. The SEED score allows for a more precise performance evaluation. We have revealed significant limitations in the current LLM's ability to solve condensed matter physics problems. Open code and datasets will facilitate continued research and development.
Limitations: The current benchmark focuses solely on computational problems and may not fully reflect other aspects of condensed matter physics (e.g., conceptual understanding and theoretical analysis). The difficulty and scope of the benchmark questions need to be further expanded in the future. Because it focuses on a specific LLM, further research is needed to determine its generalizability to other types of models.
👍