This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models
Created by
Haebom
Author
Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Mingyu Ding, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, Xinzhu Ma
Outline
PhysUniBench is a large-scale multimodal benchmark designed to assess physics problem-solving skills at the undergraduate level. It contains 3,304 problems (each with a visual diagram) covering eight major physics subfields, and descriptive and multiple-choice questions, and is rated for difficulty through an iterative model-loop process. It shows that even the best-performing models struggle with multistep problems or problems requiring precise diagram interpretation, with GPT-4o mini achieving an accuracy of about 34.2%. The benchmark aims to promote the development of AI in science and encourage the development of models with enhanced physical reasoning, problem-solving, and multimodal understanding.
Takeaways, Limitations
•
Takeaways: Provides a new benchmark to comprehensively evaluate undergraduate-level physics problem-solving ability, clearly reveals the limitations of the physics reasoning ability of the current state-of-the-art multimodal large language model (MLLM), and suggests new research directions for the development of AI in scientific fields.
•
Limitations: Potential subjectivity in the problem formulation and difficulty assessment process of the benchmark; the performance of models evaluated in the current benchmark may not represent the performance of all MLLMs; verification of perfect alignment with actual physics education courses is needed.