Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

Created by
  • Haebom

Author

Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye

Outline

This paper presents HiPhO, a new benchmark based on high school Physics Olympiad problems. HiPhO encompasses 13 recent Olympiad exams from 2024-2025, covering a wide range of problem types from text-based to diagram-based. It grades problems and solutions step-by-step, using human judgement criteria, and awards gold, silver, and bronze medals based on model performance, enabling direct performance comparisons with human participants. An evaluation of 30 state-of-the-art (M)LLMs reveals that most open-source MLLMs remain below bronze medals, while some open-source LLMs demonstrate progress, achieving gold medals. Closed-form inference MLLMs, while achieving 6-12 gold medals, still fall significantly short of a perfect score.

Takeaways, Limitations

Takeaways:
Introducing HiPhO, the first human-centered assessment benchmark based on the high school physics Olympiad.
Clearly demonstrates the difference in physical reasoning capabilities between open source and closed models.
(M) Presenting a new standard for improving the physical reasoning ability of LLM.
Wide range of assessments possible, including various types of physics problems.
Model performance can be directly compared to human participants.
Limitations:
The number and type of Olympiad exams included in the benchmark may be limited.
Perfect alignment with human judging criteria may be difficult.
It may not be clear whether the performance advantage of a closed model is due to the capabilities of the model itself or to differences in data accessibility and learning strategies.
There is still a significant gap in achieving a perfect score.
👍