Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Created by
  • Haebom

Author

Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai

Outline

This paper presents a new benchmark, Scientist's First Examination (SFE), to effectively evaluate the complex multimodal inference that is increasingly important in scientific discovery. SFE consists of three levels: scientific signal recognition, scientific property understanding, and scientific comparative inference, and consists of 66 multimodal tasks and 830 expert-verified VQA pairs across five domains. The current state-of-the-art GPT-o3 and InternVL-3 models achieve only 34.08% and 26.52% SFE performance, respectively, demonstrating that there is great room for improvement in the performance of MLLM in scientific fields. This study is expected to contribute to the advancement of AI-based scientific discovery.

Takeaways, Limitations

Takeaways:
Presenting a new benchmark SFE for assessing scientific multimodal reasoning capabilities
Overcoming the limitations of existing benchmarks and comprehensively assessing scientific cognitive abilities
Presenting the current state of scientific reasoning ability of state-of-the-art MLLM and room for improvement
Contribute to the advancement of AI-based scientific discovery research
Limitations:
The SFE benchmark is limited to five areas, requiring further research on generalizability.
Need to expand the variety of question types in current benchmarks
Research on assessing more diverse and complex scientific problem-solving skills is needed.
👍