This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?
Created by
Haebom
Author
Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, Siheng Chen
Outline
In this paper, we present a study evaluating scientific AI agents using an extremely difficult measure called Humanity Last Examination (HLE), in order to achieve the long-standing goal of accelerating scientific discovery using artificial intelligence (AI) agents. To this end, we introduce X-Master, a tool-augmented inference agent designed to interact flexibly with external tools and mimic human researchers. X-Master conceptualizes code as an interaction language, flexibly leveraging built-in Python libraries and custom tools to augment inference. It also extends its capabilities by systematically improving the breadth and depth of inference through distributed and stacked agent workflows, X-Masters. X-Masters achieves a new state-of-the-art score of 32.1% on HLE, surpassing OpenAI and Google DeepMind and breaking the 30% mark for the first time. This study advances our understanding of complex task solving and provides valuable experience for future model training.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel approach that combines tool-augmented inference agents and distributed workflows to significantly improve the performance of AI for solving scientific problems.
◦
We demonstrated the capabilities of our AI agent by achieving state-of-the-art performance on a demanding benchmark called HLE.
◦
It deepens your understanding of solving complex tasks and provides valuable experience for developing future AI models.
◦
It is released as open source to encourage use and development by other researchers.
•
Limitations:
◦
X-Master's performance gains may be limited to a specific benchmark (HLE). Generalization to other scientific problems requires further study.
◦
X-Further analysis is needed on the efficiency and scalability of distributed workflows in Masters.
◦
The Limitations of HLE itself (e.g., it may not cover the full scope of scientific discovery) may affect the performance evaluation of X-Master.
◦
Since the type and quality of tools and libraries used have a significant impact on the performance of __T9331_____-Master, consideration should be given to the bias in tool selection.