Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Created by
  • Haebom

Author

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang

Outline

This paper proposes FutureX, a novel benchmark for evaluating the predictive capabilities of large-scale language model (LLM) agents. FutureX supports real-time, daily updates and prevents data contamination through an automated pipeline, encompassing dynamic tasks demanding expert-level predictive power across diverse domains, including politics, economics, and finance. By evaluating 25 LLM/agent models, we comprehensively assess the agent's adaptive reasoning and performance in dynamic environments, taking into account factors such as reasoning capabilities, search capabilities, and external tool integration. We also provide in-depth analysis of the agent's failure modes and underlying causes of performance degradation (e.g., vulnerability to fake web pages, temporal validity). The goal is to establish a dynamic and uncorrupted evaluation criterion for developing LLM agents capable of expert-level, complex reasoning and predictive thinking.

Takeaways, Limitations

Takeaways:
Presenting FutureX, the largest real-time benchmark for evaluating the predictive capabilities of LLM agents.
Prevent data contamination with real-time updates and automated data collection pipelines.
Provides comprehensive performance evaluation and failure mode analysis for various LLM/agent models (including inference, search, and external tool integration).
Setting a New Standard for Developing LLM Agents with Expert-Level Future Prediction Capabilities
Limitations:
Lack of specific figures on the scale and diversity of data in the current benchmark.
Lack of specific discussion of additional Limitations beyond fake web page vulnerability and temporal validity.
Lack of mention of long-term maintenance and management plans for the FutureX benchmark.
👍