Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Created by
  • Haebom

Author

Zihao Sun, Ling Chen

Outline

To address the challenges of evaluating large-scale language models (LLMs)-based autonomous web agents, this paper presents WebArXiv, a static and time-invariant benchmark built on the arXiv platform. WebArXiv ensures reproducible and reliable evaluations by using fixed web snapshots, a deterministic ground truth, and standardized action paths. We identify a common failure mode, "Rigid History Reflection," where agents overrely on their past interaction history, and propose a lightweight dynamic reflection mechanism that selectively retrieves relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv to demonstrate inter-agent performance differences and validate the effectiveness of our proposed reflection strategy.

Takeaways, Limitations

Takeaways:
We present WebArXiv, a static and time-invariant web agent benchmark based on arXiv, enabling reproducible and reliable evaluation.
We identify "Rigid History Reflection," a common failure mode of web agents, and propose an effective lightweight dynamic reflection mechanism to address it.
Clearly demonstrates the performance differences between state-of-the-art web agents.
Limitations:
Because WebArXiv is limited to the arXiv platform, it may not reflect the diversity of other websites.
Further research is needed on the generalization performance of the proposed dynamic reflection mechanism.
The types of web agents used in the evaluation may be limited.
👍