To address the challenges of evaluating large-scale language models (LLMs)-based autonomous web agents, this paper presents WebArXiv, a static and time-invariant benchmark built on the arXiv platform. WebArXiv ensures reproducible and reliable evaluations by using fixed web snapshots, a deterministic ground truth, and standardized action paths. We identify a common failure mode, "Rigid History Reflection," where agents overrely on their past interaction history, and propose a lightweight dynamic reflection mechanism that selectively retrieves relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv to demonstrate inter-agent performance differences and validate the effectiveness of our proposed reflection strategy.