BEARCUBS is a benchmark consisting of 111 questions to evaluate the information exploration ability of web agents in real web environments. Unlike existing benchmarks, it uses real web pages and requires various modes of interaction (e.g., video comprehension, 3D navigation). Each question has a concise answer and a human-verified navigation path, allowing transparent evaluation. Human studies show that the questions are solvable but difficult (84.7% accuracy), and that lack of knowledge and overlooking details are the main causes of failure. ChatGPT Agent achieved 65.8% accuracy, which is significantly higher than other agents, but human-level performance requires precise control, complex data filtering, and improved execution speed. BEARCUBS will be updated and maintained periodically.