This paper presents MoNaCo, a novel benchmark for evaluating the information seeking capabilities of automated agents based on large-scale language models (LLMs). Unlike existing QA benchmarks, MoNaCo consists of 1,315 time-consuming natural language questions that require tens or hundreds of intermediate steps for humans. MoNaCo is built through a decomposed annotation pipeline that collects and manually answers a large number of time-consuming real-world questions. Evaluating state-of-the-art LLMs with MoNaCo reveals that their F1 scores are limited to a maximum of 61.2% due to recall and hallucination issues, highlighting the limitations of LLM-based agents in tackling complex and extensive real-world information seeking tasks. The MoNaCo benchmark, codebase, prompts, and model predictions are publicly available.