Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Created by
  • Haebom

Author

Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabrwal, Reut Tsarfaty

Outline

This paper presents MoNaCo, a novel benchmark for evaluating the information seeking capabilities of automated agents based on large-scale language models (LLMs). Unlike existing QA benchmarks, MoNaCo consists of 1,315 time-consuming natural language questions that require tens or hundreds of intermediate steps for humans. MoNaCo is built through a decomposed annotation pipeline that collects and manually answers a large number of time-consuming real-world questions. Evaluating state-of-the-art LLMs with MoNaCo reveals that their F1 scores are limited to a maximum of 61.2% due to recall and hallucination issues, highlighting the limitations of LLM-based agents in tackling complex and extensive real-world information seeking tasks. The MoNaCo benchmark, codebase, prompts, and model predictions are publicly available.

Takeaways, Limitations

Takeaways:
Introducing MoNaCo, a new benchmark that overcomes the limitations of existing QA benchmarks.
Provides performance evaluations of LLM-based agents on complex and time-consuming information seeking tasks in the real world.
Exposing the recall and hallucination problems of LLM-based agents.
Providing an effective resource to track the development of LLM agents.
Enabling research through the release of MoNaCo benchmarks, code, prompts, model predictions, etc.
Limitations:
MoNaCo's questions may not perfectly represent all types of time-consuming information seeking tasks in the real world.
Benchmark scaling may be limited due to its reliance on manual annotations.
The F1 score alone, as an evaluation metric, may not comprehensively evaluate all aspects of an LLM agent.
👍