Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Created by
  • Haebom

Author

Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou

Outline

PRELUDE is a benchmark for evaluating long-text context understanding by assessing whether a character's prequel story aligns with the canonical narrative of the original novel. Because prequels are not part of the original novel, assessing their validity requires retrieving and integrating indirectly related information, requiring greater overall comprehension and deeper reasoning than existing benchmarks. Experimental results show that 88% of instances require evidence from multiple parts of the narrative. Using state-of-the-art LLM, RAG, in-domain learning, and a commercial DeepResearch service, the model underperformed humans by more than 15%. Additional human studies revealed that the model frequently generated correct answers with incorrect inferences, resulting in a 30% or greater difference in inference accuracy compared to humans. These results highlight significant room for improvement in long-text context understanding and reasoning.

Takeaways, Limitations

Takeaways: Presents PRELUDE, a new benchmark for evaluating long-text context understanding and reasoning capabilities. Experimental results demonstrate the limitations of existing methodologies. Clearly demonstrates the shortcomings of state-of-the-art models, such as LLM, in understanding long-text contexts. Provides insight into the differences in reasoning processes between humans and AI.
Limitations: Lack of specific details regarding the scale and diversity of the current PRELUDE benchmark dataset. Lack of detailed information regarding the number of human participants and their selection criteria. Analysis of the model's inference process is somewhat lacking.
👍