Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Created by
  • Haebom

Author

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu

Outline

SealQA is a new benchmark for evaluating augmented language models for fact-checking queries when web search results are conflicting, noisy, or unhelpful. SealQA consists of three types: (1) Seal-0, the main benchmark; (2) Seal-Hard, which evaluates factual accuracy and inference ability; and (3) LongSeal, which tests long-text, multi-document inference in a “needle in the dice” setting. The evaluation results show that even state-of-the-art LLMs underperform on all SealQA types. In particular, state-of-the-art agent models with tools such as o3 and o4-mini on Seal-0 only achieve 17.1% and 6.3% accuracy at the highest inference performance, respectively. Advanced inference models such as DeepSeek-R1-671B and o3-mini are found to be highly vulnerable to noisy search results. Furthermore, increasing the test time computation does not lead to reliable performance improvements on o3-mini, o4-mini, and o3, and often stagnates or even decreases. Although recent models are less susceptible to the "missing in the middle" problem, LongSeal still fails to reliably identify relevant documents in the presence of numerous distractors. To encourage future research, we release huggingface.co/datasets/vtllms/sealqa에서 SealQA.

Takeaways, Limitations

Takeaways: Provides a new benchmark that demonstrates the shortcomings of existing state-of-the-art language models in augmenting retrieval for fact-checking questions, particularly in finding relevant information in noisy search results and in large numbers of documents. Provides directions for future research and provides important criteria for model improvement. Increases research usability by releasing the SealQA dataset.
Limitations: Current benchmarks focus on specific types of questions and search environments, which may limit generalized performance evaluation. Further analysis is needed to understand why increasing test time computation does not lead to improved performance. Despite improvements in the “missing in the middle” problem, it still lacks the ability to identify relevant documents in situations where there are a lot of distracting factors.
👍