SealQA is a new benchmark for evaluating augmented language models for fact-checking queries when web search results are conflicting, noisy, or unhelpful. SealQA consists of three types: (1) Seal-0, the main benchmark; (2) Seal-Hard, which evaluates factual accuracy and inference ability; and (3) LongSeal, which tests long-text, multi-document inference in a “needle in the dice” setting. The evaluation results show that even state-of-the-art LLMs underperform on all SealQA types. In particular, state-of-the-art agent models with tools such as o3 and o4-mini on Seal-0 only achieve 17.1% and 6.3% accuracy at the highest inference performance, respectively. Advanced inference models such as DeepSeek-R1-671B and o3-mini are found to be highly vulnerable to noisy search results. Furthermore, increasing the test time computation does not lead to reliable performance improvements on o3-mini, o4-mini, and o3, and often stagnates or even decreases. Although recent models are less susceptible to the "missing in the middle" problem, LongSeal still fails to reliably identify relevant documents in the presence of numerous distractors. To encourage future research, we release huggingface.co/datasets/vtllms/sealqa에서 SealQA.