Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations

Created by
  • Haebom

Author

Mahjabin Nahar, Eun-Ju Lee, Jin Won Park, Dongwon Lee

Outline

This study conducted an online experiment (N=560) to investigate whether web search results could be used to validate inaccurate content, or "hallucinations," generated by large-scale language models (LLMs). We compared conditions in which static (fixed search results provided by the LLM) or dynamic (participant-driven search) search results for LLM-generated content were provided, versus a control condition (no search results). We analyzed participants' perceptions of the accuracy of LLM-generated content (genuine, minor hallucinations, severe hallucinations), their confidence in their accuracy assessments, and their overall evaluations of the LLM. Results showed that participants in both the static and dynamic search result conditions rated the hallucinated content as less accurate and had more negative perceptions of the LLM compared to the control condition. However, participants in the dynamic search condition rated genuine content more accurately and had higher overall confidence, highlighting the practical implications of integrating web search capabilities into LLMs in real-world settings.

Takeaways, Limitations

Takeaways:
Integrating web search functionality suggests that it may help address the hallucination problem in LLMs.
In particular, we demonstrate that user-driven dynamic search capabilities are effective in assessing the accuracy and improving the reliability of LLM-generated content.
It highlights the importance of integrating web search functionality in the practical use of LLM.
Limitations:
Due to the nature of online experiments, there may be differences from the actual environment.
Further review may be needed to determine whether the participant size (N=560) is sufficient.
Further research is needed to determine generalizability across different types of LLMs and search engines.
There may be a lack of analysis on the impact of search results quality or bias.
👍