Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Scaling Truth: The Confidence Paradox in AI Fact-Checking

Created by
  • Haebom

Author

Ihsan A. Qazi, Zohaib Khan, Abdullah Ghani, Agha A. Raza, Zafar A. Qazi, Wassay Sajjad, Ayesha Ali, Asher Javaid, Muhammad Abdullah Sohail, Abdul H. Azeemi

Outline

This paper systematically evaluates nine existing large-scale language models (LLMs) using 5,000 claims evaluated by 174 expert fact-checking organizations in 47 languages. LLMs are evaluated across a variety of categories (open/closed source, various sizes, various architectures, and inference-based). To test the models' generalization ability, we use four prompting strategies that reflect the interactions between citizen and expert fact-checkers and claims generated later than the training data. Based on over 240,000 human annotations, we find a phenomenon similar to the "Danning-Kruger effect," where small-scale models exhibit high confidence despite lower accuracy, while large-scale models exhibit high accuracy but lower confidence. This poses a risk of systematic bias in information verification, especially when small-scale models are used by under-resourced organizations. The performance gap is most pronounced for claims in languages other than English and from the Global South, potentially exacerbating existing information inequalities. These findings establish a multilingual benchmark for future research and provide policy rationale to ensure equitable access to reliable AI-assisted fact-checking.

Takeaways, Limitations

Takeaways:
Provides multilingual benchmarks of fact-checking performance across various LLMs.
We reveal a correlation between high confidence and low accuracy in small-scale models and low confidence and high accuracy in large-scale models.
It raises the possibility of systematic bias in fact-checking efforts by under-resourced agencies.
Highlights the gap in fact-checking performance for the Global South and non-English languages.
Provides a basis for policymaking to ensure equitable access to AI-assisted fact-checking.
Limitations:
The 5,000 claims used in this study may not be fully representative of all types of information and language.
There may be a lack of detailed analysis of other factors that affect LLM performance (e.g., data quality, model training method).
There is a need to track changes in LLM performance over a long-term perspective.
👍