Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Humans overrely on overconfident language models, across languages

Created by
  • Haebom

Author

Neil Rathi, Dan Jurafsky, Kaitlyn Zhou

Outline

This paper emphasizes the importance of cross-linguistic calibration to ensure that the responses of large-scale language models (LLMs) deployed in multiple languages accurately convey uncertainty and limitations. Building on previous research showing that LLMs tend to be linguistically overconfident in English, which leads users to overrely on confidently generated results, we assess the safety of LLMs in a global context by studying the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages. We find that the risk of overreliance is high in all languages, and by analyzing the distribution of recognition markers generated by LLMs, we confirm that LLMs tend to be overconfident across languages, but are also sensitive to linguistic variation (e.g., Japanese produces the most uncertainty markers, German and Chinese produce the most certainty markers). We also measure users’ reliance across languages and find that users rely heavily on confidently generated LLM results in all languages, but that their reliance behavior varies across languages (e.g., Japanese relies much more on uncertainty expressions than English). These results suggest a high risk of overconfidence in multilingual LLMs and highlight the importance of culturally and linguistically contextualized model safety assessments.

Takeaways, Limitations

Takeaways:
Emphasizes the importance of cross-linguistic correction for responses to multilingual LLMs.
The risk of LLM overconfidence and overreliance on the part of users is high in all languages.
We found differences in LLM overconfidence and user reliance across linguistic variants.
Suggesting the need for model safety assessment that takes cultural and linguistic context into account.
Limitations:
The number of languages used in the analysis is limited (five languages).
Lack of detailed explanation of how to measure user dependency.
There may be a lack of consideration for certain cultural factors.
Lack of information on specific types or sizes of LLMs.
👍