Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models

Created by
  • Haebom

Author

Ken Tsui

Self-Correction Blind Spot

Outline

Large-scale language models (LLMs) have revolutionized AI, but they still tend to make mistakes and explore unproductive inference paths. Self-correction capabilities are essential for deploying LLMs in safety-critical applications. This study uncovered a systematic failure of LLMs to correct errors in their own output, a phenomenon known as "self-correction blind spots," where LLMs successfully correct identical errors in external sources while failing to correct them. To investigate this, we present the Self-Correction Bench, an evaluation framework that measures this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-inference models, we found an average blind spot rate of 64.5%. Several lines of evidence suggest that this limitation may be influenced by training data. Specifically, human demonstrations rarely include error-correction sequences, whereas reinforcement learning (RL) trained models learn to correct errors through output feedback. Notably, adding a minimal "wait" prompt reduced blind spots by 89.3%, suggesting a potential capability requiring triggering. This study highlights important limitations that may be influenced by training distributions and presents practical approaches to improve the reliability of LLM.

Takeaways, Limitations

Takeaways:
Discovering a fundamental limitation in the self-correction capacity of LLMs: the "self-correction blind spot."
Development of a Self-Correction Bench evaluation framework.
This suggests that training data, especially human demonstration data, may influence this phenomenon.
We found that simple manipulations like a "Wait" prompt can significantly reduce blind spots.
Presents a practical approach to improving the reliability of LLM in safety-critical fields.
Limitations:
May be limited to specific models and training data (limited generalization).
I don't fully understand the mechanism behind the effect of the "Wait" prompt.
Focuses solely on non-inferential models. Applicability to inferential models is unknown.
👍