Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Created by
  • Haebom

Author

Benjamin Plaut, Nguyen X. Khanh, Tu Trinh

Outline

This paper analyzes 15 large-scale language models (LLMs) and finds that the maximum softmax probability (MSP) of LLMs fine-tuned for chat is consistently miscalibrated in multiple-choice Q&A. However, MSPs can still contain useful uncertainty information. We hypothesize that incorrect answers will be associated with smaller MSPs than correct answers, and rigorous statistical testing demonstrates that this hypothesis holds true for models that perform well on the basic Q&A task. We also find a strong directional correlation between Q&A accuracy and MSP accuracy predictions, but no correlation between Q&A accuracy and calibration errors. This suggests that within the current fine-tuning paradigm, improving LLM performance will likely result in improved accuracy predictions, not calibration. We also present experimental results demonstrating that selectively rejecting responses based on MSP can improve performance.

Takeaways, Limitations

Takeaways:
We show that even if the LLM's MSP is miscalibrated in multiple-choice Q&A, it can still provide useful information for predicting correct/incorrect answers.
As LLM performance improves, correct answer prediction performance will likely improve, but correction performance is unlikely to improve.
MSP can be used to improve performance through a rejection strategy. Even with a small amount of label data, performance can be improved by setting MSP thresholds.
Limitations:
The analysis is limited to a specific type of Q&A task.
Further research is needed to determine the generalizability of MSP-based response rejection strategies.
Further research is needed on various LLM architectures and fine-tuning methods.
👍