Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PiCSAR: Probabilistic Confidence Selection And Ranking

Created by
  • Haebom

Author

Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen

Outline

This paper proposes Probabilistic Confidence Selection and Ranking (PiCSAR), an improved best-of-n sampling technique that improves the accuracy of large-scale language models (LLMs) and large-scale inference models (LRMs). PiCSAR addresses the challenge of designing a scoring function that can identify the correct inference process without access to the correct answer. It uses the joint log-likelihood of the inference process and the final answer to score each candidate product. This joint log-likelihood naturally decomposes into inference confidence and answer confidence. It outperforms existing methods on various benchmarks (up 10.18 on MATH500 and up 9.81 on AIME2025), achieving better performance with at least twice as many samples in 16 of 20 comparisons. Analytical results show that the correct inference process leads to significantly higher inference and answer confidence, supporting the effectiveness of PiCSAR.

Takeaways, Limitations

Takeaways:
A novel scoring method PiCSAR is proposed that significantly improves the efficiency of best-of-n sampling.
Effectively identify correct reasoning processes even without correct answers.
Demonstrated superior performance and efficiency compared to existing methods in various benchmarks.
The effectiveness of PiCSAR is demonstrated through inference reliability and answer reliability analysis.
Limitations:
Generalization performance for other types of problems or models beyond the presented benchmarks requires further study.
There is a possibility that PiCSAR's reliability calculation method may be biased towards certain problem types.
There is a need to verify the performance limits of PiCSAR for problems with complex inference processes.
👍