Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Probabilistic Optimality for Inference-time Scaling

Created by
  • Haebom

Author

Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei

Outline

This paper presents a novel probabilistic framework for inference-time scaling (ITS) to improve the inference performance of large-scale language models (LLMs). It overcomes the limitations of conventional heuristic-based parallel sampling methods and establishes a theoretical foundation for optimal inference time scaling under the assumption that parallel samples are independent and identically distributed. By estimating the probability distribution of a best-of-N selection strategy, we derive a theoretical lower bound on the minimum number of samples required to achieve a target performance level. Based on this lower bound, we develop the OptScale algorithm, which dynamically determines the optimal sample count. OptScale uses a language model-based predictor to estimate probabilistic prior parameters and determines the minimum number of samples that satisfies predefined performance thresholds and confidence levels. Extensive experiments on mathematical inference benchmarks such as MATH-500, GSM8K, AIME, and AMC demonstrate that OptScale significantly reduces sampling overhead while maintaining state-of-the-art inference performance. This paper provides both theoretical foundations and practical solutions, making a significant contribution to the efficient deployment of LLMs for complex inference. The source code is publicly available.

Takeaways, Limitations

Takeaways:
We provide the first theoretical foundation for the extension of the inference time of LLM.
We present the OptScale algorithm, which efficiently reduces computing costs by calculating the minimum number of samples required to achieve the target performance.
It demonstrates results that maintain or exceed SOTA performance in mathematical reasoning benchmarks.
Reproducibility and usability have been improved through open source code.
Limitations:
Based on the assumption that parallel samples are independent and identically distributed, performance degradation may occur if the distribution of real data does not meet this assumption.
The performance of the OptScale algorithm may be affected by the accuracy of the language model-based predictor.
Currently, only experimental results for mathematical reasoning benchmarks are presented, and generalizability to other types of tasks requires further research.
👍