Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Created by
  • Haebom

Author

Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang

Outline

This study aims to improve the performance of large-scale language models (LLMs) through test-time scaling, focusing particularly on the efficiency of the Best-of-N (BoN) sampling technique. To address the GPU memory usage and reward model limitations of BoN sampling, we propose Self-Truncation Best-of-N (ST-BoN), which leverages the initial consistency of the model's internal state to identify optimal paths and prune inefficient paths without fully generating all N samples. ST-BoN reduces computational costs by 70-80% while maintaining the same performance as Full-BoN, and can improve accuracy by 3-4 points at the same cost.

Takeaways, Limitations

Takeaways:
ST-BoN improves the efficiency of BoN sampling by significantly reducing computational costs while maintaining similar performance to Full-BoN.
Improves resource efficiency by reducing GPU memory usage by over 80% and inference latency by 50%.
Eliminates the associated costs and complexity by not using a compensation model.
Demonstrates the potential to improve accuracy at the same cost.
Limitations:
There is a lack of information about the specific experimental setting of the study and the size and type of LLM used.
Further research is needed to determine how well ST-BoN generalizes to other language models and tasks.
Additional information is needed on how well ST-BoN performs relative to other sampling techniques.
Lack of detailed description of specific methodology for identifying initial consistency of internal state of a model.
👍