This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes a method to improve the performance of large-scale language models through inference-time alignment. While conventional Best-of-N (BoN) sampling incurs high computational costs, the proposed TreeBoN integrates a predictive tree search strategy to reduce computational costs while maintaining high output quality. TreeBoN utilizes token-level rewards derived from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. Evaluation results using the AlpacaFarm, HH-RLHF, UltraFeedback, GSM8K, and TutorEval datasets demonstrate that TreeBoN outperforms conventional BoN, achieving a 65% win rate on the TutorEval dataset.
Takeaways, Limitations
•
Takeaways:
◦
We present TreeBoN, an efficient new framework for inference-time sorting.
◦
Maintains high output quality while reducing computational costs compared to conventional BoN.
◦
It performs well on various datasets, achieving a high win rate of 65% in TutorEval.
◦
Effectively guide tree traversal using DPO.
•
Limitations:
◦
TreeBoN's performance improvements may be limited to specific datasets and models. Experiments with a wider range of models and datasets are needed.
◦
Since some parts depend on DPO, the performance of TreeBoN may be affected by the quality of DPO.
◦
Due to the complexity of tree search strategies, computational costs may still be high in certain situations. Further research is needed to determine optimal tree search parameters.