Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

Created by
  • Haebom

Author

Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu, TingTing Gao

Outline

This paper presents a method that utilizes test-time reinforcement learning (TTRL) to improve the complex inference capability of large-scale language models (LLMs). To address the high inference cost and early-stage estimation bias of existing TTRL methods, we propose two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR), which introduce entropy-based mechanisms to improve the exploration-exploitation balance. Experimental results on the AIME 2024 benchmark using the Llama3.1-8B model demonstrate that the proposed method achieves a 68% relative performance improvement in the Pass at 1 metric compared to the baseline model, while reducing inference token usage by 60%. This demonstrates that the proposed method effectively optimizes the trade-off between inference efficiency, diversity, and estimation robustness, advancing unsupervised reinforcement learning for open-ended inference tasks.

Takeaways, Limitations

Takeaways:
We demonstrate that the efficiency and performance of TTRL can be improved simultaneously through an entropy-based mechanism.
Presenting a practical method to improve LLM's reasoning ability even with limited resources.
Extending the potential of unsupervised reinforcement learning in open-ended inference tasks.
Limitations:
There is a possibility that the performance improvement of the proposed method may be limited to specific models and benchmarks.
Generalization performance verification is needed for other types of inference tasks or LLM.
Further research is needed on optimal parameter settings for entropy-based mechanisms.
👍