Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

Created by
  • Haebom

Author

Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu

Outline

This paper presents a method utilizing Test-Time Reinforcement Learning (TTRL) to improve the complex inference capability of large-scale language models (LLMs). To address the high inference cost and overconfidence issues of existing TTRL, we propose two strategies: Entropy Branch-Tree Majority Rollout (ETMR) and Entropy-Based Advantage Reconfiguration (EAR), which improve the exploration-exploitation balance by introducing entropy-based mechanisms. Applying this strategy to the Llama3.1-8B model, we demonstrate an efficient approach that improves the Pass at 1 metric by 68% on the AIME 2024 benchmark while using only 60% of the rollout token budget. This demonstrates that TTRL effectively optimizes the balance between inference efficiency, diversity, and estimation robustness.

Takeaways, Limitations

Takeaways:
We present a novel entropy-based mechanism to improve the efficiency and performance of TTRL.
Significant performance gains in the AIME 2024 benchmark (68% improvement in Pass at 1 metric).
Reduced inference costs (60% reduction in rollout token budget).
Entropy-based strategies improve exploration-exploitation balance and mitigate overconfidence problems.
Contributed to the advancement of unsupervised reinforcement learning for open-domain inference tasks.
Limitations:
Further experiments are needed to evaluate the generalization performance of the proposed method.
Applicability to other LLMs and benchmarks needs to be verified.
Research is needed on optimal parameter settings for entropy-based mechanisms.
This performance improvement may be specific to the AIME 2024 benchmark. Verification is needed to see if the same effect is observed in other benchmarks.
👍