Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

Created by
  • Haebom

Author

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu

Outline

This paper focuses on improving exploration strategies to enhance the inference performance of large language models (LLMs) in Reinforcement Learning with Verifiable Rewards (RLVR). To address the premature convergence and entropy decay problems of existing RLVR methods, we propose a curiosity-driven exploration (CDE) framework that leverages the model's inherent curiosity. The actor's embarrassment regarding generated responses and the variance of the critic's value estimates obtained from the multi-head architecture serve as curiosity signals and serve as exploration bonuses within the RLVR framework. Theoretical analysis demonstrates that the actor-based bonus punishes overconfidence errors and promotes answer diversity, while the critic-based bonus is linked to the traditional RL exploration bonus. Experimental results demonstrate approximately a 3-point performance improvement over standard RLVR on the AIME benchmark. Furthermore, we analyze the compensation decay mechanism within RLVR to uncover a common failure mode in LLMs.

Takeaways, Limitations

Takeaways:
Presenting the Curiosity-Driven Exploration (CDE) Framework as an Effective Exploration Strategy for Improving LLM Reasoning Ability
A novel exploration bonus design and theoretical analysis leveraging actor and critic curiosity signals.
Experimentally verified performance improvement over existing RLVR in the AIME benchmark.
Enhancing understanding of LLM failure modes through analysis of the compensation collapse mechanism of RLVR.
Limitations:
The performance improvements of the proposed method are limited to the AIME benchmark. Generalization to other benchmarks and tasks is needed.
Further research may be needed to define and establish curiosity signals.
A more in-depth analysis of the compensation collapse mechanism and the need for solutions are needed.
👍