Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

No Free Lunch: Rethinking Internal Feedback for LLM Reasoning

Created by
  • Haebom

Author

Yanzhi Zhang, Zhaoxi Zhang, Haoxiang Guan, Yilin Cheng, Yitong Duan, Chen Wang, Yue Wang, Shuxin Zheng, Jiyan He

Outline

In this paper, we present Reinforcement Learning Internal Feedback (RLIF), a reinforcement learning method that uses only model-internal signals without external rewards. We attempt to improve the inference performance of baseline LLM on mathematical inference benchmarks by leveraging unsupervised learning reward surrogates such as token-level entropy, path-level entropy, and self-confidence. In the early stages, we achieve similar or better performance than RLVR techniques, but we find that the performance degrades as training progresses, especially for models that are already instructively tuned. We explain the training behavior of RLIF through model weight mixture analysis, and provide practical guidelines for incorporating internal feedback signals into LLM training.

Takeaways, Limitations

Takeaways:
RLIF demonstrates the potential to improve the inference performance of LLM without external supervision.
We demonstrate that we can achieve similar or better performance than existing RLVR techniques in the early training phase.
We present practical guidelines for integrating internal feedback signals into LLM training.
Limitations:
As training progresses, performance may degrade and may be lower than initial performance.
The effect is minimal for models that have already been tuned with directives.
The effectiveness of RLIF can vary greatly depending on the model and training phase.
👍