[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

The Invisible Leash: Why RLVR May Not Escape Its Origin

Created by
  • Haebom

Author

Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi

Outline

This paper raises the question whether reinforcement learning with verifiable rewards (RLVR) is a useful method for improving the ability to solve complex logical tasks, but whether it actually extends the inference scope of the model or merely amplifies the high-reward outputs already known by the base model for improved precision. This study provides new insights into the potential limitations of RLVR through theoretical and experimental investigations. We present a new theoretical perspective that RLVR is constrained by the support of the base model (it cannot sample solutions with initial probability 0) and can act as a conservative reweighting mechanism to limit the discovery of completely new solutions. We also identify an entropy-reward tradeoff: RLVR improves precision, but may overlook correct but underrepresented solutions by progressively narrowing the search. Extensive experimental results show that RLVR consistently improves pass@1, but the reduction in empirical support is generally larger than the expansion in empirical support under larger sampling budgets, and it fails to recover previously accessible answers from the base model. Interestingly, although RLVR occasionally increases token-level entropy, increasing uncertainty at each generation step, it decreases answer-level entropy, indicating that these seemingly more uncertain paths ultimately converge to a smaller set of unique answers. Collectively, these results demonstrate the potential limitations of RLVR in expanding the inference horizon. Future algorithmic innovations, such as explicit search mechanisms or hybrid strategies that embed probability mass in underrepresented solution regions, may be needed to break these invisible constraints.

Takeaways, Limitations

Takeaways: Show that RLVR improves pass@1 performance, but is limited in discovering new solutions due to its dependence on the support of the base model. Show the limitations of RLVR by identifying the entropy-reward trade-off.
Limitations: RLVR cannot go beyond the support of the base model and may overlook under-represented answers. Additional search mechanisms are needed to find new solutions. Under larger sampling budgets, the reduction in empirical support is larger than the expansion.
👍