Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

The Invisible Leash: Why RLVR May or May Not Escape Its Origin

Created by
  • Haebom

Author

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi

Outline

Recent advances in LLM have demonstrated that RLVR is a promising approach for solving complex logical problems. This study investigates whether current RLVR methods actually extend the model's inference range or improve accuracy by amplifying the high-reward outputs already known from the base model. This study reveals that, under current training conditions, RLVR can operate as a support-constrained optimization mechanism, constrained by the initial distribution of the base model, which can limit the discovery of entirely novel solutions. Furthermore, by examining the entropy-reward tradeoff, we find that current RLVR methods can improve accuracy while narrowing the search and overlooking underrepresented answers. Experimental results show that RLVR consistently improves pass@1, but under large sampling budgets, the reduction in empirical support generally outweighs the expansion, failing to recover answers previously accessible to the base model. Furthermore, we observe that even as token-level entropy increases, leading to greater uncertainty at each generation step, answer-level entropy decreases, indicating that these uncertain paths ultimately converge to a smaller set of individual answers.

Takeaways, Limitations

Current RLVR methods have potential limitations in extending the inference scope.
RLVR is limited by the initial distribution of the base model, which may limit the discovery of entirely new solutions.
Current RLVR methods improve accuracy, but they narrow the search and may overlook underrepresented answers.
Under large sampling budgets, a reduction in empirical support may occur.
Future algorithmic innovations (e.g., explicit search mechanisms, hybrid strategies) may overcome these limitations.
👍