Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ReDit: Reward Dithering for Improved LLM Policy Optimization

Created by
  • Haebom

Author

Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu

Outline

DeepSeek-R1 improved the inference ability of LLM through a rule-based reward system, but such discrete reward functions can lead to gradient anomalies, unstable optimization, and slow convergence. ReDit solves this problem by adding simple random noise to the discrete reward signal. This perturbed reward provides continuous exploratory gradients throughout the training process, which allows for smoother gradient updates and faster convergence. The injected noise also introduces stochastic elements into the flat reward region, encouraging the model to explore new policies and deviate from local optima. Experiments on various tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves similar performance to traditional GRPO while using only about 10% of the training steps, and achieves 4% better performance than traditional GRPO when trained for a similar period of time. Visualizations confirm that ReDit significantly alleviates the gradient problem. In addition, theoretical analysis is provided to further verify these advantages.

Takeaways, Limitations

Takeaways:
We present the ReDit method to effectively solve the problems of gradient anomalies, unstable optimization, and slow convergence of discrete reward functions.
Through ReDit, we confirmed that similar performance can be achieved and 4% performance improvement can be achieved with only 10% of the learning steps compared to the existing GRPO.
We experimentally demonstrate that compensatory perturbation via random noise is effective in improving the model's exploration ability and escaping local optima.
Theoretical analysis supports the benefits of ReDit.
Limitations:
It is possible that the performance improvements of ReDit are limited to certain types of problems or models.
Further research is needed into the negative effects of additional noise on the learning process.
Further research is needed on the optimal distribution and intensity of the random noise used.
👍