Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Created by
  • Haebom

Author

Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia

Outline

This paper addresses the alignment problem, which is essential for the secure deployment of large-scale language models (LLMs). We point out the shortcomings of existing reward-based and reward-free techniques, and propose DR-IRL (Dynamically Adjusting Rewards through Inverse Reinforcement Learning) to address the problems of imbalanced safe datasets and static reward models. DR-IRL trains category-specific reward models using a balanced safe dataset covering seven harmful categories through Inverse Reinforcement Learning (IRL). The dynamic reward adjustment technique, which dynamically adjusts rewards based on task difficulty, is applied to Group Relative Policy Optimization (GRPO). Experimental results using various benchmarks and LLMs demonstrate that DR-IRL outperforms existing methods in enhancing usability while maintaining safety.

Takeaways, Limitations

Takeaways:
We present a DR-IRL technique that effectively addresses the problems of imbalanced safety datasets and static compensation models.
Improved safety and usability through dynamic compensation adjustments that take into account task difficulty.
Demonstrated superior performance compared to existing methods in various benchmarks and LLM.
An effective safety sorting strategy using inverse reinforcement learning (IRL) and a category-based reward model is presented.
Limitations:
Further research is needed on the generalization performance of the proposed DR-IRL.
Scalability review is needed for other types of risks beyond the seven hazardous categories.
Analysis of difficulty adjustment method using text encoder cosine similarity and reward difference is needed Limitations.
Research is needed to secure universality by considering dependencies on specific benchmarks and LLMs.
👍