This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
This paper presents rule-based rewards, a promising strategy for improving reinforcement learning from human feedback (RLHF). While existing approaches often rely on manual rule engineering, in this paper we propose AutoRule, a fully automated method for automatically extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule operates in three steps: interpreting user preferences using an inference model, identifying candidate rules during the inference process of interpretation, and synthesizing them into a unified rule set. Using the final rule set, we compute the proportion of rules satisfied at each output using a language model verifier, and use this metric as an auxiliary reward along with the learned reward model during policy optimization. Using AutoRule to train the Llama-3-8B model, we achieve a 28.6% improvement in length-controlled win rate on AlpacaEval2.0 and a 6.1% improvement in second-round performance on the MT-Bench subset (compared to the GRPO baseline trained using the same learned reward model but without rule-based auxiliary rewards). The analysis results show that the extracted rules are well aligned with the dataset preferences, and AutoRule reduces reward hacking compared to the learned reward model when running over two episodes. Finally, case studies show that the extracted rules capture unique features that are important in different datasets. The extracted rules are provided in the appendix, and the code is publicly available in https://github.com/cxcscmu/AutoRule .