Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning

Created by
  • Haebom

Author

Tevin Wang, Chenyan Xiong

Outline

This paper presents rule-based rewards, a promising strategy for improving reinforcement learning from human feedback (RLHF). While existing approaches often rely on manual rule engineering, in this paper we propose AutoRule, a fully automated method for automatically extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule operates in three steps: interpreting user preferences using an inference model, identifying candidate rules during the inference process of interpretation, and synthesizing them into a unified rule set. Using the final rule set, we compute the proportion of rules satisfied at each output using a language model verifier, and use this metric as an auxiliary reward along with the learned reward model during policy optimization. Using AutoRule to train the Llama-3-8B model, we achieve a 28.6% improvement in length-controlled win rate on AlpacaEval2.0 and a 6.1% improvement in second-round performance on the MT-Bench subset (compared to the GRPO baseline trained using the same learned reward model but without rule-based auxiliary rewards). The analysis results show that the extracted rules are well aligned with the dataset preferences, and AutoRule reduces reward hacking compared to the learned reward model when running over two episodes. Finally, case studies show that the extracted rules capture unique features that are important in different datasets. The extracted rules are provided in the appendix, and the code is publicly available in https://github.com/cxcscmu/AutoRule .

Takeaways, Limitations

Takeaways:
We present a novel method called AutoRule that automatically generates rule-based rewards from human preference feedback.
Improving reinforcement learning performance using AutoRule (28.6% improvement on AlpacaEval2.0, 6.1% improvement on MT-Bench).
We verified that the extracted rules match well with the dataset preferences and are effective in reducing reward hacking.
Presents the possibility of extracting rules that capture unique features that are important in various datasets.
Ensuring reproducibility and scalability through code disclosure.
Limitations:
The performance of AutoRule may depend on the specific language model (Llama-3-8B) and dataset.
The accuracy of rule extraction may be affected by the performance of the inference model.
Additional experiments with more diverse datasets and language models are needed.
Further research is needed into the interpretability and explainability of the rules.
👍