Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

RLTHF: Targeted Human Feedback for LLM Alignment

Created by
  • Haebom

Author

Yifei Xu, Tusher Chakraborty, Emre K{\i}c{\i}man, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra

Outline

This paper proposes RLTHF, a human-AI hybrid framework, to address the high cost of human-feedback-based reinforcement learning (RLHF) and the limited generalization of AI feedback in the process of aligning large-scale language models (LLMs) to user preferences. RLTHF achieves full human-annotated alignment with minimal effort by combining LLM-based initial alignment with selective human annotations. It uses the reward distribution of the reward model to identify challenging samples misclassified by the LLM and iteratively improves the alignment by leveraging samples correctly classified by the LLM while incorporating strategic human corrections. Evaluation results on the HH-RLHF and TL;DR datasets demonstrate that RLTHF achieves full human-annotated alignment with only 6-7% of the human-annotated tasks. Furthermore, models trained on RLTHF's curated dataset outperform models trained on the full human-annotated dataset on subtasks, highlighting the effectiveness of RLTHF.

Takeaways, Limitations

Takeaways:
A novel approach to effectively address the high cost of RLHF is presented.
Achieving high-level model alignment with minimal human effort.
Models trained with RLTHF outperform models based on existing fully annotated data.
Demonstrating the effectiveness of a hybrid approach that effectively combines the strengths of LLM with human expertise.
Limitations:
The performance of RLTHF may depend on the accuracy of the reward model. A decrease in the performance of the reward model may lead to a decrease in the efficiency of RLTHF.
Since the evaluation results are from a limited dataset (HH-RLHF, TL;DR), further research is needed to determine generalization performance to other datasets or tasks.
There is a lack of detailed analysis of what types of errors are made by LLMs and which errors are corrected by humans.
Further research is needed to optimize the selective human annotation strategy of RLTHF.
👍