Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Created by
  • Haebom

Author

Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Outline

This paper proposes CANOE, a novel framework for generating context-aware responses to enhance the reliability of large-scale language models (LLMs). CANOE synthesizes diverse short-term question-answering (QA) data without human annotation to generate high-quality, verifiable training data. Furthermore, we propose Dual-GRPO, a rule-based reinforcement learning method that incorporates three rule-based rewards derived from the synthesized short-term QA data to simultaneously optimize short-term and long-term response generation. Dual-GRPO addresses the problems of manual labeling for reward model training and short-term overoptimization. Experimental results demonstrate that CANOE significantly improves the fidelity of LLMs across 11 different tasks, outperforming state-of-the-art LLMs such as GPT-4o and OpenAI o1.

Takeaways, Limitations

Takeaways:
We present an effective framework (CANOE) to improve the fidelity of LLM without human annotation.
Efficiently optimize short-term and long-term response generation using rule-based reinforcement learning.
Demonstrated fidelity enhancement performance that surpasses state-of-the-art LLM.
Demonstrates versatility in various downstream operations.
Limitations:
Dependence on the quality of synthetic data. The variety and quality of synthetic data can affect CANOE performance.
Generalizability of rule-based rewards. Rules optimized for a specific task may degrade performance when applied to other tasks.
Scalability of the proposed method. Applicability to larger data sets and complex tasks requires verification.
👍