This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes CANOE, a novel framework for generating context-aware responses to enhance the reliability of large-scale language models (LLMs). CANOE synthesizes diverse short-term question-answering (QA) data without human annotation to generate high-quality, verifiable training data. Furthermore, we propose Dual-GRPO, a rule-based reinforcement learning method that incorporates three rule-based rewards derived from the synthesized short-term QA data to simultaneously optimize short-term and long-term response generation. Dual-GRPO addresses the problems of manual labeling for reward model training and short-term overoptimization. Experimental results demonstrate that CANOE significantly improves the fidelity of LLMs across 11 different tasks, outperforming state-of-the-art LLMs such as GPT-4o and OpenAI o1.
Takeaways, Limitations
•
Takeaways:
◦
We present an effective framework (CANOE) to improve the fidelity of LLM without human annotation.
◦
Efficiently optimize short-term and long-term response generation using rule-based reinforcement learning.
◦
Demonstrated fidelity enhancement performance that surpasses state-of-the-art LLM.
◦
Demonstrates versatility in various downstream operations.
•
Limitations:
◦
Dependence on the quality of synthetic data. The variety and quality of synthetic data can affect CANOE performance.
◦
Generalizability of rule-based rewards. Rules optimized for a specific task may degrade performance when applied to other tasks.
◦
Scalability of the proposed method. Applicability to larger data sets and complex tasks requires verification.