Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Post-Completion Learning for Language Models

Created by
  • Haebom

Author

Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Chao Feng, Can Huang

Outline

This paper proposes Post-Completion Learning (PCL), a novel learning framework that utilizes the sequence space after the model output is completed, to overcome the limitation of existing language model training that terminates at the terminal token ( ). PCL enhances inference and self-evaluation capabilities by continuing to generate self-evaluations and reward predictions even after the model completes its output, while maintaining efficient inference by stopping at the completion point. This is achieved through a white-box reinforcement learning method, where the model evaluates outputs according to reward rules and supervises the scores by aligning them with the reward function. To optimize both inference and evaluation capabilities, we implement dual-track SFT and combine it with RL learning to achieve multi-objective hybrid optimization. Experimental results on various datasets and models demonstrate consistent performance improvements compared to existing SFT and RL methods.

Takeaways, Limitations

Takeaways:
Presenting PCL, a new framework that overcomes the limitations of existing language model learning.
Improving the model's reasoning and self-evaluation capabilities
Improve output quality while maintaining efficient inference
A multi-objective hybrid optimization method combining the strengths of SFT and RL is presented.
Consistent performance improvements across diverse datasets and models
Limitations:
Further research is needed to determine the generalization performance of the proposed method.
Results are presented only for specific datasets and models, requiring broader experimentation.
Consideration should be given to the complexity and computational cost of white-box reinforcement learning methods.
Further research is needed on the subjectivity of reward function design and optimization issues.
👍