Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Post-Completion Learning for Language Models

Created by
  • Haebom

Author

Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Chao Feng, Can Huang

Outline

This paper proposes Post-Completion Learning (PCL), a novel learning framework that utilizes the sequence space after the model output is completed, to overcome the limitation of existing language model training that terminates at the terminal token (). PCL enhances inference and self-evaluation capabilities by generating self-evaluations and reward predictions even after the model completes its training. Furthermore, it maintains efficiency by stopping the inference process at completion. Using white-box reinforcement learning techniques, the model evaluates outputs according to reward rules and supervises the scores by aligning them with the reward function. This approach combines dual-track SFT and RL training, which simultaneously optimize inference and evaluation capabilities, to achieve multi-objective hybrid optimization. Experimental results on various datasets and models demonstrate consistent performance improvements over existing SFT and RL methods.

Takeaways, Limitations

Takeaways:
We present a new learning framework (PCL) that overcomes the limitations of existing language model learning and improves performance.
Presenting an effective method to simultaneously improve reasoning and self-evaluation skills.
After completion, we present a technique to improve learning efficiency by utilizing the sequence space.
We observed consistent performance improvements across diverse datasets and models.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Extensive experimentation with various types of language models and datasets is needed.
Complexity of reward function design and difficulty of optimization.
Due to the nature of white-box reinforcement learning, a high level of understanding of the model's internal workings is also required.
👍