Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Selective Preference Optimization via Token-Level Reward Function Estimation

Created by
  • Haebom

Author

Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, Sophia Ananiadou

Outline

This paper proposes Selective Preference Optimization (SePO), a novel selective alignment strategy for large-scale language model alignment. Unlike existing token-level alignment methods that optimize all tokens or employ complex and costly key token selection strategies, SePO focuses on efficient key token selection. SePO presents the first token selection method based on Direct Preference Optimization (DPO), which trains an oracle model to estimate a token-level reward function for the target data. This method is applicable to existing alignment datasets with response-level annotations and enables cost-effective token selection using a small oracle model and training data. The estimated reward function is used to score all tokens in the target dataset, and only key tokens are selected to supervise the target policy model using a contrastive objective function without a reference model. Extensive experiments on three publicly available evaluation benchmarks demonstrate that SePO significantly outperforms competing baseline methods by optimizing only 30% of the key tokens in the target dataset. Applying SePO from weak generalization to strong generalization demonstrates that a weak oracle model effectively supervises a strong policy model with up to 16.8 times more parameters. Furthermore, SePO effectively selects key tokens from out-of-distribution data, improving the strong policy model and mitigating the overfitting problem.

Takeaways, Limitations

Takeaways:
Solving the inefficiency and noise problems of existing token-level sorting methods through efficient key token selection.
We present a novel token selection method based on DPO and ensure its applicability to various datasets by utilizing only response-level annotations.
Cost-effective token selection with small oracle models and training data.
We experimentally demonstrate that a weak oracle model can effectively supervise a strong policy model.
Enhancing robust policy models and mitigating overfitting issues through key token selection from out-of-distribution data.
Experimentally verified performance improvement over competing methods.
Limitations:
High dependence on the performance of the DPO-based Oracle model. If the Oracle model's performance deteriorates, SePO's performance may also deteriorate.
Further research is needed to determine the generalization performance of key token selection strategies. They may be over-optimized for specific datasets or tasks.
Further research is needed to investigate the scalability of the proposed method and its applicability to various model architectures.
👍