Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Created by
  • Haebom

Author

Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu, Maosong Sun

Outline

This paper proposes LongMab-PO, a novel framework that combines Multi-Armed Bandit (MAB) and Direct Preference Optimization (DPO) to improve modeling performance in long-context contexts. To address the limitations of existing fine-tuning methods using synthetic data, such as low diversity and inconsistency in facts, we utilize MAB to selectively extract the most information-rich portions of long contexts to generate high-quality, diverse responses. These responses are then used as training data for DPO. By iteratively selecting context fragments through MAB and updating scores based on reward feedback for the generated responses, we focus on the most relevant parts of the context, generating and collecting high-quality, diverse responses. Experimental results demonstrate that LongMab-PO achieves state-of-the-art performance on long-context inference benchmarks and generates preference data pairs with significantly higher diversity and quality than existing methods. The source code and data will be made publicly available.

Takeaways, Limitations

Takeaways:
We present a novel approach that overcomes the limitations of synthetic data-based LLM fine-tuning by combining MAB and DPO.
Achieving state-of-the-art performance on long context inference tasks.
Improving LLM performance by generating high-quality and diverse preference data pairs.
Publicly available code and data enable reproducibility and further research.
Limitations:
The effectiveness of the proposed method may be limited to specific benchmarks. Generalization to other types of long-context tasks requires further research.
The effectiveness of MAB depends heavily on the design of the reward function, and finding the optimal reward function can be difficult.
DPO can be computationally expensive and difficult to apply to large datasets.
👍