This paper proposes LongMab-PO, a novel framework that combines Multi-Armed Bandit (MAB) and Direct Preference Optimization (DPO) to improve modeling performance in long-context contexts. To address the limitations of existing fine-tuning methods using synthetic data, such as low diversity and inconsistency in facts, we utilize MAB to selectively extract the most information-rich portions of long contexts to generate high-quality, diverse responses. These responses are then used as training data for DPO. By iteratively selecting context fragments through MAB and updating scores based on reward feedback for the generated responses, we focus on the most relevant parts of the context, generating and collecting high-quality, diverse responses. Experimental results demonstrate that LongMab-PO achieves state-of-the-art performance on long-context inference benchmarks and generates preference data pairs with significantly higher diversity and quality than existing methods. The source code and data will be made publicly available.