Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

Created by
  • Haebom

Author

Yang Wu, Huayi Zhang, Yizheng Jiao, Lin Ma, Xiaozhong Liu, Jinhong Yu, Dongyu Zhang, Dezhi Yu, Wei Xu

Outline

This paper focuses on the problem of data selection for task-specific instruction fine-tuning of large-scale language models (LLMs). Existing methods primarily rely on constructed similarity measures to select training data that match the test data distribution. However, we note that the instruction fine-tuning loss (cross-entropy loss for next token prediction) in LLMs does not exhibit a monotonic relationship with actual task performance. To address this discrepancy, we present Reward-Oriented Instruction Data Selection (ROSE), a novel method that optimizes data selection for task-specific instruction fine-tuning by utilizing the pairwise preference loss as a reward signal. ROSE selects the most relevant training data points by applying an influence formula to approximate the influence of training data points on a few preference validation sets. Experimental results demonstrate that ROSE achieves competitive results compared to fine-tuning with the entire training dataset, outperforming existing state-of-the-art data selection methods, even when selecting only 5% of the training data. Qualitative analysis confirms the robust generalizability of the method across multiple benchmark datasets and diverse model architectures.

Takeaways, Limitations

Takeaways:
We present ROSE, an efficient data selection method for task-specific directive fine-tuning.
Addressing the discrepancy between the directive fine-tuning loss of the existing method, Limitations, and the actual task performance.
Achieve performance similar to fine-tuning using the full dataset with only a small amount of data.
Robust performance demonstrated across diverse datasets and model architectures.
Limitations:
The performance of ROSE may depend on the quality of the preference validation set.
There is a need to more broadly validate generalization performance for specific tasks or model architectures.
Further research is needed to determine whether using two-way preference loss as a reward signal is always optimal.
👍