Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Created by
  • Haebom

Author

Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen

Outline

This paper examines large-scale language models (LLMs), where continuous pretraining using long-context data and supervised fine-tuning (SFT) have become common approaches due to the increasing number of real-world applications requiring long-context windows. While previous research has extensively investigated the impact of data length in continuous pretraining, its impact on SFTs has remained unclear. This study systematically investigates the effect of SFT data length on the performance of LLMs in short-context tasks. Counterintuitively, we find that long-context SFTs improve short-context performance. This finding is contrary to the performance degradation typically observed with long-context pretraining. To elucidate the underlying mechanism of this phenomenon, we deconstruct two key components: the multi-head attention (MHA) and the feedforward network (FFN), demonstrating that both components independently benefit from long-context SFTs. Furthermore, we investigate their interactions, revealing a knowledge preference bias where long-context SFTs favor contextual knowledge, while short-context SFTs favor parametric knowledge. Therefore, relying solely on long-context SFTs is suboptimal. Finally, we show that hybrid training mitigates these biases, providing explainable guidance for fine-tuning LLM.

Takeaways, Limitations

Takeaways:
We find that long-context SFT improves LLM performance on short-context tasks, a finding contradicting previous studies.
Both MHA and FFN independently benefit from long-context SFTs.
Long-context SFTs reveal a knowledge preference bias that favors contextual knowledge, while short-context SFTs reveal a knowledge preference bias that favors parametric knowledge.
Mitigating knowledge preference bias through hybrid training and providing explainable guidance for fine-tuning LLMs.
Limitations:
This study may be limited to a specific type of LLM and dataset. Further research is needed on a wider range of LLMs and datasets.
Further research is needed to determine optimal strategies for hybrid training. In-depth analysis of detailed parameter adjustments, such as the hybrid ratio, is lacking.
A more in-depth mechanism analysis of the causes of knowledge preference bias is needed.
👍