Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Created by
  • Haebom

Author

Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen

Outline

Large-Scale Language Models (LLMs) have demonstrated impressive performance in natural language processing (NLP) tasks. With the increasing demand for long context windows in real-world applications, continuous pretraining on long-context data and supervised fine-tuning (SFT) have become a common approach. While the impact of data length has been extensively studied for continuous pretraining, its impact on SFTs remains unclear. In this study, we systematically investigated how SFT data length affects LLM performance in short-context tasks. Paradoxically, we found that long-context SFTs improve short-context performance, which is contrary to the performance degradation typically observed with long-context pretraining. To elucidate the underlying mechanism of this phenomenon, we analyzed the two main components—Multi-Head Attention (MHA) and Feed-Forward Network (FFN)—separately and showed that both components independently benefit from long-context SFTs. Furthermore, we studied their interactions, revealing a knowledge preference bias: long-context SFTs promote contextual knowledge, while short-context SFTs favor parameter knowledge, suggesting that relying solely on long-context SFTs is suboptimal. Finally, we demonstrate that hybrid training mitigates these biases, providing explainable guidance for fine-tuning LLM.

Takeaways, Limitations

Long-context SFT can improve short-context task performance.
Both MHA and FFN benefit from long context SFTs.
Long context SFTs have a knowledge bias that favors context knowledge, while short context SFTs have a knowledge bias that favors parameter knowledge.
Hybrid training can mitigate these biases.
This study may have investigated the effects of SFT data length only on a narrow range of tasks, and generalizability to other task types may require further research.
👍