Large-Scale Language Models (LLMs) have demonstrated impressive performance in natural language processing (NLP) tasks. With the increasing demand for long context windows in real-world applications, continuous pretraining on long-context data and supervised fine-tuning (SFT) have become a common approach. While the impact of data length has been extensively studied for continuous pretraining, its impact on SFTs remains unclear. In this study, we systematically investigated how SFT data length affects LLM performance in short-context tasks. Paradoxically, we found that long-context SFTs improve short-context performance, which is contrary to the performance degradation typically observed with long-context pretraining. To elucidate the underlying mechanism of this phenomenon, we analyzed the two main components—Multi-Head Attention (MHA) and Feed-Forward Network (FFN)—separately and showed that both components independently benefit from long-context SFTs. Furthermore, we studied their interactions, revealing a knowledge preference bias: long-context SFTs promote contextual knowledge, while short-context SFTs favor parameter knowledge, suggesting that relying solely on long-context SFTs is suboptimal. Finally, we demonstrate that hybrid training mitigates these biases, providing explainable guidance for fine-tuning LLM.