This paper examines large-scale language models (LLMs), where continuous pretraining using long-context data and supervised fine-tuning (SFT) have become common approaches due to the increasing number of real-world applications requiring long-context windows. While previous research has extensively investigated the impact of data length in continuous pretraining, its impact on SFTs has remained unclear. This study systematically investigates the effect of SFT data length on the performance of LLMs in short-context tasks. Counterintuitively, we find that long-context SFTs improve short-context performance. This finding is contrary to the performance degradation typically observed with long-context pretraining. To elucidate the underlying mechanism of this phenomenon, we deconstruct two key components: the multi-head attention (MHA) and the feedforward network (FFN), demonstrating that both components independently benefit from long-context SFTs. Furthermore, we investigate their interactions, revealing a knowledge preference bias where long-context SFTs favor contextual knowledge, while short-context SFTs favor parametric knowledge. Therefore, relying solely on long-context SFTs is suboptimal. Finally, we show that hybrid training mitigates these biases, providing explainable guidance for fine-tuning LLM.