This paper aims to generate realistic, speech-synchronized facial movements for natural-looking, speech-driven 3D facial animation. Existing methods have focused on minimizing reconstruction loss by aligning each frame with ground truth data. However, these frame-by-frame approaches often result in shaky and unnatural results due to articulatory co-operation, which disrupts the continuity of facial movements. To address this, we propose a novel, context-aware loss function that explicitly models the impact of phonetic context on phoneme transitions. By incorporating phoneme-articulatory co-operation weights, we adaptively assign importance to facial movements based on their dynamic changes over time, ensuring smoother, more perceptually consistent animation. Extensive experiments demonstrate that replacing conventional reconstruction losses with the proposed loss function improves both quantitative metrics and visual quality. This highlights the importance of explicitly modeling phonemes, which depend on phonetic context, in synthesizing natural-looking speech-driven 3D facial animation.