This paper addresses the generation of realistic and high-fidelity 3D facial animations for expressive avatar systems in the areas of human-computer interaction and accessibility. To overcome the limitations of existing methods due to mesh domain dependence, this paper proposes VisualSpeaker, a novel method using optical differentiable rendering with supervised learning of optical speech recognition. VisualSpeaker uses a perceptual lip-reading loss function obtained by passing optical 3D Gaussian Splatting avatar rendering through a pre-trained Visual Automatic Speech Recognition model. Evaluation results on the MEAD dataset show that VisualSpeaker improves the standard Lip Vertex Error metric by 56.1%, improving the perceptual quality of the generated animations while maintaining the controllability of mesh-based animations. In particular, perceptual focus supports accurate mouth shapes, providing essential cues for distinguishing similar hand signals in sign language avatars.