In this paper, we propose a human-centric real-time speech-to-sign animation framework to overcome the limitations of existing end-to-end sign language animation systems, such as naturalness, limited facial and body expressions, and lack of user control. The framework consists of (1) a streaming Conformer encoder and an autoregressive Transformer-MDN decoder for generating synchronized upper body and face movements, (2) a transparent and editable JSON intermediate representation that allows both deaf users and experts to inspect and modify each sign segment, and (3) a human-in-the-loop optimization loop that improves the model based on user edits and evaluations. Deployed on Unity3D, the system achieves an average frame inference time of 13 ms and an end-to-end latency of 103 ms on an RTX 4070. Key contributions include the design of a JSON-centric editing mechanism for fine-grained sign-level personalization and the first application of an MDN-based feedback loop for continuous model adaptation. In a study involving 20 deaf signers and 5 professional interpreters, a 13-point improvement in SUS score compared to baseline, a 6.7-point reduction in cognitive load, and significant improvements in naturalness and reliability (p<.001) were observed. This study establishes a scalable and explainable AI paradigm for accessible sign language technology.