Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Human-Centered Editable Speech-to-Sign-Language Generation via Streaming Conformer-Transformer and Resampling Hook

Created by
  • Haebom

Author

Yingchao Li

Outline

In this paper, we propose a human-centric real-time speech-to-sign animation framework to overcome the limitations of existing end-to-end sign language animation systems, such as naturalness, limited facial and body expressions, and lack of user control. The framework consists of (1) a streaming Conformer encoder and an autoregressive Transformer-MDN decoder for generating synchronized upper body and face movements, (2) a transparent and editable JSON intermediate representation that allows both deaf users and experts to inspect and modify each sign segment, and (3) a human-in-the-loop optimization loop that improves the model based on user edits and evaluations. Deployed on Unity3D, the system achieves an average frame inference time of 13 ms and an end-to-end latency of 103 ms on an RTX 4070. Key contributions include the design of a JSON-centric editing mechanism for fine-grained sign-level personalization and the first application of an MDN-based feedback loop for continuous model adaptation. In a study involving 20 deaf signers and 5 professional interpreters, a 13-point improvement in SUS score compared to baseline, a 6.7-point reduction in cognitive load, and significant improvements in naturalness and reliability (p<.001) were observed. This study establishes a scalable and explainable AI paradigm for accessible sign language technology.

Takeaways, Limitations

Takeaways:
An efficient framework for generating real-time, natural sign language animations is presented.
Implementing customizable and explainable AI systems through a JSON-based editing mechanism.
Continuous model improvement and user engagement leveraging MDN-based feedback loops.
Improving communication accessibility and reducing cognitive load for people with hearing impairments.
High-speed processing performance (13ms frame inference time, 103ms end-to-end latency).
Limitations:
The current system focuses on upper body and facial movements, and does not consider lower body movements.
Further research is needed to explore the range of sign language and sign language styles supported.
There is a need to improve model learning and generalization performance by utilizing large-scale datasets.
There is a need to improve the usability of the JSON editing mechanism and develop an intuitive interface.
👍