Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Created by
  • Haebom

Author

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Outline

This paper highlights the incomplete and limited emotional control of existing Text-to-Speech (TTS) systems and proposes EmoSteer-TTS, a novel method that enables fine-tuned voice emotion control (transformation, interpolation, and deletion) without training. EmoSteer-TTS effectively alters the emotional tone of synthesized speech by modifying the internal activations of a flow-matching-based TTS model. We develop an efficient, training-free algorithm that includes activation extraction, emotion token retrieval, and inference-time steering, making it compatible with various pre-trained models. By constructing an emotional speech dataset from diverse speakers, we derive effective steering vectors. Experimental results demonstrate fine-tunable, interpretable, and continuous voice emotion control that outperforms existing state-of-the-art (SOTA) performance. This is the first method to achieve fine-tuned continuous emotional control without training.

Takeaways, Limitations

Takeaways:
A novel method is presented that enables fine-tuned continuous vocal emotion control without training.
Development of an efficient algorithm that can be easily integrated into existing TTS models.
Demonstrated excellent performance on various pre-trained TTS models.
Provides interpretable and intuitive emotional control.
Limitations:
The effectiveness of the proposed method may depend on the specific type of TTS model (flow matching-based).
Further research is needed on generalization performance across a variety of emotional expressions.
Performance may be affected by the scope and quality of the constructed emotional speech dataset.
Further evaluation of robustness and generalization performance in real-world applications is needed.
👍