Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Created by
  • Haebom

Author

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

VSSFlow: Unified Video-to-Sound and Visual Text-to-Speech Generation with Flow Matching

Outline

This paper presents VSSFlow, which integrates Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS) tasks into a single framework. VSSFlow leverages a novel condition aggregation mechanism to handle different condition types and utilizes cross-attention and self-attention layers to leverage inductive biases tailored to the characteristics of each condition. Furthermore, it trains both tasks end-to-end together, achieving significant performance improvements without the need for complex training strategies.

Takeaways, Limitations

Proposing a new flow-matching framework that integrates V2S and VisualTTS.
Effectively handle different condition types by leveraging cross-attention and self-attention.
Improving performance and ensuring training stability through end-to-end joint learning.
Achieving state-of-the-art performance in V2S and VisualTTS benchmarks.
Limitations is not specifically mentioned in the paper. (This part does not appear in the abstract, so it cannot be inferred.)
👍