Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Created by
  • Haebom

Author

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

Outline

Kling-Foley is a large-scale multimodal video-audio generation model that synthesizes high-quality video-synchronized audio. It introduces a multimodal diffusion transformer to model the interaction between video, audio, and text modalities, and improves the alignment capability by combining a visual semantic representation module and an audio-visual synchronization module. In particular, these modules improve semantic alignment and audio-visual synchronization by aligning video terms with latent audio elements on a frame-by-frame basis. This integrated approach, together with text terms, enables accurate generation of sound effects that are consistent with video. In addition, we propose a general-purpose latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, songs, and music. We use a stereoscopic rendering method to impart spatial presence to the synthesized audio. In addition, we open-source an industry-grade benchmark, Kling-Audio-Eval, to complement the incomplete typology and annotation of open-source benchmarks. Trained with the flow matching objective, Kling-Foley achieves new audio-visual state-of-the-art performance in distributional matching, semantic alignment, temporal alignment, and audio quality among open-source models.

Takeaways, Limitations

Takeaways:
Introducing a new state-of-the-art model for high-quality video-synchronized audio synthesis.
Improved alignment capabilities using multi-modal diffusion transformers and additional modules.
Development of a universal latent audio codec for various audio types (sound effects, speech, songs, music).
Industry-level audio-visual benchmark Kling-Audio-Eval released.
Limitations:
Reliance on incomplete types and annotations in open source benchmarks (partially addressed by the release of Kling-Audio-Eval).
Lack of explicit mention of the computing resources required for training and inference of the model.
Further evaluation of the model's generalization performance is needed.
👍