[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Created by
  • Haebom

Author

Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

Outline

This paper focuses on Foley synthesis, which synthesizes high-quality audio that is semantically and temporally aligned with an image by utilizing a pre-trained audio generative model. To overcome the limitation of existing ControlNet-based Foley synthesis methods that rely on hand-crafted temporal conditions, in this paper we propose a SpecMaskFoley method that applies ControlNet to a pre-trained SpecMaskGIT model. In particular, we effectively utilize a single ControlNet branch by using a frequency-aware temporal feature aligner to resolve the mismatch between the temporal features of an image and the time-frequency features of the SpecMaskGIT model. As a result, SpecMaskFoley demonstrates improved performance compared to existing from-scratch models, and contributes greatly to the development of ControlNet-based Foley synthesis models.

Takeaways, Limitations

Takeaways:
We improved the efficiency of poly synthesis by leveraging pre-trained models.
Extend the usability of ControlNet to achieve superior performance without complex conditional mechanisms.
It presents new possibilities for ControlNet-based poly synthesis research by outperforming existing from-scratch models.
We effectively solve the problem of mismatch between temporal features and time-frequency features through a frequency-aware temporal feature aligner.
Limitations:
The performance of the proposed method may be limited to certain benchmark datasets.
Additional evaluation of generalization performance for different types of images and audio is needed.
There may be architectural limitations that depend on the SpecMaskGIT model.
👍