Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Training-Free Approach for Music Style Transfer with Latent Diffusion Models

Created by
  • Haebom

Author

Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Shinjae Yoo, Yuewei Lin, Jiook Cha

Outline

This paper proposes Stylus, a novel training-free framework that performs musical style transfer by directly manipulating the self-attention layer of a pre-trained latent diffusion model (LDM). Operating in the Mel Spectrogram domain, Stylus transfers musical styles by replacing key and value representations of content audio with representations of stylistic references without any fine-tuning. It integrates query-preserving, CFG-inspired guided scaling, multi-style interpolation, and phase-preserving reconstruction to enhance styling quality and controllability. It significantly improves perceptual quality and structure preservation compared to existing work, while remaining lightweight and easy to deploy. This study highlights the potential of diffusion-based attention manipulation for efficient, high-fidelity, and interpretable music generation without training.

Takeaways, Limitations

Takeaways:
Transferring musical styles without training data is possible by leveraging pre-trained models.
Improved perception quality and structural preservation compared to existing methods
Presenting an efficient framework that is lightweight and easy to deploy.
Improved styling quality and control through query preservation, CFG-inspired guidance scaling, and more.
Demonstrating the utility of diffusion-based attention manipulation
Limitations:
Code disclosure will be made after the paper is accepted.
Further evaluation of transfer performance across various music genres and styles is needed.
Comparative analysis with other music generation models is needed.
Lack of quantitative analysis of the performance of additional factors, such as CFG-inspired guidance scaling.
👍