Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Created by
  • Haebom

Author

Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang

Outline

MuseControlLite is a lightweight mechanism for fine-tuning text-to-music generation models using various time-varying music attributes and reference audio signals. The main finding of this paper is that position embeddings, which have rarely been used in text-to-music generation models, are important when the condition of interest is a function of time. Using melody control as an example, we show that simply adding rotational position embeddings to a separate cross-attention layer increases the control accuracy from 56.6% to 61.1%, requiring 6.75x fewer learnable parameters than the state-of-the-art fine-tuning mechanism using the same pre-trained Stable Audio Open diffusion transformer model. We evaluate various forms of music attribute control, audio inpainting, and audio outpainting, demonstrating improved control performance at a much lower fine-tuning cost (only 85M learnable parameters) than MusicGen-Large and Stable Audio Open ControlNet. The source code, model checkpoints, and demo examples are available at https://musecontrollite.github.io/web/ .

Takeaways, Limitations

Takeaways:
We propose MuseControlLite, a lightweight mechanism for precise control of text-to-music generation models using time-varying music properties.
Improving the control accuracy of text-to-music generation models by revealing the importance of positional embeddings.
Achieve improved control performance with significantly fewer parameters than existing methods (85M learnable parameters).
Demonstrated excellent performance in controlling various music properties, audio inpainting and outpainting.
Source code, model checkpoints and demo examples available.
Limitations:
The experimental results presented in this paper may be limited to a specific pre-trained model (Stable Audio Open).
Generalization performance to other text-to-music generation models or more complex music properties requires further study.
Further analysis is needed to determine whether the performance improvements in MuseControlLite are solely due to the use of positional embeddings, and whether the influence of other factors is not taken into account.
👍