Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Created by
  • Haebom

Author

Or Tal, Felix Kreuk, Yossi Adi

Outline

This paper systematically compares and analyzes two major modeling paradigms in text-to-music generation: autoregressive decoding and conditional flow-matching. Using the same dataset, training configuration, and similar underlying architecture, we trained models for both paradigms from scratch and evaluated their performance across various aspects, including generation quality, robustness to inference settings, scalability, compliance with text and temporal alignment requirements, and editing capabilities via audio inpainting. This provides practical insights into the strengths and weaknesses of each paradigm, their trade-offs, and future design and training of text-to-music generation systems.

Takeaways, Limitations

Takeaways:
By clearly comparing and analyzing the pros and cons of auto-regressive decoding and conditional flow-matching, we provide important insights into the design of text-to-music generation models.
The strengths and weaknesses of each paradigm are specifically presented through various performance evaluation indicators.
Helps develop design and learning strategies for future text-to-music generation systems.
Clearly present the trade-offs that arise in choosing a modeling paradigm.
Limitations:
Further research is needed on generalizability due to limitations in the dataset and architecture used in the analysis.
The possibility of other modeling paradigms not considered in this study.
Subjective aspects and limitations of evaluation indicators.
Only comparative analysis of two paradigms has been conducted, so research on more diverse paradigms is needed.
👍