This paper systematically compares and analyzes two major modeling paradigms in text-to-music generation: autoregressive decoding and conditional flow-matching. Using the same dataset, training configuration, and similar underlying architecture, we trained models for both paradigms from scratch and evaluated their performance across various aspects, including generation quality, robustness to inference settings, scalability, compliance with text and temporal alignment requirements, and editing capabilities via audio inpainting. This provides practical insights into the strengths and weaknesses of each paradigm, their trade-offs, and future design and training of text-to-music generation systems.