Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Diffusion Beats Autoregressive in Data-Constrained Settings

Created by
  • Haebom

Author

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

Outline

While autoregressive (AR) models have long dominated the field of large-scale language models, diffusion-based language models have recently emerged as a promising alternative. In this paper, we systematically study masked diffusion models in data-constrained environments and find that diffusion models significantly outperform autoregressive models when computational resources are abundant but data is scarce. Diffusion models repeatedly use data to reduce validation loss and achieve superior performance on downstream tasks. This advantage can be interpreted as implicit data augmentation, as masked diffusion exposes the model to diverse token orderings and prediction tasks, unlike the fixed left-to-right factorization of autoregressive models. In this paper, we propose a new scaling law for diffusion models and derive a closed-form expression for the critical computational limit at which diffusion models outperform autoregressive models. These results suggest that diffusion models represent an attractive alternative to the traditional autoregressive paradigm when computational resources, rather than data, are the bottleneck.

Takeaways, Limitations

Takeaways: We demonstrate that diffusion models outperform autoregressive models when computational resources are abundant and data is scarce. We demonstrate that the implicit data augmentation effect of diffusion models enables learning for a variety of token sequences and prediction tasks. We provide an analysis of the scaling laws and critical computing limits of diffusion models.
Limitations: This study is limited to a specific data constraint environment, and further research is needed to determine its generalizability to other data distributions or tasks. The presented critical computing limits are for a specific setting and may vary in other settings.
👍