While autoregressive (AR) models have long dominated the field of large-scale language models, diffusion-based language models have recently emerged as a promising alternative. In this paper, we systematically study masked diffusion models in data-constrained environments and find that diffusion models significantly outperform autoregressive models when computational resources are abundant but data is scarce. Diffusion models repeatedly use data to reduce validation loss and achieve superior performance on downstream tasks. This advantage can be interpreted as implicit data augmentation, as masked diffusion exposes the model to diverse token orderings and prediction tasks, unlike the fixed left-to-right factorization of autoregressive models. In this paper, we propose a new scaling law for diffusion models and derive a closed-form expression for the critical computational limit at which diffusion models outperform autoregressive models. These results suggest that diffusion models represent an attractive alternative to the traditional autoregressive paradigm when computational resources, rather than data, are the bottleneck.