This paper presents a study utilizing a diffusion model to generate tile-based game levels based on text. Unlike previous studies that focused on unconditional level generation, this study focuses on generating levels from text input. To achieve this, we present a strategy for automatically assigning captions to existing datasets and a training method for the diffusion model using a pre-trained text encoder and a newly trained simple Transformer model. We evaluate the diversity and playability of the generated levels and compare them with existing unconditional diffusion models, generative adversarial networks (GANs), the Five-Dollar Model, and MarioGPT. Specifically, we demonstrate that the diffusion model using the simple Transformer model outperforms models using complex text encoders with a shorter training time, suggesting that relying on a large language model is unnecessary. Finally, we provide a GUI that allows users to construct longer levels using the generated level fragments.