This paper points out that despite recent advances in text-to-video generation using large-scale transformer-based diffusion models, existing models only generate short videos and have limitations in real-time interactive applications. In this paper, we propose StreamDiT, a real-time streaming video generation model. StreamDiT improves content consistency and image quality through flow matching-based learning with moving buffers and blended learning using various buffer frame segmentation methods. We adopt adaLN DiT-based modeling using variational temporal embedding and windowed attention, and train a StreamDiT model with 4 billion parameters. In addition, we propose a multi-stage distillation method customized for StreamDiT, which performs sampling distillation at each segmentation interval and reduces the number of function evaluations to achieve real-time performance (16 FPS, 512p resolution). We verify the performance through quantitative indicators and human evaluation, and suggest its potential for real-time applications such as streaming generation, interactive generation, and video-to-video conversion.