This paper focuses on the advantages of autoregressive video models over bidirectional diffusion models for interactive video content generation and supporting arbitrary-length streaming applications. We present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that integrates block-wise causal attention to enable efficient inference through iterative sampling and parallel token generation within each frame. To address the challenges of real-time video generation, we extend video model-specific consistency distillation to enable efficient inference with fewer sampling steps, and propose predictive sampling that exploits the fact that adjacent frames often share the same action input. Through large-scale action-conditional video generation benchmark experiments, we demonstrate that NFD outperforms autoregressive baseline models in terms of visual quality and sampling efficiency, and achieve the first autoregressive video generation at over 30 frames per second on an A100 GPU using a 310 million-parameter model.