This paper presents a framework for real-time interactive digital human video generation. To address the high computational cost and limited controllability of existing methods, we propose an autoregressive video generation method capable of low-latency inference. With minimal modification to a large-scale language model (LLM), it accepts various conditional encodings, including audio, pose, and text, and outputs spatially and semantically consistent representations that guide the denoising process of a diffusion model. A large-scale conversation dataset of approximately 20,000 hours is constructed for model training, and a deep compressive autoencoder with up to 64x compression ratios is introduced to effectively reduce the long-term inference load of the autoregressive model. This approach demonstrates low latency, high efficiency, and fine-grained multimodal controllability in various experiments, including two-way conversation, multilingual human synthesis, and interactive world models.