In this paper, we propose RealGeneral, a novel framework based on video models, to address the problem of integrating diverse image generation tasks within a single framework. While existing image generation models either rely on task-specific datasets and large-scale learning or modify pre-trained image models for each task, resulting in limited generalization performance, RealGeneral leverages the temporal correlation modeling ability of video models to reframe image generation as a conditional frame prediction task. It includes a unified conditional embedding module for multi-modal alignment and a unified stream DiT block to mitigate cross-modal interference. Experimental results show that RealGeneral improves topic similarity by 14.5% in a user-defined generation task and improves image quality by 10% in a real image generation task from Canny images.