This paper highlights that despite the advancements in image generation capabilities of integrated multimodal understanding and generation models, significant gaps remain in instruction following and detail preservation compared to systems that tightly couple understanding and generation, such as GPT-4. Therefore, this paper explores how to improve text-to-image (T2I) generation by leveraging interleaving reasoning. To achieve this, we propose an Interactive Inference Generative (IRG) framework that alternates between text-based reasoning and image synthesis. IRG first generates initial images by generating text-based reasoning, and then reflects the results to enhance details, visual quality, and aesthetics while preserving meaning. To effectively train IRG, we propose Interactive Inference Generative Learning (IRGL), which aims to strengthen the initial reasoning and generation stages and ensure high-quality text reflection and accurate implementation in subsequent images. Utilizing the IRGL-300K dataset, which consists of six decomposed learning modes, we begin with an integrated base model that generates interactive text-to-image outputs. Through two stages of training, we build robust reasoning and reflection capabilities, and efficiently tune the IRG pipeline on the entire thought-to-image trajectory data. Experimental results demonstrate absolute performance gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, as well as significant improvements in visual quality and detail fidelity. The code, model weights, and dataset will be made public.