Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Interleaving Reasoning for Better Text-to-Image Generation

Created by
  • Haebom

Author

Wenxuan Huang, Shuang Chen, Zheyong

Outline

This paper highlights that despite the advancements in image generation capabilities of integrated multimodal understanding and generation models, significant gaps remain in instruction following and detail preservation compared to systems that tightly couple understanding and generation, such as GPT-4. Therefore, this paper explores how to improve text-to-image (T2I) generation by leveraging interleaving reasoning. To achieve this, we propose an Interactive Inference Generative (IRG) framework that alternates between text-based reasoning and image synthesis. IRG first generates initial images by generating text-based reasoning, and then reflects the results to enhance details, visual quality, and aesthetics while preserving meaning. To effectively train IRG, we propose Interactive Inference Generative Learning (IRGL), which aims to strengthen the initial reasoning and generation stages and ensure high-quality text reflection and accurate implementation in subsequent images. Utilizing the IRGL-300K dataset, which consists of six decomposed learning modes, we begin with an integrated base model that generates interactive text-to-image outputs. Through two stages of training, we build robust reasoning and reflection capabilities, and efficiently tune the IRG pipeline on the entire thought-to-image trajectory data. Experimental results demonstrate absolute performance gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, as well as significant improvements in visual quality and detail fidelity. The code, model weights, and dataset will be made public.

Takeaways, Limitations

Takeaways:
Presentation of a novel T2I generation framework (IRG) utilizing interaction reasoning and verification of its effectiveness.
Achieve cutting-edge performance in various benchmarks including GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
Improved visual quality and detail fidelity.
Supporting research reproducibility and follow-up research through the release of the IRGL-300K dataset.
Limitations:
Further research is needed on the generalization performance of the proposed method.
There may be a bias towards creating certain types of images.
The amount of computing resources required to train large-scale datasets.
👍