[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

Created by
  • Haebom

Author

Kazi Mahathir Rahman, Showrin Rahman, Sharmin Sultana Srishty

Outline

In this paper, we propose an efficient novel method for text-embedded image generation. Existing text-embedded image generation methods are resource-intensive and difficult to run efficiently on both CPU and GPU platforms. In this paper, we present a two-stage pipeline that uses reinforcement learning (RL) to quickly and optimally generate text layouts and integrates them with a diffusion-based image synthesis model. The RL-based approach significantly accelerates the bounding box prediction step and reduces overlaps, enabling efficient execution on both CPU and GPU. Compared to TextDiffuser-2, we significantly reduce the execution time and increase flexibility while maintaining or exceeding the quality of text layout and image synthesis. The MARIOEval benchmark results show that our proposed method achieves OCR and CLIPScore metrics close to the state-of-the-art models, while being 97.64% faster and running with only 2MB of memory.

Takeaways, Limitations

Takeaways:
We have significantly improved the speed and efficiency of generating images with text using reinforcement learning.
It can run efficiently on both CPU and GPU platforms.
Maintains or exceeds TextDiffuser-2 level image quality.
It can run with low memory usage (2MB).
Achieved results close to state-of-the-art performance on the MARIOEval benchmark.
Limitations:
There is a lack of specific reference to Limitations in the method presented in this paper.
Further validation of performance and stability in real-world applications is required.
Generalization performance evaluation for various text styles and complex layouts is required.
👍