[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Created by
  • Haebom

Author

Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao

Outline

In this paper, we propose RealGeneral, a novel framework based on video models, to address the problem of integrating diverse image generation tasks within a single framework. While existing image generation models either rely on task-specific datasets and large-scale learning or modify pre-trained image models for each task, resulting in limited generalization performance, RealGeneral leverages the temporal correlation modeling ability of video models to reframe image generation as a conditional frame prediction task. It includes a unified conditional embedding module for multi-modal alignment and a unified stream DiT block to mitigate cross-modal interference. Experimental results show that RealGeneral improves topic similarity by 14.5% in a user-defined generation task and improves image quality by 10% in a real image generation task from Canny images.

Takeaways, Limitations

Takeaways:
We present a novel approach that integrates various image generation tasks using video models.
We solve the image generation problem in a similar way to the in-context learning of LLM.
It outperforms existing models in various image generation tasks.
It enables efficient interaction between multiple modes through the Unified Conditional Embedding module and the Unified Stream DiT block.
Limitations:
Further validation of the generalization performance of the proposed model is needed.
The possibility of overfitting for specific tasks cannot be ruled out.
Because it is based on a video model, the availability of video data may affect performance.
There is a lack of analysis of the model's complexity and computational cost.
👍