Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Created by
  • Haebom

Author

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu

Outline

TokenFlow is a novel, unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Previous research has attempted to integrate these two tasks using a single reconstruction-target vector quantization (VQ) encoder. However, we observed that understanding and generation require fundamentally different granularities of visual information. This introduces a significant trade-off, leading to poor performance, especially for multimodal understanding tasks. TokenFlow addresses this challenge with an innovative dual-codebook architecture that separates semantic and pixel-level feature learning through a shared mapping mechanism while maintaining their alignment. This design provides direct access to high-dimensional semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through a shared index. Extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we achieve the first outperformed LLaVA-1.5 13B in understanding performance for discrete visual inputs, achieving an average improvement of 7.2%. For image reconstruction, we achieve a robust FID score of 0.63 at a 384×384 resolution. Additionally, TokenFlow achieved state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256 x 256 resolution, comparable to SDXL.

Takeaways, Limitations

Takeaways:
A novel architecture for an integrated image tokenizer for multimodal understanding and generation tasks is presented.
Effectively performs semantic understanding and detailed image generation simultaneously through a dual codebook architecture.
Achieved understanding performance that surpassed the previous best-performing model (LLaVA-1.5 13B) using discrete visual input (7.2% improvement)
Achieved excellent image reconstruction performance (FID 0.63 @ 384 384) and autoregressive image generation performance (GenEval 0.55 @ 256 256)
Limitations:
The paper lacks specific references to Limitations or future research directions.
Lack of detailed explanation of dependencies on specific datasets or hardware environments.
👍