Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Created by
  • Haebom

Author

Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan

Outline

TokLIP is a visual tokenizer that semantizes vector quantization (VQ) tokens and integrates CLIP-level semantics to address the high training computational overhead and limited comprehension performance caused by a lack of high-level semantics. It enables end-to-end multimodal autoregressive training while utilizing existing VQ tokens. It captures high-level continuous semantics by integrating a low-level discrete VQ tokenizer with a ViT-based token encoder. Unlike existing methods that discretize high-level features (e.g., VILA-U), TokLIP separates the training objectives for comprehension and generation, enabling direct application of the advanced VQ tokenizer without custom quantization operations. Experimental results demonstrate that TokLIP achieves excellent data efficiency, providing visual tokens with high-level semantic comprehension capabilities and enhancing low-level generation capabilities, making it suitable for autoregressive transformers for both comprehension and generation tasks. The code and model are available at https://github.com/TencentARC/TokLIP .

Takeaways, Limitations

Takeaways:
Overcoming the limitations of existing token-based multimodal models by incorporating high-dimensional semantics.
It has excellent data efficiency and simultaneously improves low-level generation capabilities and high-level semantic understanding capabilities.
Facilitates end-to-end multimodal autoregressive training by directly leveraging the existing VQ tokenizer.
Effective application to autoregressive transformers in both understanding and generation tasks.
Limitations:
This paper does not explicitly mention the specific Limitations. Further experimental and comparative studies are needed to elucidate Limitations.
👍