This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
TokLIP is a visual tokenizer that semantizes vector quantization (VQ) tokens and integrates CLIP-level semantics to address the high training computational overhead and limited comprehension performance caused by a lack of high-level semantics. It enables end-to-end multimodal autoregressive training while utilizing existing VQ tokens. It captures high-level continuous semantics by integrating a low-level discrete VQ tokenizer with a ViT-based token encoder. Unlike existing methods that discretize high-level features (e.g., VILA-U), TokLIP separates the training objectives for comprehension and generation, enabling direct application of the advanced VQ tokenizer without custom quantization operations. Experimental results demonstrate that TokLIP achieves excellent data efficiency, providing visual tokens with high-level semantic comprehension capabilities and enhancing low-level generation capabilities, making it suitable for autoregressive transformers for both comprehension and generation tasks. The code and model are available at https://github.com/TencentARC/TokLIP .