TokenFlow is a novel, unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Previous research has attempted to integrate these two tasks using a single reconstruction-target vector quantization (VQ) encoder. However, we observed that understanding and generation require fundamentally different granularities of visual information. This introduces a significant trade-off, leading to poor performance, especially for multimodal understanding tasks. TokenFlow addresses this challenge with an innovative dual-codebook architecture that separates semantic and pixel-level feature learning through a shared mapping mechanism while maintaining their alignment. This design provides direct access to high-dimensional semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through a shared index. Extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we achieve the first outperformed LLaVA-1.5 13B in understanding performance for discrete visual inputs, achieving an average improvement of 7.2%. For image reconstruction, we achieve a robust FID score of 0.63 at a 384×384 resolution. Additionally, TokenFlow achieved state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256 x 256 resolution, comparable to SDXL.