This paper discusses token-based video representation, a promising approach to enable large-scale language models (LLMs) to interpret video content. Existing token reduction techniques (e.g., pruning and merging) tend to interfere with essential positional embeddings and rely on continuous visual tokens sampled from adjacent pixels with similar spatio-temporal positions. In this paper, we present a novel challenge, Extreme Short Token Reduction, which aims to represent a complete video using a minimal set of discrete tokens. To this end, we propose a neural network-based discrete token representation framework called VQToken, which learns a compact codebook by applying adaptive vector quantization to continuous ViT embeddings and preserves spatio-temporal positions via a token hash function. VQToken compresses sequences to 0.07% of their original length while maintaining an accuracy degradation of 0.66% on the NextQA-MC benchmark. It also achieves comparable performance on ActNet-QA, Long Video Benchmark, and VideoMME. By introducing the Token Information Density (TokDense) metric and formulating fixed-length and adaptive-length subtasks, we achieve state-of-the-art results in both settings. This approach dramatically reduces theoretical complexity, increases information density, significantly reduces the number of tokens, and enables efficient video LLM in resource-constrained environments.