Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models

Created by
  • Haebom

Author

Haichao Zhang, Yun Fu

Outline

This paper discusses token-based video representation, a promising approach to enable large-scale language models (LLMs) to interpret video content. Existing token reduction techniques (e.g., pruning and merging) tend to interfere with essential positional embeddings and rely on continuous visual tokens sampled from adjacent pixels with similar spatio-temporal positions. In this paper, we present a novel challenge, Extreme Short Token Reduction, which aims to represent a complete video using a minimal set of discrete tokens. To this end, we propose a neural network-based discrete token representation framework called VQToken, which learns a compact codebook by applying adaptive vector quantization to continuous ViT embeddings and preserves spatio-temporal positions via a token hash function. VQToken compresses sequences to 0.07% of their original length while maintaining an accuracy degradation of 0.66% on the NextQA-MC benchmark. It also achieves comparable performance on ActNet-QA, Long Video Benchmark, and VideoMME. By introducing the Token Information Density (TokDense) metric and formulating fixed-length and adaptive-length subtasks, we achieve state-of-the-art results in both settings. This approach dramatically reduces theoretical complexity, increases information density, significantly reduces the number of tokens, and enables efficient video LLM in resource-constrained environments.

Takeaways, Limitations

Takeaways:
We significantly improve the efficiency of video LLM by introducing a new challenge of extreme short token reduction and proposing the VQToken framework.
We achieve a video compression ratio (0.07%) that is much higher than existing methods while minimizing performance degradation.
We introduce the Token Information Density (TokDense) metric to provide a new criterion for quantitatively evaluating the efficiency of video token representation.
We present the possibility of implementing efficient video LLM in resource-constrained environments.
Limitations:
The performance of the proposed method may be limited to certain benchmark datasets. Additional experiments on different types of video datasets are needed.
Further analysis is needed on the impact of VQToken's codebook size and the design of the token hash function on performance.
Extreme token reduction may result in some information loss, which requires quantitative analysis and research on improvement measures.
👍