Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Dense Video Understanding with Gated Residual Tokenization

Created by
  • Haebom

Author

Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu

Outline

This paper presents Dense Video Understanding (DVU) as an efficient method for high-resolution video understanding. Existing Video Large Language Models (VLLMs) suffer from limitations in leveraging dense temporal information due to low frame-rate sampling. DVU enables high-frame-rate video understanding by reducing tokenization time and token overhead through a two-step framework called Gated Residual Tokenization (GRT). GRT consists of motion-compensated inter-gated tokenization and semantic-scene intra-tokenization merging. It skips static regions and efficiently merges tokens, achieving sub-linear token count growth and computational efficiency. Furthermore, we propose DIVE (Dense Information Video Evaluation), a new benchmark for dense temporal inference. Experimental results demonstrate that GRT outperforms larger VLLM baseline models and improves performance with frames per second (FPS).

Takeaways, Limitations

Takeaways:
DVU and GRT are presented as efficient methods for understanding high frame rate video.
Emphasize the importance of dense temporal information
DIVE, a new benchmark for understanding high-frame-rate video.
Experimentally demonstrated that GRT linearly improves performance as FPS increases.
Limitations:
GRT's performance improvements may be limited to a specific benchmark (DIVE). Validation of its generalization performance on other types of video datasets is needed.
Because the DIVE benchmark is newly proposed, comparative analysis with other existing benchmarks may be lacking.
Detailed analysis of GRT's computational complexity and memory usage may be lacking.
Generalization performance for various types of high frame rate videos may not be sufficiently validated.
👍