Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Created by
  • Haebom

Author

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim

Outline

In this paper, we propose a training-free spatiotemporal token merging (STTM) method to address the computational complexity problem of video large-scale language models (LLMs). By exploiting the local spatial and temporal redundancy of video data, which has been overlooked in previous studies, we perform multi-resolution spatial token transformation using a quadtree structure and directed pairwise merging in the temporal dimension. This decomposed merging method outperforms existing token reduction methods on six video question-answering benchmarks, achieving a twofold speedup with only 0.5% accuracy reduction when the number of tokens is reduced by 50%, and a threefold speedup with only 2% accuracy reduction when the number of tokens is reduced by 30%. Furthermore, STTM allows reuse of the KV cache across questions for the same video, regardless of the query.

Takeaways, Limitations

Takeaways:
We present a novel training-free token merging method that effectively addresses the computational complexity problem of video LLM.
Achieve improved speed and accuracy performance compared to existing methods.
Additional efficiency gains through query-independent KV cache reuse.
Limitations:
The performance improvements of the proposed method may be limited to specific video question-answering benchmarks.
Need to verify generalization performance on various types of video data and LLM architectures.
Further research is needed on optimization of the quadtree structure and pairwise merging strategy.
👍