Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

Created by
  • Haebom

Author

Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li

Outline

This paper proposes SpecVLM, a training-free speculative decoding (SD) framework for efficient decoding of Video Large-Scale Language Models (Vid-LLMs). While Vid-LLMs demonstrate powerful performance in video content understanding, their dense video token representations incur significant memory and computational overhead. SpecVLM minimizes information loss and improves decoding speed through stepwise video token pruning. We find that the draft model's guess is insensitive to video token pruning, maintaining accuracy while pruning up to 90% of video tokens. This process consists of two stages: the first stage selects information-rich tokens based on the attention signal of the target model, and the second stage prunes redundant tokens spatially and uniformly. Experimental results demonstrate decoding speed improvements of up to 2.68x on LLaVA-OneVision-72B and up to 2.11x on Qwen2.5-VL-32B.

Takeaways, Limitations

Takeaways:
We present an efficient training-free speculative decoding framework that dramatically improves the decoding speed of Vid-LLMs.
Video token pruning can save memory and computational resources.
It works effectively even on large models such as LLaVA-OneVision-72B and Qwen2.5-VL-32B.
Reproducibility and usability have been improved through open code.
Limitations:
The effectiveness of the proposed method may be limited to specific Vid-LLM models and video understanding benchmarks.
The optimal pruning strategy may vary depending on the model and dataset.
Experiments with more diverse video datasets and models are needed.
Further analysis is needed to understand the accuracy degradation of speculative decoding.
👍