Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Created by
  • Haebom

Author

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, Mengye Ren

Outline

This paper addresses the limitations of long-term video processing capabilities of multimodal large-scale language models (MLLMs) for efficient processing of long-form video understanding. Existing long-form context MLLMs suffer from significant memory and computational overhead in storing and referencing key-value (KV) caches for long-form visual context. Existing visual compression methods require encoding the entire visual context before compression or pre-accessing the questions, making them impractical. To address this, we propose StreamMem, a query-agnostic KV cache memory mechanism that encodes new video frames in a streaming manner and compresses the KV cache using attention scores between visual tokens and common question tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained long-form video scenarios. Evaluation results on three long-form video understanding benchmarks and two streaming video question answering benchmarks demonstrate that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression methods.

Takeaways, Limitations

Takeaways:
We present an efficient solution to the memory-constrained long video understanding problem through a query-agnostic KV cache compression mechanism.
Presenting the possibility of applying real-time or near-real-time long-form video understanding applications through streaming video processing.
It shows competitive performance compared to query-aware methods.
Achieving state-of-the-art performance in long-form video QA and streaming video QA benchmarks.
Limitations:
The performance of the proposed StreamMem is limited to a specific benchmark, and its generalization performance on other types of long video datasets requires further research.
There is a possibility of information loss during the compression process, and further analysis is needed on the extent and impact of the loss.
Lack of detailed explanations for the design and selection of common query tokens. Further research is needed to determine optimal query token design.
👍