This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper addresses the limitations of long-term video processing capabilities of multimodal large-scale language models (MLLMs) for efficient processing of long-form video understanding. Existing long-form context MLLMs suffer from significant memory and computational overhead in storing and referencing key-value (KV) caches for long-form visual context. Existing visual compression methods require encoding the entire visual context before compression or pre-accessing the questions, making them impractical. To address this, we propose StreamMem, a query-agnostic KV cache memory mechanism that encodes new video frames in a streaming manner and compresses the KV cache using attention scores between visual tokens and common question tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained long-form video scenarios. Evaluation results on three long-form video understanding benchmarks and two streaming video question answering benchmarks demonstrate that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression methods.
Takeaways, Limitations
•
Takeaways:
◦
We present an efficient solution to the memory-constrained long video understanding problem through a query-agnostic KV cache compression mechanism.
◦
Presenting the possibility of applying real-time or near-real-time long-form video understanding applications through streaming video processing.
◦
It shows competitive performance compared to query-aware methods.
◦
Achieving state-of-the-art performance in long-form video QA and streaming video QA benchmarks.
•
Limitations:
◦
The performance of the proposed StreamMem is limited to a specific benchmark, and its generalization performance on other types of long video datasets requires further research.
◦
There is a possibility of information loss during the compression process, and further analysis is needed on the extent and impact of the loss.
◦
Lack of detailed explanations for the design and selection of common query tokens. Further research is needed to determine optimal query token design.