Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Created by
  • Haebom

Author

Zhuqiang Lu, Zhenfei Yin, Mengwei He, Zhihui Wang, Zicheng Liu, Zhiyong Wang, Kun Hu

Outline

To address the challenges of long-term image understanding using a Vision Large-Scale Language Model (VLLM), this paper presents a Balanced-VLLM (B-VLLM) framework that utilizes a text-conditional adaptive frame selection module, a temporal frame token merging technique, a spatial token sampling module, and a merging strategy. To address the issues that existing VLLMs suffer from, such as loss of temporal or spatial information due to image downsampling or a reduction in the number of visual tokens in each frame, we propose a method that effectively utilizes task-relevant spatiotemporal cues while limiting the number of visual tokens within the VLLM's context window length. Experimental results demonstrate that B-VLLM demonstrates superior performance on various image understanding benchmarks.

Takeaways, Limitations

Takeaways:
We have significantly improved the efficiency of long-term image understanding based on VLLM.
We minimized task-relevant information loss through text-conditional adaptive frame selection and token merging strategies.
It achieves superior performance over existing methods on various image understanding benchmarks.
Reproducibility has been improved through open source code.
Limitations:
A detailed analysis of the computational complexity of the proposed method is lacking.
There is a potential for performance bias for certain types of image data.
Additional experiments on more diverse and complex image understanding tasks are needed.
👍