Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

EventVL: Understand Event Streams via Multimodal Large Language Model

Created by
  • Haebom

Author

Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, Hui Xiong

Outline

This paper highlights that existing event-based Vision-Language Models (VLMs) struggle to explicitly understand sufficient meaning and context from event streams by focusing solely on traditional perceptual tasks utilizing CLIP. To overcome these limitations, we propose EventVL, the first generative event-based multimodal large-scale language model (MLLM) framework for explicit semantic understanding. We annotate a large dataset of approximately 1.4 million high-quality event-image/video-text data pairs to bridge the data gap for cross-modal semantic connections. We then design an event spatiotemporal representation to dynamically aggregate and segment information from the event stream, fully leveraging the comprehensive information. Furthermore, we introduce dynamic semantic alignment to enhance and complement the sparse semantic space of events. Experimental results demonstrate that EventVL significantly outperforms existing MLLM baseline models in both event caption generation and scene description generation tasks.

Takeaways, Limitations

Takeaways:
We propose a new framework, EventVL, that contributes to improving the semantic understanding ability of event-based VLM.
Building and publishing large-scale event-image/video-text datasets.
Effective utilization of event semantic information through event spatio-temporal representation and dynamic semantic alignment techniques.
Outperforms existing models in event caption generation and scene description generation tasks.
Contribute to the development of event vision.
Limitations:
Further analysis of the generalization performance and bias of the proposed dataset is needed.
There is a need to evaluate the robustness of EventVL for various event types and complex scenarios.
Research is needed to optimize computational cost and model size.
👍