Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Created by
  • Haebom

Author

Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

Outline

Long-form video data is extremely dense and high-dimensional. Text-based summaries of video content offer a way to express query-relevant content in a much more concise manner than raw video. Furthermore, text representations can be readily processed by state-of-the-art large-scale language models (LLMs), enabling inferences about video content to answer complex natural language queries. To address this challenge, we rely on a video captioner that incrementally builds a text-based memory, operating on shorter video chunks where spatiotemporal modeling is computationally feasible. We explore methods to improve the quality of activity logs composed of short video captions. Video captions tend to focus primarily on human actions, and questions may be related to other information in the scene. Therefore, we aim to use Vision Language Models (VLMs) to add static scene descriptions to the memory. Our video understanding system combines the LaViLa video captioner with LLMs to answer video questions. We first explore various methods for segmenting videos into meaningful segments to more accurately reflect the structure of the video content. Furthermore, we integrated static scene descriptions into the captioning pipeline using the LLaVA VLM, resulting in more detailed and complete caption logs and expanding the range of questions that can be answered from text memory. Finally, we fine-tuned the LaViLa video captioner to generate both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for both tasks. Our model, a controllable hybrid captioner, can alternate between different types of captions based on special input tokens that signal scene changes detected in the video.

Takeaways, Limitations

Generate text-based summaries of video content to enable LLM to answer complex queries.
Caption your videos using LaViLa Captioner and add static scene information via VLM to improve the accuracy and completeness of your captions.
Improve the efficiency of your captioning pipeline by fine-tuning the LaViLa captioner to generate both action and scene captions.
A controllable hybrid captioner allows you to generate different types of captions based on scene changes.
Limitations are not specifically mentioned.
👍