Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

COLT: Enhancing Video Large Language Models with Continual Tool Usage

Created by
  • Haebom

Author

Yuyang Liu, Xinyuan Shi, Xiaondan Liang

Outline

Building on advances in video understanding research leveraging large-scale language models (LLMs), this paper proposes a video LLM that focuses on exploring the ability to use pre-trained expert models (tools). Existing methods utilize closed-source LLMs or fine-tune tool usage through directive tuning, but they assume a fixed tool repository and struggle to generalize to real-time, evolving tool datasets. To address this, we propose a method to enhance open-source video LLMs through continuous tool usage (COLT), which automatically acquires tool usage skills from a continuous stream of tools without "forgetting" previously learned tools. COLT integrates a learnable tool codebook with a tool-specific memory system and dynamically selects relevant tools based on the similarity between user directives and tool features within the codebook. We leverage the video-centric tool usage directive tuning dataset, VideoToolBench, to realize the potential of video LLMs for tool usage, demonstrating state-of-the-art performance on existing video LLM benchmarks and the VideoToolBench dataset.

Takeaways, Limitations

Takeaways:
Presenting a video LLM framework that enables effective learning and utilization of new tools in a continuously changing real-world environment.
Solving the 'forgetting' problem of previously learned tools by utilizing a learnable tool codebook.
Introducing the new tool usage directive tuning dataset VideoToolBench.
Achieve cutting-edge performance in existing benchmarks and VideoToolBench.
Limitations:
Further review of the size and diversity of the VideoToolBench dataset is needed.
Further evaluation of generalization performance in real-world environments is needed.
Further research is needed on its applicability to different types of video data and tools.
Analysis of COLT's computational cost and efficiency is needed.
👍