This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Created by
Haebom
Author
Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
Outline
In this paper, we propose an agent-based search strategy for Deep Video Discovery (DVD) agent to solve the problem of long-term video understanding, which is difficult to answer questions in long-term video contexts with high temporal and spatial complexity. Unlike the fixed workflow of existing video agents, the DVD agent emphasizes autonomous characteristics and utilizes search-oriented tools on video databases of various sizes. It utilizes the advanced reasoning ability of LLM to plan the current observation state, strategically select tools, set appropriate parameters for actions, and iteratively improve internal inferences in light of collected information. Through comprehensive evaluations on several long-term video understanding benchmarks, we demonstrate the superiority of the system design, and in particular, achieve state-of-the-art (SOTA) results on the LVBench dataset, significantly outperforming existing studies. In addition, we provide insights into the development of intelligent agents for long-term video understanding through ablation studies and tool analysis, and the code is open source ( https://github.com/microsoft/DeepVideoDiscovery ).