[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Created by
  • Haebom

Author

Xinxin Dong, Baoyun Peng, Haokai Ma, Yufei Wang, Zixuan Dong, Fei Hu, Xiaodong Wang

Outline

In this paper, we propose LeAdQA, a novel method for VideoQA (Video Question Answering), which identifies key moments in long-running videos and infers their causal relationships to answer semantically complex questions. To overcome the arbitrary frame processing and inability to consider causal-temporal structure of existing methods, LeAdQA improves question-option pairs by leveraging a large-scale language model (LLM) to clarify temporal focus. Based on the improved questions, the temporal ground model accurately identifies the most important parts, and an adaptive fusion mechanism maximizes relevance. Finally, MLLM is used to generate accurate and contextually relevant answers. Experimental results on NExT-QA, IntentQA, and NExT-GQA datasets demonstrate that LeAdQA achieves state-of-the-art performance on complex inference tasks.

Takeaways, Limitations

Takeaways:
Contribute to improving VideoQA performance by improving questions using LLM and enhancing temporal focus.
Improve key moment identification and causal inference with accurate visual evidence models.
Adaptive fusion mechanisms enable integration and efficient processing of relevant information.
Achieving state-of-the-art performance on diverse VideoQA datasets.
Limitations:
May depend on the performance of the LLM.
There may be a bias towards certain types of questions or videos.
The accuracy of the temporal basis model affects the overall system performance.
Further analysis of computational costs is needed.
👍