Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Dense Video Captioning using Graph-based Sentence Summarization

Created by
  • Haebom

Author

Zhiwang Zhang, Dong Xu, Wanli Ouyang, Luping Zhou

Outline

In this paper, we propose a dense video captioning task that detects and captions all events in long, unprocessed videos. To address the problem that existing methods do not sufficiently explore scene evolution within event-time proposals, resulting in poor performance when scenes and objects change in relatively long proposals, we propose a graph-based segmentation and summarization (GPaS) framework. In the 'segmentation' step, GPaS divides the entire event proposal into short video segments for generating captions at a more detailed level, and in the 'summarization' step, it summarizes the generated sentences containing rich descriptive information for each segment into a single sentence. In particular, focusing on the 'summarization' step, we propose a framework that effectively exploits the relationships between semantic words, treating semantic words as nodes in a graph, and learning interactions by combining Graph Convolutional Network (GCN) and Long Short Term Memory (LSTM) with the help of visual cues. To seamlessly integrate GCN and LSTM, we propose two GCN-LSTM Interaction (GLI) modules. We demonstrate the effectiveness of the proposed method through extensive comparisons with state-of-the-art methods on the ActivityNet Captions dataset and the YouCook II dataset.

Takeaways, Limitations

Takeaways:
We present a novel framework (GPaS) for dense video captioning robust to scene and object changes in long videos.
A summary technique that effectively utilizes the relationship between semantic words by combining GCN and LSTM is proposed.
Achieving SOTA performance on ActivityNet Captions and YouCook II datasets.
Limitations:
Lack of analysis of the computational complexity of the proposed GPaS framework.
Further evaluation of generalization performance on different types of video data is needed.
Lack of detailed parameter tuning and optimization process for the interaction between GCN and LSTM.
👍