Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Created by
  • Haebom

Author

Seojeong Park, Jiho Choi, Kyungjune Baek, Hyunjung Shim

Outline

This paper studies Video Moment Retrieval (MR), which identifies specific moments in videos based on natural language queries. With the increase in information retrieval on platforms like YouTube, demand for MR technology is also increasing. While DETR-based models have recently achieved performance improvements, they struggle to accurately localize short moments. In this paper, we analyze the lack of feature diversity in short moments and propose MomentMix, which utilizes two data augmentation strategies (ForegroundMix and BackgroundMix), to address this issue. Furthermore, we find that the center location prediction accuracy for short moments is low, and propose a Length-Aware Decoder that considers length information through a novel bipartite matching process. We experimentally demonstrate that the proposed method outperforms existing DETR-based methods on benchmark datasets, demonstrating its effectiveness in localizing short moments. The proposed method achieves state-of-the-art performance in both R1 and mAP on the QVHighlights dataset, and achieves R1@0.7 on the TACoS and Charades-STA datasets.

Takeaways, Limitations

Takeaways:
We present a novel data augmentation technique (MomentMix) and a length-aware decoder that contribute to improving the accuracy of short video moment retrieval.
Achieving SOTA performance on QVHighlights, TACoS, and Charades-STA datasets.
We analyze the problems of short-term feature diversity deficiency and central location prediction bias and propose solutions.
Ensuring research reproducibility and scalability through open source code disclosure.
Limitations:
The effectiveness of the proposed method may be limited to a specific benchmark dataset. Additional experiments on a variety of datasets are needed.
Potential increase in computational cost due to increased complexity of the Length-Aware Decoder.
There is a need to evaluate generalization performance for more diverse and complex video data.
👍