[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Created by
  • Haebom

Author

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen

Outline

In this paper, we propose DaMO, a data-efficient video LLM specifically designed for accurate temporal inference and multimodal understanding. DaMO is centered around a temporal-aware fuseformer, a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and acoustic information. We improve computational efficiency by incorporating global residuals that preserve essential semantic details while reducing spatial redundancy. Furthermore, we train DaMO via a four-step progressive training paradigm that progressively equips the model with multimodal alignment, semantic grounding, and temporal inference capabilities. We also provide several datasets augmented with temporal grounding QA pairs generated by LLMs on existing datasets. Comprehensive experimental results on temporal grounding and video QA benchmarks demonstrate that DaMO outperforms previous methods, especially on tasks that require accurate temporal alignment and inference.

Takeaways, Limitations

Takeaways:
Improving Accurate Temporal Inference and Multimodal Understanding Performance with DaMO, a Data-Efficient Video LLM.
Demonstrating the effectiveness of a temporal-aware Fuseformer architecture and a four-step progressive training paradigm.
We provide a new dataset augmented with temporal grounding QA pairs.
Improved performance over existing methods for tasks requiring precise time alignment and inference.
Limitations:
Lack of in-depth analysis of the detailed process of the proposed four-step training paradigm and the contributions of each step.
Lack of generalization performance evaluation across different video types and complexities.
Further research is needed on its applicability and limitations in practical applications.
Lack of assessment of the quality and reliability of the datasets generated by LLM.
👍