Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Created by
  • Haebom

Author

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen

Outline

In this paper, we propose DaMO, a data-efficient video LLM specifically designed for accurate temporal inference and multimodal understanding. DaMO is centered around a temporal-aware Fuseformer with a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and acoustic information. It integrates global residuals to reduce spatial redundancy while retaining essential semantic details, thereby improving computational efficiency. We train DaMO through a four-stage progressive training paradigm that progressively equips multimodal alignment, semantic grounding, and temporal inference capabilities. We also provide several datasets augmented with temporal grounding QA pairs generated by GPT for tasks requiring temporal supervision. Through comprehensive experiments on temporal grounding and video QA benchmarks, we demonstrate that DaMO consistently outperforms previous methods, especially on tasks requiring accurate temporal alignment and inference.

Takeaways, Limitations

Takeaways:
We present DaMO, a data-efficient video LLM, and show that it enables accurate temporal inference and multimodal understanding even under limited supervision.
Improving temporal inference performance with a temporal-aware Fuseformer and a four-step progressive training paradigm.
Data augmentation using GPT to improve performance of tasks requiring temporal supervision.
Achieving performance that outperforms existing methods on tasks requiring temporal alignment and inference.
Limitations:
Further analysis of the generalization performance of the proposed method is needed.
Robustness assessment for different video types is needed.
Performance evaluation for more complex and longer videos is needed.
Detailed analysis of computational cost and memory usage is required.
👍