In this paper, we propose DaMO, a data-efficient video LLM specifically designed for accurate temporal inference and multimodal understanding. DaMO is centered around a temporal-aware fuseformer, a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and acoustic information. We improve computational efficiency by incorporating global residuals that preserve essential semantic details while reducing spatial redundancy. Furthermore, we train DaMO via a four-step progressive training paradigm that progressively equips the model with multimodal alignment, semantic grounding, and temporal inference capabilities. We also provide several datasets augmented with temporal grounding QA pairs generated by LLMs on existing datasets. Comprehensive experimental results on temporal grounding and video QA benchmarks demonstrate that DaMO outperforms previous methods, especially on tasks that require accurate temporal alignment and inference.