In this paper, we propose DaMO, a data-efficient video LLM specifically designed for accurate temporal inference and multimodal understanding. DaMO is centered around a temporal-aware Fuseformer with a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and acoustic information. It integrates global residuals to reduce spatial redundancy while retaining essential semantic details, thereby improving computational efficiency. We train DaMO through a four-stage progressive training paradigm that progressively equips multimodal alignment, semantic grounding, and temporal inference capabilities. We also provide several datasets augmented with temporal grounding QA pairs generated by GPT for tasks requiring temporal supervision. Through comprehensive experiments on temporal grounding and video QA benchmarks, we demonstrate that DaMO consistently outperforms previous methods, especially on tasks requiring accurate temporal alignment and inference.