DEEVISum is a lightweight, efficient, and scalable vision language model designed for video segment-wise summarization. It leverages multimodal prompts that combine text- and audio-based cues and integrates multi-stage knowledge distillation (MSKD) and early termination (EE) to balance performance and efficiency. MSKD delivers an absolute F1 improvement of 1.33% over baseline distillation, while EE reduces inference time by approximately 21% at the cost of a 1.3-point decrease in F1 score. When evaluated on the TVSum dataset, the best-performing model, PaLI Gemma2 3B + MSKD, achieved an F1 score of 61.1, making it competitive with much larger models while maintaining low computational costs. The code and processed dataset are made available to support further research.