Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization

Created by
  • Haebom

Author

Anas Anwarul Haq Khan, Utkarsh Verma, Ganesh Ramakrishnan

Outline

DEEVISum is a lightweight, efficient, and scalable vision language model designed for video segment-wise summarization. It leverages multimodal prompts that combine text- and audio-based cues and integrates multi-stage knowledge distillation (MSKD) and early termination (EE) to balance performance and efficiency. MSKD delivers an absolute F1 improvement of 1.33% over baseline distillation, while EE reduces inference time by approximately 21% at the cost of a 1.3-point decrease in F1 score. When evaluated on the TVSum dataset, the best-performing model, PaLI Gemma2 3B + MSKD, achieved an F1 score of 61.1, making it competitive with much larger models while maintaining low computational costs. The code and processed dataset are made available to support further research.

Takeaways, Limitations

Takeaways:
We improved the efficiency and scalability of video summarization through a lightweight vision language model.
We successfully achieved a balance between performance and efficiency through MSKD and EE techniques.
We achieved performance comparable to large-scale models at low computational cost.
We support follow-up research by making our code and datasets publicly available.
Limitations:
The F1 score decreased by 1.3 points by applying the EE technique.
Since it was evaluated using only one TVSum dataset, further validation of generalization performance is needed.
👍