[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Smart Routing for Multimodal Video Retrieval: When to Search What

Created by
  • Haebom

Author

Kevin Dela Rosa

Outline

ModaRoute is an LLM-based intelligent routing system that dynamically selects the optimal modality for multi-modal video retrieval. Existing dense text captioning methods achieve 75.9% in Recall@5, but require expensive offline processing and miss important visual information in 34% of clips where scene text is not captured by ASR. ModaRoute analyzes query intent and predicts information needs to achieve 60.9% in Recall@5 while reducing computational overhead by 41%. It uses GPT-4.1 to route queries to ASR (speech), OCR (text), and visual indexes, and improves efficiency with an average of 1.78 modalities per query compared to full search (3.0 modalities). Evaluation results on 1.8 million video clips show that intelligent routing provides a practical solution for scaling multi-modal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployments.

Takeaways, Limitations

Takeaways:
We demonstrate that LLM-based intelligent routing can improve the efficiency and scalability of multi-modal video retrieval systems.
Verify the effect of reducing computational overhead and reducing infrastructure costs.
Presenting an effective modality selection strategy through query intent analysis and information need prediction.
Providing practical solutions for real-world deployments.
Limitations:
Recall@5 performance is somewhat lower (60.9%) than the existing method (75.9%).
Due to the high dependence on GPT-4.1, system performance may be affected by the performance of LLM.
System performance may be limited by the accuracy of ASR and OCR.
Verification of generalization performance for various types of video data is required.
👍