Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Created by
  • Haebom

Author

Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren

Outline

This paper proposes SurgVidLM, a novel video language model for surgical scene understanding in robotic surgery. Unlike existing multimodal large-scale language models (MLLMs) that focus on global understanding of surgical scenes, SurgVidLM focuses on sophisticated video inference for detailed process analysis of surgical procedures. To achieve this, we build the SVU-31K dataset, a large-scale dataset consisting of over 31,000 video-description pairs. We introduce the StageFocus mechanism, which consists of two stages: first, extracting the overall procedural context, and second, performing high-frequency local analysis based on temporal cues. We also develop multi-frequency fusion attention that effectively integrates low- and high-frequency visual tokens to preserve important task-related details. Experimental results demonstrate that SurgVidLM significantly outperforms the state-of-the-art Vid-LLM with similar parameter scale. The code and dataset will be made public soon.

Takeaways, Limitations

Takeaways:
A novel video language model, SurgVidLM, is presented for understanding surgical scenes in robotic surgery.
Designed to enable both a general understanding and detailed analysis of surgical procedures
Building a large-scale surgical video dataset, SVU-31K
Sophisticated video inference performance enhancements through StageFocus mechanism and multi-frequency fusion attention.
Superior performance compared to cutting-edge Vid-LLM
Code and dataset to be released soon
Limitations:
The current code and dataset are not publicly available.
Further validation of generalization performance in real surgical settings is needed.
The applicability of the model to various surgical types and environments needs to be evaluated.
Further explanation of the detailed operation principles and limitations of the StageFocus mechanism and multi-frequency fusion attention is needed.
👍