Abstract
The increasing availability of surgical video data in modern operating rooms has created new opportunities for intelligent systems to assist in clinical decision-making, training, and quality control. However, effectively understanding surgical videos remains a formidable challenge due to several unique domain-specific characteristics: limited and privacy-constrained data, costly and expertise-intensive annotations, complex and variable temporal workflows, and the high-level reasoning for clinical planning and judgments.This thesis addresses these challenges through two complementary directions: (1) developing domain-informed learning techniques for robust surgical visual representation under constrained conditions; and (2) enhancing the perceptual and reasoning capabilities of foundational video multimodal large language models (Video-MLLMs), thereby laying the groundwork for future applications in surgical video generalization and intelligent assistance.
In the first part of the thesis, we focus on leveraging the surgical domain priors to improve the ""seeing"" capabilities of surgical video models. To mitigate the data scarcity and annotation bottleneck, we first introduce a knowledge-guided self-supervised learning framework. By distilling high-level semantic priors from general-domain pretrained models and incorporating them into a contrastive learning pipeline, the proposed method enhances representation learning in surgical settings with limited data and class diversity. This results in improved generalization across downstream tasks, even under small-scale, highly specialized surgical datasets. To further reduce annotation burdens, we propose the Uncertainty-Aware Temporal Diffusion (UATD) framework, which requires only sparse timestamp-level supervision. Exploiting the continuous and structured nature of surgical workflows, UATD diffuses labels from annotated timestamps to temporally adjacent, high-confidence frames. This mechanism drastically reduces the cost of manual labeling while achieving reliable performance in frame-level surgical phase recognition. To address the complex temporal dependencies in surgical procedures, we present the Segment-Attentive Hierarchical Consistency Network (SAHC) for phase recognition. SAHC identifies semantically consistent segments across time and aligns them with frame-level predictions via a segment-frame attention module. A consistency loss is also introduced to improve robustness in ambiguous transitional regions. Additionally, we introduce SEDSkill, a clinically informed skill assessment framework designed for long-form thoracoscopic surgeries. SEDSkill detects skill-related surgical events and models nuanced variations in surgical performance through an event-aware module and a difference regression block. This is supported by a newly curated dataset of mitral valve replacement procedures, enabling fine-grained and interpretable skill evaluation. In the second part, we shift focus from task-specific modeling to enhancing the foundational capabilities of Video- MLLMs--especially in visual perception and hallucination mitigation-- which are crucial for general-purpose video understanding and serve as a stepping stone toward future surgical embodied intelligence. However, since large-scale surgical instruction-following data is not yet available, we first test our methods in the autonomous driving domain, which shares similar features such as long-term sequences, real-time decisions, and detailed visual understanding. These similarities make it easy to transfer our models to surgical videos once the data is ready. To strengthen visual perception, we propose HilM-D, a two-stream Video-MLLM framework that decouples the modeling of temporal dynamics and high-resolution spatial con-tent. The temporal stream captures motion and procedural context, while the spatial stream preserves fine-grained anatomical and tool-related detail. This dual-stream design significantly improves the model's ability to perceive complex visual cues, paving the way for applications in fine-grained recognition and spatial reasoning. To reduce hallucinations in generated responses, we propose PaMi-VDPO, an online preference learning framework built upon prompt-aware multi-instance learning. This method dynamically selects data augmentations and optimizes video-text alignment in an end-to-end fashion, without requiring expensive offline preference datasets. As a result, the model generates more grounded, coherent, and reliable responses--an essential property for downstream reasoning in sensitive domains such as surgical assistance. Together, the contributions in this thesis provide a comprehensive and forward-looking approach to advancing surgical video understanding. Through domain-informed modeling and foundational improvements in Video-MLLMs, we not only address critical challenges in surgical video analysis today, but also establish a robust and extensible foundation for the development of intelligent, generalizable, and trustworthy systems in future surgical embodied intelligence.
| Date of Award | 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Xiaomeng LI (Supervisor) |
Cite this
- Standard