Skip to main navigation Skip to search Skip to main content

Think Before You Segment: Chain-of-Thought Reasoning Segmentation for Image and Video

  • Shiu-hong KAO

Student thesis: Master's thesis

Abstract

Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, high similarity with surroundings, or time sensitive queries for video segmentation. In this thesis, we aim at injecting powerful and diversely pre-trained MLLM, such as GPT-4o [8] or Gemma3 [9], to the segmentation module to enhance its reasoning capability.

We first introduce ThinkFirst, a training-free framework to aid the segmentation process by generating a detailed, CoT description of the input image for language-instructed segmentation assistant. Our ThinkFirst allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. Based on ThinkFirst, we further propose ThinkDeeper as a progressive refinement process, using GPT-4o to autonomously evaluate the correctness of reasoning image segmentation by achieving self-correction and evaluation.

For reasoning video segmentation, we propose ThinkVideo, a novel, training-free frame-work employing the zero-shot CoT capability of MLLM to extract the temporal-semantic correlation between video frames. ThinkVideo analyzes the visible objects within a given frame that possibly matches the language query (semantic), and chooses a corresponding keyframe for each object that can be effortlessly observed (temporal). We further extend ThinkVideo for online video streams, where the CoT is used to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on image and video object segmentation with explicit and implicit queries. The results show that our approaches significantly outperforms previous works in both cases, qualitatively and quantitatively.

Date of Award2025
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorChi Keung TANG (Supervisor)

Cite this

'