Skip to main navigation Skip to search Skip to main content

From Objects to Prompts: Towards Generalized Multimodal Foundation Model

  • Feng LI

Student thesis: Doctoral thesis

Abstract

The integration of vision and language has become pivotal in developing generalized multimodal foundation models, enabling AI systems to understand and interact with the world in increasingly human-like ways. This dissertation traces my research journey from object-centric perception to prompt-based multimodal understanding, focusing on scalability, generalization, and real-world applicability. The foundation of this work lies in advancing object-centric perception through novel query designs for Transformer-based architectures (e.g., DN-DETR, DINO). Beyond closed-set recognition, we expand perception to open-vocabulary language prompts (OpenSeed, Semantic-SAM) and visual prompts (DINOv), enabling versatile human-AI interaction in real-world scenarios. Furthermore, we generalize vision-language integration by leveraging large language models (LLMs). Our proposed LLaVA-Interleave unifies text, images, video, and 3D data through multi-modal interleaved processing, achieving great generalization and pushing the boundaries of multimodal AI.

Date of Award2025
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorLionel Ming-shuan Ni (Supervisor) & Heung-yeung Harry Shum (Supervisor)

Cite this

'