The integration of vision and language has become pivotal in developing generalized multimodal foundation models, enabling AI systems to understand and interact with the world in increasingly human-like ways. This dissertation traces my research journey from object-centric perception to prompt-based multimodal understanding, focusing on scalability, generalization, and real-world applicability. The foundation of this work lies in advancing object-centric perception through novel query designs for Transformer-based architectures (e.g., DN-DETR, DINO). Beyond closed-set recognition, we expand perception to open-vocabulary language prompts (OpenSeed, Semantic-SAM) and visual prompts (DINOv), enabling versatile human-AI interaction in real-world scenarios. Furthermore, we generalize vision-language integration by leveraging large language models (LLMs). Our proposed LLaVA-Interleave unifies text, images, video, and 3D data through multi-modal interleaved processing, achieving great generalization and pushing the boundaries of multimodal AI.
| Date of Award | 2025 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
| Supervisor | Lionel Ming-shuan Ni (Supervisor) & Heung-yeung Harry Shum (Supervisor) |
|---|
From Objects to Prompts: Towards Generalized Multimodal Foundation Model
LI, F. (Author). 2025
Student thesis: Doctoral thesis