Building systems capable of seamlessly learning from multiple modalities, such as vision and language, has been a longstanding aspiration in Artificial Intelligence (AI). As humans, we acquire new knowledge and skills through various sensory inputs, including visual signals and textual information. Models that emulate this behavior can potentially learn more effectively as information from different modalities often complements and supplements each other. More importantly, their capabilities can be greatly expanded to perform tasks that unimodal models cannot achieve. In this thesis, we investigate and present novel approaches for constructing robust and versatile vision-language (VL) models. Particularly, we focus on how to efficiently teach language models (LMs) to comprehend visual data, as this is more resource efficient than starting with vision models. Despite the salient progress made by modern deep learning approaches, most previous works in VL learning focus on task-specific finetuning, which cannot generalize well. While preliminary studies have explored VL pre-training in order to build generalized backbones, several essential problems exist. First, pre-training VL models from scratch is extremely computationally costly due to the added visual modality. Second, the data used for pre-training are mainly image-text pairs, where the text components are short, succinct descriptions of the images. The short texts lead to insufficient language abilities and unsatisfactory downstream performance, especially for generative tasks. However, simply adding long-form text-only data does not help much due to the discrepancy between the unimodal and multimodal training losses. Furthermore, even with robust VL backbones, the methods to improve their generalization (i.e., zero-shot performance on unseen datasets) and versatility, both critical factors in their applicability, remain largely uninvestigated. To address these challenges, in this thesis, we focus on two research problems: 1) efficient construction of robust VL models with strong language abilities; and 2) improving generalization of pre-trained VL models. Particularly, we propose three innovative approaches. Firstly, we present a task-specific vision guidance method that uses visual information to tame pre-trained LMs, enabling them to generate texts from VL inputs. This method adapts text-only LMs to the VL domain without compromising their original language abilities. Next, we take a step further by introducing task-agnostic vision-language knowledge distillation (VLKD). VLKD bridges powerful pre-trained vision models and pre-trained LMs to utilize the capabilities from both sides. Specifically, we adopt self-supervised learning with only a handful of image-text data to align them together, which is considerably more data and time efficient than pre-training from scratch. Last but not least, we introduce InstructBLIP, a simple yet novel VL instruction tuning framework to enhance VL models to accurately follow instructions. This approach dramatically improves the model’s generalization on unseen datasets and tasks, achieving state-of-the-art performance on a wide range of tasks, such as image captioning, visual reasoning, and both image and video-based question answering. Qualitative studies further showcase its robustness and versatility in multi-turn dialogues.
| Date of Award | 2023 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
| Supervisor | Pascale Ngan FUNG (Supervisor) |
|---|
Teaching language models to see : building robust and versatile vision-language models
DAI, W. (Author). 2023
Student thesis: Doctoral thesis