Abstract
Large-scale multimodal data such as text and image dataset is playing a fundamental role in modern data-centric intelligent applications such as generative AI. Enabling humans to interact with the data is an essential way to increase understanding and trust of the AI models that build upon it. To this end, visualization is the essential technique for intuitive and interpretable interaction with the data. However, the complex and heterogeneous multimodal data necessitates proper representation of the features to facilitate visual exploration. For this purpose, neural embeddings of multimodal data can represent complex multimodal features in the form of high-dimensional vectors, which can be projected to visualization space by dimensionality reduction to facilitate visual exploration. However, traditional embedding visualization techniques primarily focus on single modality embeddings such as text embeddings or image embeddings alone. Such approaches neglect cross-modal relationship and hamper the interpretability of visualization, especially for multimodal data. In addition, modern multimodal applications typically involve a wide spectrum of dataset sizes ranging from small-scale finetuning to extremely large-scale evaluation or pretraining. However, previous embedding visualization research falls short in accommodating multi-scale exploration.To address the problem, this thesis focuses on developing embedding-based visual exploration techniques for multimodal data, which can enable intuitive and transparent access to data by human users. Particularly, in response to the limitations of previous works, the author focuses on two important problems in multimodal embedding visualization: contextual projection and multi-scale exploration. Contextual projection addresses the problem of incorporating cross-modal semantics in the visualization, while multi-scale exploration enables users to perform flexible analysis at different granularities. To achieve these goals, this thesis comprises of three studies: VISAtlas, ModalChorus and AKRMap.
The first study VISAtlas develops an embedding-based visual exploration approach for medium-scale image datasets in the context of class text anchors. In this work, the author starts from single modality image data from a specific domain: visualization collections. VISAtlas proposes a fixed anchor contextual projection method combining RadViz and pretrained CNNs to incorporate cross-modal semantics into image embeddings. This approach shows the relationship between categorical text labels and image embeddings, which results in a more stable and interpretable layout. This work further supports scalability to medium-scale datasets by density + sampling approach in RadViz.
The second study ModalChorus expands the contextual method to free anchor visualization for general multimodal embeddings, enabling more detailed visual probing on small subsets. Specifically, instead of adding cross-modal semantics to single modal embeddings in the visualization space, this work targets foundational multimodal embeddings that is designed to align text and image features in the original high-dimensional representation. To faithfully represent the inter-modal and intra-modal relationships, ModalChorus develops a novel fusion-based dimensionality reduction method for the contextual projection. This approach is particularly suitable for small-scale detailed probing of subsets as it allows users to customize the contextual anchors and retrieve relevant samples.
The third study AKRMap proposes explicitly encoding cross-modal alignment in embedding visualization and develops a scalable projection method for large-scale multimodal datasets. Specifically, existing contextual projection suffers from severe limitations in revealing cross-modal alignment for large-scale datasets, which can only show a limited number of texts and text-image relations. To mitigate the issue, this work proposes a dimensionality reduction method for unified text-image embeddings. By jointly learning a kernel regression of metric distribution with the projection, this method can explicitly encode the value of text-image alignment in the visualization space. The parametric formulation and contour map overview of this method enable scalability to million-size datasets.
| Date of Award | 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Wei ZENG (Supervisor) & Kang ZHANG (Supervisor) |
Cite this
- Standard