ModalChorus: Visual Probing and Alignment of Multi-Modal Embeddings via Modal Fusion Map

Yilin Ye, Shishi Xiao, Xingchen Zeng, Wei Zeng*

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

3 Citations (Scopus)

Abstract

Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.

Original languageEnglish
Pages (from-to)294-304
Number of pages11
JournalIEEE Transactions on Visualization and Computer Graphics
Volume31
Issue number1
DOIs
Publication statusPublished - 2025
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 1995-2012 IEEE.

Keywords

  • Multi-modal embeddings
  • data fusion
  • dimensionality reduction
  • interactive alignment

Fingerprint

Dive into the research topics of 'ModalChorus: Visual Probing and Alignment of Multi-Modal Embeddings via Modal Fusion Map'. Together they form a unique fingerprint.

Cite this