Understanding and Generating Multi-Modalities: Advancing Efficient, Generalized, and Interactive Human-Centered AI

  • Junming CHEN

Student thesis: Doctoral thesis

Abstract

Human-Centered AI has been a significant path for AI development, prioritizing human needs, values, and capabilities and aiming to augment rather than replace human abilities. This thesis advances human-centered AI by developing efficient, generalized, and interactive multimodal models that span from understanding the world to generating content within it. The research charts a course from foundational perception to expressive, interactive generation, culminating in a unified multimodal large language model that explores the internal synergy between both capabilities.

First, we address the fundamental challenges of enhancing visual perception and understanding, exploring novel methods to fulfill human needs of efficient, privacy-preserving, and generalizable models. We introduce a novel real-time streaming video denoising framework that efficiently improves video quality, providing a cleaner perceptual input for human and machine viewers alike. In addition, we tackle the critical need for generalization and privacy in AI systems. We propose a federated domain generalization method that trains robust, decentralized image recognition models without centralizing sensitive user data, a crucial step for developing trustworthy AI in applications like healthcare.

The focus then shifts from understanding to generation, prioritizing the mission of human-AI interaction within the human-centered AI. To enable AI avatars to interact and communicate with humans more naturally, we develop a diffusion-based model that generates synchronized, arbitrary-long holistic 3D facial expressions and body gestures from speech in real-time. This enables the creation of more expressive and believable digital agents, fostering more effective communication between AI and humans.

Finally, we unify understanding and generation capabilities within a single multi-modal large language model to facilitate more effective human-AI collaboration. However, whether this unification benefits both understanding and generation abilities was not fully explored. Therefore, we further investigate the native synergy between visual understanding and generation through a carefully designed experimental protocol. Our findings demonstrate that simultaneously training on understanding and generation data in a unified model can lead to more capable and versatile AI systems, with notable improvements in understanding capabilities. This serves as an important motivation to advance research on such unified multimodal large language models.

Collectively, these contributions push the boundaries of human-centered AI by advancing more efficient, robust, privacy-preserving, expressive, and collaborative models for tasks from understanding to generation, making AI a more effective and collaborative partner in human life.

Date of Award2025
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorQifeng CHEN (Supervisor) & Fangzhen LIN (Supervisor)

Cite this

'