Skip to main navigation Skip to search Skip to main content

CoCoGesture: Towards coherent co-speech 3D gesture generation in the wild

  • Xingqun Qi
  • , Hengyuan Zhang
  • , Yatian Wang
  • , Jiahao Pan
  • , Chen Liu
  • , Muyi Sun
  • , Wei Xue
  • , Shanghang Zhang
  • , Sirui Han*
  • , Qifeng Liu
  • , Yike Guo*
  • *Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

Abstract

Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. However, due to the limited scale of 3D speech-gesture data, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs. To address this issue, we curate a large-scale co-speech gesture dataset covering diverse in-the-wild gesture types and propose a framework for generating plausible and diverse gestures from in-the-wild speech. Specifically, our curated dataset GES-X contains about 40M meshed postures across 4.3K speakers, which is 4× larger in word corpus and 15× more diverse in gesture motion distribution than the second-largest dataset, providing a solid foundation for diverse gesture generation. Considering the gesture sequence should be both natural and display generalization on in-the-wild speech audio, we propose CoCoGesture, a novel framework that is built upon a custom-designed pretrain-finetune training paradigm. At the pretraining stage, we aim to formulate a large generalizable gesture diffusion model by learning the abundant postures manifold provided by our GES-X dataset. Therefore, we scale up the large unconditional diffusion model to 1B parameters and pre-train it to be our gesture experts. At the finetune stage, we present the audio ControlNet that incorporates the human voice as condition prompts to guide the gesture generation. Considering the synthesized postures should be temporally coordinated with audio rhythmic while preserving the vividness and diversity, we design a novel Mixture-of-Gesture-Experts (MoGE) block. In particular, the MoGE block adaptively fuses the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism. Extensive experiments demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation. Our dataset will be released on the project page: https://mattie-e.github.io/GES-X/.

Original languageEnglish
Article number103613
JournalInformation Fusion
Volume126
DOIs
Publication statusPublished - Feb 2026

Bibliographical note

Publisher Copyright:
© 2025 Elsevier B.V.

Keywords

  • Co-speech gesture generation
  • Diffusion models
  • Dataset construction
  • Mixture-of-experts

Fingerprint

Dive into the research topics of 'CoCoGesture: Towards coherent co-speech 3D gesture generation in the wild'. Together they form a unique fingerprint.

Cite this