Temporal consistent multi-view perception for robust embodied manipulation

Haoyuan Chen, Rushuai Yang, Junjie Zhang, Xiaoyu Wen, Yi Chen, Dengxiu Yu, Chenjia Bai*, Zhen Wang*

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

Abstract

Perception plays a critical role in developing effective policies for multi-task embodied manipulation, in which visual comprehension and task interpretation are essential. Existing methods typically rely on multi-view 2D representations for visual perception, aiming to build computation-friendly perception modules through imitation learning from extensive collections of high-quality robot trajectories. However, these approaches face significant challenges when expert demonstrations are limited or tasks are highly complex, resulting in inefficiencies. To address these limitations, we propose Temporal Consistent Multi-View Perception (TMVP), a sample-efficient two-stage framework for robot manipulation that integrates temporal information into multi-view representations. Specifically, TMVP employs contrastive learning to extract meaningful, task-relevant features from visual inputs, enhancing temporal consistency and alignment with task instructions. This results in visual representations that are temporally coherent and grounded in task trajectories, enabling the model to better comprehend and execute complex manipulation tasks from diverse perspectives. Experiments conducted on RLBench demonstrate that TMVP outperforms baseline models across a wide range of tasks, achieving superior multi-task performance and few-shot training efficiency. These results highlight the potential of TMVP as an efficient and effective solution for embodied manipulation.

Original languageEnglish
Article number112177
JournalPattern Recognition
Volume171
Early online date22 Jul 2025
DOIs
Publication statusE-pub ahead of print - 22 Jul 2025

Bibliographical note

Publisher Copyright:
© 2025

Keywords

  • Embodied manipulation
  • Multi-view
  • Imitation learning
  • Temporal consistency

Fingerprint

Dive into the research topics of 'Temporal consistent multi-view perception for robust embodied manipulation'. Together they form a unique fingerprint.

Cite this