Abstract
Perception plays a critical role in developing effective policies for multi-task embodied manipulation, in which visual comprehension and task interpretation are essential. Existing methods typically rely on multi-view 2D representations for visual perception, aiming to build computation-friendly perception modules through imitation learning from extensive collections of high-quality robot trajectories. However, these approaches face significant challenges when expert demonstrations are limited or tasks are highly complex, resulting in inefficiencies. To address these limitations, we propose Temporal Consistent Multi-View Perception (TMVP), a sample-efficient two-stage framework for robot manipulation that integrates temporal information into multi-view representations. Specifically, TMVP employs contrastive learning to extract meaningful, task-relevant features from visual inputs, enhancing temporal consistency and alignment with task instructions. This results in visual representations that are temporally coherent and grounded in task trajectories, enabling the model to better comprehend and execute complex manipulation tasks from diverse perspectives. Experiments conducted on RLBench demonstrate that TMVP outperforms baseline models across a wide range of tasks, achieving superior multi-task performance and few-shot training efficiency. These results highlight the potential of TMVP as an efficient and effective solution for embodied manipulation.
| Original language | English |
|---|---|
| Article number | 112177 |
| Journal | Pattern Recognition |
| Volume | 171 |
| Early online date | 22 Jul 2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 22 Jul 2025 |
Bibliographical note
Publisher Copyright:© 2025
Keywords
- Embodied manipulation
- Multi-view
- Imitation learning
- Temporal consistency