Abstract
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.
| Original language | English |
|---|---|
| Article number | 10520818 |
| Pages (from-to) | 7493-7508 |
| Number of pages | 16 |
| Journal | IEEE Transactions on Pattern Analysis and Machine Intelligence |
| Volume | 46 |
| Issue number | 12 |
| Early online date | 6 May 2024 |
| DOIs | |
| Publication status | Published - Dec 2024 |
Bibliographical note
Publisher Copyright:© 1979-2012 IEEE.
Keywords
- Dense prediction
- multi-task learning
- scene understanding
- transformer
Fingerprint
Dive into the research topics of 'InvPT++: inverted pyramid multi-task transformer for visual scene understanding'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver