Abstract
The vision transformer (ViT) architecture offers significant advantages in object detection tasks. However, some limitations affect improving task performance. Firstly, the ViT relies heavily on inflexible position embedding, which causes poor performance when processing images with complex semantic dependencies. Secondly, class imbalance in large-scale datasets can easily cause training instability and inference bias. To overcome these limitations, we propose a generalized concordant ViT scheme for object detection (GCViTDet). Specifically, we first introduce a relevance enhancement strategy (RES) into the encoder-decoder structure, which is composed of the spatial enhanced position embeddings (SEPE) component, the cross multipooling attention (CMPA) component, and a global-local path. This strategy establishes semantic-rich dependencies through enhanced position embedding information and omni-feature representations. Subsequently, a bottom-up feature aggregation pathway is employed, utilizing a cross multi-pooling attention to improve the model’s capacity to capture semantic dependencies. This scheme enables the extraction of high-dimensional features that exhibit complex positional relationships. Besides, we propose a focal unified cross-entropy (FUCE) loss to solve the class imbalance problem during training by introducing a uniform threshold to regulate the similarity between positive and negative samples of different classes. Compared with existing methods, GCViTDet can not only capture more intricate positional relationships and semantic-rich dependencies but also alleviate the class-imbalance problem. Experimental results on the challenging MS-COCO dataset validate that GCViTDet can consistently improve performance over state-of-the-art object detection baseline models.
| Original language | English |
|---|---|
| Article number | 11005511 |
| Pages (from-to) | 10616-10631 |
| Number of pages | 16 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 35 |
| Issue number | 11 |
| Early online date | 15 May 2025 |
| DOIs | |
| Publication status | Published - Nov 2025 |
Bibliographical note
Publisher Copyright:© 1991-2012 IEEE.
Keywords
- Vision transformer
- Object detection
- Positional embeddings
- Class imbalanced learning
Fingerprint
Dive into the research topics of 'Generalized Concordant Vision Transformer with Masked Image Tokens for Object Detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver