Generalized Concordant Vision Transformer with Masked Image Tokens for Object Detection

Yu Quan, Dong Zhang, Jinhui Tang*

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

Abstract

The vision transformer (ViT) architecture offers significant advantages in object detection tasks. However, some limitations affect improving task performance. Firstly, the ViT relies heavily on inflexible position embedding, which causes poor performance when processing images with complex semantic dependencies. Secondly, class imbalance in large-scale datasets can easily cause training instability and inference bias. To overcome these limitations, we propose a generalized concordant ViT scheme for object detection (GCViTDet). Specifically, we first introduce a relevance enhancement strategy (RES) into the encoder-decoder structure, which is composed of the spatial enhanced position embeddings (SEPE) component, the cross multipooling attention (CMPA) component, and a global-local path. This strategy establishes semantic-rich dependencies through enhanced position embedding information and omni-feature representations. Subsequently, a bottom-up feature aggregation pathway is employed, utilizing a cross multi-pooling attention to improve the model’s capacity to capture semantic dependencies. This scheme enables the extraction of high-dimensional features that exhibit complex positional relationships. Besides, we propose a focal unified cross-entropy (FUCE) loss to solve the class imbalance problem during training by introducing a uniform threshold to regulate the similarity between positive and negative samples of different classes. Compared with existing methods, GCViTDet can not only capture more intricate positional relationships and semantic-rich dependencies but also alleviate the class-imbalance problem. Experimental results on the challenging MS-COCO dataset validate that GCViTDet can consistently improve performance over state-of-the-art object detection baseline models.

Original languageEnglish
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
Publication statusAccepted/In press - 15 May 2025

Bibliographical note

Publisher Copyright:
© 1991-2012 IEEE.

Keywords

  • Class imbalanced learning
  • Object detection
  • Positional embeddings
  • Vision transformer

Fingerprint

Dive into the research topics of 'Generalized Concordant Vision Transformer with Masked Image Tokens for Object Detection'. Together they form a unique fingerprint.

Cite this