TY - JOUR
T1 - Generalized Concordant Vision Transformer with Masked Image Tokens for Object Detection
AU - Quan, Yu
AU - Zhang, Dong
AU - Tang, Jinhui
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025/5/15
Y1 - 2025/5/15
N2 - The vision transformer (ViT) architecture offers significant advantages in object detection tasks. However, some limitations affect improving task performance. Firstly, the ViT relies heavily on inflexible position embedding, which causes poor performance when processing images with complex semantic dependencies. Secondly, class imbalance in large-scale datasets can easily cause training instability and inference bias. To overcome these limitations, we propose a generalized concordant ViT scheme for object detection (GCViTDet). Specifically, we first introduce a relevance enhancement strategy (RES) into the encoder-decoder structure, which is composed of the spatial enhanced position embeddings (SEPE) component, the cross multipooling attention (CMPA) component, and a global-local path. This strategy establishes semantic-rich dependencies through enhanced position embedding information and omni-feature representations. Subsequently, a bottom-up feature aggregation pathway is employed, utilizing a cross multi-pooling attention to improve the model’s capacity to capture semantic dependencies. This scheme enables the extraction of high-dimensional features that exhibit complex positional relationships. Besides, we propose a focal unified cross-entropy (FUCE) loss to solve the class imbalance problem during training by introducing a uniform threshold to regulate the similarity between positive and negative samples of different classes. Compared with existing methods, GCViTDet can not only capture more intricate positional relationships and semantic-rich dependencies but also alleviate the class-imbalance problem. Experimental results on the challenging MS-COCO dataset validate that GCViTDet can consistently improve performance over state-of-the-art object detection baseline models.
AB - The vision transformer (ViT) architecture offers significant advantages in object detection tasks. However, some limitations affect improving task performance. Firstly, the ViT relies heavily on inflexible position embedding, which causes poor performance when processing images with complex semantic dependencies. Secondly, class imbalance in large-scale datasets can easily cause training instability and inference bias. To overcome these limitations, we propose a generalized concordant ViT scheme for object detection (GCViTDet). Specifically, we first introduce a relevance enhancement strategy (RES) into the encoder-decoder structure, which is composed of the spatial enhanced position embeddings (SEPE) component, the cross multipooling attention (CMPA) component, and a global-local path. This strategy establishes semantic-rich dependencies through enhanced position embedding information and omni-feature representations. Subsequently, a bottom-up feature aggregation pathway is employed, utilizing a cross multi-pooling attention to improve the model’s capacity to capture semantic dependencies. This scheme enables the extraction of high-dimensional features that exhibit complex positional relationships. Besides, we propose a focal unified cross-entropy (FUCE) loss to solve the class imbalance problem during training by introducing a uniform threshold to regulate the similarity between positive and negative samples of different classes. Compared with existing methods, GCViTDet can not only capture more intricate positional relationships and semantic-rich dependencies but also alleviate the class-imbalance problem. Experimental results on the challenging MS-COCO dataset validate that GCViTDet can consistently improve performance over state-of-the-art object detection baseline models.
KW - Class imbalanced learning
KW - Object detection
KW - Positional embeddings
KW - Vision transformer
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:001606669700019
UR - https://openalex.org/W4410394497
UR - https://www.scopus.com/pages/publications/105005532037
U2 - 10.1109/TCSVT.2025.3570504
DO - 10.1109/TCSVT.2025.3570504
M3 - Journal Article
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -