Abstract
Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (e.g., sketching), detecting related semantic roles (e.g., AGENT is man), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel KnOwledge Integration (KOI) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), e.g., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI's model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.
| Original language | English |
|---|---|
| Article number | 111766 |
| Pages (from-to) | 1-12 |
| Number of pages | 12 |
| Journal | Pattern Recognition |
| Volume | 167 |
| Early online date | 5 May 2025 |
| DOIs | |
| Publication status | Published - Nov 2025 |
Bibliographical note
Publisher Copyright:© 2025 Elsevier Ltd
Keywords
- Grounded Situation Recognition
- Vision-Language Models
- Knowledge Integration
Fingerprint
Dive into the research topics of 'Knowledge Integration for Grounded Situation Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver