Skip to main navigation Skip to search Skip to main content

Knowledge Integration for Grounded Situation Recognition

  • Jiaming Lei
  • , Sijing Wu
  • , Lin Li*
  • , Lei Chen
  • , Jun Xiao
  • , Yi Yang
  • , Long Chen
  • *Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

Abstract

Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (e.g., sketching), detecting related semantic roles (e.g., AGENT is man), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel KnOwledge Integration (KOI) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), e.g., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI's model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.

Original languageEnglish
Article number111766
Pages (from-to)1-12
Number of pages12
JournalPattern Recognition
Volume167
Early online date5 May 2025
DOIs
Publication statusPublished - Nov 2025

Bibliographical note

Publisher Copyright:
© 2025 Elsevier Ltd

Keywords

  • Grounded Situation Recognition
  • Vision-Language Models
  • Knowledge Integration

Fingerprint

Dive into the research topics of 'Knowledge Integration for Grounded Situation Recognition'. Together they form a unique fingerprint.

Cite this