Overcoming Modality Bias in Question-Driven Sign Language Video Translation

Liqing Gao, Fan Lyu, Peng Shi, Lei Zhu, Junfu Pu, Liang Wan*, Wei Feng*

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

Abstract

Question-Driven Sign Language Translation (QSLT) addresses the challenge of translating sign language using pertinent questions in question-answering contexts. However, the pronounced modality complexity between question text and sign video poses a predicament: the model tends to overly depend on questions to generate translations, thereby neglecting the value of visual cues. To tackle this issue, the paper presents a Gloss-Bridged Translator (GBT), which introduces sign gloss as an intermediary conduit to establish semantic connections between questions and videos. By leveraging gloss, visual features are transformed into textual counterparts, mitigating the modality imbalance between these representations. Moreover, a cross-modal contrastive learning strategy is implemented, bolstering the global contextual relevance and local semantic alignment between questions and sign language. The proposed methodology is validated through extensive experiments on the proposed QSL dataset and other public sign language datasets. The results show the efficacy of integrating questions into sign language translation. The GBT yields remarkable improvements over prevailing SLT methods, attesting to its effectiveness and rationale. Our code and dataset is available at https://github.com/glq-1992/QSL.

Original languageEnglish
Pages (from-to)11724-11738
Number of pages15
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number11
DOIs
Publication statusPublished - 2024

Bibliographical note

Publisher Copyright:
© 1991-2012 IEEE.

Keywords

  • Modality bias
  • gloss-bridged translator
  • modality complexity alignment
  • question-driven sign language dataset
  • sign language translation

Fingerprint

Dive into the research topics of 'Overcoming Modality Bias in Question-Driven Sign Language Video Translation'. Together they form a unique fingerprint.

Cite this