Exploring Clean Label Backdoor Attacks and Defense in Language Models

Shuai Zhao, Luu Anh Tuan, Jie Fu, Jinming Wen*, Weiqi Luo

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

32 Citations (Scopus)

Abstract

Despite being widely applied, pre-trained language models have been proven vulnerable to backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into models by poisoning a subset of training samples through trigger injection and label modification. Traditional textual backdoor attacks suffer several flaws: the triggers lead to abnormal natural language expressions, and poisoned sample labels are mistakenly labeled. These flaws reduce the stealthiness of the attack and can be easily detected by defense models. In this study, we introduce Cbat, a novel and efficient method to perform clean-label backdoor attack with text style, which does not require external trigger, and the poisoned samples are correctly labeled. Specifically, we develop a sentence rewriting model by leveraging the powerful few-shot learning capability of prompt tuning to generate clean label poisoned samples. Cbat then injects text style as an abstract trigger into the victim model through poisoned samples. We also introduce an algorithm for defending against backdoor attacks, named CbatD, which effectively erases the poisoned samples by locating the lowest training loss and calculating feature relevance. The experiments on text classification tasks demonstrate that our Cbat and CbatD show overall competitive performance in textual backdoor attack and defense. It is noteworthy that Cbat attained leading results in the clean-label backdoor attack benchmark without triggers.

Original languageEnglish
Pages (from-to)3014-3024
Number of pages11
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
DOIs
Publication statusPublished - 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2014 IEEE.

Keywords

  • Deep learning
  • backdoor attack
  • clean-label
  • defense
  • pre-trained language model

Fingerprint

Dive into the research topics of 'Exploring Clean Label Backdoor Attacks and Defense in Language Models'. Together they form a unique fingerprint.

Cite this