Customized Transformer Adapter With Frequency Masking for Deepfake Detection

Zenan Shi, Haipeng Chen, Yixin Jia, Dong Zhang*, Wei Lu, Xun Yang

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

Abstract

The rapid advancement of AI-generated content has intensified concerns over deepfakes due to increasingly sophisticated and visually convincing forgeries. To this end, the pre-trained Vision Transformer (ViT) model has become a de facto choice for deepfake detection, thanks to its powerful learning capability. Despite favorable results achieved by existing ViT-based methods, they have inherent limitations that could result in suboptimal performance in scenarios with continuously evolving forgery techniques, such as overfitting to single forgery patterns or placing excessive emphasis on dominant forgery regions. In this paper, we propose CUTA, a simple yet effective deepfake detection paradigm that utilizes ViT adapters as the medium and fully exploits the spatial- and frequency-domain features of given images to overcome the limitations of existing methods. Specifically, CUTA focuses on frequency domain masking within the input space, which obscures parts of the high-frequency image to intensify the training challenge while preserving subtle forgery cues in the frequency domain to facilitate comprehensive forgery representations. Furthermore, we propose two task-customized modules within the ViT model, i.e., the texture enhancement module and the multi-scale perceptron module, to seamlessly integrate local texture and rich contextual features. These two modules ensure an organic interaction between the task-specific forgery patterns and general semantic features within the pre-trained ViT framework. The experimental results on several publicly available benchmarks demonstrate CUTA’s superiority in performance, particularly showcasing its significant advantages in both cross-dataset and cross-manipulation scenarios.

Original languageEnglish
Pages (from-to)5904-5918
Number of pages15
JournalIEEE Transactions on Information Forensics and Security
Volume20
DOIs
Publication statusPublished - 2025

Bibliographical note

Publisher Copyright:
© 2005-2012 IEEE.

Keywords

  • Deepfake detection
  • ViT adapter
  • frequency domain masking
  • vision transformer

Fingerprint

Dive into the research topics of 'Customized Transformer Adapter With Frequency Masking for Deepfake Detection'. Together they form a unique fingerprint.

Cite this