Abstract
The rapid advancement of AI-generated content has intensified concerns over deepfakes due to increasingly sophisticated and visually convincing forgeries. To this end, the pre-trained Vision Transformer (ViT) model has become a de facto choice for deepfake detection, thanks to its powerful learning capability. Despite favorable results achieved by existing ViT-based methods, they have inherent limitations that could result in suboptimal performance in scenarios with continuously evolving forgery techniques, such as overfitting to single forgery patterns or placing excessive emphasis on dominant forgery regions. In this paper, we propose CUTA, a simple yet effective deepfake detection paradigm that utilizes ViT adapters as the medium and fully exploits the spatial- and frequency-domain features of given images to overcome the limitations of existing methods. Specifically, CUTA focuses on frequency domain masking within the input space, which obscures parts of the high-frequency image to intensify the training challenge while preserving subtle forgery cues in the frequency domain to facilitate comprehensive forgery representations. Furthermore, we propose two task-customized modules within the ViT model, i.e., the texture enhancement module and the multi-scale perceptron module, to seamlessly integrate local texture and rich contextual features. These two modules ensure an organic interaction between the task-specific forgery patterns and general semantic features within the pre-trained ViT framework. The experimental results on several publicly available benchmarks demonstrate CUTA’s superiority in performance, particularly showcasing its significant advantages in both cross-dataset and cross-manipulation scenarios.
| Original language | English |
|---|---|
| Pages (from-to) | 5904-5918 |
| Number of pages | 15 |
| Journal | IEEE Transactions on Information Forensics and Security |
| Volume | 20 |
| DOIs | |
| Publication status | Published - 2025 |
Bibliographical note
Publisher Copyright:© 2005-2012 IEEE.
Keywords
- Deepfake detection
- ViT adapter
- frequency domain masking
- vision transformer
Fingerprint
Dive into the research topics of 'Customized Transformer Adapter With Frequency Masking for Deepfake Detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver