Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier

Qishan Zhang; Shuangbing Wen; Tao Hu

Venue: ACM International Conference on Multimedia (ACM MULTIMEDIA) 2024
Recognition: Most Influential ACM MULTIMEDIA 2024 Paper (Rank No. 10)
Edition: 2026-03
Impact factor: 3
Certificate ID: 43aabfe514433b17

Abstract

Generative AI technologies, including text-to-speech (TTS) and voice conversion (VC), frequently become indistinguishable from genuine samples, posing challenges for individuals in discerning between real and synthetic content. This indistinguishability undermines trust in media, and the arbitrary cloning of personal voice signals presents significant challenges to privacy and security. In the field of deepfake audio detection, the majority of models achieving higher detection accuracy currently employ self-supervised pre-trained models. However, with the ongoing development of deepfake audio generation algorithms, maintaining high discrimination accuracy against new algorithms grows more challenging. To enhance the sensitivity of deepfake audio features, we propose a deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. Specifically, utilizing the pre-trained XLS-R enables our model to extract diverse audio features from its various layers, each providing distinct discriminative information. Utilizing the SLS classifier, our model captures sensitive contextual information across different layer levels of audio features, effectively employing this information for fake audio detection. Experimental results show that our method achieves state-of-the-art (SOTA) performance on both the ASVspoof 2021 DF and In-the-Wild datasets, with a specific Equal Error Rate (EER) of 1.92\% on the ASVspoof 2021 DF dataset and 7.46\% on the In-the-Wild dataset. Codes and data can be found at https://github.com/QiShanZhang/SLSforADD.

Download PDF certificate