Abstract

With the development of deep neural synthesis methods, speech forgery techniques based on text-to-speech (TTS) and voice conversion (VC) pose a serious threat to auto speaker verification (ASV) systems. Some studies show that the attack success rate of deep synthetic speech on ASV systems can reach about 90%. Existing detection methods improve the detection generalization of known forgery methods by a lot of training data, but the detection effect and robustness against unknown methods are poor. We propose an anti-spoofing scheme based on one-class classification for detecting unknown synthetic. We implement deep support data description to capture the feature of bonafide speech. An autoencoder structure is introduced to enhance the detection performance. The proposed method is only trained on native speech, which reduces reliance on large amounts of fake speech. Our method achieves an equal error rate of 8.10% on the evaluation set of ASVspoof 2019 challenge and outperforms other state-of-the-art methods. In the generalization test, the proposed method can reach the equal error rate of 15% on “In-the-wild” dataset and 23% on FoR dataset, which is lower than that of other advanced algorithms.

1. Introduction

Recently, with the rapid development of deep learning technology, it is much easier to generate highly realistic fake human voices than before. At present, there are two main categories of speech forgery technology. One of which is the text-to-speech (TTS) scheme [1, 2]; by training with a sufficient amount of high-quality recorded speech from a single speaker, a TTS system can synthesize speech with a piece of text in a reading style as natural as the training speaker. The other one is the voice conversion (VC) scheme [36], which is a scheme that converts one person’s voice into another target person’s voice. According to the difference of the training dataset, the VC scheme can be divided into parallel schemes and nonparallel schemes. The increasingly mature voice forgery technology poses a huge challenge to the security star of the auto speaker verification (ASV) system. From ASVspoof 2019 challenge, we can see that the equal error rate of the attacking baseline scheme achieves 44.66%.

With the continuous improvement of forgery technology, researchers are constantly exploring efficient synthetic speech detection algorithms. Traditionally, researchers use hand-crafted audio features to detect fake speech, such as i-vector [7], x-vector [8], mel-frequency cepstral coefficients (MFCC) [911], linear frequency cepstral coefficients (LFCC) [12, 13], and constant Q cepstral coefficients (CQCC) [1416]. Later, neural network technology was used in spoof speech detection. DNN was first introduced in voice detection [17, 18]. Yu et al. [19] proposed a system that combines the human log-likelihood (HILL) and DNN system; it is able to significantly improve the spoof detection accuracy. Alzantot et al. [20] presented a deep residual network for fake audio detection; they also compared the effects of different input features (MFCC, log-magnitude STFT, and CQCC) on the detection results. In order to improve the robustness of spoofing countermeasures, Wu et al. [21] came up with a feature authenticity process which can fit the distribution of genuine speech better to enhance the performance of their method. Then, they put long-term CQT-based features into LCNN to distinguish fake speech.

However, these methods reduce the synthetic speech detection problem to a binary classification of bonafide speech and synthetic speech, which require fitting the distributions of two different classes during the training phase. So, they need lots of fake data to be trained, which will cost too much time and resources. The more fatal disadvantage is that when the test data present a large number of unknown speaker and synthetic algorithms, the trained method tends to perform poorly. In practice, when we use anti-spoof countermeasures to protect specific systems or characters, it is very hard for us to predict the attack methods used by opponents. So, it is necessary to protect speaker verification systems in the absence of known fake samples. Recently, some researchers tried to use the one-class method to solve these problems. Villalba et al. [22] proposed a one-class method using one-class support vector machine (SVM) with bottleneck feature extracted from deep neural networks (DNNs), which attain competitive result on the ASVspoof 2015 dataset. Zhang et al. [23] presented a new loss function called OC-softmax to train the method, and the experimental result shows that their method outperforms all existing single systems on the ASVspoof 2019 logical access (LA) dataset. But the embedding feature of these methods is still learned through binary classification. The phenomenon of overfitting on the training set still exists. This limits the generalization ability of the method to some extent.

In this work, we propose a one-class method based on improved support vector data description (SVDD) [24] for detecting fake speech. The method aims to learn the distribution characteristics of native speech and tries to reconstruct the source speech at the same time. In the training phase, we only use native speech for training our method so that our method can fit the distribution of real speech. With the reconstruction process, the difference between native speech and synthetic fake speech will be enlarged. In the evaluation phase, by evaluating the quality of the reconstructed speech and the distance with the center of the hypersphere, we can determine whether the speech is true or false. The main contributions of our work are summarized as follows.

1.1. One-Class Detection Method

Our proposition introduces AESVDD, a one-class classification approach derived from autoencoder and DSVDD techniques, designed to identify counterfeit speech instances. This approach holds the potential for broader application in deepfake identification, as it exclusively depends on genuine human speech characteristics. This stands in contrast to the conventional binary classification methods commonly used for spotting deepfakes.

1.2. Exploring the Influence of Different Features

We construct our method with three different features, including constant Q transform (CQT), MFCC, and pulse-code modulation (PCM). These features are used in fake speech detection generally. We test these features with different method parameters, and the result shows that the CQT is the most effective feature for one-class detection.

1.3. Detecting Fake Speech on Different Datasets

We use fake and real speech from the ASVspoof 2019 logical access to evaluate our one-class method, achieving a low equal error rate trained with only real speech. Also, we test the generalization of our algorithms on the “In-the-wild” [25] dataset and FoR [26] dataset, which outperform other methods.

The subsequent sections of this paper are structured as follows. Section 2 provides an introduction to prior research studies which are relevant to our proposed methods. Section 3 outlines the architecture of our approach. Within Section 4, we delve into the details of the experimental datasets and the evaluation metrics employed. Moving to Section 5, we extensively analyze the efficacy of various fake speech detection techniques across diverse datasets. Finally, Section 6 draws conclusions and directions for future work.

In this section, we provide a concise overview of noteworthy algorithms and research endeavors that hold direct relevance to our study.

2.1. SVDD Method

Support vector data description (SVDD), as outlined in reference [27], stands as a one-class classifier that finds utility in the realm of anomaly detection. The primary goal of SVDD resides in the mapping of data from its original space into a feature space, followed by the identification of a hypersphere possessing the smallest volume within this transformed space. The conventional SVDD framework can be categorized into two main types: linear SVDD and kernel-based nonlinear SVDD. Given a training dataset denoted as , the optimization objective for linear SVDD is articulated as follows:where O denotes the center of the hypersphere and the Euclidean norm is represented as . Accompanying this, stands for the relaxation variable, R signifies the radius of the hypersphere, and takes on the role of the trade-off parameter governing the balance between hypersphere volume and modeling error. By incorporating , the system accommodates penalty considerations for outliers. However, employing a linear methodology to depict normal data in practical scenarios presents challenges. To address this, the kernel SVDD was conceived as a solution. The fundamental principle behind nonlinear SVDD is akin to that of its linear counterpart. The key distinction lies in the preprocessing of the data through nonlinear functions prior to imposition of constraints. In the nonlinear approach, we propose the incorporation of a nonlinear function , designed to map the training data onto a novel feature space. Consequently, the optimization objective undergoes a transformation as follows:and the optimization objective is always solved using a dual expression aswhere denotes the Lagrangian multipliers and signifies the inner product within the mapping space. Given the challenge of defining the mapping function, practical implementation leans towards utilizing a kernel function for the computation of the inner product.

The advent of deep neural networks has paved the way for innovations such as the deep SVDD structure, as proposed by Ruff et al. [24]. Differing from the kernel SVDD, deep SVDD capitalizes on deep neural networks to replace the kernel function. When presented with a training dataset, , the objective function for soft-boundary deep SVDD is outlined as follows:

The primary aim of the objective function centers around the minimization of , thereby effectively reducing the volume occupied by the hypersphere. The subsequent component introduces a penalty term designed to account for data points that lie beyond the confines of the sphere. Finally, to introduce regularization to the network parameters , a weight decay mechanism is implemented, characterized by the hyperparameter . Here, the symbol refers to the Frobenius norm, and the variable denotes the hidden layers. Shifting focus to the scenario of one-class training mode, where all training samples are deemed normal, the optimization procedure detailed above can be streamlined into a simplified version, succinctly expressed as follows:where the first term is aimed at calculating the distance between the output and the center point of the hypersphere and the other term is the same as in equation (4).

2.2. Autoencoder Method

The utilization of autoencoder (AE) comes into play for the purpose of compressing input data, thereby creating a feature space that allows subsequent reconstruction. Over the course of the past decades, the AE method has demonstrated its efficacy in the realm of anomaly detection [28, 29]. The architectural blueprint of the autoencoder is visually represented in Figure 1, comprising three distinct components: the encoder, decoder, and the hidden layer space. The encoder’s primary function involves the compression of raw data into a feature space characterized by reduced dimensions. Concurrently, the hidden layer serves as the means to map the encoder’s output into a latent variable z; this mapping often employs fully connected layers. Concluding the architecture is the decoder, designed to perform the reverse operation of the encoder—its role is to reconstruct the data within the feature space, thereby restoring it to the same domain as the original data. Fundamentally, the AE model is trained to capture the lower-level features present within the input data. As such, the loss function assumes a critical role, serving as a metric to quantify the disparity between the input and output data. The mean squared error stands as the most commonly employed measure, its formulation being defined as follows:where denotes reconstructed data and denotes original data.

3. Proposed Method

In this section, we introduce the structure of the one-class fake speech detection method based on improved deep SVDD for fake audio detection in detail. The flowchart of AESVDD is shown in Figure 2. The proposed method consists of an autoencoder as a feature transformer and a deep SVDD as a one-class classifier. We will present the input data of our method, the specific structure of our method, and the objective function in the remainder of this section.

3.1. The Input of Method

In spoofing audio detection, there are many hand-crafted features that can be chosen as input features, such as MFCC, LFCC, CQT, and CQCC. Verified by the ASVspoof 2019 challenge [30], all these features have achieved excellent results in the task of detecting synthetic speech. Inspired by the paper [31] and the result of our experiment, we find CQT is more effective than other features in the one-class task. Thus, we choose CQT as the input feature in our method. CQT refers to the filter bank whose center frequency is distributed exponentially and whose filter bandwidth is different but whose center frequency to bandwidth ratio is constant Q. It is different from Fourier transform in that the horizontal axis frequency of its spectrum is not linear but is based on , and the length of the filter window can be changed according to the frequency of the spectral line to obtain better performance. Most speech synthesis methods are designed to focus on the speaker’s voice color and speech and speech fluency, often ignoring the prosodic features of speech signals. Thus, the CQT feature is efficient to detect spoofing audio.

3.2. The Architecture of AESVDD

In our proposed method, we combine the DSVDD and AE methods as a one-class classifier to detect synthetic speech. The specific architecture of the autoencoder is shown in Figure 3. The whole architecture consists of autoencoder module and SVDD module. Given that the input data consist of CQT coefficients, a 2D convolutional neural network with five convolution layers is used as an encoder. The input data are mapped onto hidden space. Different from the ordinary AE method, we introduce the SVDD structure in the bottleneck layer to make sure that the feature space variables compressed by the encoder are constrained in a hypersphere. At the end of our method, the decoder is made up of a 2D inverse convolution network, where the number and size of the convolution layer are the same as those of the encoder. Based on the basic architecture mentioned above, we propose two different AESVDD methods, AESVDD-1 and AESVDD-2, to detect real and fake audio. Figure 2(a) represents the architecture of our first method, AESVDD-1, and Figure 2(b) represents our second method, AESVDD-2. The first approach uses the same encoder and decoder structure as shown in Figure 3. The svdd module is embedded inside the autoencoder structure. In the training phase, we need two steps to train our method. Firstly, we train the autoencoder without the svdd module by minimizing between the original input X and the reconstructed output . Then, we add the svdd module while the parameters of the autoencoder module are inherited from the trained method and continue to train the new method by reducing the svdd loss and the reconstructed loss at the same time. The second architecture involves separating the svdd module from the autoencoder method compared to AESVDD-1 and adds a new encoder before the svdd module whose detailed architecture is the same as the autoencoder module. In the training phase, we also train the autoencoder module first, but in the second step, the svdd module only inherits the parameter of encoder from first step. The svdd module is trained with objective function as equation (5) while keeping the parameter of autoencoder unchanged. The difference between the two methods is the position of DSVDD module and the second training step. Figure 4 adopts t-distributed neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), two algorithms for high-dimensional data visualization, to visualize the effectiveness of SVDD module behaviors in gathering real voice samples and excluding synthetic voice samples. It can be seen from the figure that there is a clear boundary between the real sample and the synthetic sample, which shows that the SVDD module is feasible in synthetic audio detection.

3.3. The Objective Function of the Proposed Method

When we train the method, the first step is training the autoencoder, and the objective function can refer to formula (6). For AESVDD-1, the second step needs to train autoencoder module and SVDD module at the same time. Thus, the ultimate objective can be defined aswhere is the mapping function learned by the encoder, c is the center of the hypersphere, and and are the coefficients to balance the distance loss and the reconstruction loss. The last term is to prevent overfitting as in equation (5). According to the experience gained from the experiment, in this article, and are set to 1.0 and 0.01, respectively.

As for AESVDD-2, the training process of the autoencoder and SVDD is completely independent. The objective function can refer to formulas (5) and (6). Therefore, the loss function can be defined asbut in the evaluation phase, we need to combine the two parts, which can be defined as

For AESVDD-2, and are set to 1.0 and 0.4, respectively.

4. Experiment

In this section, we introduce our experimental setup and the training process. Then, we report the results of the proposed method and analyze the influence with different parameters. In the end, we compare the performance of proposed method with other state-of-the-art one-class classification and binary classification methods.

4.1. Database and Metrics

For the validation of our techniques, we have opted to employ the ASVspoof 2019 dataset [18]. This dataset encompasses two distinct corpora: logical access (LA) scenarios and physical access (PA) scenarios. In the context of our research, our focus centers on the LA scenarios. The LA scenarios are derived from the VCTK base corpus, a collection of speech data captured from 107 speakers. To facilitate experimentation, the corpus is divided into three separate datasets: training, development, and evaluation. These datasets encompass genuine speech as well as artificially generated speech, achieved through the utilization of 17 distinct TTS and VC systems. Among these, the training and development sets involve 6 well-known synthetic algorithms, while the evaluation set entails 2 recognized algorithms alongside 11 algorithms that remain undisclosed. For a succinct overview of the ASVspoof 2019 logical access corpus and the accompanying “In-the-wild” dataset, refer to Table 1.

Furthermore, to assess the efficacy of our methodologies under more lifelike conditions, we harness the “In-the-wild” [25] audio deepfake repository and FoR [26] dataset. The “In-the-wild” compilation features a substantial 37.9 hours of audio clips, partitioned into either spoofed or genuine segments. The spoofed clips are the outcome of segmenting 219 openly accessible video and audio files that overtly promote audio deepfakes. In contrast, the FoR dataset comprises over 69,300 utterances, encompassing the latest TTS algorithms from prominent entities like Amazon, Baidu, Google, and Microsoft, alongside a diverse array of authentic speech instances. Comprehensive details are outlined in Table 1.

In this work, we use the equal error rate (EER) which is adopted in the ASVspoof 2019 challenge to evaluate the performance of the detection method. EER is defined the value where false alarm rate is equal to the miss rate by setting a threshold. The lower EER indicates that the detection system has better detection performance.

4.2. Method Training

In training phases, we only use bonafide audio from ASVspoof 2019 training dataset, which includes about 3000 audio utterances. We extract the 432-dimensional CQT coefficient from the utterances which are cut into 6.4 seconds. The frame size is 20 ms, and the hop size is 10 ms. Before putting CQT coefficients into network, we arrange CQT coefficients to a 2D matrix whose dim is . In order to make the network easier to converge, we normalize the train data by fixing the mean and standard deviation counted on the training dataset.

Our methodology is constructed utilizing the PyTorch Toolkit and undergoes training on a single NVIDIA V100 GPU. Regarding the optimization setup, we make use of the Adam optimizer. More precisely, we configure the parameter to 0.9 and the parameter to 0.999 to facilitate the updating of weights within our framework. The selection of the batch size is set at 200, while the initial learning rate takes on a value of 0.001. In order to expedite convergence and enhance overall reliability, we introduce convolutional layers into the architecture and implement batch normalization techniques. Furthermore, to mitigate overfitting concerns, we introduce a weight decay mechanism set at . As a preliminary step prior to the main training phase, we undertake a pretraining procedure aimed at initializing the center of the hypersphere.

5. Result

5.1. Evaluation of Proposed Method

To demonstrate the efficacy of our proposed one-class method, we test our method under different parameters. Considering that the dimension of hidden layer and the weight of reconstruction error may influence the detection performance, we construct our method under various hidden layer dimensions. The result is summarized in Table 2. Firstly, we use three input data to train our method with different hidden layer dimensions. For the input data, we choose PCM (raw audio data), LFCC, and CQT features. At the same time, we set three different hidden layer dimensions, which are 64, 128, and 256, respectively. The pooled column in the table is a mixture of development and validation sets. Figure 5 shows the DET curve of our method on the Mixture Set. From the EER on development and evaluation set, we can see that the method whose input feature is MFCC or PCM has low EER on evaluation set but the EER on development is close to 30%. This shows that PCM and LFCC features are not robust to distinguish between native speech and synthetic speech. The CQT feature is more stable against unknown attacks. From the table, we can see that the best performance is the CQT feature with 256-dimensional hidden variables, which achieves 8.10%. Also, for AESVDD-2, we use three different hidden layer dimensions to train our method; from the table, we can get that 256 is the optimal parameter whose EER is 9.20%.

5.2. Comparison with Other Systems
5.2.1. Comparison with Other One-Class Methods

We compare our method with other one-class classification systems, including OC-SVM based on machine learning and deep SVDD method based on neural network. The input feature and train parameter are under the same setting as our proposed method, and all three methods are trained only on bonafide speech from train dataset. Table 3 shows the detection results of different methods on the ASVspoof 2019 dataset. The experimental data are a mixture of development and validation sets. It is worth mentioning that because all methods are only trained on native speech, all fake speech is unknown attack for them. From the table, we can see that our method has a lower EER than OC-SVM and DSVDD. Compared with DSVDD, the EER of our proposed method is reduced by 33%. It shows that the AE structure is useful to enhance the ability to distinguish native speech.

5.2.2. Comparison with Other Advanced Methods

We also compare our system with other state-of-the-art systems. At first, two conventional methods based on Gaussian mixture model (GMM) [32, 33] are chosen. Then, we choose the Res-TSSDNet method [31] proposed by Hua. It has two versions which are the end-to-end version and CQT version. The other one is the ResNet [20] method presented by Alzontant; we also use two different features as input data. In this section, we aim to test the ability of our method to defend the unknown attack compared with other advanced methods. Because the development set and evaluation set have lots of same or similar synthetic algorithms, the abovementioned binary classifiers are not trained on all known algorithms from ASVspoof 2019 train dataset. In this experiment, we choose bonafide and A01 synthetic speech to train methods. Then, we test methods on samples extracted from development and evaluation dataset randomly. Our proposed method is only trained on bonafide speech as first experiment. The results of EERs are presented in Table 3. It can be seen that ResNet only achieves about 23% and the Res-TSSDNet with CQT feature is in same condition. For the proposed method, it has a better ability to defend unknown attacks, which achieves 8.1%.

5.2.3. Generalization of the Method on Different Datasets

To evaluate the generalization of our proposed method, we conducted experiments using the “In-the-wild” [25] and FoR [26] datasets. We trained our proposed methods using native samples from ASVspoof 2019 and then tested them on the two different datasets mentioned above. For binary classification methods, we used the entire training set of ASVspoof 2019 for training and testing on the In-the-wild dataset. Table 4 shows the results of our experiments.

From the table, we can observe that the binary classification method has excellent detection performance against known algorithms (the first column of EER). However, when detecting audio samples belonging to different datasets, the performance of the binary classification network decreases significantly. On the other hand, the one-class classification algorithms we proposed show similar performance degradation in cross-database detection, but they still maintain better ability to distinguish between real and synthetic speech compared to other binary classification methods.

Our experiments also show that Res-TSSDNet with CQT input features performs better in fake audio detection. This proves that CQT features have more advantages in generalization than other features. AESVDD-1 and AESVDD-2 achieve 16.85% and 15.08% on the “In-the-wild” dataset, respectively, which is about 10% higher than that of other algorithms. Additionally, on the FoR dataset, our proposed methods achieve 23.34% and 25.88%, respectively, which is better than that of other advanced algorithms.

6. Conclusion

In this research paper, we introduce a novel approach for detecting fake speech within a single class. Our method leverages autoencoders and SVDD to bolster its efficacy in countering unfamiliar fake attacks. Our innovation lies in the formulation of a unique loss function that amalgamates both the reconstruction error and the one-class SVDD loss, resulting in heightened detection capabilities. Through comprehensive experiments conducted on the ASVspoof 2019 LA corpus, we demonstrate that our methods surpass conventional one-class classifiers. A comparative analysis between our system and other state-of-the-art solutions underscores the superior adaptability of our approach when confronted with unknown attacks. Furthermore, the outcomes obtained from evaluating our proposed techniques on the “In-the-wild” dataset and FoR dataset affirm their superior generalization in addressing cross-database detection challenges. Our future endeavors will be geared towards optimizing the structure of our methodologies with the ultimate aim of reducing the EER.

Data Availability

The experimental datasets are created using the ASVspoof 2019, FoR and “In-the-wild” databases, which are publicly available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by National Key Technology Research and Development Program under grant no. 2020AAA0140000.