Abstract

Using fake audio to spoof the audio devices in the Internet of Things has become an important problem in modern network security. Aiming at the problem of lack of robust features in fake audio detection, an audio streams’ hidden feature extraction method based on a heuristic mask for empirical mode decomposition (HM-EMD) is proposed in this paper. First, using HM-EMD, each signal is decomposed into several monotonic intrinsic mode functions (IMFs). Then, on the basis of IMFs, basic features and hidden information features HCFs of audio streams are constructed, respectively. Finally, a machine learning method is used to classify audio streams based on these features. The experimental results show that hidden information features of audio streams based on HM-EMD can effectively supplement the nonlinear and nonstationary information that traditional features such as mel cepstrum features cannot express and can better realize the representation of hidden acoustic events, which provide a new research idea for fake audio detection.

1. Introduction

With the development of the Internet of Things (IoT) technology, an increasing number of audio and video acquisition devices are now connected to the Internet. Fake audio becomes an increased new threat on voice interfaces due to the recent breakthroughs in speech synthesis and voice conversion technologies. Therefore, the detection of fake audio has become a new hot issue of network security [1, 2]. There are mainly two methods of audio forgery. One is to generate spoofed utterances using text-to-speech (TTS) and voice conversion (VC) algorithms, which is also called logic access (LA) [3] and the other is the use of professional replay devices to get spoof attack, which is also known as physical access (PA) [4]. There are more diversity of audio forgery means and more difficulty in fake audio detection [5]. In this paper, an audio streams hidden feature extraction method based on HM-EMD is proposed and used to extract features of audio streams to detect fake audio.

At present, the fake audio detection is mainly based on acoustic features to build classification model. Linear frequency cepstral coefficients (LFCC) [6], constant-Q cepstral coefficients (CQCC) [7], and mel frequency cepstral coefficients (MFCC) [8] are commonly used in fake audio detection. However, these features are based on fixed filter banks; none of these acoustic features are able to generalize well on unknown spoofing technologies. Subsequently, the end-to-end deep learning method to detect the fake audio is gradually concerned by researchers. Alejandro et al. proposed a cyclic neural network based on optical convolution gate to extract the shallow features at the frame level and the deep features of sequence dependence at the same time [9]. Zeinali et al. used VGg and light CNN to detect fake audio [10]. However, this end-to-end deep learning approach requires large, evenly distributed datasets and relies on certain types of fake audio.

Through the analysis of forged audio, it is found that the AI-based fake audio technology focuses more on speech content and ignores the background sound in an audio stream [11]. The background sound will also change during the replay spoofing. Therefore, the construction of features representing the hidden information in audio scenes can be used to detect the fake audio [12].

In order to focus on local level details of the signal in terms of specific regions (which may be highly discriminative), empirical-mode-decomposition- (EMD-) based approach is explored. EMD has superior time-frequency resolution performance in nonlinear unsteady signal processing and has been applied in counterfeit audio detection [13].

However, the traditional EMD method has a few disadvantages, including mode aliasing and the inconsistency of IMF dimensions after signal decomposition. Hence, accurately estimating the IMF range of a certain frequency distribution is difficult. In 2005, Deering and Kaiser proposed the ensemble empirical mode decomposition (EEMD) decision method [14], which attempts to solve the problem of mode aliasing by introducing Gaussian white noise into the signal to be decomposed. In EEMD, the attributes of Gaussian white noise should be adjusted artificially. However, the Gaussian white noise leaves traces in the IMF decomposed from the signal, thereby resulting in low signal restoration accuracy and extensive calculations. Time-varying filtering-based empirical mode decomposition (TVF-EMD) uses the b-spline time-varying filter for mode selection and thus solves the problem of mode aliasing to a certain extent. However, TVF-EMD must calculate the cutoff frequency first, thus leaving the problem of dimension inconsistency unsolved [15].

To sum up, in order to make full use of the time-frequency analysis advantages of EMD, it is necessary to solve the modal aliasing and frequency inconsistency problems existing in EMD itself. In this paper, a heuristic empirical mode decomposition (HM-EMD) method is proposed to improve the purity of IMFS and solve the problem of inconsistency between mode mixing and IMF dimension. Then, the acoustic hidden component features (AHCF) of the audio stream were constructed and used to locate the acoustic events in audio stream in the acoustic stream classification dataset A of DCASE [16]. Fake audio detection is implemented on ASVSpoof2019 dataset [17]. The experimental results show that the basic features and AHCFs of the audio streams based on HM-EMD can represent the audio background which help to verify the types of the fake audio.

The paper consists of five parts. The first part is an introduction of audio streams. The second part mainly introduces the principle of HM-EMD. The third part describes the mining of hidden information in audio streams based on the proposed HM-EMD. The fourth part presents the results of classification of audio streams on the basis of HM-EMD. The fifth part summarizes the characteristics of the proposed method and presents future research directions.

2. Heuristic Mask for Empirical Mode Decomposition (HM-EMD)

In this section, the classical empirical mode decomposition (EMD) method is first introduced and then follows the analysis of mode aliasing in EMD; finally, the solution of mode aliasing based on heuristic mask signal is proposed in detail.

2.1. Empirical Mode Decomposition Method
2.1.1. Empirical Mode Decomposition

EMD can decompose the original signal x (t) () into a series of IMFs whose upper and lower envelopes have a mean value of 0. This decomposition method does not need to preset basis functions (such as Fourier transform or wavelet analysis), but the IMFs should satisfy the following formulas:where is the number of extreme points of the data sequence and is the number of zero crossings; are the upper and lower envelopes by cubic spline interpolation with the maximum and minimum points as the control points, respectively. Formula (1) represents the narrow-band constraint condition of the IMF, and formula (2) represents the local symmetry constraint condition. The process of EMD decomposition to obtain an IMF can be expressed as follows (Algorithm 1).

Input: original signal x (t), supposed IMF number i
Output: intrinsic mode functions, IMF
(1)i = 1,  = x (t).
(2)Get the extremum points of signal , calculate the upper and lower envelope by cubic spline interpolation with the maximum and minimum points as control points, and get the average value of upper and lower envelope at every points.
(3). If satisfies formulas (1) and (2), then is taken as the ith IMF signal , i = i + 1; if not, repeat step 2 and 3 for signal .
(4). Return to step 1 until the termination condition is satisfied.
2.1.2. Modal Aliasing of EMD

However, the most significant drawback of EMD is modal aliasing, as shown in Figure 1. Figure 1(b) shows the FFT spectrum corresponding to each IMF in Figure 1(a). It can be seen from the figure that each FFT spectrum contains multiple signals of different frequencies, which means a single IMF contains signals of different frequencies or signals of the same frequency that appear in different IMF components. These are modal aliasing. The main reason for modal aliasing is the absence of extreme value or the inconsistent distribution of extreme value, which makes the variation trend error between the spectral envelope obtained by interpolation and the real signal is too large. At this time, the time-domain signal does not meet the narrow-band requirements of IMF decomposition, resulting in mode aliasing.

In order to solve this problem, mask signal is usually created to compensate the missing extreme value, and then the values are given, respectively:

For and , EMD is performed to obtain the natural mode functions , respectively. The final IMF is defined as follows:

It can be seen from the above that the extreme value distribution of mask signal is very important to solve the modal aliasing problem. White noise is usually used as mask signal , but this method does not make full use of the properties of the signal itself and cannot adapt to a variety of signal contents. Therefore, this paper proposes a heuristic mask for empirical mode decomposition method. This method makes full use of the structural attributes of the signal itself to construct variable analysis window and mask signals. The specific principle and implementation process are as follows.

2.2. Heuristic Mask Signals
2.2.1. Basic Principle Analysis

The signal properties need to be established prior to EMD. A time-varying FM/AM model can be used to express any nonstationary signal; that is,where a (T) is the envelope function and ω (T) is the phase function. The analytical signal is

Here, denotes the Hilbert transform. We calculate the instantaneous phase and instantaneous frequency . Using Hilbert transform, we can separate the AM and FM components of the IMF to achieve the purpose of modal separation.

For the single component mode, the instantaneous frequency should be nearly linear, while the variation range of should be considerably small. When mode aliasing occurs, should clearly change without consideration of the end points. Especially, for hidden components, a jump of occurs at the time point of concealment. We constructed a variable analysis window according to the time-frequency characteristics of instantaneous frequency. Then, we divided the signal into several parts.

If of the segmented signal is still unstable; then, the modal separation problem can be transformed into the minimisation problem, in which the bandwidth of is minimised. The bandwidth calculation method for nonstationary signals can be obtained by the Carson rule:where is the deviation of the instantaneous frequency from its mean value and and denote the frequencies of the AM and FM signals, respectively. We can make to minimise the bandwidth. In other words, the decomposition frequency of each IMF is expected to be equal to the centre frequency of the instantaneous frequency, that is, equal to the mean value of the instantaneous frequency . Then, a mask signal with the same frequency as can be selected and the number of IMFs required can be determined.

2.2.2. Algorithm Description

The HM-EMD algorithm comprises the following steps: variable analysis window construction and mask signal construction.

(1) Variable Analysis Window Construction. The jump point should be picked such that formula (9) is satisfied:where is the difference in instantaneous frequencies at , is the mean value of at all time points, is the variance, and is the variable parameter. The original signal is divided into two parts by the time division points and decomposed by EMD independently.

(2) Mask Signal Construction. The sine signal is a common form of a mask signal, and its amplitude and frequency should be determined. As analysed in Section 2.1, the frequency is determined as the average instantaneous frequency . Hence, the amplitude is also determined as the mean value of the instantaneous amplitude. Then, the mask signal s is defined aswhere  =  and  = .

Then, IMFs can be refreshed by formulas (3)–(5), in which the number of IMFs is determined by and is the sampling frequency. The algorithm flow is as follows (Algorithm 2):

Input: signal x (t), supposed IMF number i
Output: intrinsic mode function, IMF
(1) = x (t), i = 1.
(2)Get the first IMF of the signal residual , calculate the mean and variance of , and use formula (8) to determine whether there is a hiding jump point. Variable analysis window is constructed according to the hiding jump point and is segmented.
(3)Construct mask signal for each : .
(4)Do EMD on and ; get the first IMF and .
(5)Let  = ( + )/2, and splice all the divided pieces.
(6)i = i + 1,  =  – , return to step 2, until , or no new IMF is required.

3. HM-EMD-Based Acoustic Scene Classification

The audio stream contains the hidden acoustic events that can represent the acoustic scene. In this section, HM-EMD is first used to decompose the acoustic scene signals, and the IMF of hidden acoustic events in these acoustic scene signals is analysed. According to the analysis results, a full-band IMF hidden component feature is proposed to represent the hidden acoustic events. Finally, the process of acoustic scene classification using these features is given in detail.

3.1. Acoustic Scene Signal Analysis by HM-EMD

When processing the original signal with HM-EMD, the variable analysis window and mask signal are used to intervene the decomposition of the original signal. The frame length is selected according to the frequency structure of the signal itself, while the frequency domain components corresponding to each IMF are relatively independent, which provides higher interpretability of the features. The instantaneous frequency and amplitude of each IMF also contain all information of IMF components, which means that the instantaneous frequency and amplitude of all IMF components contain most of the information of the signal to be analysed and can be directly used as the basic characteristics of the signal. Figure 2 shows the time-domain waveforms of some typical IMFs with hiding acoustic events in the ambient audio stream, in which only the most significant one of all IMF waveforms is shown. It can be seen that the time-domain waveform characteristics of these events are very obvious, the extreme value and over-average rate are very different, and they are distributed in low, medium, and high frequency bands. Therefore, this paper proposes a full-band IMF hiding component features, which can distinguish them well, to effectively improve the effect of ambient audio stream recognition algorithm. The feature calculation method is shown in Section 3.2.

3.2. Mutagenic Component Features

Figure 2 shows various hidden components in the acoustic scene data. On the one hand, the hidden components cause a significant interference to the signal spectrum, thereby greatly affecting the ambient audio stream recognition effect based on traditional spectrum features (such as MFCC). On the other hand, the types and characteristics of hidden components corresponding to different ambient audio streams also exhibit significant differences. These hidden components are closely related to the types of acoustic events. The features constructed on the basis of hidden components can help to distinguish ambient audio streams. For a hidden component, its frequency, amplitude, and change mode information can effectively reflect its essential attributes. Almost all of such information can be reflected by the envelope shape of the IMF obtained by decomposition. Therefore, we design a set of HACFs. Based on the IMF decomposed by HM-EMD, the features extract the relevant information of hidden components, including the shock intensity feature SH and over-average feature average crossing rate (ACR).

3.2.1. Shock Intensity Feature (SH)

where max ( (t)) is the upper limit of the signal amplitude in the jth IMF and min ( (t)) is the lower limit. Both limits represent the change intensities of the hidden components relative to the steady components for measuring the changes in signal amplitude. As the sum of the mean values of the upper and lower envelopes of the IMF is 0, the signal is symmetrical along the time axis, and the information carried by the upper and lower envelopes is almost the same. Therefore, a one-sided envelope is enough to ensure the consistency of the symbols of the two values. The superscript means that the upper envelope is used for calculation.

3.2.2. ACR Feature

ACR features can express the number of times the upper envelope of an IMF passes through its mean point, that is, the number of times the IMF’s upper envelope (time domain amplitude) fluctuates significantly. If the value is large, the IMF amplitude frequently fluctuates near the mean value. For ambient audio stream recognition application scenarios, if the value is greater than a certain threshold (10 Hz or above), the data may not have obvious and meaningful hidden components and the change of the upper envelope near the mean value is only the normal fluctuation of the acoustic signal itself. If the value is less than the threshold, the data may contain significant hidden components, and one-half of the zero-crossing frequency is the frequency of the hidden components.

3.3. Ambient Audio Stream Classification

The process of audio streams classification based on heuristic mask empirical mode decomposition is shown in Figure 3. Firstly, the HM-EMD method is used to decompose the signal into the IMFS set. Then, the basic features are extracted based on the IMFS: the instantaneous frequency, instantaneous amplitude, and the hidden features of AHCFs are all composed of the feature matrix, and the feature matrix is input into the classifier to get the final recognition result. In order to verify the validity of the feature, two kinds of classifiers are selected in this paper. One is a three-layer perceptron model, whose specific structure is shown in Figure 3. The model has a three-layer structure. Sigmoid function is used as activation function for the first two layers; each layer has 500 and 250 neurons, respectively. The output layer uses a SoftMax classifier and has 10 neurons. The second is the TridentRestNet model, which consists of three branches, each of which is ResNet101. The different branches have different convolutional kernels for bottleneck modules, which use , respectively, in order to obtain features at different scales. Finally, all the features are fused together to give the recognition result. The experimental results show that under the two model systems, the system based on HM-EMD features still shows satisfactory results. The specific experimental results and analysis are as follows:

4. Experiments and Results

In this section, we evaluate the performance of the proposed HM-EMD method for the validity of modal separation and the audio stream classification. First, we provide details on the experimental setup which include both evaluation criteria and datasets. Second, the indexes, the results of effectiveness analysis for modal separation and acoustic scene classification methods whose performance are compared with the proposed method are provided. Finally, we compare the performance of HM-EMD with that of the baseline methods based on the experiments which are conducted on the DCASE datasets and ASVSpoof2019 and analyse the experimental results in detail.

4.1. Experimental Setup

We verify the results of this work from two aspects: the validity of modal separation and the validity of the HM-EMD features for environmental audio stream classification.

4.1.1. Validation of Modal Separation

A nonlinearity index is defined in formula (13), and it measures the stability of the decomposition results. The larger the DN is, the greater the nonlinear degree is, indicating the more unstable components; the verification data are the mixed signals of the three modes in Figure 1:

4.1.2. Validation of the Features of HM-EMD for the Classification of Audio Streams

To verify the effectiveness of designing a series of features based on HM-EMD, we use a basic HM-EMD feature matrix and a basic features + HACF matrix as the input parameters of the classifier. Specifically, the number of mask EMD reference IMFs is 20, the number of HM-EMD basic feature’s dimension is , and HACFs is 3D, whose number of dimension is . The audio frame length is 0.5 s, the interframe overlap is 0.25 s, and the total number of dimensions is . The classical mel frequency cepstral coefficients are selected as the contrast features; they include 13 dimensional MFCCs and delta features. The total number of dimensions is 39, and the audio frame length is 40 ms.

After setting the characteristic parameters, we conducted the test according to the process designed in Figure 3. There are two datasets used in our experiment:(1)TASK1A dataset of DCASE [16]: the dataset contains data on ten cities and nine devices, that is, three real devices (A, B, C) and six simulated devices (S1–S6). The dataset has good annotation, including three different types of indoor, outdoor, and traffic. It also has ten different ambient audio streams, namely, airport, shopping mall, metro, metro station, pedestrian, street traffic, tram, park and public square, and bus. The acoustic data span a total of 64 h, with 40 h used in dataset training and with 24 h used in verification. Each audio segment is 10 s long, and the sampling rate is 44.1 kHz.(2)ASVSpoof 2019 dataset [17], which is a dataset aiming to foster the research on countermeasure to detect voice spoofing in automatic speaker verification. The dataset contains synthesized and replayed speech attacks, which are classified as logical access and physical access respectively. There are three subsets under these two tracks, namely, training set, development set, and evaluation set. Actual voice data for both tracks were collected from 107 speakers collected from the VCTK2 database, 46 male speakers, and 61 female speakers. A subset of training and development of physical access was created by simulating room acoustics, including 3 room sizes, 3 reverberation levels, and 3 speaker distances from the ASV microphone. In addition, there are nine recording configurations, with three recording distances to the speakers and three speakers of different qualities. Since this paper focuses on testing the feature of HM-EMD in the recognition of fake audio, the 10 neurons in the output layer of the two classifiers in Figure 3 are replaced with 2 neurons and the training the model again. At this time, the output result of the model is the detection results of fake audio. The results are evaluated by EER (equal error rate). Specific experimental results are shown in Section 4.2.

4.2. Results and Analysis
4.2.1. Effectiveness Analysis for Modal Separation

By comparing the traditional EMD results, we can see DNHM-EMD /DNEMD < 1 for any given case. Hence, the IMF processed by the HM-EMD method has the lowest nonlinearity; that is, the IMF has a high purity and is close to the blind separation result under an ideal state. The separation result is shown in Figure 4. The features based on this high-purity IMF signal can effectively characterize the subtle changes in the signal components in the time and frequency domains. Hence, the method is suitable for all types of acoustic correlation analyses and recognition, especially for the recognition of ambient audio streams with hidden acoustic events.

4.2.2. Based on HM Feature Validity of EMD

(1) Ambient Audio Stream Classification. HACFs can be used to identify the hidden components in IMFs and are thus of great significance for ambient audio stream recognition. We verified the discrimination ability of HACFs in different scenarios (Figure 5). The figure shows the scatter projection of some hidden component features in the three-dimensional space. Even the three-dimensional features in a single IMF have a strong scene discrimination ability. HACFs show good discrimination ability among different ambient audio stream categories and thus provide technical support for subsequent ambient audio stream classification.

Figures 6 and 7, respectively, show the acoustic streams classification and recognition results of MFCC features, HM-EMD basic features and HM-EMD basic features + AHCFS features based on simple classifier and complex classifier. The simple classifier is a simple three-layer perceptron, while the complex classifier adopts the optimal classifier in the DCASE competition [18]. As can be seen from the figures, basic environmental features based on HM-EMD in the simple classifier are better than MFCC features in most scenes, and AHCFS features can effectively capture hidden information in the environment, which is improving the accuracy of audio streams classification. In the complex classification model, the improvement of model classification ability can make up the deficiency of feature representation, and the overall recognition rate has increased.

Tables 1 shows the results of acoustic streams classification. It can be seen that the HM-EMD feature is superior to the MFCC feature with different classifiers: given the basic classifier, is 6.7 percentage points higher than that in the MFCC series. After the addition of HACFs, the recognition rate increases by 17.4 percentage points. This result is close to the classification accuracy of the RESNET network with a 32M model size in the DCASE competition [19], while the simple model used in this paper is only 225K. In a complex classification model, the improvement of model classification can make up for the lack of features to some extent. However, in this case, still improves the accuracy by 1.3%, and the recognition result reaches 75.7%. The feature can provide instantaneous characteristics in time-frequency domain and HACFs represent the statistical characteristics in time-frequency domain. The combination of the two can help improve the accuracy of the classification of environmental audio streams.

(2) Fake Audio Detection. Table 2 presents the fake audio detection results based on HM-EMD features. It can be seen that the forged audio based on LA is easier to be identified than the forged audio based on PA, and the HM-EMD feature has a more effect on the fake audio of LA attack. After adding the hidden information feature AHCF, the detection error rate of forged audio was reduced by 5.61% and 6.11%, respectively. This is because the LA ignores background sound in the process of synthesizing the fake audio. In this case, the addition of captured audio background features greatly reduces the detection error rate. It can also be seen from the table that the HM-EMD feature can reduce the error rate of fake audio detection in both simple and complex models, which also proves the effectiveness of the feature extraction method proposed in this paper.

5. Conclusions

Aiming at the problem of audio fraud existing in the network, this paper proposes a method of feature extraction of hidden information in audio streams based on HM-EMD. Because the audio background information is difficult to be forged, the basic features of audio streams and the characteristic HCFS of hidden information are constructed for fake audio streams based on stable IMFs decomposed by the HM-EMD method. The experimental results show that the HM-EMD-based features have richer characterization ability for hidden acoustic events than MEL cepstrum features and can improve the accuracy of scene classification and fake audio detection. However, since the HM-EMD decomposition process needs to calculate the mask signal according to the structure of the signal itself and use the mask signal to separate the aliasing signal, the algorithm complexity is increased compared with the classical EMD algorithm. Therefore, in the subsequent work, we will consider the idea of coevolutionary framework to optimize the algorithm [20]. At same time, the relationship between the HM-EMD feature system and different hidden acoustic events will be the further exploration point, so as to achieve accurate hidden acoustic event markers in audio streams of different levels and time scales. In general, the feature extraction of audio streams based on HM-EMD is helpful to detect fake audio and provides a new research idea for solving network audio spoofing.

Data Availability

The data used in this study are the public dataset from DCASE challenge (http://dcase.community/challenge2020/task-acoustic-scene-classification and https://datashare.ed.ac.uk/handle/10283/3336).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Jiu Lou and Decheng Zuo conceived and designed the study. Zhongliang Xu performed the simulations. Hongwei Liu reviewed and edited the manuscript. All authors read and approved the final manuscript.

Acknowledgments

This work was in part supported by the National Key Research and Development Program (2018YFC0830602).