Abstract

Deceptive behaviour is a common phenomenon in human society. Research has shown that humans are not good at distinguishing deception, so studying automated deception detection techniques is a critical task. Most of the relevant technologies are susceptible to personal and environmental influences: EEG-based technologies need large and expensive equipment, facial-based technologies are sensitive with the camera’s perspective, and these reasons have somewhat limited the development of applications for deception detection technologies. In contrast, the equipment required for speech deception detection is cheap and easy to use, and the capture of speech is highly covert. Based on the application of signal decomposition algorithms in other fields such as EEG signals and speech emotion recognition, this paper proposed a signal decomposition and reconstruction method based on EMD to process the speech signal and a better deception detection performance was obtained by improving the speech quality. The comparison results with other decomposition algorithms showed that the EMD decomposition algorithm is the most suitable for our method. Across many different classification algorithms, accuracy improved by an average of 2.05% and the F1 score improved by an average of 1.7%. In addition, a new deception detector, called the TCN-LSTM network, was proposed in this paper. Experiments showed that this network organically combines the processing capability of TCN and LSTM for time series data; the recognition rate of deception detection was greatly improved, with the highest accuracy and F1 score reaching 86.2% and 86.0% under the EMD-based signal decomposition reconstruction method. Based on the research in this paper, the signal decomposition algorithms need to be further optimised for speech signals and more classification algorithms not used for this task should be tried.

1. Introduction

The study of deception detection is a work of great significance, especially the act of deceiving someone to avoid the punishment for crime, which has been widely studied and applied in the legal, military, and forensic fields [1]. Deception is a deliberate act of misleading others to gain some advantage or avoid punishment [2] and does not include, for example, self-deception, pathological behaviour, and whether an act represents deception or not depends on what the intention of the person acting is. In psychological terms, a person is deceiving when he exhibits subconscious or conscious behaviours, including shortening of speech, flushing, change in speech frequency, eye avoidance, change in pupil diameter, and change in body posture [3].

Compared to automatic deception detection systems, it is more difficult to rely on humans themselves to recognize deception, it is a challenging task for nonspecialists to accurately detect deceptions [4], and humans themselves are highly subjective, so automatic deception detection methods have considerable research value [5]. Current research on deception detection focuses on the following areas: physiological signals (e.g., electroencephalogram, electromyogram, and so on), facial expressions, and body posture. Changes in physiological signals as indicators of deception detection have been used throughout the recent history of deception detection research. These signals can accurately reflect changes in a person’s mental state and have led to the development of many polygraphs, which have been used in various fields for many years. But there are problems with this method, as the acquisition of these signals requires close contact with the subject, which is an invasive method and has a high probability of causing psychological fluctuations in the subject, leading to inaccurate detection. Deception detection based on face and gesture does not require contact signal acquisition, requiring only a camera as the primary device, reducing the additional stress on the subject, and physical changes such as expressions and gestures have been proved by psychologists to characterize changes in a person’s psychological state. However, using this modality for deception detection requires a certain viewpoint, and if the camera angle is faulty, it is difficult to identify the deception.

But if we use speech signals to recognize deceptions, the capture of speech is covert and can significantly reduce the psychological stress on the subject. What is more, where the recording device is located does not affect deception detection. Studies have shown that speech can map the psychological state of the speaker at the moment, and one can easily distinguish the general psychological state of the speaker (happy, sad, or angry) through speech [4]. In addition, researchers have found that people who intend to deceive others often show small changes in a range of behaviours such as vocal pressure, pitch, speech rate, and vocal organs when deceiving [6]. Deception detection using speech is already being investigated in many fields, for example, Duke University’s Throckmorton used speech and language analysis methods to identify financial deception [7], so it is feasible to analyse speech signals to detect deceptions.

There have been many studies on deception detection based on speech signals. Machine learning algorithms have achieved good results in speech deception detection. Researchers at Columbia University analysed the effectiveness of machine learning methods and human detection methods on the CSC (Columbia-SRI-Colorado) dataset and proved that human detection methods were inferior to random selection, while methods based on support vector machines and Gaussian mixture models achieved 64.4% accuracy [8]. In Enos’s Ph.D. thesis, he conducted a comparative analysis of the performance of five algorithms, including support vector machines, naive Bayes, logistic regression, decision trees, and ripper algorithms; the results showed that decision trees and support vector machines have better performance [9]. Velichko et al. analysed many different machine learning algorithms on the real-life trail dataset, and the most effective random forest algorithm achieved an accuracy of 79.4% [10]. The literature [11] investigated the impact of ensemble learning methods on deception detection performance, achieving a 70% recognition rate with the real-life trail dataset. Bareeda et al. used Mel frequency cepstral coefficients and SVM to detect deceptions and obtained an accuracy of 81% [12].

Neural networks, which have received much attention in recent years, have also played an important role in research on speech deception detection. Xie et al. proposed to replace the multiplicative operation in the LSTM with convolutional operation and achieved an accuracy of 68.4% in the CSC dataset [13]. In 2019, Xie et al. proposed a method that combines speech features with deep learning, and they got an accuracy of 71.4% using a deep belief network [14]. Fu et al. used an improved self-encoder for deception detection and achieved 62.78% and 63.89% accuracy on the CSC corpus and the self-made dataset, respectively [15]; in 2020, they proposed a method based on denoising autoencoder (DAE) and long short-term memory (LSTM) network, with an accuracy of 65.78% on the CSC and 68.89% on the home-made datasets [16]. Hershkovitch Neiterman et al. developed a deception detection recognition system based on MLP and LSTM and conducted experiments on cross-lingual datasets [17]. Chou and Lee proposed a BLSTM (bidirectional long and short-term memory network) architecture with dense layers that incorporate an attention mechanism [18], feature fusion [19], and a multitask architecture [20], all of which achieved excellent performance on their own corpus.

In addition to the abovementioned classification methods, many other networks are not used to detect deception, considering that the speech signal is a time-series data; in this paper, TCN (temporal convolutional network) is used to detect deception speech signals. This network performed well in dealing with time series and achieved better than other networks on many application scenarios [21] but has not been used to detecting deception.

Regarding speech deception detection, most studies have focused on the classifier and feature level but ignored the speech itself, which is very critical for speech deception detection. Speech signals contain multiple components, so it makes sense to decompose and analyse the signal. Similar attempts have been made in the study of EEG (electroencephalogram) signals since there are many different waves in the EEG (mainly consisting of α, β, γ, δ,, and θ), and the application of EEG signals to solve some practical problems requires the analysis of different waves: the literature [22] developed a scheme to automatically identify schizophrenia by decomposing the EEG signal through EMD (empirical mode decomposition) and calculating 22 each feature from it; Reference [23] proposed a computer-aided clinical decision support system (CACDSS) to detect and diagnose Parkinson’s disease through EEG by combining automatic variational modal decomposition (AOVMD) and automatic extreme learning machine (AOELM) classifiers; the literature [24] developed an EEG rhythm separation (VHERS) based on variational modal decomposition (VMD) and Hilbert transform (HT) to help experts detect attention deficit hyperactivity disorder (ADHD) in a real-time situation. Reference [25] proposed the robust tuneable Q wavelet transform (TQWT) for the automatic selection of optimal tuning parameters to accurately decompose nonsmooth EEG signals and identify motor imagery (MI) tasks with low complexity.

What is more, in some studies of speech emotion recognition, classical signal decomposition algorithms were used to improve the performance of emotion recognition. Reference [26] proposed a feature extraction method based on VMD (variational modal decomposition) for speech emotion recognition, and they also conducted a comparative validation of EMD (empirical mode decomposition) and LMD (local mean decomposition). Kerkeni et al. [27] used EMD and Teager–Kaiser Energy Operator (TKEO) and obtained high accuracy both in the Spanish sentiment database and in the Berlin database. Krishnan et al. [28] proposed to use EMD to classify signals into high, medium, and low frequencies and then calculate five entropy values at the three frequencies and achieved good performance by using these entropy features. However, all these algorithms have not dealt with speech deception detection. These articles showed the usefulness of signal decomposition techniques in speech signals.

So, based on combining EMD and signal reconstruction, a novel deception detection system was proposed, where the combination of TCN and LSTM is used as the classifier. Overall, this paper has two contributions as follows:(1)First, according to the general characteristics of the signal, the signal decomposition algorithm, which is rare in speech signal processing, was used. The proposed method effectively improved the performance of speech deception detection.(2)Second, two networks that can process time series data were concatenated, and they greatly enhanced their respective capabilities. The network retains the time series features of speech signals as much as possible so that the results of speech deception detection are greatly improved.

2. EMD and TCN-LSTM Deception Detection System

As shown in Figure 1, first, the speech signal is decomposed and reconstructed, and then the reconstructed signal is used to extract MFCC (Mel frequency cepstral coefficients) features, and finally, the features are sent to the TCN-LSTM network to train the deception detection classifier.

2.1. EMD (Empirical Mode Decomposition)

It is an adaptive signal decomposition method proposed by Huang et al. [29], which is useful in nonsmooth and nonlinear signals, and the speech signal is exactly this kind of signal. The EMD algorithm decomposes the signal into (intrinsic mode functions), and an must satisfy the following two conditions: first, the difference between the number of extreme points and zero crossing points is 0 or 1; second, at any time, the average of the upper envelope formed by the local maximums and the lower envelope formed by the local minimums is 0.

The specific steps of EMD are as follows:(1)We plot the upper and lower envelopes, respectively, based on the local maximum and minimum values of the original signal.(2)We calculate the mean value of the upper and lower envelopes to obtain the mean envelope.(3)We let the original signal subtract the mean envelope to obtain the intermediate signal.(4)If the intermediate signal meets the two conditions of the , then this signal is an , and the steps (1) to (4) will be repeated using the mean envelope as the original signal; otherwise, the intermediate signal will be used as the original signal, and the steps (1) to (4) are repeated. We iterate this process until the stopping condition is satisfied.

Some of the obtained by EMD processing may be ineffective or even inhibitory in distinguishing deceptive speech. So, some could probably be removed to improve the discrimination of deceptive speech. This observation is the motivation for our proposed scheme.

2.2. Speech Signal Decomposition and Reconstruction

In our proposed scheme, the original speech signal needs to be preprocessed, i.e., first resampling the speech and then following a preemphasis operation (made on the sampled signal to weigh the high-frequency part of the speech) to remove the effects of lip radiation.

Next, the aboveprocessed signal is decomposed using EMD to obtain the different components (, …, , …, ), and the main components are selected based on a correlation threshold as follows. The correlation coefficient of each imf with the preprocessed signal is calculated as follows:where is the total number of samples of the speech signal, is the th sample value of the th subsignal obtained by EMD decomposition, is the th sample value of the signal before decomposition, and represent the average value of these two signal sampling points, respectively, and the value of is in the range of [−1,1].

The correlation threshold value was obtained according to these coefficients. This threshold is calculated bywhere is the mean value of all correlation coefficients and n is the total number of . Finally, the components are selected according to the relationship between the magnitude of the correlation coefficients and the threshold value, and the selected components are recombined to obtain the reconstructed signal. The complete process is shown in Figure 2(a).

2.3. Feature and Classifier

Following the previously mentioned signal processing, the features are extracted and a classifier is trained, as shown in Figure 2(b). In this paper, MFCC, a cepstrum feature that has been proven effective by many researchers, was chosen, and it was proposed based on the auditory characteristics of the human ear [30]. The standard MFCC reflects only the static characteristics of speech parameters; the dynamic characteristics of speech can be described by the differential spectrum of these static features [31]. The complete process is shown in Figure 3, and the specific steps are described as follows:(1)We split the speech signal into a frame-level representation and multiply each frame with a Hamming window(2)FFT (fast Fourier transform) is applied to each frame to convert the time domain signal into a frequency domain representation(3)Each frame is passed through a Mel filter bank, and the logarithmic energy output from each filter bank is calculated(4)The standard MFCC coefficients are obtained by DCT (discrete cosine transformation)(5)Finally, the first and second-order difference coefficients are calculated and combined with the standard coefficients to get the required MFCC features

As for classification algorithms, because TCN based on convolutional neural network structure has powerful deep feature extraction ability and LSTM based on the recurrent neural network has good modelling prediction ability for time-series data, the concatenation of TCN and LSTM is used to match the speech signal feature and need of deception detection in our proposed scheme. The MFCC features are then fed into TCN and LSTM in turn, which not only effectively extracts the deep information but also improves the processing efficiency of LSTM, and then a linear neural network is used to output the final prediction result.

3. Dataset and Experiment Configurations

The corpus is a very critical issue in deception detection research. Currently, many deception detection researchers have constructed some datasets based on different approaches: researchers at Columbia University took the form of interviews to build the Columbia-SRI-Colorado (CSC) database [32]; the Idiap Research Institute in Switzerland recorded the Idiap Wolf dataset in the context of the werewolf game [33]; for the Chinese corpus, the Soochow University researchers constructed the SUSP deception detection dataset considering three cases, including induced deceptions, deliberately imitative deceptions, and natural deceptions [5].

However, many corpora are not open source, and only a small number of them are easily accessible. In this paper, our method is evaluated by using the real-life trial dataset [34], which is a dataset based on a real courtroom trial session. Researchers searched public multimedia sources to get data, including public court trials, and these sources should meet the condition that truthful and deceptive statements in them are easily detected and verified. The defendant and witnesses in the video should be visible, and the audio quality should be sufficient to be heard to understand what was being said.

In the real-life trial dataset, three different verdict results are considered: guilty, not guilty, and exonerated. Thus, the deceptive data were collected from the defendant or the suspect, while the truthful ones were collected from witnesses or from videos of the suspect answering certain facts (verified by the police). The final dataset consists of 121 videos, including 61 deceptive videos and 60 truthful videos, with an average length of 28.0 seconds (27.7 and 28.3 seconds for deceptive and truthful, respectively). The speakers in the dataset contained 21 females and 35 males, aged between 16–60 years old.

In our proposed scheme, for the computational convenience of EMD, an upper limit is set for the number of , if the number reaches 20, no further decomposition will be made.

For the dimension of features, the first 13 dimensions of MFCCs were taken as the features in the experiments of this paper. The first-order and second-order difference features of these static features were extracted to describe dynamic features, and finally, the 39-dimensional MFCC features were used in experiments.

The specific experiments are as follows: first, considering the possible effect of the speech signal sampling rate on deception detection, the results of different sampling rates (4 kHz, 8 kHz, and 16 kHz) are compared. Second, the effectiveness of the signal decomposition reconstruction method is verified, including a comparison of various other decomposition algorithms. Finally, a TCN-LSTM network is trained for deception detection. A 10-fold cross-validation technique was used to ensure that the results obtained were more stable.

In this paper, the accuracy rate and F1 score are used to measure the performance of the system. Accuracy is the percentage of all correct predictions. The F1-score is a statistical measure of the performance of a binary classification model; it takes into account both the precision and recall of a classification model and can be seen as a weighted average of the precision and recall of the model. The formulas for calculating the accuracy and F1 are as follows:

The definition of TP, TN, FP, and FN are shown in Table 1.

4. Results and Discussion

Table 2 contains the results when different sampling rates are used. Overall, the F1 scores do not differ significantly from the accuracy, representing that the model performance during the experiments was reliable. Specifically, there is no similar situation where the deception recognition rate is high and the truth recognition rate is low. It is shown that the results are not good at 4 kHz. Although the calculation speed is faster at a low sampling rate, the sampling rate of 4 kHz is difficult to retain enough information, resulting in some useful information being discarded. In addition, there is no significant difference between the results at 8 kHz and 16 kHz, indicating that enough useful information is retained at the two sampling rates. However, the signal decomposition speed of 8 kHz is faster, and the results under 16 kHz do not far exceed 8 kHz, so the 8 kHz sampling rate is set for the following results.

Table 3 shows the results of the signal decomposition reconstruction method for speech deception detection. In addition to using EMD as the decomposition algorithm, the results were compared with those of two other decomposition algorithms: one is the LMD (local mean decomposition), which decomposes a complex multicomponent signal into the sum of several product functions (PF); the other is the VMD (variational modal decomposition), which assumes that the signal consists of a series of subsignals with a specific centre frequency and finite bandwidth, and subsignals are obtained by constructing and solving a variational problem.

Overall, speech deception detection is improved by applying the signal decomposition reconstruction method. However, the VMD method causes a significant loss in the recognition rate. The reason may be as follows: VMD is computed by solving a variational problem, which cannot restore the signal by summing all subsignals as EMD does. This process is likely to produce changes in the internal structure of the signal, which has a negative effect on deception detection, a task that relies on signal depth information.

For more intuitive analysis, the results of EMD and LMD algorithms are statistically analysed and the histograms shown in Figure 4 are drawn according to the difference of results between them and the original signal.

The histogram clearly shows the performance of the two signal decomposition algorithms, with EMD outperforming LMD on average, with both having a larger standard deviation, due to differences in the sensitivity of the different classification algorithms to reconstructed signals. Overall, the EMD algorithm improves the accuracy by an average of 2.05% and the F1 score by 1.7%, and subsequent experiments will be based on the EMD signal decomposition algorithm only.

Table 4 shows a comparison experiment of different parameter settings of the TCN-LSTM. The number of hidden layers of the TCN, the number of layers of the LSTM, and whether a bidirectional LSTM was used were compared. Based on some experience and other research, 3- and 4-layer TCN as well as 1- and 2-layer LSTM were verified.

According to the results in the table, first, the bidirectional operation of the LSTM is better than the unidirectional one, but the effect is not very large. It shows that although the past information in speech is influenced by the future, it is not very obvious. Second, the 2-layer LSTM not only increases the computational cost significantly, but the effect is instead reduced, which shows that the 1-layer LSTM is good enough for modelling the data. What is more, 2-layer LSTM gets the biggest difference between accuracy and the F1 score in this table which represents that model reliability is influenced. Finally, 4-layer TCN is more effective than a 3-layer TCN, suggesting that deeper features of the data are being mined, but it is more difficult to identify which depth of features is most appropriate. Moreover, the results show that the improvement in the recognition rate of the 4-layer TCN is not very large, so the research of more layers of TCN will not be made.

Table 5 shows the results using TCN and LSTM alone compared to the TCN-LSTM network. The accuracy and the F1 score have been greatly improved, which fully proves the superiority of TCN-LSTM. There must be a complementary relationship between TCN and LSTM. TCN effectively extracts deep features of speech, but its ability to learn the relationship between deception and deep features is not good. In contrast, the LSTM is weaker in extracting depth features but is considered to have good modelling and prediction capabilities for time series data. So in the TCN-LSTM, the LSTM effectively learns the relationship between deception and depth features generated by TCN.

5. Comparison with Other Studies

To more fully evaluate the work in this paper, a comparative analysis of deception detection studies conducted in recent years was carried out with the method in this paper. The comparators selected were all speech-based studies. It is shown in Table 6.

Compared with the research under the same dataset (the real-life trail), the final scheme of this paper has obvious advantages, but few studies are using deep learning to train the dataset, and more attempts are needed. The recognition rate of this paper is also higher compared to studies with other datasets. Compared to general research, the key element of this thesis is that most deception detection studies do not focus on the temporal information in speech and the additional processing of speech.

6. Conclusions

In this paper, a novel system for speech deception detection is proposed. The use of EMD decomposition to reconstruct the speech signal improves the quality of the original speech and increases the recognition rate under a variety of classical classification algorithms, with an average improvement of 2.05% accuracy and 1.70% of F1 score. In addition, the new network architecture TCN-LSTM organically combines the features of TCN and LSTM and has extremely strong temporal data processing capability, achieving 86.2% accuracy and 86.0% F1 scores under the real-life trail dataset. Moreover, the method of this paper has great advantages compared to similar studies.

However, there are still some shortcomings in this paper: first, the paper does not focus on the effect of other features; second, there is no suitable modification of EMD for speech signals; and finally, it is difficult to validate in other datasets due to the low sharing of datasets in this domain.

So, in future work, the first thing is to experiment with more combinations of features, and the second thing is to further investigate improvements in signal decomposition algorithms for speech signals. As for the deception detection dataset, the plan is to produce a small Chinese dataset drawing on existing datasets, but this will only provide a limited contribution and will not fully solve the problem.

Data Availability

The data used to support the findings of the study can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Key R and D Program of China under grant no. 2020YFC0833201 and in part by the Natural Science Foundation of Shandong Province under grant no. ZR2020MF004.