Abstract
Electroencephalography (EEG) is the measurement of neuronal activity in different areas of the brain through the use of electrodes. As EEG signal technology has matured over the years, it has been applied in various methods to EEG emotion recognition, most significantly including the use of convolutional neural network (CNN). However, these methods are still not ideal, and shortcomings have been found in the results of some models of EEG feature extraction and classification. In this study, two CNN models were selected for the extraction and classification of preprocessed data, namely, common spatial patterns- (CSP-) CNN and wavelet transform- (WT-) CNN. Using the CSP-CNN, we first used the common space model to reduce dimensionality and then applied the CNN directly to extract and classify the features of the EEG; while, with the WT-CNN model, we used the wavelet transform to extract EEG features, thereafter applying the CNN for classification. The EEG classification results of these two classification models were subsequently analyzed and compared, with the average classification accuracy of the CSP-CNN model found to be 80.56%, and the average classification accuracy of the WT-CNN model measured to 86.90%. Thus, the findings of this study show that the average classification accuracy of the WT-CNN model was 6.34% higher than that of the CSP-CNN.
1. Introduction
An electroencephalogram (EEG) is a record of changes registered on a human or animal scalp, which indicate the electrophysiological activity of brain nerve cells on the cerebral cortex or scalp surface [1]. An EEG captures the spontaneous bioelectric activity of brain cell groups (also known as brain waves) through electrodes and uses potential as the vertical axis and time as the horizontal axis to display the EEG in the form of a curve [2, 3].
The process of emotion recognition based on EEG encompasses the following key steps: emotion induction, EEG signal acquisition, EEG signal preprocessing, EEG feature extraction, emotion pattern learning, and classification [4, 5]. Of these, feature extraction and classification are of particular importance and are the focus of this study.
Feature extraction selects certain feature signals to be used as classification parameters to form characteristic feature vectors [6]. It is a relatively mature technique in machine learning and has developed to include time-domain features, such as mean value, standard deviation, skewness, peak amplitude, variance, skewness, and kurtosis of the EEG signals and frequency-domain features, which transform the time-domain signal into the frequency domain and then extract relevant parameters for analysis. Common features are extracted by Fourier transform, parameter model methods (such as autoregressive (AR), moving average model (MA), autoregressive-moving-average (ARMA), and harmonic signal models). Extraction methods for time-frequency domain features include short-time Fourier transform (STFT) and wavelet transform, while those for nonlinear dynamic characteristics include those based on chaos theory methods, such as Lorenz scatter plot, maximum Lyapunov exponent, correlation dimension, and Hurst exponent. Methods based on information theory include permutation entropy, singular value decomposition entropy, LZC complexity, approximate entropy, and sample entropy [7, 8]. For the extraction of statistical features, statistical methods commonly used in EEG analysis include probability random analysis, independent component analysis, and principal component analysis, among others [8].
With the rise of artificial intelligence, the convolutional neural network (CNN) has been able to achieve increasingly better results in image and speech, but they are still rarely used in the feature extraction and classification of EEG. Some scholars have tried to directly use the CNN with EEG; however, the accuracy achieved by the two-class classification is only about 50%, and classification effects remain unsatisfactory. When Jie et al. [9] used the CNN alone to image EEG signal classification, for example, their accuracy rate was just 45%. In another study, in which EEG classification of addiction craving was based on the CNN [10], a new matrix was formed for each electrode and then sent to the convolutional neural network to detect craving for addiction, and the accuracy was improved to approximately 70%. This accuracy is slightly higher than that of the classification effect with the mean value as the feature, but it does not compare with the frequency domain feature, and this method varies greatly among different subjects. Reference [9] also tried to use common spatial patterns (CSP) for dimensionality reduction, selecting standardized covariance as a feature to classify the data of motor imagination, and the accuracy rate reached 91.46% [11]. Furthermore, another study, in which the wavelet transform- (WT-) CNN model was proposed and applied to competitive sports thinking data, the accuracy rate reached 88.1%, which is 8.2% higher than traditional WT or support vector machine [12]. In this study, the effect of the CNN on emotion classification is explored by applying the WT-CNN model to the classification of emotion recognition. Moreover, a new CSP-CNN model is proposed, and a comparative analysis of these two methods is performed.
The main research content of this study is, thus, the application of a CNN in EEG emotion classification. After the collection of the EEG data, it was preprocessed by removing ocular and other artifacts and filtering. Thereafter, a CNN was used directly to extract and classify the EEG data after either dimensionality reduction or wavelet transformation. It should be noted that the CSP-CNN in this study is different from that previously published as “Multi-class motor imaging EEG signal classification based on CSP and convolutional neural network algorithms” [11]. Moreover, in the CNN-CSP model in this study, no feature extraction work is performed between the CNN and CSP, such as the identification of standardized covariance or energy. Both of the two emotion recognition models presented in this study were designed and developed by the authors.
2. Related Work
2.1. Cospace Mode and Wavelet Transform
A spatial filter is highly suitable for the collection and processing of EEG signals such as multidimensional signals and data. It can simultaneously utilize the spatial correlation of EEG signals, eliminate signal noise, and realize local cortical nerve activity. Spatial-domain filtering effectively combines time-domain and frequency-domain features, through which better processing results can be achieved [13, 14]. At present, the commonly used spatial filtering techniques in EEG-BCI research include common average reference (CAR), Laplace transform, principal component analysis (PCA), independent component analysis (ICA), and common spatial pattern (CSP), the most widely used approach. The application process of CSP is shown in Figure 1. This spatial filter features an extraction algorithm for two classification tasks, which can extract the spatial distribution components of each category from multichannel brain-computer interface data [15].

A more recently developed transform analysis method, WT, inherited the concept of localization of STFT; however, at the same time, it provides a “time-frequency” window that can change according to frequencies, and is, thus, an ideal tool for signal time-frequency analysis and processing [16].
2.2. Convolutional Neural Network
The CNN has been widely used in the classification of speech and images and has achieved good results. However, there are relatively few studies on their application in EEG, with only minimal reports on their recognition of emotions based on EEG, such as [17], in which the CNN was introduced to EEG emotion recognition, and its application was explored [17]. Since the EEG signal is relatively weak and the extracted feature may not be sufficiently clear for the classification of emotions, we introduced a CNN to develop the feature vector of the EEG signal. Secondary processing and classification are designed to improve the accuracy and robustness of classification [17, 18]. At the same time, we also used the CNN directly after dimensionality reduction in order to improve the EEG characteristics [19], after which the classification results were evaluated.
3. Method
3.1. Feature Extraction Based on CSP
As mentioned, CSP is a commonly used EEG dimensionality reduction method in EEG feature extraction. Its basic principle is to, first, find a space transformation matrix and then transform the EEG to obtain a new matrix [20]. We represented the EEG signals used for classification with a matrix E of N ∗ T, where N represents the number of channels for collecting EEG, T represents the number of samples per EEG signal, and T is greater than or equal to N. The normalized covariance matrix iswhere is the transpose operation, and indicates the trace of the matrix during operation. and are used to represent the spatial covariance matrix of positive and negative emotions, respectively, which is obtained by calculating the mean value of the covariance matrix [21]. Thereafter, the composite matrix of the two covariance matrices can be expressed as
can be decomposed into
In the above formula, is the eigenvector of , and is the diagonal matrix formed by the eigenvalues of . The whitening matrix was calculated as follows
Thereafter, the calculated whitening matrix was used to transform the average covariance matrix, using the following formula: and had the same feature vectors, namely,
In the above two formulae, and satisfy . That is, the largest eigenvalue corresponds to the smallest eigenvalue . The eigenvalues were sorted from large to small, and the eigenvector was also sorted accordingly to get . The whitened matrix was used to obtain the optimal separation covariance matrix [20, 21], with the first m rows and the last m rows of the transformation matrix used to form a new matrix, . The projection matrix for transforming the original signal is
The transformed matrix is
In this study, the data collected after the CSP had reduced the dimensionality of the EEG was changed from the feature value to form the feature vector, as follows:
3.2. Feature Extraction Based on Wavelet Transform (WT)
As WT has been widely introduced in numerous reference literatures, this study will only briefly explain the principle of this approach. Every WT has a “mother” wavelet and a “father” wavelet [22], or “parent” wavelet, also termed the “scaling function.” Suppose is a square-integrable function, which is . If the Fourier transform satisfies the condition (10), then can be used as the mother wavelet.
All of the wavelet series of WT can be obtained by translation scaling of the parent wavelet and mother wavelet. The scaling factor is an integer power of 2, and the magnitude of the translation is related to the scaling factor [22, 23]. The wavelet series are orthonormal, which means that they are not only pairwise orthogonal but also must be normalized. The wavelet series can be expressed as
The expansion formula of the complete wavelet transform is
In the above formula, is the parent wavelet and is the mother wavelet; therefore, c and d can be calculated by selecting the appropriate parent wavelet and mother wavelet, respectively. The approximate formula for wavelet expansion is
WT is performed on the signal, which is then decomposed into a sequence of wavelet bases and scale functions. The solution formula iswhere . According to the Nyquist sampling theorem, when the sampling frequency fs.max is greater than twice the highest frequency fmax in the signal (fs.max > 2fmax), the digital signal after sampling can completely retain the information contained in the original signal. The collection frequency of electricity, for example, is 250 HZ, so the highest frequency of information retained in the original signal is 125 HZ. In this study, we performed a five-scale WT on the downsampled data (as shown in Figure 2), with each layer decomposing the low-frequency band.

3.3. Feature Classification Based on the CNN
In the base layer of the volume, the size of the filter, that is, the size of the convolution kernel, is usually a 3 ∗ 3 or 5 ∗ 5 square matrix. We used to represent the weight of the filter, to indicate the bias term of the filter, and to activate the function. The output of the filter was as follows:
The above formula was used for the forward propagation process of the roll base structure to move from the upper left corner of the current layer of the neural network to the lower right corner through the filter. Each corresponding unit matrix was calculated in the moving process [24]. A pooling layer is often added between the volume base layers, which can effectively reduce both the matrix size and the parameters in the subsequent volume base pooling layer and the fully connected layer. This study uses the maximum pooling layer, the formula for which is
Each node of the fully connected layer is connected to all the nodes of the previous layer and is used to integrate the features extracted from the front and to act as a “classifier” in the entire network [25]. In this study, the dropout layer was added after the fully connected layer. The addition of the dropout layer not only reduces error in the training model each time and accelerates the training speed but also effectively prevents the occurrence of overfitting. The last layer of the CNN is the Softmax layer. Its function is to turn the original output result of the neural network into a probability distribution, thus contributing to normalization. Assuming that the output of the original neural network is , the output after Softmax regression processing is
In addition, cross-entropy verification, a method used to describe the distance between two probability distributions, was applied in this study. Given two probability distributions as p and q, the formula for expressing the cross-entropy of p by q is
Error backpropagation is based on the principle of gradient descent, in which it is only necessary to update in the direction of the negative gradient. Suppose J is the cost function, then the iterative process of each , isamong which is the learning rate, and and and are the partial derivatives of the error.
4. Experiment
4.1. Selection and Design of Stimulus Materials
A total of 210 images were used for the stimulus file, of which 105 were intended to induce positive emotions and the other 105 intended to induce negative emotions. The experiment process is illustrated in Figure 3. At the outset of the experiment, subjects were requested to read the instructions on the screen carefully in order to fully understand the experiment process and details. Once the experiment had been completed, the EEG data samples, comprising the training set of positive and negative emotions, were mixed for the model training, and the EEG data samples of the positive and negative emotion test set were mixed for classification.

4.2. Selection of Mother Wavelet
There are many types of mother wavelets, and therefore, it is essential to select one that is most suitable for the effective extraction of EEG features. The engineering realization of the WT in this study was completed by Matlab. Matlab can complete 15 kinds of female wavelets based on Haar, Daubechies, Biorthogonal, Coiflet, Symlet, Morlet, Mexican hat, Meyer, Gaus, Demeyer, ReverseBior, Cgau Cmor, Fbsp, and Shan, amongst others. At present, there is no unified standard for the selection of wavelet bases, and it is based mainly on the accuracy of classification. In an emotion classification experiment, one of many such experiments previously conducted in our laboratory, the Symlets 8 wavelet (sym8), was found most effective in reducing the original signal, and based on “Video Stimulus EEG Signal Feature Research” [10], its comparative effects were better than other mother wavelets. Therefore, sym8 was selected as the mother wavelet in this assay.
4.3. Acquisition and Preprocessing of EEG Signals
Six students from the Minzu University of China were chosen to be the subjects for the EEG collection. The subjects were aged between 22 and 26, all of them right-handed, healthy, with good sleeping patterns and no brain damage or the history of mental illness. Preprocessing is performed mainly to remove any noise components in the EEG signal and to provide a guarantee for the analysis of the EEG signal characteristics and extraction of the emotional characteristics of the signal. In this study, preprocessing was performed using Scan 4.5 software to remove obstructive artifacts and for digital filtering.
4.4. Training
Two classification models, the WT-CNN and CSP-CNN, were used in this assay to analyze the preprocessed brain. Electric data were used for emotion recognition; whereafter, the classification results were compared and analyzed. The individual differences of an EEG are obvious; therefore, all EEG classifications in study were based on single-person EEG classification.
4.4.1. CSP-CNN
The CSP-CNN was used directly in this study to perform feature extraction on the EEG data after dimensionality reduction. That is to say, the CNN was used to directly perform convolution operations on a 16 ∗ 750 matrix. After continuous improvement, the CNN model was established, as shown in Figure 4.

This model consists of two volume base layers, two pooling layers, a fully connected layer, a dropout layer, and a Softmax layer. The size of the two base layer convolution kernels in the network is 3 ∗ 3, the first base layer has 16 convolution kernels, and the second base layer has 32 convolution kernels. The size of the first pooling layer filter is 2 ∗ 5, while the size of the second pooling layer filter is 4 ∗ 5, and both pooling layers are the largest pooling layer.
In Table 1, it can be seen that, although the sample dimension was very large since the main parameters of the pooling layer were 12,484, the addition of the pooling layer effectively reduced the number of training parameters and sped up the operation and training of the network. Moreover, this model was able to keep the value of cross-entropy to mostly below 0.01 after training within 40,000 steps. The smaller value of the loss function indicates that the convolutional neural network became more convergent after training. To ensure that the training of the network had reached a stable state, in this model, we used 50,000 steps to mark the final result of the classification. The accuracy of the emotion recognition of the six subjects after the training and classification of this model is presented in Table 2.
The CNN was employed to directly extract and classify the data of the public space model after dimensionality reduction, achieving an average accuracy rate of 80.56%. This result shows that a CSP-CNN can be used to effectively extract features from EEG data.
4.4.2. WT-CNN
When building a CNN model, its structure is determined by a variety of parameters. It is necessary to select the appropriate number of layers and to determine the number and size of each layer of the convolution kernel. After constant debugging, the WT-CNN model shown in Figure 5 was established for the classification of wavelet entropy features.

This model is comprised of two volume base layers, a fully connected layer, a dropout layer, and a Softmax layer. No pooling layer was used for dimensionality reduction, in order to optimize information retention and classification accuracy. The parameters of each layer of the WT-CNN model are shown in Table 3, in which it is evident that, while there is no pooling layer, the main training parameters are fewer than 50,000 due to the small training samples, the small number of convolution kernels, and minimal parameters in the fully connected layer. Therefore, the single-step training speed of this model is relatively fast, with training generally completed within 100,000 steps, and the value of the loss function remains below 0.02.
A smaller loss function value indicates that a network is more convergent after training. In order to ensure that the network in this WT-CNN model remained stable after training, 110,000 steps were completed, and the classification result at that point was selected as final. The classification results of the WT-CNN model on the six groups of individual EEG data are shown in Table 4.
The average accuracy of the WT-CNN model was 86.90%. At this stage, a wavelet transform was used for feature extraction. A support vector machine is a more mature method based on EEG emotion recognition. The WT-CNN model’s slight improvement in the classification indicates that it is a feasible approach for EEG-based emotion recognition.
Table 5 presents a comparison of the classification results of the two classification models, CSP-CNN and WT-CNN, in which it is evident that a CNN can be used for emotion feature classification, as the classification results were relatively accurate. Furthermore, of the two emotion recognition models, WT was found to be an excellent method for extracting emotional features, and the classification effect achieved by the WT-CNN model was also best.
5. Conclusion
The study presents research and comparative analysis of the application of two CNN models in EEG-based emotion classification of processed samples.
As the results of previously used methods for feature classification, such standardized variance among others, are generally not sufficiently accurate, and because CNNs are widely used in image feature extraction, it was hypothesized that this approach could be used to achieve more effective outcomes. First, a CSP-CNN model was established in which the CNN extracts and classifies the data after the dimensionality reduction of a cospace model. The average classification accuracy of the CSP-CNN model was 80.56%, and its classification effect was good. In addition, because wavelet transform is known to be an excellent method for extracting emotional features, we established the WT-CNN model. Its average classification accuracy was 86.90%, realizing an improvement of 6.34% compared with the results of the CSP-CNN model. These experiments, thus, showed the feasibility of using wavelet entropy as an effective method for feature extraction.
Analytical comparison of the two approaches shows that the WT-CNN achieves better results than the CSP-CNN for the following reasons: first, wavelet variance is an effective feature quantity based on multiresolution analysis. It can characterize the signal characteristics of different scales, and it does not directly process a large number of wavelet coefficients, but instead mines the data to obtain coimplemented information. Furthermore, wavelet variance has the characteristics of clarity, simple calculation and is not sensitive to noise. Finally, the wavelet transform can greatly reduce or even remove correlations between the different extracted features by selecting the appropriate filter, thereby reducing the difficulty and speed of calculations and improving accuracy.
In the following research, we can try to (1) study the application of the convolutional neural network in multitype emotion recognition, (2) use the convolutional neural network in emotion recognition of large sample EEG data, and (3) investigate whether EEG contains emotional features or look for the timepoint when emotional features appear.
Data Availability
The EEG data used to support the findings of this study are supplied by the National Nature Science Foundation of China under license and so cannot be made freely available. The data are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.