Abstract
Traditional approaches for emotion recognition utilize unimodal physiological signals. The effectiveness of such systems is affected by some limitations. To overcome them, this paper proposes a new method based on time-frequency maps that extract the features from multimodal biological signals. At first, the fusion of electroencephalogram (EEG) and peripheral physiological signal (PPS) is performed, and then, the two-dimensional discrete orthonormal Stockwell transform (2D-DOST) of the multimodal signal matrix is calculated to obtain time-frequency maps. A convolutional neural network (CNN) is then utilized to extract the local deep features from the absolute output of the 2D-DOST. Since there are uninformative deep features, the semisupervised dimension reduction scheme reduces them by balancing the generalization and discrimination. Finally, the classifier recognizes the emotion. The Bayesian optimizer finds the proper SSDR and classifier parameter values to maximize the recognition accuracy. The performance of the proposed method is evaluated on the DEAP dataset considering the two- and four-class scenarios through extensive simulations. This dataset consists of electroencephalograph (EEG) signals in 32 channels and peripheral physiological signals (PPSs) in eight channels from 32 subjects. The proposed method reaches the accuracy of 0.953 and 0.928 for two- and four-class scenarios, respectively. The results indicate the efficiency of the multimodal signals for detecting emotions compared to that of unimodal signals. Also, the results indicate that the proposed method outperforms the recently introduced ones.
1. Introduction
Emotion recognition is widely used in healthcare, teaching, human-computer interaction, and other fields. Since the physiological signals can reflect the real emotional state of an individual, they are widely used for emotion recognition. Single modality approaches extract the series of features from some channels. This approach cannot make full use of the relevant information among channels. Multimodal emotion recognition is an emerging interdisciplinary field of research in affective computing and sentiment analysis. It aims at exploiting the information carried by signals of different natures to make emotion recognition systems more accurate. This is achieved by employing a powerful multimodal fusion method [1].
This paper proposes an emotion recognition scheme based on multimodal signals consisting of electroencephalograph (EEG) and peripheral physiological signals (PPSs). The proposed method utilizes the two-dimensional discrete orthonormal Stockwell transform (2D-DOST) to consider the intramodal and cross-modal correlation among the multimodal signals, including EEG and PPS signals, and the relations between the samples of each signal. Then, a convolutional neural network (CNN) is considered to extract the local deep features among the output of the 2D-DOST. Since there are several redundant features in the set of deep features, semisupervised dimension reduction (SSDR) is used and a classifier recognizes the emotion. The feature reduction and classification performance depend on some parameters obtained by the Bayesian optimization approach to maximize accuracy. We considered the binary and four-class scenarios on the database for emotion analysis using the physiological signals (DEAP) dataset to assess the performance of the proposed method. The results demonstrate that the proposed method outperforms the recently introduced methods. Hence, the contributions of this paper are as follows:(i)Proposing a new method for multimodal emotion recognition using EEG and PPS(ii)Using the 2D-DOST to analyze the intramodal and cross-modal correlations(iii)Extracting deep features by the CNN and then reducing the number of deep features by a semisupervised method(iv)Joint optimization of the parameters of SSDR and classifier(v)Performing extensive simulations to indicate the performance of the proposed method.
Following this introduction, Section 2 presents the related works on multimodal emotion recognition. Section 3 describes the dataset and a detailed description of the proposed method. Section 4 contains the results and discussion, and Section 5 concludes the paper.
2. Related Works
The EEG is the most used physiological signal in single-modal emotion recognition systems [2–6]. EEG and other physiological signals, such as PPS, are usually used for emotion recognition in multimodal systems. The hierarchical fusion based on the CNN was proposed in [7] to extract the potential information multimodal signals, including the EEG and the PPS, and feature-level fusion was performed to merge the deep and statistical features. The binary classification scenarios based on valence and arousal dimensions were considered in the DEAP and MAHNOB-HCI datasets. The method presented in [8] combines the EEG and PPS with eye movement signals, and the joint oscillation structure of multichannel signals was analyzed by the multivariate synchrosqueezing transform (MSST). After that, a deep CNN extracts the local features from the MSST. Binary scenarios were evaluated based on the dimensions of arousal and valence on DEAP and MAHNOB-HCI datasets. An ensemble CNN was utilized in [9] to analyze the correlation between EEG and PPS signals from the DEAP dataset to develop multimodal emotion recognition. The multistage multimodal dynamical fusion network was proposed in [10] to analyze the unimodal, bimodal, and trimodal intercorrelations. It was shown that multistage fusion performs better than single-stage fusion on the DEAP dataset. The multiple-fusion-layer-based ensemble classifier of stacked autoencoder was proposed in [11] to recognize the emotions from the DEAP dataset. PPSs such as galvanic skin response (GSR), respiration patterns, and blood volume pressure were utilized in [12]. This method combines some continuous wavelet transforms (CWTs) and classifies them using a CNN. The four-class scenario on the DEAP dataset was considered for performance evaluation. The EEG, pulse, skin temperature, and blood pressure are recorded by the wearable sensor nodes in [13], and the fuzzy support vector machine (SVM) performs the emotion recognition.
Audio- and video-based signals are used separately or combined with the physiological signals for multimodal emotion recognition [14–16]. The EEG and facial expressions were used in [17] for multimodal emotion recognition. The combination of the CNN and the attention mechanism extracts the essential features from facial expressions, and a CNN extracts the spatial features from EEG signals. The features of different modalities are merged at the feature level. Binary scenarios on DEAP and MAHNOB-HCI datasets were considered for performance evaluation. Another method based on EEG signals and facial expressions was presented in [18]. The authors in [19] used facial expressions, GSR, and EEG signals with a hybrid fusion strategy. They considered the three emotions on the LUMED-2 dataset and four classes on the DEAP dataset. In [20], the 3D-CNN extracts the spatiotemporal features from the EEG signals and the video. A hybrid multimodal data fusion method was presented in [21] to fuse the audio and video signals from the DEAP dataset using a latent space linear map. The principal component analysis (PCA) and CNN were used in fusion and feature extraction from EEG and audio signals in [22] and then the grey wolf optimization algorithm was employed for selecting combined features. The heart rate can be detected from the photoplethysmography (PPG) signal; hence, some research used PPG. A method based on PPG and GSR signals was proposed in [23], which uses the 1D-CNN autoencoder model and lightweight model obtained using knowledge distillation. The performance of the model is evaluated on DEAP and MERTI-Apps datasets. The heart rate was extracted from PPG signals in [24], and then the combination of the 1D-CNN and long short-term memory (LSTM) was adopted for classification on MAHNOB-HCI. The features in time and frequency domains were extracted from PPG and GSR signals in [25] for emotion recognition. It was shown that feature selection with random forest recursive feature elimination and classification by the SVM yields the highest accuracy. Table 1 summarizes the recently introduced research on multimodal emotion recognition from biological signals. It is observed that DEAP is the most used dataset. Also, most works focusing on time-domain and time-frequency analyses were adopted in [8, 12]. The feature concatenation was considered after feature extraction from each modal, and cross-modal correlation was not considered in the feature extraction process.
3. Proposed Method
3.1. Dataset
To evaluate the performance of the proposed method, we consider the DEAP dataset [26]. Researchers at the Queen Mary University of London developed this publicly available dataset to analyze the emotions of 32 subjects on a scale of one to nine for valence and arousal. The 40 videos with the duration of 63 seconds were selected as trigger stimuli during the experiments. This dataset contains EEG and PPS signals. EEG signals were recorded using 48 electrodes. PPSs are horizontal electrooculography (hEOG), vertical EOG (vEOG), zygomaticus major electromyography (zEMG), trapezius EMG (tEMG), galvanic skin response (GSR), respiration belt, plethysmograph, and temperature. All signals were downsampled to 128 Hz. EEG and PPS signals were passed through bandpass and lowpass filters, respectively. The middle 30 seconds of the 63 seconds of recorded data were considered for further processing. Since it was generally adopted that each subject reaches a stable in the middle of the video, the selected part of the signals was partitioned into segments with a duration of three seconds so that consecutive segments have a 50% overlap with each other. Therefore, there are 40 trials for each subject, each trial with 19 segments with 384 samples.
This paper considers two scenarios based on valence and arousal for rating the emotional signals. The binary scenario classifies the multimodal signals based on the valence rating into positive and negative emotions, as shown in Figure 1(a). Conversely, the four-class scenario considers the 2D valence-arousal model for classifying emotions into one of the following categories: sad, calm, happy, and angry, as shown in Figure 1(b).

(a)

(b)
3.2. Proposed Method
Here, the proposed method for multimodal emotion recognition from EEG and PPS signals is explained in detail. The general framework of the proposed method is shown in Figure 2, which consists of the following four main steps: data fusion, feature extraction, feature reduction, and classification.

3.2.1. Fusion
Previous works based on multimodal signals usually extract the features from different modalities separately and then merge the extracted features. In this manner, the cross-modal correlation is not considered. Also, there are many redundant features. To overcome this drawback, we propose to merge the multimodal signals before any feature extraction process. Let and denote the matrices with the size of 32 × 384 and 8 × 384, respectively. After fusion, there is the matrix with the size of 40 × 384, which is considered for further processing.
3.2.2. Feature Extraction
The Stockwell transform was introduced to overcome the drawbacks of short-time Fourier transform (STFT) and wavelet transform while benefiting their advantages and characteristics [27]; however, there are some differences. STFT uses a fixed window size for signal analysis, resulting in a tradeoff between time and frequency resolution. In contrast, the Stockwell transform uses the variable-length window; hence, different frequency components can be analyzed with different time resolutions, which is necessary for transient and stationary signals. Since the Stockwell transform uses the Gaussian window, it provides a localized time-frequency map (TFM). In contrast, the STFT spreads the spectral energy over multiple time-frequency bins due to the use of rectangular windows. This Stockwell transform characteristic accurately identifies signal components’ time and frequency characteristics. The STFT suffers from smearing due to the rectangular analysis window. The Stockwell transform mitigates this issue using a window that smoothly tapers off. The Stockwell transform retains phase information, while STFT distorts the phase due to the windowing process [28, 29].
For a continuous-time signal , the continuous Stockwell transform, , is computed as follows [30]:where , , and are the time variables, denotes the frequency, and σ = 1⁄| f | is the scale factor. Also, and are the magnitude and phase of the Stockwell transform, respectively. The output of the Stockwell transform is a complex-valued matrix whose rows and columns are concerned with time and frequency, respectively.
For the discrete signal , , obtained from , by sampling, with the discrete Fourier transform (DFT) of , , the discrete Stockwell transform for , , for , can be calculated by replacing and as follows [30]:
For , the Stockwell transform equals the DC value of DFT as . The 2D Stockwell transform for the 2D image is computed as follows [30]:
The shift parameters u and control the centre position of Gaussian windows on different axes. Also, and ( and ) denote the frequencies. There is considerable redundancy in the time-frequency matrix provided by the Stockwell transform. The DOST is proposed in [31, 32] to overcome this drawback. The DOST provides spatial frequency representation similar to the wavelet transform [32]. The 2D-DOST of an N × N image , with 2D Fourier transform , is defined as follows:
Here, and are the horizontal and vertical frequencies, respectively, and and . For this image, there are N2 DOST points. The 2D-DOST gives information about the frequencies in the bandwidth of frequencies [30].
As mentioned, the input of the 2D-DOST is an N × N image, and usually, N is a power of two for computational efficiency. Hence, each with the size of 40 × 384 is partitioned into six partitions, resulting in each with the size of 40 × 64. Finally, each was resized to the size of 64 × 64. After that, the 2D-DOST is computed for each to obtain . Finally, the time-frequency matrix, , of trial is computed as follows:
CNNs provide several benefits for analyzing TFMs. CNNs are particularly effective at capturing local patterns and features. TFMs contain localized structures; hence, CNNs can automatically learn and extract relevant local features from these maps. This enables the model to capture time-varying patterns and frequency-specific information. TFMs often exhibit hierarchical structures, where low-level features correspond to basic signal components and higher-level features capture more complex relationships and patterns. CNNs can learn these hierarchical representations by stacking multiple convolutional layers. This allows them to capture both low-level details, such as individual frequency components, and high-level features that represent more abstract signal characteristics. TFMs are susceptible to noise and variations introduced during signal acquisition or processing. CNNs have demonstrated robustness to noise and variations. By leveraging local receptive fields and pooling operations, CNNs can effectively suppress noise and capture invariant features in TFMs. This robustness enhances the model’s ability to analyze the TFM in the presence of noise or variations [33–35].
The CNN extracts the multiscale localized spatial features from the input image using different layers, including image input, convolutional, batch normalization, rectified linear unit (ReLU), pooling, fully connected, and softmax. The convolutional layers generate high-level features by detecting local patterns such as lines and edges. The small-sized filters, or kernels, are employed for this purpose. The minibatch process normalizes the output of convolution layers to reduce the sensitivity to the initialization and increase the training speed. There is a nonlinear activation filter, called ReLU, after this layer, with the input-output relation function as . There are many high-level features at the output of the ReLU layer with high correlation, and training such features requires more computational resources. Therefore, the pooling layer is employed to reduce the number of high-level features at the output of the ReLU layer. This layer generally performs the downsampling with functions such as average pooling, global maximum pooling, maximum pooling, and global average pooling, in which the max-pooling is the most frequently used. This function selects the maximum value in the pooling window. The output of the last pooling layer is given to the flatten layer that converts the feature maps from the matrix form to the vector one. The elements of this vector are the input of fully connected and softmax layers that act as the traditional multilayer perceptron.
Designing the new structures for the CNN and training them is time consuming and requires a huge number of labelled training samples. Transfer learning is utilized to solve this challenge. Generally, transfer learning is using the pretrained CNN for a new problem. To this end, only the number of neurons in the last dense layer is modified according to the number of classes of the new problem and the whole or some weights of the pretrained network are refined considering the training data of the new scenario. Also, the training samples are resized considering the size of the input image layer. After training, the features at the flatten layer’s output are considered deep features and used for further processing.
3.3. Feature Reduction
Some high-level deep features obtained from the flatten layer may be highly correlated, increasing the redundancy in the feature vector given to the classifier. The redundant features increase the training complexity and probability of overfitting. Hence, they should be removed from the feature vector. The semisupervised methods combine the efficiencies of both supervised and unsupervised methods and balance the discrimination and generalization. This paper uses the semisupervised dimensionality reduction (SSDR) proposed in [33] for feature reduction.
Let and , respectively, denote the number of training samples and the number of deep extracted features. Accordingly, are training feature vectors and . In this method, the pairs of training samples belonging to the same class and samples from different classes, respectively, construct the must-link constraints, , and the cannot-link constraints, . SSDR obtains the new feature vectors set G = 𝐖T𝐒, where , 𝐖TW = 1, is the projection matrix, and the new features should preserve the structure of the original features. To this end, the objective function (𝐖) is defined as follows:
The parameters and balance the cannot- and must-link constraints. The concise form of the objective function can be expressed as follows:where denotes the Laplacian matrix, and is the diagonal matrix obtained as . The elements of matrix are obtained as follows:
It is observed that the performance of SSDR depends on parameters and . Hence, Bayesian optimization is utilized to find their optimum value that maximizes the accuracy.
3.4. Classification
Here, several classifiers, including SVM, kNN, ANN, decision tree, and random forest, are considered separately to obtain the performance of the proposed method. The performance of these classifiers depends on their parameters. For the SVM, the kernel type and box constraint; for kNN, the number of neighbours, distance metric, and weighting scheme; for the decision tree, the maximum number of splits; and for the random forest, the minimum number of leaf sizes and number of predictors to sample should be optimized. A joint optimization based on Bayesian finds its optimum value, as shown in Figure 3. It should be mentioned that the structure of the ANN is chosen according to the dense layers in the corresponding CNN.

4. Results and Discussion
This section explains the simulations performed to assess the performance of the proposed method and the obtained results. The confusion matrix, accuracy (Acc), sensitivity (Sens), precision (Prec), kappa, and F1 scores are calculated and reported. These metrics are calculated as follows:where the number of correctly classified and rejected multimodal signals is, respectively, denoted by true positive (TP) and true negative (TN). Conversely, the number of incorrectly identified and incorrectly rejected multimodal signals is given by false positive (FP) and false negative (FN), respectively. Also, is the random accuracy, where is the number of classes.
4.1. Simulation Setup
We adopt the cross-subject validation protocol to determine the train and test data. Hence, the proposed method is subject independent and considers the data of one subject for testing and the data of the remaining subjects train the model. This validation scheme repeats this procedure for all subjects as test data, and finally, the results are averaged. This paper considers some frequently used pretrained CNNs for deep feature extraction from the 2D-DOST content, including AlexNet, VGG19, ResNet18, Inception-v3, and EfficientNet-B0. Table 2 contains the parameters used in the tuning process of deep feature extractors.
4.2. Fusion Model
There are several ways to combine EEG and PPS signals to construct the matrix Xm such as EEG-PPS, Xm = [XEEG; XPPS], and PPS-EEG, Xm = [XPPS; XEEG]. The other way is that channels of EEG and PPS signals are randomly located at the rows of the matrix Xm. There are several placements for this purpose. We examined several placements, and the highest accuracy was reported. Also, the results of using only EEG and PPS signals are obtained. The results given in Table 3 depict that the EEG-PPS fusion yields the highest accuracy equal to 0.953 and 0.928 for two- and four-class scenarios, respectively. It is observed that random and PPS-EEG fusions have close accuracy, where the accuracy of the EEG-PPS scheme is slightly higher. This fusion scheme preserves the intramodal correlations among different channels and also considers the cross-modal correlations among the signals of different modalities. In contrast, a random manner cannot preserve the intramodal correlations among channels due to the random location of signals.
Also, comparing the results of only EEG and only PPS signals indicates that EEG signals are more informative than the PPSs; hence, their fusion reaches a higher accuracy than using only one. It should be noted that the maximum accuracy of both scenarios is obtained considering the deep features extracted by Inception-V3 CNN and SVM classifier. The structure of Inception-v3 [36] is given in Table 4. It should be noted that the output size of each module is the input size of the next one. The structure of inception modules is also given in Figure 4.

(a)

(b)

(c)
Tables 5 and 6 present the confusion matrix of the proposed method for two- and four-class scenarios, respectively. It is observed that the accuracy of the detection of negative emotions is slightly higher than positive ones in the two-class scenario. Notably, the minimum sensitivity is 94.7%, higher than the recently introduced works. The angry, happy, calm, and sad emotions are most accurate in the four-class scenario. Also, the values of kappa and F1 scores indicate the efficiency of the proposed method.
4.3. Accuracy for Different Pairs of Classifiers and the CNN
Tables 7 and 8 present the accuracy and kappa score of the proposed method for different pairs of CNN and classifier to find the set of CNN and classifier that reaches the highest accuracy. Notably, each pair’s reported accuracy is the maximum obtained by the optimization of SSDR and classifier parameters in the EEG-PPS fusion scheme. It is observed that in both scenarios, the combination of Inception-v3 and the SVM yields the highest accuracy. The ResNet18 and EfficientNet-B0 have a close performance that is lower than Inception-v3 and higher than AlexNet and VGG19. Also, the performance of VGG19 is better than AlexNet. For all CNNs, the SVM with Gaussian kernel reaches the highest accuracy, and after that, ANN has the highest accuracy in most cases.
Table 9 discusses the effect of feature reduction on the performance of the proposed method. We considered the proposed method without feature reduction, with unsupervised PCA, with supervised LDA, with the combination of PCA and LDA, with static SSDR, in which parameters are not optimized, and with optimized SSDR. It is observed that generally, using feature reduction increases the accuracy. Since LDA is supervised, it has higher accuracy than unsupervised PCA. However, the generalization of LDA is lower than PCA. To overcome this issue, a combination of them, PCA + LDA, can be used that reaches a higher accuracy than when used alone. The parameters of static SSR are set randomly, and it is observed that its performance is slightly lower than the hybrid PCA + LDA scheme.
4.4. Performance Comparison
Table 10 compares the performance of the recently introduced multimodal emotion recognition approaches. As observed, the EEG is the frequently used modality in multimodal emotion recognition systems. Most multimodal schemes considered the EEG and other biological signals such as EOG, PPS, GSR, and facial expressions. Also, the EEG and PPS signals are the most used. Generally, the EEG + PPS scheme reaches a higher accuracy than the other combinations of biological signals. It is observed that the proposed method has more accuracy than the recently introduced works.
5. Conclusion
This paper proposed a new method for emotion recognition from multimodal signals, including EEG in 32 channels and PPS in eight channels. The proposed method employs the 2D-DOST to analyze the relations between the multimodal signals. Then, a CNN was used to extract the deep local features from the absolute of the 2D-DOST. After feature reduction by SSDR, a classifier determines the emotion by solving an optimization problem. The results showed that the extracted deep features by the Inception-v3 network and their classification by the Gaussian SVM reached the highest accuracy equal to 0.953 and 0.928, respectively, for two- and four-class scenarios on the DEAP dataset. Several fusion schemes to combine the EEG and PPS signals were examined, and it was observed that the scheme [XEEG; XPPS] has the maximum accuracy. Also, it was shown that optimized SSDR has higher accuracy than the frequently used feature reduction schemes such as PCA and LDA. The results indicate the efficiency of multimodal emotion recognition compared to the unimodal approach. Also, the proposed method outperforms the recently introduced methods.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Behrooz Zali-Vargahan was responsible for conceptualization, investigation, methodology, software, and writing the original draft; Asghar Charmin was responsible for conceptualization, methodology, validation, and supervision; Hashem Kalbkhani was responsible for conceptualization, software, visualization, and review writing and editing; and Saeed Barghandan was responsible for the methodology and review writing and editing.