Abstract
At present, traditional facial expression recognition methods of convolutional neural networks are based on local ideas for feature expression, which results in the model’s low efficiency in capturing the dependence between long-range pixels, leading to poor performance for facial expression recognition. In order to solve the above problems, this paper combines a self-attention mechanism with a residual network and proposes a new facial expression recognition model based on the global operation idea. This paper first introduces the self-attention mechanism on the basis of the residual network and finds the relative importance of a location by calculating the weighted average of all location pixels, then introduces channel attention to learn different features on the channel domain, and generates channel attention to focus on the interactive features in different channels so that the robustness can be improved; finally, it merges the self-attention mechanism and the channel attention mechanism to increase the model’s ability to extract globally important features. The accuracy of this paper on the CK+ and FER2013 datasets is 97.89% and 74.15%, respectively, which fully confirmed the effectiveness and superiority of the model in extracting global features.
1. Introduction
In human daily communication, the expression represents people’s current emotional state, which can express more accurate information than language and plays an indispensable role in human emotional communication. In the 1970s, psychologists Ekman and Frieden defined six basic emotions, including happiness, anger, surprise, fear, disgust, and sadness, and later contempt was added. These seven emotions became the basis for people to study expression recognition.
As a research direction in the field of computer vision, facial expression recognition is closely related to face detection and recognition and has been gradually applied to daily life, such as driver fatigue driving detection, criminal investigation, and entertainment. At present, the research of facial expression recognition is mainly divided into two directions: artificial feature extraction based on traditional methods and depth based research. The expression was classified by combining LDA and PCA [1]. Feng et al. [2] proposed an expression recognition method based on the combination of LBP features and SVM classification and made corresponding improvements to the model in the case of low resolution in practical application; Zhi et al. [3] proposed a novel graph-preserving nonnegative matrix factorization algorithm for facial expression recognition, which is robust to partial occlusions than other methods; Zhong et al. [4] proposed a multitask sparse learning method based on LBP features, which transformed the expression recognition problem into a multitask sparse learning problem, and good results were obtained on multiple data sets.
Since 2013, deep learning has been gradually applied to facial expression recognition. Matsugu et al. [5] used convolutional neural networks (CNN) to solve the translation, rotation, and scale invariance of expression images. Yao et al. [6] proposed HoloNet, a network model specially used for expression recognition, in which CReLU is used to replace ReLU, and the residual module and CReLU are combined to build an intermediate layer, which got great achievements. Zhao et al. [7] designed a feature extraction network by adding a feature selection mechanism in AlexNet. Zhang et al. [8] proposed an end-to-end deep learning model by exploiting different poses and expressions which is based on the generative adversarial network; Cai et al. [9] proposed a new loss function to optimize the distance between expression classes and maximize the distance between classes so that the network can learn more discriminative features. He and Shen [10] used the improved deep residual network to deepen the depth of the network, and introduced migration learning to solve the problem that the current expression recognition data set is too small and achieved the accuracy rate of 91.33% on the CK + data set. In reference [11], the pairwise random forest method was used to process face pose change in order to solve the problem of face pose change in facial expression recognition. The above expression recognition methods based on deep learning show that the convolution neural network can achieve good recognition results, but the convolution operation is a partial operation in space. To capture the dependence between long-range features, it can only be achieved by repeatedly stacking the convolution layers. However, the efficiency of this method is too low, and due to a large number of network layers, it is necessary to design a reasonable module to keep the gradient from disappearing.
The main contributions of this paper are as follows:(i)A facial expression recognition model based on attention mechanism is proposed. A self-attention mechanism is introduced on the basis of the residual network to overcome the limitation of the local operation of convolution operation, which improves the ability of the model to capture long-range associated features. Considering the correlation between channels of feature graph, channel attention is introduced to learn the weight distribution among channels.(ii)Through a large number of experiments, the effects of different combinations of self-attention mechanism and channel attention mechanism on the performance of the model are analyzed, and the optimal model is selected.(iii)Experimental results on FER2013 and CK + datasets show that the proposed attention based facial expression recognition model is superior to other methods.
2. Relational Work
2.1. Expression Recognition Method Based on Deep Learning
With the improvement of data scale and computer computing power, deep convolutional neural network (CNN) has made important progress in the field of computer vision. In the task of facial expression recognition, the convolutional neural network has become the mainstream method.
Liu et al. [12] proposed a BDBN model based on the CNN network, using a deep belief network to complete feature extraction and expression classification to avoid complex artificial design features and obtaining better performance. Jung et al. [11] proposed the DTGAN model, which is composed of two CNN networks. The first network extracts temporal appearance features from image sequences, and the other extracts temporal geometric features from time-domain facial key points. The combination of the two networks improves the performance of expression recognition. Zhao et al. [13] proposed a deep region and multilabel learning method (DRML), which uses the feedforward function to obtain the face area, and forces the learning weight to capture the structural information of the key points of the face. It can automatically learn and has certain robustness to the fixed changes in the local region. In order to overcome the influence of face pose on expression recognition, Zhang et al. [8] used the counter automatic encoder to generate facial images with different expressions and different poses to expand the data set. However, because the generated facial images are not realistic and natural, the generalization performance of the model is poor. He and Chen [14] proposed an expression recognition model based on multiscale feature fusion. Two different networks were trained to extract features of expression images at different resolutions, and then the extracted features were fused to improve the recognition performance of the model and enhance the generalization ability of the model. Yao and Zhao [15] proposed a method of local feature fusion, which extracts the features of eyes, eyebrows, and mouth through CNN, and then uses SVM classifier for decision-making level weighted fusion. This method has a good recognition effect and strong real-time performance.
2.2. Attention Mechanism
Attention mechanism plays an important role in human perception. At present, attention mechanism has been successfully applied to machine translation [16], image generation [17], and other fields. In the field of expression recognition, there are also some researches on attention mechanism. Jaderberg et al. [18] proposed a STN module. The module learns an affine transformation matrix through an end-to-end manner, enabling it to perform spatial transformation such as translation and rotation of the feature map so as to focus on the key target area. Sun et al. [19] proposed a CNN model based on an attention mechanism to learn the attention weight of the feature graph to improve the performance of the model. Pei et al. [20] proposed an end-to-end attention network, where the attention weight on the feature map is automatically learned by CNN. In the process of long-term and long-term memory network processing information, it learned the weight of each time step and then weighed the input eigenvector to obtain the final output vector.
3. Attention Model of Residuals
3.1. Overall Network Framework
In this paper, a deep convolution neural network based on an attention mechanism is proposed and applied to facial expression recognition. In order to solve the limitation of local receptive field in CNN operation, a self-attention mechanism is adopted, which can make the network pay more attention to the regions related to expression; in addition, in order to learn the weight distribution between feature graphs, channel attention mechanism is used to make the network automatically learn the importance of each channel.
Figure 1 shows the overall framework of the attention mechanism model proposed in this paper. In the first part, we use downsampling to extract the expression feature map; then input the feature map into the attention residual module for feature transformation to improve the performance of the model; finally, the expression classification is realized through the full connection layer. The attention residual module includes the self-attention module and the channel attention module. The structure of each part is introduced in detail below.

3.2. Residual Network
In deep learning, the performance of the model is often improved by increasing the scale of the model. However, with the deepening of the network layers, the gradient vanishing problem will appear, which makes the model training difficult. To solve this problem, the residual network [21] adopts a short-circuit connection mode, which allows the information before the network to be directly transmitted to the module output layer, as shown in Figure 1.
As shown in Figure 2, the residual module establishes a connection between the input and the output by means of identity mapping so that the convolution layer can learn the residual between the input and output. If F (x,{Wi}) is used to represent the residual mapping, then the output of the residual module is as follows:

In which x and y represent the input and output information of the module, respectively.
3.3. Self-Attention Module
In a convolution neural network, the size of the convolution kernel is usually less than 7 due to the limitation of computing resources. Therefore, each convolution operation can only cover a small neighborhood around the pixel, and it is not easy to capture the features with a long distance, such as the association feature between two eyes. In order to capture the dependence between pixels in a long range, it is necessary to stack convolution operations repeatedly and get them by backpropagation, but this will easily lead to the problems of gradient vanishing and slow convergence. Because of the deep network, it is necessary to design a reasonable network structure without affecting gradient propagation.
Different from convolution local computation, the core idea of the nonlocal operation is that when calculating the output of each position in the feature map, it is no longer only calculated with the pixels in the local neighborhood, but also by focusing on all the signals related to the current representation in the image. The correlation weight is obtained to represent the correlation between other positions and the current position to be calculated, which is defined as follows:where i is a position in the output feature map, j is the index of all possible positions in the feature map, x is the input feature map, y is the output feature map, the size is the same as the input feature map, f is the function to calculate the correlation between any two points, is a function of one variable, the purpose is to transform information, and C (x) is a normalization function. Since f and are general expressions, the concrete form should be considered when combined with the neural network.
Firstly, since is a univariate output, which could be replaced by 1 × 1 convolution, as follows:
For the function f, which is used to calculate the location correlation, the similarity is calculated in the embedding space.
The mathematical expression is as follows:
In θ (xi) = Wθxi, φ (xj) = Wφxj, the normalization parameters C (x) = ∑ f (x, x). For a given location i, [1] f (xi, xj) is the softmax to calculate all the position j, the output of the Self-attention layer is as follows:
The structure of the self-attention network is shown in Figure 3. Suppose that the input FH×W×C of the network is transformed into two embedding spaces by two convolution weights Wθ and Wφ to obtain FH×W×Cl and FH×W×Cl, usually Cl < C. The purpose of the step is to reduce the number of channels and reduce the amount of calculation. Secondly, reshape the output characteristic graph to FHW×Cl, and then perform transposition operation on one of the matrices, and then multiply the matrix. The similarity matrix FHW×HW is obtained by calculating the similarity, and then the softmax operation on the last dimension is equivalent to obtaining the normalized correlation between each pixel in the current feature map and pixels in other positions. Finally, the same operation is taken for , which first reduces the dimension, then performs the reshape operation, and then multiplies with the matrix FHW×HW, and applies the attention mechanism to all channels of the feature map. After that, a 1 × 1 convolution recovery channel is used to ensure that the input and output sizes are exactly the same. From a mathematical perspective, assuming that the feature map of the previous layer is x ∈ RC×N, it is first mapped into two feature spaces f and , and f = Wfx, = where βj,i is the contribution of the position i to the region j of the feature map, where C is the number of channels in the previous layer of the feature map, and N is the number of pixels in the previous layer. This is the output of the attention layer where Wθ∈ RCl×C, Wφ∈ RCl×C, ∈ RCl×C, ∈ RC×Cl as the weight of convolution kernel, where Cl is super parameter, and Cl < C.

In addition, the residual connection is introduced in order to better carry out gradient backpropagation, so the final output of the attention module is as follows:where γ is a learning super parameter, which is initialized to 0 and gradually increases the weight in the process of training.
3.4. Channel Attention Module
Each channel of a feature graph plays a role as a feature detector [22]. Therefore, the channel of the feature graph focuses on what features are useful for the task. However, in the conventional convolution neural network, the importance of each channel is not distinguished. That is to say, each channel is treated equally, ignoring the fact that the contribution of each channel to the task is different. In view of this, this paper introduces channel attention to learn the weight distribution between channels, strengthens the channels that are useful for expression recognition task, and weakens the irrelevant channels.
In order to calculate the channel attention more efficiently, for each channel of the intermediate feature map, the global average pooling and global maximum pooling operations based on height and width are, respectively, used to compress the feature graph into two different spaces, and then the two feature graphs are input into the fully connected network with shared parameters, and the output vectors of the full connection layer are processed according to the corresponding elements. Finally, the final channel weight is obtained by the sigmoid activation function. The detailed structure is shown in Figure 4.

Suppose the input feature graph is FH×W×C, where H, W, C are the height, width, and channel number of the feature graph, respectively. After pooling, the maximum pooled feature graph Fmax ∈ R1×1×C and the global average pooled feature graph Favg ∈ R1×1×C are obtained, respectively. Then these two feature maps are sent into the fully connected network with only one hidden layer. The calculation process is as follows:
further:where W0, W1 are the shared weight of the full connection layer, and W0∈ RC/r×C, W1∈ RC×C/r.
3.5. Attention Fusion
In order to enhance the feature extraction ability of the network model and capture the dependence between long-range features, this paper adds a self-attention mechanism and channel attention to form the attention residual module based on the residual module so as to improve the sensitivity of the model to useful information and suppress useless information. There are two kinds of adding methods: serial and parallel, in which serial is divided into self-attention before channel attention and channel attention before self-attention.
3.5.1. Self-Attention before Channel Attention
In serial mode, self-attention is performed first and then channel attention is performed, as shown in Figure 5. Feature graph Fin is obtained by convolution of the previous layer as input. Channel attention graph Fmid is obtained by channel attention Mc, and then fused with input feature map as input of self-attention Ma. Finally, feature maps obtained by Ma and Fmid are fused to obtain the output of the final attention module Out. The formal description of the whole process is shown in Figure 5.

3.5.2. Channel Attention before Self-Attention
The serial mode of the first channel followed by self-attention is shown in Figure 6. The former layer is convoluted to obtain feature map Fin as input. First, the feature map is obtained by the action of self-attention Ma and channel attention Mc. Then, the self-attention diagram Famid and channel attention map Fc are obtained by fusing with the input feature graph Fin. Finally, the corresponding elements of the two attention maps are added. The final attention map output Fout is obtained. The formal description of the whole process is as follows: where ⊗ represents the multiplication of the corresponding elements.

3.5.3. Parallel Mode
The parallel connection mode is shown in Figure 7. The feature graph Fin obtained by convolution of the previous layer is used as the input. Firstly, the respective feature graphs are obtained by the action of self-attention Ma and channel attention Mc.

Then, the corresponding elements of the self-attention graph Famid and channel attention graph Fc are obtained by multiplying the corresponding elements of the obtained feature graph and the input feature graph, respectively. Finally, by adding the corresponding elements of the two attention graphs, the final output is obtained. The formal description of the whole process is as follows: where ⨁ represents the addition of corresponding elements and ⊗ represents the multiplication of the corresponding elements.
3.6. Attention Residuals
In order to make better use of the channel self-attention designed above, this paper inserts it into the residual module. It can be divided into three kinds of structural design, including using self-attention mechanism alone, using channel attention mechanism alone, and using self-attention and channel attention mechanism simultaneously. The attention residual module adds an attention mechanism to the original residual module. The specific structure is shown in Figure 8.

(a)

(b)

(c)
4. Experiment
In order to verify the validity of the model, experiments are carried out on two datasets: FER2013 and CK+. The experiment is based on TensorFlow framework [23]. The experimental platform is Intel Core i7-6850 with six-core memory of 64 GB, graphics card of GTX1080Ti, and system of Ubuntu16.04. All experiments were single card training.
4.1. Data Set
FER2013 data set [24] is derived from Kaggle’s expression recognition competition. There are 35888 facial expression images, including faces with different lighting and posture. Among them, 28709 images are in the training set, and 3589 images are in the public test set and private test set. The image size is 48×48. There are seven categories of gray image, which are anger, disgust, fear, happiness, surprise, sadness, and neutrality. The sample image is shown in Figure 9.

The CK + data set [25] is also a common data set for facial expression recognition. The modified data set contains 593 image sequences of 123 people, showing the change process of the expression of the test object from the natural state to the expression peak value. There were 327 expression tags, including 8 expressions of nature, disgust, contempt, fear, happiness, sadness, surprise, and anger. In this paper, 981 images of 7 kinds of expression images are selected for the experiment, and the image preprocessing is 48 × 48, as shown in Figure 10.

Due to the small number of the two data sets, this paper uses data enhancement to expand the data sets, including random rotation, random brightness adjustment, random graying, etc. The number of CK + data sets was increased to over 29000, and that of FER2013 was increased to over 63000. Through the data enhancement operation, the accuracy of the model can be effectively improved and the overfitting phenomenon can be prevented at the same time.
4.2. The Results and Analysis of the Experiment
4.2.1. Ablation Experiment
In this section, we verify the effectiveness of the self-attention mechanism and channel attention mechanism through experiments. For the ablation experiment, we used FER2013 and CK + data sets and used the residual module as the basic module to build the benchmark model. In the experiment of FER2013 data set, 28709 images were used for training, 3589 images were used to verify the model, and 3589 images were used to test the accuracy of the final model. For the CK + data set, we divided the training set, validation set, and test set according to the ratio of 7 : 2:1.
In the training process, Adam is selected as the optimizer, the learning rate is set to 0.0001, and the total training steps are 50 epochs, batch_size is set to 64. The experimental results are shown in Table 1.
From Table 1, we can draw the following conclusions: (1) on the two data sets of FER2013 and CK+, the performance of the benchmark model is obviously worse than that of the model with the attention mechanism. No matter what attention is added and what kind of adding method, this shows that the attention mechanism can improve the feature extraction ability of neural network, which is helpful to improve the performance of the expression recognition model. (2) For the attention machine added, the use of mixed attention mode is obviously better than that of single attention mode, which shows that the nonlinear mapping of the model is effective for the expression recognition task. (3) For the mixed attention model, on the FER2013 data set, the effect of channel attention before self-attention is the best. Compared with parallel mode and self-attention before channel attention, the accuracy rate is improved by 3.98% and 2.89%, respectively. On the CK + data set, the self-attention before channel attention mechanism has the best effect. Compared with the parallel mode and the self-attention before channel attention mechanism, the accuracy rate is improved by 0.66% and 1.48%, respectively.
4.2.2. Scheme Selection
According to the analysis of the previous ablation experiment, the combination of channel attention before self-attention has the best comprehensive performance. It has high accuracy on both FER2013 and CK + data sets. Therefore, this model is selected as the final model. In order to verify the effectiveness of this model, the model is compared with other current methods. The experimental results are shown in Tables 2 and 3.
From the experimental data in Tables 2 and 3, we can draw the following conclusions: (1) Compared with the previous three traditional expression recognition methods, the deep learning method can significantly improve the accuracy of expression recognition. The results show that the features extracted by the convolutional neural network can better describe the expression than the artificial feature operator. (2) Compared with the current mainstream deep learning based methods, the self-attention mechanism model achieves higher accuracy on both data sets. (3) The accuracy rate of the FER2013 data set is significantly lower than that of CK + data set, which indicates that the quality of the data set has a certain influence on the experimental results. The scales of the FER2013 data set and CK + data set are relatively small, and there are also wrong tags and nonfacial expression tags in the FER2013 data set, which will bring interference to the training of the model, thus affecting the performance of the model.
Figures 11 and 12 show the training loss and accuracy curves of the model in this paper on the FER2013 and CK + data sets. It can be seen from the graphs that the training process of this model on the FER2013 data set is not stable on the CK + data set, which has a certain relationship with the two data sets. By checking the two data sets, we can find that the expression images of the FER2013 data set are quite different, the image resolution is low, and the image quality is different, which brings certain interference to the training process, and the accuracy rate is finally stable at about 75%. However, the image quality of the CK + data set is high, and the distribution is uniform, so the training model on this data set is relatively stable, and the final accuracy rate is also high. The accuracy rate of the training set and the verification set is about 98%.


Figure 13 is the confusion matrix obtained from experiments on the FER2013 data set, showing the classification accuracy of face images on seven expressions. The abscissa represents the prediction label, and the ordinate represents the real label. From the matrix, we can see that the accuracy rate of the model with the self-attention unit has been improved in each expression, among which the “sad” expression has the greatest improvement. With about 13% improvement, it shows that the addition of self-attention unit makes the expression classification more accurate. However, there is a certain gap between the accuracy of the seven expressions. For example, the highest accuracy rate of “happy” expression was 92%, while that of “sadness,” “fear,” and “anger” was 49%, 53%, and 64%, respectively. On the one hand, due to the small amount of data and unbalanced samples, these three kinds of expressions have certain negative effects on network training. On the other hand, these three kinds of expressions have a certain similarity, and their feature differences are not obvious, so they are not easy to distinguish.

Figure 14 shows the confusion matrix obtained on the CK + test. It can be seen that most of the expression recognition accuracy has been improved, which is the same as that of FER2013. Due to the relatively small amount of data of anger, sadness, and contempt, and the feature difference between expressions is not very obvious, and the recognition rate is slightly lower than those of nausea, fear, happiness, and surprise.

5. Conclusion
In this paper, a facial expression recognition model based on attention mechanism is proposed. The model consists of self-attention and channel attention. Self-attention gives different weights to different positions of the feature map to achieve different attention to different regions while the channel attention learns different weights for each channel of the feature map, strengthens the features useful for classification, and weakens the features irrelevant to the task. Experimental results show that the proposed model improves the accuracy of FER2013 and CK + data sets.
Data Availability
All data needed to evaluate the conclusions in the paper are present in the paper or the references cited here within.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The study was supported by the Major Project of Natural Science Research of the Jiangsu Higher Education Institutions of China (18KJA520012), the Xuzhou Science and Technology Plan Project (KC19197), and the One Stop Service Platform of Pocket Campus Based on Wechat Applet (XCX2020002).