Abstract
The cultivation of innovative talents is closely related to the quality of course teaching, and there is a correlation between facial expressions and the effectiveness of classroom teaching. In this paper, a separate long-term recursive convolutional network (SLRCN) microexpression recognition algorithm is proposed using deep learning technology for building a course teaching effectiveness evaluation model. Firstly, facial image sequences are extracted from microexpression data sets, and the transfer learning method is introduced to extract spatial features of facial expression frames through pretrained convolutional neural network model to reduce the risk of overfitting in network training. The extracted features of video sequences were input into long short-team memory (LSTM) to process temporal features. Experimental results show that SLRCN algorithm has the best performance in training set and test set. It has the best performance in ROC curve. This effectively distinguishes between seven different expressions in the database. The model proposed in this paper can obtain the changes of students’ facial expressions in class and evaluate students’ learning status, thus promoting the improvement of teaching quality. It provides a new method of course teaching quality evaluation.
1. Introduction
Higher education research shows that the cultivation of innovative talents not only is closely related to the quality of classroom teaching [1], but also depends on the concentration of students’ facial expressions during the teaching process [2]. In view of this, based on the correlation analysis between facial expressions and classroom teaching effects, using deep learning technology to build a teaching effect evaluation model has become the main research direction in the field of education.
Facial expressions reflect real human emotions. Psychologist Albert Mehrabian pointed out that “Emotional expression = 7% language + 38% voice + 55% facial expression” [3]. As a carrier of emotion and psychology, facial expression plays an important role in human emotion judgment. According to Ekman’s basic emotion theory, facial expressions contain a large number of emotional semantics, which are generally divided into six types: happiness, disgust, anger, sadness, fear, and surprise [4]. However, emotion is usually continuous and context-related, with different strong and weak expression relations. The basic theory of emotion still has some limitations. Different from ordinary expressions, microexpressions are spontaneous expressions generated under the influence of subjective emotions [5]. Microexpressions are characterized by short duration (1/25-1/3 s) and small amplitude of action. This brings great difficulty to microexpression recognition.
In the past microexpression recognition, the method of feature extraction is used to analyze the microexpression. However, due to the artificial extraction of the underlying features, the feature extraction is insufficient, resulting in low accuracy of microexpression recognition [6]. Deep learning algorithm has outstanding performance in image feature extraction. Therefore, deep learning algorithm can be used for more effective feature extraction of microexpressions, which can improve the recognition effect. In addition, due to limited computing power and the scale of facial expression video data, traditional methods usually use static facial expression or single facial expression for analysis, ignoring the periodicity of facial expression. The generation of facial expression is a process that changes over time. Dynamic facial expression more naturally expresses the changes of facial expression, while a single frame of facial expression cannot reflect the overall information of facial expression. Therefore, the analysis based on dynamic expression sequence is more helpful to the recognition of microexpression.
Based on dynamic multiexpression sequences, this paper proposes a separate long-term recurrent convolutional networks (SLRCN) model combining spatial features and spatial time. First, convolutional neural network is used in deep feature vision extractor to extract static features of microexpressions in images [7], and features extracted from video sequences are provided to bidirectional cyclic neural network to obtain the output of time series. This can improve the accuracy of microexpression recognition. In addition, the practical application scenarios of facial expression sequences are studied to combine teaching evaluation with facial expression analysis. Students’ learning status was analyzed by collecting their facial expressions. This model can effectively obtain the changes of students’ facial expressions and evaluate their learning status, thus promoting the improvement of teaching quality and providing a new method for the evaluation of teaching quality.
This model has the following advantages:(1)The model is simple in structure and does not require much data preprocessing.(2)The overfitting problem is solved by introducing transfer learning to optimize.(3)It can be used when the data set is insufficient.
This paper mainly consists of five parts, including the first introduction, the second state of the art, the third methodology, the fourth experiment and analysis, and the fifth conclusion.
2. State of the Art
2.1. Correlation between Facial Expression Recognition and Teaching Effect
Facial expression recognition generally refers to the representation of various emotional states through facial changes. It is an extremely important means of nonverbal communication. Artists often express the inner feelings of characters by describing their facial expressions, so as to show their spiritual outlook lifelike. Paul Ekman, a renowned American psychologist known as “the Pope of the Face,” has long been studying facial expressions and inner truth. He found that involuntary reactions were the best indicator of true feelings. When the subject’s facial expression is not consistent with his real thoughts, there are always corresponding flaws. In view of this, the relevant universities in China used deep learning and other technologies to conduct in-depth research on the facial expressions of the subjects and the classroom teaching effect. Literature [8] takes FER2013 face data set as the research object. An improved multiscale feature fusion algorithm is proposed for face detection with small size. The final experimental results show that the recognition accuracy of this method is up to 73.669%. Literature [9] proposes a classroom teaching effect evaluation system based on the improved VGG network model, which combines expression concentration and head-up rate. Many interesting teaching rules can be deduced from practical classroom teaching experiments. For example, about ten minutes before class, the overall attention of the class increases slowly, and about five minutes before class ends, it drops sharply. This requires teachers to adopt different teaching methods in different teaching periods by promoting students’ interest in learning to achieve satisfactory teaching results. Literature [10] proposed a facial expression detection method based on deep learning. Firstly, face detection model is constructed based on optimized fusion of FaceBoxes and MTCNN. Secondly, this model is tested and optimized by FDDB, an open-source network face database. Finally, based on the statistics of the students’ head up rate, the evaluation standard of classroom teaching based on facial expression is constructed. At present, the recognition accuracy of classroom teaching effect evaluation model based on facial expression recognition is still low and has not reached the level of commercialization. Therefore, on the basis of existing research, this paper focuses on the inner relationship between face detection and classroom teaching effect evaluation and focuses on constructing a feasible classroom teaching effect evaluation model and applying it to teaching practice.
2.2. Facial Expression Recognition
Literature [11] proposed the Facial Action Coding System (FACS) in 1976. FACS divides the face area into 44 Action units (AU) and combines different AU to form FACS code. Each FACS code corresponds to a facial expression. Based on this, Emotion FACS was developed after analyzing a large number of facial expression pictures [12]. The MIT lab trained sparse codebooks for emotion analysis of microexpressions. By using the sparsity of tiny temporal motion patterns, local spatiotemporal features are extracted in the facial region [13], microexpression codebooks are learned from the data, and features are encoded in a sparse manner. Experiments on AVEC 2012 data set show that this approach has good performance.
2.3. Expression Feature Extraction
Expression feature extraction methods can be divided into two categories: static image and dynamic image. Dynamic feature extraction mainly focuses on facial deformation and facial muscle movement. Representative methods based on dynamic feature extraction include optical flow method [14], motion model, geometric method, and feature point tracking method.
In literature [15], through the method of 3D histogram, microexpression detection and recognition is carried out through the gradient relationship between associated frames. Literature [16] uses strain mode to process long videos by optical flow method. Facial expressions are segmented by dividing several specific subareas (such as mouth and eyes) on human face, and then microexpressions are identified. Literature [17] uses the Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) algorithm to extract the features of microexpression image sequences. In this method, dynamic local texture features in time domain and space domain are extracted by 2D to 3D extension. CASME database was established in literature [18], and Gabor filtering was applied to extract the characteristic values of microexpression sequences. The smooth adaptive boosting algorithm combined with support vector machines based on Gentle Adaptive Boosting (GentleSVM) is used to build a classifier for classification recognition.
In terms of microexpression recognition based on space-time motion information description, literature [19] detected and recognized microexpressions by constructing optical strain features and optical strain weighted features using facial optical strain. Literature [20] Euler image amplification was used to analyze the phase in the frequency domain and the amplitude in the time domain, to amplify the movement information of microexpression, eliminate irrelevant microexpression facial dynamics, and use LBP-TOP algorithm for feature extraction. Literature [21] proposed a facial dynamics map (FDM) method to characterize the sequence of microexpressions. The method calculates the optical flow information of the microexpression sequence and aligns it accurately in the optical basin.
2.4. Deep Learning and Microexpression Recognition
Different from traditional machine learning algorithms, deep learning highlights the importance of feature learning. Through feature mapping layer by layer, features of the original data space are mapped to a new feature space, making classification and prediction easier. Deep learning can use data to extract features that meet the requirements and overcome the defect that artificial features cannot be extended. Literature [22] introduced the method of deep learning in microexpression recognition and extracted microexpression features through feature selection. However, due to the small sample size of the data set, the overfitting phenomenon is easy to occur in the training, which affects the identification accuracy of the network. In literature [23], convolutional neural network was used to encode the spatial features of microexpressions in different states. The spatial features with expression state constraints were transferred to the temporal features of microexpressions, and LSTM network was used to encode the temporal features of microexpressions in different states. Literature [24] proposed a rich long-term recursive convolutional network to extract optical flow features from data sets to enrich the input of each time step or given time length.
3. Methodology
Microexpression recognition obtains the face position from the complex scene through face detection algorithm, detects and divides the face contour, carries on the microexpression feature extraction, and establishes the recognition classification model. Its basic steps include (1) facial expression image, expression sequence acquisition and processing; (2) extracting microexpression features from facial expression sequences, removing redundancy between features to reduce feature dimensions; (3) based on long-term recursive network, microexpression features are used as input of time series model to learn the dynamic process of time-varying output sequence; (4) establish a dynamic prediction model to classify and recognize facial microexpressions. See Figure 1.

The method in this paper is based on the framework of long-term recurrent convolutional Networks (LRCN), and the model is improved to make it more suitable for the recognition of microexpression video clips. Faced with the problem of small amount of data in microexpression data sets, transfer learning method is adopted to avoid network overfitting. Convolutional neural networks (CNN) and LSTM are fine-tuned, and the SLRCN method is proposed. Combining convolutional neural network and long-term recursive network, the spatial domain features are obtained by two independent modules, and the temporal domain features are classified. Firstly, the feature vector of each microexpression image frame is extracted using the pretrained CNN model to form the feature sequence. Then, the feature sequence with timing correlation is input into LSTM network, and the timing output is obtained. Through this method, the structure and output of CNN can be fine-tuned to make its classification accuracy higher and conducive to learning on small-scale data sets.
3.1. LRCN Network
LRCN is a circular convolution structure combining traditional CNN network and LSTM. The network is capable of processing both sequential video input and single frame image, as well as single value prediction and sequence prediction. It is also suitable for large-scale visual learning. LRCN model directly connects the long-term recursive network with convolutional neural network to carry out convolution sensing and time dynamic learning simultaneously.
Combined with the deep hierarchical visual feature extraction model, this model can learn to recognize and serialize spatiotemporal dynamic tasks. This includes sequence data (input, output) video and description. See Figure 2. At time n, parameterized feature transformation will be passed to each visual input qn (single image or video frame), which can generate a fixed length vector ln, ln ∈ Rd, where Rd represents the d-dimensional real number set. The feature space representation of the video input sequence [l1, l2,...,l3] is established and then input into the sequence model.

In the usual form, the input in and the hidden state bn-1 of the previous time step are mapped to the output kn and the updated hidden state bn by the sequence model. b1 = fM(i1, b0), b2 = fM(i1, b0) are calculated successively, and bn is finally obtained, where M is the weight parameter. In the time step n prediction distribution , the last step of is to take a Softmax logistic regression function on the output kn of the sequential model. Mapping a vector to a probability distribution produces a possible distribution of time space C at each step, indicating that there are C outcomes. jn = c represents the probability of class c results, and Mc is the weight vector of class c.
LRCN instantiates the following learning tasks for three major visual problems (behavior recognition, image description, and video description):(1)Sequential input, fixed output: . Vision-oriented behavioral activity prediction, with arbitrary length TT video as input, predicts behaviors corresponding to labels.(2)Fixed input, sequential output: . For the problem of image description, a fixed image is used as input to output description labels of arbitrary length.(3)The order of the input and output: . For video description, input and output are sequential.
Experimental results show that LRCN is a model combining spatial and temporal depth. It can be applied to a variety of visual tasks involving input and output of different dimensions and has a good effect in video sequence analysis.
3.2. SLRCN Network
Since microexpressions are about video frame sequence, it is particularly important to realize feature extraction in spatial domain and temporal domain of microexpressions. Therefore, by taking advantage of LRCN’s “dual depth” sequence model in behavior recognition and applying LRCN to microexpression sequence classification, a SLRCN model was proposed. The method consists of three parts: preprocessing, microexpression feature extraction, and feature sequence classification. Preprocessing includes facial cropping and alignment to extract key facial areas. Feature extraction includes image frame pretraining of face-oriented CNN model to establish feature set. Sequence classification provides the feature set of the video sequence to the network by LSTM and then classifies the microvariations of the given sequence. This method has the following advantages:(1)Based on LRCN, the structure is simple, requiring less input preprocessing and manual characteristic design, reducing intermediate links.(2)It is suitable for the situation of insufficient data amount of microexpression data set and can extract facial microfeatures through transfer learning to avoid overfitting during training.(3)Visualization of training process, which is convenient for modifying model and tuning parameters and features.
SLRCN consists of two parts in the training process. CNN was used to extract the image features of facial expression frames, and LSTM was used as a temporal classifier to analyze the correlation of features in the temporal dimension.
3.2.1. CNN as a Feature Extractor
As a deep learning model, CNN is more suitable for extracting basic features of images and reducing model complexity. Therefore, CNN is used to extract feature vectors of microexpression sequences, which has stronger adaptability and better feature expression in different environments. For microexpression recognition, the sample size of the dataset is very small, and the phenomenon of overfitting may occur in network training. It is not feasible to train CNN models directly from microexpression data. In order to reduce overfitting when training deep learning networks on microexpression datasets, CNN models based on objects and faces are used for transfer learning, and feature selection is used to extract deep features related to tasks.
Literature [25] used ImageNet database to initialize the residual network based on transfer learning in microexpression recognition, and further pretraining was carried out on several macro expression databases. Finally, microexpression data sets were used to fine-tune the residual network and microexpression units. However, in general, the expression in the macroexpression database changes greatly and has obvious expression characteristics, while the microexpression changes slightly and is closer to the unchanged face image. Therefore, the VGGFace model for face recognition is used as the feature extractor of microexpression frames, which can extract subtle features from different environments and people. The VGGFace model used in this paper is based on the compression-and-Congestion Networks (SENet) architecture and is trained on the VGGFace2 face database. SENet enhanced the self-adaptability of the network by embedding SENet structure in Residual network (ResNet) and improved the network performance through the relationship between feature channels. See Figure 3.

As shown in Figure 3, Ftr : the realization process of , is as follows:where , where is the two-dimensional space kernel, represents the z-th convolution kernel, and represents the x-th input. After the above convolution operation, feature P is obtained, which is a feature graph with the size of M × B × C. Characteristic compression transforms the input of M × B × C into the output k ∈ RC of 1 × 1 × C, which is calculated as follows:
The dimension of feature S = [s1, s2,...sc] obtained in feature excitation process is 1 × 1 × C, which is mainly used to describe the weight of C feature graphs in feature P, namely,where is the dimensionality reduction operation of the full connection layer, and is the dimensionality increase operation of the full connection layer and redirects the features:
Feature extraction is performed by fine-tuning feature compression in Global average pooling (GAP) layer, using two fully connected layers to model the correlation between channels, and minimizing overfitting by reducing the number of parameters and computation in the model.
3.2.2. LSTM Builds Sequence Classifier
Since microexpression changes occur in continuous time, it is difficult to accurately identify microexpression changes without using the time information of microexpression. Therefore, in order to make use of the time variation information of the expression sequence, the cyclic neural network is used to process the input sequence of any time sequence, and the time dimension information can be processed more easily. LSTM node bidirectional cyclic neural network model was used to process time series data, and a long-term recursive convolutional network was constructed to judge and classify whether a given sequence contains relevant microexpressions.
The expression feature input sequence MicroE_Features = (i1,...,iN), the propagation implicit variable sequence , the backpropagation implicit variable sequence , and the output sequence of the bidirectional LSTM model are defined. Then, the update mode of output sequence Y iswhere M is the bidirectional LSTM model weight. h is the offset term. B(i) represents the activation function. Calculations were made using short- and long-memory neurons. Bidirectional LSTM and memory neurons are shown in Figures 4 and 5fn, xn and on in Figure 5, respectively, represent forgetting gate, input gate, and output gate. Cn represents the state of the memory Cell at time n.


LSTM inputs are spatial features extracted from all sequence frames using a pretraining model. A single-layer bidirectional LSTM structure is used in this paper. It contains a hidden layer of 512 nodes. The dropout layer is used between the LSTM hidden layer and the fully connected layer to shield the neurons randomly with a certain probability. It enhances the robustness of network nodes by reducing the coherence between neurons.
3.3. SLRCN Is Used for Microexpression Recognition
According to the improved method, for a given sequence of microexpressions, the steps of microexpression recognition in this paper are as follows:(1)Load the microexpression video file to establish the sequence set and its corresponding label set . Ix represents the x-th sequence of microexpression in the set. . represents the yth image in the xth microexpression sequence. tx is the length of the x-th microexpression sequence. is the xth tag in the set.(2)Load the microexpression video file, and first normalize the sequence length. That is, enter the LSTM network time step, and set a fixed value T to get . Face detection is performed on the normalized video sequence images in turn to extract the face part. The effective image sizes are normalized, and then the processed data set is obtained. This step makes the input video sequence suitable for input to the CNN network. Because the microexpression sequence collected contains a lot of noise and redundant information, it is necessary to remove irrelevant areas in the image and eliminate data noise. Face alignment and face clipping were performed on the microexpression sequences in the dataset. Haar face detector was used to detect faces, and active appearance model (AAM) algorithm was used to extract facial feature points under neutral expression state of each microexpression sampling sequence. According to the coordinates of feature points, the face contour was cut out, and the image was normalized to 224 × 224 × 3 to avoid the size difference affecting the results.(3)Facial features were extracted using transfer learning and pretraining weights of VGGFace model, and the pretraining weights of VGGFace were fine-tuned to make the model more effectively adapt to microexpression and accelerate convergence. The network input is 224 × 224 × 3 face expression image, and the output is 2048-length feature vector i obtained from the full connection layer after the global average pooling layer. In Formula (7), normalizes the feature vector i output by the extractor through L2 to obtain . Finally, the features obtained are saved into the data set , and the feature set is established. In this case, Xi represents the ith extracted feature sequence in the set. . It represents the vector N × 2048 generated for a sequence, and represents the nth feature of the xth feature sequence. The resulting eigenvectors are passed to the subsequent cyclic network.(4)Due to the dynamic time-domain characteristics of microexpression image sequence, time-domain correlation is included among all frames. After the spatial feature extraction of a single frame of microexpression image is completed, the bidirectional LSTM network and reverse are used to train the propagation process to obtain the temporal feature space of expression. The facial expression feature of each frame of face image in facial expression video sequence is . Set the expression change time sequence n ∈ N, where N is the length of expression frame, and then the expression feature time sequence matrix is
Establish the sequential input, fixed output prediction time distribution .where F is the activation function. M is the decision parameter model of bidirectional LSTM. j is the prediction of multiple categories. The implementation steps are shown in Figure 6.

4. Result Analysis and Discussion
In order to test the performance of the model, CASME-II data set was used for training. Train the network model according to the method in this paper, and verify the effectiveness of the method.
4.1. The Data Set
CASME-II data set was used for experiments. CASME-II is a database of naturally induced microexpressions created by fu Xiaolan’s team at the Chinese Academy of Sciences. It contains 255 samples of microexpressions and video clips from 26 Asian participants with an average age of 22 years. The data set was collected under appropriate lighting conditions and strict experimental environment, and the image resolution was 640 pixels × 480 pixels. The database sample is labeled with the start frame, end frame, and corresponding microexpression labels. It provides classifications of happiness, disgust, repression, surprise, fear, sadness, repression, and other emotions (happiness, surprise, disgust, fear, sadness, repression, and others). The microexpressions captured in the database were relatively pure and clear, with no noise such as head movements and unrelated facial movements. The data set in this paper is divided into 5 categories, as shown in Table 1.
4.2. Data Set Preprocessing
In order to reduce the differences between different individuals and different microexpressions, the microexpression sequences in the dataset should be preprocessed first. Face alignment was carried out on the image, and the facial expression region was cropped. The resolution of the image frame was uniformly adjusted to 224 pixels × 224 pixels, so as to match the input spatial dimension with the VGGFace network model. Due to the ununiform frame number of microexpression sequence in the data set, the method of Temporal Interpolation model (TIM) is used to interpolate each image sequence of the data set sample into 20 frames to obtain a frame sequence of fixed length of 20. The 20-frame sequence was split into two 10-frame time series, and then the 10-frame samples were spliced and saved as training data. Two sets of data are obtained by processing a video.
In this paper, mirror mode is adopted to expand the dataset, and the samples in the dataset are horizontally mirrored one by one. Because the data sample data volume is small, it is necessary to expand the data set.
4.3. Experimental Analysis
The algorithm adopted in literature [26], literature [27], and literature [28] and the algorithm proposed in this paper were used for comparative training respectively. After 50 iterations, the accuracy rate and loss rate changes of the above four algorithms for model recognition are shown in Figures 7 and 8, respectively.


In order to evaluate the actual performance of the model, the above four algorithms were used to train the test set, respectively. The changes of model accuracy and loss rate of each algorithm are shown in Figures 9 and 10.


By analyzing the accuracy and loss rate of model recognition, it is found that the performance of the above four algorithms is different in training set and test set. The algorithm proposed in this paper has the best performance in the training set and test set, while the algorithm in literature [26] has the worst performance in the training set and test set.
To further analyze the effects of the above four algorithms, ROC curves are drawn in this paper, as shown in Figure 11.

ROC curve is a graphical display method to show the compromise between the true rate and false positive rate of classification effect. True rate was plotted along the Y-axis, and false positive rate plotted along the X-axis. In the ROC curve, the model near the upper left corner is better. In Figure 11, the model in the upper left corner is the model proposed in this paper, which is suitable for facial expression recognition. The worst-performing model is literature [27], farthest from the upper left corner. In addition, the ROC curve area represents another classifier standard, and the larger the area of the model, the better the predictive type of the model. In terms of total area, the model proposed in this paper has the largest area and the best model performance, while literature [27] has the smallest area and the worst model performance effect.
5. Conclusion
The application of facial expression recognition in learning scenes is a trend of constructing new classroom. With the help of the relevant research foundation of informatics, psychology, and pedagogy, learners’ learning state can be studied through expression analysis. This paper focuses on the common problems in the current research on microexpression recognition and realizes the recognition and classification of microexpression sequences through deep learning. Based on the excellent performance of LRCN in behavior recognition, a SLRCN method is proposed to improve the method. This method is more suitable for microexpression data set. In order to reduce the risk of overfitting in the training deep network, the feature set of facial expression frames is extracted through the pretrained VGGFace model with the method of transfer learning. The feature sets were input into the bidirectional LSTM network to address the characteristics of short duration and time dependence of microexpression changes. Experimental results show that this method has high accuracy. However, the main reasons for the low recognition rate are the insufficient amount of labeled microexpression data, uneven distribution of all kinds of data, and generally weak intensity of microexpression. Further work is needed to enrich the data. Based on the dynamic expression sequence analysis of learners’ emotions, a psychological characteristic model was established to study the corresponding relationship between learning state and emotion changes in the learning process, so as to promote the progress of microexpression recognition in teaching quality evaluation.
Data Availability
The labeled data set used to support the findings of this study is available from the author upon request.
Conflicts of Interest
The author declares that there are no conflicts of interest.
Acknowledgments
The research was supported by Investigation and Research on College Chinese Curriculum Education in Colleges and Universities under the Background of “Ideological and Political” (2021YB0285).