Abstract

With the development of smart classrooms, analyzing students’ emotions for classroom learning is an effective means of accurately capturing their learning process. Although facial expression-based emotion analysis methods are effective in analyzing classroom learning emotions, current research focuses on facial expressions and does not consider the fact that expressions in different postures do not represent the same emotions. To provide a continuous and deeper understanding of students’ learning emotions, this study proposes an algorithm to characterize learning emotions based on classroom time-series image data. First, face expression data for classroom scenarios are established to address the lack of expression databases in real teaching environments. Second, to improve the accuracy of facial expression recognition, a residual channel cross transformer masking net expression recognition model is proposed in this paper. Finally, to address the problem that the existing research dimension of learning emotion is too single, this paper uses the facial expression and head posture data obtained from deep learning models for fusion analysis and innovatively proposes a Dempster–Shafer evidence-theoretic fusion model to characterize the learning emotion within the lecture duration of knowledge points. The experiments show that both the proposed expression recognition model and the learning sentiment analysis algorithm have good performance, with the expression recognition model achieving an accuracy of 73.58% on the FER2013 dataset. The proposed learning emotion analysis method provides technical support for holistic analysis of student learning effects and evaluation of students’ level of understanding of the knowledge points.

1. Introduction

Improving learning outcomes is a constant theme in education, and measuring learning outcomes can be analyzed in terms of behavioral engagement, emotional engagement, and cognitive engagement. In the classification of educational objectives, Bloom’s classification of emotion as a separate broad area indicates the importance of emotion in the analysis of learning outcomes in education. Morrish et al. [1] demonstrated that emotions can influence behavior, thinking skills, and decision-making abilities, and Shen [2] pointed out that students’ affective states are an important factor in learning outcomes. Research has shown that students’ learning emotions in the classroom are an important indicator of students’ classroom learning status and classroom learning effectiveness.

In traditional classrooms, teachers mainly judge students’ learning emotions through the observation method, but teachers have limited energy to perceive the learning emotions of multiple students. With the development of artificial intelligence, it has become crucial to reduce teacher effort by automated machine learning methods to automatically select models for data analysis [3]. Therefore, Pei and Shan [4] developed a classroom microexpression recognition algorithm based on a convolutional neural network and automatic face detection, which improved the recognition rate and provided a new direction for the application of deep learning in the classroom expression recognition.

However, many current studies analyze students’ classroom learning emotions from a single dimension, which has certain limitations; for example, Fakhar et al. [5] developed a real-time automatic emotion recognition system that uses a deep learning model to identify three emotions, happy, sad, and fear, for classroom assessment. Liu et al. [6] proposed a new approach to infrared facial expressions with multilabel distribution learning, constructing an expression recognition network through label learning based on the Cauchy distribution function, through which seven traditional labels of raw, scared, disgusted, happy, sad, surprised, and neutral were detected to analyze students’ classroom learning emotions. However, there are some limitations in these studies, as the expression categories used in these studies do not apply to real classroom scenarios, and the accuracy of the analyzed learning emotions is somewhat flawed when the same facial expression is produced by students either thinking about a problem or by the surrounding environment. To remedy this deficiency, multimodal emotion recognition methods have been developed. In multimodal recognition, the evaluation accuracy of the model is mainly improved by analyzing the combining features among different data [7]. Based on this, Wang et al. [8] have proposed multimodal deep belief networks combining different features from multiple physiological-psychological signals and video signals to obtain fused features of each modality for a more accurate assessment of emotion. However, this type of research is extremely demanding in terms of data collection conditions and is not universally applicable within the larger classroom environment. In recent years, classroom recording via video cameras has become the main means of data collection in smart classrooms. And, from a data analysis perspective, the use of appropriate scientometric analysis methods to map, mine, sort, and analyze to show the logic and relationships among classroom data and the use of combined predictive approach can help improve the accuracy of analyzing [9, 10]. Therefore, this paper addresses the lack of existing classroom expression databases and the problem of inaccuracy due to limited analysis of learning emotions in a single dimension; based on smart classroom scenarios, this paper presents a flexible and diverse analysis of classroom data in smart classroom and proposes a learning emotion analysis method applicable to real classroom environments by acquiring facial expression and head posture features and fusing multidimensional features using the Dempster–Shafer theory (DST) to achieve an accurate characterization of students’ learning emotions in the classroom. The main contributions of this study can be summarised as follows:(1)An expression dataset applicable to a real classroom environment is constructed to provide a database for expression recognition in classroom teaching videos(2)A residual channel cross transformer masking net (RCTMasking-Net) is proposed, which uses down-sampling and up-sampling for multiscale feature extraction and fusion, combining shallow and deep information to increase the effective perceptual field of the model, while using a channel cross-attention mechanism for information fusion to better capture feature information so that the model is less likely to lose too much information at the shallow level and thus better classify expressions(3)A learning emotion representation model based on multidimensional temporal image analysis is proposed, which integrates facial expressions and head posture features to analyze learning emotions, avoiding the limitations of the single-dimensional analysis and better characterizing not only students’ classroom learning emotions but also reflecting their learning outcomes

The rest of the paper is organized as follows. Section 2 presents the relevant research work. In Section 3 a method for analyzing learning sentiment based on classroom temporal images is proposed. Section 4 presents the experimental procedure and a comparative analysis of the results of different experiments to evaluate the performance of the proposed algorithm. Finally, Section 5 concludes the paper.

In the field of education, learning effects can be analyzed by facial expressions, head posture, and multimodal emotions with the artificial intelligence technology in offline classroom. Therefore, this paper will present research in these areas.

In a study on student-based facial expression recognition, Yi [11] analyzed students’ affective states from their learning status, learning level, and learning effectiveness and established a student affective model with an affect-based evaluation index system. Han et al. [12] proposed a classroom evaluation method based on video analysis of students’ facial expressions by combining different states of other organs to evaluate students’ states and analyzed classroom effectiveness by head angle, eyebrows, eyes, and lip state. Krishnnan et al. [13] proposed a new algorithmic framework for keyframe recognition in video using structural similarity to better improve the recognition rate, incorporating facial expressions as well as sleepiness detection to sense the student’s learning status. Mukhopadhyay et al. [14] proposed a method for assessing students’ affective states in online learning based on facial expressions, which was effective in assessing learning outcomes. The aforementioned study had good results in expression recognition, but the expressions defined were six categories of emotions: happiness, sadness, anger, fear, surprise, and disgust, which do not apply to assessing students’ emotional state of learning during classroom learning. Pabba and Kumar [15] proposed a real-time student group engagement monitoring system by analyzing students’ facial expressions to obtain academic affective states related to the learning environment: boredom, confusion, concentration, frustration, yawning, and sleepiness.

In the study of emotion based on head posture, Huang et al. [16] distinguished students’ classroom emotional states by head posture and facial expressions and proposed a method for locating face feature points based on deep convolutional neural network and cascading. Duan [17] analyzed learning emotion by detecting attention and proposed a method based on attention detection method based on head up and head down and eye closure detection. Leelavathy et al. [18] used a variety of machine learning techniques to predict student attention and learning affect using eye movements and head posture. Xu and Teng [19] proposed a classroom attention effect scoring system based on head Euler angles, introducing spatial information to modify the Euler angles to assess attention and thus analyze students’ learning emotions, which can obtain more accurate Euler angles. The abovementioned studies explored the relationship between head posture and learning outcomes; however, they did not consider that analysis from a single dimension alone would lead to inaccurate results. However, their study provides a reference for extracting head posture features as input for multimodal analysis.

In multimodal-based emotion research, Yang et al. [20] proposed a multimodal emotion computation model combining logic functions with a framework containing emotion expression patterns such as speech, text, and facial expressions for emotion computation by analyzing emotional interactions in online collaborative learning. Li et al. [21] proposed a multichannel learning sentiment analysis method using speech and image data, together with a quantitative pleasure-arousal-dominance sentiment scale to analyze learning states. Peng and Nagao [22] proposed a multimodal intelligence detection model to identify students’ classroom learning status through the multimodal fusion of face, heart rate, and voice. Peng et al. [23] extracted eye, lip, and head features from interactive videos of student online tutoring systems and combined them with electroencephalogram brainwave sensor data to analyze learner learning effectiveness, and Ling et al. [24] used multiple deep learning models to obtain head posture and classroom audio information and fused the corresponding audio information and head posture for analysis. The classroom learning effect was analyzed by fusing the corresponding audio information and head posture to detect students’ learning attention. The research described previously has conducted fusion analysis from multiple modalities, avoiding the limitations of a single dimension and improving accuracy in analyzing student learning outcomes. However, the data required for the abovementioned study is difficult to obtain in most smart classrooms and the cost required to wear the sensors is too high to be applicable in a large smart classroom environment.

In summary, there is a paucity of research on the definition of student expressions in the classroom, most studies use expression categories that do not apply to real teaching environments, and there is still much room for improvement in studies that apply to the analysis of learning outcomes in real classroom scenarios. Therefore, this paper constructs a dataset of expressions applicable to the classroom environment. The fusion analysis of students’ classroom head posture and facial expressions is used in a real teaching environment to obtain students’ learning emotions, provide technical support for assessing learning effectiveness, and help teachers to understand students’ classroom learning promptly as well as intervene with individual students to improve classroom learning effectiveness.

3. A Model of Affective Representations of Classroom Learning

The block diagram of the whole classroom student learning emotion representation model is shown in Figure 1. In this paper, RCTMasking-Net and HeadPoseEstimate are used to obtain facial expression and head pose data, and a modified DST model is used to fuse facial expression and head pose to obtain the student learning emotions within the lecture duration of the knowledge point. Face detection is performed using trained MTCNN and FaceNet networks.

3.1. Expression Recognition Based on RCTMasking-Net

The RCTMasking-Net network model is shown in Figure 2. This network structure uses ResNet34 as the backbone network, splits the network into four modules, uses the four residual layers of ResNet34 as the feature processing, and adds the channel cross transformer masking (CTMasking) block responsible for the corresponding feature mapping, respectively. The CTM asking block mainly relies on UCTransNet, which is a U-net based network structure [25]. Unlike the traditional U-net, this network does not use the hopping links in the U-net but uses channel crossing transformer (CCT) and channel crossing attention (CCA) instead.

First, the collected classroom face images are used as the original input, and the input feature map is obtained through the first step of convolution pooling in ResNet34; second, the feature map goes through the first residual layer to obtain the feature map and ; then, the feature map goes through the CTMasking block to obtain the same size masking feature map ; finally, the output [26] feature map of the first RCTM block is obtained through the following equation:

The perceptual field of the model can be improved by equation (1) without losing too much information at a shallow level. The feature map is more important to assess than the feature map [27].

3.1.1. Channel Cross Transformer

As shown in Figure 2, the output features of the first three down-sampled convolutions are adjusted to a two-dimensional tiled sequence of patch size , respectively. This three-layer output is labeled , with being the key and being the value, and the input , , and for multiheaded cross-attention is generated after.

In equation (2), , , and are the weights of , , and , respectively; is the length of the two-dimensional tiling sequence; and is the channel dimension of the three inputs to CCT.

The similarity matrix is generated from , , and , and the is weighted by through the cross-attention mechanism so that the gradients can propagate smoothly when the channels are subjected to the attention operation. The calculation formula is shown in the following equation:where and denotes the normalization process performed, respectively [28].

Unlike self-attention, this method performs attention operations along the channel and uses instance normalization, thereby allowing the similarity matrix of each instance on the similarity graph to be normalized, allowing the gradient to propagate smoothly.

First, after a multiheaded attention mechanism, the output equation is as follows:where is the number of polytopes.

Next, after MLP and residual concatenation, the output equation is shown in the following expression:

3.1.2. Channel Crossing Attention

In the channel cross-attention block, and are used as inputs to the channel cross-attention, where is the result of up-sampling after cascading with the channel attention block output and is the result of up-sampling after the third down-sampling in the network. A global average pooling layer is used for spatial compression to produce the vector , whose channel is calculated as follows:

and in equation (7) are the weights of the two linear layers and the ReLU operator [29].

A single linear layer and sigmoid function are used to construct the attentional feature , with the following equation:

The activation function in equation (7) indicates the importance of the channel.

Finally, and are cascaded upsampling in turn before passing through a convolution layer to generate .

3.2. Head Pose Estimation

The head pose estimation algorithm aims to calculate the Euler angles of the head of individual people from the images. Face alignment is performed using 3D dense face alignment (3DDFA) [30]. The algorithm uses 3D Morphable models [31] for the face representation, which is calculated as follows:

In equation (8), represents the 3D face shape, represents the average face property, is the shape parameter of the 3D base shape principal axis , and is the expression parameter of the 3D offset shape principal axis . The 3D face shape is then projected onto the image plane using a scaled orthogonal projection. The calculation formula is as follows:

Equation (9) has as the projection function generating the 2D positions of the model vertices, is the scale factor, is the positive projection matrix, is the rotation matrix with Euler angles, and is the translation vector.

Each pitch angle, yaw angle, and roll angle is obtained using . The head attitude of each student for the duration of the knowledge point lecture is denoted by .

3.3. A DST-Based Approach to the Emotional Representation of Learning

In the actual teaching environment, the three main positions of students’ gaze are PPT, desktop, and teacher, and the attention states that students show can be determined according to their gaze orientation in the classroom. Combining existing research and the actual environment, this paper sets the attentional recognition framework as . The attentional framework includes the gaze towards the PPT and the teacher’s position, the head-down framework includes looking at the desktop, and the gaze towards other positions is classified as the inattentive framework.

In this paper, the yaw angle and pitch angle in the obtained student head posture are used as two separate bodies of evidence: . The head data within the lecture duration of the knowledge point are used to assign basic probabilities to each part of the attentional recognition frame based on the gaze fall thresholds corresponding to the pitch and yaw angles at each position. This paper proposes a new two-dimensional probability assignment method that not only fits the actual teaching scenario but also makes full use of the information contained in the two bodies of evidence and improves the accuracy of DST decision fusion to a large extent.

The assignment formula is shown in the following equation:

In equation (11), and are the thresholds when the yaw angle is towards the PPT and teacher position, respectively, and is the threshold when looking down at the desk.

Based on the number of knowledge points taught by the teacher and the respective knowledge point teaching time , the underlying probability assignment values for each attention frame were calculated for the probabilities of each body of evidence during the time . The probability values of the two bodies of evidence for each attention frame were spatially fused [32] to obtain the probability of each attention state within the knowledge point lecture duration , with the maximum probability being the main attention state of the student within that knowledge point , as shown in the following equation:

In equation (12), is the normalization factor and with denote the values of the two bodies of evidence and for the knowledge point taught by the student for the duration . The formula for calculating is shown in the following equation:

Since in learning emotion calculation, judging from facial expressions alone does not accurately portray the learning emotion of the student in the learning scenario; this paper combines attentional states to portray learning emotion.(1)Using the expression recognition network to identify the temporal expression sequence , is the output of each keyframe in time for the student within the lecture duration of the knowledge point. The is calculated as the frequency and the emotion with the highest frequency is the main emotion of the students during the lecture time.(2)The attentional state and the main emotion were assigned by the following equation:(3)A combined fusion of attentional state and affect to derive learning affect , calculated as follows:

4. Experimental Results and Analysis

In this section, first, the performance of the proposed expression classification algorithm (RUTMasking-Net) is evaluated. The algorithm is trained on the publicly available dataset FER2013 and the trained model is evaluated by metrics such as confusion matrix and accuracy; second, the proposed student learning emotion representation model is evaluated and analyzed to verify the reliability of the model in analyzing learning outcomes.

4.1. Data Sets
4.1.1. Dataset 1

The FER2013 dataset was used to evaluate the model because the image quality and tag classification accuracy in the classroom scenario was similar to the FER2013 dataset. The dataset contains 35,887 grey-scale images of size, which are classified into seven expression labels: anger, disgust, fear, happiness, sadness, surprise, and neutral. A total of 28709 images were used for training images, 3859 for validation, and 3589 for testing. To match the parameters of the pretrained ResNet model, the original data images were adapted to and converted to RGB format in this paper.

4.1.2. Dataset 2

The classroom expressions dataset ClassFaceD is derived from a wisdom classroom video of a course for freshmen at a university and contains a total of 19691 images. The dataset divides classroom expressions into three main categories: dedicated, distracted, and doubt, with the subcategories of dedicated being happy and dedicated, the subcategories of distracted being neutral, tired, and boredom, and the subcategory of doubt being doubt. The ClassFaceD part of the data created is shown in Figure 3, and the distribution of the training, validation, and test sets for this dataset is shown in Table 1.

4.2. Performance of the Emotion Classification Algorithm (RCTMasking-Net)
4.2.1. Architecture and Training Parameters

This paper uses the pretraining parameters of ResNet34 trained on the ImageNet dataset. The network input for the experiment is an RGB image of size . After several experimental tests and comparisons, the following optimal combination of initial parameters was used in this paper: the optimization algorithm is Adam, the initial learning rate is 0.0001, the weight decay is 0.001, the batch size is 48, and the model is implemented in the PyTorch framework. The experimental environment for this study is shown in Table 2.

4.2.2. Model Evaluation and Analysis

To validate the performance of the RCTMasking-Net model, this study used the confusion matrix and accuracy rate as evaluation metrics, and some classical classification models were used on the public dataset FER2013 for comparison with the RCTMasking-Net model. These models include EfficientNet-XGBoost, Inception-v3, STN + TL, ResMaskingNet, and LHC-NET. All models were trained and tested in the same environment, and Table 3 shows the number of parameters as well as the accuracy of each model.

As can be seen from Table 3 although the network model in this paper is the largest in terms of the number of parameters, it outperforms network models of recent years in terms of accuracy. The experiments show that the fusion of multiscale feature extraction by downsampling and upsampling, the combination of shallow and deep information can increase the effective perceptual field of the model, and the use of the channel cross-attention mechanism for information fusion can better capture the feature information so that the model does not lose too much information in the shallow layer, which ultimately makes the effective perceptual field of the model larger than other models and thus improves the recognition accuracy.

In this study, the training period was set to 100, the best quasigo rate reached 95.96% during the training process, and the best accuracy rate at the validation level was 70.97%, and the accuracy curve is shown in Figure 4.

In addition, as the models in this paper use the ResNet34 and ResMasgkingNet models as the primary framework, the ResNet34 model, the ResMasgkingNet model, and the RCTMasking-Net model under dataset 2 were used for comparison. The results are shown in Table 4. The model accuracy for this chapter was 65.16%, compared to 62.23% and 63.35% for ResNet34 and ResMaskingNet, respectively. The results in Tables 3 and 4 show that the proposed network significantly outperforms the other network models in the application of expression recognition.

4.2.3. Ablation Experiments

To explore the role of the channel cross-attention mechanism in the RCTMaking-Net model, this paper will use the dataset FER2013 under ResMaskingNet without the channel cross-attention mechanism and the same experimental parameters (optimization algorithm using Adam, the initial learning rate of 0.0001, weight decay of 0.001, batch size of 48, and the training period is 50 times) for the ablation experiments, and the confusion matrix of the experimental results is shown in Figure 5.

Figure 5 shows the correct recognition rates of both models for facial expressions in each category, with the RCTmasking-Net model outperforming the ResMaskingNet model for most of the expressions. The experimental results show that the incorporation of the channel cross-attention mechanism can better capture feature information and thus improve the accuracy of the model.

4.3. Sentiment Analysis of Learning the Whole Knowledge Point Lectures
4.3.1. Experiments on the Integration of Learning Emotions within the Lecture Duration of Knowledge Points

In this paper, the collected keyframe image set is used as input data to obtain the head pose data of classroom students through the trained HeadPoseEstimate model. The network model achieves an average NEM of 3.59% on the AFLW200-3D dataset, with good recognition results and a computation time of 7.2 ms and outperforms other network models in multiperson scenarios. The results of the data going through the HeadPoseEstimate model are shown in Figure 6.

The model is used to obtain individual student head posture data for the duration of the lecture, which is then used to analyze the gaze direction of the students. In the classroom scenario, two gaze points are specified for the student’s head up towards the PPT and the teacher’s position and one for the head down towards the desk. The yaw angle was used to determine whether the student’s gaze landed toward the PPT or the teacher’s position or both, and the pitch angle was used to determine whether the student’s gaze landed looking down at the desk. Due to the different spatial coordinates of the seats, the corresponding gaze landing points have different rotation angles. In this paper, the threshold values , , and , corresponding to each landing point are obtained by gazing at the corresponding gaze landing point in each position. The thresholds and angle change curves obtained for seat 1 (first position on the left in the middle first row in Figure 6) are shown in Figure 7.

Subsequently, this paper conducted experiments with six students within 48 knowledge points lecture hours.(1)The HeadPoseEstimate model was used to obtain the head posture (yaw and pitch angles) of each student within each knowledge point, and the change in the head posture curve for student number 0616 (first position on the left in the middle first row in Figure 6) is shown in Figure 8.(2)Calculating the underlying probability assignment for each piece of evidence by the method described in Section 3.2 yields Table 5, which represents the probability distribution of the Euler angle ( for yaw angle and pitch angle) for the three attentional states for the student over the length of the lecture at knowledge point . For example, represents the probability distribution of the pitch angle for the three attentional states for the 1st student over the length of the lecture at knowledge point 48. Spatial fusion of the results in Table 5 yields probability values for each attentional state for that student and the results are shown in Table 6, which represents the probability of each attentional state for the 1st student over the length of the lecture at knowledge point 1.

Table 6 shows that the probability of the attentional state during the lecture length of knowledge point 1 is 0.8796, so the attentional state of student number 0616 during the lecture length of knowledge point 1 is attention. The attentional state during the lecture length of knowledge point 3 was looking down at the desk, and the analogy can be drawn to the attentional states of students in other knowledge points.

At the same time, the expression recognition method in Section 3.1 was used to obtain the temporal expression data of student number 0616 in the classroom and the main expressions within the lecture duration of each knowledge point, as shown in Figure 9. To portray the learning emotions within the lecture duration of a knowledge point, the most frequent expressions within the lecture duration of a knowledge point were used as the main expressions within the lecture duration of that knowledge point.

The results are shown in Table 7, which incorporates students’ attentional states and main expressions during the lecture time of the knowledge point by using the formula (15). For presentation purposes, concentration is dedicated by “1,” doubt by “0,” and distracted by “−1.”

4.3.2. Analysis of the Results

In this paper, we invited experts in education and psychology to watch the videos and label the learning emotions within the lecture time of the corresponding knowledge points, and the emotion with the highest superimposed score was the learning emotion within the lecture time of the knowledge point. It also uses classroom tests to assess the reliability of the students’ learning emotions in the classroom. In this section, the accuracy of the DST fusion model-based approach to characterizing learning emotion is compared with the accuracy of the student learning emotion using expression classification alone [8] and head posture alone [13], and the results are shown in Table 8 below. Finally, the results of Table 6 are combined with the results of Table 7 to synthesize the learning emotions of these six students over the forty-eight lecture hours and compare the classroom test scores of each student as shown in Table 9.

As can be seen from Table 8, there is a large margin of error in the actual classroom if only a single dimension of facial expressions is used to characterize students’ learning emotions. The accuracy rates for students 0616 and 0621 were 27.27% and 54.55%, respectively, for facial expressions only, and 72.72% and 45.45% for head pose only. By comparing the classroom videos, it was found that student number 0616 remained silent during the lecture, resulting in a lower accuracy rate from facial expressions. Student number 0621 had a puzzled expression during the lecture and was easily judged as concentrating from head posture only, resulting in a lower accuracy rate. The experiment showed that as most of the students remained silent in the classroom, the expressions were classified as neutral expressions and it was difficult to judge the learning emotions by facial expressions. The analysis of head posture can only determine whether students are paying attention, ignoring the emotional phenomenon of their presence of doubt. These problems can be avoided when facial expressions are combined with head posture to characterize students’ learning emotions.

Classroom tests can be used to assess the effectiveness of student learning, which is influenced by the emotional state of students in the classroom. When students are in a positive affective state of learning, they have a higher level of mastery of knowledge and their test scores are relatively high. As can be seen from Table 9, the results of the learning emotion characterization model proposed in this paper correspond to their test scores, and students with higher assessment scores are basically in a focused mood, indicating that the learning emotion characterization model proposed in this paper has good reliability. At the same time, in Tables 7 and 9 of this experiment, student 0616 was basically in a distracted learning state in the early stage, and the teacher of the course intervened by shifting student 0616’s seat to the front row, and after the 11th knowledge point, his learning state gradually became positive and his final overall test score improved. It can be shown that the proposed algorithm can accurately analyze students' learning emotions and provide a basis for teachers to take interventions to improve learning outcomes.

5. Conclusion

In this paper, a facial expression recognition network based on a channel cross-attention mask block and a DST-based learning emotion analysis algorithm was proposed to improve the assessment method of learning effectiveness in a real classroom environment using classroom time-series image data. The method predicts students’ learning emotions in a real classroom setting and verifies its effectiveness with individual student performance. The experiments demonstrate that the learning emotion analysis algorithm can analyze learning emotions more accurately, can effectively avoid the limitation of judging learning emotions by a single expression, help teachers understand students’ learning effectiveness better, and take intervention measures to improve learning effectiveness. However, the method has certain limitations that it requires a gaze drop threshold to be set for new position before the student’s learning emotion can be analyzed. At the same time, students in the head-down state will be judged as an inattentive state and their emotions eventually are judged as wandering. The method ignores the presence of head down when they are reading books which lead to emotional misjudgment. In general, the proposed algorithm helps teachers to analyze the overall classroom learning effect over the length of the lecture and helps teachers to consider whether to enhance the teaching of the lecture, which is a good application in real classroom situations. In addition, in an attempt to better define classroom expressions, this paper samples and marks classroom videos to create a facial expression dataset ClassFaceD, which applies to the classroom environment. The analysis of learning effectiveness includes aspects such as learning effect, cognitive state, and active thinking. Therefore, future research will consider combining cognitive states to reflect students’ classroom learning effectiveness, using student seating, interactions, and student relationships in smart classrooms to build classroom temporal social network features, and combining temporal data from learner knowledge tests to accurately portray learners’ cognitive states. The integration of cognitive state, classroom attention, and learning emotions will help teachers to understand students' learning status in a timely and accurate manner and support teachers in optimizing the teaching process. At the same time, more data will be collected to create a rich and diverse dataset. In addition, the paper plans to deploy the algorithm to embedded devices for use in smart classrooms to continuously help improve learning outcomes.

Data Availability

The data used in the cases in this work cannot easily be published directly as they relate to the privacy of the students, and those who wish to use the data can make a reasonable request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant nos. 62177012, 61967005, and 62267003), Innovation Project of GUET Graduate Education (Grant no. 2021YCXS033), the Key Laboratory of Cognitive Radio and Information Processing Ministry of Education (Grant no. CRKL190107), and Guangxi University Young and Middleaged Teachers Research Basic Ability Improvement Project (Grant no. 2021KY0212).