Abstract
As a result of fast technological improvement and the rise of online social media, image data have grown rapidly. Because of their rich content and intuitiveness as one of the key modes of people's daily communication, as a result, images are often used as communication vehicles. When it comes to image recognition, picture feature extraction is a critical stage, and the effect of image feature extraction directly impacts the effectiveness of image recognition. Furthermore, feature extraction is a key factor to consider that influences picture recognition accuracy. Unfortunately, due to the effects of individual variations and lighting, certain elements that are significantly connected to alterations in the image are hard to extract. As a result, features that appropriately show the changing interface are urgently needed. For this purpose, this research proposes an expression identification approach based on a deep convolutional neural network for the job of facial expression recognition under online picture feature information extraction. It uses the VGG19 and Resnet18 to recognize and classify facial expressions. After that, the DCNs have combined feature extraction and classification into a single network using deep convolutional neural networks (DCNs). The proposed model is compared to the most recent approach in the context of the FER2013 and CK + databases. The experimental results reveal that this method outperforms the competition, and the amount of useful image feature information that can be extracted is substantial.
1. Introduction
Deep learning is a new subject that has seen significant progress in recent years, attracting the interest of an increasing number of researchers. The most essential one is an artificial neural network, with the rapid emergence of deep learning. The goal is to imitate some processes and systems of the brain and realize some functions based on the physiological research findings of the brain [1]. Scientists constructed a multilayer perceptron using an artificial neural network; however, it is quite limiting since it can only tackle linear issues, hence the neural network has been neglected by researchers. The introduction of the convolutional neural network [2] in 2012 altered the prior scenario. The convolutional neural network was originally used to extract features. The goal of feature extraction is to extract the most important subsets of features from the whole feature space of the data to enhance the model's predictive accuracy as much as feasible while ensuring the correctness of the prediction model in the initial stages. There are several approaches for extracting features.
Among the extracting features techniques, filtering, packaging, and embedding methods are the three types of extraction procedures. The filtering method [3] relies on the characteristics of the data, such as correlation, distance, and information, and does not include any learning methods. The packaging method [4] requires a predefined learning algorithm to evaluate the performance of the extracted feature set. The feature selection and model training processes are carried out concurrently in the embedded technique [5]. The embedded technique has a lower computing complexity than the packaged method and can offer an acceptable feature subset for the learning algorithm, but the extracted feature subset may not be appropriate for other learning algorithms. Online feature extraction was studied in the literature [6–8]. These studies assume that all characteristics are available at the same time and that training samples are available before the learning process begins. The goal is to select a subset of features from the observed feature set and return a suitable model. This article focuses on the online learning process when the training samples arrive in order. Literature [9, 10] studied the sparse online learning process, and their goal is to build a sparse linear classifier through learning. Extracting a subset of important features from a larger data set is the purpose of feature extraction. To alleviate the dimensionality crisis, feature extraction can remove unnecessary and redundant features and increase generalization performance. This can speed up learning, improve the model's interaction, and hence improve the predictive model's performance. For high-dimensional data sets, feature extraction has been a popular method of solving difficulties.
However, as the method becomes more complex, the model abilities of deep learning will increase exponentially. When the inventors of [11] first presented a gradient-learning-based convolutional neural network technique in 1989, they were able to effectively utilize it for handwritten digit character recognition with less than 1% errors. Alex-Net [12], a convolutional neural network, was dubbed the “World Cup” in computer vision after winning the ImageNet image classification contest in 2012 with an accuracy rate roughly 12% higher than second place. After winning the tournament, the winner disclosed the beginning of convolutional neural networks’ growing domination in the field of visual technology. Since then, convolutional neural networks have been declared ImageNet champions on an annual basis. Convolutional layers are stacked on top of each other in a convolutional neural network to extract properties from a large amount of input. Convolutional neural networks surpassed humans on the ImageNet data set (4.94%) for the first time in 2015. Convolutional neural networks have become increasingly complex in recent years as the number of scholars working in disciplines related to their research has expanded and technology has evolved at an exponential rate. Over the years, researchers and engineers have employed a wide range of layer counts, from the earliest 5 layers to the most recent 152 layers proposed in study of [13]. The residual network was utilized as the ILSVRC 2015 champion model. The depth and breadth of the neural network are the two most critical elements in determining network complexity; however, the depth is more effective than the width in increasing network complexity. As a result, the VGG network concept has the potential to increase network depth. The second ILSVRC 2014 model was created early by [14]. Their most sophisticated network consists of 16 convolutional/fully connected layers. Throughout the network's formation, a 3 × 3 convolution and a 2 × 2 pooling are employed. The ResNeXt [15] convolutional neural network is a multibranch convolutional network. The multibranch structure of inception is an early example of this sort of network. According to ResNeXt, “cardinality” is equally important to network performance as depth and width.
The perceptron algorithm [l6] is one of the most widely used of the aforementioned online learning algorithms. The perceptron technique initially creates a linear model and then constantly feeds training data into it. When the input training samples are incorrectly identified, the perceptron model modifies the parameters such that it properly classifies all of clarify the issue. In addition to the perceptron algorithm, academics have devised a slew of online learning algorithms [16, 17], the majority of which adhere to the maximum marginal learning principle. When a new instance arrives, the classifier should be changed to fulfill the smallest difference between the previous two times, according to the Passive-Aggressive [18] approach. The PA algorithm's shortcoming is that it only relates to first-order information. The literature [19] developed a confidence-weighted online learning method to make the PA algorithm acceptable for second-order information. The above-mentioned online learning algorithm considers all the characteristics of the data when classifying, so it can only process data sets with a small amount of data. As a result, this paper investigates online learning in a big data environment. Because each data set comprises a huge number of features, the feature set must be chosen and the most representative feature set extracted. This paper proposes a deep convolutional neural-network-based technique for extracting image feature information based on the inadequacies of existing methods. Our method combines facial expression feature extraction and facial expression classification into a network and then employs VGG19 and Resnet18 to complete face expression recognition and classification. The experiment proves that our proposed strategy is superior to other advanced models and methodologies that are currently in use. This research paper also made the following contributions:(1)First, it investigates the basic principles of a convolution neural network (its basic layers and their functionalities with flowcharts) and the design of the convolutional neural network.(2)Second, it combines facial expression feature extraction and facial expression classification into a network and then employs VGG19 and Resnet18 to complete face expression recognition and classification.(3)Third, it explains the proposed online extraction algorithm of image feature information based on a convolutional neural network and experimentally proves that the proposed strategy is superior to other advanced models and methodologies that are currently in use.
The remaining portion of this paper is arranged as follows: Section 2 explains the basic principles of the convolutional neural network, Section 3 presents the proposed model for the online extraction algorithm of image feature information based on the convolutional neural network, Section 4 contains the experimental work and their simulations, and Section 5 concludes this paper.
2. Basic Principles of Convolutional Neural Networks
This section discusses the fundamental concepts of the convolutional neural network. A convolutional neural network is a type of deep neural network used for computer vision or processing visual imagery.
2.1. Convolutional Neural Network
Convolutional neural networks (CNN) is a subcategory of neural network algorithms. CNN not only learns image feature representations efficiently but it also outperforms several traditional hand-crafted feature approaches [20]. Models of neural networks feature a hierarchical representation of data and rely on the calculation of layers with a sequential implementation, with the result of the previous layer serving as the input for the subsequent levels. Each layer corresponds to a single representation level. In addition, the layers are specified by a set of weights. Furthermore, the input units are linked to the output units by weight as well as a set of biases [21]. Weights are shared locally in convolutional neural networks (CNN), which implies that the weight of each input location is the same. Filter form using the weight associated with comparable output. A convolutional neural network is made up of layers of convolutions. Each convolutional layer employs several convolution kernels. A convolution kernel, for example, is used to scan the whole picture horizontally, vertically, and diagonally to generate a feature map (map). As the picture is processed, a restricted receptive field is employed for each pixel in the output image, which means that each pixel in the input image uses just a tiny portion of the input image. By gradually widening the receptive field of each consecutive convolutional layer, more nuanced and abstract information in the image is acquired. After the operation of several convolutional layers, the abstract representation of the picture at various sizes is eventually obtained.
2.2. The Key Structure of Convolutional Network
The key structure of the convolutional network is seen in Figure 1. According to this diagram, the convolution network structure consists of three primary layers: the convolutional layer, the pooling layer, and the fully connected layer, which are all components of a normal convolutional neural network. The convolutional layer takes local information from the picture, the pooling layer is used to substantially reduce the parameter magnitude (dimensionality reduction), and the fully connected layer outputs the desired result in the same way as a traditional neural network.

2.2.1. Convolutional Layer
As the name implies, the convolutional layer consists of a set of convolution kernels, which can be understood as filters. This type of filter exists in the form of a matrix, which is a grid of values that can modify the image. Each filter extracts a specific feature. To determine the value of each pixel in a picture, the product of the pixel’s surrounding pixels and the associated filter matrix element is calculated and then combined. The filtration is now complete. Image convolution is the mathematical term for what we’ve just done here. Figure 2 depicts the convolution process.

Convolutional layers are the essential building blocks that allow all the magic to happen in a convolutional neural network. In a normal image recognition program, a convolutional layer is composed of multiple filters that identify the various aspects of an image. Figure 3 depicts classic filters.

(1) Sobel Filter. Edge identification and the search for intensity patterns in images are two common applications for Sobel filters. Applying a Sobel filter to an image is equivalent to finding the (approximate) derivative of the image along the x or y direction. Sobelx and sobely can be defined by equation.
Sobel will also detect that edges are the strongest, which is reflected in the gradient size. The stronger the edge the greater will be the size, and vice versa. The magnitude or absolute value of the gradient is the square root of the square value of the individual x and y gradients. For gradients in the x and y directions, the magnitude is the square root of the sum of squares:
In many cases, we need to find edges in a specific direction. For example, we may need to find lines that only face up or are left. By separately calculating the direction of the image gradient in the x and y directions, we can determine the direction of the gradient. The gradient direction refers to the y gradient divided by the arctangent of the x gradient:
(2) Laplace Filter. The Laplace operator is a second-order differential isotropic operator that is better suited when simply the position of the edge is of concern, rather than the grayscale difference between the pixels in the immediate vicinity. Isolated pixels have a greater sensitivity to the Laplace operator than lines; hence, it is only useful for images that are devoid of noise. Edge detection with the Laplacian operator requires low-pass filtering in the presence of noise.
(3) Canny Filter. In 1986, John Canny proposed the Canny edge detection technique, which is widely used today. There are numerous steps in the multistage method. We know that the gradient operator may improve the picture, especially by refining the edge contour so that the edge can be seen. Noise, on the other hand, affects all of them. A fake edge is discernible since it occurs where the gray-level changes considerably. As a result, we should begin by reducing noise. To find the possible edges, calculate the gradient of the image. This may be done by looking at where the gray-level changes and comparing that to where it changes most dramatically at an image’s border, which can be found by calculating the gradient. This phase, of course, is limited to obtaining the best possible results. As a result of this, the place where the grayscale shifts may or may not be an edge. All available edges are gathered in this stage. Many points can be removed because the gradient direction containing the highest gray level changes is maintained, but the rest of the gradient direction is not. This results in a more concentrated shift in the gray level. Remove the multipixel wide edge and replace it with a single-pixel wide one. In other words, “thin edge” is the new “fat edge.”
(4) Filtering with Two Thresholds. There are still many alternative edge points after nonmaximum suppression, so a dual threshold is set, namely a low threshold (low) and a high threshold (high) (high). Strong edge pixels are created when the gray-level difference is larger than high, while low edge pixels are removed when the gray-level difference is less than low. Settings that fall anywhere in between low and high can be dangerous. Then decide whether or not to maintain it based on the presence of strong edge pixels in the surrounding area. There may be some edges that are not completely closed, therefore this is a supplement from the point where it is most likely to be needed, between low and high.
2.2.2. Nonlinear Activation Unit
Neurons are the inspiration for the nonlinear activation unit. Dendrites provide signals to the cell body, which the cell body then processes and adds to the signals. Activation of the neuron and transmission of a peak signal to its axon occurs when the total sum exceeds a certain threshold. Increase the neural network’s nonlinearity by implementing a nonlinear activation function. Each layer’s output is linearly related to its input if there are no nonlinear activation functions. Since the output is linear, no matter how many layers of the neural network there are, the original perceptron model can be seen here. Neural networks can benefit from this environment. At present, ReLu and LReLu are more commonly used. The logistic (Sigmoid) unit gradually withdraws from the stage of history due to its saturation zone characteristics leading to the disappearance of the entire network gradient (sometimes the last layer will use Sigmoid to limit the output to 0.0–1.0).
2.2.3. Pooling Layer
The filter of the convolutional layer is responsible for finding rules from the image. The more filters, the more parameters, which means that the dimensions of the convolutional layer may be very large. We need a way to reduce the dimensionality. This is the role played by the pooling layer (i.e., down sampling layer) in the convolutional network. This layer is a new layer that is introduced after the convoluted layer, particularly after a nonlinear application is applied to the output of the feature maps through a convolutional layer. The feature maps' sizes are reduced by pooling layers. As a result, it reduced the number of parameters to learn as well as the number of calculations done in the network. The pooling layer summarizes the characteristics in an area of the feature map produced by a convolution layer. There are three main forms of pooling, which can be shown by Figure 4.

(1) General Pooling. The size of the pooling window is n ∗ n. Generally, the pooling window is square. The stride is equal to n. There is no overlap between the pooling windows at this time. Only those inside or outside the range of the digital matrix are computed and filled with 0 before those that exceed the range are computed. There are two types of pooling: maximal pooling and average pooling. Maximum pooling, that is, the maximum value within the pooling window is used as the sampled output value. If the input is a matrix, the maximum pooling performed is a matrix, sliding 2 steps at a time. The execution process is very simple, split the input into different areas, and mark these areas with different colors. For an output, each element of the output is the maximum element value in its corresponding color area. Average pooling, that is, the average value within the pooling window is used as the sampled output value. This kind of pooling is not as common as max pooling.
(2) Overlapping Pooling. The loss of neighboring geospatial data is reduced when areas overlap. That is why fractional pooling appears to be much more efficient. It is important to note that nonoverlapping pooling is indeed not necessarily an issue in practice, and overlapping pooling zones or fractional pooling only modestly increases accuracy. There is overlap between pooling windows, that is, the step size is greater than or equal to 1 and less than n, and the calculation is the same as general pooling.
(3) Pyramid Pooling. Spatial pyramid pooling (SPP) can convert pictures of different sizes to the same size. SPP first treats the picture as 1 block, performs maximum pooling on this block, obtains 1 value, and divides it into 4 blocks. Performs maximum pooling on these 4 blocks, respectively, to obtain 4 values; divides into 16 blocks for these 16 blocks are pooled with the maximum value to get 16 values, and so on. In this method, it is possible to assure that the number of values obtained is the same for images of varying sizes. Because it is the maximum pooling, whether it is filled with 0 or not has no effect on the outcome if it is out of range.
2.2.4. Fully Connected Layer
Traditionally, a neural network serves as the foundation of the fully linked layer. The higher layer of neurons is strongly linked to the lower layer of neurons. As a result of its enormous number of characteristics, the fully linked layer is prone to overfitting and does not adhere to human perception of the image. For the most part, it is no longer utilized for feature extraction, but rather for linear classification. In other words, the retrieved high-level feature vectors are subjected to a linear combination before being used to generate the final forecast.
2.3. Convolutional Network Design Formula
2.3.1. Filling Formula of Convolutional Layer
There are two problems when convolving the original image directly. In one problem, the image (feature map) will be reduced after each convolution so that it will not be scrolled a few times, and in the second problem compared to the points in the middle of the picture. Here the points on the edge of the picture are calculated less frequently in the convolution, resulting in iInformation on the edge being easy to lose. Using the filling method, we can address this issue. This is done by completing a circle of blanks that encircles the image before each convolution. At the same time, additional calculations are made on the original edges.
If we add the picture of (8, 8) to (10, 10), then after the filter of (3, 3), it will be (8, 8) unchanged. It can ensure that the input data and the output data have the same spatial size. Assuming that the number of zero padding is p, the convolution kernel is , and the convolution kernel sliding step is , then should be set to
2.3.2. Calculating the Output Size of the Convolutional Layer
Assuming that the original input image is , the output image is , the number of zero padding is , the convolution kernel is , and the convolution kernel sliding step is , the output size is
2.3.3. Calculating the Number of Parameters after Convolution
Suppose the input image is , where is the image depth (pass number), the convolution kernel is , the number of convolution kernels is n, then the number of weights is , and the number of deviations is .
2.3.4. Calculating the Output Size of the Pooling Layer
The pooling layer rarely uses zero padding. Assuming that the original input image is , the output image is , the convolution kernel is , and the convolution kernel sliding step is , the output size is
3. Online Extraction Algorithm of Image Feature Information Based on Convolutional Neural Network
3.1. Model Design
For the extraction of face expression features and the classification of facial expressions, we use deep convolutional neural networks. To complete the process of recognizing and classifying phrases, we turn to VGG19 and Resnet18. The convolutional layer, the BatchNorm layer, the relu layer, and the average pooling layer make up each little block of VGG19. All of the Resnet's modules have shortcut links on both the input/output ends of the convolutional and BatchNorm layers. In order to strengthen the model's resilience, we have included a dropout method before the completely connected layer; in the traditional VGG19 and Resnet18, we have omitted several fully connected layers and immediately divided them into seven categories to identify directly after a fully connected layer.
3.2. Loss Function Design
The cross-entropy loss function was the first of two loss function calculation approaches we investigated in the design. After the model is fully connected, each category’s output probability is calculated, but the probability is not normalized at this point. In order to make data processing easier, we utilize a softmax layer to normalize the probability to 1. The cross-entropy loss function can be calculated utilizing equation.
To solve the multiclassification problem, we employ softmax regression, which makes use of the normalized probability size to discover a solution. The hinge loss function is a linear SVM classification loss function abstraction. To provide the term hinge loss that has a polyline loss function graph. The loss term for the ith category is provided in the equation.
If it is correctly classified, the loss is 0, otherwise the loss is .
Classifiers based on SVM and softmax are the most popular. Unlike the SVM, the softmax classifier is a logistic classifier that generates numerous classifications, and its normalized classification probability is more comprehensible than the SVM’s score for each class. The total probability of this event is one.
4. Experimental Design and Results
4.1. Experimental Design
This section describes the proposed scheme’s experimental design. This approach aids in the development of the suggested model for the online extraction of visual feature information using a convolutional neural network.
4.1.1. Data Sets
For this paper, the FER2013 and CK + databases have been utilized. Over 28709 training photos and 3589 public and private tests make up the FER2013 data collection. Each photo is a 48 × 48 pixel grayscale image. The FER2013 database has seven expressions: anger, contempt, fear, happiness, sadness, and surprise. Data from the Kaggle competition in 2013 can be found in this repository. There will always be some inaccuracies in this database, as it is primarily compiled by web spiders. This database's simulated correctness ranges from 65 to 75 percent. The Cohn-Kanade Data set was enlarged into the CK + database in 2010. A total of 123 people and 593 picture sequences are included in this database. Each visual sequence concludes with a label identifying the action unit. There are 327 sequences with emotion labels among the 593 image sequences. A laboratory-based database is more dependable and rigorous than a nonlaboratory database. Facial expression recognition software like CK + relies on a well-established database. These data will be used for testing in a slew of publications.
4.1.2. Data Enhancement
Some image changes, such as flipping, rotating, chopping, etc., can be done artificially to avoid the network from overfitting too rapidly. As you can see, this process is known as data enrichment. In addition, manipulating data can increase the database size, making the trained network more robust. During the experiment's training phase, we used photos that were randomly chopped and mirrored before being sent to the training phase. This article uses an integrated approach to eliminate outliers during testing. Cut and mirrored the images in the upper left, lower left, higher right, and lower right corners, as well as at the heart of the composition. Ten photos were sent into the model as a result of this process, making the database ten times larger. The average probability is then calculated, and the phrase with the largest output categorization is the result. The categorization error is effectively reduced by using this strategy.
4.1.3. Performance Evaluation
Accuracy can be used as an index to evaluate the quality of a machine learning model, and this evaluation index is directly or indirectly related to the confusion matrix, that is, it can be directly calculated from the confusion matrix.
(1) Confusion Matrix. Precision evaluation is commonly written as a matrix with n rows and n columns, known as an error matrix or confusion matrix. Confusion matrix is a visualization tool in artificial intelligence, notably for supervised learning, whereas in unsupervised learning, it is commonly referred to as matching matrix. There are columns for each predicted category and rows for each true attribution category; the total number of data in each column reflects the total number of data predicted to be that category, and each row shows the number of data occurrences in that category. Table 1 shows the confusion matrix for a binary classification task.
Using the confusion matrix presented in Table 1, the rows reflect the model’s predicted class/predicted condition, and each column represents its true condition. We can see a model's categorization in each category using the confusion matrix (positive and negative). The definition of TP, FP, FN, and TN is shown in Table 2.
The letters TP, FP, TN, and FN represent the predicted category of the sample, while the first letter represents whether the predicted category of the sample is compatible with the true category.
(2) Accuracy. Accuracy is the most obvious performance metric because it is simply the ratio of properly predicted observations to the total sample size. Furthermore, accuracy is an excellent statistic, but only when the ratios of false positives and false negatives are nearly equal. The accuracy of our model can be calculated using equation.
Accuracy indicates the proportion of samples (TP and TN) that predict correctly in all samples.
4.2. Experimental Results
This section describes the findings of the suggested research. These findings may be discussed in two steps: in the first, experiments on the CK + data set are explained, and in the second, experiments on the FER2013 data set are extensively explained.
4.2.1. Experiments on CK + Data set
Table 3 shows the outcomes of the experiments. VGG19 outperforms Resnet18 in tenfold cross-validation, according to the average results. In comparison to FER2013, the CK + data set yields a more accurate result because it was collected in a laboratory and samples are easier to identify. An average accuracy rating of 0.9465 is achieved by our VGG19, dropout, 10crop and softmax.
4.2.2. Experiments on FER2013 Data set
This section describes our experiment on the FER2013 data set. As explained previously, the data comprise grayscale pictures of faces 48 × 48 pixels in size. The faces have been automatically enrolled such that they are roughly centered and take up approximately the same number of areas in each image. On the FER database, VGG and Resnet18’s depth models are capable of achieving good classification results, as shown in Figures 5 and 6.


According to the above figure, the VGG’s technique is superior to Resnet18’s way. A dropout is a powerful tool for reducing overfitting and increasing precision. By removing certain connections at random during training and reactivating them during testing, the dropout approach achieves the same result as combining many sound models to arrive at a more complete forecast. Recognition mistake rates can be reduced even lower using the 10-fold reduction approach. Random cutting increases the quantity of data in the training stage, which is similar to directly enlarging the data set and reducing the effect of overfitting. To decrease network misjudgment in the prediction step, 10 times the data is utilized to forecast the result at the same time. Softmax’s classification algorithm outperforms SVM’s. All categories are taken into account by the softmax, and the classification result is the outcome of all K classes combined. Only two classes may be classified using the SVM approach at a time, and it can only learn whether or not this sample is the right one, which makes classifying samples much more challenging. VGG19, dropout, 10crop, and softmax have achieved a very high level in terms of single model impact. According to the Public test set, it reached 0.7150; in the Private test set, it was 0.7311. Under a single model, this is also the most advanced level. The training and testing accuracy curve of VGG19, dropout, 10crop, and softmax are shown in Figure 7. The confusion matrix of VGG19, dropout, 10crop, and softmax on the PrivateTest data set is shown in Figure 8.


On the one hand, as seen in Figure 8, happiness and surprise are significantly more accurate than dread, although fear is far less accurate. One possible explanation for the appearance of this problem is that the expression diversity of the data set is unequal. First, there are 7215 photographs of happiness, 436 images of disgust, and an average of 4101 images in each group. Second, there are certain similarities between the four emotions of anger, disgust, fear, and sorrow. As a result of the imbalance of types, classification errors can emerge. Individuals will find it challenging to distinguish between these four types of expressions in real life as well. Facial emotions are more difficult to decipher when you do not know each other. Misjudgments were also shown to be prevalent in specific categories. In certain cases, it is difficult to tell one category from another and it is simple to get them mixed up. Further investigation of the attention module of certain phrases, as well as further help for enhancing categorization capacity, is required. Figure 9 shows the comparison with the state of the art methods.

As far as we can determine, our model looks to be the most accurate using current FER2013 data (nonintegrated method). We believe this is due to the deep convolutional network's superior ability to extract features. Data augmentation strategies can help with facial expression categorization.
Figure 10 explains the comparison of accuracy between our two models model-1 (comprising VGG19, dropout, 10crop, and softmax) and model-2 comprising (Resnet18, dropout, 10crop, and softmax). From this figure, it is clear that the accuracy of model-1 is higher than model-2.

5. Conclusions
These days image recognition is a hot research topic for many scholars, but in image recognition, image feature extraction is a critical stage, and the effect of image feature extraction directly impacts the effectiveness of image recognition. Unfortunately, due to the effects of individual variations and illumination, certain elements that are significantly connected to alterations in the image are hard to extract. As a result, features that appropriately depict the changing interface are urgently needed. Because of these issues, this research study introduced an image feature extraction approach based on deep convolutional neural networks and identified face expression images. Several experiments have been performed for effectively online picture feature information extraction. During experimental work, a high rate of accuracy on both FER2013 and CK+ is achieved using the deep convolutional neural network VGG19 and Resnet18 models. Furthermore, this illustrated the deep convolutional network's accuracy and dependability in the expression classification challenge. The proposed method combined facial expression feature extraction and facial expression classification into a network and then employs VGG19 and Resnet18 to complete face expression recognition and classification. The experiment proved that the proposed strategy is superior to other advanced models and methodologies that are currently in use.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The author declares that he has no conflicts of interest.
Acknowledgments
The paper was supported by the Projects of Guangxi Young and Middle-Aged Teachers' Basic Scientific Research Ability Improvement (Grant no. 2019KY1594). Research and implementation of smart campus online service hall based on micro-service architecture, the Projects of Guangxi Young and Middle-Aged Teachers' Basic Scientific Research Ability Improvement (Grant No. 2022XXH0017).