Abstract
Purpose. In order to solve the problems of small face image samples, high size, low structure, no label, and difficulty in tracking and recapture in security videos, we propose a popular multiscale facial feature manifold (MSFFM) algorithm based on VGG16. Method. We first build the VGG16 architecture to obtain face features at different scales and construct a multiscale face feature manifold with face features at different scales as dimensions. At the same time, the recognition rate, accuracy rate, and running time are used to evaluate the performance of VGG16, LeNet-5, and DenseNet on the same database. Results. From the results of comparative experiments, it can be seen that the recognition rate and accuracy of VGG16 are the highest among the three networks. The recognition rate of VGG16 is 97.588%, and the accuracy is 95.889%. And the running time is only 3.5 seconds, which is 72.727% faster than LeNet-5 and 66.666% faster than DenseNet. Conclusion. The model proposed in this paper breaks through the key problem in the face detection and tracking problem in the public security field, predicts the position of the face target image in the time dimension manifold space, and improves the efficiency of face detection.
1. Introduction
Face recognition [1–5], as a biometric recognition technology, is one of the hot topics in the research fields of pattern recognition, image processing, machine vision, neural networks, and cognitive science in recent years. At the same time, face recognition, as a biometric identification technology with high stability, high accuracy, difficult to copy, and easy to be accepted by humans [6–9], has a wide range of application prospects in the fields of identity authentication, security monitoring, human-computer interaction, etc. With the increasing innovation of information technology, the processing of images by face recognition technology has become more and more complex. With sufficient samples, single background, and stable ambient light, most algorithms can achieve higher recognition rates.
In practical applications, how to solve the influence of environmental factors, human sentiments, and posture changes has become a difficult problem in testing various algorithms. The face recognition algorithm based on eigenface extracts face features by way of dimensionality reduction. Although the computational complexity is reduced, some effective features will be lost while reducing the dimensionality.
In order to reflect the nonlinear structure of image features, two classical manifold learning methods have been proposed by previous researchers, which are ISO metric mapping (ISOMAP) [10–12] and local linear embedding (LLE) [13–15]. By learning the mapping from environmental space to eigenspace, the structure between adjacent points after projection can be preserved. Although such manifold learning methods can model the manifold structure of the data, they require a large amount of dense data as training samples, which is not applicable to some practical applications. Therefore, this paper constructs convolutional neural network (CNN) architecture to obtain face features at different scales, so as to solve the problem of insufficient sample size.
In recent years, CNN [16–20] has become a research hotspot in the field of speech analysis and image recognition, especially in the field of face recognition. The CNN makes full use of the locality of the data itself by combining the local perception area of the face image, sharing the weight, and spatially. This feature has a certain degree of robustness to illumination changes, posture, and occlusion.
Previously, there have been many studies on face recognition, such as methods based on statistical manifolds [21–24] and methods combined with deep neural networks. It can be seen that the image classification method based on the manifold will still maintain considerable research enthusiasm for a long time in the future.
In this paper, a CNN is used to construct a face recognition system. First, the VGG16 model [25–29] is used to extract facial features, and then multiscale facial feature manifold (MSFFM) [30, 31] is used for classification and tested in the actual environment.
2. Methodology
2.1. CNN Architecture
2.1.1. VGG16
VGGNet was proposed by the Oxford Visual Geometry Group of Oxford University [32]. It explored the relationship between the depth of CNN and its performance. By repeatedly stacking 33 small convolution kernels and 22 maximum pooling layers, it successfully constructed CNN with 16 to 19 layers deep. VGGNet is a modification based on AlexNet [33]; the training image size is . All images are subtracted from the mean of all training images. VGGNet contains many levels of networks, ranging in depth from 11 to 19 layers. The more commonly used ones are VGGNet-16 and VGGNet-19. VGGNet divides the network into 5 segments, and each segment connects multiple 33 convolutional networks in series. Each segment of convolution is followed by a maximum pooling layer, and the last is 3 fully connected layers and a softmax layer. The network structure of VGG16 is shown in Figure 1.

2.1.2. LeNet-5
LeNet-5 [34] is a classic structure of CNN, the pioneering work of CNN, mainly used for handwritten font recognition. Although the network is simple, the structure is complete, and the convolutional layer, pooling layer, and full link layer have been used until now. The number of layers is very shallow, and the size of the kernel is single. The kernel sizes used by the three convolutional layers of C1, C3, and C5 are all . The feature map size of C5 is because the feature map size of S4 is and the kernel size is the same, so the result size of the convolution is . The window size used by the two pooling layers of S2 and S4 is , and there are two types of pooling here. F6 is a fully connected layer with 84 neurons. The network structure of LeNet-5 is shown in Figure 2.

2.1.3. DenseNet
DenseNet [35] breaks away from ResNet’s shortcomings of deepening the number of network layers [36] and Inception’s shortcomings of widening network structure to improve network performance [37]. From the perspective of features, through feature reuse and bypass settings, it has greatly reduced the parameter quantity of the network alleviates the emergence of gradient vanishing problem to a certain extent. Combining the assumptions of information flow and feature reuse, DenseNet deserves to be the best paper of the year at the 2017 Computer Vision Conference. DenseNet has absorbed the most essential part of ResNet and has done more innovative work on this, which further improves the network performance.
DenseNet is a CNN with dense connections. In this network, there is a direct connection between any two layers, that is, the input of each layer of the network is the unions of the outputs of all the previous layers, and the feature map learned by this layer will also be directly passed to all subsequent layers that are used as input. The network structure of DenseNet is shown in Figure 3.

2.2. Manifold Learning
2.2.1. Manifold Types Commonly Used in Image Classification
In the field of computer vision, the covariance matrix conforming to the manifold geometry of the symmetric positive definite matrix has been proven to have very good effects in image classification tasks. Because the covariance matrix can better adapt to various types of image changes, it has become a mainstream image feature representation method in image classification based on Riemannian manifolds. Commonly used manifold types for image classification are symmetric positive definite matrix manifold and Gmssmami manifold.
2.2.2. Kernel Function on Manifold
In the field of machine learning, the kernel method is a type of learning algorithm used to solve pattern recognition problems. The most classic example is the support vector machine (SVM). The main idea is to embed the data in the original space into a specific high-dimensional space through some kind of implicit nonlinear mapping, so that the linearly inseparable data in the original space becomes linearly separable after being mapped to the high-dimensional space.
Literature [18] presents a kernel function on the manifold of a symmetric positive definite matrix. According to the Frobenius norm and polarization formula, the inner product of two -dimensional symmetric positive definite matrices and in the tangent space Tr Symn is defined as Equation (1).
It can be seen that the corresponding kernel function on Symn is defined as Equation (2).
2.2.3. Face Super-Resolution Algorithm Based on Sparse Representation
The SR method adds sparse representation theory on the basis of manifold learning, uses a subset of the training sample block to linearly represent the input low-resolution image block, and uses the L1 norm to solve the optimal weight coefficient. We use the most similar face training sample block to represent the effect of the input image block.
For each image block of the input low-resolution image, all training sample blocks at the same position in the low-resolution sample space are sparsely learned to reconstruct the representation coefficients, and the objective function is expressed as Equation (3).
Then, the objective function is transformed into Equation (4) for solving.We linearly weigh the representation coefficient obtained by Equation (4) and the corresponding high-resolution training sample block to obtain the high-resolution prediction image block. This algorithm solves the problem that the solution is not unique in the location-based image block algorithm.
2.3. Data Set
2.3.1. Data Sources
In practical applications, the biggest challenge of the face recognition system is that the recognition effect is not ideal when there are pose changes and occlusions. Therefore, this paper collects many types of face images in practical applications as experimental samples. The normal face sample is basically aligned or inclined at a small angle, without occlusion, and single expression, as shown in Figure 4(a). The face samples with posture changes have a variety of expressions and side faces as shown in Figure 4(b). A sample of an occluded face is shown in Figure 4(c). In the system test, 20 face images of each of 36 humans were collected, and a total of 720 images were collected. The classification and pixels of the data set are shown in Table 1.

(a)

(b)

(c)
2.3.2. Experimental Environment
The software platform, processor, and operating system of the experimental environment are shown in Table 2. In order to ensure the fairness of the experimental results, all methods are implemented using the source code provided by the original author as much as possible, and the main parameters are adjusted and set according to the instructions in the original document.
2.4. Evaluation Criteria
This paper uses recognition rate, accuracy rate, and running time to evaluate the performance of VGG16, LeNet-5, and DenseNet.
First of all, the recognition rate refers to the ratio of the number of all recognized images to the total images, and the calculation method is as shown in Equation (5).
In Equation (5), refers to the number of all recognized images, and knows that it is the total number of images.
Second, accuracy refers to the ratio of the number of correctly identified images to the total number. The calculation method is as shown in Equation (6).
In Equation (6), refers to the number of correctly identified images, and refers to it is the total number.
3. Experimental Results
3.1. Recognition Rate
Figure 5 shows the comparison curve of the recognition rate of VGG16, LeNet-5, and DenseNet.

It can be seen from Figure 5 that the recognition rate is proportional to the number of samples, and the recognition rate of MSFFM based on VGG16 is the highest. When the number of samples is 240, the recognition rate is 97.588%.
3.2. Accuracy
Figure 6 shows the accuracy comparison curve of VGG16, LeNet-5, and DenseNet. It can be seen from the comparison result that the accuracy rate of the face recognition algorithm based on VGG16 is 95.889%.

3.3. Computational Time
By comparing the data in Table 3, it can be seen that on the database, the time required for the three network pairs to complete an operation is about 3.5 seconds, 9.1 seconds, and 11.6 seconds, respectively, and the model we proposed can shorten the calculation time by nearly half. In contrast, our proposed VGG16 has achieved excellent results in reducing computational complexity. In general, VGG16 can effectively reduce the computational complexity, so it has better feasibility in practical applications.
4. Discussion
This paper introduces a CNN that integrates manifold learning. It uses the spatial manifold information of the image as an additional feature and integrates it into the improved CNN model, so as to improve the pertinence of the model to the data and improve the accuracy. The model proposed in this paper makes up for the lack of generalization ability.
Facial super-resolution is a specific scene application of image super-resolution technology, and facial super-resolution has attracted widespread attention from scholars. In actual scenes, the acquired face images are usually blurry and low quality, which is caused by a variety of reasons.
First, the location of the surveillance camera is high, the shooting range is large, and the target face image is small; second, the surveillance equipment is limited by storage space, and the video image is highly compressed, so the image loses detailed information; third, the external environment is rainy weather and low light at night will further reduce the quality of the captured images.
The existing image feature representation methods focus on matrix-type manifolds. How to fuse multiple types of Riemannian manifolds, such as linear subspaces, probability distributions to form a multimodel representation method, and perform different manifolds the effective unification of the characteristic information will be a problem worthy of attention.
With the continuous deepening of research on CNN, vector-based convolution and pooling processing have been fully studied. In fact, in a network, we can apply Riemannian manifold geometry to the data in the middle layer for processing. This pooling and iterative process in the form of a matrix can have a positive effect on the final output of the network. For future implementation, we will continue to carry out related research in this direction.
5. Conclusion
We researched and adopted a popular multiscale facial feature algorithm based on VGG16 and designed and implemented face recognition on this basis. The system first intercepts each frame of image in the video stream for face detection and then recognizes the detected faces. The actual test results show that the system has a high recognition rate for face pose, expression, and occlusion changes when the training samples are sufficient. There is still a lot of research space for the algorithms based on manifold learning in face recognition and the application of these algorithms in face recognition systems.
Data Availability
The [face image] data used to support the findings of this study have not been made available because [we collected face images on our own which includes faces of members from our project. They are not willing to give their information to the public].
Conflicts of Interest
The authors declare that there is no conflict of interests.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (No. 62006102).