Abstract
Face image super-resolution refers to recovering a high-resolution face image from a low-resolution one. In recent years, due to the breakthrough progress of deep representation learning for super-resolution, the study of face super-resolution has become one of the hot topics in the field of super-resolution. However, the performance of these deep learning-based approaches highly relies on the scale of training samples and is limited in efficiency in real-time applications. To address these issues, in this work, we introduce a novel method based on the parallel imaging theory and OpenVINO. In particular, inspired by the methodology of learning-by-synthesis in parallel imaging, we propose to learn from the combination of virtual and real face images. In addition, we introduce a center loss function borrowed from the deep model to enhance the robustness of our model and propose to apply OpenVINO to speed up the inference. To the best of our knowledge, it is the first time to tackle the problem of face super-resolution based on parallel imaging methodology and OpenVINO. Extensive experimental results and comparisons on the publicly available LFW, WebCaricature, and FERET datasets demonstrate the effectiveness and efficiency of the proposed method.
1. Introduction
Image super-resolution (SR) is a classical issue in the engineering field of image processing technology and computer vision [1]. The face super-resolution problem is one of the important branches of the super-resolution problem. With the advent of the age of intelligence, face super-resolution has been gradually applied in many applications such as face recognition, intelligent surveillance, and identity recognition.
Face super-resolution is a mathematical inverse problem, which is an “ill-posed” problem of producing a high-resolution face image with critical information or a good visual effect (close to the real face image) from a low-resolution face image [2]. At present, there are two categories of reconstruction methods: interpolation-based method and learning-based method [3].
The interpolation-based method is the first method proposed to solve the face super-resolution problem. It essentially regards the face super-resolution problem as an image improperly posed problem and solves the problem based on the prior information of low-resolution images. The classical methods include nearest neighbor interpolation method, bilinear interpolation method, and bicubic interpolation method [4]. Nearest neighbor interpolation method is simple and computed fast because the interpolation value is obtained by calculating the gray value of the nearest neighbor [5]. In bilinear interpolation method, it gets the interpolated pixel values by interpolating the value of its surrounding pixels bilinearly to realize the linear interpolation in the horizontal and vertical directions [6]. Different from the nearest neighbor interpolation method and the bilinear interpolation method, 16 neighborhood pixels of the point to be interpolated are used for cubic interpolation in bicubic interpolation method [7]. Although the abovementioned traditional methods have good real-time computational simplicity, the reconstructed images are unsatisfactory due to aliasing, blocking, and blurring artifacts in many cases. In recent years, some advanced interpolation-based methods have been proposed. For example, Jiang et al. [8] presented a missing strength intensity interpolation method based on smooth regression with local structure prior (LSP), named SRLSP for short. Wei [9] proposed an image super-resolution reconstruction method using the high-order derivative interpolation associated with fractional filter functions. However, since the interpolation methods rely on the prior information of low-resolution images, the reconstruction results will drop rapidly while inputting a small image.
Another category of method is the machine learning-based method, which essentially uses machine learning to learn the mapping between low-resolution face images and high-resolution images. The method based on statistical dictionary is one of the learning-based methods, which builds a dictionary and learns the connection between low-resolution images and high-resolution images. For example, Yang et al. [10] proposed a reconstruction method based on a sparse representation model, which is used to learn low-resolution and high-resolution image blocks by training low-resolution dictionary image blocks and high-resolution dictionary image blocks to get the sparse similarity of the reconstructed image finally. However, the method based on statistical learning needs to specify the number of dictionary elements and the noise variance model in advance, so it is difficult to obtain the optimal parameters of natural images in real life. As for other types of learning-based methods, such as position-patch-based method, Ma et al. [11] proposed a position-patch-based super-resolution method, which used the same position image patches and the same weights of each training image to hallucinate and reconstruct the high-resolution face image. Huang and Wu [12] introduced an approach to reconstruct high-resolution face image directly from a single low-resolution face image by using multiple local linear transformations and Procrustes analysis without using large redundant low-resolution and high-resolution patch databases. Jiang et al. [13] proposed a simple and effective scheme, local constraint representation (LCR). It is a method that not only incorporates locality constraints to maintain locality but also exploits the sparse property of the redundant data representation. In [14], a method based on single image is proposed to estimate the high-resolution embedding and enhance the local compatibility and smoothness constraints of each block in the target high-resolution image by using the training image pair and overlapping. It is noteworthy that there are categories of single image-based super-resolution and multiple images based super-resolution, such as TLcR-RL [15], a context-patch-based face hallucination approach that can fully exploit contextual information of image patches to obtain a high-resolution face image. There are essential differences between the two methods. In this paper, we only discuss the single image-based super-resolution.
Deep learning methods show great potentiality in terms of face super-resolution. Dong et al. [16] validated that CNN can achieve an end-to-end learning effectively from low-resolution images to high-resolution images and proposed the single-image reconstruction convolutional neural networks (SRCNN), and in 2016, Dong et al. [17] proposed a method (FSRCNN) based on compact hourglass-shaped CNN structure to reconstruct HR image by introducing a deconvolution layer. It is an improvement on the previous SRCNN [16] to achieve faster and better image super-resolution. Liang et al. [18] proposed a framework consisting of a deep convolutional neural network with image gradient priors, named SRCNN-Pr, which exploits various image priors during the training phase of a deep CNN. Mao et al. [19] introduced an approach based on deep convolutional neural network (DSRCNN), which combines the adversarial automatic with two depixelate layers to reconstruct the pixelated image and map it to a higher resolution. Lim et al. [20] proposed a multiscale deep super-resolution method by removing unnecessary modules in the traditional residual networks, which reconstructs high-resolution images with different upscaling factors in a single model. SRCNN is a pioneering work of deep learning in super-resolution reconstruction. All kinds of improvement methods based on SRCNN achieve better performance directly or indirectly by increasing the network structure. This type of deep learning method has been successfully applied to deal with the problem of face super-resolution reconstruction and achieves good results [3]. However, there are some limitations in the following aspects:(1)The reconstruction of the network needs to rely on a large number of face images, but it is time-consuming to get a large number of face images from real-world scenes.(2)The mean-square loss function has a forced prediction and label to match exactly, i.e., “either or ,” which results in overconfidence of the model.(3)It is highly dependent on GPU devices and the inference speed is too slow on normal CPU devices.
In particular, we take the following strategies to address these three issues, respectively:(1)According to the parallel image theory [21, 22], we propose to learn from a large number of available caricature faces. In this paper, we use the WebCaricature [23] database as an example to design model and conduct experiments.(2)We introduce a center loss function for constructing networks to classify the unmarked categories in the training set and realize the identification of features.(3)We apply the latest OpenVINO toolkit [24–26] provided by Intel to optimize the model on the 6th-generation CPU with a relatively cheap price. It shows that our proposed method could be used on a general mobile device for real-time applications.
In summary, we propose a deep face super-resolution algorithm based on parallel image theory and OpenVINO, termed as SRCNN-CL, which is a sample improvement on the light SRCNN and yields better experimental results than the original networks in both qualitative and quantitative results. We use OpenVINO to speed up the inference of the model and measure this acceleration with quantitative assessments. In particular, we build a three-layer convolutional neural network to learn the mapping from low-resolution face images to high-resolution images. During this process, we use the center loss function to overcome the inability to discriminate features, and then we train the model based on parallel image theory. Finally, we use the OpenVINO toolkit for acceleration.
All in all, there are three contributions in this paper:(1)We propose a face super-resolution method based on parallel imaging theory with high robustness and efficiency.(2)We propose an OpenVINO-based face super-resolution reconstruction method, which accelerates the processing speed of the model and meets the real-time applications.(3)We improve the conventional network structure and propose a center loss function, which improves the ability of model to distinguish features. Compared with existing methods on the benchmark databases, better experimental results are obtained in terms of both qualitative and quantitative results.
2. OpenVINO Tools
In May 2018, Intel launched a toolkit called OpenVINO [27]. The toolkit can quickly deploy applications and solutions in simulating human vision. Especially on the convolutional neural network- (CNN-) based method, we can use this toolkit to span the Intel hardware of the computer vision (CV) workload and get a maximum performance. OpenVINO is mainly composed of model optimizer and inference engine. The model optimizer mainly imports models trained in standard frameworks such as Caffe and TensorFlow and transforms and optimizes them into a format that can be used by Intel tools (especially inference engines) [28]. The inference engine is an engine that runs deep learning models. It contains a set of libraries to integrate inference with applications easily.
Specifically, the model optimizer is a cross-platform command-line tool. It could facilitate the conversion between training and deployment environments, analyze the static models, and adjust deep learning models to achieve optimal execution on endpoint target devices. After using the model optimizer to create the intermediate representation, the inference engine is used to infer the input data. The inference engine is a C++ library with a set of C++ classes for inferring input data (images) and obtaining results [29]. The C++ library provides an API to read intermediate representations, set input and output formats, and execute models on the device. Figure 1 illustrates a typical work-flow for deploying a trained deep learning model.

The following is the summary of steps to optimize and deploy the trained models:(i)Configure the model optimizer for the framework.(ii)According to the trained network topology, weights, and bias values, transform the training model to generate an optimized intermediate representation (IR) of the model.(iii)Use the inference engine in the target environment to test the model in an intermediate representation format by verifying the application or the sample application.(iv)Integrate the inference engine in the application to deploy the model in the target environment.
3. Proposed Method
Parallel images [21] are a branch of parallel vision [22], which can provide large-scale and diverse image data (including real images and virtual images) for parallel vision research. At first, the virtual image is used to expand and supplement the real image to obtain the parallel image “big data” that combines virtual and real. Then, various visual models are learned and evaluated through computational experiments, and finally, the visual models are optimized online by parallel execution to realize the intelligent perception and understanding of complex environments [22, 30].
3.1. Parallel Imaging for Face Super-Resolution
Inspired by the parallel image theory [22], aiming at the problem of face super-resolution, we expand and supplement the corresponding real-world face images by using caricature face images on the Internet to obtain parallel face images that combine virtual and real information. Then, we learn and evaluate the computational experiment of the reconstruction model and optimize the visual model online to realize the intelligent perception and understanding of complex scenes with the help of parallel execution finally. Figure 2 shows the parallel vision frame structure of face super-resolution.

We use artificial web crawlers to download the corresponding face caricature images from the Internet, use the face super-resolution model to perform computational experiment and parallel execution of parallel vision, and learn to extract “little knowledge” applied to face super-resolution. In the process of computational experiments and parallel execution, we introduce the OpenVINO framework to accelerate and optimize the face super-resolution model. Figure 3 shows the technical flow of parallel imaging.

3.2. Network Architecture
In order to get the high-resolution images which meet the requirements better, we propose to apply the bicubic interpolation method to obtain low-resolution images before training network and to perform corresponding preprocessing in the testing phase.
In our network, the first layer of convolution operation is used for feature extraction and it could be expressed aswhere is the bias, is the filter with its size, , is the number of channels of the input image, is the space size of the filter, is the number of filters, is the convolution operation, and the parameter is uniformly distributed.
In the first layer, we extract features from each image block through a convolution operation. We map the features extracted from the first layer to the second layer. The operation of the second layer could be expressed aswhere is the bias, is the filter with its size, , is the space size of the filter, and is the number of filters.
In the third layer of our network, we get the high-resolution images finally. The operation of the third layer could be expressed aswhere is the bias, is the filter with its size, , is the space size of the filter, and is the number of channels of the output image.
In order to learn the end-to-end mapping function , the traditional method uses the mean square (MSE) method to estimate the network parameters aswhere is the number of training samples.
According to Wen’s theory [31], the network is used as a feature extractor to extract features for recognition. In this case, the training set contains the categories of test samples and the performance of the model degrades less. For the face super-resolution, because of the impossibility to collect all the face images, it is natural that some faces are not in the training set. Thus, the features extracted by the network are not only separable but also highly discriminative. Further, it is obviously inappropriate to use the MSE loss function. Therefore, we introduce the center loss function to construct the network:where is the number of categories.
In general, we have built a three-layer convolutional network as shown in Figure 4.

4. Experimental Results
4.1. Data Preparation and Parameter Setting
Our work uses labeled faces in the wild (LFW) database [32] and WebCaricature [23] database collected through the network as training set and testing set. We select images of LFW database and images of WebCaricature database randomly and take a total of images of them as the training set. We randomly select images in the face recognition technology (FERET) [33] database as the test set. An example of the training sample set is shown in Figure 5.

In our experiment, we choose the simplest three-layer convolution neural network (--). We set to 64, to 32, and to 0.9. The weight of each layer in the network is initialized to a Gaussian distribution with a mean value of and a standard deviation of . The deviation of each layer is initialized to , and the learning rate is always fixed to during network training. The convergence diagram of network training is shown in Figure 6. It shows that the model has converged. In order to solve the problem of different magnification factors, we respectively downsample the data in the training set with magnification factors of 2, 3, and 4, and finally, we get a training set containing face images.

4.2. Quantitative Evaluation Protocol
In order to evaluate the performance of the super-resolution algorithm and the test speed of the model, we use peak signal-to-noise ratio (PSNR) and Image Structure Similarity Index (SSIM) to evaluate the performance of the super-resolution algorithm. We evaluate the acceleration effect of OpenVINO by parallel speedup ratio.
Peak signal-to-noise ratio represents the ratio of the variance of the image signal to noise. The higher the value, the better the reconstruction effect of the image. For the original image and reconstructed image with size, the specific calculation formula is as follows:where indicates the maximum value of the image color and the value of the -bit image is .
Image structure similarity is an index to measure the similarity of two images. The value range is . The closer the value of SSIM is to , the more similar the structure of the two images. The specific calculation formula is as follows:where is the average of , is the average of , is the variance of , is the variance of , is the covariance of , and , , and are the constants.
For the execution of the model, we hope that tasks are evenly distributed among the cores of the central processing unit (CPU) without introducing additional workload for each core. If we can successfully achieve the goal, when running programs on the core system, each core runs a process or thread, the running speed of the parallel program is times the speed of the serial program. Multiprocess and multithreads always introduce some costs. For example, sharing memory programs usually have critical sections and need to use some mutual exclusion mechanisms, such as mutex. It is a cost for parallel programs but serial programs to call mutex because it will force parallel programs to execute critical section code serially. Distributed memory programs usually need to transfer data across the network, which is slower than accessing data in local memory. In contrast, serial programs do not have these additional overheads. So, it is very difficult to find a parallel program with an ideal speedup ratio. In addition, with the number of processes increasing, the overhead increases. More threads means more threads need to access the critical section, and more processes means more data need to be transferred across the network. Therefore, we simply define the speedup of a parallel program as
4.3. Analysis of Experimental Results
After reconstructing low-resolution face images obtained by interpolation, it shows a better visual result in terms of detailed reconstruction with our method. Since the human eyes, nose, and mouth are the regions with large signal energy in the face image [42], they are also the most important representations of the human face. Therefore, in this paper, we have chosen to illustrate the face reconstruction image with a magnification factor of , as shown in Figure 7.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

(p)

(q)

(r)
We compared our method with fifteen face image super-resolution methods which are LcR [13], SSR [41], LLT [12], Wang’s [40], NE [14], SR [10], WSR [34], CDA [39], bicubic [35], MSR [11], Yang’s [10], Glanser’s [36], DPSR [37], SRCNN [16], and SRAHF [38]. The result is shown in Table 1.
According to Table 1, our method is optimal and suboptimal in terms of SSIM and PSNR compared with many algorithms. Especially, compared with the original SRCNN, the PSNR and SSIM are improved by approximately 1.3 dB and 0.02, respectively. Therefore, it can be proved that the algorithm in this paper is highly robust and the model has improved robustness after introducing the parallel image theory.
We compared the test speed of SRCNN with our method before and after using OpenVINO. The result is shown in Table 2. It can be obtained from Table 2 that after using OpenVINO, the parallel speedup ratios of SRCNN and our method reach 700% and 760%, respectively. It can be proved that OpenVINO accelerates the test speed of our model.
5. Conclusion
This paper proposes an improved three-layer end-to-end convolutional neural network, which is used to reconstruct low-resolution face images to obtain corresponding high-resolution images. This paper introduces the parallel image theory to improve the robustness of our model, and it accelerates the test speed of our model by using OpenVINO. In addition, this paper uses a method of cotraining with multiple magnification factors to reduce the number of required network models. The final experimental results show that the method in this paper can cope with the super-resolution reconstruction of face images with different magnification factors and achieves better than traditional reconstruction methods.
On the other hand, through our experiment, the parallel image theory can effectively improve the robustness within the framework of existing model. In the parallel execution stage of the parallel image, OpenVINO can be used to accelerate the execution speed of our model, which makes the parallel vision model faster and better for real-time application. Therefore, our next research direction is to apply OpenVINO and parallel image theory to other computer vision models.
Data Availability
The training data used to support the findings of this study were collected from the LFW, WebCaricature, and FERET public databases.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Authors’ Contributions
Zhijie Huang and Wenbo Zheng contributed equally to this study.
Acknowledgments
This work was supported in part by the National Key R&D Program of China (2020YFB1600400), in part by the National Natural Science Foundation of China (61806198), and in part by the Key Research and Development Program of Guangzhou (202007050002).