Abstract
Recent works based on deep learning and facial priors have performed well in superresolving severely degraded facial images. However, due to the limitation of illumination, pixels of the monitoring probe itself, focusing area, and human motion, the face image is usually blurred or even deformed. To address this problem, we properly propose Face Restoration Generative Adversarial Networks to improve the resolution and restore the details of the blurred face. They include the Head Pose Estimation Network, Postural Transformer Network, and Face Generative Adversarial Networks. In this paper, we employ the following: (i) Swish-B activation function that is used in Face Generative Adversarial Networks to accelerate the convergence speed of the cross-entropy cost function, (ii) a special prejudgment monitor that is added to improve the accuracy of the discriminant, and (iii) the modified Postural Transformer Network that is used with 3D face reconstruction network to correct faces at different expression pose angles. Our method improves the resolution of face image and performs well in image restoration. We demonstrate how our method can produce high-quality faces, and it is superior to the most advanced methods on the reconstruction task of blind faces for in-the-wild images; especially, our 8 × SR SSIM and PSNR are, respectively, 0.078 and 1.16 higher than FSRNet in AFLW.
1. Introduction
Image generation has attracted broad attention in recent years. Within these works [1–3], synthesizing a face from different angles while retaining identity is an important task, because of its wide range of industrial applications, such as video monitoring and face analysis. Recently, this task has been greatly advanced by a number of models of Generative Adversarial Networks. To tackle the task of face reconstruction, existing approaches [4–6] typically apply predefined parameterized 3D models or Convolutional Neural Network (CNN) to represent face. Despite exhibiting promising ability in describing faces [7, 8], different head poses positioning has obvious deviation. In addition, the methods cannot describe complex expressions and facial postures. Therefore, complex parametric fitting requires lots of precise data and detailed descriptions. Generative adversarial networks have recently demonstrated excellence in image editing [9–11], which shows great potential in producing realistic images [12]; many modified generative adversarial network models are used to generate real face images. There are plenty of models [13, 14], such as CycleGAN which is a well-known face reconstruction method with data-driven GAN.
Additionally, in [15], the authors found that Graph CNN and GAN can effectively reconstruct high-quality face shape and texture, respectively, by learning the completely nonlinear model. CNN image directly convolves non-Euclidean structures such as graphs, which can both effectively obtain important information from the edge and reduce the computational complexity. Because of these characteristics, it has recently been widely applied in mesh datasets [16, 17] and 3D face datasets [18, 19]. Meanwhile, as a texture decoder, GAN has recently shown superpowers in fidelity textures and structural features where they do not exist in the diagram. Nevertheless, there are still some problems in representing satisfactory results.
Our Face Restoration Generative Adversarial Networks (FRGAN) consist of Head Pose Estimation Network (HPENet), Postural Transformer Network (PTNet), and Face Generative Adversarial Networks (FaceGAN). The limitation of 2D face expression information leads to the distortion of 3D face pose during the generation, which will cause the pixels in the connection part to be not clear enough and poor coherence. Our reconstruction framework is based on a combination of 3DMM model and a graph CNN, which works in a coarse-to-fine manner. CNN is used to fit the parameters of the 3DMM model. With the 3DMM model, the affine model can calculate the face shape and the initial rough texture. In a critical stage, we use pretrained CNN to extract face features and input them into the graphic CNN. We generate a different coordinate value for the grid vertices. Our framework adopts a differentiable rendering layer [20] for self-supervised training. HPENet is used to conduct head pose estimation on guided face image B and store face position feature information and pose direction vector. We input the low-quality face image A and guide face image B to fit 3D face information through the PTNet and convert the face posture with the mapping function, so as to generate face B′ with the same posture as face A. We input the original image A and the intermediate image B′ into FaceGAN, and the generator (G) fuses feature information of image B' and image A. After fusion, the new face image A′ and the original image A are determined by feedback adjustment of the prejudgment monitor (P) and discriminant (D) to reach the target threshold. Figure 1 depicts the process of low-quality face reconstruction.

We will treat P as the third party participating in the game. P not only learns identity characteristics but also assigns them different identity tags to G and D domains. Intuitively, P joins D against G. In fact, P and D distinguish face identity and image quality, respectively, and G tries to improve the generated image quality to reduce the classification accuracy of P and D. When P and D cannot distinguish between two domains, the training converges. This suggests that G is capable of producing high-quality face images that retain identity information. We follow the principle of information symmetry to reduce the difficulty of training. The characteristics of true and false domains are, respectively, derived from G and P. We determine the location of face key points through HPENet and output face features to P. The network will have to push forward in an attempt to preserve identity. If the face features under different postures are extracted, the cosine distance between the two features is very large. It is difficult to train two modules because they represent different feature spaces. We added PTNet to reduce training difficulties by constructing the same pose face images.
In this paper, we propose a method to inpaint the low-quality face with a guide face, which considers the distribution of the input face features and uses FRGAN to generate high-quality shapes and textures. In the process, we introduce a novel method that utilizes Swish-X loss and prejudgment monitor to optimize the confrontation, which further helps the model achieve high fidelity and clear face reconstruction. We summarize the contributions of the paper as follows:(i)We use HPENet to fit the key points and structure information of 2D face and use PTNet to twist the face for pose alignment.(ii)We added the prejudgment monitor (P) to FaceGAN. P stores face features and key points information according to HPENet. In FaceGAN, the generator faces the double test of P and D. P and D jointly confront G, making G generate a face with high quality and retained identity information.(iii)We add Swish-X loss function to FaceGAN, which effectively improves the convergence speed of the cross-entropy cost function.
2. Related Work
The core of traditional image processing technology is to carry out mathematical transformation between pixels on the existing information. Different methods are only due to the different ways of mathematical transformation. When the picture information is missing, that is, the pixel information is insufficient, the traditional method cannot create something out of nothing and restore the missing image reasonably.
2.1. Face Alignment
Over the past few decades, many classical facial landmark detection methods have been proposed in the literature. Parameterized appearance models are represented by Active Appearance Models (AAMs) [21], Constrained Local Models (CLMs) [22], and Cascaded Regression [23]. They complete the task by maximizing the confidence of the part position in the image. Specifically, AAMs and their follow-ups [24–26] attempt to jointly model global holistic appearance and shape, while CLMs and variants [27, 28] instead learn a group of local experts via imposing various shape constraints. In the framework of Cascaded Regression, the main operation is that vector addition is efficient and the computational complexity is low. Recently, strategies based on deep learning have captured the most advanced performance in this task.
In the following, we briefly introduce representative works of this category. The study in [29] refers to the estimation of face posture and introduces the boundary information into the regression of key points. The network consists of three parts: boundary heatmap estimator, boundary-aware landmarks regressor, and boundary effectiveness discriminator. The study in [30] estimates the 3D face shape with CNN by training the face image, and the face shape fits the corresponding 3D face model, which can detect the face feature and match the face contour. In addition, it solves the problem that different databases with different numbers of feature points cannot be cross-validated. The study in [31] proposed a powerful method to achieve 3D face reenactment and dense face alignment simultaneously. It designed a UV location map to achieve a 2D map to represent the 3D shape features of the UV space. It can reconstruct the complete facial structure without any prior face model.
2.2. 3D Face Reconstruction
Candide Model and 3D Morphable Models (3DMM) are two commonly used models in 3D face reconstruction. CANDIDE-3 [32] consists of 113 vertices and 168 planes. Through the global and local adjustments, face alignment and detail are refined. After that, the vertex interpolation can create a reconstructed face. The advantages and disadvantages of this model are obvious. There are few vertices in the template, so the reconstruction speed is fast, which leads to the fact that reconstruction accuracy is seriously insufficient and facial details’ reconstruction is poor. Traditional 3DMM reconstruction is a process of iterative fitting, which is relatively inefficient and not conducive to build a real-time 3D face.
The development of deep learning has stimulated the method of end-to-end 3D face reconstruction with CNN. The work in [33] returned to identity and residual parameters with CNN. Its expression is similar to that of 3DMM, except, in addition to ordinary reconstruction loss (generally elementwise L2 loss), adding an identification loss to guarantee that the reconstructed face ID features remain unchanged. The work in [34] is also a regression 3DMM parameter. It is considered that the upper semantic features can represent the ID information and the middle semantic features can represent the facial features, so the corresponding parameters can be regressed from different levels to achieve the 3D face reconstruction task. Another relatively common 3D face reconstruction method is shown in [35–37] which propose end-to-end methods.
The work in [38] trains a complex two-dimensional facial landmark network from a single image with an additional network to estimate the depth. The work in [39] trains a volume 3D face representation method, which is regressive from a 2D image with network. However, the expression of face key points is not accurate enough and needs to return to the whole volume to restore the shape of the face.
2.3. Face Applications with GAN
In traditional GAN, G uses Xr as the input and outputs the composite image Xs, while D takes them as the input whether they are real images or generated images. In the training process, D and G fight against each other, D maximizes its classification accuracy, and G tries to synthesize high-quality images to reduce the classification accuracy of D. When D cannot distinguish whether the image generated by G is from the real sample, it will converge, which also indicates that the image quality of true and false domain is very close.
The work in [40] describes an approach to cross-domain image transformation with GAN. It can turn human faces into emoji or animated expressions. The study in [41] introduces a method of using GAN to generate a positive portrait image (such as face forward) according to a specific angle of the face. This kind of technology can be applied to face verification or recognition system. The study in [42] introduces the method which uses GAN to generate face pictures of different age groups. In [43] in particular, the application of GAN in the construction of different versions of face images is demonstrated. The study in [44] also introduces how to use GAN to inpaint and reconstruct the damaged face image. The study in [45] shows a case of generating a photo of a face, and the photo is very realistic. Therefore, the paper has attracted wide attention from the media. GAN has been widely used in face processing, but there are still many problems. Both the processing of face edge information and the speed of face generation need to be improved.
In this paper, we propose FRGAN which considers the guide face and low-quality face features and generates a more accurate and clear shape and texture. We refer to Swish-X to improve the speed of face reconstruction.
3. Materials and Methods
Our novel FRGAN approach consists of two major enhancements: (i) using Swish-X loss function to greatly accelerate the convergence rate of the model and (ii) setting P module to increase the confrontation intensity and improve the face image quality.
3.1. Head Pose Estimation Network
In the Head Pose Estimation Network (HPENet), we modified its loss and added space constraint items of key points, three posture angles, and data balance items. Backbone uses MobileNet and auxiliary network to make point position prediction more stable and robust.
The loss function represents both difference and the true distance. Especially if the object itself has a 3D property and is only represented in 2D, such a distance representation is not accurate. It is necessary to calculate the loss caused by the angle of facial pose. The network can learn information about the 3D pose, most easily with network to predict the three angles. Combine the predicted angle loss with the point position loss in a specific combination, such as adding or multiplying. The additive approach is similar to the multitasking loss, and the multiplicative approach can be understood as a kind of weighting since the angle loss is usually normalized to between 0 and 1. In practice, it is found that this angle distribution has the problem of sample imbalance, so add the sample equilibrium term. The sample equilibrium term loss function can be defined as follows:
In (1), is the adjustable weight function, and different weights are selected according to different situations, such as normal situation, occlusion situation, and dark light situation. is the three-dimensional Euler angles of the face pose (K = 3). d is the measurement of regression landmarks and ground truth. N is the amount of face key points. M is the number of samples. The loss function is a means to make a small contribution to model training, for example, positive face, and give a small weight when the data with a large sample size is backpropagated by the gradient. For the data with a relatively small sample size, such as side face, raised head, lowered head, and extreme expression, it gives a large weight. This makes a greater contribution to model training when carrying out gradient backpropagation. The loss function of the model is very clever to solve the problem of balancing the unbalanced training samples under various conditions.
Choose proper feature extraction backbone. HPENet selects MobileNet-V1 as the backbone network. We directly estimate the pose of the keypoints of the face, and perform the regression keypoints, 2 N where N is the number of keypoints. In the face detector model mentioned above, we need to train with enough data and add some tricks to get good results. But in the actual application, for some extreme cases, such as blocking (hand and glasses), illumination (strong light and weak light), and extreme posture (large yaw and pitch, raw), extreme facial expression may cause face posture conversion resulting in image dislocation or blur.
To solve the above problems, from the perspective of engineering, we can adopt a backbone (such as VGG16 and ResNet) with stronger feature description ability or increase training data in extreme situations, balance the proportion of training data in various situations, and control the sampling form of data. In view of the above situation, we put forward a solution from the perspective of algorithm design.
In the model design, the backbone network of HPENet does not adopt large models such as VGG16 and ResNet. To increase the expression ability of the model, the content of the output characteristics of MobileNet has been modified. Figure 2 shows the method of face feature extraction and classification.

We increased the expressiveness of the model by integrating features at three different scales. The backbone of the network structure is MobileNet-V, which can still achieve good performance in embedded devices.
In the beginning, we designed a simple network with a loss function called MSE. To balance the training data in various situations, we can only conduct performance tuning by increasing the training data in extreme situations, balancing the proportion of training data in various situations, and using the sampling form of incomplete random sampling control data.
To train the process of HPENet, we introduce a subnetwork to supervise the training of the network model. The subnetwork only works in the training phase. The subnetwork estimates the 3D Euler angle for the input face sample, and its ground truth is estimated by the information of key points in the training data. The purpose of the network is to supervise and assist the training convergence, mainly to serve the critical point detection network. The input of the subnetwork is not the training data, but the intermediate output of the HPENet. Figure 3 describes the principle of face key points positioning and pose feature estimation.

3.2. Postural Transformer Network
In this work, we use PTNet to fit 3DMM to represent the facial feature fp, including posture, expression, and general form. The face is distorted by pose estimation of the face key points.
Nonetheless, even though the Postural Transformer Network (PTNet) can be trained and reconstruct end-to-end and adversarial learning, we have empirically found that it cannot converge to the desired solution and fail to align the normalized pose and expression of the guided image. As we can see, its improvement on U-Net is still limited, especially when the degenerated observation and guided images are significantly different in posture. Besides, the target image and guide image are unavailable under different lighting conditions. It is not feasible to directly use the target image to guide PTNet learning. We use the face alignment method to detect the target and guide the face landmarks of the image to introduce the landmark loss and total variation (TV) regularization to train PTNet.
Both solvePnP and solvePnPRansac in OpenCV can estimate face pose. Determining the affine transformation matrix from the 3D model to the face in the image is equal to determining the pose, which contains rotation and translation information. The solvePnP output consists of a rotation vector and a translation vector. We only care about rotation information, so we will mainly work on the rotation vector. Figure 4 indicates the neural network diagram of face pose conversion.

With the positions of the points in the world coordinate system, the pixel coordinate positions, and the camera parameters, we can figure out the rotation and translation matrix. But the above relationship is clearly nonlinear. OpenCV already provides the function solvePnP() for solving PNP problems, which is a simple step. After obtaining the rotation matrix it can be combined with the Euler angle.
The accuracy of 3DMM is not high. A 3D face model is built into the algorithm and the spatial positions of key points are mapped as the spatial positions of the real face. The 3D face models can be fitted for different people so that the spatial positions of key points are more accurate, and the accuracy of head pose estimation is improved.
After two times of downsampling, the residual block is directly connected, and then the original size is restored by two times of upsampling. Resolution drops little, keeping enough spatial location information is very helpful to improve image quality, and the network is easier to train. We used 12 residual blocks to input 256 256 images and directly used L1 of pixel for reconstruction loss. We added 3 layers of residual block to correspondingly increase the receptive field of the network. The results indicate that the increased residual blocks can produce a high-quality image.
We use transpose convolution for all our upsamples. Since UpSampling2D + Conv2D is better, we adopt UpSampling2D + Conv2D. We use standard preactivation to guarantee a wider output range. Our discriminator network architecture adopts the most general form, and the loss-proof training method uses WGAN-GP [46]. The actual measurement of instance normalization has contributed to the stabilization and convergence of the training.
Adversarial training is necessary to realize face attribute conversion. Classification constraints alone cannot produce satisfactory results. Without confrontation, the image generation that is basically equivalent to the attack sample can fool the classifier, but the human eye can hardly see changes that have taken place in the image.
3.3. Face Generative Adversarial Networks
Face Generative Adversarial Networks (FaceGAN) are mainly composed of generator (G), discriminator (D), and prejudgment monitor (P). We trained on VGGFace2 and CelebA and tested and compared the performance of face recognition on CASIA-WebFace. Experimental results indicate that the face recognition model trained by the same network on VGGFace2 achieves better results in 1 : 1 ratio and 1 : N search. Figure 5 depicts the process of low-quality face reconstruction: Training stage: face A encoding, and then restore through A decoding. Face B code, and then restore through B decoding. A and B own one encoder, and A and B use different decoder. We code the common features of the face and decode to restore the individual features of the face, and A and B use different decoders. Testing stage: after the face of B passes through the encoder, it is restored by the decoder of A, and the result is that B’s face looks like A, and the movement of face change is realized. Encoder: extract facial features realized by 7 convolutions, and recover common features realized by one deconvolution after fully connecting layers. Decoder: recover the personality characteristics for five-time consecutive deconvolution, then realize feature fusion through Residual Networks, and finally output.

When G generates a face with high quality while retaining the identity characteristics, the system reaches the resting point. Different from the simple modelized classifier as an outside discriminant, we satisfy the information symmetry, ensuring that both real and synthesized samples can be mapped in the same feature space. Furthermore, the identity classifier proposes the respective identity features from the sample input and the false generated image, which substantially reduces the difficulty of GAN training. Our GAN follows the criterion of information symmetry. Finally, G is supervised directly from the features obtained from P and D, rather than some kind of label, which leads to better results.
Before training, all samples were subjected to face pose estimation and face alignment and resized to 256256. The loss function is Swish-X. P uses Residual Networks (ResNet18) and adds three fully connected layers at the end. G and D use BEGAN. Instead of directly estimating the difference between the generated distribution Pg and the real distribution Px, we estimate the difference between the distribution errors. These distributions are considered to be similar when the error distributions between them are similar. We use standard training methods which can quickly and stably converge without training the trick. We conducted training on the VGGFace2 dataset. Figure 6 indicates the result of image convergence and the effect of face inpainting.

To solve the posture small difference between the guiding image and the original image, the Swish loss function can be defined as follows:
Loss function Swish-B is defined as
It is characterized by unsaturated, smooth, and nonmonotonic properties. A number of tests conducted by Google in [47] show that the performance of Swish and Swish-B activation functions is excellent, which is better than the performance of the current best activation functions on different datasets. Furthermore, we apply the loss function Swish-X which is defined as follows:
During the training stage, Z is uniformly distributed in the range of [−1, 1]. Balance the initial loss between G and D. At the beginning, it is 1 and goes down with training. Batch size is 100, training on 8 GPUs. The initial learning rate of G and D is 0.0005, decreasing by 0.0002 for 50 K. C’s initial is 0.0008, which decreases to 0.0002 after the 150 k passes. P directly uses Residual Network (ResNet18) without training.
Monitoring during training is essential for GAN. Once a plane crash is detected, exit immediately. The pattern is generated from the model obtained from the last formal training. In fact, during that training, the plane from the end of epoch 7 to the beginning of epoch 8 was close to collapse. About 2 epoch training efforts were wasted.
Generator: it minimizes R() against D, where G is minimized by minimizing the two distances defined as follows:
Prejudgment monitor: it converts into the desired morphological characteristics in PTNet. . represents the posture and label we want. P is a face recognition network module, which is used to store face features extracted by HPENet, . G uses the original face A, face B′, and the random noise Z (Z is a 256-dimensional vector) as the input to synthesize A face image XS, and the size is also 256 256.
We use Euclidean distance to judge the authenticity of face identity as follows:where represents the face to be discriminated and represents a certain face in the training set, both of which are represented by the weight of the characteristic face. When the distance is less than the threshold value, it means that the two are the same person. When all the training sets traversed are greater than the threshold value, the distance value can be divided into two cases of new face or nonface. Depending on the training set, the threshold setting is not fixed.
Discriminant: the traditional way is to use dichotomies to determine the true and false domain. However, it is not desirable for image generation problems because of the sparsity of the image. To use pixel-level monitoring, we use an autoencoder as a discriminator. In other words, D reconstructs the input image by minimizing the pixelwise distance between the input and output.
We use the 1 norm to express the difference, as follows:
Reconstruction error measurement methods of true and false samples are compared as follows:
By minimizing the reconstruction error of true samples and maximizing the reconstruction error of false samples,
In (8), in order to maintain the balance between R() and R(), we introduce the regularization coefficient , which is dynamically updated in the training process. We give an input of 256 256, and P estimates the facial morphological features of , denoted as , where is the learning rate and represents a diversity ratio of . We set = 0.001 and = 0.4 in this work.
4. Results and Discussion
In this section, we evaluate FRGAN with state-of-the-art models [48–52]. In Section 4.1, we briefly describe the face datasets in our experiments. Specifically, the experiments in Section 4.2 are conducted on FRGAN and the results of face pose normalization in face pose estimation, as well as the visual inpainting effect of low-quality face images. Experiments in Section 4.3 are carried out on FRGAN to generate and also compare the differences between GFRNet, DCP, SCGAN, and MSRGAN. On the basis of qualitative comparison, the quantitative comparison is further carried out to evaluate the ability of face restoration and the quality of synthetic images in Section 4.4. The peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) indices are adopted for quantitatively evaluating the related state of the art. For qualitative evaluation, we use FRGAN and the competition methods to illustrate the results. To evaluate the generalization ability of FRGAN, we also give results on real low-quality images.
4.1. Datasets
We adopt the VGGFace2 [53] and CASIA-WebFace [54] datasets to train and test low-quality faces. VGGFace2 is a large-scale facial recognition dataset containing 3.31 million images and 9,131 identities, and each has about 362 images with different sizes. This dataset has the following characteristics: more people identities (each identity contains more pictures). The CASIA-WebFace contains 10,575 identities and each of which has approximately 46 images with the size of 256256. The images were randomly collected in the wild, covering a wide range of postures, ages, visibility, and expressions. For each identity, a maximum of three high-quality images are selected, one of which is the frontal image with eyes open as the guided image and generate real degradation the ground truth. We use CelebA, CASIA-FaceV5 [55], AFLW, ORL [56], and 300 W [57] to evaluate top-performance model reconstructed image effect.
4.2. Implementation Details
We address low-quality image restoration and guide face pose conversion. The experimental results performed well with VGGFACE2. Our training consists of training and testing posture transitions. HPENet’s training involves aligning faces with different expressions and angles. The training of FaceGAN includes face superresolution reconstruction by generator and confrontation training from prejudgment monitor and discriminator. Figure 7 shows the result of face pose conversion and low-quality face inpainting by FRGAN.

To further investigate performance on the PTNet poses and datasets, we report the NME and average NMEs for faces at 30°, 60°, and 90° yaw angles on the AFLW2000-3D and AFLW-LFPA datasets. Table 1 shows the results of the comparative experiment. We directly cite the comparative experimental results from 2DASL [58]. Our method performed well on both datasets, where the AFLW-LFPA results showed 0.1 lower than the average NME of 2DASL. Among them, the average result of AFLW2000-3D shows 0.09 lower than the average NME of 2DASL. The fitting effect is the best in the large deflection angle. We believe that more wild face images used for training will ensure better performance.
We performed 3D shape fitting for the guide face. At the same time, we also listed the identity matching similarity of faces from different angles. The result of 3D face fitting is shown in Figure 8.

4.3. Comparisons to the State of the Art
Figure 9 shows the results on real low-quality images by all the competing methods. As for the pose problem, it shows more restoration results of our FRGAN compared with the top-5 performance methods on real low-quality images with different poses. One can see that our FRGAN can also show great robustness in restoring facial images with different poses. We selected real images with resolution below 5050 from CASIA-WebFace. Even though the degree of degradation is unknown, our method produces visual fidelity in low-quality face regions. By contrast, other methods can only achieve modest improvements in visual quality.

(a)

(b)

(c)

(d)

(e)

(f)

(g)
From the experimental results, it is obvious that our blind face inpainting results are significantly better than DCP, MDeblurGAN, SCGAN, and MERGAN. Our results are very close to GFRNet, but we are closer to the reality in face texture restoration. The pixel reconstruction quality of MSRGAN is very good, but the details are blurred. In distinguishing between noise and nonnoise, our method can recover the details of the face very well. We kept the textural detail of the face as much as possible while removing the noise points. SCGAN has the worst pixel recovery effect and structure. GFRNet restored a little structure and texture in visual, but the restored image pixel is too low. To sum up, our model can restore a complete structure of the face and the texture details clearly.
4.4. Results and Analysis
Herein, we present the results of facial landmark localization (on 300 W), which is also evaluating superresolution with the precision of a pretrained FRGAN on the superresolution images. We report the results of the following methods:(i)FRGAN-i: we use our superresolution loss function and then run FRGAN on it.(ii)FRGAN-ii: we use our superresolution loss function feature and then run FRGAN on it.(iii)FRGAN-iii: we use our superresolution loss function heatmap and then run FRGAN on it. We use the same FRGAN as above. This variant is intended to emphasize the importance of jointly trained face alignment and superresolution networks.(iv)FRGAN-iv: our training method is the same as above, but this time, FRGAN is training with the rest of the network.
The results are summarized in Table 2.
Figure 10 shows the PSNR and SSIM results for two test subsets, where our FRGAN achieved significant performance improvement across all competing approaches. Our method is superior to PFSR and FERNet. Our method performed well in both 4 SR and 8 SR. Our method makes the restoration effect of the low-quality image better and the structure and texture of the restored image clearer. With the help of Swish-X loss, ours can converge to a stable solution and leads to reasonable inpainting. Because of the loss function, the convergence rate of our model is greatly improved.

(a)

(b)

(c)

(d)
Figure 10 shows the restoration results of FRGAN. Our model can produce clearer and richer details that are more visually realistic and achieve better visual effects than other models. In addition, we also obtained the best performance in the qualitative results. We can introduce details of other identities into the results. Furthermore, six models are trained based on two settings of our general test model. Because the six models are trained in a different test environment, it is unfair to compare them with any synthetic test data. In comparison, our model can produce sharp and clean results even if it is most complex facial expression. The results indicate that our model is effective in simulating real low-quality images which usually have unknown and complex expressions. Finally, our algorithm can correctly align the guided image to the target posture and expression. The experiment further demonstrates the necessity and effectiveness of the Swish-B loss function. This greatly improves the reconstruction speed of the model.
5. Conclusions
We presented a novel face restoration model FRGAN, and it can effectively inpaint low-resolution facial images. Our approach is based on a new architecture that combines the Head Pose Estimation Network (HPENet), Postural Transformer Network (PTNet), and Face Generative Adversarial Networks (FaceGAN). The improvement of the low-quality of our face benefited from our face pose estimation and prejudgment monitor. The Swish-X loss in FRGAN can effectively improve the convergence speed of the model. More importantly, our design greatly improves the reconstruction bias caused by inconsistent facial expressions and postures. By comparing with state-of-the-art methods, both qualitatively and quantitatively, we demonstrate the effectiveness and superiority of our method in low-quality face reconstruction. Superresolution face image can not only repair low-quality face images but also improve human visual experience. In the future, our method can be further applied to fuzzy video frame repairing.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.