Abstract

Cross-age face recognition problem is of great challenge in practical applications because face features of the same person at different ages contain variant aging features in addition to the invariant identity features. To better extract the age-invariant identity features hiding beneath the age-variant aging features, a deep learning-based approach with multiple attention mechanisms is proposed in this paper. First, we propose the stepped local pooling strategy to improve the SE module. Then by incorporating the residual-attention mechanism, the self-attention mechanism, and the improved channel-attention mechanism to the backbone network, we proposed the Multiple Attention Mechanism Network (MAM-CNN) framework for the cross-age face recognition problem. The proposed framework can focus on essential face regions to highlight identity features and diminish the distractions caused by aging features. Experiments are carried out on two well-known public domain face aging datasets (MORPH and CACD-VS). The results yielded prove that the introduced multiple mechanisms jointly enhance the model performance by 0.96% and 0.52%, respectively, over the state-of-the-art algorithms.

1. Introduction

Computer vision has found many applications in many fields, such as 3D object detection in anonymous driving [1], hand gesture recognition in Human-Machine Interaction [2], plant disease detection in agriculture [3], and speech emotion recognition in natural language processing [4]. As an important research topic in the field of computer vision, face recognition technology has been extensively studied and widely used in real-world scenarios such as high-speed rail station entry, hotel check-in, and smartphone face unlocking. Over the last decades, many efforts have been devoted to dealing with the variation of illumination, poses, expressions, and occlusion in face recognition, yet little attention has been paid to the aging problem. In tasks such as recovering missing children, facial recognition technology can help us identify possible matches between a lost child’s face photograph and a missing person’s database. Traditional face recognition fails these kinds of tasks and degrades the performance dramatically due to the significant change of the facial shape and texture with age, as shown in Figure 1. Identifying fugitives from justice after years confront us with a similar challenge. Such tasks urge the need for research on cross-age face recognition.

The earliest attempts on cross-age face recognition fell into the category of generative approaches [5, 6], which rely on simulating the face aging process and augmenting the datasets with synthetic samples to improve the performance. However, these methods suffer from the computationally expensive problem due to the complexity of the simulating model itself, and it lacks diversity among the generated samples due to a strong parametric assumption of the model. Later, discriminative approaches were extensively explored by researchers [710], in which a feature-based classifier was exploited to generate the face matching decision.

As pointed out by [8], the facial feature of a person can be expressed by an identity component that is stable over the aging process and an age component that reflects the aging effect. An excellent discriminative classifier is expected to extract the identity factors from the face images while discarding the aging factors. Unfortunately, most existing discriminative methods fail to extract age-invariant identity face features robustly, thus leaving the space for successors to enhance.

To overcome the above-mentioned difficulties, deep learning technologies were introduced for better performance. Deep learning techniques have been one of the most powerful feature engineering tools, and the deep learning based methods have dominated in many fields of computer vision, including object detection [11, 12] and face recognition [13, 14]. Convolutional neural network (CNN), a typical deep learning model with effective feature extraction ability, has been successfully applied to cross-age face recognition systems and achieved better accuracy and efficiency than traditional methods [8, 1517].

It is widely acknowledged that facial identity is a kind of high-level biometric feature which can be better extracted by deeper network architectures. However, most of the deep network architectures applied to age-invariant face recognition tasks suffer from the vanishing gradient problem and are far from being perfect in distinguishing the identity components and the age components. Visual attention mechanism can simulate the visual perception process of human beings, and it can help us grasp the essential contents within the scene and enable the visual system to obtain useful information with limited processing resources [18]. It is expected to separate the identity factor, the age factor, and the noise and focus on the identity factor while omitting the rest. Residual network [19], also known as ResNet, is one of the most popular deep neural network structures in the research community. The introduction of shortcut connections tackles the vanishing gradient problem and makes it possible to train very deep networks.

Motivated by the above facts, this paper proposes a cross-age face recognition approach using a deep learning model incorporating spatial/channel multiple attention mechanisms. Benefit from the residual structure in the advanced ResNet-50 network and residual-attention module, the proposed approach can easily reach a deep level and extract multi-level identity features by fusing deep and shallow features. Based on ResNet-50, we make several modifications to enhance the performance. Bottleneck blocks are replaced with hierarchical residual-like connections to get a larger feature map and fine-grained features of facial identity minutiae. The self-attention mechanism is incorporated by learning facial correlation information to concentrate on identity feature extraction. The channel-attention mechanism is introduced by modifying the SE module to take advantage of the abundant convolution kernel information in the channel dimension. Experiments are carried out on two well-known public domain face aging datasets (MORPH and CACD-VS) to demonstrate the effectiveness of the proposed approach.

In this paper, we propose “the first framework for cross-age face recognition problem that incorporates three attention mechanisms reported to date”, and show its superiority over existing state-of-the-art methods. The contributions of this paper can be summarized as follows: (1)A new residual network based approach is proposed for cross-age face recognition problem. The modified network can be end-to-end optimized, and the imported modules are straightforward to implement and computationally lightweight. Experimental results on the CACD [20] and MORPH [21] datasets show the proposed approach achieved notable improvements over the state-of-the-art algorithms(2)A novel pooling strategy, called stepped local pooling (SLP), is proposed in this paper to improve the SE module. In SLP, local maximum pooling and global average pooling are performed sequentially. Local maximum pooling plays the role of extracting local textual features, and global average pooling will eliminate local noise in the feature map. The proposed SLP can retain local textual features while being robust to noise and occlusions(3)As far as we can concern, this is the first reported framework for cross-age face recognition problem that with three attention mechanisms combined. Residual-attention mechanism and the hierarchical residual connection structure can extract multi-level fine-grained identity features and tackle the vanishing gradient problem. The self-attention mechanism can explore the correlations between pixels and make the model concentrate on extracting identity features. The improved channel-attention mechanism also strengthens the global expression of the critical convolution kernels from the channel dimension and guides the model to focus more on identity features

The rest of the paper is organized as follows: Section II describes the related works. Section III provides an introduction to the proposed method for cross-age face recognition. Experimental results on the datasets are presented and discussed in Section IV. Section V concludes the paper.

2.1. Cross-Age Face Recognition

Current cross-age face recognition methods can be roughly divided into two categories: generative approaches and discriminative approaches. Most of the earliest cross-age face recognition methods are generative ones [5, 6], which synthesize the corresponding face images in the target age span to recognize. Park et al. [5] present a 3D aging modeling technique to model shape and texture separately at different ages for cross-age face recognition. Du et al. [6] employ an improved prototype method to carry on the facial aging simulation and demonstrate a sparse-constrained non-negative matrix factorization (NMFsc) algorithm to increase the recognition accuracy. However, generative approaches suffer from unstable synthesis results and high computational costs in simulating the aging process. Subsequently, discriminative approaches based on traditional machine learning models appeared. [7] applies SIFT and multi-scale local binary patterns (MLBP) as local descriptors to build feature spaces. [8] describes the hidden factor analysis (HFA) method to separate the age and the identity information and use the identity constituent for face recognition. In [9], a texture-embedded discriminative graph matching (TED-GM) model is introduced to address the problem of age-invariant face recognition by applying Gabor binary pattern histogram sequence to encode the discriminative and compact features. [10] reports an effective maximum entropy feature descriptor to construct an identity matching framework. Nevertheless, discriminative approaches based on traditional machine learning models heavily rely on artificially designed feature descriptors, leading to a limited performance under complicated environments.

Recently, deep learning models have been applied in cross-age face recognition and achieved the state-of-the-art performance [1517]. These methods are mainly based on CNN to extract effective face features automatically. In [15], the age estimation guided convolutional neural network (AE-CNN) is used to separate the age component from the person-specific feature. [16] designs a latent factor guided CNN framework (LF-CNN) for cross-age face recognition. By utilizing an LIA module and a new fully connected layer (LF-FC), age-invariant identity features can be extracted from deep visual features. An age-related factor guided joint task modeling CNN network is presented in [17] to divide identity features and aging features by combining an identity discrimination network with an age discrimination network. The above deep learning based methods are superior to the traditional machine learning based methods on robustness and accuracy. However, the complex sub-networks structure specially designed for cross-age face recognition incurs additional computational cost and may cause the vanishing gradient problem.

Table 1 briefly summarizes the above-mentioned researches.

2.2. Attention Mechanism in Deep Learning Model

Inspired by the human perception system, attention mechanisms can focus more on important locations and enhance the representations of these locations. Combining with the deep learning models, it can adaptively distinguish the crucial features in feature maps and improve the accuracy and training efficiency of the models. [22] designs the Mask-RCNN to utilize the segmentation attention module and decouple the classification and segmentation tasks. For the image classification task, RACNN [23] describes an attention proposal network (APN) trained by additional designed rank loss to extract the target, which is more favorable for fine-grained classification. A multi-scale attention module composed of two convolution layers [24] is designed to softly weight the multi-scale features at each pixel location and improve the network recognition ability for targets with different scales. [25] combines the traditional CNN network with the non-local attention operation to extract global semantic features for video classification. [26] describes the residual-attention mechanism module to realize the combination of shallow features and deep features. However, the idea of integrating residual-attention mechanism with residual networks such as ResNet-50 is only briefly mentioned without discussing how to realize it. [27] describes the SE module to extract the attention weight in channel dimension and enhance the weight of the important convolution kernel but does not combine it with the appropriate CNN network. Therefore, how to smoothly incorporate the attention mechanism into deep learning models to emphasize identity features for recognition and improve the performance is worth researching, thus necessitating this work.

3. Proposed Method

This section introduces the proposed cross-age face recognition model based on a convolutional neural network with spatial/channel multiple attention mechanisms. ResNet-50 [19] utilizes the residual learning and skip connection to extract discriminative deep vision features and avoid the gradient disappearance, so we choose it as the backbone convolutional neural network of the proposed approach. However, applying ResNet-50 directly to cross-age face recognition results in unsatisfactory performance due to the age-variant aging features. By incorporating multiple attention mechanisms, the proposed approach can retain the expressivity of ResNet-50 and effectively extract identity features that are more conducive to cross-age face recognition.

The network structure is illustrated in Figure 2. Firstly, we replace the identity block and skip connection with the residual-attention block and hierarchical residual-like connection. In this way, the proposed approach can capture detailed features at a fine-grained level and adequately fuse the shallow local identity features and the deep global identity features. Meanwhile, the self-attention mechanism is introduced in the shallow stage of the network to study human face features, especially the facial correlation information adaptively. The above two spatial attention mechanisms strengthen the weight of identity features and filter out the aging features. Furthermore, we design the improved channel-attention branch to take full advantage of the information contained in convolutional kernels and further improve recognition accuracy. The proposed approach can complete the end-to-end extraction of face features without introducing sub-networks, ensuring the reliability of cross-age face recognition while incurring a little computational burden.

3.1. Residual-Attention Mechanism

As shown in some deep learning based cross-age face recognition approaches, most of the identity features are semantic features that can only be extracted at deep levels of convolutional neural network. However, the vanishing gradient will appear with the increase of network depth. Residual-attention mechanism [26] is a stackable attention mechanism module that leverages the way of residual learning to solve the above problem and enables the network to combine deep features with shallow features.

The structure of the residual-attention mechanism block is displayed in Figure 3. Each contains the trunk branch and the soft mask branch. In the trunk branch, the convolutional layer, batch normalization layer, and activation function layer are alternatively stacked as a typical convolution unit to perform feature processing with an output . The soft mask branch utilizes the Bottom-up and Top-down structure to output attention mask with the same size of . The Bottom-up and Top-down structure firstly increases the receptive field of the model and extracts deep visual features through a series of convolution layers and pooling layers. Then, the size of the feature map is enlarged by up-samples to successfully map the focused area of attention mechanism to each pixel of the input feature map. By assigning attention weight to each pixel value of , the attention mask can enhance identity features that are meaningful for cross-age face recognition and suppress noisy aging features in the original feature map.

Nevertheless, the output values of the mask branch are fixed to to match the definition of the weighting coefficient. Performing element-wise multiplication directly to and leads to a feature map value decreasing, which can break the beneficial characteristics of the original network and cause the vanishing gradient. The residual-attention mechanism applies the following equation to generate the final weighted residual-attention map :

In this way, the value decreasing can be avoided, and the differences between the identity features of various people can be more significant.

As shown in Figure 2, our cross-age face recognition model replaces the identity blocks of the second stage from the fourth stage in original ResNet-50 with the residual-attention mechanism blocks. Considering the large number of fine-grained identity features (such as facial features and moles) in face images, the receptive field of the residual-attention mechanism is not enough to extract the attention weight of these features. So we also introduce the hierarchical residual-like connections [28] to replace the skip connections in ResNet-50 and solve the above problem. As shown in Figure 4, the feature map is divided into space subsets in channel dimension and let and denote the feature map and the convolution kernel of the level, respectively. The corresponding output is defined as follows:

And is the control parameter of the scale dimension. Larger means more internal multi-scale feature fusion, while smaller brings fewer parameters. We follow [28] to choose .

In this way, each convolution kernel in the hierarchical residual-like connection structure can implicitly make the current level output contain feature information of all the previous levels and capture multi-scale identity features with finer-grained receptive fields.

3.2. Self-Attention Mechanism

According to scientific research [29], the primary identity features such as relative size, relative distance, and relative angle of facial features can be fixed at the age of 10 to 12, and will change slightly in the subsequent aging process. This means the relative relationship between facial features of the same person in different age groups remains basically unchanged, but different people will vary greatly. Self-attention mechanism [30] can automatically learn the correlation information of the image key areas. Therefore, we use it to extract facial correlation information and further strengthen the expression of identity features.

The structure of the self-attention mechanism is shown in Figure 5. The output feature map of a hidden layer is converted into three new feature maps [] through convolutional layers, where is the number of channels, is the number of feature locations of features from the original feature map, and . The attention weight value indicates the extent to which the model attends to the location when synthesizing the region and can be given by: where . The weighted self-attention map is , where

At the same time, we found that embedding the self-attention mechanism at the beginning can limit the network performance after repeated experiments. As the training continues, the relationship between facial features is more obvious, and the effect of the self-attention mechanism will be more significant. Therefore, we follow [29] to multiply the weighted self-attention feature map by a learnable scale parameter which is initialized as 0 and add back the input feature map to get the final output:

This means the self-attention mechanism is stopped in the early stages of training and will be gradually integrated to explore identity features as the training progresses. By enhancing the extraction of the important correlation relationship between facial features, the proposed approach weakens the influence of aging features and can achieve better performance in cross-age face recognition.

3.3. Improved Channel-Attention Mechanism

In order to provide rich feature expression ability at one level, CNN uses numerous convolution kernels which are sensitive to different features for convolution operation. With the development of the network, the convolution layers are accumulating, and the output feature maps can contain thousands of channels (e.g., 2048 channels in the fifth stage of ResNet-50). By leveraging the abundant convolution kernel contribution information in channel dimension, the proposed approach designs an improved channel-attention mechanism based on SE module [27] to obtain the weight value of convolution kernels and further enhance the weight of identity features.

First, we squeeze the input feature map with the size into channel-wise global descriptors. The original SE module uses global average pooling to generate the convolution kernel global expression, which can lose plentiful texture feature information in the face image. Directly replacing it with global maximum pooling can capture more representative features but losing spatial information and degrading anti-interference ability. Therefore, we design a Step Local Pooling (SLP) block which combines the advantages of the above two pooling strategies to form a more expressive squeeze operation. The structure of SLP is shown in Figure 6.

Firstly, SLP divides the input feature map into sub-regions with the size and calculates the local maximum value of each sub-region. Then, the global average pooling is implemented to eliminate the influence of occlusion and noise in the local area and obtain the channel-wise global descriptor , which is given by: where is the sub-region of the feature map in channel . In this way, we can retain the surface texture features of face images and increase the reliability of the channel-attention mechanism.

Subsequently, considering the nonlinear relationship between channels, the channel-wise global descriptors obtained by squeeze operation are transformed into channel-attention weights by using sigmoid activation : where and are the parameters of two fully connected layers which can limit model complexity and increase generalization. And the final output of the channel-attention module is as follows:

As shown in Figure 2, the proposed approach combines the improved channel-attention mechanism branch with the residual-attention mechanism branch to further improve the accuracy of cross-age face classification. The network can gradually output the optimal fusion attention map in the training process by a learnable weight parameter

4. Experiments and Results Analysis

4.1. Experiments Details

(1)Datasets: In our experiments, we evaluate the performance of the proposed CNN model with Multiple Attention Mechanism Network (MAM-CNN) on two famous public cross-age datasets: CACD [20] and MORPH [21]. Some statistical information of these databases is illustrated in Table 2(2)Preprocessing: To clean the dirty data with low resolution and misclassification, all face images in the datasets are detected by DeepFace model [31], and only the images scored more than 99.75% are preserved. For the consideration of balancing the quantity of images and the age interval, we partition the face images into six age groups {10-20, 21-30, 31-40, 41-50, 51-60, 61+}. Due to the long age span and collection difficulty, the overall quality of the above cross-age face datasets is relatively poor and cannot effectively train the model. Therefore, the proposed model is firstly pre-trained on the CASIA-WebFace dataset [32] and then fine-tuned on the cross-age datasets to perform cross-age face recognition(3)Implementation: The overall network architecture of the proposed model is demonstrated in Figure 2 and implemented by Tensorflow. In the training processing, we use Adam optimizer with batch size 32, initial learning rate 0.0001, and decay rate 0.0005. After the training is complete, the trained model is evaluated by accuracy (ACC) and Rank-1 identification rate [33].

4.2. Architecture Deployment

Original ResNet-50 consists of 5 stages, and there are various choices for the deployment of spatial/channel-attention mechanism modules in the proposed approach. In this section, we demonstrate the recognition performance of different deployment modes and finally determine the optimal architecture of the proposed model through comparative experiments.

ResNet-50 pre-trained on the CASIA-WebFace dataset and fine-tuned on the 70% training set of the MORPH dataset is chosen as the baseline. Architectures with different attention mechanism deployment modes are tested on the CACD and MORPH. For spatial attention mechanism modules, we firstly replace the identity blocks from the second stage to the fifth stage with the residual-attention modules in all possible ways to discover the optimal architecture of the improved residual attention ResNet-50. Then, we use the hierarchical residual-like connection structure to replace the skip connection structure of the above optimal residual attention ResNet-50 in different ways. The experimental results are shown in Tables 3 and 4. Considering the balance between performance and parameter quantity, the final optimal residual-attention module in the proposed model can be formed by replacing the identity blocks in stages 2-4 with the residual-attention blocks and the skip connections in stages 3-5 with the hierarchical residual-like connections.

At last, in order to simplify the model complexity, we add a single self-attention mechanism module at different stages and explore the best deployment position through comparative experiments. As shown in Table 5, the best performance can be obtained by placing the self-attention module at the second stage. Because the feature map is still large () and the receptive field is not big enough to contain more than two facial features at this stage. This means the correlation information between the facial features is not destroyed by the convolution kernel. Moreover, the number of channels at this stage is sufficient ( =256). Reducing the number of channels to C/8 when calculating the attention map does not significantly affect the recognition accuracy.

For improved channel-attention mechanism modules, we deploy them in the same position with the residual attention mechanism modules (stage 2-4) to extract the weight value of each convolution kernel in the channel dimension and achieve the effective fusion of spatial attention and channel-attention mechanisms.

4.3. Cross-Age Recognition Performance and Results Analysis

To further test the effectiveness and robustness of the proposed model in cross-age face recognition, we evaluate it on the datasets and compare the performance with the state-of-the-art ones including several classic cross-age face recognition methods [8, 10, 15, 16, 20] and one deep learning based conventional face recognition method [13]. The ACC and Rank-1 are selected as the evaluation measures for the CACD-VS and MORPH, respectively, to keep consistent with the comparison algorithms. The results are shown in Tables 6 and 7.

4.3.1. Experiments on the MORPH Dataset

As Table 6 reveals, all of the CNN models with multiple (two or more) attention mechanisms outperform the existing state-of-the-art approach on the MORPH dataset. It proves that the added attention mechanism contributes to highlighting the identity features and suppressing the aging features in the extracted deep face features. Moreover, the proposed MAM-CNN outperforms either using a single residual attention mechanism or simply combining residual and self-attention mechanisms, which demonstrates that different attention mechanisms are complementary and can boost cross-age face recognition performance jointly

4.3.2. Experiments on the CACD Dataset

The CACD dataset is one of the recently released cross-age face recognition datasets with good image quality and various scenes, illumination, and pose conditions, containing 163,446 images from 2,000 celebrities of wide-ranging ages. However, after inspection, we notice that there are a lot of duplicate samples and mismatched labels. Therefore, we filter the dataset and retain 121,592 images from 1,963 celebrities to train the models. We also follow the same partitions as the original CACD to obtain the filtered subset CACD-VS for testing

The results in Table 7 show that we have reached a similar conclusion compared to the experiments on MORPH. The proposed MAM-CNN model still has the best recognition accuracy than existing state-of-the-art approaches, demonstrating the superiority of the proposed approach. The reason lies in that the self-attention mechanism can explore the correlation between facial features, and the introduction of the residual-attention mechanism allows our network to reach a deep level easily. In addition, an improved channel mechanism with SLP also strengthens the global expression of the critical convolution kernel from the channel dimension and guide the model to focus more on identity feature. These observations impressively demonstrate the high reliability and robustness of our method to facial aging in face images.

4.3.3. Comprehensive Analysis

From the results obtained on MORPH and CACD-VS, we can draw several conclusions. All deep learning-based methods ([13, 15, 16] and Ours) outperform traditional machine learning-based methods ([8, 10, 20]) significantly on both datasets, showing their dominance in the cross-age face recognition problem. The methods specially designed for the cross-age face recognition problem ([15, 16] and Ours) are more competitive, as they are dedicated to mining the underlying identity information from the cross-age images. Tables 6 and 7 demonstrate that the relative improvement of deep learning-based methods over traditional methods is more significant on CACD-VS; it can be explained that deep learning-based methods can take better advantage of the “big-data” provided by CACD-VS.

As Figure 7 illustrates, our proposed method improves the AE-CNNs by 0.96% on MORPH dataset and 0.52% on CACD-VS dataset. As MORPH dataset has more subjects, fewer images per subjects, and a larger age range, it is considered a more challenging dataset. The imposed attention mechanism contributes a lot to the ability of MAM-CNN to tackle challenging datasets.

5. Conclusion

This paper proposed a CNN model with spatial/channel multiple attention mechanisms to address the cross-age face recognition task. By combining the residual-attention mechanism, self-attention mechanism, and improved channel-attention mechanism into one framework, the proposed approach can focus more on identity features that are most distinctive between different persons and filter out the background noises and the effects of aging features. Experiments are conducted on the MORPH dataset and the CACD dataset to examine the effectiveness of the proposed method, and the results achieved demonstrate that the proposed method outperforms the existing state-of-the-art ones.

One of the limitations of the attention mechanism is that they are prone to suffer from the overfitting problem. So one of the possible directions to enhance the performance of MAM-CNN is to study the learning from very few samples.

Other further work includes solving cross-age face recognition problems with mask or occlusion and developing a more dedicated loss function for the attention mechanism network to obtain better global optimization and speed up the model efficiency.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no competing interest.

Acknowledgments

This research work is funded by the National Nature Science Foundation of China 355 under Grant 61971283 and Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102.