Abstract

For gender classification, we present a new approach based on Multiscale facial fusion feature (MS3F) to classify gender from face images. Fusion feature is extracted by the combination of Local Binary Pattern (LBP) and Local Phase Quantization (LPQ) descriptors, and a multiscale feature is generated through Multiblock (MB) and Multilevel (ML) methods. Support Vector Machine (SVM) is employed as the classifier to conduct gender classification. All the experiments are performed based on the Images of Groups (IoG) dataset. The results demonstrate that the application of Multiscale fusion feature greatly improves the performance of gender classification, and our approach outperforms the state-of-the-art techniques.

1. Introduction

Gender classification plays an important role in many scenarios. As one of the demographic classification attributes, gender information belongs to soft biometrics that provides ancillary information of an individual’s identity information. Moreover, it can improve the performance of face recognition. Thus, it is widely used in many applications to provide smart services in human-computer interaction, such as visual surveillance, intelligent interface, and intelligent advertising.

Varieties of modalities are used in gender classification, including gait [1], iris [2], hand shape [3], and human face. The majority of the existing work on gender classification used the modality of the human face. This paper uses face images since face images provide more useful information than other modalities for gender classification. Face images contain the distinguished differences between men and women, for example, face contour, hair, and beard.

There are a great number of studies on gender classification. For the first time, Golomb et al. [4] trained a fully connected three-layer neural network to discriminate gender for a set of 90 face images in the early 1990s. Then, several similar methods have been proposed, for example, neural network classifier [57]. And a few studies [8, 9] conducted based on FERET dataset to highlight the choice of the classifier. Gutta et al. [8] proposed the hybrid classification architecture, which is an ensemble of Radial Basis Function (RBF) networks and Decision Trees (DT). Moghaddam et al. [9] investigated nonlinear Support Vector Machines (SVMs) with low-resolution faces, which is superior to traditional pattern classifiers, e.g., Fisher Linear Discriminant and Nearest Neighbour Classifier. Viola et al. [10] developed a visual object detection framework based on AdaBoost. Shakhnarovich et al. [11] adopted the object detection framework to classify gender obtaining an extremely fast speed.

More recently, the method of feature extraction combining with a classifier is widely used. For example, Yildirim et al. [12] explained a Haar cascade to classify gender for frontal face images. Shan [13] reached the accuracy of 74.9% with the boosted Local Binary Pattern (LBP) features combined with SVM. Bekhouche et al. [14] extracted Multilevel Local Phase Quantization (ML-LPQ) features from normalized face images to predict the gender. Different types of features have been used in classification, such as Gabor [15] and Histogram of Oriented Gradient (HOG) [16]. Several comparative studies on different features in facial gender classification can be found in [1719]. Deep feature extraction using pre-trained convolutional neural networks (CNNs) is very powerful and has recently been successfully applied in many image domains. However, it is rarely applied to IoG for gender classification. Although CNN architecture can reduce the time required for feature selection, it is still time-consuming when training. And IoG is not big enough to train the network fully. Whether CNN is the output feature of a fully connected layer or an end-to-end classifier, as long as the handcrafted feature is exhaustive, feature descriptors approach may compete with deep learning-based algorithm [20].

However, only the single facial feature could be extracted from the images, which are captured in controlled conditions in above-mentioned methods. The illumination and angle may have a big impact on the result of gender classification. The single feature cannot represent the information in facial images detailed. Fusion feature is often used to improve the feature extraction, and it is also a very effective method in computer vision tasks, such as visual tracking [2125].

In this paper, we propose a new approach, Multiscale facial fusion feature (MS3F), to classify gender for the face images which are captured in uncontrolled conditions. The MS3F is extracted by applying LBP descriptor and LPQ descriptor. Each face image is divided into two blocks and LBP is applied to the top block to extract feature, while LPQ is applied to the bottom block. Besides, there is a change in the parameter of LBP and LPQ varying with the size of subblocks. By dividing face images, calculation cost can be reduced by half. Finally, SVM is employed as the classifier to conduct gender classification. The experimental results are shown in Section 3.

2. Our Approach

2.1. Feature Extraction

In gender classification, the dimensionality of raw data is very high; feature extraction should be applied before classification. And the classification performance largely depends on the selection of feature descriptor. LBP characterizes the spatial structure of a local image texture pattern, and LPQ is based on computing short-term Fourier transform (STFT) on the local image window. Both of them are proved with a better performance than other facial descriptors [13, 19]; they are complementary with each. Besides, LBP and LPQ do not require a large amount of data or computational resources. In our approach, LBP descriptor and LPQ descriptor are used together.

LBP is a descriptor proposed by Ojala et al. [26]. It expresses the texture of an image in a local 3 by 3 neighbourhood using the binary code. Comparing the central pixel value with around it, if is smaller than , setting the neighbourhood pixel to 0 and in other cases setting the neighbourhood pixel to 1. An eight-bit binary code can be obtained and can be transformed into a decimal value as follows:Furthermore, there is an extended LBP named , which is given by Ojala et al. [27]. Here, P is the number of pixels in the local neighbourhood and is the radius of the local neighbourhood. The basic LBP is defined as , where . An extended is shown in Figure 1.

LPQ descriptor is proposed for texture classification, especially for blurred images by Ojansivu et al. [28]. It utilizes local phase information extracted by short-term Fourier transform (STFE) computing over a rectangular M-by-M neighbourhood at each pixel position of the image is defined in Here is frequency, is the basis vector of the STFT at frequency , and is another vector containing all image samples from .

In LPQ, only four complex coefficients are considered, which are . The transformation matrix is defined in Decorrelation and quantization are considered before the coefficients are represented as decimal values between 0 and 255 using binary coding.

2.2. Feature Fusion

In most of the existing works, LBP and LPQ are applied to extract facial feature. However, LBP is a spatial domain descriptor operating on the pixel gray value, which is sensitive to the change of gray value. LPQ is a frequency domain descriptor operating on the frequency coefficients, which reflects the change of image gradient. Owing to this, by using LBP or LPQ some useful information cannot be extracted.

Fusion feature is proposed in this paper to represent facial information by fusing LBP and LPQ. In this paper, considering the calculation cost, fusion feature is not the histogram cascade directly. There is a process that cutting the image into two subblocks before fusing the features. Then applying different descriptors to each subblock and combing the histograms (see Figure 2). For the reason, half calculation cost can be reduced while the feature is more exhaustive.

In order to improve the performance, the division in face region is adopted horizontally rather than vertically. LBP is applied to the top block, while LPQ is applied to the bottom block. Four divisions in the face region are conducted to prove the division, which is adopted in our work can reach a better performance. The experimental results are shown in Section 3.

2.3. Multiscale Feature

Based on the fusion feature method, Multi-Block (MB) and Multi-Level (ML) are also utilized to improve the classifier performance. MB and ML are two kinds of face representations. The main idea of the MB is to divide the face into several subblocks and extract features from each subblock. In this way, the local feature can be obtained. ML combines a series of MBs. ML is a spatial pyramid representation containing both global feature and local feature as demonstrated in Figure 3.

We use MS3F to represent facial information so that extracted feature is more precisely. The main idea of MS3F is using different window sizes at different levels, rather than one size for LBP and LPQ. The window size is calculated by the size of the neighbourhood.

Considering, in ML the blocks are in different sizes, where the features express the global feature and local feature, respectively. We use large window size in low level to extract global feature, while we use small window size in high level to extract local feature as shown in Figure 4.

3. Experimental Results and Discussion

3.1. Experimental Setting

For a fair comparison, Images of Groups (IoG) dataset is used, which is the largest database of this kind collected from Flickr images [29]. The dataset consists of 5050 images with 28231 faces categorized in ages and genders (see Table 1).

IoG covers unconstrained facial expressions, different head poses, age range, and close to real world illuminations. Some images have people sitting, laying, standing on elevated surfaces, and even having dark glasses. In addition, some images in the IoG are in low resolution, having face occlusion or unusual facial expression (see Figure 5).

With the feature extracted from the face images in IoG, we use SVM to perform classification. In SVM, the Radial Basis Function (RBF) kernel is utilized, and the parameters are searched by the cross-validation. Penalty term C is searched from to and the parameter of kernel gamma is searched from to . Step is the parameter multiply by . The final result is the case and . Without loss of generalization, 14116 face images are randomly selected for training while the rest face images are used for testing. The results obtained with the proposed method are compared to the traditional pattern classifier, Principal Component Analysis (PCA) in Section 3.2.

3.2. Experimental Results

We demonstrated its performance in various experiments. The results showed that the proposed method has a higher accuracy than PCA. Even extracting single feature combined with SVM can achieve a better accuracy (see Table 2). Here, the performance of the gender classification is evaluated as follows:where is the number of face images classified correctly and is the number of total face images.

After the satisfactory classifier is obtained, the focus of our work turns to feature extraction. As mentioned above, there is a division before fusing feature. To evaluate the performance of the division we adopted, four division experiments are conducted to find the best division (see Table 3), which are dividing the image horizontally and vertically and extracting LBP and LPQ, respectively.

It is not difficult to see that the division of UpLBP-DownLPQ achieves the higher accuracy. Some repeated information was captured when dividing vertically because the face is almost a symmetrical structure. The main area of the top block is hair and eyes, which means that gray value changes obviously in this block. There are chin and nose in the bottom block where the contour is clear. Therefore, the division we proposed can extract more valuable information and obtain the best accuracy.

Based on the fusion feature, MB and ML are adopted to improve the classification performance. Dividing the image into subblocks and extracting the feature from each subblock until most of the local features are captured. Here, is an integer from 0 to 3. That means there are totally four different sizes of MB in this paper. There is subblock in MB1, subblocks in MB2, subblocks in MB3, and subblocks in MB4. ML can be obtained by combining MB. Thus, there is four ML in total. The last ML is a histogram (see Table 4).

MS3F is a critical technique proposed in this paper. The key to MS3F is the window size of feature descriptor. To get the most appropriate window size in MS3F, we compare two different scales of MS3F with the basic window size (see Figure 6).

It can be clearly observed that the MS3F shows the best performance, especially Multiscale 2. The number of pixels calculated grows with the window size increases. Only the appropriate window size can achieve the best performance. In order to evaluate the effect of Multiscale approach, we compare the basic LPQ with the Multiscale LPQ (see Figure 7). Obviously, the Multiscale LPQ gives the better performance.

We perform all the experiments on the Images of Groups dataset for gender classification. Here, the comparison against the state-of-the-art is given (see Table 5), which proves the superiority of our proposed approach.

4. Conclusion

Gender classification is one of the most important tasks in computer vision. With regard to the problem that a low accuracy in gender classification on images captured in uncontrolled conditions, a new approach MS3F is proposed in this paper. LBP and LPQ feature descriptors have been used together to extract the feature from face images. The proposed approach is tested on IoG dataset and compared to the state-of-the-art methods in terms of accuracy. The experiment results illustrate that this approach can achieve a better performance to conduct gender classification based on face images captured in daily life.

Data Availability

Previously reported IoG database was used to support this study and can be available at http://chenlab.ece.cornell.edu/people/Andy/ImagesOfGroups.html. The prior studies (and datasets) are cited at relevant places with the text as references [29].

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by National Natural Science Foundation of China (61876112, 61303104, 61601311, and 61603022), Natural Science Foundation of Beijing (4162017), Support Project of High-Level Teachers in Beijing Municipal Universities in the period of 13th Five-Year Plan (CIT&TCD20170322), Project of Beijing Excellent Talents (2016000020124G088), Beijing Municipal Education Research Plan Project (SQKM201810028018), Capacity Building for Sci-Tech Innovation-Fundamental Scientific Research Funds (025185305000/134/187/188/189), and also the Youth Innovative Research Team of Capital Normal University.