Abstract

It is very difficult to process and analyze high-dimensional data directly. Therefore, it is necessary to learn a potential subspace of high-dimensional data through excellent dimensionality reduction algorithms to preserve the intrinsic structure of high-dimensional data and abandon the less useful information. Principal component analysis (PCA) and linear discriminant analysis (LDA) are two popular dimensionality reduction methods for high-dimensional sensor data preprocessing. LDA contains two basic methods, namely, classic linear discriminant analysis and FS linear discriminant analysis. In this paper, a new method, called similar distribution discriminant analysis (SDDA), is proposed based on the similarity of samples’ distribution. Furthermore, the method of solving the optimal discriminant vector is given. These discriminant vectors are orthogonal and nearly statistically uncorrelated. The disadvantages of PCA and LDA are overcome, and the extracted features are more effective by using SDDA. The recognition performance of SDDA exceeds PCA and LDA largely. Some experiments on the Yale face database, FERET face database, and UCI multiple features dataset demonstrate that the proposed method is effective. The results reveal that SDDA obtains better performance than comparison dimensionality reduction methods.

1. Introduction

The data collected by various sensors (such as visual sensors and sound sensors) are mostly high-dimensional, which brings inconvenience to the later processing and analysis of data. In order to effectively utilize these high-dimensional data, it is necessary to adopt effective dimensionality reduction algorithms. In fact, dimensionality reduction is an effective data preprocessing method. It reduces the size of data while retaining the valid data, which brings great convenience to the later analysis and calculation of data. In pattern recognition, data dimensionality reduction has a wide range of applications. KL-based principal component analysis (PCA) [1, 2] and linear discriminant analysis (LDA) [36] are the two most widely used dimensionality reduction methods. PCA and LDA have been widely used in the analysis and processing of various types of data. They can be used in data compression, data preprocessing, data mining, data retrieval, data classification, and so on. Independent component analysis (ICA) is a data processing method developed from solving blind source separation, which decomposes the original data to obtain independent components. ICA is helpful to find the maximal independent projection direction as the dimensions of data are reduced. But in ICA, there is a preprocessing process for data, that is, PCA and whitening. In the pattern recognition field, some researchers have proven that the overall performance of ICA is not better than that of PCA by conducting experimental comparisons between the two methods [7, 8]. At present, PCA and LDA have a lot of applications in image processing, voice processing, communication, network, and others. Many researchers [927] have done extensibility research based on LDA and PCA methods and have made some progress. But there are some shortcomings in the use of PCA and LDA. The disadvantage of PCA is that the data after dimensionality reduction have no clustering characteristics. The classification accuracy is uncertain by using the features after dimensionality reduction. The disadvantage of LDA is that there is a phenomenon of overfitting to training samples. The classification accuracy is closely related to the characteristics of training samples. PCA is a dimension reduction analysis method that maintains the maximum dispersion of samples. However, category information is not introduced into the dimensionality reduction process of PCA, with the result that the accuracy of using minimum distance measurement is usually lower than the accuracy of using nearest neighbor measurement. In contrast, LDA can obtain the best identifying projection information for classification. Therefore, the accuracy of LDA by the minimum distance method is close to the nearest neighbor method. LDA takes category information into account, so its test accuracy is better than that of PCA, but on the other side, the LDA method may overfit the training set, which worsens the generalization ability. In particular, when there is quite a difference between the training set and the test set, it is likely that the test result of LDA will be not ideal [7, 28, 29]. Pattern recognition for images has high application and research value, so it has become a hot area of research in the field of pattern recognition and machine vision, especially for face recognition [3042]. Meanwhile, face recognition is also an important way to verify the effectiveness of pattern recognition methods. The LDA-based methods are widely used in face recognition [2327, 3042]. The research on LDA can be traced back to a classic paper [3] by Fisher in 1936. The basic idea is to choose the vector that makes the Fisher criterion function as max as optimal projection vector, so that the sample can achieve the maximum between-class scatter and the smallest within-class scatter after being projected in this direction. Based on the Fisherface method, Wilks and Duda proposed classic linear discriminant analysis (CLDA), respectively [4, 5]. Foley and Sammon proposed a method called FS linear discriminant analysis (FSLDA) [6], in which a set of optimal discriminant vectors satisfying the orthogonal condition is used for dimensionality reduction. The specific algorithm for solving the optimal discrimination vectors of two-class cases is presented by Foley, and the solution of the optimal discriminant vectors in multiclass cases is given by Duchene and Leclercq [43]. Jin et al. proposed the concept of uncorrelated linear discriminant analysis (ULDA) [44, 45] for optimal discriminant vectors from the perspective of statistical irrelevance. A simple algorithm for solving the optimal set of uncorrelated discriminant vectors is presented in the literature [45], and it is pointed out that ULDA is equal to CLDA under the condition that the eigenvalues of the generalized characteristic equation corresponding to the Fisher criterion function are not equal.

Although the discriminant vectors of CLDA have statistically uncorrelated characteristics, they are not orthogonal. In contrast, the discriminant vectors of FSLDA are orthogonal and statistically correlated. Some researchers have argued the performance of orthogonal discriminant vectors is better than that of the statistically uncorrelated vectors [46, 47], and some researchers hold a contrary opinion [45, 48]. Actually, both of these characteristics have certain pertinence. If only one of them is considered, it is deficient. Firstly, nonorthogonal discriminant vectors are unfavorable factors for extracting useful features, which weakens the generalization ability of test samples. Especially when the number of training samples is small and the distance between samples is small, the test performance of CLDA is inferior to that of FSLDA. Secondly, the discriminant vectors of FSLDA are composed of orthogonal normalized vectors. However, in the case of fewer categories, more samples per class, and larger intraclass dispersion, the redundancy between the discriminant features obtained by each discriminant vector is very large, that is to say, the statistical uncorrelation characteristics of FSLDA is very poor. For example, in terms of character recognition, the performance of FSLDA is significantly worse than that of CLDA.

In general, statistical uncorrelation is only strictly statistically uncorrelated for training samples but only nearly statistically uncorrelated for test samples. Therefore, only nearly statistical uncorrelation is required for the optimal discriminant vectors. On the other hand, orthogonality is a strict restriction, which reflects the perpendicular relation of each axis in Euclidean space and enhances the generalization ability of test samples. So we conclude that the discriminant vectors of the most effective discriminant method should be orthogonal and nearly statistically uncorrelated.

To solve the above problems, the paper presents a similar distribution discriminant analysis (SDDA) method from the similarity of samples’ distribution. The advantage of the SDDA is that the projection vector has orthogonal characteristics, and the data after dimensionality reduction have nearly statisically uncorrelated characteristics, and the distribution of the projection vector approximates the distribution of principal components in the center of the sample class. The proposed method uses the statistically uncorrelated characteristics of PCA to combine PCA with the class labels of samples and gives the solution of the optimal discriminant vectors. These discriminant vectors have orthogonal characteristics and nearly statisically uncorrelated characteristics. The SDDA algorithm requires that the data distribution after the dimensionality reduction of samples is similar to the distribution of the principal component of the original samples. That is to say, in the process of dimensionality reduction, the principal component characteristics of the original sample are better preserved. The SDDA algorithm requires that the data distribution after the dimensionality reduction of the sample is similar to the distribution of the principal component of the original sample. The principal component property can suppress overfitting well, which solves the problem of overfitting of LDA. The proposed SDDA method overcomes the disadvantages of the two basic methods of LDA and concentrates the advantages of the basic methods of LDA together, so the extracted distinguishing features are more effective, which improves the recognition performance and adaptability. Finally, the effectiveness is validated through some experiments on the Yale face database, FERET face database, and UCI multiple features dataset. The results of these experiments indicate that the recognition accuracy of SDDA is superior to the two basic methods of PCA and LDA.

The N samples in the training set come from categories: , where and . is the number of samples in the class, and is the sample that comes from the class. All samples are m dimension column vectors. Thus, the within-class scatter matrix, the between-class scatter matrix, and the total scatter matrix are defined as , , and , respectively, with the following expressions:where denotes the mean vector of all samples in the th class and is the expected mean vector of all samples. is a prior probability of the samples in the class, which is generally taken as . Then, the mean vector of all samples can be represented as .

2.1. Classical Principal Component Analysis

The criterion function of PCA is defined as (4). The vectors in the optimal projection vectors make (4) reach the maximum, and they are a group of normal orthogonal vectors. Its physical significance can be interpreted as maximizing the total dispersion of the projected features:

Actually, vectors in the optimal projection vectors are normal orthogonal eigenvectors corresponding to largest eigenvalues. The criterion function of PCA can be also represented as follows:

And the best projecting matrix is .

2.2. Linear Discriminant Analysis (LDA)

LDA was proposed by Fisher firstly. The basic idea is to select the vector that maximizes the Fisher criterion function and take as the optimal projection direction, which is also called optimal discriminant vector. Then, the ratio of the interclass dispersion to the intraclass dispersion reaches the maximum after the samples projected in this direction. Fisher discrimination criterion function is defined aswhere is the within-class scatter matrix, is the between-class scatter matrix, and is a nonzero column vector of any number of dimensions.

The Fisher criterion function combines the between-class and within-class dispersion of samples skillfully and provides a perfect criterion for determining the optimal projection direction.

2.2.1. Classic Linear Discriminant Analysis

Inspired by LDA, Wilks and Duda extended the two-class classification problem of finding one optimal projection direction to the multiclass classification problem of finding multiple optimal projection directions. Their idea is called the classic Fisher linear discriminant analysis, and the classical Fisher discriminant criterion function is (7) or (8):

In fact, the column vectors in the optimal projection matrix of classic Fisher linear discriminant analysis are taken from the eigenvectors corresponding to largest eigenvalues of the generalized characteristic equation .

2.2.2. FS Linear Discriminant Analysis

FSLDA aims at finding a set of optimal discriminating vectors . They maximize the Fisher criterion function and satisfy the following orthogonal condition:

The first vector of the FS optimal discriminant vectors is the Fisher optimal discriminating direction, that is, the unit eigenvector corresponding to the maximum eigenvalue of the generalized characteristic equation . After the first discriminant vectors are found, the th discriminant vector is obtained by solving the following optimization problem:

In fact, is the eigenvector corresponding to the maximum eigenvalue of the generalized characteristic equation:where and .

3. Similar Distribution Discriminant Analysis (SDDA)

By reducing the dimensions, the proposed SDDA method in this paper makes the total distribution of extracted features closest to the principal component distribution and the extracted features satisfy the minimization of within-class dispersion. In other words, the extracted features not only have a good performance in discrimination, but also retain the principal component characteristics. At the same time, the optimal discriminant vectors are composed of orthogonal and nearly statistically uncorrelated vectors, which makes the extracted discriminant features more effective and improves the performance of classification and recognition.

3.1. Theoretical Framework of SDDA

Supposing there are two points and in -dimensional space, which represent the vector and the vector , respectively. For the similarity between and , a similarity measurement is usually adopted, whose formula is as follows:where vector is and vector is , in which and represent the mean of all elements in and , respectively. The larger the value of is, the more similar the two vectors are, and means that the two vectors are completely similar.

Extend the similarity measurement from two vectors to two sets of vectors. Supposing one set of vectors is and the other set of vectors is , where and are both m-dimensional column vectors. Set and , the columns of which are represented as and , and where vectors and represent the mean of all column vectors in and , respectively. Let and be row vectors of and , then the similarity measurement formula of and can be defined as (13) or (14):where . Equation (14) is easier to analyze, so it is adopted in this paper. indicates that the distribution of the two sets of vectors is completely consistent, which is called distribution equivalence.

For a given matrix , in other words, is a matrix with certain distribution, and the dimension of is larger than that of (the dimension of is ). If the discriminative feature of the samples can be extracted by dimension reduction, then the distribution of the overall discrimination features is closest to the expected distribution. In other words, we need to find an optimal projection matrix to satisfy the condition :where are column vectors of .

Given a set of n samples from c class, where is an l-dimensional column vector. Set , in which and is the mean vector of all the samples.

Because the principal components of samples are statistically uncorrelated, using the principal components to construct the expected matrix , and then solving the projection matrix with orthogonal characteristics, the discriminant vectors can have both orthogonal and nearly statistically correlated characteristics. In addition, the obtained discriminant vectors should be helpful for classification, that is to say, the expected matrix should have the characteristics of the smallest distance within the class, so the expected matrix is established by using the principal components of the sample class mean, and the expected vectors belonging to each class are the same.

Let be the set of the principal components of the sample class mean in the total samples X, where . is the mean vector for each class. P is the projection matrix of the principal component of the class mean, which consists of standard orthogonal column vectors for nonzero eigenvalues.

The principal component extension matrix can be defined aswhere in which is the number of samples for the th class.

The set of the class mean principal components has statistically uncorrelated characteristics, which means and matrix is diagonal. For the same number of samples per class , we get . Therefore, the principal component extension matrix also has the property of statistical uncorrelation.

3.2. Solution to the Projection Matrix

Due to the statistically uncorrelated characteristics of the principal component extension matrix of the projection matrix, makes statistically uncorrelated to some extent, so the discriminant vectors to be solved only need to be mutually orthogonal. That is to say, after the first discriminant vectors are solved, the th discriminant vector is obtained by solving the following optimization problem:

In order to obtain the th discriminant vector , we define the following function:where is a given vector, so has no effect on the solution of , and thus is rewritten as

According to Lagrange multipliers, makes (20) achieve the maximum value:

Taking the derivative of and setting it to zero, then we have

We define and , thus

Multiply on both sides of (21), and according to (17), the third term is zero. Thus, we obtain

The solution to the problem is to maximize the value of .

Multiply () on both sides of (21), and according to (17), the second term is zero. Then, we getwhich can be rewritten as

And (22) is rewritten as

So, the updating rule of is presented as

By combining formulas (27) and (22), we have

After some rearrangement, we obtainwhere is the eigenvector corresponding to the largest eigenvalue of . is the eigenvector corresponding to the largest eigenvalue of generalized characteristic of (29). In order to satisfy , needs to be adjusted to after being calculated, where .

Remarkably, all the samples have to be compressed by K-L transform to reduce the original samples from high-dimensional to low-dimensional ones if the matrix is not invertible, so it can be ensured that is reversible after dimensionality reduction.

The SDDA method proposed in this manuscript mainly solves the adaptability problem in various applications of two basic methods of LDA. The architecture of the proposed SDDA starts from the classic PCA and the classic LDA, so SDDA itself is also a basic method, which is in the same level as the comparison methods and can be used as a supplement to the classic PCA and the classic LDA. So in this manuscript, SDDA is only used as a basic method and compared with the existing classical methods. Actually, some techniques for improving PCA and improving LDA can also be used in the proposed SDDA method. For example, we can learn from the construction process of KPCA, KFSLDA, and KFDA to use nuclear techniques to construct KSDDA.

4. Experiment Results and Analysis

We conduct some experiments on the Yale face database, FERET face database, and UCI multiple features dataset to demonstrate the adaptability and effectiveness of the proposed algorithm to different objects. The proposed algorithm is compared with SDDA, PCA, and two basic methods of LDA (CLDA and FSLDA), and we analyze the comparison results.

4.1. Experiment on the Yale Face Database

Yale face database [49] is taken from 15 volunteers with each one having 11 images. Different images of each person are quite different in expression changes and light changes. Figure 1 is 11 images of one person in the Yale face database.

Since the discriminant vectors of SDDA are obtained by the orthonormal constraint, it is not necessary to carry out the experiment on its orthogonal characteristics. We have only done the experiment to verify the statistically uncorrelated characteristic and show it intuitively with the statistical uncorrelation diagrams. The elements in the diagrams of statistical uncorrelation are , where means the element that comes from the th row and the th column. As shown in Figure 2, in addition to the diagonal element values, the closer the element values of other locations are to 0 (black), the better the statistically uncorrelated characteristics between the discriminant vectors are. Comparing the statistical uncorrelation diagrams of SDDA, CLDA, FSLDA, and PCA, it is showed that the discriminant vectors obtained by SDDA are almost completely statistically uncorrelated while FSLDA has poorly statistically uncorrelated characteristics.

In order to evaluate the performance of the proposed SDDA, we conduct two sets of experiments on the Yale face database. One set selects the odd number (6 samples) of each person as the training set and the even number (5 samples) of each person as the test set, and the other set of experiments selects the even number (5 samples) of each person as the training set and the odd number (6 samples) of each person as the test set. The final results are averaged from the results of the two sets of experiments. Minimum distance and nearest neighbor are adopted as the measurement method in this paper.

Because of the small number of samples in the Yale face database, the lower dimension is taken in the experiment, that is to say, only a few projection vectors are used to extract features. In this experiment, the number of features is the dimension of sample reduction. The dimension of samples is reduced by various algorithms. Table 1, aimed at the number of features from 4 to 11, shows the experimental results of various algorithms. With the increase of the number of features, the experimental accuracy of each algorithm is improved. When the dimension of samples is reduced to nine, the test accuracy of the SDDA algorithm is higher than the maximum accuracy of other algorithms. Under the condition of the same number of features, the experimental results of SDDA are better than LDA and PCA in both minimum distance and nearest neighbor measurements. FSLDA is similar to CLDA, while the PCA method has the worst performance because it ignores category information. The results of our experiment demonstrate that the discriminant vectors of SDDA not only have principal component characteristics, but also have orthogonal and nearly statisically uncorrelated characteristics, so its performance is the best.

4.2. Experiment on the FERET Database

To validate the effect of SDDA on a dataset with large categories, we choose the FERET face database [50]. The FERET face database contains 1400 images of 200 persons. For each person, there are 7 images and whose file names contain the identification string “ba,” “bj,” “bk,” “be,” “bf,” “bd,” and “bg” to indicate the change of each image. Changes in posture ( and ), illumination, and expression are all contained in the samples. In the experiment, the face in each original image is acquired according to the position of the eyes and then adjusted to and preprocessed with histogram equalization. Figure 3 shows 7 images of one person in the database.

We also conducted two sets of experiments on the FERET human face database: images with file names contain the identification string ba, bd, be, and bf for training and the rest images for testing, and ba, be, bg, and bk for training and the rest for testing. We calculate the mean of the two results as the final results. Minimum distance and nearest neighbor measurement methods are both used.

Because the number of faces in the FERET database is large and the number of classes is 200, we use various algorithms to reduce the dimension of samples from 9 to 99. In order to reflect the trend of each algorithm when the dimension changes, we add line charts to reflect the accuracy change of each algorithm in different dimensions. Experimental results are shown in Figures 4 and 5 and Table 2. From the experimental results, we can see that the SDDA algorithm is much better than other algorithms when the dimension is the lowest. With the increase of the dimension, the test accuracy of the SDDA algorithm reaches the maximum effect quickly. When the dimension is 29, it has exceeded the maximum accuracy of all algorithms. When the dimension is 59, the test accuracy of the SDDA algorithm has reached its maximum. Due to the large difference between the training samples and the test samples, the training sample space cannot contain the test sample space well. Experimental results are shown in Figures 4 and 5. The CLDA method is statistically irrelevant, which fits the data too tight, resulting in the worst test results. The PCA method maintains certain test accuracy although the category information is not considered. The test result of FSLDA is second, which has orthogonal characteristics and good generalization ability. The proposed SDDA method is orthogonal and nearly statistically uncorrelated. The test result of SDDA is the best, and the recognition accuracy is significantly improved compared with the other three methods.

4.3. Experiment on the UCI Multiple Features Dataset

In the previous two experiments, the number of samples per person is small. To further evaluate the performance of SDDA, the UCI multiple features dataset [51] is used for experiments. The UCI multiple features dataset contains six feature structures of handwritten numbers 0 to 9. Each feature structure is divided into 10 categories, each of which has 200 samples and a total of 2000 samples.

The 240-dimensional pixel average feature of the sample data is selected to reflect the dimensionality reduction effect of these algorithms. The minimum distance measurement method is used to verify the effectiveness of the feature. We also conduct two experiments.

The first experiment randomly selects 100 samples as the training set, and the remaining 100 samples are used for testing. The experiments are repeated 10 times, and the average results are given in Table 3. SDDA algorithm still has the best experimental effect. Because of the large number of samples selected for each class, CLDA algorithm’s superiority is reflected. The effect has been greatly improved, ranking second, and FSLDA performance has been greatly reduced, ranking fourth. We can see that the results of SDDA and CLDA are similar, SDDA is significantly better than PCA and FSLDA, and the performance of PCA is better than that of FSLDA.

In the second experiment, only 20 samples were selected from each class as the training set. The experiment was repeated 10 times, and the average results are obtained. Figure 6 is the line chart of the results with the dimension of the extracted feature on the horizontal and the percentage of test accuracy on the vertical axis. As the line chart shows, the results of CLDA are the worst and the performance declined greatly. The experimental results of PCA and FSLDA are similar, and the performance of SDDA is still the best.

The experiment on the UCI multiple features dataset shows that the dispersion of the samples is large when the number of training samples of each class is large, so that the accuracy of the identification method with statistical uncorrelation characteristics is obviously superior to that of the orthogonal one. That is to say, in the case of larger samples, the performance of CLDA is much better than that of FSLDA. The performance of FSLDA is significantly better than that of CLDA when the number of training samples of each class is small. With the orthogonal characteristics and nearly statisically uncorrelated characteristics, the discriminant vectors of SDDA maintain the best performance regardless of the number of training samples. As can be seen from Figure 6, the accuracy of SDDA is higher than the comparison algorithms under the same number of features, and the curve shows a smooth upward trend. The increasing gradient of the accuracy is decreasing, which means the increase rate is large at first, while as the number of features increases, the increase rate gradually decreases. The proposed method can obtain superior performance under a small number of features, and the curve is rising smoothly, which means the overall performance of SDDA is stable and reliable.

Analyzing the experimental results on different databases synthetically, we can conclude that the SDDA method has the best performance in all databases, while the two basic methods of LDA have unstable performance. In contrast, the SDDA method has stronger adaptability than CLDA and FSLDA, which not only retains the principal component characteristics, but also overcomes the shortcomings of CLDA and FSLDA, and can extract more outstanding identification features. In addition, under the same sample conditions, CLDA has the fastest training speed, while FSLDA and SDDA have similar training speed. The training speed of CLDA is about 1.5 times that of SDDA and FSLDA. When testing, all algorithms have the same test speed because they use the same dimension projection vector.

5. Conclusions

For a large number of collected high-dimensional data, the proposed method can effectively reduce information overload and improve data transmission and processing, so the importance of this study is more prominent. In view of the shortcomings of the two basic methods of LDA, the paper proposes a similar distribution discriminant analysis method (SDDA) and presents the solutions of the optimal discriminant vectors. The optimal discriminant vectors are mutually orthogonal and nearly statistically uncorrelated. The proposed SDDA method in this manuscript mainly aims at two basic methods of LDA. One is FSLDA whose projection vectors are orthogonal, and the other is CLDA which is statistically uncorrelated after dimensionality reduction. The performance of the two algorithms is different under different data. The FSLDA with orthogonal characteristics has stronger generalization ability, but with the increase of training sample size, the performance of statistically unrelated CLDA is improved. Taking both the two characteristics into consideration, SDDA performs well regardless of sample size. SDDA actually takes advantage of both PCA and LDA methods to maintain optimal performance and better adaptability in each experiment. A large number of experiments on the Yale face database, FERET face database, and UCI handwritten digits multiple features database confirm that SDDA is a more effective and adaptable dimensionality reduction method, which can extract better identification feature than CLDA, FSLDA, and PCA. Many theories and applications based on LDA can also be extended on the basis of the method proposed in this paper.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (grant no. 61503329) and Prospective Joint Research Project of Jiangsu Province (grant no. BY201506-01).