Abstract
With the advent of the era of big data, feature selection in high- or ultra-high-dimensional data is increasingly important in statistics and machine learning fields. In this paper, we propose a marginal utility measure screening method MI-SIS based on mutual information. The proposed marginal utility measure has several appealing features compared with the existing independence screening methods. Firstly, the proposed procedure is model-free without specifying any relationship between the predictors and the response and is valid under a wide range of model settings including parametric and nonparametric models. Secondly, it is suitable for various combinations of the continuous and categorical of predictors and response in our new method. Finally, the new procedure has a good performance in discovering a weak signal in the finite sample and its computation is simple and easy to implement. We establish the sure screening property for the proposed procedure with mild conditions. Simulation experiments and real data applications are presented to illustrate the finite sample performance of the proposed procedures.
1. Introduction
With the development of the information and technology, more and more data were collected in many scientific areas. The number of predictors can be bigger than the number of samples . Theoretically, classical results allow that may diverge at exponential rate of the sample size. The difficulty in ultra-high-dimensional data analysis is obvious; the most typical difficulty is a large amount of calculation. Fan and Lv [1] proposed a relatively new approach to deal with this problem; they proposed a sure independent screening (SIS, hereafter) method based on the Pearson correlation. Since the work of Fan and Lv [1] on sure independent screening, there has been a lot of work on feature screening. Wang [2] proposed Forward Regression for linear models; Fan et al. [3] designed a screening method based on the marginal likelihood estimators for generalized linear models and robust regression. Fan and Song [4] established the sure independence screening property under the background of generalized linear models. Within the scope of the ultra-high-dimensional nonparametric modeling, various feature screening methods have also been proposed. Those methods include but are not limited to the following: nonparametric independence screening (Fan et al. [5]), conditional correlation sure independence screening (Liu et al. [6]), iterative nonparametric independence screening (Fan et al. [7]), and others. All these methods are based on specific model assumption.
The model-based screening methods have a good performance when the underling model is correctly specified. However, it is challenging to specify a correct model for ultra-high-dimensional data analysis. Model-based screening methods may perform poorly in the case of model misspecification. In order to overcome this problem, great efforts have been made to relax the model assumption and make the screening methods less model-dependent. Thus, new model-free screening methods have been proposed in the latest literature. See the works of Zhu et al. [8], Li et al. [9], He et al. [10], Mai and Zou [11], Shao and Zhang [12], Cui et al. [13], and others. In particular, Zhu et al. [8] proposed a sure independent ranking and screening (SIRS) method to screen predictors in multi-index models. They further showed that SIRS enjoys the ranking consistency property. Li et al. [9] developed a distance correlation based screening methods (DC-SIS). DC-SIS does not require model specification, and it can deal with grouped predictor variables. They further demonstrated that DC-SIS procedure possessed the sure screening property. Shao and Zhang [12] proposed a martingale difference correlation based screening method (MDC-SIS), which is also a model-free screening method; MDC-based screening methods can extend to quantile regression, and the sure screening property was also established.
In ultra-high-dimensional classification problem, Cui et al. [13] proposed a screening method for classification problem (MV-SIS); this method can be applied to the situation where one of the predictors and the response are categorical. Huang et al. [14] proposed a screening method based on Pearson Chi-square (PC-SIS); PC-SIS is applicable in case the response and the predictors are categorical; by categorizing the continuous variables, both MV-SIS and PC-SIS have a wider applicability. We were inspired by the simplicity of MV-SIS and PC-SIS; in this paper, we proposed a new screening method based on mutual information. It has advantages over PC-SIS and MV-SIS. It does not require categorizing the continuous variables; categorizing the continuous variables will lose some information especially in small samples case and we do not know how many categories we should use to categorize the continuous variables in advance. Cui et al. [13] thought that the number of categories could be treated as tuning parameter, and they could be determined by cross validation; obviously, the calculation will increase especially in the higher-dimensional case.
Recently, the studies on the independent screening methods are still booming. We just list a few relevant researches; for example, Zhang et al. [15] proposed a Gini correlation screening (GCS) method to select the important variables in ultra-high-dimensional data. Zhou and Zhu [16] proposed a modified martingale difference correlation to improve some drawbacks of martingale difference correlation. Dai et al. [17] proposed a feature selection method based on kernel density estimation for interval-valued data. An et al. [18] proposed a new model for supervised multiclass feature selection which has the norm in both the fidelity loss and the regularization terms with an additional constraint. Cuong et al. [19] established fundamental qualitative properties of the minimum sum-of-squares clustering problem and proved that the problem always has a global solution and the global solution set is finite. For more details, we refer to the selective survey by Kamolov [20, 21].
In this paper, we propose a new model-free screening method which has a wider application, the mutual information based screening method; we refer to our method as MI-SIS. Table 1 shows the application scope and algorithm complexity of different feature screening methods. We systematically study the theoretical properties of MI-SIS and establish the sure screening property for the proposed procedure with mild conditions. The new procedure has a good performance in discovering a weak signal in the finite sample. MI-SIS is comparable with MV-SIS and PC-SIS corresponding to the application scope of these methods. In the case that both the predictors and response are categorical variables, the MI index is similar to the PC index proposed by Huang et al. [14]; thus, the efficiency of MI-SIS will be the same as that of PC-SIS in this situation. Moreover, to enhance the performance of the proposed method in finite sample, we conduct three Monte Carlo simulations and conduct a real data analysis. MI-SIS has a good performance in all the simulations.
The rest of this paper is organized as follows. In Section 2, we propose MI-SIS procedure and further study the theoretical properties of the novel approach. In Section 3, we conduct three Monte Carlo simulation studies to examine the performance of MI-SIS in finite samples, especially in very small sample case. We also analyze real data, and the result is very impressive. All technical proofs are given in Appendix A.
2. Independence Screening Using Mutual Information
2.1. Mutual Information
The mutual information between two random variables and is defined in terms of their joint probability distribution as
is always nonnegative and if and only if and are independent.
The MI marginal measure can be estimated by letting . The estimator of the mutual information based on a nonparametric density estimator is illustrated in the following. Let be a random sample of size from the population . We assume that has continuous joint pdf. Definewhere the function is kernel function, and are bandwidth in nonparametric density estimate, and in practice the kernel functions we often use are Gaussian kernel function and Epanechnikov kernel function. MI marginal measure under other circumstances is given in the following. When is continuous and is categorical,where and ; when is categorical and is also categorical,where . In this case, MI index is very similar to PC index [14]; when is categorical and is continuous,where and .
2.2. An Independence Ranking and Screening Procedure
We now propose a new model-free sure independence screening using for ultra-high-dimensional data analysis. Let be the response with support and can be either discrete random variable or continuous random variable, and denotes the predictor vector, where and is the sample size. Without specifying a regression model, define the active predictors index subset byand define the inactive predictors index subset by
With the above notation, we can specify the active predictors as and the inactive predictors as . Our main purpose is to accurately find the active predictors index subset .
We now set out to calculate of each predictor and the response, , . Note that only if ; this also implies that the predictor is statistically independent of Y; thus we can use index as a dependence measure to screen the predictors. The MI-based method is model-free because it involves only marginal density and joint density of the random variables. This index can characterize both linear and nonlinear relationships between the response and predictors.
The primary objective of feature screening in ultra-high-dimensional data analysis is to find a reduced model with a small scale which can contain the true model with high probability. In this paper, we propose using the index to select a moderate modelwhere and are predetermined positive values. In practice, we often select the reduced model using another formula:
. Obviously, are the most likely relevant predictors with the response. Thus, we can use the predictors in to estimate the true model. For ease of presentation, we call the above procedure MI-SIS procedure for short.
In the following, we will establish the theoretical properties of the proposed independence screening procedure; Fan and Lv [1] and Ji and Jin [22] demonstrated that the sure screening property guaranteed the effectiveness of the class of independence screening procedure. Therefore, to establish the sure screening property for MI-SIS is essential. The three following conditions are assumed to guarantee that the MI-SIS procedure has sure screening property. They are imposed mainly to facilitate the technical proofs, although they may not be the weakest ones. (C1) Suppose that , and come from the distribution which is unknown but has a Lebesgue pdf , and some conditions present in Lemma A.3 in the Appendix. (C2) There exists a positive constant , such that (C3) There exists a positive constant and ; the minimum MI of the active predictors satisfies . (C4) Both and satisfy the subexponential tail probability uniformly in . That is, there exists a positive constant such that, for all ,
Theorem 1 (Sure Screening Property). Under conditions (C1)-(C2), there exists the positive constant such that
Further, we have thatIn the above equation, is the cardinality of .
Theorem 1 indicates that we can deal with the ultra-high-dimensional case with .
3. Numerical Studies
In this section, we first assess the finite sample performance of the proposed MI-SIS by Monte Carlo simulation studies. Then, we use real data to analyze the sure screening property of our proposed method. All of our simulation studies were performed in the R language.
Example 1. (X is continuous and Y is categorical). In this example, we simulate a quadratic discriminant analysis problem with ultra-high-dimensional predictors by following the similar idea in Cui et al. [13] or Pan et al. [23]. Our simulation example is slightly different from theirs; we conduct a quadratic discriminant analysis in which the categorical response Y comes from two distributions which have very small difference. We generate Y from a discrete uniform distribution with categories, where , with given , the predictor is then generated by letting , where is the mean of the categories, and its component , but its other components are all zeros. is relatively a small number; in this example, we conduct , and are the dimensional error terms, and we assume that , and , with only the component ; can take other symmetric distributions, such as -distributions; in this case, we only illustrate the simulation result with . The shape of the conditional density given Y is shown in Figure 1(a), and that of the marginal density of , is shown in Figure 1(b).
In order to illustrate the performance of the novel approach, we compare our result with the existing methods PC-SIS [14] and MV-SIS [13]. The MV-SIS method can be directly applied in this situation. But the PC-SIS method only applies to the situation that both the response and the predictors are categorical variables; in this case, we need to categorize the predictors. Cui et al. [13] proposed a specific procedure to categorize continuous variables. The procedure can be described as follows: assuming that is a continuous variable, we define a new vector using the percentiles of ; let , where denotes an indicator function. Then we immediately face a problem of how many categories we should use in categorizing . In their paper, they suggest . In this simulation, we take and separately, and, in each case, we let and . We repeat each experiment 100 times and define some evaluation index to illustrate their performance. MMMS is short for the median of the minimum model size to include all the active predictors; we also report the robust estimate of its standard deviation RSD (=IQR/1.34, IQR stands for interquartile range) in the parentheses. denotes that an individual active predictor is selected for a given model size in the 100 simulations, where denotes the integer part of . denotes that all active predictors are selected for a given model size in the 100 simulations. The MMMS is an index to assess the model complexity of an underling procedure. The closer to the true model size it is, the better the screening procedure is. The sure screening property ensures that and are asymptotic to 1 when the estimated model size is sufficiently large. We report the detailed result in Table 2.
Table 2 implies that the procedure MI-SIS performs reasonably well when the sample size is relatively small; thus, we can conclude that MI-SIS can capture more subtle signals in this situation. Table 2 also shows that the PC-SIS and MV-SIS have a poor performance in this situation, and the results are basically unchanged, despite the fact that we select three values for PC-SIS. With the sample size becoming large, we can find that all the three procedures have a good performance. But the result of PC-SIS is related with the selected ; note that cannot be too small.

(a)

(b)
Example 2. (both X and Y are continuous). This example is designed to compare the performance of MI-SIS with those of PC-SIS and MV-SIS in the case that both the response and the predictors are continuous variables. We consider the two following models: (2.a) (2.b) where (2.a) is a linear regression; the relationship between , and is also linear, while in model (2.b) the relationship between , and is nonlinear, and is an indicator function. In addition, model (2.b) contains a sine function of and an interaction term . All of the three methods are model-free; thus they can be directly applied in these two models. But we need to categorize the continuous response for MV-SIS and categorize both the continuous response and the predictors for PC-SIS. In this example, we set when we categorize the continuous variables. Let and , and, for each model, we consider two scenarios to assess the performances of the three methods: and , where . Moreover, in order to show defects of categorizing, we use formula (3) to calculate MI; we also need to categorize the continuous response in this case, and we name this method MI-SIS2 in Table 3. Table 3 presents the simulation result for , and . The performances of the MV-SIS, PC-SIS, MI-SIS, and MI-SIS2 are very similar in model (2.a). But MI-SIS outperforms the other procedures in model (2.b). We can conclude that MI-SIS has a better performance than other procedures in the case where both the response and the predictors are continuous variables.
Example 3. (both X and Y are categorical). In this case, our proposed index reduces to the form in (4); in order to assess the effect of our index with PC-SIS, we borrow an example from Huang et al. [14]; in their paper, the following example was used to assess the effect of PC-SIS in the case of the predictors without interaction. The category response , and , for every . Define the true model as with . Next, conditional on , we generate the predictor as for every and . The specific values of are shown in Table 4. For , let , for every .
Under this mechanism, we infer that the predictors , are independent of the response, because , and ; we can calculate ; the conditional mass functions are identical to the unconditional mass functions, so , and are independent. In this example, in order to systematically study the gradual equivalence about MI-SIS and PC-SIS proposed by Huang et al. (2014), we designed four cases and performed simulation 100 times for each case. The gradual equivalence is a very strong property which needs us to subsequently define several evaluation indexes to assess. denotes the proportion that all active predictors are included in for a given model size , while denotes the proportion that all active predictors are included in at the same model size. More specifically, we let denote the proportion of the identical correct predictors selected by the methods of PC-SIS and MI-SIS, while we let denote the proportion of the identical false selected predictors.
The detailed results are shown in Table 5; the four simulation results are very similar; this result is not out of our expectation. Both the MI-SIS and the PC-SIS methods are relatively very efficient in selecting the active predictors; the values of MMMS (RSD) and and are very close to 1. The value of PCS is approximately equal to 1 as n goes to infinity; this result demonstrates that MI-SIS and the PC-SIS have gradual equivalence. Both can almost surely select the active predictors, even if they do not select all the true predictors. However, PFS has a relatively poor performance; the reason may be that when n goes to infinity, the cardinality of will also be very large; due to the randomness, the probability of the false selected predictors will vary a lot.
Example 4. (X is categorical and Y is continuous). In this example, we use the same setting as Example 2 but add some new categorical variables. The models we considered are as follows: (2.a) (2.b) where , are continuous variables and , are categorical variables. We consider the following scenarios with . Firstly, we generate continuous variable with and , where . Then, we generate categorical variables with and follows the uniform distribution on 0 and 1. Note that and , 1, 0, 0. The simulation result is presented in Table 6 which shows that the MI-SIS method also outperforms its competitors when X is categorical and Y is continuous.
4. Real Data Analysis
In this section, we illustrate the proposed MI-SIS procedure with an application to detect the important features about the voice of Parkinson’s patients using LSVT dataset. The dataset is available at http://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation. The LSVT (Lee Silverman voice treatment) Voice Rehabilitation Dataset comes from UCI website; this dataset was created by Tsanas et al., and it was first analyzed by [24]. The goal of the study is to improve the effectiveness of rehabilitative speech treatment by appropriate statistical algorithms for LSVT Companion system. These algorithms can automatically be used for detecting whether the characteristics of the voice are acceptable or not. An efficient algorithm can automatically assess the effectiveness of the LSVT Companion system during use of software away from expert clinical guidance.
This dataset contains 126 samples from 14 participants, 309 predictors, and one response. The predictors are the features of the voice. The response is a binary variable; one means “unacceptable” and zero means “acceptable.” “Unacceptable” means that a clinician thought the voice was not persisting during in-person rehabilitation treatment. More details about the predictors can be found in [24]. Therein, the authors demonstrated that the algorithm they proposed could correctly replicate the experts’ binary assessment with approximately 90% accuracy.
Here, we use two penalized regression models to analyze this dataset. We first established a penalized logistic model using all the predictors by minimizing the penalized likelihood
For the second model, we propose first applying MI-SIS to screening predictors; thereafter, we extend the predictor variable space by adding the interaction terms of the screened predictors. Then we apply the penalized logistic model to the new feature space, and in this way, we can explore the nonlinear relationship of the screened predictors. In the paper, these two models will be referred to as penalized-logistic model and MI-SIS-penalized-logistic model, respectively, for simplicity.
The top-right and bottom-right figures in Figure 2 show the penalized-logistic coefficient paths about the two models. The top-left and bottom-left figures are the CV error for each ; the hyperparameter is selected by 5-fold cross validation, and the best will be obtained at the minimum of the CV error curve. We summarize the classification result in Table 7. From the confusion matrix, we conclude that both the penalized-logistic model and MI-SIS-penalized-logistic model have a better performance. For penalized-logistic model, it finally selects only 13 predictors, with the best , and its classification rate is 91.27%; for MI-SIS-penalized-logistic model, it finally selects 25 predictors on the new feature space, with the best , and its classification rate is 96.83%. This example further demonstrates that the MI-SIS with a penalized-logistic model is more enjoyable in real data analysis.

5. Discussion
In this paper, we proposed a new independent feature screening method based on the mutual information, that is, MI-SIS. The proposed procedure is model-free, and the sure screening property was established when the number of the predictors diverges with an exponential rate of the samples. The new procedure has a good performance in discovering a weak signal in the finite sample. Similar to Fan and Lv [1], we select a cutoff d for MI-SIS. How to choose d is a very important and tough problem. How to choose d for MI-SIS more reasonably is a good topic desiring more discussion and reasonable d plays an important role in feature screening methods.
Theoretically, we ignore the marginal dependence between the predictors; marginal dependent problem may be a trouble during the feature screening procedure. How to deal with the marginal dependence between the predictors remains a question. Similar to the ISIS, we adopt the idea to MI-SIS and develop an iterative procedure about MI-SIS. Due to the defect of nonparametric estimation in small sample, the iterative MI-SIS performs poorly, so we do not post the result in this paper. To overcome the marginal dependence problem, more novel research needs to be developed.
Appendix
A. Proof of Theoretical Result
In order to prove Theorem 1, we need the three following lemmas. The first two lemmas provide us two exponential inequalities, and their proofs can be found in [25].
Lemma A.1. Let . If , then
Lemma A.2. Let be a kernel of the statistics , and . If , then, for any and ,where denotes the integer part of .
Lemma A.2 is the unilateral tail inequality of ; we can easily get the bilateral tail inequality of due to its symmetry.
Lemma A.3. (the asymptotic property of nonparametric density estimators). Suppose that exists and ; thenFrom the above equation, and .
Lemma A.3 directly implies that . Under some more strict conditions, we have the strong uniform convergence of .
More details about the strong uniform convergence can be found, for example, in [26] or [27].
Proof of Theorem 1. First, we show for each that the following inequality holds:This is becauseBy Lemma A.3 and strong law of large numbers, it shows the convergence ofNext, we will establish the bound of the second term.
Define as the kernel of the statistics of , where we define , and . With Markov’s inequality, we can ensure thatwhere . As Li et al. (2012) used the technique to deal with the U statistics in their paper, together with condition (C2), it is entailed immediately thatBy choosing , we have ; therefore, due to the symmetry of statistics, we can obtain the bilateral tail inequalityUsing the relationship between and , we can show thatUnder condition (C2), for , we can take a large ; when , ; we can easily prove thatNote thatSimilarly, we can use the above skill and take larger ; when , this can directly show thatLet , where ; from inequality (A.11), together with property (A.1), and Bonferroni’s inequality, it is implied thatWe thus haveThen, we proof the second part of Theorem 1.
If , this implies that there must exist some satisfying the fact that . Using condition (C3), implies that for some . Thus, the event ; we take complement on both sides, and we get . Therefore,In the above equation, is the cardinality of . This is the end of the proof.
Data Availability
The LSVT (Lee Silverman voice treatment) Voice Rehabilitation Dataset was adopted to illustrate the proposed MI-SIS procedure in Section 4. The dataset is available at UCI website: http://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This research was funded by Doctoral Program of Harbin Normal University, China (grant no. XKB201805).