Abstract

The development of sparse techniques presents a major challenge to complex nonlinear high-dimensional data. In this paper, we propose a novel feature selection method for nonlinear support vector regression, called FS-NSVR, which first attempts to solve the nonlinear feature selection problem in the regression technology field. FS-NSVR preserves the representative features selected in the complex nonlinear system due to its use of a feature selection matrix in the original space. FS-NSVR is a challenging mixed-integer programming problem that is solved efficiently by using an alternate iterative greedy algorithm. Experimental results on three artificial datasets and five real-world datasets confirm that FS-NSVR effectively selects representative features and discards redundant features in a nonlinear system. FS-NSVR outperforms L1-norm support vector regression, L1-norm least squares support vector regression, and Lp-norm support vector regression on both feature selection ability and regression efficiency.

1. Introduction

High-dimensional data have commonly emerged in diverse fields, such as finance [1], economics [2], biology [3], and medicine [4]. Complex nonlinear relationships between features may exist in high-dimensional datasets [5]. For example, most economic and financial time series follow nonlinear behavior [6]. Ling et al. explored the nonlinear relationship between globalization, natural resources, financial development, and carbon emissions [6]. Another example of complex nonlinear high-dimensional data is in the medicine field. Medical costs have a sophisticated relation with features [7]. Complex nonlinear high-dimensional data have frequently emerged in the biology field. Nonlinear relations between features can depict biological relationships more precisely and reflect critical patterns in biological systems [8].

Complex nonlinear high-dimensional data may include some irrelevant and redundant features, which may reduce the effectiveness of data mining and may detract from the quality of the results [911]. Thus, complex nonlinear high-dimensional data need a sparse technique. Feature selection, as a useful sparse technique, selects some useful features upon which to focus its attention and ignores the rest [1217]. In general, feature selection methods are classified as filter, wrapper, and embedded methods [1618]. The embedded method is very popular for feature selection since it conducts feature selection and other learning tasks simultaneously [19].

Sparse support vector regression, as a branch of sparse support vector machine [2023], is a computationally powerful feature selection method. Sparse support vector regression always adopts a sparse regularization term to realize feature selection and regression simultaneously. Therefore, sparse support vector regression is an embedded feature selection method [2426]. L1-norm support vector regression (L1-SVR) [27] and L1-norm least squares support vector regression (L1-LSSVR) [28] use the L1-norm sparse regularization term to shrink some coefficients of the regression estimators towards 0. According to the regression estimators, the contribution of each feature to the final decision function can be judged, and then, the useful features are selected, while the irrelevant and redundant features are discarded. To improve the sparseness of L1-SVR, Zhang et al. [29] proposed Lp-norm support vector regression (Lp-SVR) (0 <  < 1). The Lp-norm regularization term in Lp-SVR shrinks more coefficients of the regression estimators towards 0, and some coefficients are shrunk to exactly 0, leading to more irrelevant and redundant features being discarded. However, L1-SVR, L1-LSSVR, and Lp-SVR only solve the linear feature selection problem, which is not always suitable for complex nonlinear cases.

To solve the feature selection problem for complex nonlinear high-dimensional data, we follow the spirit of nonlinear support vector machine-based feature selection [9, 30, 31] and then propose a novel feature selection method for nonlinear support vector regression, which is called FS-NSVR. We bring a feature selection matrix, as a diagonal matrix with elements of either 1 or 0, into nonlinear support vector regression. As a result, FS-NSVR becomes a mixed-integer programming problem (MPP). To solve FS-NSVR efficiently, we employ an alternate iterative greedy algorithm to find a local optimal value [32], in which we iteratively solve the standard SVR problem and a smaller nonconvex feature selection problem. In addition, a feature-ranking strategy is suggested [33], which ranks the features according to their contributions to the objective in the MPP. Compared with L1-SVR, L1-LSSVR, and Lp-SVR, the experimental results show that FS-NSVR selects the most appropriate representative features in the highly complex nonlinear relationships with smaller estimation errors than those produced by L1-SVR, L1-LSSVR, and Lp-SVR. This means that FS-NSVR not only selects the representative features but also has good regression effectiveness. The contributions of this paper are summarized as follows:(1)Bringing a feature selection matrix into nonlinear support vector regression, we propose a novel feature selection method for nonlinear support vector regression to identify complex nonlinear relationships between features in their original space. The proposed model first attempts to solve the nonlinear feature selection problem in the regression technology field.(2)The proposed model is a complex mixed-integer programming problem. Ensuring the efficiency of the learning process, we employ an alternate iterative greedy algorithm to find a local optimal value for the proposed model. The alternate iterative greedy algorithm, transferring the complex mixed-integer problem into the min-max optimization problem, effectively reduces the computational complexity.(3)Experimental results on both artificial and real-world datasets indicate that the proposed model preserves the representative features selected in the complex nonlinear system, outperforming the other three linear feature selection methods, with better feature selection and regression results. The training speed of the proposed method confirms the efficiency of the alternate iterative greedy algorithm.

The remainder of this paper is organized as follows: Section 2 in this paper briefly focuses on support vector regression. In Section 3, we propose feature selection for nonlinear support vector regression. Section 4 provides artificial and real-world dataset experiments, and Section 5 concludes the paper.

2. Background

Starting with the notations, we consider a regression problem in -dimensional real vector space . Suppose that is the response vector, is a known design matrix of covariates, and is the -dimensional training sample. Next, we briefly review support vector regression (SVR) [26] that are closely related to FS-NSVR.

The optimal nonlinear regression function of SVR is constructed as follows:where , , and are an appropriately chosen kernel. Parameters in function (1) are estimated by solving the following optimization problem:where and are the slack variables and is a parameter determining the trade-off between the empirical risk and the regularization term. To derive the dual formulation of SVR, we first introduce the Lagrangian function for the problem (2), which iswhere and are the Lagrangian multiplier vectors. The Karush–Kuhn–Tucher (KKT) necessary and sufficient optimality conditions for the problem (2) are given by

According to the previously mentioned KKT conditions, we obtain the dual formulation of problem (2) as follows: can be obtained from the solution and of (5) by

For any solution to (5), and , if , the solution to the problem (2) can be obtained in the following way:(1)For any nonzero component ,(2)For any nonzero component ,

The final decision function can be constructed as

3. Feature Selection for Nonlinear Support Vector Regression

3.1. Problem Formulation

In this section, we propose feature selection for nonlinear support vector regression. is an feature selection matrix. We consider the following nonlinear regression function:where , , and is an appropriately chosen kernel. and are the unknown parameters that need to be estimated. The optimal feature selection matrix also needs to be searched simultaneously.

The estimator of the regression function (10) can be defined as the solution to the FS-NSVR optimization problem:where and are slack variables and is a parameter determining the trade-off between the empirical risk and the regularization term. In fact, the feature selection matrix defines a subspace spanned by the selected features. Minimizing the term in the objective function (11) has the beneficial effect of suppressing variables to produce a sparse set of nonzero feature weights. Therefore, FS-NSVR has nonlinear feature selection ability.

3.2. Problem Solution

Obviously, the FS-NSVR optimization problem is a mixed-integer programming problem. We reformulate problem (11) as follows:

Solving problem (12) to obtain global optimality is highly challenging and impractical [24]. We employ an alternate iterative greedy algorithm to find a local optimal value. First, we fix the integer part and then obtain the solution to (12), which leads to solving the problem in the same manner as SVR. Similar to the deduction process of nonlinear SVR in Section 2, we obtain the dual formulation of the inner minimization problem. Then, problem (12) can be rewritten as

Obviously, problem (13) is a challenging min-max optimization problem. Fixing the optimal solution of the inner maximization problem , we obtain the outer minimization integer problem, which leads to exhaustive computation of the objective for the possible .

To make the greedy algorithm work, we follow the strategy in [33] to initialize to make the algorithm more stable. The value of each feature is computed after solving SVR by

The score of the th feature that reflects the importance among all the features is computed by

The greedy algorithm starts from an initial generated by (15). If is less than , then ; otherwise, . We then fix and solve problem (13) to obtain . We calculate and according to (14) and (15), respectively. is updated by replacing if objective (13) decreases more than the tolerance. After updating , can be obtained again. The algorithm will be terminated if objective (13) decreases less than the tolerance. We summarize the procedures of the greedy approach in Algorithm 1 to give the feature selection method for nonlinear support vector regression. The proof of convergence of the greedy approach in Algorithm 1 can be obtained from Mangasarian and Kou [32].

Input: Training set . The appropriate kernel parameter , parameter C;
Output: , and ;
Begin
 Start with ; set iteration number ;
whiledo
  Find the solution to problem (15) with the fixed ; Compute each feature score by (15).
  for i = 1 : n
   Ifthen
   else
   end
  end
  Ifthen
   Converged and output and as the final solutions.
  else
   set
  end
end
Output and as the final solutions;
end

Obtaining the solution of problem (13), can be obtained by

For any solution to (15) and , if , the solution to problem (6) can be obtained in the following way:(1)For any nonzero component ,(2)For any nonzero component ,

The final decision function can be constructed as

3.3. Computational Complexity

Concerning the computational complexity of FS-NSVR, we find that FS-NSVR includes two parts: one is repeatedly computing the inner maximization problem of (13), and the other is repeatedly computing (14) and (15). The first part requires solving one quadratic programming problem. The time complexity of this part is approximately . For the second part, it is easy to compute with the fixed , and the computational complexity of this part is not more than times.

4. Experimental Results

To test the feature selection and regression effectiveness of the proposed FS-NSVR, we compare it with L1-SVR [27], L1-LSSVR [28], and Lp-SVR [29] by using three artificial datasets and seven real-world datasets. L1-SVR, Lp-SVR, and L1-LSSVR are embedded linear feature selection methods. All of these methods are implemented in the MATLAB R2019b environment on a PC running the 64-bit Windows XP OS on a 1.6 GHz Intel (R) processor with 16 GB of RAM.

For the feature selection of nonlinear support vector regression, we employ a Gaussian kernel, and its kernel parameter is selected from the set . Parameter C is also selected from the set . The insensitive parameter is fixed at 0.01. The optimal values of the parameters in the experiments are obtained by using the grid search method.

Let be the number of samples, is the test sample, is the predicted value of , and is the average value of . We use the following evaluation criteria to evaluate the variable selection and regression results.P1: the proportion of simulation runs with nonzero coefficients are selectedR2: the coefficient of determination is defined asNMSE: the normalized mean squared error (NMSE) is defined asRMSE: the root mean square error (RMSE) is defined as

Thus, the smaller the values of NMSE and RMSE are, the more statistical information is captured from the selected variables.

4.1. Artificial Datasets

To test the nonlinear feature selection performance of FS-NSVR, we provide three artificial datasets. Specifically, we set , , and . The first regression model is generated as follows:where .

The second regression function is as follows:where .

The third regression function is as follows:where . The specifications of these artificial datasets are listed in Table 1.

To evaluate the performance of the feature selection results, we adopt the following criteria:where are true positives, are false-positives, are false negatives, and are negative counts. Precision and recall are commonly used to present results for binary decision problems in machine learning since they give a more accurate evaluation of an algorithm’s performance. Here, we use precision and recall to evaluate the feature selection results of FS-NSVR, L1-SVR, Lp-SVR, and L1-LSSVR.

The best parameters of FS-NSVR, L1-SVR, Lp-SVR, and L1-LSSVR for artificial datasets are shown in Table 2. The feature selection and regression results on the previous three artificial datasets are shown in Table 3. From Table 3, we find that FS-NSVR drives larger precision and recall than L1-SVR, Lp-SVR, and L1-LSSVR. Meanwhile, FS-NSVR obtains a larger and smaller NMSE than L1-SVR, Lp-SVR, and L1-LSSVR. It is clear that FS-NSVR has the ability to select the representative features and discard the redundant features in the nonlinear system. Therefore, FS-NSVR is suitable for solving the nonlinear feature selection problem, while L1-SVR, Lp-SVR, and L1-LSSVR are not suitable for solving the feature selection problem of complex nonlinear high-dimensional data. In terms of running times, although the training speed of FS-NSVR is slower than L1-LSSVR, it is significantly faster than L1-SVR and Lp-SVR.

4.2. Parameters and Nonlinear Feature Selection Analysis

In this part, we analyze the effects of the parameters and C on the nonlinear feature selection results. To test the influence of the kernel parameter on NMSE, , precision, and recall, we first fix parameter C as the optimal value used in the experiments on the artificial datasets. Figures 13 illustrate the influence of the kernel parameter on the nonlinear feature selection results for Type A, Type B, and Type C, respectively. From Figures 13, we find that when increases, the NMSE value decreases and then increases. As the kernel parameter increases, R2 and precision increase and then decrease, which means that the kernel parameter has a strong influence on the feature selection ability of FS-NSVR. When is selected as the optimal value, precision and recall reach maximum values, which means that FS-NSVR selects representative features and discards irrelevant features. Although the number of selected features is small, FS-NSVR can select the representative features in the datasets.

To further test the influence of parameter C on the FS-NSVR feature selection results, we fix the kernel parameter as the optimal value used in the experiments on the artificial datasets. Figures 46 show the influence of C on NMSE, R2, precision, and recall. From Figures 46, we observe that as parameter C increases, NMSE decreases and then remains constant. When C = 2, precision reaches a maximum value and then remains constant, which means that FS-NSVR selects the representative features and discards the irrelevant features.

4.3. Real-World Datasets

To further test the feature selection and regression performance of FS-NSVR, we consider five real-world datasets from the UC Irvine (UCI) Machine Learning Repository [34]. Table 1 displays the dataset information, including the specific numbers of training samples, test samples, and features. The best parameters of FS-NSVR, L1-SVR, Lp-SVR, and L1-LSSVR for real-world datasets are shown in Table 4.

Table 5 lists the feature selection and regression results of FS-NSVR, L1-SVR, Lp-SVR, and L1-LSSVR. One can easily observe that FS-NSVR selects fewer features than those of L1-SVR, Lp-SVR, and L1-LSSVR, but FS-NSVR obtains small values for NMSE and RMSE, which are comparable with the other methods. FS-NSVR selects very few useful features and captures the nonlinear statistical information in the test datasets. FS-NSVR realizes feature selection and regression simultaneously due to its inherent feature selection property. Faced with complex nonlinear high-dimensional datasets, L1-SVR, Lp-SVR, and L1-LSSVR present challenges since they can only solve the feature selection problem for the linear version. Regarding the running time, FS-NSVR is significantly faster than L1-SVR and Lp-SVR.

5. Conclusion

Our paper focused on the nonlinear feature selection problem posed by high-dimensional data, especially when nonlinear complex relationships between features exist. To solve this problem, we brought a feature selection matrix in the original space into nonlinear support vector regression and then proposed a novel feature selection method for nonlinear support vector regression (FS-NSVR). FS-NSVR is a mixed-integer programming problem (MPP). To efficiently solve FS-NSVR, we employed an alternate iterative greedy algorithm to find a local optimal value. The feature selection matrix and the supervised selection process of FS-NSVR ensured that the representative features were selected and the redundant features were discarded automatically. The feature selection and regression performance of FS-NSVR in the artificial and real-world datasets confirmed its sparseness and effectiveness.

The proposed method also has limitations that should be acknowledged in future studies. First, FS-NSVR is not suitable for the heterogeneity problem of nonlinear high-dimensional data. The spirit of quantile regression [3537] can be brought into the nonlinear feature selection framework in the future. Second, more efficient methods to solve FS-NSVR are needed since the training speed of the current method is not fast enough with regard to large-scale data sets. Third, forming the application perspective on how to use FS-NSVR to deal with nonlinear feature selection problem in the real world remains a question in future work.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 11871183, 61866010, 61603338, and 12101552), the National Social Science Foundation of China (No. 21BJY256), the Philosophy and Social Sciences Leading Talent Training Project of Zhejiang Province (No. 21YJRC07-1YB), the Natural Science Foundation of Zhejiang Province (No. LY21F030013), and the Natural Science Foundation of Hainan Province (No. 120RC449).