Abstract
Twin support vector regression (TSVR) generates two nonparallel hyperplanes by solving a pair of smaller-sized problems instead of a single larger-sized problem in the standard SVR. Due to its efficiency, TSVR is frequently applied in various areas. In this paper, we propose a totally new version of TSVR named Linear Twin Quadratic Surface Support Vector Regression (LTQSSVR), which directly uses two quadratic surfaces in the original space for regression. It is worth noting that our new approach not only avoids the notoriously difficult and time-consuming task for searching a suitable kernel function and its corresponding parameters in the traditional SVR-based method but also achieves a better generalization performance. Besides, in order to make further improvement on the efficiency and robustness of the model, we introduce the 1-norm to measure the error. The linear programming structure of the new model skips the matrix inverse operation and makes it solvable for those huge-sized problems. As we know, the capability of handling large-sized problem is very important in this big data era. In addition, to verify the effectiveness and efficiency of our model, we compare it with some well-known methods. The numerical experiments on 2 artificial data sets and 12 benchmark data sets demonstrate the validity and applicability of our proposed method.
1. Introduction
Support Vector Machine (SVM) was first introduced by Vapnik in [1, 2]. It is a classical classification method based on the principle of structural risk minimization. Due to its great performance, it has attracted much attention in the field of machine learning and has been widely applied in many areas [3, 4]. Similarly, the SVM-based method, Support Vector Regression (SVR), is an efficient model for regression problems, which guarantees a strong generalization ability. In the literature, the traditional regression methods include General Regression Neural network [5], Multiple Linear Regression [6], and Ridge Regression [7]. Compared with these models, SVR achieves better robustness and generalization, since it considers not only the structural risk but also the experience risk. At present, SVM-based regression models are mainly composed of Support Vector Regression (SVR) [8], Least Square Support Vector Regression (LSSVR) [9], and Twin Support Vector Regression (TSVR) [10]. These models have been widely utilized in various areas, such as stock market forecasting [11, 12], image understanding [13, 14], and pattern recognition [15].
The classical SVR [8] obtains the final regression function by solving a quadratic programming problem in which the training error and complexity of the model are minimized in the objective function. Similar to SVM, SVR obtains some good statistical properties. Based on this work, LSSVR makes an improvement by introducing a series of equality constraints. In this way, it only needs to solve a linear system of equations which leads to a much lower computational complexity. Inspired by the twin support vector classification machine [16], Peng [10] proposed a method called TSVR. Unlike the traditional SVR with two parallel hyperplanes, this model uses two nonparallel hyperplanes to construct the final regression function. Due to its flexibility, TSVR has a better generalization ability than the traditional SVR. Moreover, another advantage of TSVR is that the final regression function is obtained by solving two small-sized quadratic programming problems, which also leads to a lower computational burden.
It is worth pointing out that, in order to deal with the nonlinear structures in most data sets, these three models first have to use the mapping functions to project the original points into some high-dimensional spaces. Since these mapping functions cannot be directly handled, researchers introduce some kernel functions to solve the dual problems. However, there is no universal rule to automatically choose a suitable kernel function for a given data set. Since the kernel function largely affects the final performances of these kernel-based models, the users have to spend a lot of time and effort in selecting a proper kernel function and its corresponding parameters. This is indeed a very tedious and hard work and damages the applicability of those models. Recently, Luo et al. [17] proposed a kernel-free fuzzy quadratic surface support vector machine (FQSSVM) model which directly generates a quadratic surface for the classification. In this way, it can skip the notorious searching process for a proper kernel function in the classical kernel-based SVM. Hence, this new model can save much effort of the user and greatly improve the total efficiency. Based on this leading work, some extensions of kernel-free SVM have been developed and applied [18–20]. The good results of these models demonstrate that the traditional kernel method can be replaced by the novel quadratic surface method.
In this paper, we propose a completely new linear twin quadratic surface support vector machine (LTQSSVR) model. The main contribution of our work can be concluded on three aspects. First, inspired by the cutting-edge idea of the FQSSVM model, we apply two nonparallel quadratic surfaces to build the regression model. In this way, we can successfully avoid the searching process of kernel function selection. Hence, the applicability and efficiency of our model get significantly improved. It is worth noting that, to the best of our knowledge, this is the first time to develop a kernel-free SVM-based regression model. Second, in order to increase the robustness of the model, we incorporate the 1-norm regularization into our model to measure the error between the predicted and actual values. The numerical experiment demonstrates the effectiveness and robustness of this extended model. Third, we equivalently transform the above 1-norm regularization regression model into a linear programming problem. Hence, we can further greatly relieve the computational burden and speed up the efficiency. Compared with those benchmark SVM-based regression models, our new model not only beats them with a slightly better performance but also leads far ahead with a much higher efficiency in building and solving the model.
The remaining paper is arranged as follows. Section 2 briefly introduces those benchmark works. Then, the TQSSVR and LTQSSVR are proposed in Sections 3 and 4, respectively. After that, we conduct a comprehensive numerical experiment to compare the performances of those models in Section 5. Finally, we summarize this paper in Section 6.
2. Related Works
2.1. Support Vector Regression
First, we show some notations used in this paper. e denotes the vector of ones with an appropriate dimension, denotes the set of real numbers, denotes the n-dimensional vector space, and I denotes the identity matrix with an appropriate dimension. For a matrix , Aij denotes the element in the ith row and jth column of A.
For a giving data set , where are input vectors and yi = {−1, 1} are outputs. Let , where Ai = xi. Let denote the output vector. The basic idea of the SVM model is to find a hyperplane f(x) = + b that separates the training points into two classes with a maximum level of separation [2].
Comparatively, the elements in the output vector are real numbers for the SVR model. Hence, the targets of SVR and SVM are various. Instead of constructing a proper classifier as that in SVM, SVR aims to find a suitable regressor. In other words, SVR wants to let all the training points as close to the regressor plane as possible, while SVM tries to let them as far away from this plane as possible. Specifically, SVR can be transformed into a classification problem by introducing an ϵ-insensitive tube, which allows a small error in fitting the training data. In this way, any error smaller than ϵ is ignored in the classification model. Let ξ1 and ξ2 be the slack vectors to measure the errors of those samples inside the ϵ-tube. Then SVR can be formulated as the following quadratic programming problem:
The interested readers can refer to [8] for more details.
Specifically, the first term of the objective function maximizes the margin of this classification problem. And the constraints indicate that all training points should be contained in that soft ϵ-tube. Here, C is a tradeoff parameter between margin and training errors.
2.2. Least Squares Support Vector Regression
LSSVR was first introduced by [9, 21]. Like SVR, the main idea of LSSVR is to seek the decision function in the form of f (x) = Aw + b and include all the points in a small region. Then, the regression function is obtained by solving the following quadratic programming problem [9, 22, 23]:where ξ is the slack vector. Similarly, its objective function is to maximize the margin while its constraints indicate all training points should be close to the regression plane. One significant difference between the above model and SVR model is that solving LSSVR is just equivalent to solving a linear equation set. Therefore, the computational process is relatively simple. On the other side, it should be pointed out that almost all training points contribute to the decision function; hence, LSSVR no longer has the sparsity of SVR.
2.3. Twin Support Vector Regression
TSVR is similar to TSVM as it also derives the following pair of nonparallel planes around the data points:
However, there are some differences between TSVR and TSVM. First, TSVM considers only one class of data points in each quadratic programming problem, while TSVR uses all data points in its both quadratic programming problems. Second, TSVM finds two hyperplanes such that each plane is close to one class and is as far as possible from the other class, whereas TSVR determines the up- or down-bound function by using only one group of constraints in each quadratic programming problem. Specifically, TSVR is obtained by solving the following pair of quadratic programming problems [10]:where C1 and C2 > 0 are the parameters and ξ and η are the slack vectors. The final regressor is generated by . TSVR is approximately four times faster than the standard SVR in theory due to its formulation.
2.4. Nonlinear Case and Dual Problems
For a nonlinear case, these three models (SVR, LSSVR, and TSVR) first need to project data points into a higher dimensional space via a mapping function ϕ (x): , d > m and then conduct the linear regression (A) + b = 0 in this new space [24]. Since the mapping function is difficult to calculate directly, researchers have to introduce a kernel function K (xi, xj) = (ϕ (xi) ⋅ ϕ (xj)) into their dual problems [25].
Following this way, the dual problem of SVR can be written as follows:where α and are the Lagrange multipliers that satisfy [8]. Then, the decision function of SVR is of the following form:
Similarly, we can obtain the dual problems of LSSVR by introducing the following Lagrangian multiplier:
The decision function of LSSVR is
Moreover, the dual problem of TSVR is in the following forms:wherewhere
The final regression result of TSVR is determined by the average values of f1 (x) and f2 (x) as follows:
It is worth noting that there are many kinds of kernel functions such as linear kernel, Gaussian kernel, and polynomial kernel. However, a user cannot find a general guideline to help him/her choose a suitable kernel for a given data set. And the choice of kernel function would largely affect the regression performance. Hence, the time-consuming searching process for a proper kernel function and its corresponding parameters significantly drags down the total efficiency.
3. Twin Quadratic Surface Support Vector Regression
In this paper, we propose a totally new kernel-free TQSSVR model, which directly generates a quadratic surface for regression in the original space instead of projecting data points into a higher dimensional space. Our new model can overcome the main drawback of those traditional SVM-based regression models by avoiding the selection process. Hence, it has a much higher efficiency and applicability.
For a given training data sets, we want to find two quadratic surfaces:which determine the ϵ-insensitive down- and up-bound regressors. Following the basic scheme of TSVR, TQSSVR can be reformulated as the following two quadratic programming problems:where > 0 are the parameters chosen as a priori and ξ and η are the slack vectors. (x) generates the -insensitive down-bound regressor, while (x) generates the -insensitive up-bound regressor.
Note that the first term in the objective function of (14) is the sum of squared distance from the training points to the quadratic surface (x). Therefore, minimizing it equals to minimizing the regression error. Moreover, the constraint indicate that the estimated function (x) is at least ϵ1 away from training points. The second term of the objective function is to minimize the sum of errors. As noted, the loss is zero if and only if (x) is exactly the same as y. Moreover, like SVR, we also assume that we can tolerate at most ϵ1 deviation between (x) and y. The similar explanations apply for problem (15).
It is worth pointing out that the above formulations can be further simplified by the following steps. Note that matrices W1 and W2 are symmetric. First, let and be the vectors formed by taking the elements of the upper triangle part of matrices W1 and W2, respectively:
Then, for each point , we construct an vector si as follows:
Let . Finally, we can define two vectors of variables in the following form:
It is easy to check that problem (14) and problem (15) can be equivalently reformulated as follows:
For a point x, the final regression result of TQSSVR is the mean value of (x) and (x):
4. Linear Twin Quadratic Surface Support Vector Regression
In the above model, the error between the predicted value and the actual value is measured by the 2-norm. It is noting that 2-norm is sensitive to those points which are far away from the regressor and may amplify the influence of those error points, especially in the case with outliers or mislabeled information. Compared with 2-norm, 1-norm is more robust and insensitive to the errors [26, 27].
In this section, in order to increase the robustness, we will introduce the 1-norm regularization condition and transform problem (19) and problem (20) into two linear programming problems. This not only enhances the performance but also greatly improves the computational efficiency. Hence, we extended our model by incorporating the 1-norm regularization as follows:
For a vector , ‖a‖1 is the sum of the absolute values of all its elements, and the sum of the absolute values of each element is greater than or equal to the absolute value of the sum. Let s1 = ∥Y − eϵ1 − (sTz1 + c1)∥1, s2 = ∥Y + eϵ2 − (sTz2 + c2)∥1. Then, we have
Based on these inequalities, we can relax and reformulate those two twin regression problems as the following two linear programming problems:
It is worth pointing out that, the linear structure of these two reformulations greatly improves the computational efficiency. Moreover, the existence of the 1-norm distance opposed to the 2-norm distance in TQSSVR makes the new model less sensitive to noise or large errors.
To further illustrate the robustness of 1-norm, we give an example here. The points are generated by , where xi uniformly distributed between [0, 1] and ξi belong to a Gaussian noise N (0, 0.152). For simplicity, we set ϵ1 = ϵ2 = 0. The final regressors are derived by LTQSSVR and TQSSVR. Then, to test the robustness of TQSSVR and LTQSSVR, we added some outliers to the data. As shown in Figure 1, it can be easily seen from the left graph that two regressors are almost overlapping for the data without some noises. However, for the data with outliers, TQSSVR is obviously biased while LTQSSVR is quite robust.

(a)

(b)
5. Numerical Experiment and Discussion
To investigate the performance of our proposed LTQSSVR and compare it with other benchmark SVM-based regression methods, we conducted a comprehensive numerical experiment. The comparison list includes Ridge Regression, Linear Regression, nu-SVR, ϵ-SVR, LSSVR, TSVR, LTSVR, and our LTQSSVR. LTSVR was proposed by Xu in [28] which adds 1-norm to TSVR. Besides, two different types of artificial data sets and twelve various benchmark real data sets were used in the experiment.
All computational tests were executed via MATLAB (R2016a) on a personal laptop with a system configuration as Intel p4 processor (1.8 GHZ) and 4 GB usable RAM. All the corresponding models were solved by the function “quadprog.m” or “linprog.m” in the MATLAB toolbox (available from https://pan.baidu.com/s/1QRX40tVcnO–f4bz8c0–58Q.password:itiz).
5.1. Kernel Function and Parameters Selection
For the kernel-based models, since Gaussian kernel function K (xi, xj) = exp(−∥xi − xj∥2)/σ2 is the most commonly used one, it was applied in this paper. The corresponding optimal kernel parameter σ in nu-SVR, ϵ-SVR, LSSVR, TSVR, and LTSVR was selected from the set {2i∣i = −4, −3, …, 4, 5}. It is worth pointing out that we only considered the Gaussian kernel for those kernel-based models in the experiment. If we compare other kernel functions in the searching process, the total efficiencies of those models would be tremendously reduced.
Moreover, we set c1 = c2 = c and ϵ1 = ϵ2 = ϵ in our experiment. Specifically, the optimal penalty parameter c for these models was selected from the set {2i∣i = −4, −3, …, 4, 5}, and the insensitive parameter ϵ was chosen from the set {0.001, 0.01, 0.1, 0.2}. Note that all these parameters were determined by the fivefold cross validation. That is, the data set was randomly split into five subsets. In each time, one of those subsets was reserved as a testing set while other four subsets were used together as a training set. For every data set, this process was repeated ten times. And all the results are the average values of these ten tests.
In each cross validation, we use the following four indicators in [10, 28] as the criteria for selecting the optimal parameters and verify the performance of eight models:(1)RMSE: Root Mean Squared Error, defined as . It is used to measure the deviation between the observed value and the true value. Because RMSE uses the average error, it is sensitive to abnormal points. If the regression value of a certain point is not reasonable, its error is relatively large, which has a great impact on the value of RMSE. In general, the smaller the RMSE is, the more accurate the prediction results will be.(2)MAE: Mean Absolute Error, defined as . It is the mean of absolute error which can reflect the actual situation of predicted error.(3)SSE/SST: SSE (Sum of Squares due to Error) measures the difference between the actual value and the predict value, which represents the part interpreted by the regression equation. SST (Sum of squares for total) measures the deviation of y, which represents the total change of data. In most cases, a smaller SSE/SST means a better consistency between estimates and actual values.(4)SSR/SST: SSR (Sum of Squares due to Regression) measures the difference between estimate values and the mean value of y, which represents the part not explained by the regression equation. The larger the SSR, the more statistic information it captures from the test sample. In order to obtain a small SSE/SST, it is usually accompanied by an increase in SSR/SST.
5.2. Artificial Data Sets
First, we followed the traditional way in [10, 28] to generate a classical type of artificial data sets in the experiment. This is a kind of 2-D artificial data sets which is obtained by the equation . In reality, data sets always contain noise points. In order to check the performance of our model in this situation, two kinds of noises including the Gaussian noise and the uniformly distributed noise were added. Specially, four types of training points were generated as follows:where U[a, b] represents the uniform distribution in the range of [a, b] and N[0, d2] represents the normal distribution with a mean of 0 and a variance of d2. To avoid bias comparison, we randomly generated 10 groups of independent noise samples for each type of noise. Following the traditional way in [10, 28], each group contains 600 training samples and 400 test samples. And the test points were generated by function sin c (x) without noise. The corresponding results of eight algorithms Ridge Regression, Linear Regression, nu-SVR, ϵ-SVR, LSSVR, TSVR, LTSVR, and LTQSSVR are shown in Table 1.
From Table 1, we can see that Ridge Regression achieves the best results by obtaining the minimum RMSE and MAE values for noises of Type 1 and Type 2, and ϵ-SVR obtains the best for noises of Type 3 and Type 4. Besides, LTQSSVR take a stable performance for most data sets which has a better robustness for noises. And it can also be seen from Table 1 that Linear Regression is the fastest one among all algorithms. However, Linear Regression is only suitable for the cases with linear separable data.
In order to verify the universal ability of eight models, we used another type of artificial data sets in [20] to test the performances of eight models. Now, we show how to obtain this type of artificial data sets. First, we used different matrices U and vectors c to get various quadratic surfaces , where each element of U and c was randomly selected from the domain [−10, 10]. Then, we randomly generated points on both sides of the quadratic surface. Similarly, we generated 10 independent groups of samples. Each group includes 60% training samples with noise and 40% test points without noise. The corresponding results are shown in Table 2.
From Table 2, we can see that LTQSSVR still achieves the best performance in seven cases among total eight cases by obtaining the smallest RMSE, MAE, and SSE/SST values. Compared with TSVR, LTSVR improves the training results by introducing 1-norm instead of the original 2-norm. But for large-scaled data sets, its advantages are not significant. Among these methods, Ridge Regression has the shortest processing time, followed by Linear Regression, nu-SVR, and ϵ-SVR, which lack robustness to noises. Moreover, the experimental results with different scaled data sets show that LTQSSVR has better performance with large-scaled but small-dimension data sets. Besides, we also show the box plot of RMSE values in Figure 2. It is easy to check that LTQSSVR are concentrated and obtain a smallest average, Hence LTQSSVR is robust in most cases.

(a)

(b)

(c)
5.3. Benchmark Data Sets
In this part, we use 12 benchmark data sets from the UCI Repository to test these four models (available from https://archive.ics.uci.edu/ml/datasets.php). This list includes Auto MPG, Oring, Wis, Slump, Haberman, and so on. The detailed information of these data sets is summarized in Table 3.
To eliminate the overwhelming dominance between features, we normalized the features of all data to [0,1] before the training. The experimental results of eight models are summarized in Table 4. For each result, the first term denotes the average value of ten times while the second term denotes the standard deviation. From Table 4, we can see that the LTQSSVR obtains good results in most of the cases. Compared with LSSVR and TSVR, the new model greatly improves the computational efficiency on the basis of ensuring accuracy.
The performances of eight methods on 12 data sets are summarized in Table 4. We can see that the introduction of 1-norm in LTSVR improves the efficiency in some cases, but this advantage is not valid for large-sized data sets. On the contrary, LTQSSVR greatly improves the training efficiency in most of the cases. It can be seen from Figures 3 and 4 that the processing time of LTQSSVR is much shorter than those of other methods; hence, LTQSSVR has a large advantage on the aspect of efficiency. This is because our kernel-free model avoids the time-consuming task of selecting a proper kernel and parameters. Besides, the structure of two small-scaled linear programming problems further accelerates the computational speed [29–32].


From Table 4, we can easily check out that our proposed method generates stable regressions on most of the data sets. It achieves the best accuracy on two data sets (Auto mpg, Computer hardware), the second best accuracy on one data sets (Hayes roth), the third best accuracy on two data sets (Body fat, Haberman), and the fourth best accuracy on five data sets (Oring, Wis, Real estate valuation, Slump, Ozone). We summarize the average rank information about these eight methods on 12 data sets in Table 5.
Then, we employ the Friedman test to check the difference of all methods under the null hypothesis that all the algorithms are no significantly different from the mean rank:where and denotes the jth of k methods on the ith of N data sets. Then, the Friedmans FF is obtained byWith eight methods and 12 data sets, FF is distributed according to the F distribution with (7, 77) degrees of freedom. According to (26), (27), and Table 5, we can obtain and FF = 2.3593. The critical value of F (7, 77) for α = 0.05 is 2.131 and similarly is 1.796 for α = 0.1, so we reject both levels of the null hypothesis. We can identify that there is a significantly different between the eight methods. Note that LTQSSVR obtains the smallest average rank which is 3.25. It means that the accuracy of our approach is also very close to the best one.
As for the processing time, Ridge Regression and Linear Regression take the shortest time among these methods, which lack robustness. In addition, LTQSSVR takes much shorter computational times than TSVR and LTSVR on all data sets. For the large data sets, TSVR and LTSVR require a lot of time for solving. This is because they have to tuning those parameters in the kernel function. In particular, we only used the Gaussian kernel function in this paper. Hence, if we add the selecting process of kernel function to those kernel-based models, their efficiencies will be further dragged down a lot. Therefore, these methods are not suitable for dealing with those huge-sized cases and their applicabilities would be strictly limited in this big data era. In contrast, due to the kernel-free and linear programming structure, our new approach has a strong ability to handling these huge-sized data sets quickly. Above all, our proposed method not only has a good generalization ability but also achieves a great efficiency.
6. Conclusion
In this paper, we proposed a new approach of TSVR, which directly uses two quadratic surfaces in the original space to regress the data sets. It is worth noting that our new approach avoids the difficult and time-consuming task for searching a suitable kernel function and its corresponding parameters in the traditional kernel-based SVR methods. Besides, in order to make further improvement of the efficiency and robustness, our model incorporates the 1-norm to measure the regression error. The corresponding linear programming structure avoids the matrix inverse operation and leads to a high efficiency even for those huge-sized problems. Finally, the experimental results on different data sets demonstrate the validity and applicability of our method. Specifically, compared with those benchmark nonlinear regression models, our model is superior in both accuracy and efficiency.
For the future work, since the advantage of our method is not obvious on efficiency for those high-dimensional data sets, we hope to incorporate a feature selection method to improve it.
Data Availability
The data used in this paper are available from https://archive.ics.uci.edu/ml/datasets.php.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
Tian’s research has been supported by the Fundamental Research Funds for the Central Universities (Nos. JBK2002001, JBK1805005, and JBK190504).