Abstract
Localized multiple kernel learning (LMKL) is an effective method of multiple kernel learning (MKL). It tries to learn the optimal kernel from a set of predefined basic kernels by directly using the maximum margin principle, which is embodied in support vector machine (SVM). However, LMKL does not consider the radius of minimum enclosing ball (MEB) which actually impacts the error bound of SVM as well as the separating margin. In the paper, we propose an improved version of LMKL, which is named ILMKL. The proposed method explicitly takes into consideration both the margin and the radius and so achieves better performance over its counterpart. Moreover, the proposed method can automatically tune the regularization parameter when learning the optimal kernel. Consequently, it avoids using the time-consuming cross-validation process to choose the parameter. Comprehensive experiments are conducted and the results well demonstrate the effectiveness and efficiency of the proposed method.
1. Introduction
Over the past decade, kernel methods [1] have drawn a lot of attention of researchers in the machine learning community and have been widely applied. A kernel characterizes the similarity between two samples [2]. Actually, the performance of a kernel-based algorithm often strongly depends on the selection of the kernel. Generally, an unsuitable kernel would lead to a poor performance. Therefore, it is very critical to choose a suitable kernel for a kernel-based algorithm.
Recent researches on kernel methods have highlighted the requirement to learn a suitable kernel matrix or function from the training data. A generic technique is known as multiple kernel learning (MKL) [3]. Given a set of predefined basic kernel functions, MKL tries to find their combination by employing a criterion which maximizes a generalization performance measure or minimizes an error bound. Actually, the practical problems frequently involve multiple heterogeneous data sources [4]. Thus, MKL is in accord with this fact. Many studies [5–13] have shown that MKL can generally find the suitable combination of basic kernel functions and so can usually achieve better performance in contrast with single kernel. The idea of MKL has been applied in all sorts of kernel-based algorithms, for example, support vector machine (SVM), which is a powerful and excellent machine learning method based on Vapnik’s statistical learning theory [14]. In the paper, we will only focus on the SVM-based MKL.
Localized multiple kernel learning (LMKL) [15–17], as a method of MKL, is an attractive method which combines multiple heterogeneous attributes according to their discriminative ability for each individual instance. Generally, other MKL methods try to learn a global combination in the whole input space [2, 5, 6, 18, 19], whereas LMKL believes that a sample-specific local combination should most likely better reflect the distinctive characteristics of each instance and so embodies the idea. This is the key difference between other MKL methods and LMKL. Overall, LMKL consists of an SVM learning problem and a parametric gating model. The gating model is used to assign local weights to predefined basic kernels. In LMKL, a two-step alternate optimization method is employed to train the two components. In contrast with other MKL methods, LMKL generally provides fewer support vectors but can achieve statistically similar accuracy results. The idea of LMKL has been extended to other kernel-based methods and successfully used in some practical applications [20–23].
However, LMKL learns the kernel function (essentially the parameters of the gating model) only by maximizing the margin which is embodied in single-kernel-based SVM. A key fact is that the generalization performance of SVM depends not only on the separating margin, but also on the radius of the smallest ball that encloses the data [24–28]. In fact, it is not necessary that standard SVM (or single-kernel SVM) exploits the radius. The reason is that the radius of the minimum enclosing ball (MEB) is fixed once the kernel including its parameters is selected. However, in the context of LMKL, the radius is not fixed but is a function of the parameters of the gating model.
Actually, several attempts have recently been directed at incorporating the radius into SVM-based MKL [29]. However, most of these works direct SVM with soft margin (-SVM) because the problem of -SVM can be transformed into a form of SVM with hard margin, in which the radius-margin bound holds and can be used to conduct model selection [25]. Unfortunately, for SVM with soft margin (-SVM), the radius-margin bound does not hold, as we have no way of reducing the formulation of -SVM to a form of SVM with hard margin. So, one cannot directly utilize the radius-margin bound in LMKL since its formulation is rooted in -SVM. However, in [27], Chung et al. investigated several heuristic bounds for SVM and developed a modified radius-margin bound to conduct model selection for -SVM. The experimental results have indicated its effectiveness.
Inspired by the work of Chung et al. in [27] and aiming at the drawback of LMKL, in this paper, we propose an improved version of LMKL, which is named ILMKL. A noticeable characteristic of the proposed method is that it takes account of both the separating margin and the radius of MEB; that is, it integrates the information of the margin and the radius to measure the goodness of the kernel and learn the parameters of the gating model. Actually, a key insight of our work is that learning the parameters of the gating model in LMKL is similar to conducting model selection. Analogous to LMKL, the problem of the proposed method can be efficiently resolved in a coupled manner through employing the two-step alternate optimization method. Moreover, the proposed method treats the regularization parameter as an extra parameter that can be automatically learned. Consequently, we can jointly tune it with the parameters of the gating during the learning kernel function process. This improves the computational efficiency of the proposed method to some extent since it avoids the time-consuming cross-validation process. Comprehensive experiments are conducted and the experimental results well demonstrate the efficiency and effectiveness of the proposed method.
The rest of the paper is organized as follows. Section 2 reviews the related work. In Section 3, we first present the formulation of the proposed method and then detail how to solve the optimization problem. After that, we conduct some preliminary discussion on the proposed method and outline the algorithm step. Section 4 reports the experimental results, and the conclusions are drawn in Section 5.
2. Preliminaries
In the paper, we suppose a training dataset which consists of samples and is represented by , where the samples and its corresponding labels . Here, and is the dimension of the sample space. Denote, for convenience, by the set of all indices; that is, .
2.1. Radius-Margin Bound for SVM
SVM embodies the structural risk minimization principle, which is related to the probability of incorrectly classifying an unknown sample. Geometrically, the key idea of SVM is to construct a separating hyperplane in the data space through employing the maximal margin principle among two different classes of samples [14]. In the nonlinear case, -SVM defines the following optimization problem:where , is the inner product of two vectors, represents the training error, and is the regularization parameter that adjusts the training error and the regularization term . Problem (1) can be efficiently solved by transforming it to its corresponding dual problem [30], which is formulated asHere, is called kernel function. Suppose that solves the above optimization problem and is the optimal threshold which can be computed by using the KKT condition of (1); the decision function of SVM is formulated as
In order to obtain a better performance in the practical applications, it is very important to choose suitable hyperparameters which include the regularization parameter and the kernel parameter of the kernel function for SVM. This is the so-called model selection [24]. Generally, one can empirically set these hyperparameters. But this is very hard work because one cannot know in advance the suitable hyperparameters when facing all kinds of practical applications. Many works have tried to find a good criterion to automatically learn the related hyperparameters [24–26]. In [27], Chung et al. proposed the following bound for -SVM:where is a parameter and can generally be set as and refers to the radius of MEB. This radius-margin bound is differentiable and is successfully used to conduct model selection for -SVM in [27].
2.2. Localized Multiple Kernel Learning
In the context of MKL, we assume that there exist different mappings (), the th mapping of which is endowed with the base kernel of associated reproducing kernel Hilbert space (RKHS) .
As a method of MKL, LMKL is based on -SVM and defines its optimization problem as follows:where , , , is a regularization parameter that adjusts the training error and the regularization term , and represents the training error. Here, , , and is a gating function defined up to a set of parameters which need to be learned from the training data. Further, by using duality, we have the dual formulation of the primal problem in (5) as follows:where the locally combined kernel function is defined asIf the used gating model is constant (not a function of ), LMKL finds a fixed combination over the whole input space and is similar to the original MKL formulation. The main advantage of LMKL is that it can achieve statistically similar accuracy results by storing fewer support vectors compared with the original MKL.
3. The Proposed LMKL Framework
In this section, we first present the primal optimization problem of the proposed method ILMKL and then detail how to solve it, and finally some preliminary discussion on the proposed method is given and the algorithm is outlined.
3.1. Primal Optimization Model of the Proposed Method
In the context of LMKL, it is easy to find that the radius of MEB is not fixed but is a function of the parameters of the gating model. Nevertheless, LMKL learns the parameters of the gating model only through using the separating margin. Therefore, LMKL ignores the fact that the generalization performance of SVM depends not only on the separating margin but also on the radius.
Actually, the purpose of learning the parameters of the gating model is in essence to yield an appropriate kernel matrix for good performance. In our opinion, this process is similar to model selection by which SVM chooses the appropriate parameters to achieve good performance. Therefore, following the basic idea of the work in [27], we define the primal optimization problem of ILMKL as follows:As in Section 2.2, here, , , , , , represents the training error, and is a regularization parameter that adjusts the training error and the regularization term . Note that and , and thus we can reformulate (8) intoHere, is a gating function. As in [15], we employ the softmax gating model determining the parameters () and it can be expressed aswhere is the th feature of the th sample, , and () are the parameters of the gating model associated with the mth kernel and the softmax guarantees nonnegativity. As pointed out in [17], one can use more complex gating models.
Obviously, in contrast with LMKL (5), ILMKL has two noticeable characteristics. One is that it takes into consideration the information of the radius and margin. Another characteristic of the proposed method is that it treats the regularization parameter as a variable that can be automatically learned during the procedure of learning the parameters of the gating model. To sum up, the key insight of our method is that, in the context of LMKL, learning the parameters of the gating model is similar to conducting model selection for SVM.
3.2. Training with Alternating Optimization
Generally, it is very difficult to directly solve problem (9). In LMKL, a two-step alternate optimization method is employed to find the parameters of the gating model and the discriminant function. In our method ILMKL, we use the same strategy.
The first step is to fix and solve (9) with respect to , and the second step is to optimize the parameters of by using a gradient-descent method. The objective value obtained for a fixed is an upper bound for (9) and the parameters of are optimized according to the current solution. The objective value obtained at the next iteration cannot be greater than the current one due to the use of gradient-descent procedure. And as iterations progress with a proper step size selection procedure, the objective value of (9) never increases. Note that this way does not guarantee convergence to the global optimum and so the initial parameters of may affect the solution quality.
In this subsection, we will discuss how to solve problem (9) when fixing . In the following two subsections, we will, respectively, discuss how to optimize and .
For a fixed , we haveHere, we setTherefore, for a fixed , problem (9) of the proposed method can be expressed asNote that if we fix the gating model parameters and the regularization parameter, the optimization problem (13) becomes convex. In order to find its solution, we can switch it into the dual optimization problem. By using duality, the dual problem of the primal problem in (13) can be formulated aswhere the locally combined kernel function is defined as (7). Obviously, this formulation corresponds to, respectively, solving a canonical SVM dual problem and a canonical support vector domain description (SVDD) [31] dual problem with the kernel matrix , which should be positive semidefinite.
Finally, once the final gating model has been learned and problem (14) is solved, the resulting discriminant function of the proposed method ILMKL can be expressed as follows:
3.3. Optimizing the Parameters of the Gating Model
In order to optimize the parameters of the gating model by using a gradient-descent method, one needs to calculate the derivatives of the primal objective with respect to the parameters . Next, we will discuss how to calculate the derivatives of the parameters.
First, note thatSo, we haveThus, we can calculate the derivatives of with respect to the gating model parameters as follows:Further, according to the above formula, the following can be obtained:Finally, the derivatives of with respect to the parameters of the gating model can be formulated as
3.4. Optimizing the Regularization Parameter
In our method, the regularization parameter is treated as a variable that can be learned when learning the gating model. Similar to the process of optimizing the parameter of the gating model, we employ a gradient-descent method to optimize the regularization parameter and so the derivative of (14) with respect to is needed. In the following, we will discuss how to compute the derivative.
Actually, the derivative of (14) with respect to can be expressed asObviously, one obtains in advance and , which are, respectively, computed as
3.5. Discussion
It should be noted that must hold in the whole procedure. However, actually in the iterations, this condition may be broken. In order to deal with this problem, following [27], we can use instead of in the solving procedure. The reason is that can be any real number when . Thus, the positivity of is dodged. Here, we need to rewrite the above partial derivatives. According to the chain rules, we can modify (21) as the following:where
Finally, according to the above discussion, we outline the complete algorithm of ILMKL in Algorithm 1.
|
4. Experiments
In this section, the experimental results will be reported. In the first experiment, we investigate the influence of parameters on ILMKL performance. In the second experiment, we further explore the possibility of learning the regularization parameter under different initial value on a synthetic dataset. In the third experiment, we conduct the experiments on several UCI datasets and compare the proposed method with traditional MKL methods.
4.1. Parameter Influence on Performance of the Proposed Method
In the training procedure of LMKL, the regularization parameter must be predefined. In our method ILMKL, the parameter can be automatically tuned during the learning process of the parameters of the gating model. However, we need to first set the parameter and the initial value (denoted by ) of . In [27], the authors suggested and . In this subsection, we will investigate the influence of these parameters on the final learned regularization parameter and the classification performance of the proposed method.
We used the Sonar dataset, which was selected from the UCI repository [32]. In the experiment, 50% of the datasets were randomly selected for training and the rest for testing. The data were preprocessed in the following way: first, the mean and the standard deviation of each feature were computed according to training data; then, training examples were normalized to have mean 0 by subtracting the mean and unit variance; finally, testing examples were correspondingly preprocessed using the mean and the standard deviation. The base kernels include one linear kernel and one polynomial kernel with degree of two. All kernel matrices are calculated and normalized to unit trace before training.
Figure 1 shows the experimental results under different . From Figure 1(a), we can find that the value of actually influences the final value of learned . The reason, in our opinion, is that the algorithm may fall into a local minimum since we adopt the gradient-descent method which cannot guarantee finding the global minimum. Actually, the final values of learned under different are close to each other on the whole. Moreover, the classification accuracies under different have almost no difference according to Figure 1(b). Figure 2 shows the experimental results under different initial value of . From Figure 2(a), similar to the case which is shown in Figure 1(a), we can find that the initial value of also impacts the final learned . However, according to Figure 2(b), the initial value scarcely influences the classification accuracy.

(a) The final learned C of ILMKL under different

(b) The classification accuracy of ILMKL under different

(a) The final learned C of ILMKL under different initial value

(b) The classification accuracy of ILMKL under different initial value
Therefore, it can be concluded that the proposed method is effective to learn a suitable in the SVM-based LMKL scenario under different and initial value of .
4.2. Experimental Results on Synthetic Dataset
In the above experiments, the final regularization parameter learned is always much larger than the initial value of and so almost always increases in the learning progress. In these experiments, we will show that the final learned actually can be larger or smaller than the initial value of .
Following [15], we create a synthetic dataset, which consists of two classes, and each class contains 200 samples. The samples come from four Gaussian components (two for each class), and each component, respectively, has the following mean vector and covariance matrix:where the samples in class 1 are from the first two components (denoted by red ×) and others belong to class 2 (denoted by blue +). Here, we adopt the same base kernels as in Section 4.1, that is, linear kernels and one polynomial kernel with degree of two. Before training, all kernel matrices are computed and preprocessed to unit trace in advance.
Figure 3 illustrates the experimental results of LMKL under different regularization parameter . It can be easily found that the values of the regularization parameter impact severely the result of LMKL. Obviously, the experimental result illustrated in Figure 3(c) is better. Here, the regularization parameter is set as . The experimental results of the proposed method are illustrated in Figure 4. It can be found that we almost obtain the same result under different initial value of . Moreover, the final learned is sometimes smaller and sometimes larger than the initial value. Note that the final values of learned are different under the different initial value of . The reason, as pointed out in Section 4.1, is that the gradient-descent method is employed to optimize the regularization parameter . However, the final learned regularization parameters are close to each other. The experiments further validate the fact that our method can effectively automatically learn the regularization parameter . This is an advantage over traditional LMKL.

(a) The obtained separate hyperplane under = 10−1

(b) The obtained separate hyperplane under = 100

(c) The obtained separate hyperplane under = 101

(d) The obtained separate hyperplane under = 102

(e) The obtained separate hyperplane under = 103

(a) The obtained separate hyperplane under = 10−1

(b) The obtained separate hyperplane under = 100

(c) The obtained separate hyperplane under = 101

(d) The obtained separate hyperplane under = 102

(e) The obtained separate hyperplane under = 103
4.3. Experimental Results on UCI Datasets
In this subsection, we report the performance comparison about SimpleMKL [2], LMKL [15], and the proposed method on several UCI datasets [32].
In the experiment, we use 50% of each dataset as a training set and the rest as the test set. As in Section 4.1, the data were normalized (i.e., 0 mean and 1 standard deviation). The base kernels include seven Gaussian kernels with the widths of [] and four polynomial kernels with degrees of one to four. Before training, all kernel matrices are calculated in advance and preprocessed to unit trace. Each experiment is repeated 50 times, and the mean accuracy and standard deviation were computed. In the experiments, SimpleMKL and LMKL employ the cross-validation technique to choose the regularization parameter from the set , , , , , , , , . For our method ILMKL, it is not necessary to use cross-validation to select the parameter because it can learn an appropriate value.
Table 1 reports the classification accuracies of several SVM-based MKL methods on the selected datasets. From Table 1, it can be found that LMKL has comparable performance to SimpleMKL. However, on the whole, it can be found that ILMKL has a clear improvement in the classification performance in contrast with SimpleMKL and LMKL. These experimental results indicate that the generalized performance in SVM-based MKL can be improved when the information of radius of MEB is considered. The proposed method ILMKL embodies the idea.
For a rigorous comparison, simultaneously, we further conducted the paired two-tailed -tests [33] on these methods. In -test, the value depicts the probability that two sets generate from distributions with equal means. If the value is smaller, then the difference of the two mean values is more significant. Generally, 0.05 is viewed as a typical threshold of the value; that is, it is considered statistically significant when the value is smaller than 0.05. Table 2 reports the experimental results of the -tests. For example, the value of the -test when comparing LMKL and SimpleMKL on the Ionosphere dataset is 0.0862 (>0.05), meaning that SimpleMKL does not perform significantly better than LMKL on this dataset at the 0.05 significant level though SimpleMKL has better classification accuracy according to Table 1. However, ILMKL performs significantly better than SimpleMKL since the value of the -test is 0.0296 (<0.05) on this dataset at the 0.05 significant level. From Table 2, ILMKL has on the whole significant improvement in the generalized performance in contrast with SimpleMKL and LMKL.
Finally, we investigated the support vector percentages of several methods on the selected datasets, which are reported in Table 3. Generally, fewer support vectors mean less test time. From Table 3, LMKL tends to have more support vectors in contrast with SimpleMKL. The proposed method ILMKL has on the whole similar support vector percentages to LMKL. So, our method inherits the advantage of LMKL that it stores fewer support vectors but can achieve statistically similar accuracy results compared with other MKL methods.
5. Conclusions
In this paper, by following the work in [27], we presented a novel LMKL method. Different from traditional LMKL, our method takes into consideration the information of both the radius and the margin when learning the parameters of the gating model. As a result, our method can achieve better accuracy. Simultaneously, our method can automatically tune the regularization parameter during the process of learning the parameters of the gating model. Therefore, this can improve the computational efficiency of our method by avoiding using the time-consuming cross-validation technique to find a suitable regularization parameter. Comprehensive experiments are conducted on several toy and benchmark datasets and the results well demonstrate the efficiency and effectiveness of the proposed method.
Competing Interests
The authors declare that there are no competing interests regarding the publication of this paper.
Acknowledgments
This work is supported in part by the Scientific Research Project “Chunhui Plan” of the Ministry of Education of China (Grant no. Z2015102), the Key Scientific Research Foundation of Sichuan Provincial Department of Education (no. 11ZA004), and the National Natural Science Foundation of China (Grants nos. 61103168, 61532009, and 61602390).