Abstract
Multikernel clustering achieves clustering of linearly inseparable data by applying a kernel method to samples in multiple views. A localized SimpleMKKM (LI-SimpleMKKM) algorithm has recently been proposed to perform min-max optimization in multikernel clustering where each instance is only required to be aligned with a certain proportion of the relatively close samples. The method has improved the reliability of clustering by focusing on the more closely paired samples and dropping the more distant ones. Although LI-SimpleMKKM achieves remarkable success in a wide range of applications, the method keeps the sum of the kernel weights unchanged. Thus, it restricts kernel weights and does not consider the correlation between the kernel matrices, especially between paired instances. To overcome such limitations, we propose adding a matrix-induced regularization to localized SimpleMKKM (LI-SimpleMKKM-MR). Our approach addresses the kernel weight restrictions with the regularization term and enhances the complementarity between base kernels. Thus, it does not limit kernel weights and fully considers the correlation between paired instances. Extensive experiments on several publicly available multikernel datasets show that our method performs better than its counterparts.
1. Introduction
Clustering is a widely used machine learning algorithm [1–4]. Multikernel clustering is one of the clustering methods which is based on multiview clustering and performs clustering by implicitly mapping sample points of different views to high dimensions. Many studies have been carried out in recent years [5–9]. For example, early work [10] shows that kernel matrices could encode different views or sources of the data, and MKKM [11] extends the kernel combination by adapting the weights of kernel matrices. Gönen and Margolin [12] improve the performance of MKKM by focusing on sample-specific weights on the correlations between neighbors to obtain a better clustering, called localized MKKM. Du et al. [13] engaged the norm to reduce the uncertainty of algorithm results due to unexpected factors such as outliers. To enhance the complementary nature of base kernels and reduce redundancy, Liu et al. [14] employed a regularization term containing a matrix that measures the correlation between base kernels to facilitate alignment. Other works [15–19]are different from the original MKKM method [11] that prefused multiple view kernels. These methods first obtain the clustering results of each kernel matrix, then fuse each clustering result in a later stage to obtain a unified result.
More recently, a newly proposed optimization strategy, simple multiple kernel k-means (SimpleMKKM) [20] has emerged as a representative of multikernel clustering (MKC). Different from the normal MKKM algorithm, SimpleMKKM assumes minimization of kernel weights and maximization of cluster partition, which leads to min-max optimization that is somewhat difficult to unravel. It converts the optimization to a minimization problem and cleverly solves it with a specially designed gradient descent method rather than a coordinate descent method. However, it is established that the strict alignment of the combined kernel matrix can force the combination globally. Therefore, Liu et al. proposed [21] localized SimpleMKKM, which reduces the negative impact of distant samples on clustering by restricting the kernel alignment to the k-nearest neighbors of the sample rather than the global alignment. In this way, LI-SimpleMKKM can sufficiently account for the variation between samples, improving clustering performance.
Although localized SimpleMKKM shows excellent performance on MKC problems, we find that the correlation between the given kernels is not sufficiently considered providing an opportunity for improvement based on the listed problem statement.(i)The original method [21] makes the data stable by setting a larger weight in the gradient descent step and maintaining the summation and nonnegativity of the weights through the association with other weights. However, this idea only enhances the correlation between different view weights and and does not consider the relationship between view kernel matrices, especially between pairs.(ii)The original method is possible to select multikernel kernels with high correlation for clustering simultaneously. Repeated selection of similar information sources makes the algorithm redundant and has low diversity, leading to the low ratio of different kernel matrices’ effectiveness, ultimately affecting the accuracy of the clustering results.
Motivated by these, we propose a localized SimpleMKKM with matrix-induced regularization (LI-SimpleMKKM-MR) to improve upon the LI-SMKKM algorithm by adding an entry containing a matrix to measure the correlation between each two basis kernel matrices. LI-SimpleMKKM-MR algorithm can reduce the probability and simultaneously select high-correlation kernels, thereby enhancing the diversity of synthetic kernels and enhancing the complementarity of low-correlation kernels. Moreover, it adopts the advantage of localized SimpleMKKM, which has a better optimization effect that can be achieved by clustering the neighbor index matrix formed by the sample and the nearest k neighbors, and uses the optimization strategy instead of .
Compared with the original multiple kernel clustering, the proposed method optimizes kernel matrix weights by using gradient descent rather than coordinate descent, combined with localized sample alignment and kernel matrix induced regularization. This reduces the negative effects of forced alignment of long-distance samples and high redundancy and low complementarity of multiple kernel matrices.
We experimented with the algorithm on 6 benchmark datasets and compared it with the other nine baseline algorithms that solve similar problems through four indicators: clustering accuracy (ACC), normalized mutual information (NMI), purity, and rand index. We find that LI-SimpleMKKM-MR outperforms other methods. This is the first work to fully consider and solve the correlation problem between the base kernels to the best of our knowledge.
The contributions of this method are summarized as follows:(1)Proposed algorithm LI-SimpleMKKM-MR can productively deal with the alignment problem between kernel matrices using a regularization term, in order to reduce the redundancy, enhance the complementarity, and correlation between kernel matrices.(2)The novelty is that our proposed method can be transformed into SimpleMKKM or LI-SimpleMKKM by adjusting the hyperparameters, making LI-SimpleMKKM-MR an extension of the previous two methods.(3)We conducted extensive experiments on 6 public multiple kernel datasets using 4 metrics. The results show that our method achieves state-of-the-art performance compared to 9 existing baseline algorithms. The experiments essentially validate our understanding of the previous problems and the effectiveness of the proposed solution.
2. Related Works
2.1. Multiple Kernel K-Means
Let be a set of n samples, and means mapping the features of the sample of the th view into a high-dimensional Hilbert space . According to this theory, each sample can be represented by , where means the weights of m prespecified base kernels . The kernel weights will be changed according to the algorithm optimizing in the kernel learning step. According to the definition of and the definition of kernel function, the kernel function can be defined as follows:
We can use training samples by (1) to calculate a kernel matrix . Based on the calculation of , the objective function of MKKM with can be expressed as follows:
Here, means one soft label matrix, which is used to solve NP-hard problems caused by the direct use of hard allocation, which is also called the partition matrix. Moreover, means an identity matrix which is in size.
Optimization of (2) can be divided into 2 steps: optimizing or and fixing the other one.(i)Optimizing with is fixed, the problem of optimizing in (2) can be represented as follows: The optimization of of (3) can be easily solved by taking the first k eigenvalues of the matrix .(ii)Optimizing with is fixed, with the soft label matrix is fixed, the problem of optimizing in (2) can be represented as follows: According to the constraints, it can be easily solved by the Lagrange multiplier method [10].
2.2. MKKM with Matrix-Induced Regularization
As (2) shows that only depends on and . However, the interactions between different kernel matrices are not considered. Liu et al. [14] defined a criterion to measure the correlation between and . A larger means high correlation between and , and a smaller one implies that their correlation is low. By introducing the criterion term in (2), we can obtain the following objective function:where is a hyperparameter to balance clustering loss and regularization term.
2.3. Localized SimpleMKKM
Unlike the existing paradigm, SimpleMKKM adopts optimization [20]. However, it is extended to make full use of the information between local sample neighbors and optimization to enhance the clustering effect with a fusion algorithm called localized SimpleMKKM. The objective value of LI-SimpleMKKM can be represented as follows:where and with are the ith sample’s neighborhood mask matrices; that is, only the samples closest to the target sample will be aligned. This new method is hard to solve with a simple two-step alternating optimization convergence method. To solve this problem, LI-SimpleMKKM first optimizes by a method similar to MKKM and then converts the problem into a problem of finding the minimum with respect to . With proving the differentiability of the minimized formula, the gradient descent method can be used to optimize [21].
3. Localized Simple Multiple Kernel K-Means with Matrix-Induced Regularization
According to Liu et al. [21], the relative value of is only dependent on , , and , where u is the largest component of . Only the weights of different kernels are linked, indicating that the LI-SimpleMKKM algorithm is not fully considered the interaction of the kernels when optimizing the kernel weights. This motivates us to derive a regularization term which can measure the correlation between the base kernels to improve this shortcoming.
3.1. Formulation
Although the performance of clustering can be improved to some extent by aligning samples with closer samples, there is still room for further improvement of that algorithm.
To address this issue, we define a criterion to measure the correlation between and . A larger means high correlation between and , and a smaller one implies that their correlation is low. We propose to add a matrix-induced regularization based on LI-SimpleMKKM to improve the shortcomings, enhancing the kernel alignment between multiple kernels and reducing the redundancy of kernels with higher correlation. By fusing the regular term with (6), we can get the objective function as follows:where is a trade-off parameter to balance the loss of clustering problem and the regularization term on kernel weights. The regularization term has many types, such as KL divergence and Hilbert–Schmidt independent criterion.
In our proposed algorithm, we set for each element in to measure the correlation between and . Choosing this method makes the calculation not too complicated and adopts the Hilbert–Schmidt independent criterion in disguise, which can reflect the correlation between different base kernels to a certain extent.
The incorporation of use of the basic kernel better, thus improving clustering performance. Moreover, we can clearly see that if we set , equation (7) is a special case of LI-SimpleMKKM.
Li et al. [22] use instead of as a regular term, where means a matrix with , . Although this method shows excellent performance, we find that the induced regularization of matrices should be global rather than local because the kernel alignment should be for the global kernel matrix. It can also be found from the experimental results in Table 1 that the global kernel-induced regularization has a better effect.
3.2. Alternate Optimization
We design a two-step alternating optimization to solve the formula in (7).(i)Optimizing by is fixed: fixed , the optimization value with respect to in (7) is represented as follows: Treating the summation as a whole, (8) can be solved by solving for the eigenvalues of the matrix.(ii)Optimizing by is fixed: fixed , the optimization value with respect to in (7) can be represented as follows:
We first prove the differentiability of (9), then calculate the gradient, and optimize by the gradient descent method. The first part of the objective function in (9) is as follows:
With the hyperparameter defined, we can regard as a whole, which is global kernel alignment and PSD [21]. For convenience, we let .
Thus, the function in (9) can be represented as follows:with
Theorem 1.  in (12) is differentiable. ,
where .
Proof. For any given , the maximum of optimization problem  is unique [21], with  the corresponding maximizer. According to theorem in [23], the former part of  is differentiable. By defining other elements in  except for  as s and the latter part of the  as , the differential of  can be expressed as follows:where  means one of the components of  and s means all of the other components so that , and the whole  in (12) is differentiable.
We can solve this problem by designing a gradient descent method. After obtaining the gradient of  under the premise of satisfying the equality constraints  and nonnegativity constraints  of , we update  by gradient descent [23]. To implement this method, we let  become a nonzero unit in  and  indicates the reduced gradient of . The th  element of  is presented as follows:andTo improve numerical stability, we choose u as the largest unit in the vector . The nonnegativity constraint of  also needs to be considered during gradient descent.
To minimize , we define  as a descent direction. However, if there is an index  corresponding to , with , the situation of  may occur when the gradient is updated, violating the nonnegativity constraint. Under these circumstances, the descent direction for that unit  is set as zero. This makes  when the gradient is updated as follows:The gradient update adopts the formula , where  is the step size. We determine the step size  by a one-dimensional linear search method, rather than setting it directly, and in order to ensure global convergence, this method has appropriate stopping criteria, for example, Armijo’s rule [21].
The specific calculation steps of the algorithm in equation (13) are detailed in Algorithm 1.
| 
 | 
Theorem 2. The proposed algorithm is converged.
Proof. Note that for the kth iteration, will be bigger than k + 1th iteration. In each iteration, the gradient of is smaller than 0 by equation (14) because u is the component of , and in order to get the maximum of , should be larger than other parts, so the differential of it is bigger than others. The component u has the gradient which is the opposite number of other component gradients’ sum by the equation (15). According to the equation (16), the component of will be bigger, while the coefficient of u will be smaller, and we can let as the difference of the kth iteration and k + 1th’s, with , with as the largest part of each , is bigger than 0, it can be easy to get the conclusion is smaller than 0, because the non-negativity of and kernel matrix, the former term has the lower bound 0 and convex, so the former term’s convergence is been proofed.
As for the latter term with the similar thought, it is also decreasing monotonically because is a PSD matrix, is not negative, and is bigger than 0; the second derivative of can be easy to be calculated bigger than 0 (since each element of is bigger than 0), so the latter term has the lower bound 0 and convex. At the same time, the whole equation (13) is monotonically decreasing and lower-bounded.
3.3. Computational Complexity Analysis
We theoretically analyze the time complexity of the algorithm LI-SimpleMKKM-MR. We assume that n and m denote the number of samples and the number of base kernels. LI-SimpleMKKM-MR based on Algorithm 1 first computes a neighborhood mask matrix with computational complexity and then computes the regularization term with computational complexity . Therefore, the time complexity of LI-SimpleMKKM-MR is per iteration.
Let us compare the complexity of LI-SimpleMKKM-MR and LI-SimpleMKKM. Since in most cases, the number of base kernels is much fewer than the number of samples , compared with LI-SimpleMKKM , the time complexity of the proposed method does not increase significantly.
4. Experiments
4.1. Datasets
In this section, we evaluate the clustering performance of our algorithm on a set of standard MKKM benchmark datasets, including Oxford Flower17(FLO17), Flower102(FLO102) (https://www.robots.ox.ac.uk/~+vgg/data/flowers/), Protein Fold Prediction(proteinFold) (https://mkl.ucsd.edu/dataset/protein-fold-prediction/), Digital (https://ss.sysu.edu.cn/~+py/), Caltech101-25views(Cal-25views), and Caltech101-7classes(Cal-7classes) (https://files.is.tue.mpg.de/pgehler/projects/iccv09/). Caltech101-25views refers to the number of kernels randomly selected by 25, and Caltech101-7classes refers to the number of classes randomly selected by 7. The details of these can be found in Table 2. We can compare the performance of the different MKKM algorithms using these datasets.
4.2. Compared Algorithms
In addition to the localized SimpleMKKM with matrix-induced regularization, we tested nine other comparative algorithms from the other MKKM algorithms, including, average kernel k-means (Avg-KKM), multiple kernel k-means (MKKM) [10], localized multiple kernel k-mean (LMKKM) [12], optimal neighborhood kernel clustering (ONKC) [24], multiple kernel k-mean with matrix-induced regularization (MKKM-MR) [14], multiple kernel clustering with local alignment maximization (LKAM) [22], multiview clustering via late fusion alignment maximization (LF-MVC) [25], simple multiple kernel k-means (SimpleMKKM) [20], and localized SimpleMKKM (LI-SimpleMKKM) [21].
The implementations of the comparison algorithms are publicly available in the corresponding papers, and we directly apply them to our experiments without tuning. Among the previous algorithms, ONKC, MKKM-MR, LKAM, LF-MVC, and LI-SimpleMKKM need to adjust hyperparameters. Based on the published papers and actual experimental results, we show the best clustering results of the previous methods by tuning the hyperparameters on each dataset.
4.3. Experimental Settings
In all experiments, to reduce the difference between different views, all the base kernels are first centered and then scaled so that for all i and , we have . For our proposed algorithm, its trade-off parameters and are chosen from and by grid search, where n is the number of samples.
For all the datasets, we set the number of clusters k according to the actual number of categories in the dataset. We engage 4 indicators: clustering accuracy (ACC), normalized mutual information (NMI), purity, and rand index to measure the effect of clustering. To reduce the harmful effects of randomness, we initialized and executed all algorithms fifty times (50×) to obtain the mean and variance of the experimental indicators.
4.4. Experimental Results
Table 3 reports the ACC, NMI, purity, and rand index of the previously mentioned algorithms on all 6 datasets. The following observations were made based on the results:
The proposed localized SimpleMKKM with matrix-induced regularization significantly outperforms localized SimpleMKKM. For example, it outperforms the LI-SimpleMKKM algorithm by 1.8%, 0.1%, 3.1%, 0.3%, 0.6%, and 3.4% in terms of ACC on Flower17, Flower102, ProteinFold, DIGIT, Caltech-25 views, and Caltech-7 classes datasets. These results validate the effectiveness of enhancing the correlation between matrices.
Also, our proposed LI-SimpleMKKM-MR significantly outperforms the MKKM-MR algorithms by 3.6%, 3.8%, 4.7%, 7.5%, 3.3%, and 6.3% in terms of ACC on Flower17, Flower102, ProteinFold, DIGIT, Caltech-25 views, and Caltech-7 classes datasets. This result proves that utilizing the data’s local structure and optimization improves the clustering effect very well.
The proposed algorithm adopts the advanced formulation and uses matrix-induced regularization to improve the correlation between kernel matrices, reducing redundancy and increasing the diversity of selected kernel matrices, making it superior to its counterpart.
Together, these factors make LI-SimpleMKKM-MR significantly improved over other algorithms on the same dataset. In addition, due to time complexity and memory constraints, the effect of LMKKM on some datasets has not been shown.
4.5. Parameter Sensitivity of LI-SimpleMKKM-MR
We designed comparative experiments to study the influence of the setting of two hyperparameters, localized alignment, and matrix-induced regularization, on the clustering effect. According to equation (7), LI-SimpleMKKM-MR tunes the clustering performance by setting two hyperparameters and , referring to the regularization balance factor and the nearest neighbor ratio.
We experimentally show the difference in clustering performance in and in all benchmark datasets.
Figure 1 shows the ACC and NMI of our algorithm by varying one of or with the other one fixed. Based on these figures, we can conclude that (1) as the value of increases, the ACC and NMI of each dataset increase to their highest value and, correspondingly, decrease when decreases and (2) by keeping the unchanged, the ACC and NMI will exceed SimpleMKKM and be steady when is small.

Hence, we conclude that our proposed algorithm presents a new state-of-the-art performance for clustering compared to other algorithms that only preserve the global kernel, such as LI-MKKM. Thus, it focuses on preserving the local structure of the data as specific results are displayed in Table 1.
On top of optimization, the clustering performance improves when the parameters are appropriately set by combining matrix-induced regularization and local alignment.
4.6. Convergence of LI-SimpleMKKM-MR
In addition to theoretical verification, we experimentally verify the convergence of the algorithm. We present simulations of our proposed algorithm using different datasets in Figure 2. According to the results, the object value of the proposed algorithm oscillates first, then decreases monotonically, and finally converges in several iterations. Moreover, we know from experiments that most datasets can converge in fewer than 10 iterations. This result is comparable to the state-of-the-art methods.

4.7. Performance of LI-SIMPLEMKKM-MR by Learned H
We calculate the 4 clustering metrics at each iteration to show the variety of clustering performance variations of the learned in different datasets and plot them in Figure 3. As observed, the clustering performance increased firstly with iterations and remained stable after oscillation.

4.8. Running Time of LI-SimpleMKKM-MR
We report the running time comparison of all the baseline algorithms and LI-SimpleMKKM-MR on different datasets in Figure 4. With the analysis of the time complexity in Section 3.3 and the experiment result from Figure 4, even though there are additional computational steps, we found that LI-SimpleMKKM-MR does not significantly increase in computation time.

5. Conclusion
Although LI-SimpleMKKM can address the task of multiple kernel k-means in a optimization and realize the local alignment, it does not sufficiently account for the correlation between the basis kernels. This work proposes an LI-SimpleMKKM-MR algorithm that combines the sample localized alignment and matrix-induced regularization to solve this problem. Theoretically and experimentally, our method has demonstrated the best performance in clustering optimization and outperforms existing algorithms. In further research, we will apply this algorithm to incomplete MKKM problems.
Data Availability
The data that support the findings of this study are openly available at https://www.robots.ox.ac.uk/~+vgg/data/flowers/, https://mkl.ucsd.edu/dataset/protein-fold-prediction/, https://ss.sysu.edu.cn/~+py/, and https://files.is.tue.mpg.de/pgehler/projects/iccv09/.
Disclosure
A preprint has previously been published [26].
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Jiaji Qiu and Huiying Xu equally contributed to the paper.
Acknowledgments
This work was supported by the Outstanding Talents of “Ten Thousand Talents Plan” in Zhejiang Province (project no. 2018R51001), Natural Science Foundation of China (project no. 61976196), and Zhejiang Provincial Natural Science Foundation of China under grant no. LZ22F030003.