Abstract

In recent years, the Internet of Things (IoT) technology has developed rapidly and is widely used in various fields. It is of great research significance to uncover underlying patterns and insights from the high-dimensional data of IoT, to excavate valuable information to guide people’s production and life. Clustering can explore the natural cluster structure of the data, which is conducive to further understanding of the data, and is an essential preprocessing step for data analysis. However, clustering is highly dependent on the data. In order to reduce the complexity of the model, reduce the computational cost, and obtain a more robust clustering solution, we combine subspace clustering and ensemble learning to propose a novel subspace weighted clustering ensemble framework for high-dimensional data. The proposed framework first combines random feature selection and unsupervised feature selection to generate a set of base subspaces. Clustering is performed on each base subspace to achieve a set of subspace clustering solutions that generate a set of adaptive core clusters. The size of the core cluster is between the sample and the cluster. In the ensemble process, the core clusters are viewed as the basic unit, and the stability of the cluster is evaluated by measuring the distance between the core cluster pairs, and the similarity between the core clusters and the clusters in the base subspace, and then weighting the subspace clustering solution. Under this framework, we propose four subspace ensemble approaches based on core cluster to improve the accuracy of consensus clustering solutions. Extensive experiments are conducted on multiple real-world high-dimensional datasets, demonstrating that the proposed framework can process high-dimensional data for the IoT, and the proposed subspace clustering ensemble approaches are superior to the state-of-the-art clustering approaches.

1. Introduction

The Internet of Things is a network platform that combines various information collection devices and the Internet, which facilitates the sharing of information between various devices. Data acquisition devices are widely used in various application fields, and their collected data has the characteristics of high dimension. The impact of high-dimensional data on data mining is twofold. On the one hand, as the data dimension grows, there are more features that describe the sample from different perspectives, which brings richer and more comprehensive information to data analysis; on the other hand, high-dimensional data increases the complexity of the model and make data analysis difficult.

Clustering is an unsupervised learning technique that partitions unlabeled samples into clusters according to the specific criteria. Clustering can reveal the intrinsic properties of the study object and discover its underlying patterns. In the field of data mining and pattern recognition, clustering is often used as data preprocessing, which is the basis for subsequent data analysis. Clustering has always been an active research direction, and in different fields, there are existing multiple clustering algorithms and they achieve satisfactory results. But with the popularity of high-dimensional data, they pose a huge challenge to clustering.

Existing clustering approaches for high-dimensional data can initially be roughly divided into two categories: dimensionality reduction, and then clustering and subspace clustering. The former typically uses feature selection or feature extraction techniques to extract features from a high-dimensional space that are relevant to subsequent clustering criteria, thereby reducing the dimensionality of the sample. Then, we performed the clustering in the lower-dimensional space. Subspace clustering assumes that the clusters of high-dimensional data are located in low-dimensional subspaces [1], and the goal is to find the clusters hidden in large dimensions.

Clustering is highly dependent on the data. As the dimensions increase, there are a large number of redundant and irrelevant features in the data. This makes the model more complex and the computational complexity grows exponentially. Different clustering algorithms are designed for different types of data, and they can discover the underlying structure of a particular data set, not valid for all types of data. Even with the same clustering algorithm, setting different parameters can group samples into different clusters. As a result, more researchers are focusing on clustering ensemble, trying to achieve better and more robust clustering results by merging multiple base clustering solutions. Recently, spectral clustering [2] has become one of the popular clustering approaches. It first calculates the similarity of the samples to construct an affinity matrix and then decomposes the Laplace vector of the affinity matrix to obtain the eigenvector associated with the feature values. The samples are then mapped to the lower-dimensional space for a final clustering solution.

Spectral clustering [3] can find clusters for complex samples, which is simple to implement and can achieve better results compared to traditional clustering approaches. Recently, some researchers have proposed clustering high-dimensional data through subspace clustering ensemble approaches [47]. Some of these employ spectral clustering approach [4, 5]. Cai et al. proposed the spectral clustering approach based on random subspace and graph fusion, termed SC-SRGF [5]. It combines the affinity graph of each subspace with the iterative similarity network fusion scheme and performs spectral clustering on the fusion matrix to obtain the final clustering solution. Huang et al. proposed multidiversified ensemble clustering (MDEC) [4], and under the proposed framework, consistent clustering solutions are obtained by performing spectral clustering, bipartite graph, and hierarchical clustering-based consensus function. These approaches [47] use random feature selection to generate a set of subspaces, that is, a certain proportion of features are randomly extracted from the original feature set to generate feature subset. The clusters of each subspace are ensemble to achieve the final clustering solution. However, these approaches [46] do not take into account the contributions of different feature subspaces during integration. Literature [7] considers the contribution of each subspace clustering solution in the ensemble process, but it selects members with larger contributions to participate in ensemble. This ensemble selection strategy ignores the feature subspace of small contributions, which is easy to bias the clustering results.

To the abovementioned problem, this paper proposes a novel adaptive core cluster-based subspace weighted clustering ensemble approach, termed C2SWCE, which considers the similarity of clusters in subspaces in the ensemble process, and the weighted subspace clustering solution helps to improve the accuracy of consensus clustering result. The overall process of C2SWCE is shown in Figure 1. The proposed framework first uses a combination of random feature selection and unsupervised feature selection to generate a set of base subspaces. Then, spectral clustering is performed in the base subspace to achieve the clustering solution for each base subspace, further generating a set of adaptive core clusters. The core cluster is viewed as the basic unit of clustering ensemble, and the clustering solution of the subspace is weighted by calculating the distance between the core clusters in the subspace, the similarity of the core cluster and the cluster. Finally, we propose four consensus functions under this framework, which combine the locally weighted subspace clustering solutions to achieve the consensus clustering solutions.

The main contributions of our approaches are summarized as follows: (1)We combine random feature selection and unsupervised feature selection to generate a set of base subspaces. Random feature selection generates a set of random subspaces, and then, unsupervised feature selection is performed on random subspaces, and the features that retain the local structure of random subspaces are selected, which further reduces the dimensionality of random subspaces. This joint feature selection strategy ensures the diversity of feature subspaces and improves the computational efficiency.(2)We introduce the concept of the core cluster, which is a collection of samples that are grouped into the same cluster in all subspaces. In this paper, the core cluster is viewed as the basic unit of the clustering ensemble, which improves the computational efficiency of the integration process(3)We propose locally weighted subspace clustering ensemble framework, which evaluates the stability of clusters by calculating the distance of core clusters in each subspace, and the similarity of core clusters and clusters. We further propose four weighted ensemble approaches based on core clusters, which fuse clusters in base subspace and obtain the consensus clustering solution(4)The experimental results on several real high-dimensional datasets show that the comprehensive performance of our approaches in clustering accuracy and time complexity is significantly better than that of the state of the art clustering approaches

The rest of this paper is organized as follows: We review the related work in Section 2, the locally weighted subspace clustering ensemble approaches are proposed in Section 3, experimental settings and related results and analysis in Section 4, and finally, in Section 5, we conclude this paper.

Clustering ensemble, also known as cluster integration, combines multiple base clustering solutions to obtain better and more robust consensus clustering result. The cluster integration process consists of two stages: the base clustering generation and the ensemble of clusters. In the first stage, different clustering algorithms are performed on the dataset, or different parameters are set by the same clustering algorithm to partition samples, and the goal of this stage is to generate diverse clustering solutions. In the second stage, the input base clustering solutions are fused by the consensus function to obtain the final clustering result.

The subspace clustering technique can explore the nature cluster structure of high-dimensional data in different low-dimensional spaces. Subspace clustering ensemble is a method that combines subspace clustering and ensemble learning, which fuses clusters in different feature subspaces to achieve consensus clustering solutions. Among the recently proposed subspace clustering ensemble approaches, Huang et al. proposed a multivariate subspace clustering ensemble framework, termed MDEC [4], in which random sampling was used to generate a set of feature subspaces. The novelty of this approach is that the randomizing a scaled exponential similarity kernel is used to get a large number of multivariate measures for each random subspace, forming a metric-subspace pairs. Perform spectral clustering on the similar matrix in each metric-subspace pair to obtain a clustering solution for each subspace, use the entropy criterion to weight the clusters of the subspace, and then use spectral clustering, bipartite graph and hierarchical clustering approaches to get a consensus clustering solution.

Cai et al. proposed a novel spectral clustering approach based on subspace, termed SC-SRGF [5], which first generates a set of random feature subspaces, uses the local structures information of each subspace to form the KNN affinity graph, and then use an iterative similarity network fusion scheme to fuse the affinity graphs of each subspace to obtain a unified affinity graph, and obtain the final clustering solution by performing spectral clustering on a unified affinity graph.

Shankar proposed subspace clustering integration framework AP2CE [6], which uses the affinity propagation to produce a representative subset of features and employs multiple distance function metric objects to produce diverse subspace clustering solutions and use the Ncut to partition the consensus matrix to get the final result.

Verma et al. proposed double weighting semisupervised ensemble clustering based on selected constraint projection, termed DCECP [7], which treats prior knowledge of experts as pairwise constraints and assigned different subsets of pairwise constraints to different integration members. In addition, an adaptive integration member weighting process is designed to associate different weight values with different integration members. Finally, the final clustering result is obtained using the weighted normalized cut algorithm.

Although many successful subspace ensemble clustering approaches have been developed, most existing approaches [46] treat each subspace clustering solution equally during the ensemble process. How to weigh the clustering solution according to the contribution of different feature spaces is worth considering in subspace ensemble.

3. Proposed Framework

In this section, we describe the overall process of the locally weighted subspace clustering ensemble approach. First, we give a brief overview of the proposed methods and then detail the proposed algorithm from three aspects: subspace generation, subspace clustering, and fusion of clusters in subspace.

3.1. Overview

In this paper, we introduce a locally subspace weighted clustering ensemble framework. First, we randomly select feature subset from the original feature set to generate random subspaces, and then selecting features representing the local structure from each random subspace to generate base subspace. Then, spectral clustering is performed on each base subspace to achieve subspace clustering solutions. The adaptive core clusters are generated from the base subspace clustering solution, which is a set of samples that are grouped into the same cluster in all subspaces. Then, by calculating the distance of the core cluster in different subspaces, the stability of the clusters is evaluated and weighted the base subspace clustering solution. Finally, in order to integrate the clustering solutions of the base subspaces and get the consensus results, in the C2SWCE framework, we propose four adaptive core cluster-based consensus functions, to achieve the final clustering solutions.

Let be the input dataset that contains samples, each with features. Let () denotes the th sample, which corresponds to the th row of , so that the input data can be represented as . Let () denotes the th feature of the sample, which corresponds to the th column of the ; therefore, matrix can also be represented as .

3.2. Generate Subspaces

The natural cluster structure of high-dimensional data is hidden in low-dimensional subspaces [8]. Subspace clustering explores the possibility of grouping samples in different feature sets. Subspace clustering ensemble fuses clusters in different subspaces to obtain the final clustering solution, which is an effective method for clustering high-dimensional data. The recently proposed subspace clustering ensemble approaches [5, 9], which uses random feature selection to generate a set of random subspaces. Random subspace has diversity; it explored the potential cluster of high-dimensional dataset from the perspective of different features to achieve diverse subspace clustering solutions. Literature [10] uses stratified sampling method to generate feature groups and verifies that this method is superior to random sampling. This paper uses the random feature selection to obtain a variety of subspaces, and unsupervised feature selection is performed again on the random subspace to select the features that retain the local structure of random subspace.

The specific procedure of feature subspace segmentation is as follows. The original feature set is randomly sampled according to the sampling ratio , the original features are classified into feature groups , where represents the features contained in the th random subspace. Let denotes the matrix corresponding to the th random subspace, is the number of samples, and is the number of features in the th random subspace. Obviously, it holds that and .

A set of base subspaces is generated by calculating the Laplace score [11] of features in each random subspace and selecting the important features in random subspace. Then, we construct the KNN graph for each random subspace, which represents the local structure of the random subspace. The KNN graph of the th subspace is defined as follows: where is the set of nodes corresponding to the samples in the th subspace and is the edge set of the th subspace. We use the Gaussian kernel function to calculate the weights of the edges between the nodes in the subspace and their corresponding KNN nodes. is defined as where

In Equation (3), and are the two samples in the th subspace, respectively. is the Euclidean distance between and , and is the -nearest neighbor (KNN) operator, and is the mean of Euclidean distance between the sample and its KNN.

Let be the degree matrix of the th subspace, where . The Laplacian of the graph is calculated as follows:

In the th subspace, the Laplace score of () is

where

We calculate the Laplace score for each feature in the subspace and arrange them in descending order to select the top features. We determine the number of second selected features based on the number of features in random subspace. Let represents unsupervised feature selection ratio in the random subspace, that is, from each random feature group (), select important features that represent the local structure of the random subspace to generate (), it holds that ,(), and . For convenience, let () represent the matrix of the th base subspace, where is the number of samples and is the number of features.

3.3. Generate Subspace Clustering Solution

Let represent the base subspaces generated by the joint feature selection strategy, where () is the th base subspace, and . Let () be the clustering solution for , formally, the clustering solution for the base subspace is , where () is a cluster that contains in . Let () denote the th cluster in the , then for , it holds that . Thus, the clustering solution of also be denoted as where is the number of clusters in . Then, we can get the clustering solution set of base subspace, which is represented as . Subspace clustering ensemble is the fusion of clustering solutions from multiple subspaces to achieve consistent results .

In the clustering process, most approaches view input samples as base units and group them into different clusters. However, when there are more samples, the computational complexity also increases significantly. Huang et al. introduce the concept of superobject [12], which is defined as in the base clustering ensemble where two samples are partitioned into the same cluster, the two samples are in the same original superobject. The size of superobject is between clusters and samples, and it has been proven that viewing superobject as base units when integrated can significantly improve the scalability of data size and simplify the calculation.

Inspired by the concept of the superobject [12], this paper groups samples in base subspaces to achieve a set of base subspace clustering solutions and then generates the adaptive core clusters. The core clusters in high-dimensional space are defined as follows. (1), (), , then(2),(),

Definition 1. Let be the input dataset and be set of base subspace clustering solutions. Samples that simultaneously satisfying the following two conditions are core clusters, which are denoted as .

Let be the set of core clusters in a high-dimensional space, where is the number of core clusters. It holds that , , and .

We provide examples of samples clustering in different subspaces. Given a dataset , where is the th sample. These samples are grouped into clusters in subspaces, and the relationships between the samples, core clusters, and clusters are described in Table 1. In subspace , samples are grouped into two clusters, and in and , they are grouped into three clusters, respectively.

According to Definition 1, there are four core clusters are generated in the above example. The relationship between the core cluster and the samples is shown in Table 1, namely, , , , and . The core cluster is viewed as the basic units of cluster in subspace, for example in base subspace , , .

In the conventional subspace clustering approaches, samples are basic units. However, samples correspond to different features in different feature subspaces, so the implicit relationship between samples in different subspaces is also different. A core cluster is a set of samples that are grouped into the same clusters in all subspaces. In the process of subspace ensemble, we view the core cluster as the basic unit, that is, each cluster in subspace is composed of core clusters. We evaluate the stability of clusters in subspace by measuring the distance of core clusters pairs contained in the cluster.

Definition 2. (), (), the distance between the core clusters pairs is defined as where and are the number of samples contained in and , respectively. is a distance metric function between samples in , which can be selected as the Euclidean distance, the Manhattan distance, the Cosine distance, and so on. If the distance between the core clusters pairs is smaller, the more stable the clusters in the corresponding subspace, and vice versa. Clusters are evaluated by the average distance between core clusters.

For (), there are a total of clusters, and each cluster () contains core clusters. Obviously, it holds that and .

Definition 3. For (,), the average distance of the core clusters is defined as

In Equation (9), is a distance metric function of core clusters pairs in and is the number of core clusters in .

The smaller the average distance between the core clusters in a cluster, the denser the cluster, that is, the greater the probability that the core clusters will be grouped into clusters in the base subspace, the higher the stability of the clusters. We introduced the cluster stability index (CSI) to describe this relationship.

Definition 4. For (,), its cluster stability index is defined as where is the average distance between the core clusters in . For , the range of the cluster stability index is .

The CSI is an indicator that describes clusters in a subspace. If the cluster has only one core cluster, the distance between the core cluster is 0, and the stability index of the corresponding cluster is 1; the cluster can no longer be divided and is stable. If there are multiple core clusters in a cluster, the greater the average distance between the core clusters, the worse the stability of the clusters, and vice versa.

Thus, for (), the average of the stability index for all clusters is

When fusing the clustering solutions of base subspace, we weight the subspace by the stability of the clusters in the corresponding base subspace. The weight of each base subspace is calculated by

where

3.4. Subspace Clustering Ensemble

Clustering ensemble is an effective way to improve robustness and stability of clustering solution [13]. We propose four core cluster-based consensus functions that ensemble clustering solutions for each base subspace to achieve the final clustering results.

Define the core cluster similarity matrix in base subspace based on whether any two core clusters in the subspace are grouped into the same cluster.

Definition 5. For , the core cluster similarity matrix is defined as where

In Equation (15), and represent clusters containing and in , respectively. Unlike normal similarity matrices, we define the core cluster similarity matrices rather than similar matrices between samples. At the same time, we achieve the weighted core cluster similarity matrix based on the weights of base subspace, which is represented as

3.4.1. Hierarchical Clustering Based on the Core Cluster

In this section, we propose the consensus function for hierarchical clustering based on the core clusters, termed C2SWCE_HC. The proposed method views the core cluster as the basic unit and constructs a dendrogram by the core cluster, where the root of the tree corresponds to the dataset and its leaves correspond to all the core clusters in . Each level of the dendrogram represents the clustering results of different numbers of core clusters, and the final clustering solution can be achieved by specifying a specific level of the dendrogram.

The specific steps of integration are as follows: First, the core clusters in as the initial region, which is represented as

In Equation (17), denotes the th initial region, which corresponds to the th core cluster. Let be the initial similar matrix. Merge the two most similar core clusters in into one larger cluster and update the similar matrix according to the average link for the next merge of core clusters. Repeat merge the two most similar regions of the similar matrix that were updated in the previous iteration. After each merge, the number of regions is reduced by 1, and after the th iteration, the merged regions are represented as , and the corresponding similar matrix is updated as follows: where is the number of core clusters contained in the th region after the th iteration, and the maximum number of iterations of the dendrogram is .

For clarity, C2SWCE_HC is summarized in Algorithm 1.

Input: ,
Output:
1. Performed clustering on each base subspace to generate a set of subspace clustering ensemble
2. Generate the adaptive core clusters according to Definition 1
3. Calculates the average distance between core clusters according to Equations (8) and (9)
4. Calculates the CSI of the cluster according to Equation (10), and is achieved according to Equation (11)–(13)
5. Construct the core cluster similarity matrix () according to Equations (14) and (15), and the weighted core cluster similarity matrix according to Equation (16).
6. Initialize
7. Construct the dendrogram
   for to do
    According to , merge the two most similar regions to achieve
    Update , and achieve
   end for
8. Select the level of the dendrogram according to , and achieve clusters
9. Map the labels of core cluster to the samples.
3.4.2. Spectral Clustering Based on the Core Cluster

In this section, we introduce the core cluster-based spectral clustering consensus function to ensemble subspace clustering solutions. First, we build the affinity graph that treats the core clusters as graph nodes and the weighted core cluster similarity matrix as the adjacency matrix. The graph is defined as where is the nodes set and is the edge set. The weights of the edge between the nodes and is determined by matrix , that is, . Let be the degree matrix of , the normalized graph Laplacian is computed as

We perform eigendecomposition on to achieve the first eigenvalues, and matrix is constructed by corresponding eigenvectors and perform -means on the row vectors of the matrix.

For clarity, the C2SWCE_SC algorithm is summarized in Algorithm 2.

Input: ,
Output:
1. Performed clustering on each base subspace to generate a set of subspace clustering ensemble
2. Generate the adaptive core clusters according to Definition 1
3. Calculates the average distance between core clusters according to Equations (8) and (9)
4. Calculates the CSI of the cluster according to Equation (10), and is achieved according to Equations (11)–(13)
5. Construct the core cluster similarity matrix () according to Equations (14) and (15), and the weighted core cluster similarity matrix according to Equation (16).
6. Build the graph with ,
7. Constructed the normalized graph Laplacian according to Equation (20).
8. Perform eigendecomposition on to achieve the first eigenvalues and corresponding eigenvectors to build
9. After normalizing , perform -means to categorized the core clusters
10. Map the labels of core cluster to the samples
3.4.3. Bipartite Graph Partition Based on the Core Clusters

Under the C2SWCE framework, we propose the bipartite graph clustering ensemble based on the core clusters, termed C2SWCE_BG. In ensemble process, we use the core clusters and clusters as graph nodes to construct the bipartite graph. The specific steps are described below.

In different base subspaces, the core clusters are grouped into a set of clusters. In all base subspaces, the set of clusters is where is the number of clusters in . We view both the core cluster and cluster as graph nodes, and construct the bipartite graph. That is: where is the node set corresponding to the core cluster and the cluster; is the node set corresponding to the cluster in ; it holds that , . is the edge set.

In , clusters in the same subspace contain different core clusters, while clusters in different subspaces may contain the same core cluster. Therefore, we use the Jaccard coefficient to measure the similarity of clusters. The core clusters are viewed as base units; the similarity matrix between clusters is defined as

where

Clusters in the same base subspace do not intersect, so the Jaccard coefficient between clusters in the same subspace is 0. The similarity between clusters pairs is calculated by Equation (24) and the clusters are weighted based on similarity.

The similarity matrix between clusters and the core clusters is constructed as where

In Equation (26), is the cluster stability index of in , is the weight of in (,,).

Connect the matrix and matrix to generate matrix , that is,

The entry of corresponds to the weight of the edge between the two nodes in , denoted as

In Equation (28), is the cluster stability index of the th cluster. is the Jaccard coefficient operator, which is calculated according to Equation (24). In , there are no edges between cluster nodes or between the core cluster nodes, only edges between cluster nodes and core cluster nodes.

According to the corresponding features in different base subspaces, we use the weight of the subspace corresponding to the cluster to which the core cluster belongs as the weight of the edges of the core cluster and the cluster. The higher the stability of the cluster, the greater its impact during integration and the greater the weights assigned.

Finally, we use Tcut [14] to segment . All nodes are partitioned into disjoint groups. The samples contained in the core clusters and cluster of the same group are partitioned in the same cluster.

The specific steps of C2SWCE_BG are summarized in Algorithm 3.

Input:,
Output:
1. Performed clustering on each base subspace to generate a set of subspace clustering ensemble
2. Generate the adaptive core clusters according to Definition 1
3. Calculated the similarity of the cluster according to Equations (23) and (24)
4. Calculate the similarity of clusters and core clusters according to Equations (25) and (26)
5. Construct the bipartite graph according to Equation (22) with ,
6. Combine matrix and matrix to generate
7. Segment the by Tcut
8. Map the labels of core cluster to the samples
3.4.4. Metacluster-Based Ensemble Clustering

Under the proposed C2SWCE framework, we propose the metacluster-based consensus clustering algorithm, termed C2SWCE_MC. In the proposed approach, clusters are regarded as the basic units that use the similarities between them to divide clusters into different groups, and the samples of the same group are divided into the same cluster.

C2SWCE_MC first treats all clusters as nodes and constructs the similar graph, which is defined as: where is the node set for all clusters in and is the edge set. The weights of the edges between nodes are defined by the similarity matrix of the clusters by Equations (23) and (24). where .

Finally, we adopt Ncut [15] to partition nodes into disjoint sets of nodes, which denotes as where () represents the th metacluster, which is the set of clusters. The core clusters are treated as base units in the ensemble. In each subspace, () is partitioned into a cluster. may appear multiple times in . Define the discriminant function to represent the relationship:

The probability that () is grouped into metacluster is where is the number of clusters in the th metacluster.

Finally, is assigned to the metacluster with the highest probability, namely,

We summarize C2SWCE_MC in Algorithm 4.

Input: ,
Output:
1. Performed clustering on each base subspace to generate a set of subspace clustering ensemble
2. Generate the adaptive core clusters according to Definition 1
3. Calculate the similarity matrix of clusters in according to Equation (23) and (24)
4. Build the graph according to Equations (29)–(31)
5. Partition the graph nodes into disjoint groups by Tcut, generating
6. The core clusters are treated as the base unit, the metaclusters corresponding to the core clusters are determined according to Equations (32), (34), and (35)
7. Map the labels of core cluster to the samples

4. Experiments

In this section, we conduct experiments on eight high-dimensional datasets to compare the proposed four subspace clustering ensemble algorithms against several clustering approaches. All the experiments in this paper are conducted in MATLAB R2016a on a PC with 8 Intel 3.40 GHz processors and 8 GB of RAM.

4.1. Datasets

In our experiments, we use eight high-dimensional datasets, which including four cancer gene expression datasets and four image datasets. The 4 gene expression datasets, namely, Yeoh02v11, Yeoh02v21, Bhattacharjee20011, and Golub1999v11. The Yeoh02v1 dataset and Yeoh02v2 dataset are the pediatric acute lymphoblastic leukemia dataset, in which Yeoh02v1 dataset contains 2 categories of genes expressed in leukemic blasts, and Yeoh02v2 dataset contains 6 categories of genes expressed in leukemic blasts. Bhattacharjee2001 is the lung tumor samples dataset, which contains 186 lung tumor samples and 17 normal lung tissues. Golub1999v1 is the leukemia dataset, which contains acute myeloid leukemia samples and acute lymphoblastic leukemia samples. The other 4 image datasets, including COIL_203, USPS2, Semeion4 and Multi Featureples4. The COIL_20 dataset is an image dataset containing 20 item objects, and the other 3 image datasets are all handwritten digit datasets. The USPS dataset contains a total of 10 categories and 11,000 samples. For facilitating comparison, we randomly selected 10% of the samples from each category of the USPS dataset to form a dataset containing 1,100 samples, which is represented as USPS_10P.

To simplify the description, the 8 datasets will be abbreviated as D1 to D8, respectively. The details of the datasets are given in Table 2. (1)https://schlieplab.org/Supplements/CompCancer/(2)http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html(3)https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php(4)http://archive.ics.uci.edu/ml/index.php

4.2. Evaluation Criterion

We use the normalized mutual information (NMI) and the clustering accuracy (ACC) to evaluate the quality of clustering results. The NMI metrics measure the accuracy of the clustering results according to the shared information of the ground-truth clustering solution and the test clustering solution. Let denote the clustering solution of the proposed method, and denote the groundtruth clustering solution. The NMI score of with respect to is calculated as where and are the number of clusters in and , respectively. is the number of samples for the th cluster in , and is the number of samples for the th cluster in ; is the number of input samples, and is the number of samples that the th cluster in and the th cluster in jointly contain.

The ACC measures the ratio of the number of samples that are correctly classified to the number of all samples. The ACC is an indicator that evaluates the accuracy of the clustering result, which is defined as where is the groundtruth label corresponding to , and is the test clustering labels corresponding to in proposed approach. is the relabel mapping function that aligns the test clustering label with the groundtruth label. is an indicator function, which is defined as

The value interval for ACC and NMI score is , and a higher ACC score or NMI score indicates a better clustering effect.

4.3. Discussion of the Parameter Selection

There are several parameters in proposed C2SWCE, where is the number of subspaces, is the random feature selection ratio, is the unsupervised feature selection ratio, and is the number of clusters. To increase the diversity of ensemble members, each subspace randomly generates a different number of clusters within range . We set the random feature selection ratio to , and the number of generated random subspaces is in the range of , with the increment of 5. For each parameter setting, we run the spectral clustering methods 10 times in each random subspace, generating core clusters according to Definition 1 at each time. The average number of generated core clusters is shown in Figure 2. (1)The relationship between and

In the proposed C2SWCE framework, the core cluster is the basic unit of integration.

In Figure 2, is the number of generated random subspaces, corresponds to the number of input samples, and is the number of core clusters. It is observed from Figure 2 that in the corresponding dataset, as the number of subspaces increases, so does the number of generated core clusters. For the image datasets (Figure 2(b)), even if the number of subspaces is set to , the number of core clusters is still much smaller than the number of input samples, and this trend is most pronounced on the D6 and D8 datasets. For the gene express datasets (Figure 2(a)), due to the small number of input samples, when the number of subspaces is set to , the number of generated core clusters is close to the number of input samples. (2)The influence of parameters and on clustering accuracy

In this section, we will compare the effect of the number of a random subspace and the unsupervised feature selection ratio on the accuracy of clustering ensemble. We set the value interval of parameter to and the increment to 5. For each random subspace, the unsupervised feature selection ratio is set in range of , with increments set to 0.1. For each parameter setting of and , we run the proposed methods 10 times, and the average of the NMI scores is shown in Figure 3.

As shown in Figure 3, for the image datasets (D5-D7), parameters and are insensitive to clustering results. When is fixed, the number of subspaces has little effect on the consensus clustering solutions. When is fixed, the NMI score is relatively high in the range of . For the D1-D3 datasets, as shown in Figures 3(a)–3(c), the higher NMI scores are distributed in parameter in an interval of and parameter in an interval of region. For the D4 dataset, it can be clearly observed that when , the corresponding NMI score is higher.

For each dataset, the more base subspaces is divided, the more adaptive core clusters are generated, which adds computational complexity and algorithm runtime. As observed from Figure 2 that the number of core clusters generated by parameter in range is moderate. At the same time, when and is set in the range of , the ensemble result has a higher NMI score. Therefore, in subsequent experiments, we set , , and in experiments on all datasets.

4.4. Compare the Effects of Feature Selection and Weighted Strategies on Ensemble Results

In this section, we first compare the impact of the hybrid feature selection strategies on consensus clustering solutions. For each dataset, the proposed four ensemble methods are run on random subspaces and base subspaces generated by mixed feature selection strategies and compare their clustering results. For fair comparison, each ensemble approach runs 10 times in the unweighted manner, recording the average of its NMI scores. The comparison results are recorded in Table 3, with the notation “N” corresponding to the clustering result on random subspace, and the notation “Y” corresponding to the result of clustering result on the base space.

As shown in Table 3, the clustering effect of the proposed clustering ensemble approaches on the base subspace is differently better than that in the random subspace. Therefore, in subsequent experiments, we use the hybrid feature selection strategy, perform unsupervised feature selection on random subspace, generate a set of base subspaces, and then generate ensemble members by performing the clustering on the base subspace.

The proposed four ensemble approaches treat the core cluster as the base unit. The base subspace clustering solutions are weighted by calculating the distance between the core clusters pairs contained in the cluster, or the similarity of core clusters to each cluster. To verify the effect of the weighted clustering integration method, we compared the results of the weighted integration method and the unweighted integration method on the base subspace, respectively. Each ensemble method is run 10 times, and the average NMI of the proposed methods is recorded in Table 4. In Table 4, the notation “N” corresponds to the unweighted clustering results of the proposed methods, and the notation “Y” corresponds to the weighted clustering results of the proposed methods.

As observed from Table 4, on the other datasets except D4, the fusion results of the proposed four ensemble methods under the weighted ensemble manner are better than the ensemble results of the unweighted ensemble manner. For the D4 dataset, the NMI score of the weighted ensemble manner of the proposed C2SWCE_HC is 0.9101, which is higher than the NMI score of 0.8257 corresponding to the unweighted ensemble manner. However, on the D4 dataset, there are no significant advantages in the clustering results obtained by the other three ensemble methods using a weighted ensemble manner. For other datasets, the consensus clustering results achieved by the weighted ensemble manner are better than those obtained by the unweighted ensemble manner. This illustrates the effectiveness of the proposed weighted ensemble strategies.

4.5. Comparison with Base Clustering

Subspace clustering ensemble can fuse multiple subspace clustering solutions to obtain a better accuracy, more robust consensus solution. In this section, we compare the clustering effects of the proposed locally weighted subspace clustering ensemble approaches, namely, C2SWCE_HC, C2SWCE_SC, C2SWCE_BG, and C2SWCE_MC against spectral clustering approach. Each method runs 20 times, and their average NMI scores are described in Figure 4.

In Figure 4, “base” corresponds to base clustering results, and “HC,” “SC,” “BG,” and “MC” correspond to the clustering results of C2SWCE_HC, C2SWCE_SC, C2SWCE_BG, and C2SWCE_MC, respectively. As can be seen from Figure 4, on each experimental dataset, the proposed approaches can achieve better and more robust consensus clustering results than the spectral clustering approach. In particular, on the D1, D2, D3, D6, D7, and D8 datasets, the performance of the proposed 4 clustering integration approaches are significantly better than that of the spectral clustering approach.

4.6. Comparison with Other Clustering Methods

In this section, we compare the proposed consensus clustering approaches with the state-of-the-art clustering approaches to evaluate their effectiveness. Among comparative clustering approaches, -means and GNMF [16] are classic clustering methods and SSC [17] and LSC [2] are the clustering methods based on spectral clustering. In the contrasting ensemble methods, MDEC_SC [4], MDEC_HC [4], and MDEC_BG [4] all use spectral clustering to generate the clustering ensemble members, and then, they use spectral clustering, hierarchical clustering, and bipartite graph partition to fuse base clustering solutions, respectively. SC_SRGF [5] adopts spectral clustering to obtain clustering solutions for each subspace; however, ECPCS-HC [18], ECPCS-MC [18], WEAC-AL [19], GP-MGLA [19], LWGP [20], and LWEA [20] all use -means to generate base clustering solutions. All parameters of the comparison approaches are set as suggested by the corresponding papers.

For a fair comparison, we run each clustering approach 20 times on each dataset and the average performance in terms of NMI and ACC are summarized in Tables 5 and 6, respectively. In Tables 5 and 6, the highest score is highlighted in bold; the symbol “_” indicating that the algorithm cannot be performed.

The landmark-based spectral clustering (LSC) [2] approach is not suitable for datasets with number of features greater than the number of samples, so there are no corresponding clustering results on D1-D4 datasets. As shown in Table 5, the SSC achieves the best average performance in terms of NMI on D3 dataset. For the remaining datasets, the clustering ensemble approaches yield better average performance than the traditional clustering approaches. This also confirms the previous conclusion that the clustering ensemble approaches are more suitable than the traditional clustering approaches for high-dimensional data clustering scenarios.

As can be seen from Tables 5 and 6, the proposed C2SWCE_HC achieves the highest average NMI scores on all datasets among the comparative hierarchical clustering ensemble approaches. Especially on the D1-D4 datasets, the performance of the C2SWCE_HC is significantly better than that of WEAC_AL and ECPCS-HC, and it also achieves the highest average ACC score on the D2, D4, D6, and D8 datasets. Comparing the ensemble approaches based on spectral clustering, the average performance in terms of NMI of C2SWCE_SC is significantly higher than that of other methods on the D1-D4 dataset. For example, on the D1 and D2 datasets, the average NMI scores of SC_SRGF are 0.2371 and 0.1931, respectively, while the average NMI scores of C2SWCE_SC are 0.8943 and 0.5666, respectively, which increased by about 4 times. In addition, compared with the other bipartite graph partitioning ensemble methods, the proposed C2SWCE_BG achieves the highest average NMI score on 6 datasets and the highest average ACC score on 5 datasets. Its performance is significantly better than MDEC_BG, GP_MGLA, and LWGP. In particular, on the D1, D3, and D4 datasets, the average NMI scores of the C2SWCE_BG are 0.8821, 0.5514, and 0.8955, respectively, significantly exceeding the corresponding NMI scores of 0.2794, 0.2847, and 0.2194 for the GP_MGLA. Finally, the metacluster-based ensemble clustering proposed in this paper is significantly better than that of the ECPCS-MC method on all datasets except D7.

4.7. Execution Time

In this section, we compare the execution times of different clustering ensemble approaches at the integration phase on the USPS dataset. The USPS dataset contains 10 categories with a total of 11,000 samples. To compare the clustering effects of different ensemble approaches on datasets of different sizes, first, according to the method described in Section 4.1, samples of different proportions are randomly selected for each category of the input dataset, to generate the USPS dataset of different sizes. In the experiment, the sample sampling ratio is set in interval of , and the increment set to 0.1. That is, the generated dataset contains samples, where is the number of samples in the USPS dataset. Running various ensemble approaches on the USPS datasets of different sizes, Figure 5 compares their execution time during the integration phase.

As can be seen from Figure 5, the execution time of all ensemble approaches increases significantly as the size of the USPS dataset increases, with SC_SRGF and MDEC_SC approaches increasing the fastest. When clustering all samples of the USPS dataset, the proposed C2SWCE_HC, C2SWCE_SC, C2SWCE_BG, and C2SWCE_MC approaches are 195.23 s, 209.15 s, 220.72 s, and 231.65 s, respectively. Compared with other ensemble methods, the proposed methods have obvious advantages, especially the computational efficiency of C2SWCE_SC is higher than that of the contrasting spectral clustering ensemble methods MDEC_SC and SC_SRGF. The results show that the spectral clustering method based on the core clusters can reduce the complexity of the spectral clustering method to a certain extent. Further, compared with other hierarchical clustering ensemble methods, the execution time of C2SWCE_HC is comparable to that of the WEAC_AL, but much faster than that of ECPCS-HC and MDEC_HC. At the same time, we also observed that the execution time of the C2SWCE_BG is slightly higher than that of the GP_MGLA, but it has a significant advantage over MDEC_BG.

In summary, the proposed approaches have a modest computational cost for ensemble tasks on USPS datasets of different sizes. Compared with the same type of ensemble approaches, the execution time of the proposed 4 approaches have advantages.

5. Conclusion

In this paper, we propose a novel locally weighted subspace clustering ensemble framework, termed C2SWCE. It first uses the hybrid feature selection strategy that combines random feature selection and unsupervised feature selection to generate a set of base subspaces. The strategy combines the diversity of random feature selection to select representative features from each random subspace in an unsupervised manner, continuously reducing the dimensionality of the random subspace. To increase the diversity of subspace ensemble members, in addition to using random feature selection to generate feature subspaces, we also randomly group the samples into different numbers of clusters. Furthermore, we introduce concept of the core cluster. In the ensemble process, the core cluster is viewed as the base unit, which improves the ensemble efficiency to a certain extent. The subspace clustering solution is weighted by evaluating the stability of the cluster. Last but not least, under the proposed framework, four weighted ensemble approaches are proposed to integrate the clustering solutions of the base subspace to achieve the final clustering result. Extensive experiments are conducted on 8 real-world datasets to verify the effectiveness of the proposed ensemble approaches. Experimental results show that compared with the state-of-the-art ensemble methods, our methods have stronger robustness, and the comprehensive performance of clustering accuracy and efficiency has advantages.

Data Availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The author states that this article has no conflict of interest.