Abstract

Clustering analysis is an unsupervised learning method, which has applications across many fields such as pattern recognition, machine learning, information security, and image segmentation. The density-based method, as one of the various clustering algorithms, has achieved good performance. However, it works poor in dealing with multidensity and complex-shaped datasets. Moreover, the result of this method depends heavily on the parameters we input. Thus, we propose a novel clustering algorithm (called the MST-DC) in this paper, which is based on the density core. Firstly, we employ the reverse nearest neighbors to extract core objects. Secondly, we use the minimum spanning tree algorithm to cluster the core objects. Finally, the remaining objects are assigned to the cluster to which their nearest core object belongs. The experimental results on several synthetic and real-world datasets show the superiority of the MST-DC to Kmeans, DBSCAN, DPC, DCore, SNNDPC, and LDP-MST.

1. Introduction

Clustering analysis, which classifies the unlabeled data into some clusters, refers to a task to discover the internal structure of data or the potential data models [1]. Since the early 1950s, quite a few clustering algorithms have been put forward [2, 3]. These algorithms can be roughly classified into four categories: partition-based clustering algorithms [4, 5], hierarchical clustering algorithms [6, 7], density-based clustering algorithms [8, 9], and graph-based clustering algorithms [1012]. Thanks to the predominant capability of discovering clusters of different shapes and sizes along with outliers, density-based and partition-based clustering technologies are widely used in the fields of health care [13], information security [14], the Internet [15], etc. Besides, clustering is also a vital key for analyzing big data.

Partition-based clustering algorithms are the simplest and most fundamental clustering algorithms. They organize the data objects into several nonoverlapping partitions where each partition represents a cluster, and each data object belongs to one cluster [16]. Nevertheless, traditional partition-based methods usually cannot find clusters with arbitrary shapes.

However, identifying clusters with arbitrary shapes is a very important task in the applications of clustering algorithms. Since the density-based clustering algorithm does not need to know the number of clusters in the dataset in advance and can effectively process datasets with arbitrary shapes, it has always been the focus of research in clustering algorithms. The idea of the density-based clustering algorithm [17] is that the clusters in the dataset are a collection of dense data regions separated by sparse data regions. DBSCAN [18] and DPC [8] are two typical algorithms among density-based algorithms. The DBSCAN algorithm manually sets the neighborhood radius and the minimum density points , and it classifies data points into core points, boundary points, and outliers. Besides, DBSCAN can be effectively applied to datasets with complex shapes, and outliers can be detected during the clustering process. However, the algorithm gets poor clustering results for multidensity datasets. Furthermore, for different datasets, setting different parameters in DBSCABN will get unstable clustering results. The DPC (density peak clustering) algorithm was published in the journal Science in 2014, and it holds the idea that cluster centers are characterized by a higher density and a relatively longer distance [8]. However, the DPC still has some drawbacks. Firstly, a threshold needs to be set by users. Besides, the cluster centers are obtained by the decision graph, which has certain human factors.

To improve the performance of DPC, DPC-KNN-PCA [19] and SNN-DPC [20] have been proposed. DPC-KNN-PCA integrates PCA, DPC, and KNN to avoid the defects of DPC. However, this density-based algorithm still cannot recognize clusters containing manifold distributions [21]. To solve the defect of the use of threshold , paper [20] proposed a shared-nearest-neighbor-based clustering by the fast search and find of density peak (SNN-DPC) algorithm. The computation of its local density and distance from the nearest larger density point takes the information of the nearest neighbors and the shared neighbors into consideration. The assignment process in DPC is sensitive and of low fault tolerance. For example, if a data point is assigned incorrectly, then the subsequent assignment will magnify the error, resulting in more errors that will have a seriously negative impact on the clustering process. Therefore, the paper [20] adopted a two-step assignment to solve the drawbacks of DPC. Yet, SNN-DPC still has several apparent defects. Firstly, the number of shared nearest neighbors needs to be set through manual experience. Besides, SNN-DPC still utilizes a decision graph to select cluster centers. MST-based clustering methods [22, 23] do not assume that data points are grouped around centers or separated by regular geometric curve. Instead, they use tree edge information to divide a dataset into clusters and are able to recognize clusters with arbitrary shapes. Clustering algorithms based on the minimum spanning tree (MST) are able to detect clusters with arbitrary shapes; however, they are time-consuming and susceptible to noise points. The algorithm in [24] uses a new distance between local density peaks based on shared neighbors to construct a minimum spanning tree on the local density peaks, which excludes the interference of noise points and reduces the running time of MST-based clustering algorithms. Nevertheless, LDP-MST still needs input parameters, which means that the algorithm cannot exclude the interference of human factors.

To resolve the problem mentioned above, we propose a novel clustering algorithm (called the MST-DC). Firstly, we automatically get the reverse neighbors of each object based on the concept of natural neighbor searching. Secondly, we obtain the core objects according to the number of reverse neighbors of each object. Thirdly, based on the Prim algorithm in the graph theory, we construct a minimum spanning tree of the core objects to obtain the clustering result of core objects. Lastly, unallocated objects are marked as the label of their nearest local core objects. There is no need to set parameters in MST-DC. Furthermore, MST-DC can be applied to complex patterns with extremely large variations in density.

The remainder of this paper is organized as follows: Section 2 presents a brief overview of density core and natural neighbors; Section 3 presents the clustering algorithm (MST-DC); Section 4 presents the analysis of synthetic datasets and real datasets; and finally, Section 5 presents the summary of this paper and future work.

We review related works on density core [25] and natural neighbors [26] in this section, which are originally mentioned by Dai et al. [25].

2.1. Density Core

There exist some intrinsic defects in centroid-based clustering methods, including shape loss, false distances, and false peaks, which cause centroid-based methods to fail when applied to complex patterns [27]. Thence, Chen et al. [27] proposed a hybrid decentralized approach named DCore, which was used to overcome these defects in centroid-based clustering methods. Density cores can roughly maintain the shape of the cluster and be located far from each other.

As is known to all, the mean shift algorithm can identify nonspherical patterns by shifting tracks. Thus, DCore uses mean shift and -center to obtain convergence points. DCore is a hybrid method that can decentralize each density peak into a loose density core, which can refrain some intrinsic defects in centroid-based clustering approaches. The application of DCore in different datasets indicates that it can perform well in many complex datasets. Nevertheless, it still has some obvious limitations:(1)DCore algorithm uses a global fastened scanning radius to search convergent points and density cores. For a dataset with multiple density levels, it cannot obtain an ideal density representative point using a global fastened scanning radius, and thus cannot obtain ideal clustering results.(2)In order to filter noise, the DCore algorithm adopts three filtering strategies. However, it is usually difficult to determine a specific pattern, so it is difficult to select the corresponding strategy to detect outliers and noise.(3)The DCore algorithm needs to adjust five parameters to attain better clustering results. It is difficult to find the ideal combination of parameters to make the clustering perform better.

2.2. Natural Neighbor

Natural neighbor [26] is a new concept that originates from the reality that the number of one’s real friends should be the number of how many people are taken him or her as friends, and he or she takes them as friends at the same time. For example, if object regards object as a neighbor and object regards object as a neighbor at the same time, then object is one of the natural neighbors of object . To put it another way, the main idea of the natural neighbor stable structure is that objects lying in sparse regions possess a small number of neighbors. In contrast, objects lying in dense regions possess a large number of neighbors [28]. Thus, the natural neighbor stable structure of objects is formulated as follows:where is the nearest neighbor of object .

Definition 1. (k nearest neighbors). The nearest neighbors of object are a set of objects in dataset with , that is,where is the distance between the object and the th object .

Definition 2. (Reverse neighbors). The reverse neighbors of object are a set of objects that regard as its nearest neighbor, i.e., .

The natural neighbor stable structure’s formation process is represented as follows: continuously expand the neighbor searching range increasing from 1 to ( is named natural neighbor eigenvalue (NaNE)) [26]; in each searching process, calculate the number of reverse neighbors of each object and judge the following two conditions: (1) all objects have reverse neighbors and (2) the number of objects without reverse neighbors remains unchanged; and when one of these conditions is met, the natural neighbor stable structure is formed. The searching range is equal to the natural characteristic value at this moment. Therefore, is obtained bywhere is initialized with 1, is the number of object ’s reverse neighbor in the iteration (note that ), and is defined as follows:

Based on the above concepts, the natural neighbor is defined as follows:

Definition 3. (Natural neighbors). For each object , the nearest neighbors are natural neighbors, denoted as , where is equal to the natural characteristic value .

Apparently, each object has the same number of natural neighbors in this paper. The details of the natural neighbor searching algorithm are shown in Algorithm 1.

Input: (the dataset)
Output: (the natural characteristic value), (natural neighbors)
(1)Initial ;
(2)create a KD-tree from the dataset ;
(3)while do
(4)for each object in d
(5)  find the neighbor of using
(6)  ;
(7)  ;
(8)  ;
(9)end for
(10);
(11);
(12)ifthen
(13)  break;
(14)end if
(15);
(16)end while
(17);
(18)for each object in do
(19);
(20)end for
(21)Return ,

3. The Proposed Algorithm

Based on the idea of density core, we apply the reverse nearest neighbors in the natural neighbor searching algorithm to extract density core sets. And density core sets can maintain the general shape of clusters well. Subsequently, we construct a minimum spanning tree of density core points for clustering. Besides, we use the reverse nearest neighbors in the process of natural neighbor searching, which does not require any parameter settings. MST-DC can recognize extremely complicated clusters with large variations in density.

3.1. Density Core Set

According to Algorithm 1, we calculate the number of reverse nearest neighbors for each object. Since the number of core objects’ reverse nearest neighbors are greater than that of noncore objects’ reverse nearest neighbors, we use the number of reverse nearest neighbors to extract core objects. The definition of the core object is as follows:

Definition 4. (Core object). If one object can be considered as a core object, it must satisfy the following formula:where SRNN(p) represents the number of reverse nearest neighbors of object , and represents the natural characteristic value.

As is mentioned above, each data point regards its neighbor point as a potential density core point. This neighbor point is a true density core point when there are enough data objects to treat it as a potential density core point. Figure 1(a) shows an original dataset with three clusters. After the core point extraction process, as shown in Figure 1(b), the red regions represent potential clusters, and each point in the red regions is a core object. The gray points are noncore objects. Algorithm 2 presents the process of finding core sets, which can roughly retain the shape of clusters.

Input: (the dataset)
Output: (density core)
(1)Initialize: ;
(2);
(3)for each point in do
(4)
(5)if then
(6)  
(7)end if
(8)end for
3.2. Clustering Core Objects

After we obtain the density core points, how to cluster the density core points becomes a key task. We propose a method of clustering density core points based on the minimum spanning tree. The density core sets extracted in the dataset retain the general shape of the cluster, while the distance between the density core sets is further apart. After constructing the minimum spanning tree, it is easier to find the longest edge of the tree for cutting.

The process of clustering the density core points based on the minimum spanning tree is as follows: Firstly, we construct a minimum spanning tree based on the set of density core points, and secondly, we cut off the edges whose length is greater than the trimming threshold. Afterwards, we can obtain the cluster of the density core sets according to tree structure information after trimming. The trimming threshold is defined as follows:where represents the average value of all the edge weights in the minimum spanning tree, and represents the standard deviation of all the edge weights in the minimum spanning tree. is an empirical value, and its value range is [2, 5], which has been verified by a large number of experiments. When  = 3, it meets the requirements of most datasets. In this article, we choose  = 3 as the experimental parameter. The setting of the trimming threshold is based on statistical principles, which can check whether there are outliers in the data. The length of the edge of the minimum spanning tree constructed conforms to the Gaussian distribution, and the edge we need to trim is the longer one located between different clusters.

Definition 5. (Trimming threshold). The threshold is derived from the overall minimum spanning tree.

The steps of clustering density core points are as follows:(1)We utilize the Prim algorithm to construct the minimum spanning tree on all the density core points. The length of the edge, which is computed based on the Euclidean distance, is used as the weight in the minimum spanning tree. The minimum spanning tree we build based on the density core points is shown in Figure 2.(2)After building the minimum spanning tree, we get a minimum spanning tree edge set. Since we build the minimum spanning tree based on the density core, the weight of the tree in the same cluster is relatively small, and the change of weight is small as well. The weight of the edge between different clusters is larger, so it is easier for us to find the edges connecting different clusters and cut them off. As shown in Figure 3(a), the colored dots represent the weights of the edges of the minimum spanning tree. The weights of the two edges (red dots) are much larger than the weights of the other edges (blue dots). The red dotted line indicates the trimming threshold we calculated, and the trimming threshold can effectively identify the longer sides in the minimum spanning tree; Figure 3(b) shows the minimum spanning tree of the subcluster after cutting off the edge greater than the trimming threshold. From the figure, we can see that the red density core points have been divided into three parts, and each part of the density core has a minimum spanning tree after cutting.(3)We cluster the density cores according to the minimum spanning subtree structure retained by the density core points of each cluster after trimming, namely, assigning the points on the same minimum spanning subtree to the same cluster.

According to the description of the above steps, the specific steps of clustering density core points are shown in Algorithm 3.

Input: (the core set)
Output: (the result of clustering)
(1)Initialize the Core Set ;
(2);
(3);
(4)for each edge in do
(5)  ifthen
(6)   cut this edge;
(7)  end if
(8)end for
(9)for each object in do
(10)   ifthen
(11)   ;
(12)   ;
(13)   ;
(14)   whiledo
(15)    ;
(16)    
(17)   end while
(18)  end if
(19)end for
(20)Return
3.3. The MST-DC Clustering Algorithm

In this section, we introduce a novel clustering algorithm, namely, MST-DC. The basic steps are as follows: firstly, we find the reverse nearest neighbors of each object according to the natural neighbor searching algorithm; secondly, we use formula (5) to get the core objects; thirdly, we build the minimum spanning tree of the density core set , and the edge between the clusters is cut according to formula (6), and then, the density cores are clustered according to the obtained subcluster tree; fourthly, we apply the concept of outlier clusters in paper [29] to eliminate erroneous clusters. In this paper, a novel outlier cluster detection algorithm called ROCF is proposed based on the concept of mutual neighbor graph and on the idea that the size of outlier clusters is usually much smaller than the normal clusters; and finally, the noncore points are assigned to the clusters to which their closest density core points belong. The overall steps of the MST-DC algorithm are shown in Algorithm 4.

Input: (the dataset)
Output: Cluster Label
(1)Initialize the core objects set ;
(2)Initialize the dataset ;
(3);
(4);
(5);
(6) eliminate erroneous clusters;
(7)for each object which is unallocated do
(8)   the label of nearest core object;
(9)end for
(10)Return
3.4. The Complexity Analysis

Based on the description of the MST-DC clustering algorithm in the previous sections, the time complexity of MST-DC depends on following parts: (1) we use the natural neighbor algorithm optimized by [30] to obtain the reverse nearest neighbors of each data point, the natural eigenvalue, and the Euclidean distance of the data points, and its time complexity is ; (2) the process of extracting core points is equivalent to traversing data points, and its time complexity is ; (3) the time complexity of clustering the core points based on the minimum spanning tree is mainly focused on the Prim algorithm to establish the minimum spanning tree. This paper uses the heap-optimized Prim algorithm, whose time complexity is , where represents the number of obtained core points; and (4) we assign the remaining points to their nearest density core with time complexity of , where represents the number of remaining noncore points. In summary, the total time complexity of MST-DC is .

4. Experiments and Analysis

4.1. Experimental Design and Environment

In this section, we demonstrate the effectiveness of the proposed clustering algorithm, respectively, on some artificial datasets and UCI real datasets. And we compare the MST-DC with well-known and state-of-the-art clustering methods, including Kmeans [5], DBSCAN [18], DPC [8], DCore [27], SNN-DPC [20], and LDP-MST [24].

The experiment is conducted on a MATLAB 2018a computer equipped with Intel Core 2.20 GHz CPU, 8 GB RAM, and Windows 10 operating system. The MATLAB code is available from https://github.com/qczggaoqiang/MST-DC

4.2. Metrics for Measurement

We adopt [31], [32], and (normalized mutual information) [33] to test the performance of our proposed algorithm MST-DC and six comparison algorithms. The upper limit of the three indexes is 1. The larger the values of the three indexes are, the better the clustering result is.

We chose as the first evaluation indicator. For objects , and are the intrinsic category label and the predicted cluster label of , respectively, and the calculation formula of is as follows:where is a mapping function that maps the predicted cluster label and its intrinsic cluster label by the Hungarian algorithm [34]; let equal to 1 if or equal to 0 otherwise. ; namely, the higher the values of the are, the better the clustering performance will be.

includes both the precision and the recall . is the ratio between the number of correct positive results and the number of all positive results returned by the classifier. is the ratio between the number of correct positive results and the number of all samples that should have been identified as positive , , and , which are defined by formulas (8)–(10). is a set of the number of all samples that should have been identified as positive. is a set of the number of all positive results returned by the classifier.

The mutual information [32] can be used to measure the information shared by two clusters, given a set of data points, and two partitions of set , namely, , and . Suppose that we select a point at random from , then the probability that the point belongs to cluster is

Entropy can be described as the information conveyed by the uncertainty that a randomly selected point belongs to a certain cluster. The entropy of the cluster is given by the following formula:

The between the clusters and is defined by

The is calculated as follows:

4.3. Experiments on Synthetic Datasets

We first conduct comparison experiments based on ten synthetic datasets. As shown in Table 1, the characteristics of ten synthetic datasets are described. Moreover, ten original synthetic datasets are displayed in Figure 4. D1 and D2 contain spherical clusters with a different number of clusters and densities. D1 contains three clusters and a total of 600 objects. D2 consists of five clusters with skewed distribution, and a total of 6699 objects. In contrast, the remaining synthetic datasets contain clusters with arbitrary shapes. D3 is composed of four-line clusters and a total of 1268 objects. D4 has a spherical cluster in the middle of two ring clusters and a total of 1897 objects. D5 is composed of two moon manifold clusters with 1532 objects, including noise objects. D6 includes four manifold clusters and a total of 630 objects. D7 has four spherical clusters in two right-angle line clusters with some noise objects, and a total of 1916 objects. D8 has three spherical clusters in one manifold cluster with several noise objects, and a total of 1427 objects. D9 is composed of six square clusters that cross and parallel with each other, and a total of 8000 data objects including some noise objects. D10 consists of three circle clusters, two spiral clusters, and two spherical clusters with a total of 8533 objects, including some noise objects.

The parameter settings of each clustering algorithm in the ten synthetic datasets are displayed in Table 2. As for the Kmeans algorithm, represents the number of clusters in the dataset, and the initial clustering center is randomly selected. DBSCAN needs two parameters: and . The cutoff distance of DPC is set as 2%. SNNDPC needs the parameter to find nearest neighbors. We test different to achieve better results. The results of DCore are affected by the selection of parameters r1, r2, , , and . We use different parameter settings to achieve better results. The LDP-MST algorithm needs to set the parameter N manually. Concerning MST-DC, there is no need to set parameters. In Table 2, the symbol “ — ” indicates that there is no need to set parameters in MST-DC.

The experimental result of D1 is shown in Figure 5. It shows that all clustering algorithms can find correct clusters in D1, which means that these algorithms are effective for uniformly distributed spherical datasets. However, except MST-DC, other algorithms need parameter settings.

Figure 6 shows that DPC, LDP-MST, and MST-DC can get the correct clustering on D2, while Kmeans, DBSCAN, DCore, and SNNDPC cannot. The number of clusters in the Kmeans algorithm is input by users. Because it cannot recognize clusters with different densities, the low-density area is mistakenly recognized as the same cluster, while the high-density area is erroneously partitioned as well. Because the choice of is too large, DBSCAN aggregates D2 into four clusters. Because of the improper choices of parameters, DCore and SNNDPC cannot correctly identify clusters in the D2 dataset. Therefore, Figure 6 displays that using global fixed parameter settings for multidensity patterns is not applicable.

The experimental results on D3 are shown in Figure 7. It shows that except Kmeans and SNNDPC, other algorithms can well find the correct clusters. It also illustrates that Kmeans cannot be applied to line cluster datasets. Moreover, the incorrect setting of parameters leads to incorrect clustering results.

The experimental results on D4 are shown in Figure 8. It demonstrates whether those algorithms can process circle clusters or not. It shows that Kmeans, DPC, DCore, and SNNDPC algorithms are not suitable for datasets containing circular clusters, and the correct clustering results cannot be obtained. DBSCAN, LDP-MST, and MST-DC algorithms can find the correct clusters for D4.

The clustering results displayed in Figures 911 demonstrate whether those algorithms can process clusters with the arbitrary shape or not. The results of DBSCAN, LDP-MST, and MST-DC are similar. Three of them can find the clusters of the three datasets correctly. As shown in Figure 9(d), DCore mistakenly identifies only two points in D5. Besides, none of Kmeans, DPC, DCore, and SNNDPC can find the right clusters of D5, D6, and D7 datasets.

The clustering results shown in Figure 12 display that MST-DC, LDP-MST, and DCore have recognized the D8 datasets correctly, while Kmeans, DBSCAN, DPC, and SNNDPC have not. In other words, MST-DC, LDP-MST, and DCore can be good at dealing with manifold structure datasets. In Figure 12(b), DBSCAN can detect the manifold cluster. Yet, it does not work for the datasets combined with spherical clusters and manifold cluster.

The experimental results displayed in Figures 13 and 14 are used to evaluate the clustering performance on D9 and D10, which are more complex patterns. As shown in Figure 13, only DCore, LDP-MST, and MST-DC can detect rectangle clusters. Because of inappropriate parameter settings, DBSCAN and SNNDPC did not get impressive clustering results, and neither did Kmeans and DPC. The clustering results shown in Figure 14 demonstrate that only the LDP-MST and MST-DC algorithms obtained the correct clusters. DBSCAN and DPC have similar results that both of them cannot deal with spiral clusters and circle clusters. Although SNNDPC can detect the spiral clusters, they fail to handle the circle clusters. DCore can detect circle clusters; however, it cannot detect the spiral clusters. Thence, MST-DC can be applied to more complex situations without parameter settings.

From Figures 514, we can see that MST-DC performs better than other algorithms. Moreover, there is no need to set any parameters so that several intrinsic flaws of other algorithms can be avoided.

The running time of seven clustering algorithms on the synthetic datasets is shown in Table 3. It is obvious that although MST-DC runs slower than Kmeans, DBSCAN, and LDP-MST, it runs faster than DPC and SNNDPC evidently. What is more, the consuming time of MST-DC is similar to that of DCore.

4.4. Experiments on Real-World Datasets

To further prove the superiority of the MST-DC algorithm, we also apply the proposed method to six real-world datasets, including Segmentation, Pageblock, Iris, Control, Column2C, and Breast, obtained from the University of California, Irvine (UCI) Machine Learning Repository. The characteristics of six real datasets are displayed in Table 4. Table 5 illustrates the parameter setting of each clustering algorithm in six real-world datasets of UCI, and the symbol “ — ” indicates that there is no need to set parameters in MST-DC. Table 6 shows the clustering performance of seven clustering algorithms according to three external criteria, namely, , , and . Table 7 shows the running time of each algorithm, where “ — ” indicates that the algorithm has not run within the specified time (20 minutes).

As shown in Table 6, except for the Iris dataset, the clustering performance of the MST-DC algorithm is superior to the Kmeans, DBSCAN, DPC, DCore, SNNDPC, and LDP-MST algorithms on other datasets. The LDP-MST algorithm obtains the optimal clustering results on the Iris dataset in terms of , , and . Besides, LDP-MST achieves the best results on Segmentation and Control datasets in terms of Accuracy, and on Column2C dataset in terms of NMI. On relatively simple datasets, the , , and values of the Kmeans, DBSCAN, DPC, and SNNDPC algorithms are high; however, on datasets with higher dimensions or complex structures, the four algorithms have poor performances. Except for the Segmentation dataset, the DCore algorithm can obtain a relatively good clustering effect on multiple datasets. However, the DCore algorithm has a major flaw that it needs to set five parameters manually. It is usually very difficult to adjust the parameter combination for better clustering results.

According to Table 7, MST-DC is slower than the Kmeans, DBSCAN, and LDP-MST algorithms on the UCI dataset. However, the MST-DC algorithm is much faster than the DPC and SNNDPC algorithms. The running time of the MST-DC and DCore algorithms is similar on most datasets.

From the above analysis, we conclude that MST-DC provides an overall good performance in clustering compared with the other existing methods. Firstly, the MST-DC algorithm employs a natural neighbor algorithm to obtain the reverse neighbor information and then extracts the density core point according to the number of reverse neighbors and natural characteristic value. This process does not need to set parameters, while other algorithms need to set parameters manually. Secondly, MST-DC only utilizes the density core points instead of all points to build the minimum spanning tree, which reduces the computational cost while excluding the interference of noise points. Thirdly, the MST-DC algorithm can recognize complex datasets efficiently and accurately. However, compared with Kmeans, DBSCAN, and LDP-MST, the time efficiency of the MST-DC algorithm is not optimal, which is worth further exploration.

5. Conclusions

In this paper, we propose a novel clustering algorithm named MST-DC. The process of the algorithm has the following four steps: firstly, we automatically obtain the reverse neighbors of each object based on the concept of natural neighbor searching without any parameters set by the user; secondly, we obtain the core objects according to the number of reverse neighbors of each object; thirdly, we construct a minimum spanning tree of the core objects to obtain the clustering result of core objects; and finally, unallocated objects are marked as the label of their nearest local core objects.

The experimental results of synthetic and real-world datasets demonstrate that MST-DC can detect quite complex patterns with large variations in density. Besides, unlike most clustering methods, there is no need to set parameters in MST-DC. The concept of natural neighbor can automatically obtain the only parameter used by MST-DC. Therefore, our proposed algorithm, MST-DC, is superior to other algorithms.

Nevertheless, there are several aspects to be improved in this paper. Firstly, the similarity measure based on the Euclidean distance is used when acquiring natural neighbor information, extracting density cores, and assigning remaining points; however, the Euclidean distance is prone to in high-dimensional data spaces, resulting in poor clustering effects. Therefore, we will explore the adaptability of this algorithm to high-dimensional data in the future. Secondly, the approach to assign the remaining points in this paper is to directly assign them to the cluster where the closest density core is located. Subsequently, we will study the new method of allocating the remaining points.

Data Availability

The data that support the findings of this study are openly available on GitHub at https://github.com/qczggaoqiang/MST-DC

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the National Natural Science Foundation of China (Grant no. 61701051), the Fundamental Research Funds for Central Universities (Grant no. 2019CDCGJSJ329), and the Graduate Scientific Research and Innovation Foundation of Chongqing, China (Grant no.CYS20067).