Abstract
Aiming at the problems that the initial cluster centers are randomly selected and the number of clusters is manually determined in traditional clustering algorithm, which results in unstable clustering results, we propose an improved clustering algorithm based on density peak and nearest neighbors. Firstly, an improved density peak clustering method is proposed to optimize the cutoff distance and local density of data points. It avoids that random selection of initial cluster centers is easy to fall into the local optimal solution. Furthermore, a K-value selection method is presented to choose the optimal number of clusters, which is determined by the sum of the squared errors within the clusters. Finally, we employ the idea of the K-nearest neighbors to carry out the assignment for outliers. Experiments on the UCI real data sets indicate that our proposed algorithm can achieve better clustering results compared with several known algorithms.
1. Introduction
Clustering is a set of unsupervised algorithms, which plays an important role in machine learning and data mining. Traditional clustering algorithms can be divided into five categories [1–3]: partitioning, hierarchical, density-based, grid-based, and model-based. K-means is one of the most popular and simplest partition-based clustering algorithms, which has been widely used in many fields [4, 5]. However, K-means algorithm has its own drawbacks, and then many scholars have proposed improvements [6]. It is mainly concentrated in two ways [7–12]: selection of the initial clustering center and determination of the number of clusters K.
In 2014, Rodriguez and Laio proposed the density peak clustering (DPC) [13] algorithm. It suggests that the cluster centers are always surrounded by those data points with low local density, whereas it is far away from those data points with high local density. The DPC algorithm is of great significance to the selection of initial cluster centers and the determination of K-values, but it also has limitations [14–16]. A new strategy [17] attempts to automatically determine the cutoff distance by information entropy, which is based on data potential energy in the data field instead of the local density between data points. There is an approach [18] determining the initial cluster centers according to an improved DPC algorithm, meanwhile using entropy to calculate the weighted Euclidean distance between data points to optimize the K-means algorithm. A DPCSA algorithm [19] is presented which improves the DPC algorithm based on K-nearest neighbor (KNN) algorithm in addition to weighted local density sequence. The fuzzy weighted KNN is used to calculate the local density of data points [20], and it employs two assignment strategies to classify data points which improves the robustness of this clustering algorithm. The DPC-DLP algorithm [21] uses the KNN idea to calculate the global cutoff distance and the local density of the data points, and uses the graph-based label propagation algorithm to assign data points to clusters. The PageRank algorithm can be used to calculate the local density of data points [22], which can avoid the instability of clustering result caused by the cutoff distance. The CFSFDP-HD algorithm [23] uses heat diffusion to calculate the local density of data points to reduce the influence of cutoff distance on clustering result. The ADPC-KNN algorithm [24] is proposed based on the KNN to calculate the local density and automatically selects the initial cluster centers. The DPADN algorithm [25] uses a continuous function to redesign the local density, which can automatically determine the cluster centers. A multi-feature fusion with adaptive graph learning model [26] is developed, which is an unsupervised algorithm to address the issue of person reidentification. An adaptive approach is constructed [27] which is effective in simultaneously learning the affinity graph and feature fusion, resulting in better clustering results. The novel RCFE method [28] is proposed where the number of clusters is guaranteed to converge to the ground truth via a rank constraint on the Laplacian matrix. Finally, there are some other recent work involving density peak and nearest neighbors that are worth investigating [29–36].
Inspired by the ideas of DPC and KNN, this paper proposes an improved clustering algorithm. First of all, using the sum of the squared errors (SSE) to determine the optimal number of clusters—the K-value, then we select K initial cluster centers based on an improved DPC algorithm. In addition, K-means algorithm is applied to carry out iteration, and data points are divided into core points and outliers according to the average distance within each cluster. Finally, combined with the nearest neighbor idea of KNN, outliers are assigned by voting. Experimental results show that our proposed algorithm can achieve better clustering effect in most cases, compared with several known clustering algorithms.
The remainder of this paper is organized as follows. Section 2 reviews the DPC algorithm and introduces several improved methods. Section 3 provides our K-value selection method. Section 4 describes the proposed algorithm in detail. Experimental results are presented and discussed in Section 5. The conclusion is stated in Section 6.
2. Related Work
2.1. The DPC Algorithm
The DPC algorithm indicates that cluster centers are characterized by a higher density than their neighbors and a relatively large distance from data points with higher densities. In this approach, for each data point , two attributes need to be calculated: local density and distance .
There are two ways for calculating the local density of , namely, cutoff kernel and Gaussian kernel. The cutoff kernel is often used when data size is large, and it is defined as follows:where is the distance between data points and , and the cutoff distance which needs to be set manually. Usually arranging all distance in ascending order, a value between the first 1% and 2% is selected as the cutoff distance. When the data size is small, the Gaussian kernel is used as
Comparing the two local density calculation methods, it is found that local density obtained by the cutoff kernel is a discrete value, while the other is a continuous value. The use of the cutoff kernel may cause conflicts when calculating the local density of different data points; therefore, the Gaussian kernel is more generally used to calculate the local density.
The distance of data point is measured by computing the minimum distance between the point and any other point with higher density:
For the point with highest density, it conventionally takes
Assume that the original data set is shown in Figure 1. The local density and distance are calculated for each data point , and a decision graph is illustrated in Figure 2. According to the DPC algorithm, data points that have both large and are most likely to be the cluster centers, which are located in the upper right in Figure 2.


In order to determine the cluster centers and its number clearly, the quantity in [13] which is a comprehensive consideration of and is defined as
According to and number of data points, a new decision graph is shown in Figure 3. Obviously, the larger the value , the more likely the corresponding data point to be the cluster center. It can be seen from Figure 3 that of noncluster center is relatively smooth, and there is an obvious jump of the value between the cluster center and the noncluster center. It is demonstrated that the data point above the dotted line is the cluster center. Consequently, the remaining data points are assigned to the cluster which has the nearest point with larger local density.

2.2. Set the Cutoff Distance
It is not advisable to manually set the cutoff distance based on experiences for diverse data structures. It affects the local density of data points, which in turn leads to different clustering results.
Assuming p is a percentage, we define
N is the number of distances between any two data points, and dis is an ascending order of these distances. Here, it takes the Np position of dis sequence as the value of the cutoff distance . Figure 4 shows clustering results by running the DPC algorithm in terms of different percentage p.

(a)

(b)

(c)

(d)
We can see that the cutoff distance obtained at different percentage p has a huge impact on clustering results. Hence, the setting of the cutoff distance should be flexible, and an appropriate should be selected to certain data structure.
In information theory, Shannon entropy is used to measure the uncertainty of a system. The greater the entropy, the greater the uncertainty [37]. Similarly, the uncertainty of data distribution can be expressed by entropy. Suppose that the local density is is an impact factor which is used to optimize the cutoff distance . For a data set , the local density of each point is . Considering the information entropy to evaluate the rationality of local density estimation [38],
Z is the normalized factor. The relationship between information entropy H and impact factor is shown in Figure 5. We can see that when starts to increase from 0, the information entropy H first decreases rapidly, then increases slowly, and finally tends to be stable. It follows that is the optimal value when the entropy is the lowest. According to the 3B rule of Gaussian distribution in [39], the data point has a radius of influence on other points which is . Similarly, we think that data points can only affect points within its radius in clustering algorithm [40]. Therefore, we set the cutoff distance as

2.3. Optimize the Local Density
The DPC algorithm only considers the global structure of data set, and the clustering result is not good enough when it comes to unevenly distributed data sets [41]. When the data distribution is relatively concentrated and the density of each cluster is quite different, the change of local density will lead to the difficulty of selecting the correct cluster centers, thus affecting the final clustering results. Based on the idea of KNN algorithm, the nearest neighbors are introduced into local density calculation. The closer the data point is to the target point, the more contribution it makes to the local density of the target point. The new local density is defined as follows:where , represents the ascending order of the distances between data point i and other data points, n is the number of data points in data set, m is the number of nearest neighbors of data point i, and K denotes the number of clusters.
3. The K-Value Selection Method
In the DPC algorithm, the boundary between cluster center point and noncluster center point may be unclear, which cause different people to choose different numbers of clusters according to their own experiences. Similarly, in the K-means algorithm, the number of clusters also needs to be preset by the user based on experiences. However, it is often difficult to determine the value of K.
In order to solve the problem that it is difficult to choose the optimal number of clusters, we proposed a K-value selection method based on the SSE and the ET-SSE algorithm in [42].
Ordinarily, the K-means algorithm employs SSE to measure clustering quality aswhere represents the Euclidean distance from data points x and c, and K is the number of clusters. Let denote the set of all data points in the i-th cluster, and represents the cluster center of i-th cluster. Usually, the relationship between SSE and K-value is demonstrated in Figure 6(a). It is obvious that with the increase of K, the SSE will decrease and eventually stabilize. When the K-value gradually increases and approaches the actual number of clusters, SSE will decrease rapidly; when the K-value is greater than the actual number of clusters, SSE decreases slowly. Consequently, the K-value corresponding to the obvious “elbow point” in Figure 6(a) can be used as the optimal number of clusters. Nevertheless, sometimes there will be situation as shown in Figure 6(b), where there is no obvious “elbow point” in the figure. In this case, it will be more difficult to select the optimal K-value, which will affect the final clustering results.

(a)

(b)
To solve the problem that the “elbow point” in Figure 6(b) is not clear, the exponential function is introduced into the SSE formula. Using the property that the function is sensitive to positive number, the SSE values of different clusters are scaled down to further improve the difference degree of SSE values, when K-value is not equal to the actual number of clusters. Meanwhile, in order to prevent the exponential explosion, an adjustment factor is added to update the weight of the SSE value. A new calculation formula of the SSE is defined as follows:where max means to find the largest SSE value in K clusters. In order to reduce the influence of manual parameters on the clustering result, based on a large number of experiments on different data sets, it is found that whenthe “elbow point” in Figure 6 is the most obvious. Then, the predicted number of clusters is closest to the actual number of clusters, and the clustering effect is the best.
4. The Improved Clustering Algorithm
Aiming at the problem of unstable clustering results caused by randomly selecting initial cluster centers in traditional K-means algorithm, this paper proposed an improved clustering algorithm based on the density peak and nearest neighbors. Firstly, using information entropy to improve the DPC algorithm, we find the optimal cutoff distance of data set and then calculate the local density of the data points; additionally, the optimal number of clusters is obtained by the K-value selection method proposed in Section 3. Finally, according to the descending order of values in (5), the top K corresponding data points are selected as the initial cluster centers. The K-means algorithm is used for iterative clustering.
4.1. Weighted Euclidean Distance
Let set be a data set containing n data points, and each data point contains m-dimensional attribute characteristics, where represents the p-th attribute characteristic of the i-th data point.
In order to remove the unit restrictions of different attributes in the original data and avoid its impact on the clustering results, the original data need to be normalized and converted into pure numerical data. After the normalization, each attribute is in the same order of magnitude, which is suitable for comprehensive comparative evaluation. The normalization formula is as follows:where and , respectively, represent the maximum and minimum of the p dimension attributes in a data point.
The traditional K-means algorithm employs Euclidean distance to measure the similarity between data points. It is applicable to the uniform measurement of each attribute in the data point, and it treats the difference between different attributes equally. But in practice, the contribution of different attributes to the clustering results is quite different. To solve this problem, a weighted Euclidean distance is used to measure the similarity between data points. Let the weight of the l-th dimension attribute of the data set bewhere denotes the average value of the p-th dimension attribute. Then, the weighted Euclidean distance between the data points and is
4.2. The Framework of the Proposed Algorithm
The K-means algorithm is sensitive to outliers [43], and the improved DPC algorithm can exclude the influence of outliers on the selection of initial cluster centers. However, K-means is an iterative clustering algorithm, and each iteration will generate new cluster centers. The outliers will have an impact on the generation of new cluster centers. Hence, it is necessary to distinguish the outliers in each cluster.
Let be the K clusters after first iteration, and the cluster centers are . The average distance of the i-th cluster iswhere is the number of data points in the i-th cluster. According to the average distance MeanDist of each cluster, the data points in the cluster are divided into core points and outliers. If the distance between the data point and its cluster center is less than MeanDist, is regarded as a core point; otherwise it is an outlier. The average value of the data points marked as core points in the cluster is calculated as the new cluster center of the next iteration, and the outliers do not participate in the calculation of the new cluster center.
To ensure that the assignment of outliers is closer to real situation and improve the clustering accuracy, the idea of KNN is introduced into the assignment of outliers [44–46]. Suppose that is an outlier, calculate the distance sequence from to each core point by (17). Then, is an ascending order, and we select the first 2K corresponding core points as the core neighbor point . Since the core points have completed the assignment, the cluster category corresponding to each data point is counted from the sequence Core, and the cluster with the largest number of statistics is regarded as the cluster to which the outlier belongs. If there is more than one cluster category with the largest number of statistics, the cluster belonging to the nearest center point is marked as the cluster category of , according to the distance from to the center point.
In summary, the steps of our proposed algorithm are as follows:(1)Input the original data set .(2)Normalize the data set X according to (15) to obtain the processed data set X.(3)According to (17), the weighted Euclidean distance between each data point in data set X is obtained.(4)Calculate and of each data point in data set X according to (3), (4), and (11) and determine the optimal cutoff distance according to the information entropy in Section 2.2.(5)Calculate and arrange the value sequence in descending order to obtain the sequence .(6)Determine the number of clusters K according to the K-value selection method presented in Section 3.(7)The first K corresponding data points are selected from the sequence as the initial cluster centers.(8)The distances from each data point to each cluster center are calculated, and a data point is classified into the cluster with the nearest cluster center.(9)Calculate the mean distance MeanDist of each cluster according to (18). The data points in each cluster are divided into core points and outliers.(10)Calculate the average value of the core points in each cluster as the new cluster centers of the next iteration. The idea of nearest neighbors in KNN algorithm is used to re-assign the outliers.(11)When the cluster centers no longer change, the algorithm terminates and outputs the clustering result. Otherwise, go to Step 8.
Furthermore, a flowchart for the overall process is provided in Figure 7.

4.3. The Time Complexity Analysis
The time complexity of the improved DPC algorithm mainly depends on the local density and distance , which is . The time complexity of K-value selection method mainly comes from the sum of the squared error (SSE), which is . The time complexity of the idea based on nearest neighbors is . Therefore, the total time complexity of our proposed algorithm is .
5. Experiments
5.1. Experimental Environment and the Data Sets
The hardware environment is based on Windows 10 Professional 64-bit, Inter Core i3-4000M CPU, 2.40 GHz, and 4 GB memory. The proposed algorithm is implemented in MATLAB R2011a. The data of Wine, Pima, WDBC, Iris, and Parkinsons in the UCI real data sets [47] are used as the experimental data sets, as shown in Table 1.
5.2. The Evaluation Measures
This paper uses accuracy (ACC), adjusted Rand index (ARI), and adjusted mutual information (AMI) to evaluate the performance of the clustering algorithms. Assume that is a known manually labeled cluster, and is a cluster generated by the clustering algorithm. The ACC calculation formula is as follows:where the value range of ACC is [0, 1] and the value range of ARI and AMI is [−1, 1]. The specific meanings and calculations of ARI and AMI are referred to [48, 49]. The three evaluation measures are positively correlated with clustering performance. The larger the value, the better the clustering performance of the algorithm.
5.3. Experimental Results and Discussion
In this paper, we compare our proposed algorithm with several existing algorithms: K-means, DPC [13], CNACS-K-means [50], and DCC-K-means [51]. Due to the influence of the initial cluster centers, the K-means algorithm takes the mean value of 20 times. The experimental results on the UCI data sets are shown in Tables 2–4. The data in bold are the best experimental results on this data set.
As shown in Tables 2–4, the algorithm we proposed is generally better than the other four comparison algorithms in terms of the three evaluation measures. It achieves the best experimental results especially on the Wine and Iris data sets.
In the comparison of ACC in Table 2, our algorithm is slightly worse than the K-means and DCC-K-means algorithms on the Pima and WDBC data sets, but is better than the other two comparison algorithms. However, the situation is just the opposite on the Parkinsons data set.
As for ARI comparison in Table 3, the proposed algorithm is lower than the CNACS-K-means on the Pima data set, whereas it is higher than the other three comparison algorithms. On the WDBC data set, it is only lower than K-means and DCC-K-means algorithms by 1.7%. Furthermore, it is lower than CNACS-K-means and 0.02% difference from DPC on the Parkinsons data set, but it is higher than the other two comparison algorithms.
In Table 4, the comparison of AMI shows that our algorithm is superior to the other four comparison algorithms on the Pima data set. On the WDBC data set, it is close to the K-means and DCC-K-means algorithms, and superior to the other two comparison algorithms. Although our algorithm outperforms the DPC and CNACS-K-means algorithms, it is indeed inferior to the other two comparison algorithms on the Parkinsons data set.
6. Conclusion
Focusing on the problems of randomly selecting initial cluster centers, manually determining the number of clusters, and not considering the influence of outliers on clustering process, this paper proposes an improved clustering algorithm based on the density peak and the nearest neighbors. Our algorithm uses an improved DPC algorithm to determine the initial cluster centers and calculates the sum of the squared errors within the clusters to find the optimal cluster number K. Moreover, the average distance within the clusters and the nearest neighbor idea are combined to determine the outliers and its assignment. The experimental results show that the proposed algorithm can achieve better clustering results on the UCI real data sets. However, it is not sure that the algorithm can be applied to large-scale data sets. In future, how to improve the stability and running efficiency of our algorithm for large-scale data set will be the focus.
Data Availability
The data sets used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61902262) and Handan City Science and Technology Research and Development Program (19422031008-15).