Abstract

Aiming at the problems of the traditional K-means clustering algorithm, such as the local optimal solution and the slow clustering speed caused by the uncertainty of k value and the randomness of the initial cluster center selection, this paper proposes an improved KMeans clustering method. The algorithm first uses the idea of the elbow rule based on the sum of squares of errors to obtain the appropriate number of clusters k, then uses the variance as a measure of the degree of dispersion of the samples, and selects k data points with the smallest variance and the distance greater than the average distance of the samples as the initial clustering center of the data set. Finally, combined with the “triangular inequality principle,” the unnecessary distance calculation of the KMeans algorithm in the iterative process is reduced, and the algorithm operation efficiency is improved. The results show that the improved KMeans clustering algorithm is tested on the UCI data set. Compared with the traditional k-means algorithm and Canopy-KMeans algorithm, the accuracy and speedup ratio of the improved KMeans clustering algorithm are significantly improved, and the clustering quality is improved.

1. Introduction

“Birds of a feather flock together.” Clustering is based on the principle of “clustering of objects,” which aggregates samples without categories into different sets (clusters), so that samples of the same cluster are similar to each other, and samples of different clusters are sufficiently dissimilar [1]. With the development of clustering technology today, many clustering methods have emerged, such as partition-based clustering methods—K-means, K-medoids, and hierarchical clustering methods—DIANA, AGNES, BIRCH, density-based clustering methods—DBSCAN, grid-based clustering methods—STING, CLIQUE. Among these clustering methods, the K-means algorithm is the most classic and commonly used method. This algorithm is an unsupervised pattern classification method with simple ideas and easy implementation. It is widely used in various fields of scientific research. The selection of the initial clustering center and K value is very sensitive, and it is easy to fall into the problems of local optimal solution and unstable number of iterations [2].

Many scholars have proposed optimization strategies for this problem. Reference [3] proposes to use Canopy algorithm to select center points, but the initial thresholds T1 and T2 of Canopy algorithm are generally selected manually, so the effect is unstable; Reference [4] proposes a method to select initial cluster centers based on dissimilarity metric, the point with greater dissimilarity is used as the initial cluster center, but the K value is still artificially specified, and the clustering results of different k values may be different; Reference [5] proposes a method to obtain the global optimal center points of canopy one by one using the minimum variance. The algorithm effectively solves the problem of random initial center selection, but there is no good solution for the selection of the initial threshold, and it may cause the cluster center points to be too concentrated; Reference [6] uses dual MapReduce to improve and optimize the K-means algorithm, but this algorithm is not suitable for datasets with very different cluster sizes; Reference [7] uses the minimum and maximum distance method to select the initial cluster center, which avoids the cluster center falling in the same area, but the noise point may be used as the center point; Literature [8] proposed an improved k-means algorithm based on distance and weight, which reduces the number of iterations, but introduces higher time complexity.

On the basis of the above analysis, this paper proposes an improved K-means algorithm to improve the clustering quality of the algorithm from three dimensions: first use the elbow rule based on the sum of squares of errors to obtain a suitable k value for the number of clusters, improve the stability of the clustering effect; second use variance as a measure of the degree of dispersion of the samples, select k data points with the smallest variance and the distance greater than the average distance of the samples as the initial clustering center of the data set, to improve the accuracy of the clustering results; third combined with the “triangular inequality principle,” the KMeans algorithm can reduce the unnecessary distance calculation in the iterative process, shorten the iteration time, and improve the execution efficiency of the algorithm.

2.1. Principle of K-Means Algorithm

The k-Means algorithm is a clustering algorithm based on partitioning. Generally, distance is used as the evaluation index of similarity, so that the similarity of objects in the same cluster is high, while the similarity of objects in different clusters is small. The algorithm first randomizes k centers, calculates the distances from all data points in the sample dataset to each center point, and assigns each data point to the classification of its nearest centroid, the set of points collected by the same centroid is a cluster, then, the centroid of each cluster is updated according to this classification result [9]. Repeat the above steps of data point classification and centroid change, until the data points in the cluster no longer change.

Although the k-Means algorithm has many advantages such as concise thinking, fast convergence, and easy implementation. However, the selection of the initial centroid is closely related to the operating efficiency of the algorithm. The random selection of the centroid may result in a large number of iterations or be limited to a local optimal state. In addition, the algorithm requires the user to specify the number of clusters k. Different k values will have great differences in the clustering results, affecting the accuracy of the clustering results [10]. Therefore, how to optimally selecting the initial cluster center of the k-Means algorithm and determining the k value is an important direction for the improvement and optimization of the algorithm.

2.2. Canopy-KMeans Algorithm

The Canopy-KMeans algorithm is an improved KMeans algorithm with the help of the Canopy algorithm [11]. The idea is to first use the Canopy algorithm to perform “coarse” clustering to obtain the k value and the initial cluster center, and then use the KMeans algorithm to perform “fine” clustering on the center points of each Canopy class to obtain the clustering results [12, 13].

Using the Canopy-KMeans algorithm can effectively avoid the instability of the clustering effect caused by artificially specifying the k value and randomly selecting the center point. However, the Canopy algorithm is more difficult to determine T1 and T2. When the T2 value is too large, it will reduce the number of clusters, making the clustering effect unstable. If the value of T2 is too small, the number of clusters will increase, and the calculation time will be increased; when T1 is too large, a data object will belong to multiple Canopy classes, increasing computational cost [14, 15].

3. Improvement of k Value Based on Elbow Rule

The parameter k in the k-Means algorithm is difficult to determine and generally needs to be specified in advance, or obtained based on empirical values and multiple experiments. For different data sets, the k value has no reference. Setting different k values, the number of iterations may vary greatly, affecting the accuracy of the clustering results. Therefore, the first step to obtain a good clustering effect is to determine the optimal number of clusters [16].

3.1. Elbow Rule

Definition 1. The Elbow Rule
The position where the improvement effect of the degree of distortion decreases the most is the elbow, and the degree of distortion is generally used to determine the best k value.

Define 2. SSE objective functions:where dist is the standard Euclidean distance, k is the number of clusters, x is a data object in cluster Ci, and ci is the centroid of cluster Ci.
The k-means algorithm uses the sum of squared errors (SSE) as the objective function to measure the clustering quality. The degree of distortion of each class is equal to the sum of the squares of the distances from each sample point to the center of its class. If the members within the class are more compact with each other, the degree of distortion of the class will be smaller, on the contrary, if the members within the class are more scattered with each other, the degree of distortion of the class will be greater. However, for data with a certain degree of discrimination, the degree of distortion will be greatly improved when it reaches a certain critical point, and then slowly decline, the critical point can be considered as a point with better clustering performance, and the value corresponding to the position where the improvement effect of the distortion degree decreases the most is the elbow.
The “elbow rule” can give a suggestion for selecting the value of k based on the analysis of the SSE image. Taking the iris data set in the UCI data set as an example, to analyze the number of clusters by the elbow rule, it is more appropriate when k is 3. It is shown in Figure 1.

3.2. Determining the Optimal k Value Algorithm Idea
(1)For a dataset of n points, iteratively calculate k, from 1 to int ) [1], after each clustering is completed, calculate the sum of the squares of the distances from each point to the center of the cluster to which it belongs;(2)The sum of squares gradually becomes smaller until the sum of squares is 0;(3)In the process of changing the sum of squares, there will be an inflection point, that is, the “elbow” point. When the rate of decline suddenly slows down, it is considered to be a good k value;(4)Output the optimal number of clusters k.

4. Using Minimum Variance to Select Cluster Centers of Sample Data Sets

4.1. Algorithmic Thinking

In addition to determining the value of the parameter k, the k-Means algorithm also needs to optimize the selection of the initial cluster center point. According to the concept and function of variance in statistics, this paper uses variance to measure the degree of dispersion of sample data sets [17]. According to the definition of variance, variance is a measure of the degree of dispersion of a random variable or a set of variables. The variance is the average of the sum of the squares of the differences between each data and its mean. The smaller the variance, the denser the data points around the sample, and the faster the convergence; the larger the variance, the sparser the data points around the sample, and the slower the convergence.

Therefore, this paper optimizes the initial clustering center of the dataset based on the idea of minimum variance and selects k minimum variance data points located in different regions as the clustering center, so that the distance between each center is as far as possible. Avoid cluster center points falling in the same area, and as many data points as possible near each center, so as to converge as soon as possible. First, calculate the variance of all samples in the data set, i = 1, 2, 3, ..., N, which is n data samples, the average distance between samples is dAvg, select the sample with the smallest variance as the first initial center of the cluster, add it to the cluster center list Clist{}, and then continue to select the remaining samples whose distance is greater than the average distance between samples, take the data sample point with the smallest variance as the second initial center of the cluster, and add it to the list of cluster centers. And so on, until the kth initial cluster center is selected, so far, the list Clist{} stores all the cluster center vectors of the dataset.

4.2. Related Definitions

Suppose there are n data objects (x1, x2, …, xi, …, xj, …, xn) in the dataset X, and each data object has p attributes. Let C1, …, Ci, …, Cj, …, Ck denote the k centers of data clusters, S1, …, Si, …Sj, …Sk denote the sample sets of k data clusters, SiSj = ∅.

Definition 3. Euclidean distance between samples xi and xj.

Definition 4. The average distance from sample xi to other samples in the dataset

Definition 5. Average distance of dataset samples

Definition 6. The variance of the sample dataset xi

4.3. Selection of the Initial Center
(1)According to Definition 6, calculate the variance of each sample, find the sample with the smallest variance in the data set as the first initial cluster center C1, and add it to the list of cluster centers Clist{}, and set the data points whose distance from the center point C1 is less than or equal to the average distance between samples are stored in S1, the remaining data objects are X = XS1;(2)Continue to find the sample with the smallest variance in the remaining samples as the second initial cluster center C2, and add it to the list of cluster centers Clist{}, and the data points whose distance from the center point C2 is less than or equal to the average distance between samples are stored in C2, and the remaining data objects are X = XS2;(3)Repeat this process until k initial cluster centers are found;(4)Output the list of cluster centers Clist{}.

5. Improve the Efficiency of the algorithm by using the Principle of Triangle Inequality

There are many unnecessary distance calculations in the traditional k-means algorithm in the iterative process. In order to reduce redundant calculations and improve the efficiency of the algorithm, this paper proposes to combine the triangle inequality principle in the distance calculation, so as to achieve the purpose of accelerating the clustering algorithm. This is especially important in the case of large amounts of data.

In any triangle, the sum of two sides is greater than the third side, and the difference between the two sides is less than the third side. It can be extended from Euclidean distance to multidimensional Euclidean space: let any three vectors a, b, c. d(x, y) represent the distance between x and y in space, then the triangle inequality satisfies:

Assuming that there is a set of data points X = {x1, x2, .., xn}, the set of centroids C = {C1, C2, ..., Ck}}, the corresponding cluster set is S = {S1, S2, ..., Sk}.

Definition 7. The principle of the triangle inequality
Let Ci, Cj(ij)∈C, x ∈ X. If 2d(x, Ci) ≤ d(Ci, Cj), then d(x, Ci) ≤ d(x, Cj).
Prove: Because there is 2d(Ci, x) ≤ d(Ci, Cj), and by the triangle inequality principle d(Ci, Cj) ≤ d(x, Ci) + d(x, Cj), so 2d(Ci, x) ≤ d(x, Ci) + d(x, Cj), that is, d(Ci, x) ≤ d(Cj, x).
Using Definition 7 can reduce a lot of unnecessary distance calculations. If 2d(Ci, x) ≤ d(Ci, Cj), it can be determined that x belongs to the cluster center Ci, there is no need to calculate d(Cj, x). Therefore, it can be seen from the above derivation that if a screening is performed before the distance iteration, a lot of redundant calculations can be avoided.

6. Improvement of K-Mean Algorithm

Based on the above analysis, this paper proposes an improved KMeans algorithm, which uses the elbow rule to obtain the number of clusters k, uses the minimum variance optimization to select the cluster centers, and combines the “triangular inequality principle” to reduce the number of iterations of the algorithm. Multi-dimension improves the accuracy and iterative efficiency of Algorithm 1.

(1)Elbow rule to get K data points
(2)Use minimum variance optimization to select k cluster centers C1(0),…, Ck(0)
(The number of iterations t = 0, 1, 2, ..., until the objective function converges)
(3)repeat
(4)Calculate the distance between k centroids, and use the hash table to save the shortest distance from each centroid to other centroids (Use d(Ci, Cj) to represent the distance between the centroid Ci and its nearest centroid Cj).
(5)repeat
 for each data point x.
  if (data point x has been assigned to the cluster where the centroid Ci is located).
   if 2d(Ci, x) ≤ d(Ci, Cj).
    x allocation does not need to be changed;
   else.
  Continue to calculate the distance from x to the existing k centroids and assign it to the cluster with the closest centroid.
   end if.
  else (data point x is not assigned to any cluster).
  for i from 0 to K do.
    if 2d(Ci, x) ≤ d(Ci, Cj).
     assign x to the cluster where Ci is located.
     Exit the for loop.
    end if.
   end if.
(6)Recalculate the centroid.
(7)Until all centroids no longer move and the sum of squared error SSE converges.
(8)Output cluster center and sample set of k clusters {z1, …, zk}

The optimized algorithm flow chart is shown in Figure 2.

7. Simulation Experiment and Analysis

7.1. Experimental Environment and Experimental Data

The experimental environment in this paper uses windows 10 as the operating system and uses PyCharm Community Edition 2019.3.3 × 64 as the programming environment.

The experiment uses artificial datasets (100 data samples are randomly generated, the Gaussian noise value is 0.2, and the initial value of the random seed is 0) and the real dataset in UCI (http://archive.ics.uci.edu/ml/index.php), UCI datasets such as Iris dataset, including 150 rows of records, 4 numerical attributes, waveform data set, 5000 rows of records, 40 numerical attributes and so on. First, the cluster visualization effect of the algorithm in this paper, K-means algorithm, and Canopy-KMeans algorithm is compared on the artificial synthetic data set. Further, 10 data sets were selected in UCI to compare and test the parameters of the algorithm in this paper, K-means traditional algorithm and Canopy-KMeans algorithm, such as k value, silhouette coefficient, clustering accuracy, ARI, NMI, F-score, iteration time, and other parameters.

7.2. Cluster Number k Value Analysis

As shown in Figures 3 and 4, by calling the sklearn module in python, a k-SSE line chart is made for the number of clusters k and the sum of squares of errors SSE in the wine and soybean-small data sets in UCI respectively. It can be seen from the elbow rule, k = 3 is more suitable for the wine dataset, and k = 4 is more suitable for the soybean-small dataset.

Similarly, by making a line chart of the number of clusters k and the sum of squares of errors, the other eight different data sets in UCI are analyzed. The experimental data sets and k value prediction are shown in Table 1.

As can be seen from the above table, the k value estimated by the K-SSE line chart made by the traditional k-means algorithm combined with the “elbow rule,” corresponds to the actual number of clusters in the UCI dataset. Therefore, this method has important reference significance for the setting of k value.

7.3. Algorithm Performance Comparison Analysis

When performing clustering effect analysis, clustering evaluation is required, which mainly includes feasibility and effectiveness analysis of clustering on the dataset. The feasibility is measured by the silhouette coefficient value, and the effectiveness is measured by the accuracy rate [18], ARI [19], NMI [4], F-score [13], iteration time, etc.

Silhouette coefficients: for a dataset D containing n objects, suppose D is divided into k clusters: C1, C2, …Ci, …, Ck. For each object xiϵD, calculate the average distance d(xi) between xi and other objects in the cluster to which it belongs, and similarly, calculate the minimum average distance b(xi) [20] between xi and all other clusters. Assuming xiϵCi (1 ≦ i ≦ k), then

The silhouette coefficient of object xi is defined as:

The value of the silhouette coefficient is between −1 and 1. d(xi) reflects the compactness of the cluster to which the object xi belongs. The smaller the value, the more compact it is. b(xi) reflects the degree of separation between the object xi and other clusters. The larger the value, the farther the object xi is separated from other clusters. To measure the fit of the clusters in the clustering, generally, the average value s of the silhouette coefficients of all objects in the cluster is calculated. When the value of s is close to 1, it means that the clusters are close and the distances between different clusters are far.

Correct rate: the total number of correctly identified individuals/the total number of identified individuals, the formula is as follows:where Ci is the number of correctly identified individuals of the i-th class in the clustering results, and Ci is the number of individuals of the i-th class that have been identified.

ARI: adjusting the rand coefficient assumes that the superdistribution of the model is a stochastic model, which does not require any assumptions about the structure of clusters and is usually used to compare various clustering effects [21]. ARI∈[−1, 1]. A larger value means that the clustering results are more consistent with the real situation.

NMI: mutual pheromone is also an indicator used to measure the similarity between the two clustering results (labels). It measures the similarity between the results of the algorithm and the standard results. The more similar the value is, the closer the NMI is to 1. The value range of NMI is [0, 1].

F-score: evaluate by converting the clustering results into classification results through external labels. The index F-score can comprehensively consider the harmonic value of precision and recall. The value range of F-score is [0, 1]. The larger the value, the better the model performance.

Clustering time: the algorithm running time t, the less time used, the better the algorithm.

Compare the clustering visualization effects of the algorithm in this paper, the K-means algorithm, and the Canopy-KMeans algorithm on the artificial data set, as shown in Figures 57.

It is shown in Figures 57 that the k-means clustering results depend on the random selection of the initial clustering centers and may converge to the local optimum instead of the global optimum, this is because the way k-means updates the center is to calculate the mean vector within the cluster, and outliers will greatly affect the mean of an attribute column, resulting in the deviation of the center point and the unstable clustering results. The Canopy-KMeans algorithm first performs coarse clustering, determines the k values and k initial cluster centers, and then uses the k-means algorithm for fine clustering, to a certain extent, it avoids the local optimum caused by the random selection of the initial cluster center. The algorithm in this paper more accurately determines the number of clusters and the initial cluster center through the elbow rule and the minimum variance and better improves the clustering quality and clustering accuracy.

The traditional K-means algorithm and the improved KMeans algorithm are respectively used for 10 different data sets of UCI, and the performance parameters of each algorithm are compared. The experimental data sets and experimental analysis are shown in Table 2.

From the experimental results in Table 2, it can be seen that compared with the traditional K-means algorithm and the Canopy-KMeans algorithm, the algorithm in this paper uses the elbow rule to pre-evaluate the k value and uses the minimum variance of the sample to determine the cluster center of each area, so that the accuracy has been significantly improved, which is 12.15–17.51 percentage points higher than the traditional K-means algorithm, and 4.97–9.37 percentage points higher than the Canopy-KMeans algorithm. The silhouette coefficient, which measures the fit parameter index of clusters in clustering, is also increased by 13–18 percentage points compared with the traditional K-means algorithm, and 6–10 percentage points compared with the Canopy-KMeans algorithm. It shows that the improved algorithm makes the clusters more compact and the distances between different clusters are relatively long. Other clustering evaluation indicators ARI, NMI, and F-score parameters have also been improved. It can be seen that the algorithm in this paper has a better clustering effect and clustering speed.

Further compare the other clustering performance indicators of the algorithm in this paper, the K-means algorithm, and the Canopy-K means algorithm on different data sets. Figures 8 and 9, respectively, show the average accuracy and running time of the three algorithms running 10 times on 10 different datasets in UCI.

As can be seen from Figures 8 and 9, the correct rate and time of the algorithm running are positively related to the number of records and dimensions of the data set. The algorithm works in data sets with a relatively large number of dimensions and records, such as Glass Identification, Diabetes, E. coli, the accuracy rate is lower, and the runtime is also longer. From the experimental line chart, it can be seen that the average number of iterations and the accuracy rate are closely related to the determination of the k value and the selection of the initial cluster center. The traditional K-means algorithm has the longest running time and the lowest accuracy rate on each data set. This is because the number of clusters k of the traditional K-means algorithm is arbitrarily specified, and the cluster centers are randomly selected, so different K values and initial cluster centers will cause great fluctuations in the clustering results, and the clustering effect is not stable. The Canopy-KMeans algorithm uses the coarse clustering method to predetermine the k value and the initial cluster center, which improves the accuracy and efficiency of the algorithm to a certain extent. The algorithm in this paper first uses the elbow rule to obtain the k value, determines the number of clusters, and then uses the minimum variance to determine the initial clustering center of the data set, so that the initial clustering center is more consistent with the actual clustering center of the data set. Finally, the triangular inequality principle is used to reduce unnecessary distance calculations, further shorten the clustering time, and improve the clustering quality of the algorithm from three dimensions. Therefore, the average running time, average number of iterations, and accuracy of the algorithm in this paper are relatively optimal.

8. Conclusion

This paper proposes an improved KMeans algorithm to solve the problems such as the difficulty in determining the k value of the KMeans algorithm and the randomness of the initial cluster center selection, which leads to easy to fall into the local optimal solution. By introducing the elbow rule, the optimal k value is obtained according to the K-SSE line chart, and k initial cluster centers are selected based on the minimum variance and the average distance between samples. Finally, the triangle inequality principle is combined to reduce the number of iterations of the KMeans algorithm. Through the experimental verification of 10 groups of UCI data sets, the improved algorithm has a significant improvement in accuracy and operating efficiency compared with the traditional algorithm, the clustering results are stable, and the distribution of the original data samples can be objectively presented. The research method in this paper is mainly for distance clustering. The next step will be to further study data sets such as arbitrary shape [22], different sizes, variable density [23], overlapping clustering [24, 25], to expand the scope of application of the algorithm.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work was supported by the Research Project of Education Department of Jilin Province (project no. JJKH20210695KJ).