Abstract

The target of the clustering analysis is to group a set of data points into several clusters based on the similarity or distance. The similarity or distance is usually a scalar used in numerous traditional clustering algorithms. Nevertheless, a vector, such as data gravitational force, contains more information than a scalar and can be applied in clustering analysis to promote clustering performance. Therefore, this paper proposes a three-stage hierarchical clustering approach called GHC, which takes advantage of the vector characteristic of data gravitational force inspired by the law of universal gravitation. In the first stage, a sparse gravitational graph is constructed based on the top k data gravitations between each data point and its neighbors in the local region. Then the sparse graph is partitioned into many subgraphs by the gravitational influence coefficient. In the last stage, the satisfactory clustering result is obtained by merging these subgraphs iteratively by using a new linkage criterion. To demonstrate the performance of GHC algorithm, the experiments on synthetic and real-world data sets are conducted, and the results show that the GHC algorithm achieves better performance than the other existing clustering algorithms.

1. Introduction

Clustering is one of the major unsupervised learning techniques and has been applied in many fields such as pattern recognition [1], image processing [2, 3], community detection [4, 5], bioinformatics [6, 7], information retrieval [8, 9], and so on. The main task of clustering is to classify a dataset into some nonoverlapping clusters based on a suitable similarity metric so that the elements in the same cluster are similar, while any elements from different clusters are dissimilar. A range of various clustering methods have been proposed and classified as partition-based, hierarchical, grid-based, density-based, model-based clustering, and so on.

K-means [10] and its successors [2, 11] are typical partition-based clustering approaches. They need to be specified the number of clusters in advance. Each data point of the dataset is assigned to its closest cluster according to the Euclidean distances among data points. The new centroids of clusters are repeated to be calculated until the elements are consistently assigned to the same cluster. Then, the cluster centers have stabilized and will remain the same forever. However, these approaches are not able to detect nonspherical clusters because an element is always assigned to the nearest center. Numerous studies have been done to overcome the drawback of K-means type algorithms, particularly by using density distribution. In density-based clustering, clusters which have arbitrary shape are considered as the dense regions separated by sparse region in data space [12]. DBSCAN [13] is the most representative density-based clustering algorithm that needs to be specified a density threshold, discards the points with densities lower than this threshold as noises, and assigns to different clusters disconnected regions of high density. DP [14] is a novel algorithm that efficiently discovers the centers of clusters by finding the density peaks. It assumes that cluster centers are surrounded by neighbors with lower local density and are at a relatively large distance from any points with a higher local density. Furthermore, hierarchical clustering is a significant method of cluster analysis which seeks to build a hierarchy of clusters. The hierarchical clustering algorithms can be divided into two categories including agglomerative and divisive algorithms. Agglomerative hierarchical clustering algorithm starts with every single element in a dataset. Then it aggregates the closest clusters with a linkage criterion in each iteration until all elements form one cluster. The divisive hierarchical clustering algorithm starts with the dataset considered a single cluster which is separated into many subclusters until every element forms a cluster. The other differences among hierarchical clustering approaches are determined by the diverse choices of similarity criteria and the linkage criteria. BIRCH is one of the most effective hierarchical clustering methods [15]. It constructs a tree data structure with the cluster centroids being read off the leaf, which can be either the final cluster centroids or can be provided as input to another clustering algorithm. In addition, there are many multistage hierarchical clustering algorithms, such as Chameleon [16], which is a representative approach and can detect the arbitrary shape of the cluster effectively. In the first stage, Chameleon uses a graph-partitioning algorithm to cluster the data items into several relatively small subclusters. In the second stage, it is to find the genuine clusters by repeatedly combining these subclusters based on its selections on both interconnectivity and closeness. These classical clustering approaches usually only utilize one kind of internal evaluation function to determine clustering quality [17]. Many scholars focused on the study of multiobjective clustering to overcome the defect of conventional clustering algorithms. Peng et al. [18] proposed fuzzy multiobjective clustering based on PSO to obtain well-separated, connected, and compact clusters. Saha and Maulik [19] proposed the multiobjective clustering based on incremental learning for categorical data. Moreover, lots of clustering algorithms, such as DenPEHC [20], GHFHC [21], Muenc [22], and so on, were also put forward to improve the clustering performance. Meanwhile, new gravity-based clustering approaches were also proposed, such as the LGC algorithm, which would be discussed in Section 2.

It is an important task to design a new clustering algorithm because every algorithm has its own advantages and disadvantages. In conventional clustering algorithms, the similarity or distance is usually scalar, which only contains the partial information among data points. To obtain more information among data points, the vector can be adopted to represent the similarity of two data points. Data gravitational force, which is like the universal gravitational force, is employed to cluster data points. Then, we propose a novel hierarchical clustering based on the sparse gravitational graph in which the vertex denotes each object of a data set, and the edge denotes that data gravitation force exists between its two vertices. In the clustering process, a weighted graph firstly is constructed based on universal gravitation. Then the graph is divided into several subgraphs based on the gravitational influence coefficients between each vertex and its adjacent vertices. Finally, it is iterative to merge two subgraphs based on a new linkage measure until the genuine clusters are found.

There are three highlights in this paper as follows. At first, the sparse gravitational graph is defined based on the data gravitation model. Meanwhile, a new measure is used to extract more valuable information between each vertex and its adjacent vertices in the sparse gravitational graph. Secondly, a new linkage measure which makes the best of the data gravitation’s characteristics is proposed to merge the subclusters iteratively. Thirdly, a novel three-stage gravity-based hierarchical clustering method named GHC is proposed. The GHC algorithm can be used to detect arbitrary clusters effectively and achieves an excellent clustering performance on the synthetic and real-life data sets in this study.

The remainder of the paper is organized as follows. In Section 2, the related work of gravity-based clustering is reviewed. The novel gravity-based hierarchical clustering (GHC) is proposed and analyzed in detail in Section 3. In Section 4, the experiments on the synthetic and real-world data sets are conducted and discussed. In Section 4, the conclusions are drawn.

Using gravity theory in clustering is not a new idea. Numerous gravity-based clustering algorithm, which simulates the process of the attraction and merging of objects by their gravity force, has been studied. Usually, these algorithms consider each data point as an object and assign a mass to it in feature space. Wright [23] proposed the first version of gravitational clustering, which updates the position of each data point at each iteration and aggregates the data points into clusters when they are close. Yung [24] employ the gravitational clustering approach to segment color images. Each pixel with a unit mass maps to a location (as a particle) in RGB space. The mass of a particle is the total number of pixels mapped to it. The gravity causes the particles to move in the space under constraint. The particles are clustered when they move to the same location in RGB space. Wang et al. [25] proposed two novel clustering approaches based on the local gravitation model. In this model, each data item is considered as an object with mass and associated with a local resultant force (LRF) generated by its neighbors in the local region. The clustering process is realized by using the differences among the LRFs of the data points close to the cluster centers and at the boundary of the clusters. Bahrololoum et al. [26] proposed another approach that finds the best positions of the cluster centroids determined by employing the law of gravity. In the approach, the data points and cluster centroids are considered as fixed celestial objects and movable objects, respectively. The celestial objects apply a gravity force to the movable objects and change their positions in the feature space. The best cluster centroids are obtained until the sum of the forces on each centroid approaches zero. Mohammed Alswaitt et al. [27] proposed a modification over a gravity-based data clustering algorithm. The modified algorithm adopts the dependence of the agent on velocity and an initialization step of centroid positions to impose a balance between exploitation ability and exploration ability of gravity-based clustering approach. Besides, a serial of approaches based on gravity theory and Newton’s second law of motion was proposed by Gómez et al. [28], Kundu [29], and Sanchez et al. [30]. In these approaches, points of the same cluster will move toward the direction of their cluster center. Inspired by the phenomena of gravitation and the black hole, Hatamlou [31] proposed a new heuristic optimization approach called the black hole algorithm. Other heuristic algorithms inspired by gravitational phenomena have been designed for clustering. For instance, a heuristic gravitational search algorithm (GSA) was proposed by Rashedi et al. [32] and was applied in solving wind-hydro-thermal CO problem by Shukla and Singh [33]. Yin et al. [34] designed a hybrid data clustering algorithm based on GSA.

To the best of our knowledge, each data point of a dataset is considered a movable object with mass in the most existing gravity-based clustering algorithms. Data points can move around in feature space in the influence of the law of gravity and merge into several clusters when they move close enough to each other. In our approach, we establish the data gravitation model and utilize the relation between each data point and its neighbors which exert the largest gravitational forces on it to group a dataset into many subclusters. And then, two subclusters with the largest resultant gravity force are merged. To boost the effectiveness of clustering, we define the sparse gravitational graph based on the data gravitation model, which can be divided into many subgraphs based on the relation between each vertex and its adjacent vertexes. Next, subgraphs can be repeatedly merged to form a larger subgraph until the terminal condition is satisfied.

3. Data Gravitation Model

Newton’s law of universal gravitation states that every point mass attracts every other point mass with force acting along the line through those points, which is proportional to the product of their masses and inversely proportional to the square of the distance between them. The gravitational force can be calculated as follows:where denotes the gravitational force exerted on point mass i by point mass j, and are the masses of the two points, respectively, is the distance between point mass i and point mass j, is the unit vector from point mass i to j, and κ is the gravitational constant.

Similar to the gravitational force, it is assumed that data gravitation exists among any two data points in data space. The data gravitation can be given as follows:where is a decreasing function of x (), and are, respectively, the masses of the data points i and j, is the distance between the two data points, is the unit vector from data point i to j. The mass of data point i can be defined bywhere if and otherwise, and is a cutoff distance. In other words, equals the number of points from which the distances to point i are less than γ. Especially, the mass of a data point is equal to 1 when . Moreover, we assume that the gravitational forces exerted on a data point are the top k gravitational forces between it and other data points. Therefore, the gravitational resultant force (GRF) of data point i can be obtained as follows:where is the set of neighboring data points which exert the top k gravitational forces on data point i. The gravitational force between two data points changes with the cutoff distance γ because their masses are related to γ. Then GRF of a data point also changes with γ according to equation (4). For example, Figure 1 shows the gravitational force and GRF when γ is specified to different values in a 2D data set . In Figure 1(a), the mass of each data point is 1 when . It can be noticed that the GRF of the data point is directed towards the data points and . It indicates that the data points and provide more influence on . In Figure 1(b), the masses of , and are 4, 1, 2, 3, and 3, respectively. The GRF of the data point is directed towards the data points and . It indicates that the data points and provide more influence on . Then the gravitational influence coefficient (GIC) is introduced to represent the relationship between the RGF of a data point and the gravitational forces exerted on it by other data points. The GIC of data points i and j is defined as follows:where is the resultant force of data point i, is the gravitation force exerted on data point i by its neighboring data point j. ranges from −1 to 1. The bigger the , the point j provides more influence on data point i. Intuitively, the gravitational influence coefficient can be adopted to realize the data cluster analysis. The data point i and j are grouped into a cluster if there are the bigger and than a threshold. Otherwise, they are clustered into different clusters. In this way, a course clustering method can be obtained.

4. The Proposed Gravity-Based Hierarchical Clustering Algorithm

Though the course clustering algorithm based on the data gravitation model can be employed to cluster a dataset, its clustering performance is not good. Therefore, a novel hierarchical clustering algorithm (GHC) is proposed based on sparse gravitational graph which can make the algorithm implement easily and perform effectively. The time complexity of GHC algorithm is analyzed at the end of this section.

4.1. Sparse Gravitational Graph

Let denote a data set with n data points, in which each data point has υ features. The sparse gravitational graph is composed of vertices in which the weight of each is calculated by equation (3) and the set of edges in which the weight of each edge is calculated by equation (2). The smaller the value of k, the sparser the graph. Figure 2 shows the different gravitational graphs of a dataset when various parameters are specified. Figures 2(a) and 2(c) show the graph when , while Figures 2(b) and 2(d) show when . Meanwhile, the weights of vertexes and edges all will be influenced by the value of γ.

The vertex is not only influenced by the vertex but also by its other adjacent vertices in the sparse gravitational graph, though the relationship of two vertexes and can be described by the gravitational force between them simply. Considered the influence of each vertex to its adjacent vertex, the gravitational influence coefficient can also be introduced into the gravitational graph to describe the influence between two vertices. Two vertices i and j can be grouped into the same cluster if their and are larger than the threshold θ. The edges between two vertices in the same cluster are retained in the graph, while the edges of which the vertices belonged to different clusters are removed from the graph. Then the gravitational graph will be partition into many subgraphs. However, these subgraphs are not the final clustering results.

4.2. Gravity-Based Hierarchical Clustering Algorithm

Though the gravitational graph can be partition into many subgraphs which denote different subclusters, the performance of the clustering is poor. But these subgraphs can be considered as the intermediate results of clustering. Therefore, a new hierarchical clustering algorithm is proposed based on the intermediate results of partitioning the gravitational graph. The proposed clustering approach consists of the following three stages.

During the first stage, the data set is mapped into a sparse gravitational graph which is similar to the k-NN graph. Firstly, the data set is preprocessed by using feature transformation and dimension reduction technique. And then, the mass of each data point is calculated by equation (3), and the gravitational force between two vertices is computed by equation (2). The initial gravitational graph is constructed, in which the weights of vertex and edge are the corresponding mass and force. The procedure of constructing sparse gravitational graph is presented in Algorithm 1.

Input: X: the data set. k: the number of data points with top k gravitational force. γ: the cutoff distance used to determine the mass of each point.
Output: G: the sparse gravitational graph.
(1)Scale the data set X using a feature transformation technique;
(2)Calculate the Euler distance between any two data points i and j in the data set X;
(3)Calculate the mass of any data point i in the data set X by equation (3);
(4)Calculate the data gravitational force between any two data points i and j in the data set X;
(5)Initialize the sparse gravitational graph . And set and ;
(6)for each data point x in X do
(7) Assign the mass of x as the weight of the corresponding vertex in V;
(8) Select data points with the top k data gravitation exerted on data point x;
(9)for to k do
(10)  Insert the edges into the set E;
(11)  Assign the data gravitational force of x and as the weight of the edge ;
(12)end
(13) Calculate the gravitational resultant force of data point x by equation (4) as the corresponding vertex in V;
(14)end
(15)return the sparse gravitational graph G;

During the second phase, the gravitational graph is partitioned into many small connected subgraphs based on the gravitational influence coefficient among vertices. If the and are greater than the threshold θ, the edge is retained in the graph. Otherwise, the edge would be removed from the graph. The process of the second stage is described in Algorithm 2.

Input: G: the sparse gravitational graph. θ: the threshold of gravitational influence coefficient.
Output: : the separated gravitational graph.
(1)
(2)for each vertex v in the graph do
(3) Search the adjacent vertices of the vertex ;
(4)for to t do
(5)  Calculate the of the vertex for the vertex by equation (5);
(6)  Calculate the of the vertex for the vertex by equation (5);
(7)  if or then
(8)   Remove the edge from the edge set E of ;
(9)  end
(10)end
(11)end
(12)return ;

In the last stage, the genuine clusters are found by emerging subgraphs iteratively. The core of merging the subgraphs is to define the linkage criterion between two clusters. The linkage criterion determines the similarity among the subgraphs. The common linkage criteria are complete linkage, single linkage, mean average linkage, centroid linkage, minimum energy linkage, graph degree linkage, and so on. Different from the above linkage criteria, a novel linkage measure is defined to determine the similarity of two subgraphs based on the vector property of gravitational forces. It is called as gravitational merging coefficient (GMC) and obtained by combining the gravitational forces between two subgraphs. Mathematically, is formulated as follows:where is the ist subgraph, is the jst subgraph, is the number of vertexes in , and is the number of vertexes in . It each iteration, the two subgraphs with the biggest are merged into a new subgraph. The clustering process is terminated until the end conditions are met. The processing steps are presented in Algorithm 3.

Input: G: the original sparse gravitational graph. : the separated gravitational graph.
Output: : the merged gravitational graph.
(1)Search out the connected subgraphs in the gravitational graph ;
(2)Calculate the of the connected subgraphs and by equation (6);
(3)Select the connected subgraphs and which have the largest ;
(4)for each vertex v in do
(5)for each vertex u in do
(6)  if the edge in G then
(7)   Insert the edge into ;
(8)  end
(9)end
(10)end
(11)return ;

The overall procedure of GHC is presented in Algorithm 4. To illustrate the GHC algorithm, the main clustering steps are shown in Figure 3. Figure 3(a) shows the first stage of GHC when and . The artificial data set with 25 data points is mapped into a gravitation graph by Algorithm 1. Figure 3(b) shows the second stage of GHC when . The gravitational graph is partitioned into many subgraphs by Algorithm 2. Figures 3(c) and 3(d) show the third stage of GHC. The subgraphs with highest is merged by using Algorithm 3. Figure 3(c) shows the gravitational graph after six iterations. Figure 3(d) shows the clustering result after twelve iterations. Obviously, the data set is grouped into two clusters correctly.

Input: X: the data set. γ: the cutoff distance. k: the number of data points with top k gravitational force. θ: the threshold of gravitational influence coefficient. : the number of the final clusters.
Output: : the.
(1) CreateGravGraph (X, k, γ);
(2) PartGravGraph (G, θ);
(3)Calculate the number of the connected subgraphs in the graph ;
(4)while do
(5) MergeGravGraph (G, );
(6) Calculate the number of the connected subgraphs in the graph ;
(7)end
(8)Assign cluster label for each subgraph in ;
(9)return the cluster labels ;
4.3. Complexity Analysis

The time complexity can be defined as the sum of the complexities of each stage of GHC algorithm. For the first stage, each data point needs to calculate the masses, find its neighbors with the top k gravitational forces, and then construct the gravitational graph. Considering a data set with n data points, the time complexity of Algorithm 1 is . During the second stage, the GRF of each data point is calculated, and the gravitation graph is divided into some subgraphs. Thus, the time complexity of Algorithm 2 is where κ is the mean number of adjacent vertexes and n is the number of vertexes in gravitational graph. Usually, κ is much smaller than n. During the third stage, it is required to calculate of any two subgraphs and merge two subgraphs which have the largest . The worst case time complexity of Algorithm 3 is . Therefore, the worst case time complexity of GHC algorithm is .

5. Experiments

5.1. Performance Metrics

In this study, four clustering performance metrics, such as Purity [27], Rand Index (RI) [35], Fmeasure [35], and Normalized Mutual Information (NMI) [36], are used to evaluate the performance of clustering algorithms. Given a dataset with p categories and n data points, the set denotes the real classes in which is the subset of X. The clustering result is the set in which also is the subset of X. Purity is the external evaluation criterion of cluster quality. The purity of a cluster with data points is defined as follows:where is the number of the data points in jth class that are assigned to ith cluster. The overall purity of a clustering result is defined as

In general, larger Purity denotes better clustering result. Rand Index is calculated as follows:where , a is the number of pairs of data items in X that are in the same subset of Q and in the same subset of P, b is the number of pairs of data items in X that are in different subsets of Q and in different subsets of P. Fmeasure is like RI with the exception that true negatives are not taken into account. Mathematically, Fmeasure is calculated as follows:where c is the number of pairs of data items in X that are in different subsets of Q and in the same subset of P. The normalized mutual information (NMI) is also adopted in this paper. The NMI is computed as

The larger NMI denotes a better performance of clustering.

5.2. Parameter Settings

To investigate the performance of GHC, the experiments are performed on the synthetic datasets shown in Figure 4 and real-life datasets tabulated in Table 1. Six well-known clustering algorithms, such as K-means [10], K-means++ [37], Spectral Clustering (SC) [38], DBSCAN [13], Birch [39], and LGC [25], are employed to compare with GHC algorithm.

The well-tuned parameter settings of the GHC algorithm and the competitive algorithms are tabulated for each data set in Table 2. For K-means, K-means++, and SC, the parameter τ is the number of classes in each data set. For SC algorithm, the parameter is sought from the set {0.01, 0.1, 0.5, 1, 1.5, 2, 5, 10, 20, 70, 80, 100, 150} and the one with the best RI value is selected. There are two parameters eps (e) and min-samples (m) of DBSCAN. The parameter e is chosen from the interval from 0.1 to 1 with an increment of 0.1. The parameter m is chosen from the set {4, 5, 6, 8, 10, 15, 20}. The parameters pairing (e, m) with the best RI value are considered as the well-tuned parameters. Birth has two tunable parameters threshold (t) and branching-factor (b) [39]. The parameter t is varied from 0.1 to 1 with an increment of 0.1. The parameter b is chosen from the set {10, 20, 40, 50, 60, 100, 150, 200}. The parameter pairing (t, b) with the best RI value is selected for each data set. For the three parameters IM (i), kn (n), and cFactor (f) of LGC algorithm, the value of i is chosen from the set {5, 8, 10, 20, 30, 40, 50, 60, 70, 80, 150, 250}, the value of n is chosen from the set {4, 5, 7, 10, 15, 20}, and the value of f is varied from 0 to 1 with increment 0.1. The parameters pair (m, n, f) with the best RI value is chosen for each dataset.

To demonstrate the performance of GHC algorithm, the three parameter γ, θ, and k are, respectively, equal to 0.2, 0.1, and 6 for all synthetic data sets. For all the real-world data sets, the pair (γ, θ, k) with the best RI value is chosen for each real-world dataset. The tunable parameter γ is varied from −1 to 10 with an increment 0.1. The parameter θ is chosen from −1 to 1 with an interval 0.1. The parameter k is chosen from the set {4, 5, 6, 10}. For all nondeterministic approaches, we run these algorithms 100 times on each data set and adopt the average of each performance criterion to evaluate the performance of GHC algorithm. For all deterministic approaches, the performance metrics are taken by running only once.

5.3. Experiments on Synthetic Datasets

In order to investigate the performance of the proposed approaches, a series of experiments on twelve synthetic datasets shown in Figure 3 are performed by using the proposed GHC and the other existing algorithms. The performance results are tabulated in Table 3. In Table 3, the first column denotes the used dataset, whereas the first row denotes the used algorithms. The digits in the other fields of the table are the evaluation results. For Purity, RI, Fmeasure, and NMI, their values range from 0 to 1. The higher values denote the algorithm has a better performance on the dataset. Although the synthetic datasets are easy to be clustered intuitively, not all the clustering algorithms achieve remarkable performance in this study. Overall, CHC, DBSCAN, and LGC algorithms obtain more competitive advantages than the other algorithms. The GHC algorithm obtains good clustering results on all synthetic datasets while DBSCAN and LGC algorithms only achieve the worse performances on a few datasets.

5.4. Experiments on Real-Life Datasets

In order to investigate the performance, the proposed GHC algorithm and other competitive approaches are adopted to solve the clustering problems on the real-world datasets tabulated in 3. For each real-life dataset, the well-tuned parameters of all algorithms also are tabulated in Table 2. The performance results are shown in Table 4. In Table 4, the first row denotes the algorithms while the first column denotes the real-life datasets used in the experiments. The digits in other fields of this table denote the evaluation results for the GHC algorithm and other existing algorithms on each dataset. GHC algorithm obtains the best values of all evaluation criteria on the datasets such as BTissue, Iris, and Wine. On the other datasets, the evaluation results of the GHC algorithm are the best or close to the best for the four evaluation criteria. In the overall view, the GHC algorithm outperforms other competitive algorithms on these real-world datasets.

5.5. Discussions

In this subsection, we mainly discuss the role and impact of parameters to the performance of the GHC algorithm. There are three tunable parameters γ, θ, and k, which are required to determine for GHC. The parameter γ determines the masses of data points, which affects the force of gravity straightly and controls the structure of the gravitational graph with the gravity forces varying. The second parameter θ controls the number of subgraphs that the gravitational graph can be partitioned into. The third parameter k determines the sparsity and connectivity of the gravitational graph. In the previous subsection, it can be noticed that the GHC algorithm performs better than the state-of-the-art algorithms for all synthetic datasets, though these parameters are set to fixed values () which maybe are not the optimal values.

To illustrate the impact of the parameters γ, θ, and k, we conduct a series of experiments on real-world datasets to analyze the influence of each parameter to the clustering performance of the GHC algorithm. The prior knowledge of the real-world datasets can be used to search the optimal parameters with the best values of evaluation metrics. Figure 5 shows that the values of the evaluation metrics Rand Index and Purity change when the parameter γ varies in a given interval. The parameter γ varies from 0 to 10 with increment of 0.1 for all datasets except the SControl dataset for which it varies from 0 to 250 with increment of 5. It can be noticed that the performance values fluctuate within an interval as the parameter γ is increased. The evaluation result on each dataset will converge to a fixed value when γ is beyond the interval. Because γ determines the mass of each data point by equation (3), the gravitational force between two data points will be significantly different when γ is set to different values. In essence, the different distribution of data points’ masses affects the gravitational forces between them and leads to different clustering performance. Figure 6 shows that the values of the evaluation metrics Rand Index and Purity with the best clustering performance change as the parameter θ specifies different values, which changes from −1 to 1 with increment 0.1. From Figure 6, it can be noticed that the evaluation values increase on the general trend as the threshold θ is increasing in most of the real-world datasets. The reason is that the data points of different clusters are divided into the same cluster in the second stage of the GHC algorithm when the parameter θ is set to a lower value. In contrast, the performance of the GHC algorithm is better when the value of parameter θ is set to a high value because the data points of different clusters can be partitioned into a cluster correctly. Figure 7 shows that the values of the evaluation metrics Rand Index and Purity with the best clustering performance change as the parameter k specifies different values. In general, the performance metric Purity decreases as the parameter k is increasing. From Figure 7, there is a single peak at which the value of Rand Index reaches the maximum when k changes from 1 to 20. From the above analysis, the GHC algorithm can achieve good performance when the three parameters are set to the suitable values for each dataset.

6. Conclusions

In this paper, we propose a novel gravity-based clustering approach that sufficiently utilizes the vector properties of gravitational force. To some extent, the data gravitational force can be considered as a similarity measure which takes not only density but also distance into account. To illustrate the performance of GHC algorithm, the experiments with all well-tuned parameters have been conducted on synthetic datasets and real-life datasets compared with the other famous clustering algorithms. The experiments’ results show that the GHC algorithm is robust and achieves competitive performance. Of course, it also can be noticed that the time complexity of the GHC algorithm is high. The problem can be improved in the future. Meanwhile, the GHC algorithm can be applied in more application fields.

Data Availability

The data used to support the findings of this study have been deposited in the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) and the figshare database (https://doi.org/10.6084/m9.figshare.8187623.v1).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61672136 and 11872323).