Abstract

Determining the right number of clusters without any prior information about their numbers is a core problem in cluster analysis. In this paper, we propose a nonparametric clustering method based on different weighted spatial rank (WSR) functions. The main idea behind WSR is to define a dissimilarity measure locally based on a localized version of multivariate ranks. We consider a nonparametric Gaussian kernel weights function. We compare the performance of the method with other standard techniques and assess its misclassification rate. The method is completely data-driven, robust against distributional assumptions, and accurate for the purpose of intuitive visualization and can be used both to determine the number of clusters and assign each observation to its cluster.

1. Introduction

In recent years, there has been a significant advancement in clustering methods in the field of data science and machine learning. Density-based spatial clustering of applications with noise (DBSCAN) has become one of the popular nonparametric clustering algorithms that group together points that are close to each other and are surrounded by low-density areas [1]. Also, hierarchical clustering [2] has been widely used to build a hierarchy of clusters by either merging smaller clusters into larger ones or splitting larger clusters into smaller ones. The K-means clustering [3] remains one of the most popular partitioning methods, where the data are partitioned into K distinct, nonoverlapping clusters.

In addition, spectral clustering [4] has gained popularity due to its ability to use the eigenvectors of the similarity matrix to partition the data into clusters. Affinity propagation [5] assigns each data point to an exemplar, which is a representative point of a cluster, and iteratively updates the exemplars until convergence. Fuzzy clustering [6] assigns each data point a probability of belonging to each cluster, rather than assigning it to a single cluster. The K-medoids algorithm introduced by [7] uses actual data points as representatives or medoids for each cluster, and medoids selected from the dataset are typically the most centrally located points within a cluster. This makes K-medoids more robust to outliers and noise compared to K-means. Another recent clustering method is the distance density clustering (DDC) introduced by [8]. It is a distance density clustering method that is a medoid-based clustering with time series data density consideration which provides clustering results in a hierarchy fashion.

The clustering by the fast search and find of density peaks (densityClust) method [9] is a density-based clustering method that aims to identify clusters based on the local density of data points and the distances between them, which makes it suitable for datasets with irregularly shaped clusters and varying densities. Furthermore, neighborhood grid clustering (NGC) introduced by [10] is a density-based clustering algorithm that aims to identify clusters based on the local density of data points and their spatial relationships. It utilizes a grid-based approach to partition the data space into cells and then assigns data points to their corresponding cells. The algorithm then determines the cluster assignments by considering the density and spatial proximity of the points within each cell.

These methods have their own strengths and weaknesses and are suitable for different types of data and applications. The choice of the appropriate clustering method depends on various factors such as the size and nature of the data, the desired number of clusters, and the research question being addressed.

Further advancements have been made in the use of spatial ranks to analyse multivariate data, with the development of nonparametric methods being particularly notable [1113]. They have a number of attractive features including being distribution-free and easy to compute. Furthermore, the traditional spatial ranks function gives information about how central each observation is and its direction in relation to the centre. However, they do not capture the distances between each pair of observations, which is important for cluster analysis.

Recently, Baragilly and Chakraborty [11] utilized spatial ranks as a clustering tool, using a forward search methodology based on nonparametric multivariate spatial rank functions to determine the number of clusters in the data. Their method does not depend on the choice of the initial subsample and has been shown to perform well in different mixture distributions. This work has been extended to clustering functional datasets in medical applications [14].

This paper proposes a novel approach to utilizing spatial ranks for clustering, using a nonparametric weighted spatial ranks function that takes into account the distances between each pair of observations as weights and defines a dissimilarity measure based on spatial ranks. By measuring the distances between each pair of observations instead of their central tendency, it becomes easier to segment a given set of data into a specific number of clusters.

The main idea behind the weighted spatial ranks (WSRs) is to define a dissimilarity measure based on a localized version of spatial ranks, such that the weighted ranks can be used as a classifier and a confirmatory tool to determine the number of clusters and assign each observation to its cluster. Proper selection of a weight function can lead to better identification of clusters, and kernel weights are a popular choice in pattern analysis, classification, cluster analysis, machine learning, and support vector machines.

The paper also demonstrates how weighted spatial ranks may be used for the purpose of visualizing the clusters, so that the number of clusters may be determined using weighted rank contours for a low-dimensional input space after dimension reduction.

Section 2 of the paper introduces the weighted spatial rank function and evaluates its use for different parametric and nonparametric weight functions. Section 3 demonstrates the weighted rank-based clustering algorithm and proposes a confirmatory classifier based on weighted ranks that can be used to assign observations to the most appropriate cluster for two-dimensional data. Section 4 demonstrates the weighted rank-based clustering algorithm to higher dimensional data, and Section 5 provides numerical examples based on both simulated and real datasets to examine the performance of the proposed algorithm. The algorithm is compared with other clustering methods in Section 6, and concluding remarks are presented in Section 7.

2. Weighted Spatial Rank Functions

In this section, we propose two different weighted spatial rank functions. Suppose that has a -dimensional distribution of F, which we assume to be continuous hereon, then the unweighted spatial rank function of the point with respect to F can be defined as

The first weighted spatial rank function of the point with respect to F is a vector function and can be defined as

The second weighted spatial rank function can be defined aswhere their L2 norms are

Note that, the difference between (2) and (3) lies with the scale (denominator) which in (2) is the sum of the weights of where and is therefore data dependent, whereas in (3) it depends on n which is data independent.

Kernel weight functions are often used in nonparametric estimation and, as already indicated, in a range of classification and pattern recognition problems. Here, we consider the Gaussian kernel weight, which is one of the commonly used nonparametric kernel weight functions and it is defined as follows:where is the Euclidean norm such that is the direction of the d-dimensional vector (for comprehensive details on kernel weights, see Souza [15])

3. Weighted Spatial Ranks Clustering Algorithm

We now introduce the weighted spatial ranks clustering algorithm starting with the bivariate case () before considering the higher dimensional case .

3.1. Weighted Spatial Ranks Clustering Algorithm for a Bivariate Case (d = 2)
(1)Let be a random sample with two variables and and let be the Cartesian product of two equally spaced sets, and so that each is a two-dimensional vector.(2)For each , we calculate with respect to aswhere and .(3)Plot the weighted functional spatial ranks contour based on and , and then determine the number of clusters K from the contour lines.(4)Based on the contour lines, specify the assigned observations in each cluster. You can use a lower contour level for better visualization.(5)Use the weighted spatial rank classifier’s rule defined in Section 3.2 to confirm the assignment of each observation and allocate the unassigned observations to the proper cluster.
3.2. Weighted Spatial Ranks Classifier

Suppose that we have k groups, with distributions , then based on the second WSR function in (3), we can assign d-dimensional observation vector to the -th group ifwhere . Note, if we had used , then since increases outward from the spatial median, we have

So, after determining the number of clusters using the WSR contour, (8) may be used to assign each observation to the most suitable cluster.

3.3. Confirmatory Analysis Based on the Weighted Spatial Ranks Clustering Algorithm

In order to assess the weighted spatial rank functions’ performance, we compare them with other standard methods such as Euclidean distances, Mahalanobis distances, spatial ranks, and spatial depth. A simulation study is used to assess the performance of the proposed weighted ranks-based clustering in defining the group structure in multivariate data. The simulated data are sampled from a bivariate normal mixture distribution that is assumed to cluster into two groups, where the mixture proportion p = 0.3 and sample size n = 1000 such that , is a random sample fromwhere , , and .

For all the contour plots that follow, they are derived from a random sample of 1000 observations simulated from the bivariate mixture normal distribution in (10). Figure 1 shows the contour plots of the Euclidean distances, Mahalanobis distances, spatial ranks, and the spatial depth. The figure clearly shows that the contours produced have failed to map the shape of the two clusters’ structure in the bivariate mixture distribution.

In Figure 2, (defined in (4)) and (defined in (5)) are used to derive the contour plots based on the nonparametric Gaussian kernel weights function defined in (6) (see section 4). In general, Figure 2 reveals that compared with Figure 1, the contours produced from both and based on the Gaussian kernel function capture more of the structure of the simulated data. However, the contour based on failed to detect some of the observations that are not close to either of the clusters. These undetected observations are indicated by some lines between the two clusters, suggesting the potential presence of a third cluster. In contrast, it is clear that the cluster structure of the data is better defined by using the as compared to that from the in Figure 2(a).

This is because is a constant and the values increase outward from the centre or spatial median of the cluster. Thus, unassigned points may be assigned to that cluster which results in the lowest weighted rank for the point. In contrast, the contour derived from the second normed weighted spatial rank function, , defined in (5), is based on the values of that decrease outward from the spatial median. Thus, the larger the weighted spatial rank value for an individual point, the closer it is to the centre of the cluster.

In summary, is more successful than at capturing the cluster structure of the simulated data. Overall, the most accurate contour plot is synthesized when uses Gaussian kernel weights.

4. Weighted Spatial Ranks Based Clustering Algorithm for Higher Dimensions (d > 2)

For real-world datasets, we often have to analyse complex multidimensional data and this makes data visualization and computation more complicated. In such cases, a dimension reduction strategy may be employed and here we use the principal component analysis (PCA) to reduce the dimensionality of the data to two dimensions in order to derive a contour plot (see [16]).

The main idea of using PCA is to find a lower-dimensional subspace that captures most of the variance in the data. Specifically, it involves finding the orthogonal rotation of the axes that maximises the variance. For a d-dimensional random variable X = (X1, X2, …, Xd) with covariance matrix , let , where is a d-dimensional vector of constants. Since the variance of , to find the principal component, requires finding

It is necessary to constrain to have a unit length to ensure finite values. This may be solved by using the method of Lagrange multipliers, so that for Lagrange multiplier, , this reduces to solving the eigen equation.

It also means that for the component which yields the largest eigenvalue, has the largest variance and is the corresponding eigenvector. It is also straightforward to show that the for j = 1, …, d are orthogonal. Thus, for the matrix of eigenvectors, and the matrix of principal component scores C is given by . This represents a rigid rotation of the X axes to an orientation of maximum variance. Thus, the first principal component C1 has the largest variance for X, and the second component C2 has the second largest variance and so on (for more details on PCA, see [17]).

4.1. Weighted Spatial Ranks Clustering Algorithm When d > 2
(1)Let be a d-dimensional random sample, then use the PCA to get the first two components 1 and 2 of that sample and construct the matrix , which is a matrix consisting of the two components 1 and 2.(2)Consider 1 and 2 as the new variables and perform the steps of the weighted spatial ranks clustering algorithm when .

5. Numerical Examples

In this section, we apply the weighted spatial ranks-based clustering algorithm to two simulated datasets and three real datasets.

5.1. Simulated Data Examples

In the first simulated data example, we consider a mixture of three quadvariate normal distributions, with mixing proportions, and , and sample size , such that is a random sample from 4-dimensional mixture normal distribution.where , , and .

Figure 3(a) shows the scatterplot matrix of the principal components and reveals a mixed picture. It is clear from the component 1 versus component 2 panels that there are 3 clusters. In contrast, the component 2 versus component 3 and the component 2 versus component 4 suggest that there are only 2 clusters. Figure 3(b) gives the proportion of the total variance explained by each component, i.e., 97% of the total variance is explained by the first two components. Figures 3(c) and 3(d) demonstrate that the weighted spatial ranks contour accurately fits the shape of the three clusters without any misclassification. Finally, in Figures 3(e) and 3(f), the confirmatory plots based on the weighted ranks classifier for the first two components show that the observations have been correctly assigned to the three simulated clusters.

In the second example, we simulate a random sample of size n = 100 from a mixture of four 6-dimensional normal distributions, with equal proportions of weights p = 0.25, i.e.,with , , , and .

From the scatter plot matrix in Figure 4(a), we can see that whilst there are four clear clusters in the component 1 versus component panels, the number of clusters is less clear in the other panels. However, it is clear from Figure 4(b) that the first two components explain the majority (98%) of the variance.

The weighted spatial ranks contour plots shown in Figures 4(c) and 4(d) clearly reveal the shape of the four clusters, where in the latter, a lower contour level = 0.001 has been used. Finally, the confirmatory plots based on the weighted spatial ranks classifier for the first two components and the original data shown in Figures 4(e) and 4(f) demonstrate the correct assignment of the simulated observations to the right clusters.

5.2. Real Datasets’ Examples

In this subsection, the algorithm is applied to three real datasets: the iris data [18], financial data [19], and old faithful geyser data [20, 21]. The iris dataset consists of three different types of irises (Setosa, Versicolour, and Virginica). However, most of the clustering techniques consider there to be two groups, since iris Virginica and iris Versicolour are not separable without the species information that Fisher used. As we can see from Figure 5(c), the weighted ranks contour based on the first two components, which explain 97.8% of the total variances, indicates two clusters. The confirmatory plots in Figures 5(e) and 5(f) assign all of the observations to two groups.

The second real dataset is the financial data [19], which contains measurements of the three variables monitoring the performance of 103 investment funds operating in Italy since April 1996 (Table A.16 of Atkinson et al. [19]). These data include two different kinds of funds (stock funds and balanced funds). From Figure 6(c), the weighted ranks contour of components 1 and 3, which explain 96.4% of the total variances, suggests there are two clusters. Moreover, the confirmatory plots provide a valid assignment of the observations, which is consistent with the two types of funds.

The third dataset, the old faithful geyser data, is taken from Azzalini and Bowman [20] and the MASS library of Venables and Ripley [21]. It includes 272 observations and two variables, the waiting time between eruptions, and the duration of the eruption in minutes for the old faithful geyser in Yellowstone National Park, Wyoming, USA. This dataset consists of two apparent clusters, the short and the long eruptions. From Figures 7(b) and 7(c), it can be seen that the weighted ranks contour of the data indicates two clusters with an unassigned observation (number 174). Using the confirmatory classifier shown in Figure 7(d), observation 174 is correctly assigned to the second cluster.

6. Comparison with Other Clustering Methods

The WSRN method determines the number of clusters in a dataset and classifies the data into each of the clusters. In this section, we compare the WSRN method with other clustering and classification methods.

The first method is the model-based clustering “mclust” [22]. This is based on a Gaussian mixture model GMM [23] where the number of clusters corresponds to the model which returns the largest Bayesian information criterion (BIC). The second method is the K-means algorithm combined with the Calinski–Harabasz (CH) index [24]. The number of clusters that returns the highest CH index is selected before applying the K-means [25] algorithm to classify the data.

The third method used as a comparator is the high-dimensional data clustering (HDDC) [26] which is again a clustering method based on the Gaussian mixture model where the BIC is used to select the number of clusters. The fourth method used is the mixture of probabilistic principal component analyses “MixtPPCA” [27] where the number of clusters corresponds to the largest BIC. The fifth method for comparison is the partitioning around medoids “PAM” clustering [28] method, where the number of clusters is estimated based on the optimum average silhouette width [29]. The sixth method used in the comparison is the density-based spatial clustering of applications with noise (DBSCAN), where the number of clusters is estimated using a density-based approach to identify regions of high density in the data and these are considered as clusters [1]. Other methods that have been used in the comparison are KMD: clustering with K-medoids [7], FCM: fuzzy C-means clustering [30], GG: Gath–Geva clustering algorithm [31], DDC: distance density clustering [8], SNN: clustering with shared nearest neighbor clustering [32], and densityClust: clustering by fast search and find of density peaks [9].

Each method was applied to the three real datasets in Section 5. As the external classes are known, the different clustering methods were compared using the purity, entropy, and the misclassification rate. Although the purity and entropy are external validation methods commonly used in classification, they measure the homogeneity of the data in clusters and do not penalize algorithms that identify the incorrect number of clusters. Indeed if each cluster is homogeneous for a particular class, both the purity and entropy will assign a perfect score (1 for purity, 0 for entropy) even if the number of clusters is incorrect. The following misclassification rate does penalize algorithms which identify the incorrect number of clusters.

Let there be n data points where there are r true classes such that T = {T1, T2, …, Tr} and in which the algorithm identifies k clusters such that C = {C1, C2, …, Ck}. Let A = {1, 2, …, k} and B = {1, 2, …, r}. The misclassification rate H is defined assubject to the constraint that if two terms and appear in the sum, then i = t if and only if j = u [14, 33].

Thus, each row and column of the matrix AB contribute at most one element to the sum. A consequence of this is that is set to zero if is one of the terms that maximises the sum in parentheses. Also, when there is only one cluster, the sum contains one term only.

The adjusted Rand index (ARI) is another commonly used metric for evaluating the performance of clustering algorithms [34]. While H compares clusters based on set matching, ARI assesses clusters by counting point pairs where there is agreement or disagreement. Moreover, ARI takes into account the expected value of the unadjusted Rand index, which is determined by randomly selecting entries in the contingency table with fixed column and row totals.

Other cluster validity indices can be used to evaluate the goodness of the different clustering structures such as the connectivity index [35], CS index [36], and Sym index [37] (for more extensive details, see [38]).

Table 1 shows the results of the different algorithms applied to the iris dataset, where nine of the twelve methods recorded perfect scores for the purity and entropy, despite only 4 algorithms identifying the correct number of clusters. For these data, the WSR, mclust, and densityClust have the best misclassification rate, and H and both have perfect entropy, purity, and ARI scores.

For the financial dataset, as shown in Table 2, the WSR with seven other algorithms have the joint lowest misclassification rates. The HDDC algorithm records the best scores for the purity and entropy but this identifies an incorrect number of clusters. Based on the ARI, WSR, K-means, MixtPPCA, PAM, KMD, FCM, and GG record the best scores. Finally, for the old faithful dataset, Table 3 shows that the WSR algorithm has the joint third lowest H, but the HDDC and GG algorithms are the best across all the four metrics for this dataset.

7. Concluding Remarks

In this paper, we have introduced a new clustering method based on weighted spatial ranks. The WSRN algorithm is completely data-driven and it both determines the number of clusters and classifies the data. As a nonparametric method, it does not require any assumptions to be made on the underlying distribution(s) of the data. The synthesis of weighted rank contours, based on principal components analysis when the data have more than two dimensions, allows the intuitive visualization of the cluster structure in relation to the distribution of the data points.

We considered nonparametric kernel weights and we introduced WSRN functions based on Gaussian kernel weights. Compared to other standard approaches, the WSRN function based on the Gaussian kernel weights provided the best results in terms of cluster detection and visualization. The weighted rank contours based on Gaussian weights were more accurate and provided the best fit to the shape of the clusters’ structure. They captured each observation carefully and assigned it to the proper group with a minimal probability of misclassification. It also performed competitively with other methods when clustering and classifying the data from three real datasets.

Although the WSRN method is invariant under orthogonal transformations, it is not an affine invariant. Using affine invariant ranks has the potential to improve the results if the scales of different clusters are not similar [39]. A further possible extension to the method would be to consider generalizations of the Euclidean norm for estimating the WSRN. Thus, different Lp norms for could be investigated to establish the optimal value for .

Data Availability

The data used to support the findings of the study are available in the public domain and are appropriately referenced in this article.

Disclosure

This paper is a modified version of part of the published work in MB’s Thesis (Baragilly [40]).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank Biman Chakraborty for his helpful discussions and suggestions relating to this work, and BHW and MB were supported by a Clinician Scientist award with the Medical Research Council, UK (MR/N007999/1).