Abstract

In this paper, a correlation index-based clustering algorithm (COIN) is proposed for clustering the categorical data. The proposed algorithm was tested on nine datasets gathered from the University of California at Irvine (UCI) repository. The experiments were made in two ways, one by specifying the number of clusters and another without specifying the number of clusters. The proposed COIN algorithm is compared with five existing categorical clustering algorithms such as Mean Gain Ratio (MGR), Min–Min-Roughness (MMR), COOLCAT, K-ANMI, and G-ANMI. The result analysis clearly reports that COIN outperforms other algorithms. It produced better accuracies for eight datasets (88.89%) and slightly lower accuracy for one dataset (11%) when compared individually with MMR, K-ANMI, and MGR algorithms. It produced better accuracies for all nine datasets (100%) when it is compared with G-ANMI and COOLCAT algorithms. When COIN was executed without specifying the number of clusters, it outperformed MGR for 88.89% of the test instances and produced lower accuracy for 11% of the test instances.

1. Introduction

Categorical data are qualitative data which can be grouped for data analysis. Cluster analysis come into the focus of research since 1960 and still it dominates in various domains for decision making. Clustering is widely used in the fields of medical diagnostics, academics, bioinformatics, search engines, text mining, statistics, pattern recognition, image processing, cellular manufacturing, etc. Categorical data are representation of class variables, where it can take more than two states. For example, the categorical data in academic industry includes the information of faculty, students, and research scholars. In marketing, the categorical data are collected through surveys or questionnaires and the purchase decisions are significantly motivated by the opinion of the current users of the product. The new customers depend more on opinion-rich information available online for any purchase decision. A recommender system recommends a product based on opinion of current customers. Since there is a huge demand for opinions and sentiments of others, the recommender systems use the clustering algorithm to analyze the recommendation of the product based on various opinions from users. For example, all the items available on eBay online shopping can be grouped into a set of unique product clusters. In social network analysis, the clustering algorithm can be used to group the population within large groups of people.

There are several methods available for cluster analysis. The major divisions of methods are hierarchical, partitioning, model, grid, and density-based methods. The clustering method is chosen based on the type of data and applications [1]. The hierarchical clustering method is a traditional method that can handle both numerical and categorical data [2, 3], but it is not suitable for clustering larger datasets because of the very high computation cost. Ralambondrainy [4] proposed k-means clustering algorithm that could handle a greater amount of categorical data. However it could not process the categorical data directly and hence the categorical data was converted into binary data for further processing. To overcome this limitation, Huang [5] proposed k-modes algorithm that uses modes for clustering. A new distance measure was used to replace means by modes of clustering categorical values. The computation cost was minimized using the frequency-based methods. In k-prototype method, k-means and k-modes algorithm were combined for clustering mixed data types. Huang and Ng [6] proposed a fuzzy k-modes algorithm in which modes were used for measuring simple matching distance. The idea of using modes in fuzzy k-modes was to overcome the limitation of fuzzy k-means. Ganti et al. [7] proposed CACTUS, a fast summarization algorithm for categorical data clustering that takes only two searches of the dataset for the formation of clusters. It was found suitable for clustering datasets with very large number of attributes.

Guha et al. [8] proposed the ROCK algorithm, a robust hierarchical clustering algorithm for clustering the binary and categorical data variables. It used the concept of ‘link’ to find the neighborhood data points. It merged the neighborhood data points to the current point, rather than finding minimum distance between the points. It was found different from the traditional hierarchical clustering algorithm. It was experimented with the synthetic and real-life data sets like Mushroom, Votes, and Mutual Funds datasets and was evaluated with traditional hierarchical clustering algorithms. The authors of quick version of this algorithm, named as QROCK, claimed that it outperforms traditional hierarchical clustering algorithm by means of minimizing the time complexity. Barbará et al. [9] proposed an entropy-based incremental algorithm named as COOLCAT for clustering larger amount of categorical data and data streams. He et al. [10] proposed a one pass threshold-based squeezer algorithm for clustering categorical data, in which the first instance was added to the first cluster, and each instance was then called sequentially and added to either existing clusters or a new cluster based on the similarity between instances and clusters-threshold value. Ng and Wong [11] proposed a tabu search algorithm for obtaining global optimum results on categorical data. Also, He et al. [12] proposed a transactional clustering named as TCSOM in their other research using self-organizing map for binary transactional data. The interesting thing to be noted in k-modes, k-prototypes, and fuzzy k-modes algorithms is that they produce only local optimum. Sun et al. [13] proposed iterative initial point refinement for k-modes algorithm, while Kim et al. [14] proposed fuzzy centroids for conventional fuzzy k-modes. Kim et al. [15] introduced k-populations, a brief communication on the concept of conventional k-modes clustering. Gan et al. [16] proposed a hybrid genetic fuzzy k-modes algorithm for clustering categorical data. Cao et al. [17] and Bai et al. [18, 19] introduced initialization methods to k-modes algorithm simultaneously finding the number of clusters and good initial centers. Ahmad and Dey [20] used new cost function and the distance measure in their k-means algorithm for clustering categorical and numerical data.

Some of the significant research works done in the recent years related to the clustering algorithms is tabulated in Table 1. To the best of our knowledge, no existing research used the correlation similarity measure for clustering the categorical data. This research was attempted to introduce a correlation index-based clustering algorithm for clustering the categorical data. The proposed algorithm was tested in two ways on the nine datasets gathered from the University of California at Irvine (UCI). The first way is to specify the number of clusters and the second way is without mentioning the required number of clusters. The performance of the algorithm was evaluated with five popular categorical clustering algorithms. Section 2 presents the algorithm and explains how it works. Section 3 discusses the results of the COIN algorithm and compares them with the other algorithms.

2. Material and Method

2.1. Datasets Used in Experimentation

In the proposed algorithm, the input data is in the form of matrix which contains ‘n’ instances and ‘m’ attributes. The datasets for testing of the proposed algorithm were taken from the HAYES-ROTH dataset from UCI repository [35]. Though HAYES dataset has 132 instances and 4 attributes divided into 3 classes, only 8 instances were considered for the sake of testing the proposed algorithm. The sample input data is given in Table 2. The instances {O1, O2, … O8} are arranged in rows, while attributes (Name (A1), Hobby (A2), Age (A3), and Educational Level (A4)) are arranged in columns. The actual class label of the data (1, 2) is provided in the last column.

2.2. COIN: Correlation Index-Based Clustering Algorithm

The proposed algorithm is a correlation similarity based heuristic method that can partition the ‘x’ number of instances into ‘y’ number of groups. We name this algorithm as correlation index-based clustering algorithm (COIN). The pseudocode of the algorithm is presented in Figure 1.Step 1: input categorical data that consists of ‘n’ instances (rows) and ‘m’ attributes (columns).Step 2: compute correlation coefficients and generate a correlation matrix of instances of size .Step 3: generate the modified correlation matrix using the epsilon-gamma substitution procedure.Step 4: from the first row of the correlation matrix, divide the instances into two groups based on the positive and negative correlation values.Step 5: scan the initial two groups among the instances of the group for negatively correlated instances.Step 6: remove those negatively correlated instances and pool them in a separate array named negatively correlated instance (NCI) array.Step 7: apply correlation index (CSM) procedure for each element of the NCI to assign those instances either to the existing clusters or to a new cluster.Step 8: repeat Step 6 until the end of NCI.Step 9: evaluate the final clusters.

The correlation coefficients were computed from input matrix and the respective correlation matrix of the instances was generated. It is a square symmetrical matrix with the diagonal elements equal to 1. Table 3 presents the correlation matrix for the data considered. The xyth element in the correlation matrix represents the correlation coefficient between the xth and yth vector. The correlations of variables with themselves are always equal to 1 as arranged in diagonal of the matrix.

In the proposed algorithm, the correlation similarity matrix is not directly used for the formation of clusters. Instead, a threshold (epsilon) is used for the formation of clusters. In each iteration, only one value is selected as . The values of correlation matrix that are less than or equal to are changed to a negative correlation value varying from -0.1 to -1 with the decrement of -0.1 and it is named as (gamma). The above procedure was applied on correlation matrix by considering  = 0.1 and  = -0.1. The modified correlation matrix is shown in Table 4, where all values less than 0.1 were replaced by -0.1.

In this proposed algorithm, initial clusters are formed by choosing any row of the modified correlation matrix and splitting the instances having positive and negative correlation values into a separate group. For instance, consider the first row of the modified correlation matrix where instance O1 is positively correlated with instances {O1, O4} and it is negatively correlated with instances {O2, O3, O5, O6, O7, O8}. The positively correlated instances were grouped together as cluster1 and the negatively correlated instances were grouped together as cluster2. The pictorial representation of the formation of initial clusters is shown in Figure 2. The initial clusters formed are as follows.

The next step in the proposed algorithm is to find the negatively correlated index (NCI) from the initial clusters. A normal search procedure was initiated to find the instances having negative correlation and each possible pair of instances was chosen from the cluster. All negatively correlated pairs of instances were removed from that cluster and pooled to an array called negatively correlated instances array. All other instances were retained in the same cluster.

For example, consider the initial cluster Cluster1, which has two instances {O1, O4}. Since correlation value of instances O1 and O4 in the modified correlation matrix , these instances remained in the existing cluster itself and proceeded to check for Cluster2. The Cluster2 has six instances {O2, O3, O5, O6, O7, O8} in which the correlation value of the first pair of instances {O2, O3} is positive. Hence, the instances O2 and O3 remained in the same cluster and were then checked for the next pair of instances. Since the correlation value of instances {O2, O5} is negative, they were removed from the cluster and pooled to a new array called negatively correlated instances array. The pictorial representation of finding negatively correlated instances from the initial clusters is shown Figure 3.

The initial clusters after removing the negatively correlated instances were cluster1⟶ {O1, O4}, cluster2⟶{O7, O8}, NCI ⟶{O2, O3, O5, O6}.

The next step is to find the clusters for the instances from NCI based on correlation similarity measure. For each instance in the NCI, correlation index similarity with all the existing clusters is computed and the respective instance is updated with the cluster that holds positive highest value. If the correlation index similarity is negative with all the existing clusters, a new cluster will be created where that instance will be the first element of the new cluster.

The proposed correlation index similarity measure is given in equation (1). It is defined as the sum of correlation value of a instance with all elements in the list.where is the correlation index similarity of any instance ‘ins’ with ith cluster, K is the number of instances in the ith cluster, and is the correlation value of the instance with the kth element of the ith cluster.

For example, consider the NCI and the existing clusters of the input data given in Table 1, Cluster1⟶ {O1, O4}, cluster2⟶{O7, O8}, NCI ⟶{O2, O3, O5, O6}.

Now, start with the first instance O2 of the NCI.

Calculate as follows:

From modified correlation matrix shown in Table 3,

Similarly, calculate .

From Table 3, , .

From the above calculations, since has the maximum positive value, the instance was updated to the . Hence, the instances in the cluster were cluster1⟶ {O1, O4}, cluster2⟶{O2, O7, O8}, NCI ⟶{O3, O5, O6}.

Similarly, the procedure was repeated for the remaining instances in NCI. The pictorial representation of the formation of clusters for the NCI is given in Figure 4.

Since the correlation index similarity value of instances with the existing clusters was positive, no new clusters were created. The final clusters formed in the input data areCluster1⟶ {O1, O4, O5, O6};Cluster2⟶{O2,O3, O7, O8}.

2.3. Performance Evaluation

The performance of the algorithm was evaluated using the clustering accuracy as given in (2).where J is a variable representings the cluster number (1 … y), D is the number of instances of the dominating class label in the cluster j, and X is the total number of instances.

The value of clustering accuracy ranges between 0 and 1. The value close to 0 indicates bad clustering and the value close to 1 indicates good clustering.

Table 5 shows the results of COIN algorithm for the datasets considered in Table 1. The total number of dominating instances is 4. For the samples considered above, the clustering accuracy of the clusters at the end of the process was found to be 0.5. The procedure to calculate the time complexity is explained as follows. Let m be the number of instances, n be the number of attributes, and d be the number of negatively correlated instances, the time taken to form a correlation matrix is O(mn ), and the time taken to find the correct clusters is O(d). The total time complexity of the algorithm is O (m2d).

3. Experimental Analysis

To conduct the experimental analysis, nine datasets were gathered from the UCI machine learning repository as shown in Table 6. The datasets were Hayes-Roth, Zoo, Votes, Balance Scale, Breast Cancer, Car Evaluation, Chess, Mushroom, and Nursery. The Breast Cancer dataset had some missing values, so the instances with missing values in the attributes were deleted for the experiment. After deleting the instances, the number of instances in Breast Cancer dataset was 683.

The COIN algorithm was coded in MATLAB programming language and its performance was tested with those nine datasets. The performance of COIN was compared with the existing five algorithms such as MGR, MMR, COOLCAT, K-ANMI, and G-ANMI algorithms. The clustering accuracy of MGR, MMR, COOLCAT, K-ANMI, and G-ANMI algorithms on the same nine datasets was taken from Qin et al. [36].

3.1. Results of COIN Algorithm with Specifying the Number of Clusters

The proposed COIN algorithm can be executed by specifying the number of clusters and without specifying the number of clusters. For instance, if the number of classes present in Nursery dataset was five, the number of clusters specified in COIN was five. For the second case, the COIN algorithm was executed without specifying the number of clusters. Clustering results from selected six existing algorithms and the proposed COIN algorithm were used to cluster the selected nine datasets and tabulated in Table 7, while the comparison of cluster accuracy for the six algorithms on nine datasets is shown in Figure 5. It can be observed that COIN outperformed all other five algorithms for six datasets such as Hayes-Roth, Votes, Balance Scale, Car Evaluation, Chess, and Nursery datasets. COIN algorithm stands second best on Zoo datasets and Mushroom datasets with a very marginal difference. For Breast Cancer datasets, the accuracy of COIN is less than K-ANMI, G-ANMI and MGR.

COIN produced better accuracies for eight datasets (88.89%) and slightly lower accuracy for one dataset (11%) when compared individually with MMR, K-ANMI, and MGR algorithms. It produced better accuracies for all nine datasets (100%) when it is compared with G-ANMI and COOLCAT algorithms. The results (number of instances in each cluster and accuracy) of a nine datasets Zoo, Hayes-Roth, Votes, Balance Scale, Breast Cancer, Car Evaluation, Chess, Mushroom, and Nursery datasets are furnished as a supporting document. The readers can also personally request and take the data from authors.

3.2. COIN Results without Specifying the Number of Clusters

Table 8 shows the comparative analysis of the accuracies of the MGR and COIN on nine datasets without specifying the number of clusters. From Table 8, it is observed that the COIN algorithm produced better accuracies for seven datasets (77.78%) and equal accuracy for two datasets (22.22%) without specifying the number of clusters. Figure 6 shows the comparison of accuracies for MGR and COIN without specifying the number of clusters. Tables 917 showthe results of COIN on the ZOO dataset, Hayes-Roth dataset, Congressinal votesdataset, Balance scale dataset, Breast cancer dataset, Car evaluation dataset,Chess dataset, Mushroom dataset, and Nursery dataset respectively. They consist of the details like Number of clusters achieved, in each cluster thedistribution of instances with respect to various classes, dominating clusterinstances and the accuracy.

4. Conclusions

A new correlation index-based clustering algorithm for categorical data has been presented. Performance of the proposed algorithm was tested on nine datasets from UCI repository in two ways by specifying the number of clusters and without specifying the number of clusters. The performance of COIN was compared with the algorithms like MMR, K-ANMI, G-ANMI, COOLCAT, and MGR. In the case without specifying the number of clusters the COIN outperformed all the five algorithms for 66.67% of the test instances and produced slightly lower accuracies for 44.44% of the test instances. When COIN was executed without specifying the number of clusters it outperformed MGR for 88.89% of the test instances and produced lower accuracy for 11% of the test instances.

Nomenclature

COIN:Correlation index-based clustering algorithm
UCI:University of California at Irvine
MGR:Mean gain ratio
MMR:Min–min-roughness
COOLCAT:An entropy-based algorithm for categorical data
K-ANMI:(K represents the number of desired clusters) (ANMI is average normalized mutual information)
G-ANMI:A genetic average normalized mutual information
MATLAB:Matrix laboratory
CACTUS:Clustering categorical data using summaries
ROCK:Robust clustering using links
TCSOM:Transactional clustering using self-organized map.

Data Availability

Data associated with this research are available on request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.