Abstract
In the era of big data, more and more datasets are gradually beyond the application scope of traditional clustering algorithms because of their large scale and high dimensions. In order to break through the limitations, incremental mechanism and feature reduction have become two indispensable parts of current clustering algorithms. Combined with single-pass and online incremental strategies, respectively, we propose two incremental fuzzy clustering algorithms based on feature reduction. The first uses the Weighted Feature Reduction Fuzzy C-Means (WFRFCM) clustering algorithm to process each chunk in turn and combines the clustering results of the previous chunk into the latter chunk for common calculation. The second uses the WFRFCM algorithm for each chunk to cluster at the same time, and the clustering results of each chunk are combined and calculated again. In order to investigate the clustering performance of these two algorithms, six datasets were selected for comparative experiments. Experimental results showed that these two algorithms could select high-quality features based on feature reduction and process large-scale data by introducing the incremental strategy. The combination of the two phases can not only ensure the clustering efficiency but also keep higher clustering accuracy.
1. Introduction
Clustering technique is very important in data mining research. It can generate data groups in an unsupervised way and ensure high intragroup similarity and intergroup dissimilarity. Since the fuzzy set theory [1] was proposed and applied to cluster analysis [2], the research of fuzzy clustering grew rapidly. Today, many fuzzy clustering algorithms, represented by Fuzzy C-Means (FCM) clustering [3–5], have been proposed and widely used.
In the context of big data, datasets mainly show two characteristics: (1) large amount and (2) high dimension. When the FCM algorithm is used to process this kind of datasets, it lets all features participate in clustering with the same importance weights. Obviously, it will be easy to affect clustering accuracy and efficiency. In 2018, Yang [6] proposed Feature Reduction Fuzzy C-Means (FRFCM) clustering. In this algorithm, an objective function of feature reduction based on weighted entropy is designed, and the feature weight is calculated. If the influence of one feature is greater, the weight is higher, and vice versa. According to the weights, the original high-dimensional features are reduced to lower-dimensional space by feature reduction, which can effectively help improve the efficiency.
Although fuzzy clustering can effectively deal with high-dimensional feature data through feature reduction [7–13], it is still difficult to process large-scale data, especially streaming data. Previously, in order to realize large-scale data clustering [14–18], Hore et al. [19, 20] proposed two incremental algorithms, named SPFCM (Single-Pass Fuzzy C-Means) and OFCM (Online Fuzzy C-Means), based on single-pass and online clustering strategies, respectively. The single-pass method divides the whole dataset into several chunks. The centroids obtained by clustering the previous chunk participate in the clustering process of the next chunk until all the chunks are processed completely. However, in the online method, each chunk is clustered separately. Then centroids of each chunk form a new chunk, which is clustered again to generate the final results. Although SPFCM and OFCM both have the ability to deal with large-scale datasets, once the datasets are high-dimensional and sparse, the clustering results are difficult to be satisfactory. Therefore, Mei et al. [21] considered scalability and high-dimensional processing capability and expanded SPFCM and OFCM to SPHFCM (Single-Pass Hyperspherical Fuzzy C-Means) [21] and OHFCM (Online Hyperspherical Fuzzy C-Means) [21], respectively. In their work, a centroid normalization step is added to each iteration to normalize all centroids to the unit norm, and the Cosine similarity instead of the Euclidean distance is used to measure the distance between centroid and each object.
Based on above analysis, in order to realize the clustering of large-scale and high-dimensional datasets, two incremental fuzzy c-means clustering algorithms based on feature reduction are proposed, named SPFRFCM (Single-Pass Feature Reduction Fuzzy C-Means) and OFRFCM (Online Feature Reduction Fuzzy C-Means). In these two algorithms, different weights are given to the features according to their importance, and the dimension reduction method is used to reduce the data dimension and improve clustering efficiency. Besides, the incremental method is employed so that they are capable of handling large-scale data.
The rest of the paper is organized as follows. In Section 2, we introduce some related work, including the FRFCM algorithm and incremental clustering based on FCM. Section 3 presents our two algorithms, SPFRFCM and OFRFCM. In Section 4, experiments are provided for demonstrating the performance of proposed algorithms. Section 5 concludes this paper.
2. Related Work
2.1. FRFCM
In the FRFCM algorithm, each feature has its own weight, and the weight value will be updated at each iteration. Let X = {x1, x2, …, xN} represent a D-dimensional dataset, let U be a membership matrix whose element uci represents the fuzzy membership of the i-th data (1 ≤ i ≤ N) to the c-th (1 ≤ c ≤ C) cluster, let V be a centroid matrix whose element cj represents the value of the j-th feature of the c-th centroid, and let W be a feature weight matrix whose element ωj represents the weight of the j-th feature (1 ≤ j ≤ D), where N and C are the numbers of objects and clusters, respectively.
The objective function and constraints are expressed as follows:where δj is used to adjust the feature weight ωj and Tω depends on the values of N and C; that is,
2.2. Incremental Clustering Based on FCM
Both SPFCM and OFCM algorithms are based on WFCM (Weighted Fuzzy C-Means Clustering) [22–25], which assigns different weights to objects according to their importance. The higher the weight, the higher the importance, and vice versa. The objective function of WFCM algorithm is expressed as follows:where represents the weight of the i-th object. The iterative formulas of uci and vcj can be obtained, respectively, by Lagrange multiplier method:
After dividing dataset X into s chunks with size n, that is, X = [X1, X2, …, Xs], both SPFCM and OFCM process each chunk in turn. In SPFCM, Let Xz represent the z-th chunk (1 ≤ z ≤ s). After clustering Xz with WFCM algorithm, we get a centroid set Δz = [v1z, v2z, …, vcz]. Then these centroids and their weights are merged with the objects in the z+1 chunk, and a new chunk Xz+1′ = [Δz, Xz+1] is generated. On this new chunk, the WFCM algorithm will be carried out again. The above process is cycled until all chunks are processed.
Different from SPFCM, OFCM algorithm clusters s chunks, respectively, at the same time; and the centroids of each chunk form a total chunk, on which the WFCM algorithm will be implemented again to produce the final clustering result.
3. Incremental Fuzzy Clustering Based on Feature Reduction
Based on single-pass and online incremental framework, this paper proposes two fuzzy incremental clustering algorithms based on feature reduction, named SPFRFCM and OFRFCM, respectively. During the implementation of SPFRFCM and OFRFCM algorithms, a weighted FRFCM algorithm, named WFRFCM (Weighted Feature Reduction Fuzzy C-Means), is designed. Its objective function is
The constraints include
According to the Lagrange multiplier method, by fixing U and W and calculating the partial derivative of V, the value of vcj can be calculated as follows:
Similarly, the iterative formulas of uci and ωj can be described as
Next, we analyze the time complexity of WFRFCM algorithm. This algorithm contains three parts: (1) calculating the membership division uci, (2) updating the cluster center , and (3) updating the feature weight ωj. The complexities of these three parts are O(NC2D), O(NCD), and O(NCD2), respectively. Therefore, the total computational complexity is O(NC2D + NCD2).
3.1. SPFRFCM
Before the SPFRFCM algorithm is implemented, the dataset is divided into s chunks with size n, and the weight vector is initialized with a 1 × n vector, that is, = [1, 1, …, 1]. Let z be the index of chunk. Then this algorithm will run by performing the following steps:(1)When z = 1, WFRFCM algorithm is used to group this chunk directly; and we can obtain the centroid matrix Δ1 = [11, 21, …, c1] and the corresponding data weight matrix: where = 1, 1 ≤ i ≤ nz, and nz represents the size of the z-th chunk. In addition, we can also obtain the weight matrix ω1 = [ω11, ω22, …, ωj1] of the features. The feature weight is By minimizing the objective function, the formulas of membership uci and centroid after clustering chunk X1 are(2)When z > 1, the centroids of previous chunk are added to current chunk, and a new chunk Xz′ = [Δz-1, Xz] is obtained. Let be the weights of elements in Xz′, and it contains C weights of centroids from the previous chunk and nz weights of objects in current chunk. Then we implement WFRFCM clustering on this new chunk; and the corresponding data weight can be calculated asHere = 1, and the feature weight can be updated as
By minimizing the objective function, we can get the formulas of membership uci and centroid vcj:
The pseudocode of SPFRFCM algorithm is as follows: Step 1: Initialize some parameters including the number of clusters C, fuzzy index m = 2, membership matrix U(0), characteristic weight W(0) = [ωj]1 × D, number of iterations t = 1, parameter ε1 = 1/(DN)1/2, and parameter ε2 = 1 × 10−8. Step 2: Divide the dataset into s chunks, that is, X = [X1, X2, …, Xs]; initialize the data weight, that is, = [1, 1, …, 1], and set z = 1. Step 3: Calculate V(t) with formula (19) or (23). Step 4: Calculate W(t) with formula (16) or (20). Step 5: Calculate ωj with formula (17). Step 6: If ωj < ε1, delete the j-th feature; update the number of data features dnew = D-dr, where dr represents the number of features to be deleted. Step 7: Calculate U(t) with formula (18) or (22). Step 8: z < s: if ║Ut − Ut−1║>ε2, make t + 1⟶t and go to Step 3; otherwise, add the centroids and weights to the next chunk, obtain a new chunk, and make z + 1⟶z and go to Step 3. Step 9 z = s: if ║Ut − Ut−1║ > ε2, make t + 1⟶t and go to Step 3; otherwise, this algorithm ends.
3.2. OFRFCM
OFRFCM is implemented based on the online incremental framework. Firstly, s chunks are clustered, respectively, and then the centroids and weights of each chunk are merged to form a new chunk and weight matrix. This new chunk can be expressed as , and the weight of each chunk after clustering is expressed as z = [w1 1, w1 2, …, w1 c, w2 1, w2 2, …, w2 c, …, ws 1, ws 2, …, ws c]. The specific steps are as follows:(1)Cluster each chunk, and obtain the centroid Δz = [1z, 2z, …, cz] and the corresponding weight matrix: In addition, the feature weight matrix ωz = [ω1z, ω2z, …, ωjz] is also obtained. The feature weight is By minimizing the objective function, we obtain the formulas of membership uci and centroid vcj:(2)The centroids with different weights obtained from each chunk form a new dataset, on which the WFRFCM algorithm will be implemented again to generate the final results.
The formulas of membership uci and centroid cj obtained from the new chunk arewhere 1 ≤ i ≤ np and np is the number of objects in the new chunk.
The clustering process of OFRFCM algorithm can be described as the following pseudocode: Step 1: Initialize some parameters including the number of clusters C, fuzzy index m = 2, membership matrix U(0), feature weight W(0) = [ωj]1 × D, iteration time t = 1, centroid matrix Xnew, data weight Q, parameter ε1 = 1/(DN)1/2, and parameter ε2 = 1 × 10−8. Step 2: Divide the dataset into s chunks, that is, X = [X1, X2, …, Xs]; initialize the data weight, that is, = [1, 1, …, 1], and set z = 1. Step 3: Calculate V(t) with formula (27) or (29). Step 4: Calculate W(t) with formula (24). Step 5: Calculate ωj with formula (25). Step 6: If ωj < ε1, delete the j-th feature; update the number of data features dnew = D-dr, where dr represents the number of features to be deleted. Step 7: Calculate U(t) with formula (26) or (28). Step 8: z ≤ s: if ║Ut-Ut−1║>ε2, make t + 1⟶t and go to Step 3; otherwise, add the centroids and weights to the centroid matrix Xnew and object weight matrix Q, respectively, make z+1⟶z, and go to Step 3. Step 9 z > s: if ║Ut-Ut−1║>ε2, make t + 1⟶t and go to Step 3; otherwise, this algorithm ends.
4. Experiments
4.1. Experimental Preparation
In order to verify the effectiveness of our algorithms, six datasets and six incremental comparison clustering algorithms are selected for experiments. The specific information of each dataset is listed in Table 1; and the comparison algorithms include three single-pass algorithms and three online algorithms. The three single-pass algorithms are SPFCM, SPHFCM, and SPFCOM (Single-Pass Fuzzy C-Ordered-Means clustering) [26], and three online algorithms are OFCM, OHFCM, and OFCOM (Online Fuzzy C-Ordered-Means clustering) [26]. SPFCOM and OFCOM are two incremental fuzzy c-ordered-means clustering algorithms using single-pass and online architectures, respectively, and these two algorithms can achieve better robustness. The final experimental results will be evaluated by two evaluation criteria, accuracy (AC) and F-Measure (FM) [27].
4.2. Experimental Results
The experiments are divided into three parts. The first part evaluates clustering accuracy by comparing the clustering results of our algorithms with six comparison algorithms, the second part of experiments verifies the effectiveness of the dimension reduction, and the third part evaluates the robustness of our algorithms.
4.2.1. Algorithm Accuracy
This group of experiments first compares our SPFRFCM algorithm with three single-pass comparison algorithms and then compares our OFRFCM algorithm with three online comparison algorithms. In our experiment, the chunk size is set to 5%, 10%, 20%, and 50% of the total number of objects in each dataset, so as to verify the clustering performance under different chunk sizes. In addition, the experiments adopt the same parameter setting as FRFCM algorithm, and the final clustering result is the average value of each algorithm running independently for 10 times. The experimental results of four single-pass algorithms and four online algorithms are shown in Tables 2 and 3, respectively.
It can be seen from Table 2 that when the chunk size is 5%, compared with SPFCOM, SPHFCM, and SPFCM algorithms, the average FM value of SPFRFCM algorithm on six datasets is increased by 10.98%, 13.89%, and 18.55%, respectively, and the average AC index is increased by 10.97%, 12.71%, and 18.22%, respectively. When the chunk size is 10%, the average FM index of SPFRFCM algorithm is 13.65%, 15.12%, and 17.94% higher than those of SPFCOM, SPHFCM, and SPFCM algorithms, respectively, and the AC index is 9.93%, 11.96%, and 15.34% higher, respectively. When the chunk size is 20%, the improvement ranges of average FM index are 14.33%, 15.22%, and 14.28%, respectively, and the improvement ranges of AC index are 11.48%, 15.22%, and 12.35%, respectively. When the chunk size is 50%, the improvement ranges in terms of FM are 11.26%, 15.12%, and 19.02%, respectively, and the improvement ranges in terms of AC are 7.37%, 13.06%, and 17.53%, respectively. From the result distribution, SPFRFCM algorithm has greatly improved on such three datasets as WR, WB, and SE, and the FM and AC indexes have increased by 20.16% and 10.15%, respectively, on average. The improvements on GS, PB, and IR datasets are less, and the FM and AC indexes increase by 5.76% and 5.88%, respectively, on average.
Table 3 lists the clustering accuracy comparison results of the four online algorithms. It can be seen from this table that when the chunk size is 5%, the average FM index of the OFRFCM algorithm on the six datasets is 14.98%, 25.19%, and 14.65% higher than those of the OFCOM, OHFCM, and OFCM algorithms, respectively, and the AC index is 14.20%, 34.88%, and 18.66% higher, respectively. When the chunk size is 10%, the average FM index of OFRFCM algorithm is 13.13%, 22.95%, and 13.74% higher, and the AC index is 9.22%, 21.14%, and 16.37% higher, respectively. When the chunk size is 20%, the improvements of FM index of OFRFCM algorithm are 14.76%, 23.63%, and 13.49%, respectively, and the improvements of AC index are 11.45%, 31.20%, and 15.51%, respectively. When the chunk size is 50%, the improvement ranges of FM index are 13.40%, 26.72%, and 16.97%, respectively, and the improvement ranges of AC index are 9.77%, 33.29%, and 24.62%, respectively. From the result distribution, OFRFCM algorithm has greatly improved on SE, WR, and WB datasets, and the FM and AC indexes have increased by 27.40% and 25.68%, respectively, on average. The improvements in GS, PB, and IR datasets are less, and the FM and AC indexes increase by 8.20% and 14.37%, respectively, on average.
Through this group of experiments, it can be seen that both SPFRFCM and OFRFCM algorithms obtain higher accuracy than the comparative experimental algorithms on the experimental dataset, and the improvements are obvious on some datasets. The reason is that some features with low weights may play a negative role in the clustering process. Therefore, it may help improve clustering accuracy by filtering out these features using feature reduction methods.
4.2.2. Feature Reduction
In the process of feature reduction, the features whose weights are lower than the threshold will be discarded. In order to verify the effectiveness of the feature reduction, the second part of experiments also sets the chunk size with 5%, 10%, 20%, and 50% of the total number of objects in each dataset and records the number of features. The experimental results are shown in Figures 1 and 2.


Figures 1 and 2 describe the performances of SPFRFCM and OFRFCM algorithms in the process of feature reduction, respectively. On IR dataset, both algorithms reduce data features by 50%. On WB dataset, SPFRFCM algorithm reduces features by up to 70%, while OFRFCM algorithm reduces features by up to 33%. On WR dataset, SPFRFCM algorithm reduces the number of features by 73%, while OFRFCM algorithm reduces it by 9%. On PB dataset, both algorithms do not reduce the number of features. On SE dataset, SPFRFCM algorithm is better than OFRFCM algorithm in feature reduction when the chunk size is 5% and 10%, and they are equivalent when the chunk size is 20% and 50%. On GS dataset, the number of features of SPFRFCM algorithm is reduced to 22 on average, while the number of features of OFRFCM algorithm is reduced to 120 on average. It can be seen that, in our experiments, the feature reduction method can effectively reduce the number of features.
In addition, it can be seen from the first part of experiments that the clustering accuracy with feature reduction has been improved, which indicates that some noise features are filtered in the feature reduction process. Obviously, noisy features will not only affect the clustering accuracy but also reduce the clustering efficiency. When noisy features are filtered, the robustness of algorithms will be improved, and the clustering accuracy is also easy to be improved.
After feature reduction, clustering time will be reduced, and thus clustering efficiency will be improved. The average running time of each algorithm in the experiments is shown in Table 4. It can be seen from this table that, for high-dimensional datasets, compared with other single-pass comparison algorithms SPFCOM, SPHFCM, and SPFCM, the running time of SPFRFCM algorithm is reduced by 70.2%, 12.6%, and 38.5%, respectively. The running time of OFRFCM algorithm is equivalent to that of other online comparison algorithms and has not been significantly improved, which is related to the incremental strategy. Due to the different incremental strategies of single-pass and online, when clustering each chunk, the single-pass incremental algorithm will update the feature weight and filter out the low weight features. The updated features participate in the calculation of subsequent chunks and directly improve the clustering efficiency of subsequent chunks. Therefore, when SPFRFCM completes the clustering task of the whole dataset, it can greatly reduce the clustering time. However, the online incremental algorithm groups each chunk separately, and the low weight features in each chunk are filtered out in the clustering process. It can be seen that the feature reduction of each chunk is relatively independent, resulting in little change in the overall efficiency. Therefore, the performance of single-pass incremental algorithm in feature reduction and efficiency improvement is better than online incremental algorithm.
4.2.3. Robustness
This part of experiments aims to evaluate the robustness of our algorithms. In our experiments, the size of the data chunk is set to 50%, and the noise data is added to 6 datasets in the proportions of 5%, 10%, 20%, and 50% for clustering, respectively. The experimental results are shown in Figures 3–8.

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)
For the four single-pass algorithms, when the noise data proportion is 5%, the average FM values of SPFRFCM algorithm on the six datasets are 8.86%, 6.40%, and 1.75% higher than those of SPFCM, SPHFCM, and SPFCOM algorithms, respectively, and the AC values are 8.18%, 10.65%, and 2.3% higher, respectively. When the noise data proportion is 10%, the mean improvements of SPFRFCM algorithm in terms of FM index are 8.10%, 9.81%, and 2.4%, respectively, and the mean performances in terms of AC index are 6.10% and 10.30% higher than those of SPFCM and SPHFCM, respectively and 0.67% lower than that of SPFCOM algorithm. When the proportion is 20%, the mean values of FM index of SPFRFCM algorithm are 6.28% and 7.85% higher than those of SPFCM and SPHFCM algorithms and 1.01% lower than that of SPFCOM algorithm, and the mean values of AC index are 8.05% and 11.1% higher than those of SPFCM and SPHFCM algorithm and 0.4% lower than that of SPFCOM algorithm. When the noise data proportion is 50%, the mean values of FM index are 7.60% and 6.93% higher than those of SPFCM and SPHFCM algorithms and 4.27% lower than that of SPFCOM algorithm, and the mean values of AC index are 7.95% and 9.56% higher than those of SPFCM and SPHFCM algorithm and 4.83% lower than that of SPFCOM algorithm. The above experimental results show that the robustness of SPFRFCM algorithm is higher than that of SPFCM algorithm and that of SPHFCM algorithm. Compared with SPFCOM algorithm, when the noise ratio is less than 10%, the robustness of SPFRFCM algorithm is better; and when the noise ratio is between 10% and 20%, there is little difference between these two algorithms. However, when the noise ratio is higher than 20%, the robustness of SPFCOM algorithm is higher than that of SPFRFCM algorithm. We know that the Fuzzy C-Ordered-Means (FCOM) clustering is famous for its scenery and its insensitivity to the presence of noise and outliers in data, which is further verified by our experiments.
For the four online algorithms, when the noise data proportion is 5%, the average FM index values of OFRFCM algorithm on the six datasets are 5.38%, 8.18%, and 1.53% higher than those of OFCM, OHFCM, and OFCOM algorithms, respectively, and the AC index values are 6.81%, 10.56%, and 1.85% higher, respectively. When the noise data proportion is 10%, the average FM index values of OFRFCM algorithm are 7.33%, 10.86%, and 3.55% higher, respectively, and the AC index values are 7.88%, 11.28%, and 0.98% higher, respectively. When the proportion is 20%, the mean improvements of OFRFCM algorithm in terms of FM index are 7.20%, 9.00%, and 0.26%, respectively, and the mean improvements in terms of AC index are 9.98%, 11.40%, and −0.98%. When the noise data proportion is 50%, the mean values of FM index of OFRFCM algorithm are 5.66%, 8.18%, and 1.53% higher than those of OFCM, OHFCM, and OFCOM algorithms, respectively, and the mean values of AC index are 8.01% and 8.20% higher than those of OFCM and OHFCM, respectively, and 3.17% lower than that of OFCOM algorithm. This part of experiments show that the robustness of OFRFCM algorithm is better than that of OFCM algorithm and that of OHFCM algorithm and proves the better robustness performance of FCOM again.
5. Conclusions
While the scale of dataset is becoming larger and larger, the dimension is also increasing. Traditional clustering algorithms are difficult to deal with large-scale and high-dimensional data. Therefore, two incremental fuzzy c-means clustering algorithms, named SPFRFCM and OFRFCM, are proposed. The former divides the whole dataset into several chunks and clusters the chunks successively based on WFRFCM algorithm, and the clustering results of the previous chunk participate in the clustering process of the next chunks. The latter clusters each chunk based on WFRFCM algorithm and then merges clustering results of all chunks to cluster again. These two algorithms combine the advantages of incremental framework and feature reduction. They not only have the ability to deal with large-scale data but also can reduce feature dimension and help to improve the robustness and efficiency of clustering through feature reduction. In order to evaluate the effectiveness of these two algorithms, experiments were carried out on six datasets. Experimental results show that SPFRFCM algorithm can effectively reduce the feature dimension through feature reduction and obtain higher clustering accuracy and efficiency than the comparison algorithms. In OFRFCM, although the clustering efficiency has not been significantly improved due to the strategy, the feature dimension has also been reduced and the clustering accuracy has also been improved.
Data Availability
All the datasets used in this paper are derived from the UCI (University of California Irvine) Machine Learning Repository. Please visit https://archive.ics.uci.edu/ml/datasets.php.
Conflicts of Interest
The authors claim that there are no conflicts of interest in terms of the publication of this paper.
Acknowledgments
The authors would like to thank members of the IR&DM Research Group from Henan Polytechnic University for their invaluable advice that makes this paper successfully completed. The authors would like to acknowledge the support of National Science Fund subsidized project under Grant no. 61872126.