Abstract
Neural network is a supervised classification algorithm which can deal with high complexity and nonlinear data analysis. Supervised algorithm needs some known labels in the training process, and then corrects parameters through backpropagation method. However, due to the lack of marked labels, existing literature mostly uses Auto-Encoder to reduce the dimension of data when facing of clustering problems. This paper proposes an RBF (Radial Basis Function) neural network clustering algorithm based on K-nearest neighbors theory, which first uses K-means algorithm for preclassification, and then constructs self-supervised labels based on K-nearest neighbors theory for backpropagation. The algorithm in this paper belongs to a self-supervised neural network clustering algorithm, and it also makes the neural network truly have the ability of self-decision-making and self-optimization. From the experimental results of the artificial data sets and the UCI data sets, it can be proved that the proposed algorithm has excellent adaptability and robustness.
1. Introduction
Cluster analysis is an important method of data mining, whose core idea is to gather high similarity points in the data set into clusters while ensuring that different clusters have significant differences. Clustering can explore hidden patterns and rules in data, which is the embodiment of the decision-making ability of artificial intelligence algorithms, and is now widely used in computer science, information security, and image processing.
In recent years, the sudden development of neural network has shown people its powerful function, which can deal well with the processing of high-dimensional and complex data, and it has achieved successful applications in the fields of image clustering [1, 2], facial recognition [3–5], image segmentation [6, 7], and so on. However, the shortcoming of neural network is also very obvious, the limitation of manual label annotation restrict its self-decision ability. Existing unsupervised learning methods include clustering and dimension reduction, and clustering algorithms are complex and diverse, which can be divided into clustering methods based on prototype-based, density, hierarchy, and dimension reduction includes Auto-encoders and PCA (Principal Component Analysis) method, whose major for data preprocessing.
From the existing literature, it can be seen that the more common idea is to migrate the loss function of the traditional clustering algorithm to the neural network structure and achieve clustering through global optimization. Yang et al. [8] propose the DCN model, which migrates the loss function of K-means to the feature space of the Auto-encoder, and realize feature learning and clustering through the alternating optimization of network parameters and cluster centers. Yang et al. [9] combine hierarchical clustering algorithm with CNN (Convolutional Neural Networks) and realize feature clustering through global optimization of cluster merging and feature learning. SpectralNet [10] introduces the idea of spectral clustering into deep learning, which first learns the similarity matrix between features through a Siamese Network, then obtain a new feature measure based on the spectral clustering objective function, and finally performs K-means in the feature space to obtain cluster assignments. VaDE (Variational Deep Embedding) [11] introduces an idea of migrate clustering of GMM (Gaussian Mixture Models) into the Variational Auto-encoder (VAE), which realizes the optimization of feature learning and cluster allocation through the distribution constraints of feature space. Another idea is to directly design a specific cluster loss function based on the desired clustering assumptions [12]. DAC (Deep-adaptive Image Clustering) [13] converts multiclassification problems into binary classification problems instead, continuously generates positive and negative sample pairs with high confidence based on the idea of self-paced learning, which are used as supervised information to guide the training of the model, and finally outputs the clustering result. IMSAT (Information Maximizing Self-Augmented Training) [14] and IIC (Invariant Information Clustering) [15] are based on the same assumption that simple transformations of data do not alter their intrinsic semantic information, and clustering is achieved by maximizing the information entropy of the original sample and its enhanced sample.
In terms of the combination of traditional clustering algorithms and Auto-encoders, Ren et al. [15] propose a deep density clustering framework by combining density clustering with Auto-encoder, which uses the t-SNE (t-distributed stochastic neighbor embedding algorithm) [16] and DPC (density peaking algorithm) [17], which completes the training by alternating or optimization of cluster pseudolabels and feature representations. Mrabah et al. [18] propose a model training method. In the pretraining phase, reliable feature representations are learned in a self-supervised manner by introducing data augmentation and adversarial interpolation techniques [19]. Due to the high dimension of the image, the combination of dimension reduction and traditional clustering algorithms can indeed effectively improve the accuracy of the algorithm, but it does not really use the neural network to classify the data, and it does not make the neural network have the ability of self-decisions.
Unsupervised clustering methods based on deep learning use VAE (Variational Auto-encoder) and GANs (Generative Adversarial Networks) [20, 21], which use existing data to generate data that does not exist in reality, mainly for image generation and sharpening. VaDE is a generative clustering model based on VAE; this algorithm models the process of data generation by introducing a Gaussian hybrid model. GMVAE [22] adopts a strategy similar to VaDE, which imposes a Gaussian mixture distribution constraint on the feature space. By minimizing the information constraints, the model avoids falling into a local solution at the beginning of training. Mukherjee et al. [23] propose a clustering method-based generate adversarial networks. The algorithm utilizes mixed discrete and continuous latent variables to construct new spaces for clustering by interpolating methods.
The existing cluster analysis has less robustness and the self-decision classification ability of neural networks is low. In order to make full use of the advantages of neural network, this paper proposes an RBF neural network clustering algorithm based on KNN graph (RBF-KNN). The core idea of the algorithm in this paper is to use the K-means algorithm to generate the initial pseudolabels, by using the overall and local information retention ability of the RBF neural network to train the classification network under the pseudolabel, and then use the self-supervision method based on the neighbor to continuously generate the corrected class label, optimize the generated neural network and obtain the clustering results, and finally achieve the purpose of self-optimization and self-decision of the neural network results. The algorithm process in this paper is simple, and while have fewer parameters, it can effectively handle irregular data sets and unbalanced data sets. From the experimental results of the artificial data set and the UCI data set, it can be seen that the proposed algorithm has good robustness and adaptability.
2. Related Works
2.1. RBF Neural Network
In 1985, Powell proposed a Radial Basis Function (RBF) method for multivariate interpolation, which uses a Gaussian kernel function in most cases, and the RBF neural network is a typical three-layer neural network that includes an input layer, a hidden layer, and an output layer. The transformation from input space to hidden space is nonlinear, while the transformation from hidden space to output layer space is linear. The network structure is as Figure 1.

The hidden nodes of BP (Back Propagation) neural network use the input pattern and the inner product of the weight vector as the arguments of the activation function, while the activation function uses the “Sigmoid” function. The parameters have an equally important effect on the output of the BP network, so the BP neural network is a global approximation of the nonlinear map.
Compared to traditional BP neural network, the hidden nodes of RBF neural network use the similarity between the input mode and the central vector (such as Euclidean distance) as the argument of the function, and the radial basis function as the activation function. Farther the input of a neuron is from the center of the radial basis function, the less activated the neuron becomes (Gaussian function). RBF neural network thus carries more local information and have a “local mapping” feature.
2.2. K-Nearest Neighbor and K-Means Algorithms
K-nearest neighbor algorithm is one of the most commonly used algorithms in supervised classification algorithms, which selects K closest sample points to obtain the corresponding class labels through the election method, which have low complexity and high accuracy. The KNN graph is an undirected graph formed by connecting the sample to the adjacent K sample based on the K-neighbor principle.
K-means algorithm is the most commonly used traditional clustering algorithm, which is based on the greedy principle, and selects the global optimal through iterative mode, and its optimization function is
The K-means method is simple and practical which can meet most needs, but its problem of sensitivity to noise data and to spherical data has always been a problem that scholars solving, and the existing DPC is a density-based clustering method, but it needs to find a central point, which can also be seen as an improvement of the K-means algorithm.
3. Introduction of Algorithms in This Paper
In this paper, the algorithms need to perform two times K-means clustering and multiple iteration of RBF neural network backpropagation.
3.1. Generate Pre-Trained Pseudo-Labels
Since the training of neural network requires the assistance of class labels, the proposed algorithm first uses the K-means algorithm to cluster the datasets (random labels can also be used, but more iterations are required thus), and the resulting pseudolabels are transformed, as shown in Figure 2.

The resulting pseudolabels are used for the next step of neural network training.
3.2. Full RBF Neural Network Training
The traditional RBF neural network is a locally weighted network, which selects part of the sample set as the center point for the training of the neural network, and its training process is to use RBF as the “base” of the hidden and the input vector is directly mapped to the hidden space. The RBF neural network establishes a mapping relationship around the center point, and the mapping from the implicit layer space to the output space is linear, which means the output of the network is the linear weighted sum of the output of the hidden unit. Among them, the role of the hidden layer is to map the vector from low dimensions to high dimensions through the kernel function, and the situation of linear indivisibility of low dimensions can be mapped to high dimensions to become linearly separable.
The outputs of the RBF network are as follows:where y is the output of the RBF neural network, n is the number of neurons in the hidden layer, Wi is the connection weight between the i neuron of the hidden layer neuron and the neuron in the output layer, hi(x) is the activation function of the neurons of the hidden layer, the activation function usually takes the Gaussian function which is defined as follows:
is the input matrix, is the selected center point, is the width of the neuron, and is the Euclidean distance between the input matrix and the radial base center.
Clustering can be seen as a process of optimization of neural network output and weights, the output expression of the radial base network used in this algorithm is as follows, and its objective function can be seen as the process of minimization of the following formula:
In this paper, the algorithm no longer selects the center point but retains all the samples and uses Gaussian kernel functions to map to high-dimensional space, so that the RBF neural network can retain the information between the data to the greatest extent. Using the pseudolabel to train RBF neural network, the algorithm selects the gradient descent algorithm as the backpropagation method, and finally obtains the appropriate network weights after several iterations of training.
3.3. Generation of Self-Supervised Remediation Labels
Due to the limitations of traditional partition-based clustering algorithms, the class labels obtained by preprocessing are not necessarily correct when facing nonspherical clusters and unbalanced datasets, so the network parameters need to be corrected.
According to the K-nearest neighbor principle, from the microscopic point of view, the data sample must have the same class label as the adjacent data point, so the resulting corrected label is the mean of the neighbor sample class label, the specific formula is as follows:
The algorithm loss function in this paper uses the MSE (Mean Square Error), and the backpropagation algorithm also uses the gradient descent algorithm, after multiple rounds of iteration, the new measure labels are finally obtained.
3.4. Clustering Generating Class Labels
After re-entering the data set X, the new measure labels are obtained, and the K-means clustering algorithm is used again for the resulting output to obtain the final results.
The pseudocode of algorithm in this paper is shown in Table 1.
4. Experimental Results and Analysis
4.1. Artificial Data Set Verification
Artificial data set is a human-made set data set with obvious characteristics, which can be easily artificially judged through experience. However, there is a high degree of complexity in artificial data and in terms of data set shape, density, imbalance, or other aspects problems for artificial intelligence algorithms, so it can well test the adaptability of a certain algorithm to a certain class or several types of complex properties.
The manual data sets used in this paper are all two-dimensional data (Table 2), and the used two-dimensional data sets can display more intuitively the quality of the algorithm clustering results and objectively evaluate the performance of the algorithm. The six datasets selected in this paper are typical artificial datasets, and there are problems of density, shape, or complex properties between clusters.
The algorithm in this article first uses the K-means algorithm to do the preprocessing work, and the K-means run the result such as in the left side (Figures 3–8), and the result of the algorithm RBF-KNN in this article is as follows as the right side (Figures 3–8).






The proposed algorithm is a neural network self-supervised clustering method based on partition-based clustering as the basis for preprocessing, so the 6 artificial datasets (ADS) selected in this paper are all datasets that cannot be well processed by the K-means algorithm. It can be seen from the experimental results of RBF-KNN that the processing of the “aggregate,” “long,” “spiral”, and “target” datasets in this paper can achieve an accuracy rate of 100%, and there are still deficiencies in the details of the processing results of the “jain” and “flame” datasets, but the results have been greatly improved compared to the preprocessing results. Although there are still deficiencies in some data sets, it is limited by the functional limitations of neural networks.
The algorithm structure of the RBF-KNN algorithm is simple and has fewer parameters, that is only two parameters (the value of K-nearest neighbor-value is basically 2 or 3), and compared with the traditional clustering algorithm based on division, the results of this algorithm can basically meet the needs of cluster robustness.
4.2. UCI Data and Validation, and Evaluation Indicators
In order to verify that the proposed algorithm can achieve a good clustering effect when dealing with practical problems, several UCI datasets are selected to verify the performance of the proposed algorithm (Table 2), all of which are derived from the UCI machine learning library.
UCI data sets come from different types of industries. “Ecoli” is a data set on molecular research and cell biology for the determination and prediction of yeast data. “Iris” is a data set of classification information about iris plants, which is used to distinguish between three types of plants. “Seeds” data set is mainly a coefficient obtained by X-ray technology for three wheat varieties, and a test data set for wheat classification. The “Soybean” data set is Michalski’s famous soybean disease database for predicting the diseases yielded in soybeans, while the “Segment” data set is a data set that classifies image data on higher numerical attributes, which can better test the performance of this algorithm.
The evaluation criteria for the algorithms selected in this paper include the V-measure coefficient, the adjusted rand index (ARI), and the normalized mutual information (NMI), among which ARI requires the use of the Rand index (RI). Because the use of a single evaluation index will lead to the evaluation results being too one-sided, this paper selects multiple evaluation indicators for cross-validation as formulas (6)–(8).where h denotes homogeneity; c indicates completeness; a and b are the selected categories; C2n is the probability of picking 2 classes from all n classes. As an adjustment function of RI, ARI is adjusted using the resulting RI.
V-measure represents the harmonized mean of homogeneity and integrity, the value range is [0, 1], and the larger the value, the better the clustering effect. The ARI indicates the degree to which the resulting category information matches the expected category, and the ARI range is [−1, 1], and the larger the value, the higher the coincidence of the clustering result with the real situation.
NMI is defined as the formulas (9) and (10).
X and Y represent variables, I (X, Y) represents the mutual information of two variables, H(X) and H(Y) represent entropy for the sum of variables.
NMI is a measure of the interdependencies between variables, indicating the strength of the relationship between two variables. The value range of the NMI index is [0, 1], and the closer the value is to 1, the better the clustering effect. On the contrary, the clustering effect is considered to be poor. Through the above evaluation methods, the processing effect of various algorithms on the data set can be compared more intuitively.
This section selects several types of unsupervised learning methods to compare with the proposed algorithm, including the K-means algorithm based on partitioning, the DPC density clustering algorithm based on K-nearest neighbor, the VaDE clustering algorithm based on VAE and GMM clustering algorithms, the DCN clustering algorithm combining Auto-encoder and K-means algorithms, and the semisupervised neural network. SpectralNet clustering algorithm is based on graph theory, of which the SpectalNet algorithm only selects k labeling points in this paper.
It can be seen from the experimental results that compared with the traditional K-means and the DPC algorithm combined with KNN, the algorithm of this paper can obtain better clustering results on each data set, and the comprehensive comparison of the three indicators of V-measure, ARI, and NMI shows the result obtained by the proposed algorithm is more excellent (Tables 3–5).
Compared with the clustering algorithms VaDE and DCN that combine neural networks, both algorithms are clustering algorithms for image processing, and these two algorithms are inefficient in coping with low-dimensional datasets; SpectralNet as a semisupervised clustering algorithm also has poor clustering effect with less prior knowledge. As a self-supervising clustering algorithm, we get better results than the selected 5 algorithms from the three evaluation criteria, which show RBF-KNN has better stability performance and robustness.
4.3. Discussion on RBF-KNN Algorithm Performance and Stability Study
From the V-measure results of the artificial data set, it can be seen that the clustering accuracy of the RBF-KNN algorithm on the Jain data set is low, so the clustering process of the Jain data is selected for analysis in this section, and MSE loss in each iteration is shown in Figure 9(a), and the algorithm can quickly optimized in the 40th iteration. The main step of the algorithm is to map the data set to a Gaussian function and back-propagate, so the algorithm complexity of the RBF-KNN algorithm is O(n2epochs).

(a)

(b)
In the absence of explicit class label correction, the neural network will cause the clustering process to greatly fluctuate due to initialization problems and missing label problems. In terms of stability, this section still uses the Jain data set and process 100 epochs. The V-measure result is shown in Figure 9(b), and it can be seen from the results that the clustering results of the proposed algorithm do not fluctuate significantly, and the proposed algorithm shows good stability.
5. Conclusion
In this paper, an RBF neural network clustering algorithm based on the K-nearest neighbor principle (RBF-KNN) is proposed, which belongs to a self-supervising clustering algorithm. The central idea is to use the full RBF network to retain global information, and then do back-propagation to self-supervise the resulting neural network based on K-nearest neighbor principle, in order to solve the problem of poor adaptability and lack of robustness of traditional clustering algorithms. From the processing results of the artificial data set and the UCI data set, we can see that the performance of the proposed algorithm is excellent, and it can handle well with the multitype and unbalanced data.
Based on the above analysis, compared with the lack of versatility of traditional clustering methods and the complex priori conditions of semisupervised clustering, the proposed algorithm can ensure the accuracy of clustering while the process is simple and the simple parameters setting, which shows more obvious advantages than the proposed traditional clustering algorithms and semisupervised neural network clustering algorithms.
Data Availability
The artificial data used to support the findings of this study are included within the article. The UCI data used to support the findings of this study have been deposited in the “https://archive.ics.uci.edu/ml/index.php” repository.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by Shandong Provincial Natural Science Foundation(ZR2021MF045).