Abstract

Community structure detection in a complex network structure and function is used to understand network relations and find its evolution rule; monitoring and forecasting its evolution behavior have an important theoretical significance; in the epidemic monitoring, network public opinion analysis, recommendation, advertising push and combat terrorism, and safeguard national security, it has wide application prospect. A label propagation algorithm is one of the popular algorithms for community detection in recent years; the community detection algorithm based on tags that spread the biggest advantage is the simple algorithm logic, relative to the module of optimization algorithm, convergence speed is very fast, the clustering process without any optimization function, and the initialization before do not need to specify the number of complex network community. However, the algorithm has some problems such as unstable partitioning results and strong randomness. To solve this problem, this paper proposes an unsupervised label propagation community detection algorithm based on density peak. The proposed algorithm first introduces the density peak to find the clustering center, first determines the prototype of the community, and then fixes the number of communities and the clustering center of the complex network, and then uses the label propagation algorithm to detect the community, which improves the accuracy and robustness of community discovery, reduces the number of iterations, and accelerates the formation of the community. Finally, experiments on synthetic network and real network data sets are carried out with the proposed algorithm, and the results show that the proposed method has better performance.

1. Introduction

Community structure is a very important attribute in complex networks. Therefore, community structure plays a crucial role not only in the analysis of the social relations in human society [1] but also in the analysis of the functional relations between biological network organizations and organs [2], as well as the analysis of the citation relations between collaborative networks among scientists [3]. Therefore, the discovery of community structures from complex networks has been extensively studied in the past decade [48].

In 2002, Girvan and Newman achieved pioneering work pointing out that community structure is common in complex networks and proposed modularity to measure the stability of communities in networks [1]. Although the definition of community structure has not been unanimously determined by clear relevant studies, it is generally considered that a community is a group of nodes, which can also be called a community or a group of modules. These nodes are characterized by tight internal connections and sparse external connections [9].

As one of the hot spots of current research, community discovery algorithm based on label propagation has been widely used in community detection. This algorithm is a graph-based semisupervised learning method [10]. The advantage of semisupervised learning is that it can determine a lot of unlabeled samples by a small number of marked samples, thus improving the effectiveness of learning process [11, 12]. The basic idea of label propagation is to predict the label information of unlabeled nodes by using the topological relations between nodes from the label information of labeled nodes and finally complete the division of the graph to form a clustering structure. Although this algorithm has the advantages of simple implementation, clear logic, no need to know the number of communities in advance, time complexity is close to linearity, etc., the unstable partition results and strong randomness are the defects of this algorithm. In each iteration of the label propagation algorithm, which community a node belongs to depends on the label with the largest cumulative weight of its neighbor nodes. Therefore, when more than one of the largest neighbor labels appears on a node, one of them will be randomly selected as its own label. This kind of randomness will bring avalanche effect, that is, a small clustering result error at the beginning will be continuously amplified. In addition, the updating order of node labels will also have a great impact. Obviously, the earlier the updating of the most important node will accelerate the process of convergence. In the label propagation algorithm, the closer the initial label is set to the core point, the more accurate the clustering effect is. However, in specific applications, it is often not feasible to know the number of communities in advance, and it is very inefficient to determine the number of communities () by searching all possible candidate communities. Therefore, we are inspired by the density peak algorithm (DP) [13] and propose a label propagation algorithm based on density peak (DPLPA) for solving complex networks. The central idea of DP algorithm is that the core nodes are surrounded by other nodes in the same class, and there is no possibility for the core nodes to be closely connected. In other words, the core nodes have higher density, so this algorithm is feasible to calculate the core number. But unfortunately, DP algorithms cannot be directly used in a complex network, so DP algorithm is improved, and it can be applied to a complex network, can be reasonably come to the core number, applied to the label propagation algorithm, and according to the topology of the network that similarity matrix and priority to update nodes, reduce the randomness and the number of iterations.

2. Background

2.1. Label Propagation Algorithm

Raghavan et al. proposed the label propagation algorithm (LPA) [14], which used the label values of a few preset nodes to divide the community structure on a large-scale complex network. However, the accuracy of LPA is low because of the randomness of propagation, which leads to a large error in clustering results. The reason is that when the neighbor node label frequency appears with multiple highest values, the algorithm is fair to each label. We randomly select a label as the label of the update node. Therefore, the algorithm will appear small and fragmented communities or large communities which are not in line with the actual situation when the community is divided. Figure 1 is a situation where an error occurs in the label propagation process. The -label finally appears in two communities, which is not in line with the actual situation.

In view of the problems of LPA, domestic and foreign scholars have proposed many improvement measures. Tibély and Kertész [15] proved that the LPA will produce different community structures for the same network, and the algorithm still has a lot of randomness. Leung et al. [16] discovered the possibility of LPA application on tens of millions of networks and found the potential of large-scale data application of the algorithm. Barber and Clark [17] proposed the LPAm to solve the problem that the LPA cannot integrate different clustering results well by adding some restrictive conditions. Liu and Murata [18] solved the problem that LPAm was easy to fall into local optimal solution by optimizing the modularity. Zhuoxiang et al. [19] calculated the value by calculating the potential influence of nodes. When the value is less than the actual number of communities, the algorithm will not get the correct partition result. Xie and Szymanski [20] combined the label propagation algorithm with the Markov clustering algorithm (MCL) and proposed a new label propagation algorithm LabelRank. The biggest feature of the LabelRank algorithm is that a node can have multiple neighbor labels during the propagation process. Lin et al. [21] sorted the node weights and then updated the node labels in order. Zhang et al. [22] proposed a labeling algorithm based on edge clustering coefficient. Kipf and Welling [23] extended the graph-based label propagation algorithm and used graph convolution neural network for label propagation. The algorithm realized the propagation of label information through the aggregation of adjacent nodes. In addition, PageRank is used to quantify the importance of nodes, and LPAp algorithm [24] based on the importance of nodes is proposed. An improved community discovery algorithm based on feedback control [25], objective function [17], circle [26], and other methods for label propagation is proposed. The above algorithm is to optimize and improve the problem of node label in the propagation process, which can improve the stability and accuracy of LPA to a certain extent, but most of them bring more or less increased computational overhead, and do not achieve very ideal results.

However, Zhu et al. proposed another label propagation algorithm (LP) in reference [27]. They described the clustering problem as a form of propagation on the graph, in which the label of one node propagates to the neighboring nodes according to the similarity between them. In this process, LP fixes a small number of tags on the known label data. Then, the tagged data, like a signal source, pushes the label through the unlabeled data. Therefore, an accurate number of known tags will play an important role in the propagation process of LP algorithm, greatly improving the accuracy of clustering results.

The algorithm based on label propagation can be described as follows: (1)Propagation label:(2)Reset the label of the core point in :(3)Repeat steps (1) and (2) until converges

Where step (1) multiplies the probability transition matrix and the label matrix to propagate the label of each node to other nodes with the probability of . If the similarity between two nodes is very high, the easier it is for each other’s label to be replaced by its own. Step (2) the most important thing is the known label, which cannot be changed, so every time it is propagated, it must return to the original label. As the label data point continues to propagate its label, the final class boundary passes through the high-density area and stays in the low-density interval. It is equivalent to the label node of each different category to divide the sphere of influence.

However, it is still an open question to determine the number of known labels. Traditional community detection algorithm can obtain the number of communities by optimizing the objective function or evaluation index. However, these methods are easily affected by many factors such as initial matrix and optimization objective function, so it is difficult to accurately determine the number of communities. In order to solve the above problem, we use an improved density peak clustering to obtain the kernel number as the input parameter of LP.

2.2. Density Peak Algorithm

In 2014, Rodriguez and Laio [13] proposed a density-based clustering method in Science, which can recognize clusters of various shapes, and the parameters are easy to determine. This method overcame the disadvantages of DBSCAN algorithm [28], which had large density differences among different classes and was difficult to determine the neighborhood range and had strong robustness. The core idea of the density peak algorithm (DP) is based on the assumption: for the center point of each cluster, the density of the cluster center point is greater than the density of surrounding neighbor points and the distance between the cluster center point and the higher density point is relatively large. Therefore, the DP algorithm has two quantities to calculate: the local density of the node and the distance from the high-density node. Usually, is used to represent the local density of node , and is used to represent the distance between node and the high-density node.

There are two ways to define local density , one of which is where

Here, represents the distance from node to node , and is the cut-off distance, that is, the number of nodes whose distance to node is less than .

The second method is the Gaussian kernel function:

The minimum distance between node and other higher local density nodes is denoted by the formula defined as

When all the nodes have calculated and , only the nodes with higher and can become the center points of the cluster, and the points with larger local density distance but smaller local density are abnormal points. The remaining nodes are assigned to the point with the highest local density among the neighbors.

Because the DP algorithm is a density-based clustering algorithm, it has the advantage of detecting clusters of arbitrary shape without the need to set the center point ( value) in advance. Moreover, when selecting the center point, the selection process of the center point can be visually seen through the decision graph. However, DP algorithm still has some defects. Firstly, the value of cut-off distance needs to be set artificially, and improper setting will have a great impact on clustering results. Secondly, the central point needs to be selected artificially, so human subjective factors will affect the clustering results.

3. Methodology

In this section, the proposed label propagation based on density peak optimization clustering algorithm (DPLPA) is introduced. The core idea of DPLPA is to regard the high-density nodes surrounded by nodes of low-density neighbors as the community center points, and the distance between the community center points should be far away. In other words, a node with a higher density is more closely connected to its neighbors and is more likely to be the core point of the community. A community network is a complex network with connections between nodes, which usually reflects the network structure based on the connections between nodes. However, DP algorithm is a density-based clustering algorithm that handles any shaped data set by calculating the distance between nodes to use high-density areas as a basis for judgment. But this way of calculating distance directly based on coordinates is not applicable to community networks. If the distance between nodes in community network is calculated, the similarity between nodes will become meaningless because the distance between nodes is more uniform or even the same. Therefore, DP algorithm cannot be directly used to detect community networks. In order to solve this problem, this paper uses the improved DP algorithm [29] to obtain the number of communities in a complex network as the input parameter of the label propagation algorithm.

3.1. Predictive Fetch of Label Matrix

Let be a complex network with no direction and no weight. The node set contains nodes, the edge set contains edges, and the adjacency matrix of the graph is . If node and node have an edge connected, then in the adjacency matrix ; otherwise, . Therefore, the node similarity formula of node and node is obtained, which is expressed by Salton index [30], also known as cosine similarity:

where and represent the neighbor nodes of node and node , respectively, represents the number of neighbor nodes of node , so the molecular formula represents the number of neighbors shared by node and node , while denominator formula represents the number of neighbors expected to be shared by node and node . The value of is between 0 and 1. When is closer to 1, the similarity between the two nodes is very high. The formula for the distance between node and node is as follows:

Among them is a small positive number, in order to avoid the denominator being 0.

Next, we have two methods to calculate the local density of the node, one is to use the Gaussian kernel function, and the formula is as follows:

where represents the local density of node , represents the distance between node and node , represents the cut-off distance, and the size of is selected according to [13]. Then, normalizes the value:

Then, we start to define the distance formula between nodes:

Among them, when the local density of node is the largest, its distance is the maximum value of the distance between node and other nodes. When the local density of node is not the maximum, its distance is the distance between the node whose local density is slightly larger than that of node and node .

Then, is standardized:

The threshold is selected from the list of , which is about of the list of from small to large [13].

Finally, take as the -axis and as the -axis to generate a decision graph. Then, we calculate each node , select a value greater than the sum of the average value of and the standard deviation of to enter the list, and then arrange them in order, and finally select the appropriate cut-off value as the core number (as the known label ) and apply it to the label propagation algorithm.

3.2. Label Propagation Algorithm Based on Density Peak

LP is a graph-based clustering algorithm, so need to construct a graph first. The nodes of the graph are the data points. This paper uses the Gaussian kernel method to construct the weight between the two nodes:

Among them, is the distance between node and node , and is the hyperparameter, and the similarity matrix composed of weight is obtained.

Next, the known label is propagated through the edges between nodes. The greater the weight of the edge, the more similar the two nodes, and the easier the label is to spread. We define the probability transition matrix:

where represents the probability of propagating the label of node to node . Since there are known core points with known labels, a label matrix is defined. The th row represents the label indication vector of node , that is, if the label of the th node is , then the th element is , and the rest are . It also defines an unlabeled matrix of unlabeled nodes. We combine to get the label matrix of all nodes:

Then, the label matrix is propagated according to the similarity between nodes in the probability matrix ; the formula is as follows:

After the propagation, the in the known label matrix changes during the propagation process, but is the label value we took for core nodes before, which is accurate and the label should not be changed. Therefore, need to reset the label matrix , and the formula is as follows:

Then, the label matrix is propagated through the probability matrix again, and the part in the propagated matrix is reset. We iterate this process until the label change difference of in matrix reaches a critical point. At this moment, DPLPA completes the label partition. Algorithm 1 shows the algorithm flow of DPLPA.

DPLPA
  Input:
  Output: Label matrix
1 Construct adjacency matrix from complex network.
2 Calculate node similarity by Equation (5).
3 Calculate the distance matrix between nodes by Equation (6).
4 Calculate the local density of the node by Equations (7) and (8).
5 Calculate by Equations (9) and (10).
6 Calculate get core points.
7 Get probability transition matrix by Equations (11) and (12).
8 Build label matrix by Equation (13).
9 while convergence criteria not reached do
10:   
11:   
12: end while
13:/Iteratively update until convergence, and the label change of the node has been very small. /

After obtaining the clustered label matrix , the algorithm will gather the nodes with the value of in the same dimension from together to form a community. All nodes are divided according to the dimension. The clustering algorithm is finished, and the complex network is also divided.

4. Experimental Study

In order to assess our algorithm, we use a variety of real and synthetic data sets to test, and some classic methods to compare at the same time, including DPLPA in this paper, Newman fast greedy algorithm (FN) [31], Louvain algorithm (BGLL) [32], LPA [14], and improved label propagation algorithm (LPAm) [17]. The hardware environment of the experiment is as follows: Inter (R) Core (TM)i7-7700M CPU, 3.60 GHz, and 8 GB memory. The DPLPA is implemented in Python3.7 64-bit.

4.1. Evaluation Metrics

In this article, in order to verify the accuracy of the algorithm, we use the community discovery modularity function [31] proposed by Newman as the evaluation index of the experiment. Modularity is defined as

where represents the total number of edges of the community network, represents the adjacency matrix, represents the degree of node , and represents the community allocated by node . is defined as follows:

Among them, when node and node are in the same community, is ; otherwise, it is.

At the same time, we still use standardized mutual information (NMI) [33] to measure the similarity of two clustering results. It is an important measure of community discovery. It can basically objectively evaluate the comparison between a community division and a real division. For accuracy, the value range of NMI is , and the higher the value, the closer the divided community is to the real community result. NMI is defined as

Among them, represents the community discovery algorithm , is the confusion matrix, represents the number of nodes shared in the method partition, represents the number of communities in the community discovery method , and represents the th row (column ) in and represents the number of nodes. If the clustering results of methods and are the same, then .

4.2. Performance on Synthetic Networks

The use of artificially synthesized networks to evaluate the effectiveness of the algorithm has become an effective means to test the pros and cons of the algorithm. Among them, the most used benchmark test network for community detection, LFR benchmark, was proposed by Lancichinetti et al. [34]. The LFR reference network is an extension of the GN reference network [1] and has high practical value. The LFR benchmark network reflects the heterogeneity of community distribution and the power-law distribution of node degrees. Some of the important parameters are described as follows: represents the number of nodes, represents the average degree of nodes, represents the maximum degree of nodes, and represents the minimum community size, represents the maximum community size, and represent the negative exponents of the power-law distribution of node degree and community size, respectively, and is equal to the ratio of the number of connected edges between communities in the network to the total number of edges, to express the obvious degree of the community in the network; the smaller the value, the more obvious the structure of the community. Figure 2 shows the comparison of the algorithm’s NMI experiment results on the LFR benchmark data set.

The parameters set in this LFR experiment are , , , , , , , and the range of is from to . It can be seen from Figure 2 that when is small, that is, the community structure of the complex network is obvious, the NMI values of the other algorithm results are high except for the FN algorithm. The NMI value of the FN algorithm and the LPA both began to decrease significantly. The remaining algorithms all began to decrease when the value was , but the DPLAP decreased relatively slowly compared with the BGLL and LPAm, and finally, the NMI value is higher, so this can indicate that the DPLPA has a higher accuracy rate in community exploration and has better stability in high-complexity community exploration.

4.3. Real-World Networks

In order to further compare the pros and cons of the algorithms, this paper also tested the algorithm in a few real social networks. These networks are of different sizes but are representative and involve various fields. See Table 1 for details, where represents the node, represents the number of edges, and represents the number of communities that have been identified.

Among them, Karate [35] is a data set of member relations of a university karate club in the United States, which is constructed based on the interactions between club members and is often used in the analysis of social networks. Dolphins [36] is a member network constructed from the living habits of 62 wide-mouthed dolphins, and the dolphins that are often together correspond to an edge between nodes. Polbook [37] is a community network constructed through political books sold on Amazon in the United States. Each node represents a book. If two books are purchased by the same customer, there is an edge between them on the corresponding node. Football [1] is a network constructed by the American college football schedule. The nodes represent the participating teams. If there is a match between them, there will be an edge between the nodes. The calculation results of different algorithms on different networks are shown in Table 2 and Figure 3.

In order to better compare the clustering effect of DPLPA on the data set, this paper takes the Football data set to make a detailed explanation. The actual grouping of Football data set is shown in Table 3, and the clustering effect of DPLPA is shown in Figure 4.

It can be seen from Figure 3 that although the value of our method is not the best in some data sets, the division result of the DPLPA is the same as the actual community distribution, which can be seen from Table 3 and Figure 4. This is mainly because in the process of label dissemination, the probability transition matrix well suppresses the randomness of the dissemination process, so that each update of the node is updated to the label of the same community node as much as possible, making the result of community division more stable and closer to the real community situation. Comparison of values of different algorithms on different networks is shown in Table 4 and Figure 5.

In addition, from Figure 5, we find that the DPLPA can detect the true number of communities, which is completely consistent with the actual value. This is mainly because the DPLPA begins to calculate the local density and distance of nodes through the topology of the network at the very beginning and selects the number of values through a decision graph. Therefore, we do not need to provide the value, and the DPLPA has the advantage of detecting the value.

In order to better show the experimental results, we use the Karate network and the Dolphins network as case studies to visualize the detected communities. Nodes in the same community are divided by the same color. Figure 6 is the visualization result of DPLPA division of the Karate network. Figure 7 is the visualization result of the DPLPA division of the Dolphins network.

It can be seen from Figure 6 that the local density of node 1 and node 34 is the highest, and it can be seen from Figure 7 that the local density of node 15 and node 18 is the highest, and these nodes have higher node distance, so it is very reasonable for the DPLPA to select these nodes as , and the result of the division is completely consistent with the result of the actual community division. Therefore, we believe that the DPLPA is an algorithm that can perform high-quality community detection in real communities. In order to observe the convergence of DPLPA, this paper makes a comparison in multiple data sets, as shown in Figure 8.

Where the axis is the number of iterations, and axis is the number of changes during node label iteration, as can be seen from Figure 8, in the process of Karate and Doplhins data clustering, the DPLPA has completed the division of most node labels after the first few iterations and then completed the division of a few nodes. In the process of Polbook and Football data clustering, the labels of most nodes have been partitioned until the 30th iteration. After that, the change curve of node labels becomes flat, indicating that all nodes have completed the label division and the algorithm has converged.

5. Concluding Remarks

In this paper, we propose a DPLPA for complex network community detection. It combines the characteristics of density peak algorithm and can predict the number of communities without a prior condition. It avoids the defects of random label algorithm, such as unstable division and strong randomness, and effectively improves the accuracy of community mining and the stability of the algorithm. In addition, the probability transition matrix is constructed to reduce the number of iterations of label propagation, so that the algorithm has efficient operation time, and finally can quickly find the network community structure. In the test results of the benchmark network and the classical real network, it is found that the proposed algorithm has better stability and accuracy than other advanced algorithms, and the number of communities found is always consistent with the actual number of communities in terms of the predicted value. However, there is still room for improvement of the algorithm. In future research, we will face large-scale network data and further improve the time complexity of the algorithm. At the same time, dynamic network and overlapping network are also taken as research objects.

Data Availability

The data used in “Label propagation community detection algorithm based on density peak optimization” is a commonly used data set to study complex networks, which can be queried on multiple data websites, for example: https://snap.stanford.edu/data/. http://konect.cc/http://konect.cc/. https://networkrepository.com/index.php. That is where I read the data I used in my experiments. New data availability url http://www-personal.umich.edu/~mejn/netdata/.

Disclosure

Ma Yan and Chen Guoqiang current address is School of Computer and Information Engineering, Henan University, Henan Province, China. This work was outlined at the 2021 17th International Conference on Computational Intelligence and Security (CIS) [38].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Key Science and Technology Program of Henan Province, China (Grant No. 162102210168). Group Name - on behalf of Key Science and Technology Program of Henan Province, China NO:162102210168 Affiliation - Belongs to Henan University, School of Computer and Information Engineering. Email Address - hnkjgg@163.com.