Abstract

Community structure is an important feature of complex networks. Detecting overlapping communities in complex networks is a hot research topic in data mining and graph theory, aiming at the shortcomings of community detection algorithm based on seed expansion, such as the instability of community detection results caused by randomly selecting seeds, the similarity of selected seeds leading to similar communities after different seed expansion, and the increase of calculation caused by deleting nodes in the process of seed expansion. This paper proposes an overlapping community detection algorithm based on high-quality subgraph extension in local core regions of the network (OLCRE). First, a novel seed community selection method is designed. By analyzing the sum of node degrees of the subgraph formed by a node and its neighbor nodes in the local core region of the network and the tightness of the internal and external connections of the subgraph, a seed community selection function is proposed. According to this function, high-quality subgraphs are selected from all the local core regions of the network as seed communities. Then, taking the definition of community as the guideline, a new community expansion strategy is proposed. Considering the influence of the neighbor node on the inner and outer connection tightness of the seed community comprehensively, it is determined whether the neighbor node can join the seed community. Finally, after the completion of all seed community expansion, overlapping nodes and possible missing nodes should be simplified and redetected to further improve the quality of community detection. The proposed algorithm is tested on the artificial and real-world networks and compared with several overlapping community detection algorithms. The experimental results verify the effectiveness and feasibility of the proposed algorithm.

1. Introduction

Many complex systems in the real world can be shown in the form of complex networks through their connection modes. Components in the system can be regarded as nodes in the network, and the connection relations between different components can be regarded as edges in the network, for example, the social network [1] that is interconnected among people and the metabolic network [2] that is connected through chemical reactions. With the further study of complex networks, it is found that community structure is the basic statistical characteristic among them. A community in a complex network can be understood as a collection of nodes with similar characteristics, which is usually represented by close connections between nodes within the same community, while sparse connections between nodes in different communities. The purpose of community detection studied in this paper is to reveal the real community structure in complex networks, which has important theoretical significance and practical value for the topological structure analysis and functional analysis of complex networks [3]. At present, the research achievements of community detection have been widely applied in the public opinion analysis and control [4], search engines [5], personalized interest recommendation [6], and other fields. In addition, in view of the actual needs of epidemic transmission prevention and control [7], the community structure, which is between the macro- and micronetwork characteristics, is taken as the entry point, and community detection of social networks is combined with the epidemic transmission, so as to provide important information about the transmission risk class of persons involved in the epidemic for epidemic transmission prevention and control.

Up to now, many classical complex network community detection algorithms have been proposed, which can be divided into two categories according to their community detection results: nonoverlapping community detection algorithms and overlapping community detection algorithms. The nonoverlapping community detection algorithm divides the complex network into multiple disjoint communities. However, in real-world networks, there exist overlapping communities; that is, a node can belong to multiple communities at the same time. In a social network, for example, a person may belong to multiple social circles (family circle, friend circle, and colleague circle). Therefore, the detection of overlapping communities in complex networks has more practical value. According to the different research perspectives, the overlapping community detection algorithms are mainly divided into algorithms based on label propagation, algorithms based on cliques, algorithms based on local extension, algorithms based on edge, algorithms based on nonnegative matrix factorization, and algorithms based on spectral clustering.

Algorithms based on label propagation, for example, the SLPA algorithm [8], firstly initialize labels for nodes in the network and then carry out label propagation. The storage space of each node will save all labels received in the process of label propagation. In order to prevent too many overlapping nodes, the label control threshold is set to determine which labels will be saved in the storage space of nodes. After label propagation stops, nodes with the same label are divided into the same community, and nodes with multiple labels are considered overlapping nodes. OMKLP algorithm [9] proposed a new core node evaluation model by analyzing the node degree and local coverage density of the subgraph formed by this node and its neighbor nodes and assigned the same label to the core node and its neighbor nodes to achieve fast convergence of the algorithm. In the process of label propagation, each node adopts an asynchronous update to receive the community label corresponding to the maximum belonging coefficient of its neighbor nodes. After label propagation stops, nodes with the same label are divided into the same community, while nodes with multiple community labels are overlapping nodes.

Algorithms based on cliques, for example, the CPM algorithm [10], start from the complete subgraph and detect the community through the penetration of the complete subgraph. The nodes belonging to multiple disconnected cliques are overlapping nodes. LOC algorithm [11] firstly finds out all the cliques in the network and selects the local maximum density node as the initial community. Then, the clique participated by the node whose fitness function value is positive among the neighbor nodes of the initial community is added to the community. If the node does not participate in the formation of the clique, only the node is added to the community. Because a node can belong to multiple cliques and be added into different communities, overlapping community structures can be detected.

Algorithms based on local extension, for example, the LFM algorithm [12], start from different seed nodes and expand the community by constantly optimizing the fitness function value of the community. The nodes that are extended into multiple communities are overlapping nodes. The ECES algorithm [13] weights the network graph according to the similarity between nodes and then selects the node with the highest centrality value as the core node and expands it. This process is repeated in the remaining set of nodes until there are no nodes left.

Algorithms based on edge, for example, the LC algorithm [14], use the Jaccard function to calculate edge similarity, construct a hierarchical tree of edge community combined with the clustering method, and then truncate the hierarchical tree to obtain edge community by using partition density function. Since a node can connect multiple edges, overlapping nodes appear naturally when the community to which the edge belongs is determined. Finally, the edge community is transformed into a node community to obtain the structure of the overlapping community. LCDEL algorithm [15] firstly transforms the node graph into the line graph, constructs the adjacency matrix of the line graph, calculates the node distance matrix of the line graph using the NDML metric, and obtains the feature matrix of the node distance matrix by principal component analysis. Finally, clustering on the feature matrix by -means clustering algorithm combined with ensemble learning is performed to obtain the overlapping community structure.

Algorithms based on nonnegative matrix factorization, for example, the DNMF algorithm [16], directly find the discrete community membership matrix, which can assign explicit community memberships to nodes without postprocessing. In addition, the pseudosupervision module is added to DNMF to utilize the identification information in an unsupervised way, which further enhances its robustness. The AGNMF-AN algorithm [17] uses an augment attributed graph to combine both the topological structure and attributed nodes of the network and introduces an effective framework to update the affinity matrix, in which the weight of the affinity matrix in each iteration is modified adaptively instead of using a fixed affinity matrix. In addition, the -norm is also used to reduce the impact of random noise and outliers on the community quality, which greatly improves the effectiveness of this algorithm.

Algorithms based on spectral clustering, for example, the SPOC algorithm [18], can extract prior information such as the likelihood of each node belonging to multiple communities from available metadata and node centrality measure, and a hierarchical algorithm is introduced to automatically detect communities. The ASC algorithm [19] constructs a new affinity matrix based on both the network structure and attribute information and does not need to define control parameters to combine structure and attribute. In addition, extra nodes and edges are not added to the original network which makes the algorithm suitable for application to large-scale networks.

In recent years, local community detection algorithms based on seed extension can detect communities without the complete structural information of complex networks and have high efficiency [2022] and validity [2325], so it is widely used in the field of community detection. However, in terms of overlapping community detection, there are still shortcomings in the quality and stability of community detection, which are manifested as the instability of community detection results caused by randomly selecting seeds, the similarity of selected seeds leading to similar communities after different seed expansion, and the increase of calculation caused by deleting nodes in the process of seed expansion. In view of the above shortcomings, this paper proposes an overlapping community detection algorithm based on high-quality subgraph extension in local core regions of the network (OLCRE). The major contributions of this paper are as follows: (1)A new method of seed community selection is proposed; that is, the subgraphs with tight internal connections and sparse external connections in the local core regions of the network are selected as seed communities, which conforms to the definition of community and ensures the high quality of selected seed communities. Moreover, the selected seed communities by this method are determined, which can avoid the wobble of the community detection results(2)A new seed community expansion strategy is proposed, which takes the definition of community as the guideline. Considering the influence of the neighbor node on the tightness of the internal and external connections of the seed community comprehensively, it is to decide whether the neighbor node can join the seed community, so that the seed community would expand towards the direction of tight internal connections and sparse external connections and finally obtain high-quality community structure(3)The OLCRE algorithm proposed in this paper does not need to set any parameters. It can be applied to networks of different scales and types and has universal applicability. The experimental results show that the OLCRE algorithm is effective and feasible, which is tested on artificial networks and real-world networks and compared with several overlapping community detection algorithms

2. Basic Concepts and Definitions

A complex network can be modeled as an undirected and unweighted graph , where is a nonempty finite set of nodes and is a nonempty finite set of edges. Table 1 lists the notations used in this paper and gives a brief explanation. The basic concepts and definitions used in this paper are described below.

Definition 1 (Seed community selection function). The seed community selection function, denoted by , is defined as follows: where represents the subgraph formed by node and its neighbor nodes and represents the degree of node . if there is an edge connection between nodes and . Otherwise, . Likewise, if there is an edge connection between nodes and . Otherwise, .
The larger the value of corresponding to the subgraph formed by node and its neighbor nodes, the more located the subgraph is in the local core region, and the more tightly connected the subgraph is internally and sparsely connected to the external region.

Definition 2 (Common neighbor edge). The common neighbor edge of edge , denoted by , is defined as follows: where is the set of neighbor nodes of node and is the set of neighbor nodes of node .

Definition 3 (Cluster triangle). The cluster triangle in which edge participates, denoted by , is defined as follows: where represents the set of cluster triangles in which edge participates.
The more cluster triangles an edge participates in, the tighter the edge is connected to its neighbor edges. The more cluster triangles exist in the community, the tighter the connection within the community.

Definition 4 (Node to the community interior influence function). The node to the community interior influence function, denoted by , is defined as follows: where is the edge set of community and, likewise, is the edge set of community formed when a neighbor node joins community . represents the number of cluster triangles in which edge participates, and represents the number of nodes in community .
If the corresponding value is greater than 0 after a node joins community , it indicates that the node joining community can improve its internal connection tightness.

Definition 5 (Community boundary nodes). The boundary nodes of community , denoted by , are defined as follows: where is the node set of the community .

Definition 6 (Node to the community exterior influence function). The node to the community exterior influence function, denoted by , is defined as follows: where is the node set of the community and, likewise, is the node set of the community formed when a neighbor node joins community . represents the number of boundary nodes of community , and represents the number of boundary nodes of community . if there is an edge connection between boundary node and node . Otherwise, .
If the corresponding E value is less than 0 after a node joins community C, it indicates that the node joins community C to make its connections with the outside more sparse.

Definition 7 (Community quality optimization function). The community quality optimization function, denoted by , is defined as follows: The community quality optimization function is used to simplify overlapping nodes and redetect possible missing nodes so as to further improve the quality of community detection results.

3. The OLCRE Algorithm

3.1. General Description of the OLCRE Algorithm

As shown in Algorithm 1, the OLCRE algorithm firstly traverses the global network and, according to the seed community selection function , selects the subgraphs with close internal connections and sparse external connections from the local core regions of the network as seed communities. In the seed community expansion stage, the influence of the neighbor node on the inner and outer connection tightness of the seed community is comprehensively considered to determine whether the neighbor node could join the seed community. When the corresponding value and value of a neighbor node of the seed community meet the requirements of and , the neighbor node can join the seed community. Otherwise, it cannot join the seed community. When all neighbor nodes of a seed community do not meet the expansion strategy, the seed community stops expanding and continues to expand the rest of the seed communities until all the seed communities complete expansion. After the expansion of all seed communities is completed, overlapping nodes and possible missing nodes are simplified and redetected according to the proposed community quality optimization function, so as to further improve the quality of community detection. Finally, the output is the overlapping community structure . Through the above steps, the overlapping community detection of complex networks is completed.

Input : Graph G = (V, E)
Output : Overlapping community structure
1: C = ∅;
2: According to seed community selection algorithm (Algorithm 2), seed community
  set, denoted by , are selected from network G;
3: Select any seed community, denoted by , and go to Step 4 if Seeds ≠ ∅. Otherwise,
  go to Step 5;
4: Remove from Seeds, and then expand it into a community structure Cs according to
  the seed community extension algorithm (Algorithm 3), and add Cs to , returning to
  Step 3;
5: Simplify and re-detect overlapping nodes and possible missing nodes;
6: Output overlapping community structure C;
3.2. Seed Community Selection

Seed selection is a key step of overlapping community detection algorithm based on seed expansion, which has an important impact on the results of community detection. In this paper, a novel seed community selection method is proposed. According to the seed community selection function , subgraphs with close internal connections and sparse external connections are selected from local core areas of the network as seed communities, see Algorithm 2 for the specific process.

Input : Graph G = (V, E)
Output : Seed community set Seeds
1: Seeds = ∅;
2: for each iVdo
3:  if node i has been accessed then
4:   continue;
5:  else
6:   mark node i as visited;
7:  end if
8:  max ← the SCS value of node i is calculated;
9:  while true do
10:   valueSCS values of all neighbor nodes of node i are calculated, and all
  neighbor nodes are marked as visited. The node with the maximum SCS value is
  selected. If the node with the maximum SCS value is not unique, a node j with the
  maximum SCS value is randomly selected;
11:   ifmax >= valuethen
12:    Seeds ← the subgraph formed by node i and its neighbor nodes serves as a seed
   community;
13:    break;
14:   else
15:    max = value;
16:    i = j;
17:   end if
18:   end while
19: end for

The seed community selection algorithm first starts from any node in the network and calculates the respective values of node and its neighbor nodes, respectively. If the SCS value of node is not the largest, the search will continue along the direction of the maximum value. After the node with the maximum value is found in a region, the subgraph formed by this node and its neighbor nodes is regarded as a seed community. If the node with the maximum value is not unique, a node is randomly selected. Then, the search for seed communities continues in unvisited areas of the network until all nodes in the network have been traversed. Finally, the subgraphs with tight internal connections and sparse external connections in all local core regions of the network have been searched and used as seed communities.

3.3. Seed Community Expansion

In the stage of seed community expansion, a novel seed community expansion strategy is designed according to the proposed node to the community interior influence function and node to the community exterior influence function , see Algorithm 3 for the specific process.

Input : Graph G = (V, E), Seed community set Seeds
Output : Overlapping community set C
1: C = ∅;
2: for each sSeedsdo
3:  Cs = s;
4:  While true do
5:   select any neighbor node i of the seed community Cs;
6:   ifI(i) > 0 and E(i) < 0 then
7:    Cs = Csi;
8:   end if
9:   if all neighbor nodes of seed community Cs do not satisfy I > 0 and E < 0 then
10:    break;
11:   end if
12:  end while
13:  C = CCs;
14: end for

The new seed community expansion strategy is as follows: select any neighbor node of the seed community and calculate the corresponding value and value of the node. If and are satisfied, the node will be added to the seed community; otherwise, it cannot be added to the seed community. When all neighbor nodes of the seed community do not meet the expansion strategy, the seed community stops expanding and then continues to expand the rest of the seed communities until all the seed communities have completed the expansion.

3.4. Dealing with Overlapping Nodes and Missing Nodes

After the expansion of all seed communities is completed, if any node is not added to the community, the missing node will be added to the community with its corresponding maximum value according to the community quality optimization function . In addition, in order to prevent the excessive overlapping phenomenon from affecting the quality of community detection, it is necessary to simplify the detected overlapping nodes. The value of the overlapping node corresponding to the community where it is located is calculated, respectively. If the value is positive, the overlapping node is kept in the community where it is located; if the value is negative, the overlapping node is removed from the community where it is located. When the values of the overlapping node corresponding to the communities where it is located are all negative, it will be added to the community with the corresponding largest value.

3.5. Time Complexity Analysis

Assume that the number of nodes in network is and the average degree of nodes is . The number of seed communities, the number of overlapping nodes, and the number of missing nodes detected by the OLCRE algorithm are , , and , respectively. Firstly, high-quality subgraphs are selected from local core areas of the network as seed communities, whose time complexity is . After that, the time complexity for all seed communities to complete the extension is . Finally, the time complexity of simplifying and redetecting overlapping nodes and missing nodes is . To sum up, the time complexity of the OLCRE algorithm is . Since , , , and are far less than , the time complexity of the OLCRE algorithm is about , where is a constant.

4. Experimental Results and Analysis

4.1. Experimental Data Sets
4.1.1. Artificial Networks

Since the LFR benchmark network [26] is very similar to the real-world complex network in the statistical characteristics of node degree and community size distribution, this paper uses this benchmark network as the test data set for the proposed algorithm and other comparison algorithms. The parameters of the LFR benchmark network are shown in Table 2.

In order to objectively reflect the performance of each algorithm, four groups of different types of artificial networks (see Table 3) are generated by changing the mixing parameter , the number of overlapping nodes , the number of memberships of the overlapping nodes , and the number of nodes in the network by using the LFR toolkit. They, respectively, are artificial network group N1 with a gradually fuzzy community structure, artificial network group N2 with a gradually increasing number of overlapping nodes, artificial network group N3 with a gradually increasing number of communities to which overlapping nodes belong, and artificial network group N4 with a gradually increasing number of nodes.

4.1.2. Real-World Networks

In order to compare the performance of each algorithm in detecting network community structure, seven real-world network data sets of different sizes and types are used in this paper. They, respectively, are the Zachary karate club network (Karate for short) [27], bottlenose dolphin network (Dolphins for short) [27], books about US politics network (Polbooks for short) [28], US election blog network (Polblogs for short) [27], author collaboration network (Netscience for short) [27], trust network (PGP for short) [29], and friendship network (HR for short) [30]. The details of the seven real-world networks are listed in Table 4.

4.2. Evaluation Metrics and Experimental Settings
4.2.1. Evaluation Metrics

Since the community structure of the artificial network is known, normalized mutual information ( for short) [12] is used as the evaluation metric of artificial network community detection results. is used to measure the similarity between the community structure detected by the algorithm and the real community structure, and its value range is [0,1]. The more accurate the community structure detected by the algorithm, the larger the corresponding value. The is defined as follows: where is the number of real communities in the artificial network and is the number of communities detected by the algorithm on the artificial network. The rows of matrix correspond to the real community results of the artificial network, and the columns of matrix correspond to the community results detected by the algorithm on the artificial network. is the number of overlapping nodes between the real community and the community detected by the algorithm. · is the sum of elements of in row and is the sum of elements of in column .

Since the community structure of the real-world network is unknown, the extend modularity ( for short) [31] is adopted as the evaluation metric of the community detection results of the real-world network. is used to measure the tightness of community connection, and its value range is [0,1]. A higher value means that the community quality detected by the algorithm is better. The is defined as follows: where is the number of edges in the network. is the number of communities detected by the algorithm in the real-world network. is the number of communities to which node belongs, and is the degree of node . is an adjacency matrix element of the network. if there is an edge connection between nodes i and . Otherwise, .

4.2.2. Experimental Settings

The OLCRE algorithm is tested on artificial network data sets and real-world network data sets and compared with overlapping community detection algorithms DNMF [16], CoEuS [32], MULTICOM [33], and APAL [34] to verify the effectiveness and feasibility of the OLCRE algorithm. The experimental running environment is a computer equipped with an Intel Core i9-11900K 3.50 GHz processor, 32 GB memory, and Windows 10 operating system. The algorithm proposed in this paper is programmed by MATLAB R2021a, and the source code has been publicly shared and is available at https://github.com/GitZhaoY/OLCRE.git.

Table 5 lists the year, programming language, and time complexity of each comparison algorithm, where represents the number of edges in the network, represents the number of nodes in the network, represents the number of seeds, represents the number of nodes within the seed community, represents the number of communities, and represents the number of iterations. From the data listed in Table 5, it can be seen that both the CoEuS algorithm and the MULTICOM algorithm have linear time complexity, which is on the same order of magnitude as the time complexity of the OLCRE algorithm proposed in this paper. The time complexity of the DNMF algorithm is order of magnitude, which is significantly higher than that of the OLCRE algorithm. The time complexity of the APAL algorithm is , which indicates that it has good operating efficiency on sparse networks and is not suitable for dense networks.

4.3. Experimental Results on Artificial Networks

Figures 13 and Table 6, respectively, show the comparison results of the evaluation metric obtained by each algorithm running on four groups of different types of artificial networks. In the network group N1, with the increase of value, that is, the network community structure is gradually blurred, the community detection accuracy of each algorithm decreases, but the community detection accuracy of the OLCRE algorithm is better than that of each comparison algorithm under different values. In the network group N2 with a gradually increasing number of overlapping nodes and the network group N3 with a gradually increasing number of communities to which overlapping nodes belong, the community detection accuracy of the OLCRE algorithm is better than that of each comparison algorithm. From the experimental data listed in Table 6 (“\\” means that the algorithm failed to detect communities in this experimental running environment), it can be seen that the community detection accuracy of the OLCRE algorithm is relatively stable and better than that of each comparison algorithm in the network group N4 with gradually increasing number of nodes.

According to the above experimental results, it is shown that the seed community selection method and community expansion strategy of the OLCRE algorithm proposed in this paper are effective and can be applied to networks of different scales and types.

4.4. Experimental Results on Real-World Networks

Table 7 lists the results of values obtained by the OLCRE algorithm and other four overlapping community detection algorithms running on seven real-world network data sets (“\\” means that the algorithm failed to detect communities in this experimental running environment). As can be seen from the experimental results listed in Table 7, the values obtained by the OLCRE algorithm on the Dolphins network, Polbooks network, Polblogs network, Netscience network, PGP network, and HR network are all higher than those obtained by each comparison algorithm. The value obtained by the OLCRE algorithm only on the Karate network is slightly lower than that obtained by the DNMF algorithm.

The reason why the OLCRE algorithm does not obtain the maximum value on the Karate network is analyzed below. Figures 4 and 5, respectively, show the community detection results of the OLCRE algorithm and the DNMF algorithm on the Karate network. It can be seen from the comparative analysis of Figures 4 and 5 that the DNMF algorithm does not detect overlapping nodes in the Karate network, while the OLCRE algorithm detects node 3 as the overlapping node, which loses some connection tightness. Therefore, the value obtained by the OLCRE algorithm is slightly lower than that obtained by the DNMF algorithm.

5. Conclusions

The OLCRE algorithm proposed in this paper firstly selects high-quality subgraphs from all local core regions of the network as seed communities according to the proposed seed community selection function. Then, the seed communities are expanded in turn according to the proposed expansion strategy. Finally, after the completion of all seed community expansion, overlapping nodes and possible missing nodes should be simplified and redetected to further improve the quality of community detection. In this paper, four groups of artificial networks with different types and scales are designed and compared with several overlapping community algorithms. The community detection accuracy of the OLCRE algorithm on these four groups of artificial networks is better than that of each comparison algorithm. In the experiments on seven real-world networks, the OLCRE algorithm only fails to obtain the maximum value of on the Karate network, and the results on the other six real-world networks are all higher than those of the comparison algorithms. In conclusion, the experimental results verify that the OLCRE algorithm is effective and feasible. In addition, the OLCRE algorithm does not need to set any parameters and only needs to master the basic network information (nodes and edges) to complete the detection of overlapping communities. It can be applied to networks of different scales and types and has universal application.

Data Availability

The data sets used to support this study are obtained from http://snap.stanford.edu/data/ and http://www-personal.umich.edu/~mejn/netdata/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61370083), the Humanity and Social Science Research Project of Ministry of Education of China (no. 22JDSZ3023), the ZheJiang Provincial Natural Science Foundation of China (no. LY15F020040), the ZheJiang Provincial Education Science Planning Project of China (no. 2020SCG046), and the Ministry of Education's Industry-University Collaboration Education Project (nos. 220603372015422 and 220604029012441).