Abstract
According to the needs of scientific research project research and development, the research of cooperative team excavation methods was carried out. Aiming at the current difficulties in accurately and reliably defining and identifying cooperative research teams from co-author network, an improved Louvain algorithm that integrates core node recognition was proposed: Louvain-LSCR algorithm. Based on the analysis of the Louvain algorithm, considering the local topology of the node in the network and the communication range of the node, a new algorithm LSCR for core node identification was constructed. The LSCR algorithm and Louvain algorithm were merged to obtain a new and improved algorithm, Louvain-LSCR. In this algorithm, the leaf nodes in phase 1 of Louvain algorithm were first pruned to reduce calculations; then, seed nodes were selected according to the LSCR algorithm in phase 2. The experimental results on related datasets show that LSCR algorithm has certain advantages in identifying core nodes. The modularity of Louvian-LSCR algorithm is better than other algorithms, and the community structure is more reasonable. It was verified that the algorithm can mine potential cooperative research teams in co-author network.
1. Introduction
Modern scientific research projects are becoming more and more complex and larger in scale, and their research and development often require multiple researchers to jointly complete [1]. Scientific research cooperation can effectively supplement the scientific research capabilities of scholars, and it is necessary and effective for multiple people to carry out project research and development work in cooperation. The first key is to find multiple suitable researchers to form a team. A cooperative research team is a special group formed by researchers with different skills or different knowledge to achieve a common goal. The cooperative research team found that the problem has attracted widespread attention from the academic community [2–4].
The methods discovered by the cooperative research team are mainly divided into two types: traditional cooperative research team discovery based on multiple explicit data and cooperative research team discovery based on network analysis. In recent years, the method based on network analysis has gradually replaced the traditional method; it first establishes an overall network through large-scale data and then uses the technology of community detection to discover small teams from the constructed network. In the real world, it is very important to discover communities in various fields [5, 6]. For example, in the IoT system, it is important to discover and analyze key parts that are easily damaged. Mining communities or nodes play an important role in the transmission of information in virtual communities [7, 8], and research and analysis are conducted on their behaviors. In addition, community detection can be applied to the knowledge graph [9, 10] to mine abnormal data [11, 12], which can lay the foundation for building a knowledge graph or big data mining providing services.
The cooperative research team is a group of researchers who work closely together, which can be understood as a community in the network reflected in the network structure. The members of the community cooperate closely, while the connections between members of different communities are sparse. In a complex network, nodes belonging to the same community may have the same or similar properties in some respects, and the number of internal connections is more than the number of external node connections [13, 14]. After dividing the entire network into communities, the entire network structure presents a high degree of modularity, and the nodes within each module present a high degree of aggregation. There are many methods based on network cooperative research team discovery, such as GN algorithm [15], agglomerated subgroup method [16], etc. Li et al. [17] used the method of Louvain [18] to discover the scientific research team in the network and identify the roles of members in the team by constructing a research team citation network. Jin et al. [1] detect the community in the co-author network and implement co-author recommendations, which provides the possibility for large-scale research cooperation in the future. Reference [19] builds a network of Italian sociologists, uses the Leiden algorithm [20] to detect co-author communities, and uses the exponential random graph model to find that the cooperative relationship is mainly driven by the research interests of these groups. Sciabolazza et al. [21] use a modular algorithm to study the co-authoring network of scholars.
With the development of computer technology, community detection algorithms emerge endlessly [22–24], among which the Louvain algorithm is an aggregate community detection algorithm based on modularity and is considered to be one of the most efficient algorithms. The whole algorithm is divided into two stages. In stage 1, the similar nodes are classified into one category, and in stage 2, the nodes of the same category are merged into one node, and then stage 1 is repeated. For the deficiencies of the Louvain algorithm, many researchers have improved it to improve the efficiency of the algorithm and the quality of the community. Wu [25] pruned the leaf nodes that appeared in the network to improve the operating efficiency of the Louvain algorithm. Hu et al. [26] improved the efficiency of Louvain algorithm by combining the LPA algorithm. Zhao et al. [27] selected the seed nodes with a large degree as the seed node to improve the Louvain algorithm. Zhang et al. [28] used an adaptive neighbor algorithm to improve the efficiency of the algorithm. Li et al. [29] divide neighbor communities based on absolute gain, rather than relative gain, to obtain better modularity. In order to solve the problem of too many iterations and too many small communities in the Louvain algorithm stage 2, this paper combines the Louvain algorithm with the new core node recognition algorithm LSCR and proposes a new Louvain-LSCR algorithm. It uses LSCR to measure the importance of nodes and then selects seed nodes.
The main contributions of this paper are as follows:(1)We propose a new core node recognition algorithm LSCR that takes into account local information, which is used to identify core nodes in the network.(2)Aiming at the problem of too many iterations and too many small communities in the Louvain algorithm stage 2, a new community detection algorithm Louvain-LSCR is proposed, which integrates the core node recognition algorithm LSCR into the Louvain algorithm. The seed node is selected by calculating the LSCR value of the node to reduce the number of iterations and merge small communities.(3)A comparative experiment was conducted on the public datasets to verify the effectiveness and advantages of the LSCR and Louvain-LSCR algorithms proposed in this paper, and the Louvain-LSCR algorithm was used to conduct a case test of potential cooperative research team mining.
The rest of this paper organized as follows. In Section 2, we introduced the basis of the core node recognition algorithm and the Louvain algorithm and its flaws. In Section 3, the improved core node recognition LSCR algorithm and the new Louvain-LSCR algorithm are introduced. In Section 4, the evaluation coefficient of core node recognition algorithm and community detection is introduced. In Section 5, the experimental results of this paper’s algorithms are tested and compared. Finally, we draw conclusions and discussions in Section 6.
2. Basic Theory and Algorithm
The network represents an undirected and unweighted network, where the node set represents the individual in the network, and the edge set represents the relationship between individuals, and . If there is an edge between nodes and , then ; otherwise, , represents the degree of node .
2.1. Core Node Recognition Algorithm
2.1.1. Degree Centrality
The degree centrality of the node in the network is the number of node directly connected to it. The calculation formula is as follows:
2.1.2. Betweenness Centrality
The betweenness centrality of node in the network is the ratio of the number of shortest paths of any two nodes through node to all the shortest paths:
Among them, represents the number of shortest paths from node to node through node and represents the number of shortest paths from node to node .
2.1.3. Closeness Centrality
The closeness centrality of node in the network is the reciprocal of the sum of the shortest distances from the node to other nodes, and its calculation formula is as follows:where is the shortest distance from to .
2.1.4. K-Index Centrality
Giovanni et al. proposed K-index [30]. First, compare the value of H-index [31, 32]; when the H-index is equal, the citations (the sum of citations of h core papers) are used to further distinguish the size of influence. The K-index calculation formula is as follows:
Among them, represents the sum of the citations of a certain author at h, which corresponds to the sum of the neighbor node degrees of the node in network G at h.
2.2. Louvain Algorithm
2.2.1. Concept of Louvian Algorithm
Louvain algorithm is an aggregation community detection algorithm based on modularity. By continuously merging nodes, when the modularity no longer increases, the community structure obtained at this time is the final division result [18].
Louvain algorithm is divided into two stages, and constant iterations are required in both stages. In the first stage, the nodes in the network are allocated to different communities. At this time, the number of communities is equal to the number of nodes; then it is calculated by the modularity gain when the node is moved to the community to which its neighbor node belongs. For node , suppose that node has neighbor nodes, and these neighbor nodes belong to different communities. At this time, the change of modularity gain before and after moving node to the communities to which its neighbor node belongs, in turn, is recorded. Set the maximum value of to , when , move node to the community relative to . When , keep the community to which node belongs unchanged, until all the nodes belonging to the community are not changing. In the second stage, each community obtained in the first stage is merged into a super point, and the weight of the edge between the super points is reset to the sum of the weight of the edge between the two communities in the first stage. Then divide this new network by the method of phase one, until the modularity value does not increase. The formula for calculating modularity gain is as follows:
In the formula, represents the total number of edges in the network; is the sum of the weights of all the edges connected to node ; is the sum of the weights of the internal edges in the community ; is the sum of the weights of the edges connected to nodes in the community and node ; is the sum of the weights of all the edges connected to the nodes in the community ; and is divided into two parts. The first part represents the modularity of community after node is divided into community , and the latter part is the sum of the modularity of community and the modularity of node when node is an independent community.
2.2.2. Defects of Louvain Algorithm
In real networks, there are often many leaf nodes, especially for some networks with the obvious hierarchical structure, these leaf nodes can be processed through pruning technology to improve the efficiency of the algorithm. The leaf node is only connected to one other node. For the leaf node, it must be divided into the community where the connected node is located, which can avoid a lot of related calculations, thereby improving the efficiency of the algorithm. Generally speaking, there are leaf nodes during phase 1 division. In phase 2, after the community divided in phase 1 is compressed into a super point, there are no more leaf nodes [33].
In the Louvain algorithm, stage 2 will iterate repeatedly until no longer changes. The problems in phase 2 of the Louvain algorithm are as follows: (1) the operation efficiency of the algorithm decreases due to too many iterations and (2) there are too many small communities.
3. Improved Algorithm
3.1. LSCR Algorithm
The importance of a node in the network depends not only on the degree of the node itself, but also on the degree of dependence of the neighboring nodes on the node within two hops of the node. Inspired by Ruan et al. [34], from the LHN similarity [35] to measure the overlap degree of neighbor nodes in terms of topological structure, the formula is as follows:where represents the set of neighbor nodes of node and represents the number of common neighbor nodes of node and node . The larger the Local Similarity Centrality (LSC) value of a node, the lower the overlap between the node and the neighbor node in terms of topological structure.
|
In people’s daily life, due to differences in ability, personality, etc., there will be inconsistencies in opinions. The relationship between closeness and estrangement is distinguished based on the value of opinion, and not everyone can communicate with each other, and people have a certain range of trust when communicating. Inspired by these factors, the ramp function is used to correct the above LSC value, and its calculation formula is as follows:
3.2. Louvain-LSCR Algorithm
The improved community detection algorithm Louvain-LSCR has two stages. In both stages, the weights between nodes are taken into consideration. Phase 1 is different from phase 1 of the Louvain algorithm in that it distinguishes which nodes are leaf nodes at the beginning and records the nodes connected to them. If the node is a leaf node during calculation, then the subsequent operations of the node are skipped. Finally, find the community of the node connected to the leaf node and move it in. After the algorithm passes through stage 1, the main community has formed. However, the quality of the division is not good enough, and there are too many communities, so the communities need to be further divided through algorithm stage 2. In algorithm stage 2, first compress the result of stage 1 division, and rebuild a new network , and then the LSCR value of each node according to Algorithm 1, and then the seed nodes are selected by formula (8), and its set iswhere represents the average importance value of the node. Finally, each node in stage 2 is divided into different communities, and then for non-seed node , the modularity gain of the community to which its neighbor node belongs is calculated. If its neighbor node communities contain seed nodes and the of these communities, the node will be preferentially moved to the community corresponding to . Otherwise, the calculation will move it to without the seed node community; if this , then node will be moved to the community corresponding to . If the above conditions are not met, the community to which node belongs remains unchanged.
|
In the Louvain algorithm, the loop of phase 1 and phase 2 occupies most of the running time of the algorithm. It takes time complexity to traverse a node in the division process of phase 1, is the time cost required to move a node, phase 2 adopts the idea in phase 1, and its complexity is the same as phase 1. The algorithm in this paper is improved based on the Louvain algorithm. The addition of pruning in stage 1 reduces the complexity of the division, improves the iterative process in stage 2, and reduces the number of iterations. Therefore, the algorithm in this paper is superior to Louvain algorithm in terms of operating efficiency.
4. Evaluation Criteria
4.1. Core Nodes Evaluation Criteria
This paper uses different core node recognition algorithms to identify the core nodes and deliberately attacks the network through the core nodes. The maximum connected subgraph coefficient and network efficiency indicators are used to quantify the impact of deliberate attacks on the network structure and function, in order to evaluate the effectiveness of the core node recognition algorithm.
4.1.1. Maximum Connectivity Coefficient
Connected subgraph refers to a subgraph of the network. In this subgraph, there is at least one path between any two nodes. For a nonconnected graph, it can be decomposed into two or more subgraphs, and the one with the largest number of nodes becomes the largest connected subgraph. After the network is attacked, its topology will also change, and the network is decomposed into multiple subgraphs. The formula for calculating the maximum connectivity coefficient is as follows:
Among them, represents the number of nodes in the maximum connected subgraph of the network after the network is attacked and represents the number of initial network nodes.
4.1.2. Network Efficiency
Network efficiency is to examine the removal of nodes in the network and all their corresponding edges under a certain attack strategy, so that some paths in the network are interrupted and the shortest path between some nodes becomes larger. In turn, the average path length of the entire network becomes larger, which affects the connectivity of the network. The network efficiency is expressed as
Under a certain attack strategy, calculate the network decline ratio before and after the network is attacked to measure the robustness of the network. The calculation formula for the decline in network efficiency is as follows:
Among them, represents the network efficiency after the network is attacked and represents the original network efficiency. The larger the value of , the weaker the robustness of the network after being attacked.
4.2. Community Detection Evaluation Criteria
To measure the results of community network division, Newman and Girvan proposed the concept of modularity () [36]. Generally speaking, when modularity , it means that the community structure of the network divided by the algorithm is obvious. When the modularity value is closer to 1, it means that the quality of the community obtained by the algorithm is higher, and it also means that the community structure of the network is more obvious. The modularity is defined as follows:
In the above formula, represents the sum of the weights of the edges connecting node and all other nodes. In an undirected graph, when the value of is 0, it means that there is no edge between node and node ; otherwise, it means the connection weight between node and node ; represents the community to which node belongs, and is a binary function. If and are equal, the function value is 1; otherwise, the function value is 0.
5. Algorithm Verification and Examples Found by Cooperative Research Teams
The experiment in this article needs to verify the performance of the improved algorithm. The experimental running environment is as follows: processor Intel® Core™ i5-4258U, CPU @2.40 GHz, memory 8G, programming language is Python, and programming environment is Visio Studio Code 1.48.2.
5.1. Algorithms Verification
The datasets used in the experiment are the Zachary Karate network, the Dolphins social network, the Polbooks network, the Adjnoun network, the Celegansneural network, and the WS small world network (). The basic parameters are shown in Table 1, where represents the number of network nodes, represents the number of network edges, represents the average degree, and represents the average shortest path length.
5.1.1. Experiment and Analysis of Core Node Recognition Algorithm
Based on the above network, this article compares the LSCR algorithm with the other four algorithms that also use local information. The four algorithms are Degree [37], LLS [34], K-index [30], and LC [38]. According to the ranking results of the five algorithms, nodes are removed by static attacks, and the maximum connected subgraph coefficients and network efficiency changes when the network is deliberately attacked, so as to evaluate the accuracy of the core node recognition algorithm.
Figure 1 reflects the impact of using different core node recognition algorithms to attack the network on the largest connected subgraph of the network. In Figure 1(a), the LSCR algorithm is slightly lower than the LLS algorithm only in a certain part, and the rest are in the optimal situation among these algorithms. In Figure 1(b), the LSCR algorithm is in the optimal situation in the top 23% and the bottom 50% and is slightly lower than the LLS algorithm in the middle part. In Figure 1(c), in the top 25%, the LLS, Degree, LC, and LSCR algorithms are not much different. In the latter part, the LSCR algorithm is better than the other four algorithms. In Figures 1(d) and 1(c), the LSCR algorithm in the second half reaches the minimum value faster than other algorithms. In Figure 1(f), the LSCR algorithm is significantly better than the other four algorithms.

(a)

(b)

(c)

(d)

(e)

(f)
Figure 2 reflects the effect of using different core node recognition algorithms to delete network nodes on network efficiency. In Figure 2(a), the LSCR algorithm only performs slightly worse in a small part and is in the optimal situation in the rest. In Figures 2(b) and 2(c), the LSCR algorithm is the first to achieve a 100% reduction in efficiency, and the remaining algorithms have not reached 100% after removing 80% of the nodes. In Figures 2(d), 2(e), and 2(f), the LSCR algorithm is almost in the best situation, and its network efficiency drops the fastest, while the K algorithm performs the worst.

(a)

(b)

(c)

(d)

(e)

(f)
In summary, the LSCR algorithm is more capable of identifying the core nodes in the network than other algorithms, and it is feasible to apply it to the Louvain algorithm to select core nodes.
5.1.2. Louvain-LSCR Algorithm Experiment and Analysis
In order to further illustrate the difference in modularity and number of communities of the Louvain-LSCR algorithm compared with other algorithms, the following will compare the Louvain-LSCR algorithm with SLPA [39], SCD [40], and ILouvain [27] algorithms. The results are shown in Table 2. The algorithm in this paper is better than the ILouvain and SLPA algorithms, but some of the modularity is lower than the SCD algorithm. The reason may be that the SCD algorithm detects too many communities. For example, on the Karate network, SCD gets 16 communities, and the remaining algorithms are 3 to 4.
5.2. Cooperative Research Team Found Examples
The above experimental results show that the Louvain-LSCR community detection algorithm has better modularity than other algorithms and a more reasonable community structure, indicating that the improved algorithm is more reasonable for community detection than other algorithms. It is feasible to use the improved algorithm for team mining on social networks.
5.2.1. Data Source
In this paper, the author’s co-occurrence matrix is used to construct the co-author network, and the author’s published papers are used as the data source to construct the author’s co-occurrence matrix. The experimental data in this article are all from the papers included in CNKI “China Academic Journals Online Publishing Database” and “China’s Important Conference Papers Full-Text Database,” and the retrieval period is until March 2020. First, find an expert in mechanical engineering, collect the data of his published papers, and then collect the rest of the paper data through the “snowball” [41] method, then these thesis data are programmed to clean the original data, and finally 1042 valid records are obtained after manual screening. Using these paper data as the source of data analysis, a total of 963 authors were involved, and there were 3102 relationships between them. First, convert these collaborative papers into co-occurrence matrices, considering the influence of the weight of the edge on the division result, so the co-occurrence matrix in this article is not a binary, but a non-binary matrix and the number of collaborations is the weight of the edge. And because the cooperation between the authors of the paper is mutual, the co-author network is undirected. Among them, the nodes of the co-author network represent the authors in the paper. If there is cooperation between two authors, then the nodes represented by the two authors are connected by lines.
5.2.2. Analysis of Experimental Results
The Louvain-LSCR algorithm is compared and analyzed through experiments, and two evaluation indicators of modularity and number of communities are used to analyze the pros and cons of the algorithm [42]. What can reflect the quality of community detection is the modularity and the number of communities. The larger the , the better the effect of community detection.
In order to evaluate the quality of community detection, the Louvain-LSCR algorithm is compared with LPA [43], SCD [40], and ILouvain [27] algorithms. The experimental results are shown in Table 3. Two evaluation indicators are used to evaluate the pros and cons of the algorithms: and number of communities. The larger the , the more compact the community structure and the more reasonable community detection results.
It can be known from Table 3 that the modularity of the Louvain-LSCR algorithm is greater than that of the other algorithms and the number of communities is smaller than that of the other algorithms. Compared with the ILouvain algorithm, is relatively improved by 10.9%. Compared with its SCD and LPA, it has a more obvious improvement, and the obtained community structure is more compact. Comparing the ILouvain algorithm with the Louvain-LSCR algorithm in this paper, the result is shown in Figure 3. The ILouvain algorithm requires 11 iterations, while the Louvain-LSCR only needs 7 times to complete the algorithm. Due to the introduction of edge weights, the influence of edge weights on the division results will be considered when dividing, and a more compact community will be obtained. In the end, the Louvain-LSCR algorithm got 15 communities, and the ILovain algorithm got 23. Further comparing the communities with less than 1% members, the result is shown in Figure 4. There are 5 in ILovain and only 2 in Louvain-LSCR, which shows that the Louvain-LSCR algorithm can merge small communities in time. In summary, the community quality obtained by the algorithm in this article is better than the community quality obtained by the above algorithms.


(a)

(b)
5.2.3. Analysis of Cooperative Research Team
The author’s co-authoring relationship is formed into a co-author network, which has 963 nodes and 3102 edges. Count the number of authors, average degree, and internal team connection of the 15 teams divided according to the Louvain-LSCR algorithm, respectively, and calculate the average value. The results are shown in Table 4.
Analysis of the structure of the co-author network found that there are close connections between certain nodes in the network, which causes these nodes to be combined into a small group. In a complex network, this feature of the network is called a community [44, 45]. The community characteristics in the co-author network reflect the characteristics of teamwork between authors. The main purpose of community analysis in complex network analysis is to discover how many communities there are in the network. The community detection method can measure the strong, direct, and closely related communities in the network. Therefore, use Networkx [46] to draw the co-author network, and then use the improved Louvain algorithm to conduct community detection on the co-author network. It can be found that there are 15 communities in close cooperation; that is, there are 15 potential cooperative research teams in the network, numbered 1–15, as shown in Figure 4(b). From the results of team mining, it is found that some research teams have a large number of people, and some research teams have a small number of people, because in the network, the research team does not have a unified size and clear boundaries [47]. The 15 research teams further explore the relationships between the teams; the relationship is shown in Figure 5. It shows that the distribution of teams is obvious; they do not exist in isolation but have cross-team communication and cooperation with each other. In the entire network, there are nodes between teams that act as bridges, connecting these teams into a large cooperation network.

6. Conclusions
In order to explore the potential cooperative research teams in the co-author network, this paper proposes corresponding improvement measures to improve the detection effect of the cooperative research team in view of the deficiencies of the Louvain algorithm. First, in order to identify the core nodes in the network, an LSCR algorithm based on local information is proposed. By combining Louvain and LSCR algorithms, a Louvain-LSCR algorithm is proposed to mine teams in the network. It uses the LSCR algorithm to select seed nodes, which avoids the problems of too many iterations and too many small communities in phase 2 of the Louvain algorithm. In the experiments of the co-authored network and 5 other real networks and 1 artificial network, the LSCR algorithm is better than the four algorithms based on local information (LLS, Degree, LC, and K algorithms), while the Louvain-LSCR algorithm is better than the SCD, SLPA, and ILouvian algorithms.
The core nodes in the network play an important role in the transmission of information, such as finding influential users in the virtual community and protecting the core nodes to improve the robustness of the network. In addition, community detection can be applied to the knowledge graph to mine abnormal data, which can lay a foundation for building a knowledge graph or big data mining providing services.
However, this study only briefly analyzed the co-authoring network, and did not consider the order of authors and the influence of corresponding authors. In addition, it did not consider the time attenuation of cooperation. Therefore, building related models and solving algorithms for the above attributes is our next research direction.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest regarding the publication of this study.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant no. 71961005), Natural Science Foundation of Guangxi Zhuang Autonomous Region (Grant no. 2020GXNSFAA297024), and Guangxi Key Laboratory of Manufacturing System and Advanced Manufacturing Technology (Grant no. 20-065-40S002).