Abstract

In past studies, researchers find that endpoint degree, H-index, and coreness can quantify the influence of endpoints in link prediction, especially the synthetical endpoint degree and H-index improve prediction performances compared with the traditional link prediction models. However, neither endpoint degree nor H-index can describe the aggregation degree of neighbors, which results in inaccurate expression of the endpoint influence intensity. Through abundant investigations, we find that researchers ignore the importance of coreness for the influence of endpoints. Meanwhile, we also find that the synthetical endpoint degree and coreness can not only describe the maximal connected subgraph of endpoints accurately but also express the endpoint influence intensity. In this paper, we propose the DCHI model by synthesizing endpoint degree and coreness and the HCHI model by synthesizing H-index and coreness on SRW-based models, respectively. Extensive simulations on twelve real benchmark datasets show that, in most cases, DCHI shows better prediction performances in link prediction than HCHI and other traditional models.

1. Introduction

The research of link prediction aims to find the lost, false, or possible links through the observed network structure and information [15]. Therefore, link prediction algorithms have been applied to many fields. For example, link prediction algorithms can remove noise of the networks [6]. Furthermore, link prediction algorithms also can be applied to friends’ recommendation on online social networks [711] and products’ recommendation on e-commerce websites [1216]. Moreover, link prediction algorithms provide references for biological experiments, which can reduce the cost of experiments [1719]. In addition, link prediction algorithms can reveal network evolution mechanism and organization pattern [2022].

To reveal the structure of complex networks, researchers propose a large number of link prediction models. Specifically, the models based on local information have been gained more attentions. For example, Kossinets [23] finds that two strangers become friends if they have more common friends in social networks. Newman [24] finds that two scientists have more likely to establish cooperation relationship in the future if they have more common cooperators. Based on this phenomenon, researchers propose the common neighbors’ model (CN). Based on CN, some researchers propose improved models, such as Salton [25] and LHN-I [26]. Furthermore, according to the different similarity contributions of common neighbors, Adamic and Adar propose the AA model [27]. Zhou et al. [28] propose the resource-allocation model (RA). Moreover, Cannistraci et al. [29] propose that CN, AA, RA, and other algorithms can be weighted by local community information, which can further improve the performances of these models. However, the models based on common neighbors only consider the influence of endpoints on one-step paths. Though further research studies, Lü et al. [30] propose local path model (LP) through considering the influence of endpoints on three-step paths. In addition, some models consider global information, such as Katz [31] and hierarchical structure model [32]. Besides, some models’ consider quasi-local information, which can compensate the defect of low accuracy in local information and high-computational complexity in global information. For example, local random walk model (LRW) [33] considers a random walker within a quasi-local range, and superposed random walk model (SRW) [33] considers the effects of LRW with different path lengths. Based on SRW, HSRW [34] and CSRW [34] models consider the roles of H-index [35] and coreness [36] with different path lengths, respectively. Simple hybrid influence model (SHI) [37] synthetically considers the role of endpoint degree and H-index as hybrid influence with different path lengths.

At present, many link prediction models only consider the degree [33] of endpoints, such as Sørensen [38], LHN [26], LRW [33], and SRW [33]. These models illustrate that the source endpoint can effectively spread its influence to the target endpoint if the source endpoint has more neighbors to connect the target endpoint. Through abundant study, Lü et al. [39] find that H-index shows a better performance to quantify the influence of endpoint than degree and coreness. Zhu et al. [37] find that an endpoint possessing large synthetical degree and H-index can acquire a more extensive maximal connected subgraph, which can help the endpoint to attract other nodes. Through further investigations, we find that the endpoint influence can be expressed by the aggregation degree of neighbors. The large aggregation degree of neighbors illustrates that the endpoint has the extensive maximal connected subgraph, leading to attract more nodes. The aggregation degree of neighbors can only be quantified by coreness of endpoints. Thus, we synthesize the endpoint degree and coreness (or H-index and coreness) to quantify the endpoint influence and build the new link prediction models. Although the SHI model based on the synthetical degree and H-index has been explored, the synthetical endpoint degree and coreness (or the synthetical H-index and coreness) has not been fully verified.

Figure 1 shows a clear illustration. In Figure 1, endpoint b possesses degree = 6, H-index = 3, and coreness = 3, respectively. The influence intensity of endpoint b is size 6 in consideration of only degree. However, only the degree cannot express the depth and scope of the influence of endpoints accurately. Due to the role of coreness in influence of endpoints, synthesizing degree, and coreness or synthesizing H-index and coreness can better quantify the maximal connected subgraph of endpoints and the aggregation degree of neighbors. For endpoint b, the product of degree (H-index) and coreness is 18 (9). Obviously, degree and H-index indicate the different sizes of maximal connected subgraph belonging to endpoint b with the same coreness, leading to different influences of endpoints. Therefore, the prediction performance on the influence of endpoints based on the different quantification index needs to be further explored.

In real world, we find many phenomena to confirm our idea. For example, in Weibo, an ordinary individual possesses the limited influence because he/she only has many individual followers from colleagues, classmates, relatives, or friends, indicating that he/she only has large degree. However, public figures possess extensive and strong influence because they have large number of fan club, indicating that they have large coreness to strengthen their influence. In scientists’ collaboration network, if a scientist only cooperates with many scholars, meaning he/she has large degree but small coreness, the scientist cannot be known by more researchers and can hardly further attract them to cooperate. In e-commerce network, the applicability of products depends on purchase groups with similar identities, such as male/female group, student group, and teacher group, which shows the importance of aggregation degree. In paper-citation network, the value of a paper depends on the citation of researchers in the same field, not the citation of researchers in the different fields.

In summary, in this paper, we define the hybrid influence of synthetical degree and coreness (synthetical H-index and coreness) to redefine the SRW and propose two improved models DCHI and HCHI to further explore the accuracy of link prediction. Experimental results on twelve real networks show that DCHI exhibits better performances of link prediction.

The rest of this paper is organized as follows. In Section 2, we build two models based on the synthetical degree and coreness and the synthetical H-index and coreness, respectively. In Section 3, the thirteen benchmark experimental datasets are introduced. In Section 4, a link prediction metric and eight mainstream baselines are described, respectively. In Section 5, the experimental results are discussed. In Section 6, the conclusion is described.

2. Models Based on Hybrid Influence of Endpoints

Firstly, we study link prediction models in an undirected simple network , where is the set of links ( denotes the number of all edges.) and refers to the set of nodes. Multiple links and self-connections are eliminated. For every pair of nodes, , a score, , is given to calculate the probability of their future connection. In this paper, we set the similarity value as a score directly, and a larger score illustrates that the potential link has more possibility to be found.

Secondly, we show two models based on the degree (SRW [33]) and the synthetical degree and H-index (SHI [37]) separately as follows.

2.1. SRW Model

Liu et al. [33] build the similarity model using random walk, which finds all intermediate nodes sequentially between two endpoints according to a Markov chain with one-step transmission probability , where represents the degree of node and if node successfully connects and if not. The sequence of node with -step between and is expressed as . Thus, the -step transmission probability from to is denoted by and . Importantly, Liu et al. consider the degree and to quantify the influence of endpoints and define the SRW aswhere and denote the degree of endpoint and , respectively, and indicates the number of links in the network. and describe the influence of endpoint and , respectively.

2.2. SHI Model

Zhu et al. [37] find that the H-index can represent the maximal connected subgraph of endpoints and describe the influence intensity. Thus, Zhu et al. simply synthesize degree and H-index as the hybrid influence of endpoints and replace the degree in SRW to define a simple hybrid influence model (SHI) aswhere and denote the hybrid influence of node and based on synthetical degree and H-index, respectively.

Although the endpoint degree and H-index can quantify the endpoint influence, they only represent the number of neighbors and the maximal connected subgraph of endpoints separately, ignoring the influence intensity of endpoints. The influence intensity of endpoints can be expressed by the coreness of endpoints because the coreness can quantify the aggregation degree of neighbors which represents the endpoint influence intensity. Thus, we consider the role of coreness for endpoint influence. Finally, we build two models based on synthetical degree and coreness (DCHI) and synthetical H-index and coreness (HCHI) separately as follows.

2.3. DCHI Model

Through the explanation in Section 1 and the illustration in Figure 1, we synthesize degree and coreness to quantify the influence of endpoints and replace the degree in SRW to build a new model DCHI aswhere and denote the hybrid influence of node and based on synthetical degree and coreness, respectively.

2.4. HCHI Model

Furthermore, we synthesize H-index and coreness to quantify the influence and replace the degree in SRW to build a new model HCHI aswhere and denote the hybrid influence of node and based on synthetical H-index and coreness, respectively.

3. Experimental Data

In this section, we introduce 12 real network datasets to prepare the following experiments. (1) US Air97 (USAir) [40] represents the US airline network. (2) Yeast PPI (Yeast) [41] represents the yeast network of relationship between proteins. (3) Food Web (Food) [42] represents the relations of carbon exchanges in the cypress wetlands of Florida ecosystem. (4) Power Grid (Power) [43] represents the western US’s electrical power transmission network. (5) NetScience (NS) [44] represents partnerships between scientists in publishing papers concerning the subject of networks. (6) Jazz [45] represents the networks of Jazz musicians. (7) e-mail network (e-mail) [46] represents e-mail communication network of University Rovira i Virgili (URV) in Spain. (8) Slavko [47] represents the friendship network of Slavko Zitnik on Facebook. (9) UC Irvine dealing with social network (UCsocial) [48] represents an online social network composed of students in the University of California, Irvine. (10) Infectious (Infec) [49] represents the offline contact network of visitors in the course of the exhibition named ”Infectious: Stay Away” in the Science Gallery in Dublin, 2009. (11) EuroSiS web (EuroSiS) [50] represents interactions network between Science in Society actors from twelve European countries. (12) C. elegans (CE) [43] represents the network of neurons in the C. elegans worm. Table 1 lists the mentioned networks fundamental topological features.

To achieve preprocess, arcs are changed as nondirectional links, and loops and multiedges are eliminated to ensure the network unweighed and undirected. Subsequently, the largest linked simplified network subgraph is extracted to make sure the connectivity.

In the beginning, the set of network links is divided into the training set containing links in a random manner, and the testing set containing links, while the connectivity of is ensured [1]. Besides, 30 divisions are identically and separately conducted on the network. Next, experimental processes are performed over the 30 separated training and testing sets, the averaged accuracy is achieved in a statistical manner, and metrics is recalled more than 30 times realization.

4. Experimental Methods

4.1. Metric

[36], a metric of accuracy, can be interpreted as the probability that a potential link (a link in ) ranks a higher score than a nonexistent link (a link in , where U denotes the universal link set). In the specific implementation, among n independent comparisons if the potential link ranks higher in times and the same as the nonexistent link in times, and the total score accumulates and . After that, expresses the averaged score over n-time comparisons as

evaluates the performance of a model globally. If all scores originate from an independent and identical distribution, the value should equal to 0.5. Therefore, the extent to which the accuracy exceeds 0.5 suggests how much better a model performs than pure chance.

4.2. Baselines

Comparatively, we introduce eight fundamental models as follows:(1)Common neighbors (CN) [24] describe the similarity between endpoints by calculating the number of common neighbors, defined aswhere , represents the set of neighbors of endpoint and refers to the number of common neighbors of endpoints and .(2)Adamic/Adar (AA) [27], based on CN, suppress the contributions of common neighbors with big degree by applying the inverse logarithm, which is defined aswhere represents the degree of node .(3)Resource-Allocation (RA) [28], analogous to AA, suppresses the large degree of common neighbors by applying the reciprocal of the degrees of common neighbors, which is defined as(4)Local Path Index (LP) [30] considers the similarity on two-step and three-step paths between endpoints simultaneously, with the two-step paths preferred, which are defined aswhere represents the adjacency matrix and is a punishment parameter.(5)Superposed Random Walk (SRW) [33] is introduced in Section 2.(6)CSRW [34] exploits the coreness to quantify the influence of endpoint and replace the degree influence in SRW, which is defined aswhere and represent the coreness of node and , respectively.(7)HSRW [34] exploits the H-index to quantify the influence of endpoint and replace the degree influence in SRW, defined aswhere and represent the H-index of node and , respectively.(8)Simple hybrid influence (SHI) [33] is introduced in Section 2.

5. Results and Discussion

To explore the prediction performances of the proposed models, extensive simulations are conducted on 12 real datasets. Through comparisons with several main baselines in terms of accuracy metric, we obtain the experimental results on the models and discuss the findings in the following.

SHI, HCHI, and DCHI models mainly consider two aspects: random walk on paths and hybrid influences of endpoints. Through simulations, the experimental results show that the number of steps in random walk between endpoints can affect the accuracy of link prediction. For illustrating the changes of prediction accuracy on the number of steps t, we plot the relation curves in Figure 2.

In Figure 2, SHI (synthetical degree and H-index), HCHI (synthetical degree and coreness), and DCHI (synthetical H-index and coreness) models show their prediction performances on the random steps t, and they exhibit different optimal accuracies at certain number of steps t, respectively. Specifically, SHI shows optimal AUC values at t = 15 in food, power, NS, e-mail, UCsocial, and Eurosis, t = 5 in USAir and CE, t = 3 in yeast, t = 2 in Jazz, t = 6 in Slavko, and t = 9 in Infec. Obviously, the optimal number of steps on SHI mainly appears on the long path t = 15, illustrating that long paths can further facilitate the hybrid influence spreading based on the degree and the H-index. However, HCHI and DCHI all show optimal AUC values at t = 5 in USAir, Food, Slavko, Infec, Eurosis, and CE, illustrating that quasi-local paths can further facilitate the hybrid influence spreading based on H-index and coreness or degree and coreness. Importantly, we find that the influence concerning coreness can easily leak in the random-walk process on longer paths, which leads to weaken the intensity of influence spreading between endpoints. However, in power, the prediction performances of HCHI and DCHI reach the optimal value at t = 15 because power network includes large numbers of long paths with average distance much longer than other datasets (referring to Table 1). In addition, DCHI, compared with SHI and HCHI, has larger size of maximal connected subgraph and more paths to spread the hybrid influence of endpoints. Therefore, DCHI shows the best prediction performances in ten datasets (black mark on each dataset) except yeast and CE.

In addition, we compare HCHI and DCHI with eight link prediction models CN, AA, RA, LP, SRW, CSRW, HSRW, and SHI. To exhibit the experimental results, we show the averaged AUC values over 30 simulations in Table 2 for all models. The underlined bold fonts represent the best AUC values in each dataset and the numbers in parenthesis indicate the optimal random walk steps t, at which HCHI and DCHI obtain the optimal AUC values in eight datasets altogether.

As can be seen from Table 2, optimal values on seven datasets exist in DCHI with Power, NS, Jazz, Email, Slavko, Infec, and Eurosis. In contrast, local models CN, AA, and RA show worst prediction performances because they only consider the local paths and ignore the influence of endpoints. Then, optimal values on three datasets exist in LP with yeast, food, and UCsocial, illustrating that the quasi-local paths can limitedly promote the prediction performances. And then, SRW, CSRW, and HSRW also show worst performances because they only consider separately the contributions of degree, coreness, and H-index, meaning that degree, coreness, and H-index all cannot quantify the influence of endpoints comprehensively. Finally, we focus on the performances of SHI, HCHI, and DCHI. In twelve datasets, there are seven optimal performances in DCHI. DCHI, compared with SHI and HCHI, shows the effective influence of endpoints (e.g., extensive maximal connected subgraph of endpoints and aggregation degree of neighbors) and finds sufficient paths between two unconnected endpoints. Therefore, because the synthetical degree and coreness as hybrid influence can be a good quantification index, DCHI can better enhance prediction accuracy than SHI and HCHI in many cases of link prediction.

Besides, the low computation complexity is a necessary condition in link prediction. The time complexity of the product of two matrices is . According to the definitions of the baseline models, CN, AA, and RA possess the time complexity of and LP, SRW, CSRW, HSRW, and SHI have with coefficient . Although HCHI and DCHI have the same time complexity , two proposed models, especially DCHI, show greater performance improvement. Therefore, the proposed models show a better performance with no increase in complexity.

6. Conclusions

At present, researchers pay more attention to the contributions of the influence of endpoints for link prediction based on local, quasi-local, or global similarity. To quantify the influence of endpoints, researchers consider the degree, H-index, or coreness separately, which all cannot evaluate the influence of endpoints comprehensively. Specifically, the endpoint degree only represents the number of neighbors of endpoints, but cannot describe the maximal connected subgraph. The H-index can express the maximal connected subgraph of endpoints to quantify the influence scope. However, the endpoint degree and H-index cannot quantify the influence intensity of endpoints and result in incomplete influence expression. We find that the coreness can represent the aggregation degree of endpoints, which can quantify the influence intensity of endpoints accurately.

Through abundant investigations, we find that the synthetical degree and coreness and the synthetical H-index and coreness can quantify the influence of endpoints accurately and comprehensively. Therefore, we synthesize degree (H-index) and coreness as the hybrid influence of endpoints and replace the degree in SRW to build two models DCHI and HCHI.

We explore the prediction performances of DCHI and HCHI by the comparisons among CN, AA, RA, LP, SRW, CSRW, HSRW, and SHI on twelve real datasets. As a result, we show that DCHI obviously outperform other models on the metric AUC and do not increase computational complexity. The outstanding improvement in accuracy illustrates the synthetical degree and coreness as hybrid influence of endpoints can describe the endpoint influence intensity accurately and can attract more nodes to produce links.

Although our models have been verified on the datasets, the models only make a simple synthesis between endpoint degree, H-index, and coreness. We find degrees differ in different networks, and so do H-indices and coreness. The network heterogeneity characterized by heterogeneous degrees, H-indices, and coreness directly results in heterogeneous influences. And, we find that endpoints in network with smaller heterogeneous influence can attract each other more likely. For such characteristic, we will further carry out research on the heterogeneous hybrid influence model based on DCHI and HCHI. In the future research studies, the impact of heterogeneous complex networks will become a crucial problem.

In addition, our study may provide new findings relating to link prediction based on similarity in future. Our research results can be applied to friends’ recommendation, products’ recommendation, scientists’ cooperation, biological experiments, and so on.

Data Availability

The data used to support the findings of the study are available at http://vlado.fmf.uni-lj.si/pub/networks/data/ and http://snap.stanford.edu/data/index.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61471060) and Beijing University of Posts and Telecommunications-China Mobile Research Institute Joint Innovation Center.