Abstract

Evaluating scientific articles has always been a challenging task, made even more difficult by the constantly evolving citation networks. Despite numerous attempts at solving this problem, most existing approaches fail to consider the link relationships within the citation network, which can often result in biased evaluation results. To overcome this limitation, we present an optimization ranking algorithm that leverages the P-Rank algorithm and weighted citation networks to provide a more accurate article ranking. The proposed approach employs two hyperbolic tangent functions to calculate the corresponding age of articles and the number of citations, while also updating the link relationships of each paper node in the citation network. We validate the effectiveness of the proposed approach using three evaluation indicators and conduct experiments on three public datasets. The obtained experimental results demonstrate that the optimization article ranking method can achieve competitive performance when compared to other unweighted ranking algorithms. In addition, we note that the optimal Spearman’s rank correlation and robustness can all be achieved by using a combination of the following parameters: , , and .

1. Introduction

Scholarly impact evaluation has always been a hot topic, which is beneficial to promote the public understanding and application of scientific achievements [13]. Nevertheless, assessing and computing academic achievements has proven to be a difficult task due to the constantly changing nature of citation networks [4, 5]. Moreover, utilizing different evaluation indicators or ranking algorithms will also lead to significant differences in evaluation and ranking results [6]. As far back as 1972, Garfield introduced the Journal Impact Factor (JIF) as an estimate means of ranking different academic publications [7]. In 1983, Garfield extended the JIF methodology to calculate the academic influence of the author community [8]. In addition, Braun et al. introduced the Journal H-index as a comprehensive to measure the correlation between the number of publications and the academic impact of a journal [9, 10].

As a famous ranking evaluation algorithm, the PageRank [1113] method has already been widely and effectively used to address a variety of ranking tasks such as network traffic prediction and community discovery. In [14], the scholarly impact and activity of scientists in the author community can be assessed by utilizing the PageRank algorithm. Furthermore, Bollen et al. employed a variant form of the PageRank algorithm to optimize the computational mechanism on the journal impact factor [12]. It is worth noting that the most of ranking evaluation methods view the generation of article nodes in a citation network as a static state. However, the articles in the heterogeneous scholarly network are written, published, and quoted in chronological terms. Clearly, these methods above fail to account for the dynamic change characteristics of the citation network such that the newly generated nodes are often neglected caused by a dearth of sufficient citations. In order to consider and utilize the time information of nodes in the citation network adequately, Sayyadi et al. developed a ranking evaluation method called FutureRank [4] based on time-aware and PageRank algorithm. Compared with the other ranking evaluation methods, FutureRank is more advanced at evaluating the scholarly impact of each academic entity in a real heterogeneous scholarly community. By employing a simple network traffic model with time information, a ranking evaluation algorithm called CiteRank [15] is proposed to predict the number of citations each article in the future. However, the changing mechanism behind academic entity scores cannot be fully explained by utilizing the network traffic models. In addition, notwithstanding that the PageRank method is more advanced at recognizing the global information of network structure, it does not consider those local elements that would affect the performance and result of evaluation methods. To address the issue above, Yan et al. [16] proposed an enhanced evaluation approach called P-Rank. P-Rank involves the creation of a heterogeneous scholarly network that incorporates various entities such as articles, authors, and journals. The method then executes a propagation among subnetworks to assess and calculate the impact of individual entities. In addition to P-Rank, Kleinberg [17] presented an algorithm called HITS, which focuses on the concept of authority. To be specific, the different academic entities can be first divided into the concepts of hub and authority, and then calculates the obtained scores in a mutually reinforcing manner by exploiting the local structure. In order to improve the evaluation effectiveness of the HITS algorithm, Wang et al. [18] introduced a ranking algorithm framework known as PageRank + HITS, which has been shown to produce accurate and reliable ranking results, making it a popular choice in the field of scholarly network analysis. Despite the enhanced performance achieved by these ranking methods, they tend to overlook the link weightings between different networks. This can result in biased article ranking results, which makes it crucial to consider link weightings during the ranking process. Taking inspiration from the PageRank + HITS framework, a link optimization scheme called W-Rank [19] is developed, which assigns link weightings to the corresponding subnetworks through the calculation of citation correlation and author contribution. The relevant experimental results suggest that incorporating link weightings into ranking methods can lead to improved accuracy and more reliable results.

In essence, the demand for algorithm is completely different when the ranking algorithm is designed for prediction or comparative analysis. The existing article ranking algorithms rarely consider integrating link weights into heterogeneous scholarly networks, which would lead to biased article ranking results. Such ranking algorithms view the link correlation between different entity units as equivalent, which means the differences between links are ignored. In [2, 3], Zhou et al. developed a virtual citation network based on the parallel intelligence framework. The parallel intelligence system employs multiple intelligent agents to form an integrated modeling group, considering the correlation between macrogroup phenomena and rules as well as the variability in microindividual behavior and decision-making. Meanwhile, this modeling approach elucidates how the interaction behavior of various entity units in complex networks and communities affects the overall system, while verifying the coherence of features between virtual and real communities [20, 21]. Therefore, this paper developed a link weighting algorithm, which assigns weight to the corresponding links between different article nodes based on their practical significance and representation. The aim of this study is to explore and verify whether or not the evaluation performance of the ranking algorithm could be improved by assigning link weight to the corresponding network and introducing time information. In comparison with relevant works, the main advantages of the study presented here are fourfold:(1)The proposed method comprehensively takes into account the influence of time information and citation number of article nodes in citation network(2)We employ two different hyperbolic tangent functions to calculate the link weights between corresponding nodes and establish a weighted citation network(3)We fully assess the performance of the optimization ranking algorithm by tuning the function configurations and parameter combinations under different conditions(4)By incorporating appropriate link weightings into the citation network, the optimization ranking method on different datasets achieved superior performance in comparison with the original ranking algorithms

The rest of this paper is organized as follows. Section 2 introduces the heterogeneous scholarly network and the proposed optimization ranking method in detail. In Section 3, we present the experimental results to analyze the influence of function configurations and parameter settings on the performance of the algorithm. Furthermore, we validate the benefits of incorporating link weights into the citation network on the evaluation performance of the algorithm. Finally, we conclude the paper in Section 4.

2. Methodology

This section introduces the proposed optimization ranking method in detail. To be more specific, a heterogeneous scholarly network can be defined as a network consisting of three entity elements (author, article, and journal) and how the different entity elements in these three layers are interconnected with each other through various links. Moreover, an optimization ranking method based on a weighted citation network is developed to calculate and evaluate the quality of articles.

2.1. An Introduction to Heterogeneous Academic Networks

A whole heterogeneous academic network is usually composed of three entity elements, namely, the author community, the article citation network, and the journals, as illustrated in Figure 1. Actually, a scholarly network can be regarded as a heterogeneous network that integrates the different information of authors, articles, and journals into a heterogeneous unit and allows them to interact with each other through subnetworks.

It can be observed in Figure 1 that there exist three linking relationships between three entity elements, i.e., “Write” indicates the relationship between the authors and the articles, and the term “Cite” denotes the citation relationship between an original article and the articles that cite it. Similarly, the term “Publish” denotes the relationship between the publications and the corresponding articles. In [16], a heterogeneous academic network consists of three types of scholarly entities (authors, articles, and journals), which can be represented as follows:where , , and represent the author nodes, article nodes, and journal nodes in given entity layers. indicates “Cite” link in the citation network, is “Write” link from author to article, and is “Publish” link from article to journal of publication.

As shown in Figure 2, the three subnetworks which link the three academic entities can be expressed as , , and , respectively. The arrows point in Figure 2 denotes the specific entity behavior and orientation relation. It is worth remarking that and are two undirected subnetworks, and the corresponding links and only denote the behaviors of “Write” and “Publish.” By contrast, the heterogeneous network of article citation network is a directed network and the arrows point in the direction of article citation. In Figure 2, for instance, means cites . In addition, it can be observed that published in is written by and which shows that the relationship between author layer and article layer is a many-to-one (namely, an article can be jointly written by different authors), while the relationship between article layer and journal layer is a one-to-one (namely, an article can only be published in one journal). In this study, we introduce a link weighting scheme that updates the unweighted citation network as: in which indicates the link weight in the citation network. With defined in the citation network, the heterogeneous scholarly network can be updated as follows:

In order to ensure a more objective evaluation of three academic entities, we assume that the citation correlation between two article nodes in the citation network is mainly influenced by the article age and the number of citations in the article (see Section 2.2 for more details) [2, 3, 6, 18, 19].

2.2. Link Optimization in Citation Network

The most of ranking evaluation approaches view the article citation as a static process, without attaching importance to the dynamic change of link relationship in the citation process. It is worth remarking that citation relevance plays a crucial role in evaluating article quality and should also be taken into consideration in the citation network. Furthermore, the influence evaluation and analysis based on topic modeling is an important research work in the field of data mining, mainly applied to user behavior modeling, sentiment analysis, text mining, and other aspects of social networks. In [22, 23], for instance, Tang et al. developed a topic-based modeling, which combines a random walk framework for calculating articles, authors, and journals simultaneously. The experimental results demonstrate that the topic-based method achieved promising performance in comparison with certain other baseline models. In this work, an optimization ranking method based on a weighted citation network is developed, which can enhance the evaluation performance and rationality of the ranking method. More specifically, the proposed method updates the link relationships between nodes in a citation network via taking into account the age and number of citations of the relevant articles. Compared with the binary citation method, the citation calculation in a weighted manner is more effective in assessing the actual academic influence of scientific publications because it fully considers some potential and important elements during the dynamic change of the network. The sketch between a binary citation network and a weighted citation network is illustrated in Figure 3.

In the initial PageRank and P-Rank, the citation network with articles is expressed as an adjacency matrix in which the link weighting between different articles is computed using

Inspired by the works in [2, 3], this study considers utilizing two hyperbolic tangent functions to calculate the corresponding article age and the number of citations and update the link relationship between different article nodes in the citation network. The weight function that denotes the influence of article age in the citation network can be represented as follows:where is a hyperbolic tangent function; denotes the age of the th node in months; is a weight factor that determines the probability of the article being cited. It is worth noting that should follow two principles: (1) ; (2) is increasing in the interval . The relationship between article age and the weight factor that affects the probability of the article being cited is depicted in Figure 4.

Subsequently, we also employ a hyperbolic tangent function to depict the relationship between the weight probability and the number citations of articles, which is formulated as follows:where is a hyperbolic tangent function; indicates the number of citations already received; is the weight factor that determines the probability of the article being cited. For rationality, the function follows three qualitative principles: (1) is increasing in the interval ; (2) the overall slope of is always decreasing in the interval ; (3) . The relationship between the number of citations already received by an article and the weight factor that influences the probability of the article being cited is illustrated in Figure 5.

Based on the (article age) and (number citations of article), the link weighting between node and node can be computed bywhere denotes the weight from article node to article node in the citation network and and denote the weight probability to the age and number citations of articles, respectively. Parameters and are two correlation coefficients, which can be defined as follows:where denotes a parameter shaping the exponential function and and denote the media values of and , respectively. Here, let and .

Therefore, the unweighted citation network can be updated using

Now, we first define as a fractionalized citation matrix. Next, let indicate the PageRank vector corresponding to the vector , and can be computed utilizing in which . Hence, PageRank eigenvector can be calculated bywith (normally set at 0.85) being a control coefficient. Here, let then . Thus, for any given , vector can be achieved by .

2.3. Optimization Ranking Method

The weighted evaluation score of article samples can be represented as , in which the eigenvector can be written as follows:where denotes a vector that contains the number of articles written by each individual author in a set of articles and denotes a vector that involves information about the number of articles published in each journal. The interdependence between three academic entities can be manipulated by utilizing parameters and (normally set to 0.5). Therefore, and can be calculated by

In the optimization ranking method, the initial score of each article in the dataset is set to be and the total number of for all the articles in each iteration is set to 1. Furthermore, the threshold is set as 0.0001 such that the above steps can be executed recursively to convergence. The pseudocode of the optimization ranking method is shown in Algorithm 1.

Input:, number citations and age information of all articles
Output: PageRank score for each article
Settings:, , , , , , , , ,
Steps: 1 Initializing the scores of all article samples:
, where
denotes the total number of article sets. 2 Update link weightings in citation network utilizing equations (6) and (8):
While not converging do
end return
, , and

3. Experiments

In this section, we first verify the evaluation performance of the proposed ranking algorithm under different conditions. Furthermore, we validate the robustness and ROC performance of the optimization ranking algorithm with different parameter combinations.

3.1. Datasets and Settings

In this work, we perform ranking evaluation experiments on three publicly available datasets, i.e., arXiv (hep-th) (http://www.cs.cornell.edu/projects/kddcup/datasets.html), Cora (http://people.cs.umass.edu/mccallum/data.html), and MAG (https://aminer.org/open-academic-graph). The reason for selecting the three datasets is that they provide a more comprehensive representation of the general results. In addition, the convergence rate and robustness of the optimization ranking method need to be verified on three datasets of the same type. Before the experiments, each sample list can be characterized by four elements, i.e., article serial number, article age, the number citations of article, and article score. Table 1 presents a comprehensive summary of statistics for the three datasets.

The data preprocessing and relevant experiments are conducted on a server with a 3.60 GHz Intel i9-9900K processor and a Linux 4.17.0 operating system. The optimization algorithm is implemented in Python 3.7.6 64-bit, which can be seen in https://github.com/Weighted-P-Rank and https://github.com/pjzj/JIF-Modeling.

3.2. Evaluation Metrics

In this section, we introduce the two used evaluation metrics in detail.

3.2.1. Spearman’s Rank Correlation

Assessing and ranking articles has always been a daunting task due to the difficulty in accurately quantifying the real academic quality or influence of an article in reality [24]. Furthermore, the ranking evaluation results are also subject to significant variations based on the ranking indicators or approaches utilized [25]. In [4], Sayyadi et al. utilized the FutureRank score as the baseline guide for evaluation. Yet this approach may result in some older articles receiving higher scores because the iteration of PageRank is inherently biased towards older article nodes in the citation network. To address the limitations of employing FutureRank scores as the baseline guide, Wang et al. utilized the number of citations in the future as an alternative evaluation metric, which provides a more unbiased evaluation of article quality by focusing on future citations rather than historical factors [18, 19]. In this study, we also employed Spearman’s rank correlation to evaluate the rank performance of the optimization method under different parameter conditions. For a given sample, set , where n indicates that initial data are transformed into rank data. Here, let and represent the specific ranking of the article in the first two experiments, and then let and represent the average ranking in the process of the two experiments. Thus, Spearman’s rank correlation is computed as follows:

3.2.2. Robustness

Among the many evaluation metrics available, robustness refers to the ability of a system or algorithm to maintain its performance and stability even in the face of variations or disturbances that may arise during its operation. In this study, the entire duration can be split into two distinct stages by utilizing the corresponding historical time node for each dataset. For instance, let denotes the historical time node in a dataset, then the timeframe before the node and the entire duration can be indicated as and , respectively. Accordingly, the robustness of article ranking method in the dataset (historical time node defined as ) can be computed by the correlation between stage and the entire duration . That is, there exists a positive correlation between the algorithm robustness and the correlation between the two distinct time durations.

3.3. Experimental Results and Analysis
3.3.1. Function Configurations

In the experiments, we employ two parameters ( and ) to modulate function configurations. Through the utilization of various function configurations, we evaluate and analyze the optimization rank method in comparison to prior studies. The conditions and their respective parameters are depicted as follows:(1) ( and ): which denotes that the conventional P-Rank method is employed for computing ranks(2) ( and ): a hyperbolic tangent function is incorporated into the citation network which takes into account article age only(3) ( and ): a hyperbolic tangent function is introduced into the citation network which only employs the number citations of article(4) (, ): two hyperbolic tangent functions ( and ) are introduced into the citation network which considers article age and number citations of article simultaneously

3.3.2. Three Parameters (, , and ) in Two Hyperbolic Tangent Functions

From observation and analysis in Figure 4, it can be found that the degree of translation and overall slope of the curves in Figure 4 are comprehensively affected by the parameters and in equation (4). Also, it can be seen in Figure 5 that parameter reflects the slope of each curve in Figure 5. Inspired by the studies in [2, 3], there exist three kinds of parameter settings in this work:(1)Minimum sampling condition: ; ; and (2)Baseline sampling condition: ; ; and (3)Maximum sampling condition: ; ; and

With the assumptions above, we will validate Spearman’s ranking correlation and robustness of the proposed optimization method across all datasets, which can be found in Figure 6 and Tables 2 and 3.

It can be seen in Figure 6 that the optimal Spearman’s ranking correlation (arXiv: 0.603; Cora: 0.335; and MAG: 0.553) and robustness (arXiv: 0.854; Cora: 0.446; and MAG: 0.707) of the optimization method can be obtained by jointly employing a combination of parameters from , , and . Furthermore, we note that the proposed optimization method with three parameter configurations (, , and ; , , and ; and , , and ) all achieved competitive rank performance in comparison with the original P-Rank and PageRank. That is, the optimization ranking method based on a weighted citation network significantly outperforms the previous works. This result seems to show that the evaluation performance of article ranking approach can be enhanced by comprehensively taking into account the article age and the number citations of article in citation network.

It can be observed in Figure 6(a) and Tables 2-3 that the greater the article age (the smaller and ), the higher the Spearman’s ranking correlation and the robustness would tend to be. In addition, one note that the rank performance of the proposed optimization algorithm improves as the parameter becomes smaller.

Subsequently, we employ ROC curves and AUC to further validate the evaluation performance of the optimization approach with three parameter combinations (see Figure 7). Initially, all article samples are scored and ranked utilizing PageRank algorithm. Based on the default threshold, all article samples are classified as positive samples and positive samples. To ensure the reliability of the experimental results, we conduct five independent tests on each dataset and record the average performance of different ranking algorithms.

As illustrated in Figure 7, it can be observed that the proposed optimization ranking method with three parameter combinations (, , and ; , , and ; and , , and ) all achieved competitive ROC performance on three datasets. It is worth remarking that the optimal AUC values (arXiv: 0.5529; Cora: 0.4531; and MAG: 0.5912) on three datasets are obtained by the optimization ranking method with , , and . By contrast, the ROC performance and AUC values achieved by PageRank (arXiv: 0.4533; Cora: 0.3279; and MAG: 0.4503) and P-Rank (arXiv: 0.3289; Cora: 0.3484; and MAG: 0.5020) are unsatisfactory. The results show that considering link weighting between different nodes in the citation network can effectively elevate the rationality of the ranking evaluation algorithm.

4. Conclusion and Future Work

Refining the evaluation of scientific papers is crucial, yet it poses significant difficulty due to the intricate and constantly evolving nature of the diverse academic network. In this study, we proposed an optimization ranking algorithm based on a weighted citation network and the P-Rank algorithm. The main aim of the developed optimization ranking method is to attach link weightings between different nodes in the citation network by calculating the corresponding article age and the number of citations article. The effectiveness of the proposed optimization ranking approach can be fully validated by carrying out the relevant experiments on three different datasets. The experimental results demonstrated that the optimization ranking method exhibited superior performance across all three datasets, and with optimal outcomes attainable under specific conditions of , , and . In addition, it can be found that the proposed optimization ranking method with different parameter combinations all achieved competitive ROC performance on three datasets. Taken together, it can be found that the proposed link weighting scheme is beneficial to improve the performance of the article ranking algorithm, in particular compared with the other unweighted methods.

In the future, we would like to further examine the effectiveness and universality of the link weighting scheme by testing it against more ranking evaluation algorithms. Moreover, one may consider utilizing the advantages of various approaches and the factors such as discipline and topic for further improving the ranking performance of the proposed method.

Data Availability

The optimization algorithm is implemented in Python 3.7.6 64-bit, which can be seen in https://github.com/Weighted-P-Rank and https://github.com/pjzj/JIF-Modeling.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Zhejiang Provincial Key Research and Development Program (2023C01233).