Abstract
The circular RNAs (circRNAs) have significant effects on a variety of biological processes, the dysfunction of which is closely related to the emergence and development of diseases. Therefore, identification of circRNA-disease associations will contribute to analysing the pathogenesis of diseases. Here, we present a computational model called BRWSP to predict circRNA-disease associations, which searches paths on a multiple heterogeneous network based on biased random walk. Firstly, BRWSP constructs a multiple heterogeneous network by using circRNAs, diseases, and genes. Then, the biased random walk algorithm runs on the multiple heterogeneous network to search paths between circRNAs and diseases. Finally, the performance of BRWSP is significantly better than the state-of-the-art algorithms. Furthermore, BRWSP further contributes to the discovery of novel circRNA-disease associations.
1. Introduction
circRNAs are a special type of endogenous noncoding RNAs (ncRNAs), which widely exist in the gene expression of various organisms. The discovery of circRNAs could date back to the nineteen seventies. Sanger et al. [1] first observed circRNAs in the process of studying plant viruses by using electron microscopy. circRNAs were gradually found in different species and cells after the following decades, such as yeast [2], zebrafish [3], and mouse [4]. Because of the low abundance of circRNAs and the lack of known function, circRNAs have not got more attention for a very long time.
With the rise and development of high-throughput sequencing technologies, a large number of circRNAs have been found and identified [5, 6]. Along with gradually penetrating to the study of circRNAs, more and more circRNAs have been identified and published. Therefore, various circRNA databases with different emphases have been constructed, such as CircR2Disease [7], circBase [8], exoRBase [9], PlantcircBase [10], circAtlas 2.0 [11], and CSCD [12]. In addition, the biological function of circRNAs has also been gradually revealed, such as acting as miRNA sponges [13], interacting with RNA-binding proteins (RBPs) [14], participating in transcriptional regulation [15] and so forth.
Complex diseases seriously threaten human health [16–18]. Therefore, studies on complex diseases have been a hot topic in the field of medicine and bioinformatics [19, 20]. As more and more biological functions of circRNAs have been revealed, massive evidence has indicated that circRNAs play an important role in the emergence and development of complex diseases. According to the reports of Liu et al. [21], the function of circRNAs was also versatile to function as microRNA (miRNA) sponges [5, 13] and protein sponges [22, 23]. For example, the circSMARCA5 [24] and circCFH [25] have been found to be expressed in a glioma-specific pattern which may be used as the tumor biomarkers. CircNFIX [26] and circNT5E [27] have been found that they play oncogenic roles in glioma, whereas circFBXW7 and circSHPRH have been reported to function as the tumor suppressors. Furthermore, circRNAs might become an ideal choice for gene/protein delivery in future brain cancer therapies [21].
The above methods of predicting circRNA-disease associations are time-consuming and costly. The disadvantage can be properly overcome by adopting computational methods to identify circRNA-disease associations. Due to the low number of known circRNA-disease associations in the past years, machine learning methods are not widely used in the identification of circRNA-disease associations. However, the research progression of prediction in miRNA-disease association and lncRNA-disease association would benefit the development of computational models for circRNAs [28–30]. Recently, Fan et al. [7] constructed CircR2Disease database by using the method of literature retrieval, which provides 661 circRNAs, 100 diseases, and 725 circRNA-disease associations. Another similar database is circRNA-Disease [31], which provides an opportunity to identify circRNA-disease associations by using computational methods. Lei et al. [32] employed a method called depth-first search to search paths between circRNAs and diseases in heterogeneous network composed of circRNAs and diseases and then used the path weighted method to infer the probability of circRNA-disease based on searched paths. Fan et al. [33] built a heterogeneous network by using circRNA similarity network, disease similarity network, and circRNA-disease associations, and then they employed the KATZ method to predict circRNA-disease associations. Xiao et al. [34] utilized a manifold regularization learning framework to predict human disease-related circRNAs based on a heterogeneous circRNA-disease bilayer network. Zhao et al. [35] proposed a novel computational algorithm to identify circRNA-disease associations, which is based on the bipartite network projection and KATZ algorithm. Wei et al. [36] employed an improved matrix factorization identification algorithm to identify circRNA-disease associations. Yan et al. [37] utilized a DWNN-RLS algorithm based on regularized least squares of Kronecker product kernel to identify circRNA-disease associations.
In this paper, we propose a new computational method, named BRWSP, to identify circRNA-disease associations based on biased random walk to search paths on a multiple heterogeneous network. Specifically, BRWSP first establishes a multiple heterogeneous network by using circRNA coexpression similarity network, gene similarity network, disease similarity network, circRNA-gene associations, circRNA-disease associations, and gene-disease associations. Containing multiple types of biological data can facilitate a comprehensive analysis of circRNA-disease associations. Next, a biased random walk runs on this multiple heterogeneous network to search paths between a specific circRNA and a specific disease. BRWSP then calculates the score of specific circRNA-disease association by using those searched paths. Compared with state-of-the-art algorithms, BRWSP obtains better performance in the identification of circRNA-disease associations. The overall framework of BRWSP is depicted in Figure 1.
2. Materials and Methods
2.1. Motivations
(1)Ba-Alawi et al. [38] used depth-first search algorithm to traverse all simple paths between a specific drug and a specific target protein and then aggregated the score from these searched paths to infer drug-target interactions. Then this algorithm was extended to identify miRNA-disease associations [39], lncRNA-disease associations [40], circRNA-disease associations [32], and microbe-disease associations [41] and obtained satisfactory performance. However, this algorithm needs to search for all paths between a specific circRNA and a specific disease. If the network is very enormous, this type of algorithm cannot handle it well. Therefore, this type of algorithm cannot be well extended to a multiple heterogeneous network constructed by using many different types of biological networks. Being inspired by [42], a biased random walk is proposed to search paths. Compared with depth-first search algorithm, it chooses the paths according to the probabilities (such as Figure 1(c)). Therefore, if the probability of one path is very smaller than other paths, it is very likely that the walker will not select this path in the process of selecting the next path.(2)Recently, many methods [32–34] have been proposed based on a heterogeneous network to identify circRNA-disease associations. However, these methods use fewer biological data and depend greatly on the known circRNA-disease associations, which lead to insufficient analysis of circRNA-disease associations from a variety of biological perspectives. Therefore, gene similarity networks and gene-disease associations are imported to build a multiple heterogeneous network which contains circRNA coexpression network, circRNA-disease associations, and disease similarity network.
2.2. Materials and Preprocessing
2.2.1. circRNA-Disease Associations
The datasets of circRNA-disease associations are downloaded from the CircR2Disease database (http://bioinfo.snnu.edu.cn/) [7]. The CircR2Disease database contains 725 circRNA-disease associations consisting of 661 circRNAs and 100 diseases. In order to ensure the accuracy of data, we only extract circRNAs with circBase IDs and gene symbols. Finally, 427 circRNA-disease associations, consisting of 372 circRNAs, 330 gene symbols, and 77 diseases, are remained.
2.2.2. Disease Semantic Similarity
The similarity between diseases can be calculated by a directed acyclic graph (DAG). Firstly, we search DOID corresponding to 77 diseases, being extracted in Section 2.2.1, from the Disease Ontology database (http://www.disease-ontology.org/) [43]. After deleting diseases without DOID, the dataset contains 55 diseases with DOID, 291 circRNAs, 261 gene symbols, and 340 circRNA-disease associations. Based on disease ontology, Yu et al. [44] created a DOSE package of R, which can calculate disease semantic similarity by doSim function based on Wang’s method [45]. In this study, we adopt this DOSE package to calculate disease semantic similarity.
2.2.3. circRNA Expression Profile
To calculate the circRNA coexpression similarity network, the circRNA expression profile is downloaded from the database exoRBase (http://www.exorbase.org/) [9]. After converting exor_circ_ID to circBase ID, we eliminate some circRNAs without expression profile among 291 circRNAs. The final data contain expression profile data of 154 circRNAs on 90 samples, 192 circRNA-disease associations consisting of 154 circRNAs (corresponding to 140 gene symbols) and 48 diseases (being shown in Figure 2).
2.2.4. Gene-Disease Associations
In order to detect associations between 48 diseases and 140 genes (corresponding to circRNAs), we download the integrated gene-disease associations from the human_disease_textmining_full.tsv file of the DISEASE Database [46]. A confidence score is given to evaluate associations in this database. In order to ensure the reliability of data, we only select the gene-disease associations whose confidence score is greater or equal to 2 according to previous research [47]. In total, among 48 diseases and 140 gene symbols, we obtain sufficiently 80 gene-disease associations consisting of 29 diseases and 34 genes.
Besides, we also extract some genes associated with the 48 diseases mentioned above from the DISEASE database [46] and DisGeNET database [48]. Similarly, we only extract gene-disease associations with confidence score greater or equal to 2 for the human_disease_experiments_full.csv file of the DISEASE database [46]. And for the DisGeNET database, the gene-disease associations are extracted from the curated_gene_disease_associations.tsv.gz file. Finally, among the 48 diseases mentioned above, 2193 disease-gene associations are extracted, which contain 37 diseases and 1607 disease-related genes.
2.2.5. Constructing Multiple Heterogeneous Network
In this paper, we extract 140 gene symbols (corresponding to circRNAs) from CircR2Disease. According to these gene symbols, gene similarity network is constructed by mapping gene products to GO annotations [49]. Genes are annotated by cellular component (CC), molecular function (MF), and biological process (BP). Herein, we use the biological process (BP) to measure gene semantic similarity value, which has been proven to embrace better performance in previous papers [50]. Finally, the adjacency matrix GS is utilized to represent the gene similarity network, and the value represents a functional similarity value between gene i and gene j, which can be calculated by the function of geneSim in the GoSemSim package of R [49].
The adjacency matrix CD is constructed to represent circRNA-disease associations and is equal to 1 when circRNA is associated with disease ; similarly, the adjacency of CG and GD is used to describe circRNA-gene interactions and gene-disease associations, respectively. Besides, we employ the adjacency matrix DS to describe disease semantic similarity, in which the indicates the semantic similarity between disease and disease . For circRNA coexpression similarity CS, represents the similarity value between circRNA and circRNA , which is calculated by using the Pearson correlation coefficients based on circRNA expression profile.
In the process of predicting circRNA-disease associations, the performance of the algorithm largely depends on the known circRNA-disease associations. However, the existing known circRNA-disease associations are still limited, which will affect the accuracy of the algorithm for predicting circRNA-disease associations. In order to solve this problem, we calculate the initial score for circRNA-disease associations based on the gene-disease associations. The initial score of the association between circRNA i and disease k is as follows:where is the gene corresponding to circRNA and represents the gene associating with disease . represents the semantic similarity value between gene and gene calculated by the GoSemSim package of R [49]; represents the initial score of the association of circRNA and disease . If is equal to 0, will be assigned as a new value.
Next, a multiple heterogeneous network is constructed by using circRNA coexpression network, disease similarity network, gene functional similarity network, and their association information, which is represented as follows:where , , and are the transposed matrices of , , and , respectively. To avoid the biases caused by larger values in the multiple heterogeneous network, H is utilized to construct a normalized multiple heterogeneous network , and is a degree matrix of .
The overall framework of BRWSP is depicted in Figure 3.
2.3. BRWSP Methods
2.3.1. Biased Random Walk to Search Paths
In the paper [42], DFS can search for more different types of nodes because it explored a network as deeply as possible. The breadth-first search (BFS) can search the neighbourhoods of source node. Being inspired by it, a biased random walk algorithm is designed to search paths between circRNAs and diseases, which combines the advantages of DFS and BFS by adjusting the BRWSP’s parameter (being explained as follows).
Formally, let represents one path between circRNA and disease . In this , represents the node (circRNA or disease) of and L represents the length of . Let indicate the node accessed by the kth biased random walk. The strategy of selecting the next node is described as follows:where represents the transition probability of selecting node x the next biased random walk, and the currently visited node and the last visited node are and t, respectively. and represent the neighbourhoods of and t, respectively. For parameter q, if q is assigned a larger value, the nodes of are highly interconnected and belong to communities or similar network clusters (similar to BFS algorithm). Otherwise, the nodes of can more exactly describe a macroview of the neighbourhood (similar DFS algorithm). In other words, we can integrate the strategies of DFS and BFS by adjusting the value of the parameter q. Finally, each neighbourhood of can obtain a probability of being visited in the next biased random walk. A roulette selection algorithm, a simple random choice based on probability, is employed to randomly select the next node from the neighbourhood of based on their probability. Then the selected node is added to corresponding . If k is equal to 1, the next node is randomly selected from the neighbourhoods of the last node based on their probability.
In the process of biased random walk to search paths between circRNA and disease , the path from to will be saved if its length is less than or equal to L. Otherwise, the current biased walk fails to search for a corresponding path. In order to search for more possible paths between circRNA and disease , we will repeat the above steps maxiter times. Therefore, after the biased random walk, we can get a lot of paths from circRNA to disease .
2.3.2. Calculating circRNA-Disease Score Based on Paths
It is known that circRNA and disease are possibly associated with each other if many paths with higher weight and shorter length are found among them. Therefore, an exponential decay function for circRNA and disease is utilized to give more support for paths with high weight and short length as follows:where represents the score of predicted association score between circRNA and disease . represents all paths we have searched between circRNA and disease , where represents the ith searched path. represents the weight of the eth edge in . is the length of and the parameter represents a decay factor.
3. Results and Discussion
3.1. Evaluation Metrics
In this paper, the leave-one-out cross-validation (LOOCV) is utilized to analyse the performance of BRWSP in the process of predicting circRNA-disease associations. According to the results of LOOCV, the receiver operating characteristic (ROC) curve is plotted and the area under of ROC curve (AUC) is calculated as evaluation criteria.
In the process of predicting circRNA associated with disease k, the positive samples are those known circRNAs associated with disease k. Reliable negative samples are required in the process of evaluation. However, there is no prior information about the negative samples (non-disease-related circRNAs). All unknown genes can be regarded as negative samples. However, there are two disadvantages to this approach. Firstly, there is no evidence to prove that the unknown circRNAs are related or unrelated to diseases currently. It is not scientific to make that all unknown genes are regarded as negative samples. Secondly, this approach will lead to class-imbalance problem since the number of known circRNAs is much fewer than the number of unknown circRNAs. This phenomenon has also been widely discussed in identifying disease-related genes, miRNAs and lncRNAs [47,51–53]. Therefore, it is not scientific to regard all unknown genes as negative samples. To overcome these problems and extract reliable negative samples, we first calculate all initial scores of the associations between all circRNAs and disease k according to equation (1) and arrange them in ascending order. The circRNAs whose number is same with the number of positive samples are selected as negative samples from the front of the results of ascending order. If all initial scores are equal to 0, we randomly select some circRNAs as negative samples from unknown circRNAs associated with disease k, in which the number of negative samples is equal to the number of positive samples. Finally, we can get all predicted scores for positive samples and negative samples.
3.2. Effects of Parameters
There are four parameters in the BRWSP algorithm. Among them, we set the path length is equal to 3 based on the previous studies [38–41]. However, the values of q, maxiter, and decay factor are undefined. Therefore, we set maxiter = 300, and . The experimental results after combining different values of q and are listed in Figure 4. Figure 4 shows that the BRWSP algorithm will get the best AUC value (0.8675) when q = 0.12 and .
(a)
(b)
(c)
(d)
(e)
(f)
3.3. Comparison with Other Methods
In order to analyse the performance of the BRWSP algorithm in predicting circRNA-disease associations, BRWSP (L = 3, q = 0.12, maxiter = 300, and ) is compared with KATZHCDA [33], iCircDA-MF [36], RLS-Kron [37, 54], and DFSPW [38–41]. Herein, for DFSPW algorithm, it first searches all paths between circRNAs and diseases and then calculates the score between circRNAs and diseases based on paths by formula (4). For DFSPW algorithm’s parameters, the maximum length of path and the decay factor are equal to 3 and 2.26, respectively, based on the previous study [38–41]. For the convenience of comparison, we apply these computational methods on the same dataset in this paper.
The comparison results of BRWSP and other algorithms are shown in Figures 5–7. Obviously, we can observe clearly from Figure 5 that the AUC value of BRWSP is 0.8675, which improves the prediction precision by 6.49%, 19.36%, 21.65%, and 22.81% compared to the KATZHCDA, RLS-Kron, iCircDA-MF, and DFSPW algorithm, respectively. The precision and recall are listed at each top 100 circRNAs in Figure 7, in which we can find BRWSP get excellent performance. In addition, we calculate the number of circRNAs with each disease. Then, we arrange them in ascending order and select the top 4 cancer diseases (breast cancer, stomach cancer, colorectal cancer, and papillary thyroid carcinoma) to analysis. The four common diseases are associated with 24 circRNAs, 22 circRNAs, 13 circRNAs, and 12 circRNAs, respectively. Figure 7 shows the performance of each algorithm on the four cancer diseases. In a word, we can see that BRWSP gets the satisfactory performance from Figures 5–7.
(a)
(b)
3.4. The Effect of Gene Network
One of the highlights of our paper is that the gene similarity network is utilized to construct a multiple heterogeneous network with circRNA coexpression similarity network, disease semantic similarity, and associations among them. In this section, we analyse its impact on predicting circRNA-disease associations. In other words, we run our algorithm on a heterogeneous network (constructed by circRNA coexpression similarity network, disease semantic similarity, initial score, and their association information) and a multiple heterogeneous network (constructed by circRNA coexpression similarity network, gene similarity network, disease semantic similarity, initial score, and their association information).
Obviously, we can clearly see from Figure 8 that our algorithm on a multiple heterogeneous network (Mul_Het_Net) gets better performance than that on heterogeneous network (Het_Net). The difference between Mul_Het_Net and Het_Net is that Mul_Het_Net introduces gene similarity network. Therefore, the introduction of gene similarity network is helpful to identify circRNA-disease associations.
3.5. Case Study
To further demonstrate the effectiveness of BRWSP (L = 3, q = 0.12, maxiter = 300, and ) in predicting new circRNA-disease associations, a case study is performed for colorectal cancer, which is associated with 13 circRNAs (being shown in Table 1). In the process of experiment, 13 circRNAs associating with colorectal cancer are still assigned as training data and other circRNAs act as candidate samples. At the end of the prediction, we rank the score of candidate samples in descending order, and then the top 20 candidate samples (circRNAs) are selected. The literature mining method and interaction network method are utilized to analyse associations between them and colorectal cancer.
The result of the literature validation method is shown in Table 2. For the fourth column in Table 2, if there is a corresponding literature indicating that the gene corresponding to circRNA is associated with colorectal cancer, and the corresponding position in the fourth column is set the corresponding literature’s PMID, otherwise “-”. Obviously, we can clearly see that there are 12 literature studies to support our result from Table 2.
Interaction network method is to show the host gene of circRNA interacts with disease genes in PPI network and Pathway network. If host gene of predicted circRNA interacts with disease genes, this phenomenon indicates that the predicted circRNA is likely to be associated with the corresponding disease. Genes associating with colorectal cancer are extracted from the DISEASE database [46] and DisGeNET database [48]; protein-protein interaction (PPI) network and Pathway network are extracted from the research [55]. Then, we extract the interaction between genes associating with colorectal cancer and genes corresponding top 20 circRNAs in PPI network and Pathway network. The final analysis result is shown in Figure 9. We can clearly observe that 11 genes corresponding to circRNAs interact with colorectal cancer genes. The gene POLD1 is not just colorectal gene and also associated with hsa_circ_0052012. In addition, three sets of connected graphs are constructed by predicted circRNAs, the host gene of predicted circRNAs, and colorectal cancer genes. The first set of connected graph contains hsa_circ_0067531, hsa_circ_0002362, hsa_circ_0091894, hsa_circ_0000893, hsa_circ_0052012, hsa_circ_0006022, hsa_circ_0008719, hsa_circ_0008615, hsa_circ_0001727, the host gene of the predicted circRNAs, and corresponding disease genes. The second set of connected graph includes hsa_circ_0021549, hsa_circ_0021553, MPPED2, and CAGE1. Similarly, the hsa_circ_0064996, SNRK, and STK11 construct the third set of connected graph.
4. Conclusion
In this study, we propose a novel path weighted computational method, named BRWSP, to predict circRNA-disease associations. Highlights of BRWSP are to construct a multiple heterogeneous network and to employ the biased random walk strategy to search paths between circRNAs and diseases. Firstly, BRWSP constructs a multiple heterogeneous network by using circRNA similarity network, gene similarity network, disease similarity network, and their associations, which can analyse the circRNA-disease associations from different biological perspectives. Secondly, the biased random walk is employed to search paths, which can eliminate some low probability paths. Experimental results show that BRWSP receives a satisfactory performance compared with other algorithms.
Although the BRWSP can effectively predict circRNA-disease associations, it still has several shortcomings. Firstly, we only use a small amount of circRNA-disease associations and do not consider those circRNAs without gene symbol, circBase ID, and expression profile information. Besides, BRWSP has to consider four parameters (maxiter, , L, and ). Therefore, it is a challenge about how to select optimal parameters in different situations. In a word, these limitations will encourage us to do further research studies in the future work.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61972451, 61672334, 61902230) and the Fundamental Research Funds for the Central Universities, Shaanxi Normal University (GK201901010).