Abstract

In view of time delay existing in gene regulation, by using the analysis idea and methods of complex network, this paper proposes a multi-time-delay gene regulation network analysis method based on the fuzzy label propagation. The algorithm takes the relative change trend coefficient, the correlation coefficient, and the mutual information as the similarity measurement indexes for the gene pair, fully reflecting the correlation of gene pairs and simultaneously obtaining the gene regulation relationship and the time delay through the fuzzy label propagation algorithm of the semisupervised learning. Experimental results of the cell cycle-regulated genes of yeast show that the proposed construction method of GRN can not only correctly select potential regulation genes but also provide details about the gene regulator model, thereby more accurately constructing gene regulation network.

1. Introduction

The interaction between the implicit genes in gene expression data can be used to construct gene regulatory network by analysing gene expression data [1]. The research of gene regulatory network is one of the topics of postgenomic informatics. It mainly analyses gene expression data, uses bioinformatics methods and technologies to identify the topological structure of gene network to deeply understand the structure and function of biology and the mechanism of pathological changes, and understands life phenomena in a systematic framework [2, 3]. Gene network research can be used to reveal the development process and mechanism of biological tissue system and help understand the regulation of internal substances, which can promote people to effectively identify the cause of disease. In particular, the study of human tumor gene regulatory network can make us have a deep understanding of the regulatory relationship of tumor related genes and then provide basis and guidance for tumor gene therapy. Gene regulatory network, as the molecular basis of basic cell life activities, has the biological characteristics of randomness, complexity, spatiotemporal specificity, and dynamic. This makes the construction of gene regulatory network very difficult.

The time-series gene expression data have been widely applied in the research on the gene regulation network, and attention to the time delay has been paid increasingly as the important factor for the gene regulation network construction [4]. The time-delay processing can be generally classified into two types: first type—firstly calculate the time delay among genes, then translate the time-series gene expression data to the calculated time delay to achieve the effect of removing delay, and finally construct the gene regulation network to analyse the regulation relationship among the genes; second type—directly construct the time-delay gene regulation network model and obtain the time delay and the regulation relationship through the time-delay regulation network method [5]. In the first type of analysis, Ahsen et al. [6] obtained the phase and frequency of two gene expression data in the frequency domain through the frequency domain change and calculated the time delay on the basis of the relationship between the phase and the frequency. Huang et al. [7] obtained the mutual time delay among 101 genes by using this time-delay estimation method, constructed the gene regulation network by adopting the concept of community detection, and obtained the better result. The delay time obtained by this method is not always the integer multiple of time interval, and it is necessary to obtain the delay removed gene expression value by using the curve fitting method, which is not conducive to the follow-up study.

The regulation time of gene expression in cells is not synchronous, and the regulation delay length is also different [8]. The existing dynamic Bayesian network model of gene expression regulation network based on time-series gene expression data is difficult to model the asynchronous multi-time-delay regulation relationship [9]. In order to solve this urgent problem, this paper proposes a semisupervised learning method that can accurately model the asynchronous and multi-time-delay regulatory relationship between genes. It can learn the gene expression regulatory network with asynchronous and multi-time-delay characteristics from the time-series expression data of gene chip. In the learning process, it can use the known class data and the unknown class data to obtain more information and have better learning effect.

The reconstruction of gene regulatory network based on expression data is also called reverse engineering or network inference. In recent years, various algorithms have been proposed by analysing gene expression data, such as GA [10], gene programming [11], evolutionary strategies [12], and ACO [13]. However, the GRNs modeled by the above algorithm consist of only a limited number of genes. How to reconstruct large-scale gene regulatory network is still an unknown biological problem.

At present, there are various models to model gene regulatory network. The simplest model is based on Boolean networks. In reverse engineering, Boolean networks are used to infer the underlying topology and the Boolean functions at the nodes from the observed gene expression data. In addition, continuous network is an extension of Boolean network [14], which is also widely used to model gene regulatory network. Nodes still represent the regulatory effect of genes and their connections on gene expression. Genes in biological systems show continuous range of activity levels, and it has been considered that continuous networks can capture some properties of gene regulatory networks that do not exist in Boolean models. Many methods based on continuous networks have been proposed to infer gene regulatory networks, for example, based on linear regression and based on mutual information. In Arachne algorithm, the specific information of each gene pair can be calculated in an appropriate way to get the actual value of mutual information, and compared with the fixed threshold value, a regulatory interaction can be inferred. In addition, many probabilistic graphical models have been proposed to measure the high-order dependence between different gene expression patterns. Bayesian network is one of the most popular methods to infer gene regulatory network. In Bayesian networks, directed acyclic graphs are used to indicate the conditional dependence between random variables [15].

Many researchers think that the time delay among the genes is constant value, and the time delay varies from gene pairs so that the analysis on the multi-time-delay gene regulation network was proposed in succession. To be favorable to construct the gene regulation network, the time delay is normally regarded as the integral multiple of time interval. Based on this, Yang et al. [16] firstly established the time-delay gene expression matrix to dig the time-delay regulation relationship among the genes through the decision tree classifier. Yang [17] constructed the multi-time-delay gene regulation network by using the high-order Markov dynamic Bayesian network. Raja Chowdhury and Chetty [18] constructed the multi-time-delay gene regulation network by using the correlation coefficient method. In this method, the time-delay correlation coefficient among the genes was firstly established, the maximum value of correlation coefficient in each gene pair and the time delay corresponding to this value were obtained through the dynamic threshold method, and finally the maximum value of correlation coefficient in the gene pair was compared with the given threshold to screen the correlation coefficient greater than the threshold and obtain the genes corresponding to these correlation coefficients and the time delay to complete the analysis on the multi-time-delay gene regulation network. This method is simple and can effectively handle the time-delay problem. Aderhold et al. [19] established the time-delay mutual information among the genes and constructed the multi-time-delay gene regulation network through the dynamic Bayesian network: firstly construct the multi-time-delay mutual information matrix to select the larger gene in the mutual information and then complete the analysis on the gene regulation relationship by the dynamic Bayesian network. Better effect has been obtained by this method. However, most of these time-delay methods start from the relationship between genes but ignore the characteristics of genes.

The single metrical scale was used when the similarity among the genes was measured by the above methods. When constructing the gene regulation network, Liu et al. [20] pointed that the single similarity evaluation scale cannot reflect the correlation among the genes very well, so they evaluated the correlation among the genes by using the combined method of correlation coefficient and interquartile range and obtained better gene regulation relationship through the vector analysis by taking the interquartile range of the gene pair as the horizontal ordinate and the correlation coefficient as the vertical coordinate. By reference to [21, 22], this paper combines the multi-time-delay correlation coefficient, the mutual information, and the relative change trend coefficient to construct the new gene pair correlation evaluation matrix and complete the analysis on the multi-time-delay gene regulation network through the semisupervised learning method of fuzzy label propagation.

3. Relevant Notes

The time-series gene expression data are denoted as , wherein expresses the expression value of gene at the time point and . The maximum delay time among the genes is denoted as times the time interval.

3.1. Time-Delay Relative Change Trend Coefficient

The matrix obtained by gene expression data discretization is denoted as .where .

For any two genes and in the dataset, the relative change trend coefficient for the gene and the gene at the time point is denoted as after the gene is delayed by unit of time and can be calculated by the following the formulas:

The value of is , wherein indicates the similar change trend of two genes.

The relative change trend of two genes after delay is graded. The number of values equal to is denoted as , and the relative change trend score for the gene and the gene is denoted as after the gene is delayed by unit of time.

3.2. Time-Delay Correlation Coefficient

The correlation coefficient for the gene and the gene is denoted as after the gene is delayed by unit of time.where expresses the mean value of the former expression values for the gene and expresses the mean value of the latter expression values for the gene .

3.3. Time-Delay Mutual Information

Mutual information expresses the shared information amount between two genes, firstly performing interval partition for the gene expression dataset and then calculating the delay mutual information matrix among the genes. The mutual information for the gene and the gene is denoted as after the gene is delayed by unit of time.where is information entropy. The calculation method is as shown in formulas (7) to (9):where takes the former expression value and takes the latter expression value.

3.4. Gene Pair Similarity Evaluation Matrix

The gene pair similarity evaluation matrix is , expresses the expression value of the attribute for the gene and the gene after the gene is delayed by unit of time, and , wherein the attributes of gene pair are, respectively, relative change trend, correlation coefficient, and mutual information. For the convenience of subsequent analysis, the time-delay similarity evaluation matrix is denoted as , wherein expresses the similarity sample for the gene and the gene after the gene is delayed by unit of time.

4. Multi-Time-Delay Gene Regulation Network Based on Fuzzy Label Propagation

4.1. Algorithm Description

The converted datasets are classified by using the fuzzy label propagation algorithm of semisupervised learning. There are two label values: and , wherein indicates that the regulation relationship exists between two genes in the gene pair and indicates that there is no regulation relationship.

In the fuzzy label propagation algorithm, firstly divide into the labeled data set and the unlabeled dataset and calculate the similarity of any two samples and by using RBP kernel function.where expresses the variance of difference value between two samples.

Express the category of sample with the vector of dimensions:(1)If the sample ,where .(2)If the sample , the label value of is propagated from the adjacent samples and the membership that belongs to the category meetswhere expresses the set composed of adjacent samples of , and the results are obtained from formula (12):

As the category labels of unknown samples are continuously renewed, in formula (13) is required to be repeatedly calculated until the fuzzy category label values of all samples are not changed.

Obtain the fuzzy label value of all samples and convert the label value matrix by the following formula:

Convert the label value matrix by the following formula:

The regulation relationship exists between two genes corresponding to the samples with the label value of , and the time delay is times the time interval.

4.2. Algorithm Steps

Step 1: estimate the missing value in the simulation data set by using the missing value estimation method [23] and construct the complete dataset.Step 2: calculate the time-delay relative change trend coefficient matrix, the time-delay correlation coefficient matrix, and the time-delay mutual information matrix of all gene pairs in the complete dataset.Step 3: obtain the similarity evaluation matrix of gene pair. This similarity evaluation matrix is a multidimensional space matrix. For simulation simplicity, the matrix is processed accordingly to be converted to the two-dimensional space matrix. Make , wherein the row sequence of the row vector is as follows: no delay between the gene and the gene , delay 1 time unit between the gene and the gene , and delay time unit between the gene and the gene .Step 4: add the label value to a small number of gene pairs, calculate the fuzzy label values of unknown gene pairs on the basis of fuzzy label propagation algorithm, and obtain the regulation relationship and the time delay between the genes.

4.3. Time Complexity Analysis of Algorithm

There are two main bottlenecks in the calculation of this algorithm. The first is to use mutual information to find the time delay between gene pairs, and the second is to use fuzzy label transfer algorithm to classify datasets. It is assumed that the number of genes is , the length of gene time series is , the maximum time delay is , and the number of iterations of fuzzy label transfer algorithm is . When we use equation (6) to find the mutual information of a target gene and its regulator under a certain time delay, we need to traverse the gene expression level matrix once, and the algorithm complexity is . So, the time complexity of the algorithm is . The time complexity of initialization is . The time complexity of fuzzy label transfer algorithm using semisupervised learning is . The time complexity of calculating score function is . Therefore, the total time complexity is .

5. Results and Discussion

5.1. Simulation Dataset

The simulation dataset is selected from the yeast cell gene chip data [24, 25] provided by Spellman et al. in Stanford University, from which 6 genes are extracted to form a small gene regulation network. The data are as shown in Table 1.

Extract the regulation relationship among 6 genes based on the research of Hou et al. [26]. The regulation network structure is shown in Figure 1.

6. Results

In simulation, firstly we need to select part of samples to add the labels. In this paper, the sample label value is set to when the delay between the genes Clb6 and Cln1 is 0, the sample label value is set to when the delay between the genes Clb2 and Cln2 is 0, and the maximum time delay is set to . The simulation results are shown in Table 2.

It can be seen from Figure 1 that there are 10 pairs of genes having the regulation relationship. Table 2 shows that the method of this paper can correctly identify 8 pairs of genes having the regulation relationship, accounting for 80% of the total gene pairs having the regulation relationship, and the accuracy is relatively perfect. In 8 pairs of genes correctly identified, there are two pairs of genes having time delay, namely, Swi5 and Cln2 and Cln2 and Clb1, and the time delay is 1 unit. The change relationship of expression values for 8 pairs of genes correctly identified is shown in Figure 2. Horizontal coordinates represent gene expression level and vertical coordinates represent time point of gene expression.

It can be seen from Figure 2 that the change relationships of the gene expression data in Figures 2(a) and 2(c)2(e) are basically consistent and the change relationships of the gene expression data in Figures 2(b) and 2(f)2(h) are basically contrary, and Figure 2(g) shows the change relationship between the gene Cln2 and the gene Clb1 after the gene Cln2 is delayed by 1 unit of time. On the left side, the time points corresponding to the peaks and troughs of Cln2 expression value and Clb1 expression value are basically the same, and the change trends are contrary; on the right side, the expression value changes of two genes are disordered to some extent, but except for the last three time points, the expression value changes of other time points substantially conform to the change contrary trends. Figure 2(h) shows the expression value change relationship between the gene Cln2 and the gene Swi5 after the gene Cln2 is delayed by 1 unit of time. It can be seen from the figure that the change trends of two gene expression values are contrary, wherein the first peak of the expression value for the gene Swi5 was obtained at the seventh time point and the first trough of the expression value for the gene Swi5 was obtained at the twelfth time point; the first peak of the expression value for the gene Cln2 was obtained at the seventh time point, and the first trough of the expression value for the gene Cln2 was obtained at the eleventh time point; the time points of peaks and troughs for the two genes are basically the same. Therefore, based on the premise and assumption, the result that the time delay between Swi5 and Cln2 and between Cln2 and Clb1 is 1 time interval in the regulation is reasonable.

7. Discussion

In order to have objective and scientific comparison results, hypothesis testing is used on the experimental results. Let the variables denote the classification error rate of algorithms proposed in this paper, reference [17], reference [20], and reference [27], respectively. Since the value of is subject to many random factors, we assume that they submit to normal distribution, , . Now, we compare the random variable mean of these algorithms, (). The smaller is, the lower the expected classification error rate is and the higher the efficiency is. Because the sample variance is the unbiased estimation of the overall variance, the sample variance value is used as an estimate of the generality variance. In this experiment, the significance level is set as 0.01.

Table 3 shows the comparison process on and other parameters. We can see from Table 1 that the expectations of classification error rate in this paper are far below than other algorithms.

Next, we use some evaluation indexes to evaluate the algorithm. TP, TN, FP, and FN are abbreviations of true positive, false positive, true negative, and false negative, respectively. Perform the following operations on all target genes and regulatory genes. If the regulatory relationship between the target gene and regulatory gene is inferred by this algorithm and the previous literature has proved the regulatory relationship, then the value of TP is increased by 1. If the regulatory relationship between the target gene and regulatory gene is inferred by this algorithm, but the previous literature has not proved the regulatory relationship, then FP is increased by 1. If the algorithm in this paper does not infer the regulatory relationship between the target gene and the regulatory gene and no previous literature has proved that there is a regulatory relationship between the target gene and the regulatory gene, then the value of TN is increased by 1. And if the algorithm in this paper does not infer that there is a regulatory relationship between the target gene and the regulatory gene, but the previous literature has proved that the regulatory relationship exists, then add 1 to the value of FN. Each algorithm evaluation standard is evaluated by some combination of TP, FP, TN, and FN. The most common algorithms for predicting gene regulatory networks are sensitivity (Sn), specificity (Sp), and accuracy (Acc). Sn = TP/(TP + FN), Sp = TN/(TN + FP), and Acc = (TP + TN)/(TP + FP + TN + FN). The comparison results are shown in Table 4.

Table 4 compares the inference results of the four methods for gene regulatory network. The sensitivity of reference [17] method is only 37.5%, that of reference [20] method is 36.4%, that of reference [27] method is 24.2%, and that of the proposed method is 43.8%. It can be seen that in the network construction of this gene, the method proposed in this paper is better for identifying the right edge; it also shows that the addition of transcription factor linkage site data reduces the information loss in data processing. The data in accuracy are also optimal, which shows that the accuracy of network construction in this paper has been improved.

Therefore, each gene has a complex regulatory relationship in different cell cycles. The direction of regulation can be determined by using the method of multiple time delay, which is in line with the mechanism of biological time sequence activity. The introduction of transcription factor linked site data can reduce the network complexity and construct the regulatory network more effectively.

To sum up, the multi-time-delay gene regulation network method based on the fuzzy label propagation is feasible.

8. Conclusions

In consideration of the time delay existing in the interaction of genes, this paper constructs the multi-time-delay gene regulation network, uses the relative change trend coefficient, the correlation coefficient, and the mutual information as the evaluation indexes of the gene pair to construct the similarity matrix of the gene pairs, and then analyses the regulation relationship and the time delay among the genes by using the fuzzy label propagation algorithm. Due to the high complexity of algorithm in this paper, the method proposed in this paper is unsuitable for the construction of large network, and the error recognition ratio will be increased when the maximum time delay is set to high value. However, the method proposed in this paper is feasible. Therefore, how to effectively modularize the large network, divide the large network into many small networks, and integrate the small networks into the large network in the analysis will be an improvement direction of the method in this paper.

Data Availability

The yeast cell gene chip data used to support the findings of this study have been deposited by Spellman et al. in Stanford University.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of the paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (grant nos. 61973304 and 61876185) and the Fundamental Research Funds for Central Universities (no. 2015QNB21).