Abstract
Ubiquitin is an important type of protein after translational modification. Ubiquitin has the ability to take part in several cellular regulations among several biological processions. At the same time, ubiquitin plays key roles in the enzymatic process. So as to construct the new tool to classify the ubiquitin amino acid residues, we employed the random forest model to classify the ubiquitin sites utilizing the experimentally identified ubiquitinated protein sequences of A. thaliana. More detailed, we utilized the k-spaced amino acid pair (CKSAAP) encoding and binary encoding to deal with the potential protein segments. The proposed tools can obtain 72.83% in Sp, 72.42% in Sn, 72.63% in Acc, and 0.4525 in MCC. With these performances, such tools can obtain the available results in the dataset of Arabidopsis.
1. Introduction
Ubiquitin is a typical type of protein after translational modification in the field of protein research [1–7]. Such modification has the ability to take part in several significant processions in the cellular level. At the same time, ubiquitin is an important element in the enzymatic process, which is a key biological procession among the life activities. When it comes to ubiquitin, such issue is attached by a special regulation protein procession [8–15]. Such issue is not only the single ubiquitin but also the chain adjoining. The majority of ubiquitin modifications appear in the lysine residues in the level of protein and peptide. Such process can be defined as three key steps, including activation, conjugation, and ligation. First of all, ubiquitin activates enzymes (E1s). And then, such protein conjugates enzymes (E2s) to complete the conjugation process. Last but not least, such protein ligases enzymes (E3s). The aforementioned steps are the main procession of ubiquitin. It is well known that ubiquitin is one of the posttranslational modifications of proteins in any cell, which is highly related to many biological processes and different kinds of diseases in plants and animals. Alzheimer’s disease, for example, has been found to be downregulated and regulated by various types of signaling and endocytosis, such as Parkinson’s disease and anaphylaxis. These functions are as important as other regulatory functions in biological processes [16–22].
It is well known that ubiquitin plays an important role in several protein constructions and processions. Furthermore, a great number of research studies have been carried out to reveal the molecular properties and regulation functions with ubiquitin in the whole biological process. In order to identify ubiquitin amino acid residues in the protein level, a variety of experimental methods have been proposed, such as ubiquitin antibodies (anti-ubi) and ubiquitin-binding proteins (binding-ubi), high-throughput mass spectrometry (MS), liquid chromatography, and mass spectrometry. Due to the dynamics, rapidity, and reversibility of ubiquitin, the existing experimental methods are expensive, laborious, and time consuming. Therefore, species-specific computational methods have been proposed to identify such modification sites, which are considered to be efficient, convenient, and economical. Different calculation methods with unique characteristics have been developed using unique methods [23–29].
In a variety of species, ubiquitin amino acid can hardly be treated as conservative. Therefore, the existing predictors of ubiquitin sites are not suitable for predicting multispecies ubiquitin sites in different organisms. In previous studies, the existing predictive variables were applied to the Arabidopsis dataset. The results showed that the classification performance can hardly meet the need of effective identification. We can see that it is necessary to develop predictors of species-specific ubiquitin sites in order to improve the model’s performance. Nevertheless, only one predictor has been developed for the model plant Arabidopsis species. Although the existing prediction variables are well predicted, there is still room for improvement. In this work, we try to find a new computational tool to identify ubiquitin sites with the protein sequences of Arabidopsis species [30–32].
In order to develop such new tools, we employed the random forest model as the classifier to deal with the potential ubiquitin amino acid residues among the protein sequences of Arabidopsis proteins. At the same time, the composition of k-spacing amino acid pairs (CKSAAP) was selected as the main feature of this work. The proposed RF_CKSAAP tool demonstrates better performance than other state-of-the-art algorithms in the field of machine learning. In the next section, we will introduce the whole development process of the proposed new method in detail.
2. Materials and Methods
Generally speaking, our method is based on RF to predict the ubiquitin sites. It is developed based on a comparative work among the consecutive sequence coding features (CKSAAP and binary coding methods) and other typical classification methods based on optimal window size 27. After studying the two coding methods, the RF classifier based on CKSAAP coding is used to design a ubiquitin identification tool. And then, optimizing the parameters and evaluating the performance, the optimal model is accomplished [33–35].
2.1. Dataset Arrangement
In this work, the ubiquitinated protein sequence training dataset was used to train the model, and its performance was checked in an independent test dataset. The ubiquitinated protein sequence was arranged according to previously published papers [36–42]. Reports on plant cells with experimentally confirmed ubiquitin sites (lysine residues) were extracted from [42]. Ubiquitin site annotations were extracted from UniProtKB/Swiss-Prot and NCBI protein sequence database (https://www.nim.gov protein/) and were related to the model plant Arabidopsis. With the selection of Arabidopsis’s protein, we can easily find that the modified sites are far more lower than the sites that are not modified. In order to deal with such situations, we may construct several ratios among the positive and negative sample. In this work, nonubiquitin sites from all negative samples are randomly selected, and a training dataset is constructed with the ratio of positive samples to negative samples of 1 : 1, 1 : 2, and 1 : 3.
In order to test the performance of the model, cross-validation and independent test datasets are used. From the downloaded protein sequences, 250 protein sequences are randomly selected that are not included in the training dataset to construct an independent test dataset. The stability of the model is tested by considering the predictive performance of all positive and negative sample independent test datasets. However, regardless of whether the performance of all training models is tested by knife-cutting tests or cross-validation tests, the training set being evaluated contains a 1 : 1 ratio of positive and negative samples.
2.2. CKSAAP Encoding
When it comes to the composition of k-spaced residue pairs (CKSAAP), such encoding method is a typical method to demonstrate the protein sequences [43–51]. The most significant parameter of this method is k, which means the number of amino acid residues between two target amino acid residues. So, the length of the target fragments is k − 2. For instance, if k is equal to 0, it means there is no gap between two target amino acid residues. If window size is 2r + 1, it means the real effect-identified length is 2r − 1. There are 21 types of amino acids, which include 20 types of amino acid residues and a blank. “AxxA” can play the same method to two amino acid pairs. O means the none amino acid. Other characters mean 20 types of amino acid. The detailed description is shown in the following equation:
In this work, we select the final optimum window length as 27. The final feature scale is 2646, which means 21 × (kmax + 1) × 21. In other words, one sample may contain 2646 features and one label. The dataset may transform n × 2647. n means the scale of samples, and 2647 is the sum of features and the label.
2.3. Binary Encoding
In order to obtain the position-specific information in the protein sequence level, one-dimensional binary encoding was utilized. In this work, we define 21 types amino acid, including 20 types amino acid residues and one blank residue, in the model. For instance, A amino acid can be shown as 100000000000000000000. Similarly, each of the other amino acids was coded in the same way. The final scale of features is 21 × 26 = 546 feature dimensions without considering the central k residue.
2.4. Random Forest
When it comes to the random forest classifier, it has been widely used in several areas. Compared with other algorithms, this algorithm has strong robustness in the presence of noise and outliers. Basically, the random forest classifier is based on a decision tree classifier, where each decision tree is trained using a randomly selected subset of samples, as described in the following [52–55]. First of all, if k features and the alternative sampling scheme can be submitted, n samples are selected randomly. And then, the best split node can be selected among the input features, which can be treated as the key step to design a decision tree. Last but not least, a decision tree is generated without pruning. Usually, when all individual trees provide votes, the maximum number of votes is considered to build the forest. In our study, RF predicted two categories of positive sites (ubiquitin sites) and negative sites (nonubiquitin sites) by voting on the number of trees. If the score is greater than or equal to 0.50, the lysine (k) position is declared positive. If the score is less than 0.50, the lysine (k) position is a ubiquitin site. Lysine sites with a score closer to 1 were more accurately defined as ubiquitin sites.
2.5. Model Training
As mentioned earlier, this study used three classification algorithms, namely, RF, SVM, and naïve Bayes, to construct predictors of ubiquitin sites. In order to train these classifiers, no matter whether their prediction accuracy can distinguish well between ubiquitin sites and nonubiquitin sites, the training dataset is used. In this paper, considering the optimal value of the parameters, the support vector machine kernel radial basis function is used to realize the radio frequency signal. Through different cross-validations, the model is trained with different ratios of positive and negative samples. Among the three classification methods, RF is considered to be the best method to classify ubiquitin sites and nonubiquitin sites. We employed 5-fold cross validating in the 1 : 1 ratio of positive and negative samples.
2.6. Model Performance Evaluation and Cross-Validation
In order to show the performance of the proposed ubiquitin site too, four typical performances were used, namely, sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthew correlation coefficient (MCC) [56–60], which can be formulated as follows:
TP, FP, FN, and TN represent true positive, false positive, false negative, and true negative, respectively. To provide a comprehensive performance measurement, we also consider receiver operating characteristic (ROC) curves, which are graphical representations of a function between true positive rate (i.e., sensitivity) and false positive rate (i.e., 1− specificity). In addition, the area under the ROC curve (AUC) is used to quantify the overall performance of the proposed method. The closer the AUC value is to 1, the better the performance is. In addition, folding knife and 2-fold, 5-fold cross-validation tests are also considered to examine the performance of the prediction model. Typically, in a 5-fold cross-validation test, it creates five subsets of datasets from approximately equal-sized training datasets. Four subgroups are used to train the classifier, and the remaining groups are used as test datasets to evaluate the performance of the classifier, and the classifier is therefore run five times. The prediction performance of independent test datasets also proves the performance of the model. The performance indicators Sn, Sp, and MCC are calculated at the threshold fpr = 0.20 (FP rate), while the AUC measure (the total area under the ROC curve) is calculated by the threshold-independent score.
3. Results and Discussion
3.1. Performance Assessment of the Models
In real situation, the positive samples and negative ones are not equal to 1 :1, but the imbalance situation may affect the accuracy of the classification. In order to overcome the shortcoming, we have selected three ratios, which are 1 : 1, 1 : 2, and 1 : 3, between positive and negative samples. In the following part, we show the results of the classification performances in different ratios, which include 1 : 1 in Figure 1, 1 : 2 in Figure 2, and 1 : 3 in Figure 3. At the same time, the RF model and other employed classification machine learning algorithms have been compared in these ratios.



From Table 1, we can easily get the conclusion that the neural network performances are 58.11% in Sp, 59.67% in Sn, 58.89% in Acc, and 0.1778 in MCC. The naïve Bayesian model’s performances are 37.08% in Sp, 75.89% in Sn, 56.49% in Acc, and 0.1407 in MCC. The support vector machine’s performances are 65.29% in Sp, 59.16% in Sn, 62.23% in Acc, and 0.2450 in MCC. Finally, the random forest algorithm can obtain the results, including 72.54% in Sp, 53.84% in Sn, 63.19% in Acc, and 0.2685 in MCC.
From Table 2, we can easily get the conclusion that the neural network and naïve Bayesian model can hardly obtain the ideal performances compared to the support vector machine’s and the random forest’s ones. It was noted that the random forest can obtain the available results in the balance situation among the positive and negative samples.
From Table 3, we can easily get the conclusion that the neural network performances are 36.68% in Sp, 75.48% in Sn, 56.08% in Acc, and 0.1319 in MCC. The naïve Bayesian model’s performances are 56.70% in Sp, 56.37% in Sn, 56.54% in Acc, and 0.1307 in MCC. The support vector machine’s performances are 77.48% in Sp, 59.70% in Sn, 68.59% in Acc, and 0.3778 in MCC. Finally, the random forest algorithm can obtain the results, including 83.43% in Sp, 56.83% in Sn, 70.13% in Acc, and 0.4176 in MCC.
With the above performances, we can easily find that, with the ratio among positive and negative samples increasing, the performances of the random forest and other employed models can obtain the more available results. In other words, the proposed features and employed algorithms can be fitted in the real situation, in which the positive samples are far more lower than the negative ones.
In order to show the CKSAAP encoding is an effective feature of protein sequence processing, we have compared such features with several state-of-the-art features.
From Table 4, we can easily get the conclusion that the DNABIND performances are 61.10% in Sp, 61.96% in Sn, 61.53% in Acc, and 0.2305 in MCC. The DNAbinder performances are 56.66% in Sp, 57.73% in Sn, 54.70% in Acc, and 0.0941 in MCC. The DBD-Threader performances are 22.06% in Sp, 93.72% in Sn, 57.89% in Acc, and 0.2262 in MCC. The DNA-Prot performances are 62.56% in Sp, 47.61% in Sn, 55.09% in Acc, and 0.1029 in MCC. The iDNA-Prot performances are 62.07% in Sp, 59.71% in Sn, 60.89% in Acc, and 0.2179 in MCC. The PLMLA performances are 59.20% in Sp, 59.67% in Sn, 59.44% in Acc, and 0.1887 in MCC.
From Table 5, we can easily get the conclusion that the DNABIND performances are 63.98% in Sp, 65.13% in Sn, 64.55% in Acc, and 0.2911 in MCC. The DNAbinder performances are 52.96% in Sp, 60.62% in Sn, 56.79% in Acc, and 0.1361 in MCC. The DBD-Threader performances are 19.41% in Sp, 91.60% in Sn, 55.50% in Acc, and 0.1590 in MCC. The DNA-Prot performances are 63.71% in Sp, 52.07% in Sn, 57.89% in Acc, and 0.1589 in MCC. The iDNA-Prot performances are 62.91% in Sp, 62.97% in Sn, 62.94% in Acc, and 0.2588 in MCC. The PLMLA performances are 57.17% in Sp, 61.74% in Sn, 59.46% in Acc, and 0.1894 in MCC.
From Table 6, we can easily get the conclusion that the DNABIND performances are 65.49% in Sp, 67.14% in Sn, 66.32% in Acc, and 0.3263 in MCC. The DNAbinder performances are 56.82% in Sp, 63.07% in Sn, 59.95% in Acc, and 0.1993 in MCC. The DBD-Threader performances are 22.69% in Sp, 93.26% in Sn, 57.98% in Acc, and 0.2252 in MCC. The DNA-Prot performances are 66.95% in Sp, 52.21% in Sn, 59.58% in Acc, and 0.1936 in MCC. The iDNA-Prot performances are 64.98% in Sp, 65.37% in Sn, 65.17% in Acc, and 0.3035 in MCC. The PLMLA performances are 59.80% in Sp, 63.86% in Sn, 61.83% in Acc, and 0.2368 in MCC.
4. Conclusion
Ubiquitin is an important type of protein after translational modification. Ubiquitin has the ability to take part in several cellular regulations among several biological processions. At the same time, ubiquitin plays key roles in the enzymatic process. So as to construct the new tool to classify the ubiquitin amino acid residues, we employed the random forest model to classify the ubiquitin sites utilizing the experimentally identified ubiquitinated protein sequences of A. thaliana. More detailed, we utilized the k-spaced amino acid pair (CKSAAP) encoding and binary encoding to deal with the potential protein segments. The proposed tools can obtain 72.83% in Sp, 72.42% in Sn, 72.63% in Acc, and 0.4525 in MCC. With these performances, such tools can obtain the available results in the dataset of Arabidopsis.
Data Availability
All the data used to support the findings of this study are included within the manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.