Abstract
Self-interacting proteins (SIPs) play an influential role in regulating cell structure and function. Thus, it is critically important to identify whether proteins themselves interact with each other. Although there are some existing experimental methods for self-interaction recognition, the limitations of these methods are both expensive and time-consuming. Therefore, it is very necessary to develop an efficient and stable computational method for predicting SIPs. In this study, we develop an effective computational method for predicting SIPs based on rotation forest (RF) classifier, combined with histogram of oriented gradients (HOG) and synthetic minority oversampling technique (SMOTE). When performing SIPs prediction on yeast and human datasets, the proposed method achieves superior accuracies of 97.28% and 89.41%, respectively. In addition, the proposed approach was compared with the state-of-the-art support vector machine (SVM) classifiers and other different methods on the same datasets. The experimental results demonstrate that our method has good robustness and effectiveness and can be regarded as a useful tool for SIPs prediction.
1. Introduction
Protein is the material basis of life and an important part of all cells and tissues. The study of protein-protein interactions has become a fundamental problem. These interactions can elucidate the mechanism of biological reactions and also play a crucial role in living organisms. In recent years, protein-protein interactions (PPIs) have become a hot spot for many researchers because they are the key to further understanding of protein functions. [1] Self-interacting proteins (SIPs) are a special problem of PPIs, referring to the interactions between two or more copies of the protein. At the same time, copies of proteins from the same gene will likely lead to the formation of homo-oligomers. Among them, a large number of studies have shown that homo-oligomerization has a wide range of promoting effects on biological development, such as enzyme activation, gene expression regulation, signal transduction, and immune response. SIPs also play an important role in the study of protein interaction networks (PINs). Specifically, they can interact with other protein partners and thus play an important role in the study of biological cell systems [2]. On the contrary, the genes of SIPs have a good superiority, which is reflected in the high duplicability of coding genes compared with other genes. At the same time, these genes are often displayed on a large scale at the level of the whole genome [3]. Although there is a lot of theoretical knowledge about SIPs in the field of biological proteomics, few machine learning methods have been used to study this problem [4, 5]. Due to the high cost and time consumption of traditional SIPs prediction experiments, it is necessary to develop a reliable and efficient computational method for detecting SIPs.
So far, a large number of different computational methods based on biological sequences have been proposed for the prediction of PPIs. For instance, Zhang et al. [6] proposed an ensemble deep neural networks approach which used different configurations based on each descriptor to train deep neural networks. Jha et al. [7] extracted the features from the 3D voxel representation of proteins and the amino acid sequences, respectively, and then concatenated these features in pairs. Finally, the bidirectional GRU-based classifier used the concatenated features to predict PPIs. Jia et al. [8] used a discrete wavelet transform (DWT) to extract the corresponding protein information and obtained a variety of physicochemical descriptors. The random forests model was used to classify the prediction results. The experimental results fully demonstrate that this method can effectively predict PPIs. Sumonja et al. [9] developed a stacking algorithm GA-STACK for automatic ensembling of different machine learning algorithms, and this algorithm achieves a good prediction performance. Tain et al. [10] proposed an efficient prediction method based on multi-information fusion, in which the pseudoamino acid composition (PseAAC), autocovariance (AC), and encoding based on grouped weight (EBGW) extracted the features of protein sequences. By two-dimensional wavelet denoising, the denoised features are input to the support vector machine (SVM) for classification. Xu et al. [11] proposed a prediction method based on protein sequences, in which the method used graph energy to extract the effective information between proteins, and then sent the information to the weighted sparse representation-based classification (WSRC) classifier for predicting PPIs. Wang et al. [12] proposed a method which called novel stochastic block model. This method can capture the latent structural features of proteins from the perspective of forming protein complexes by simulating the generative process of the protein interaction network. In summary, these methods are mainly used to study the interaction between protein pairs with related information such as co-localization, co-expression, and co-evolution, and they cannot be used to predict SIPs. Among them, the most important reason is that the datasets used by these methods do not contain protein information between the same partners. Given the impact of such limitations, these methods are not desirable in predicting SIPs. Therefore, it is very important that an efficient and practical computing method be developed for the study of large-scale SIPs.
In this study, a method for predicting SIPs from a PSSM of protein sequences is proposed, which combines the HOG feature extraction method with SMOTE and an RF classifier. In this experiment, two SIPs datasets, yeast and human, were used to evaluate the predictive performance of the proposed method. These experimental results further indicate that the proposed method can be effectively used to identify self-interacting proteins. Among them, the proposed method achieved high prediction accuracy on yeast and human datasets, which were 97.28% and 89.41%, respectively. At the same time, our method was compared to the most popular support vector machine (SVM) classifiers and other different methods on the same yeast and human datasets. All the experimental results show that the proposed method has good robustness and effectiveness and can be used to predict SIPs.
2. Materials and Methodology
2.1. Datasets
First, there are 20,199 reliable curated human protein sequences available from the published UniProt database [13]. Second, these protein sequences were also obtained from other datasets in different ways, including BioGRID, [14] DIP, [15] InnateDB, [16] IntAct, [17] and MatrixDB [18]. In this experiment, those PPIs derived from the same PPIs datasets and containing only two interacting protein partners were extracted, and the PPIs obtained in this way were considered as “direct interaction.” Finally, a total of 2,994 human self-interaction protein sequences were calculated. In order to further better test the accuracy of the proposed method, these human self-interaction proteins were processed by using the following measures. (1) Protein sequences with a length of more than 50 and less than 5,000 residues in the entire human proteome are retained. (2) In order to construct an accurate and reliable SIPs positive dataset, the selected data must meet one of the following requirements: (a) the data should be at least reliable experimental data obtained from experiments that were applied to a small-scale or two large-scale tests for self-interactions; (b) the annotation of proteins in the UniProt protein database is a homo-oligomer, which also includes homodimers and homotrimers; (c) the SIPs positive dataset should be at least from the experimental data reported in the 2 publications. (3) The annotated SIPs data in the entire human proteome were deleted. Its main purpose was to construct a negative dataset. Thus, after such processing, a total of 17,379 proteins were used to construct human SIPs datasets, of which 1441 human SIPs were defined as positive datasets and 15,938 human non-SIPs were defined as negative datasets. In addition, in order to further verify the validity of our method on other datasets, we also used the same data processing measures to construct a yeast dataset with 710 positive SIPs and 5511 negative non-SIPs. Table 1 shows the summary of the datasets used in this experiment, and the positive samples were experimental identified samples.
2.2. Position Specific Scoring Matrix (PSSM)
The PSSM can detect distantly related proteins by sequence comparison. At the same time, this PSSM information is established based on a previous set of protein sequences that have been arranged by the structure or sequence similarity. [19] In addition, the most commonly used homologous search Position-Specific Iterated BLAST (PSI-BLAST) tools [20] have also been applied to this experiment, and its purpose is to convert each protein sequence into a PSSM. Here, each PSSM representation can be given. Specifically, it is an matrix , where the number of matrix row vectors is used to represent the size of the protein sequence and the number of matrix column vectors is used to refer to 20 amino acids. The position specific score is used to specify the score of the jth amino acid at the ith position, which is expressed as , where represents the value of the frequency at which the kth amino acid in the protein appears at position i, and is Dayhoff’s matrix consisting of mutation values between the jth amino acid and the kth amino acid [21].
In order to obtain biological evolutionary information of higher homologous proteins in this study, the PSI-BLAST tool was used to search for important information in each protein sequence and convert it into PSSM for further use in the prediction of SIPs. Among them, the parameter e-value of PSI-BLAST is set to 0.001 and the number of iterations is set to 3. Each PSSM is eventually converted to a matrix of , where the value of row vector represents the total number of residues in each protein and 20 represents the number of amino acids. The website address of the PSI-BLAST tool is http://blast.ncbi.nlm.nih.gov/Blast.cgi.
2.3. Histogram of Oriented Gradients (HOG)
The HOG feature descriptor was first applied to human detection by Dalal and Triggs [22]. Among them, the HOG feature descriptor describes the local appearance and shape of the image according to the gradient density distribution. The descriptor is created by dividing the original image into small cell regions. Each cell builds a histogram of the gradient direction for each pixel inside it. [23] At the same time, the important features are obtained by calculating the HOG of the image local area and then the HOG algorithm is used to identify the target object to be detected. So far, HOG descriptors have been widely used for various computer vision problems. In most cases, they are combined with machine learning methods to solve specific practical problems. For example, face recognition, facial expression recognition, vehicle detection, and pedestrian detection in video. In addition, because of its good performance in some complex problems, HOG as an attribute extraction method has attracted the attention of many researchers and is widely used in a large number of literature. Therefore, HOG is used to extract information of the target edge in a local region of the original image, and this has become an important means of referencing certain target shapes. [24] The description of HOG feature extraction is as follows.
Using the HOG algorithm to perform gradient calculation on the original image is a crucial step. The gradient direction of each pixel in the image can be further determined based on the obtained gradient values of the abscissa and the ordinate. Its main purpose is to obtain the contour information of the image so as to further avoid the influence of light on the experimental result. [25] The gradient calculation formula is as follows:where represents the horizontal gradient of the position in the original image, indicates the vertical gradient of the position in the original image, and refers to the pixel value of the position in the original image. Then, the convolution kernel is used to perform a convolution operation with the original image to obtain the gradient component in the horizontal direction, and then, the convolution operation is performed on the original image through another convolution kernel to further obtain the gradient component in the vertical direction.
The next step is to calculate the gradient magnitude and orientation through the image's horizontal gradient and vertical gradient . They are expressed as follows:where represents the gradient magnitude of point in the image and denotes the gradient orientation of point in the image. In this experiment, each PSSM of the protein sequence was divided into size cells with an overlap ratio of 1/2. Each of the cells has a gradient orientation of 0–180° and is divided into 9 bins; specifically, a direction bin is 20° so that the pixel magnitude in each direction can be obtained.
Since the illumination and contrast factors have a large influence on the gradient magnitude, the gradient magnitude can be normalized. The HOG feature in each cell is further extracted by normalizing all cell values in the block. The block normalization formula is as follows:where is the unnormalized vector of the obtained histogram, is used to refer to the L2-norm of the vector , and is a particularly small constant value used for zero division.
Therefore, we use the HOG algorithm to extract the HOG features on each PSSM for a given protein sequence. In order to avoid the influence of noise on the experimental results, we further use the singular value decomposition (SVD) method to process the extracted HOG features to obtain significant features. Finally, the protein sequences of the yeast dataset and the human dataset were converted into 180-dimensional vectors for prediction of SIPs experiments.
2.4. Synthetic Minority Oversampling Technique (SMOTE)
SMOTE is a technique for oversampling the minority classes to balance datasets. This method was proposed by Chawla et al. [26]. The main idea is to form new minority samples by interpolating and selecting a neighbor of a minority class. This technique is capable of generating synthetic sample data rather than simply copying minority class samples. Thus, this method can avoid overfitting problem. The SMOTE algorithm is described as follows. The first step randomly selects a point x in the minority class sample and obtains its nearest neighbors from other minority samples. Then, get one minority class sample among the neighbors. Finally, an interpolation method is used to generate a new synthetic sample in the vector between x and It can be concluded that the mathematical expression of SMOTE is as follows:where represents a random number between 0 and 1. The above process is then repeated based on the beginning of the set of oversampling rates, and a new minority class sample is added to the original training samples using the SMOTE algorithm to increase the number of minority class samples. This method can be seen as an interpolation between two minority class samples from a geometric perspective. It expands the decision space for the minority class, which allows the classifier to make higher predictions for unknown minority class samples. The SMOTE method is efficient and simple in generating synthetic samples, which avoids overfitting problems. In our datasets, the positive samples are few. In order to prove the effectiveness of the SMOTE method in Section 3, we performed the SMOTE method after the five-fold cross-division to expand the positive samples of training dataset. Finally, the new training dataset is entered into the classifier for classification, and the prediction results are obtained. The process of SMOTE is shown in Figure 1.

2.5. Rotation Forest (RF)
Rotation forest, an ensemble classifier based on the bagging algorithm, was originally introduced by Rodriguez et al. [27]. The idea of rotation forest classifier comes from random forest, but the main difference is that it can build accurate and diverse base classifiers. RF has been widely used in machine learning to perform classification on practical problems. For example, cancer classification [28], sequence-based protein-protein interactions’ prediction [29], and diagnosis of neuromuscular disorders [30]. The rotation forest framework is described as follows. Assuming training samples are composed of where is a D-dimensional feature vector. Suppose the training sample set consists of which contains n observation feature vectors, be the feature set, and be the corresponding class labels. Here, we assume that the number of decision trees in the RF algorithm is denoted as respectively. The processing steps for a single classifier can be described as follows. The appropriate parameters is selected and the feature set is randomly divided into disjoint subsets, wherein the feature number contained in each feature subset is Next, the corresponding column of the feature in the subset is selected from the training sample set to form a new matrix Using the bootstrap method, we can extract three-quarters of the dataset from to form a new nonempty training set . After that, a principal component analysis method is applied to to obtain coefficients in a matrix where the jth column coefficient is used as the characteristic component jth. Finally, a sparse rotation matrix is constructed, and its coefficients stored in the matrix are represented as follows:
In the classification problem, for a given test sample x, is produced by the classifier to determine that x belongs to class Then, the average combination method is used to calculate the confidence of a class:
Finally, x is assigned to the class with the greatest confidence. The pseudocodes of rotation forest algorithm (Algorithm 1) can be summarized as follows.
3. Results and Discussion
3.1. Performance Evaluation
In order to further test the accuracy and reliability of the proposed method, the values of the five evaluation indicators were calculated: Accuracy (Ac), Sensitivity (Sn), Specificity (Sp), and Matthew correlation coefficient (MCC). They are defined as follows:where true positive represents the number of true SIPs correctly predicted by the model, false positive refers to the number of non-self-interacting proteins predicted to be self-interacting by the model, true negative replaces the number of true non-self-interacting proteins correctly predicted by the model, and false negative was used to represent the number of self-interacting proteins predicted to be non-self-interacting by the model. In addition, to better demonstrate the predictive results of our model, Receiver Operating Characteristics (ROC) [31] curves were also used to evaluate the performance of the proposed method.
|
3.2. Performance of the Proposed Method
In this experiment, the proposed method was applied to yeast and human datasets for detection of SIPs. In order to avoid overfitting of the prediction results to better test the validity and stability of our model, the datasets used in the experiment were divided into training datasets and independent test datasets. At the same time, a five-fold crossvalidation method was also used in this experiment to further evaluate the performance of the proposed method. The workflow diagram of the five-fold crossvalidation is shown in Figure 2.

In addition, in order to ensure the fairness of the experimental results, the two main parameters of the RF classifier were optimized, and the same parameter settings were used for SIPs experiments of yeast and human datasets, where the parameter was set to 5 and was set to 13. Here, is the number of feature subsets and is the number of decision trees. The predicted results of the original method and the proposed method for SIPs experiments on yeast datasets are shown in Tables 2 and 3.
From Table 2, we can clearly see that the RF model has achieved excellent prediction results on the yeast dataset, and the accuracy of the five experiments is higher than 97%. Specifically, their accuracy is 97.99%, 97.19%, 97.83%, 97.35%, and 97.59%, respectively. Among them, the average accuracy, specificity, sensitivity, and MCC were 97.59%, 99.47%, 82.87%, and 87.78%, respectively. Their standard deviations are 0.33%, 0.28%, 2.42%, and 1.92%, respectively. From Table 3, when using the SMOTE and RF models to predict SIPs on the yeast dataset, the average accuracy, specificity, sensitivity, and MCC were 97.28%, 98.02%, 91.50%, and 87.34%, respectively. Although the SMOTE and RF method also achieved 97% accuracy, the average sensitivity increased by 8.63%. At the same time, to better demonstrate and visualize the performance of the model, ROC curves on the yeast dataset are shown in Figures 3 and 4. X-rays on both figures represent the false positive rate (FPR) and y-rays indicate the true positive rate (TPR).


Similarly, the same model is applied to the human dataset to perform SIPs prediction. The predicted results of the original method and the proposed method for SIPs experiments on human datasets are shown in Tables 4 and 5. It is worth mentioning that the proposed method also achieved an average accuracy of more than 89% on the human dataset. As shown in Table 4, when exploring SIPs on the human dataset, the average accuracy, specificity, sensitivity, and MCC of 91.68%, 99.92%, 0.56%, and 5.08% were obtained by using the RF model. The standard deviations of these evaluation indicators were 0.59%, 0.04%, 0.21%, and 1.14%, respectively. As shown in Table 5, when predicting SIPs on the human dataset, the average accuracy, specificity, sensitivity, and MCC of 89.41%, 96.44%, 11.69%, and 19.76% were achieved by using the SMOTE and RF models. From this comparison, we can see that although the average accuracy decreased by 0.0227, the average sensitivity and average MCC increased by 0.1113 and 0.1468, respectively. This is benefited from the combination of the SMOTE method and the rotation forest model.
It can be seen that the superior feature extraction methods and the choice of classifier can effectively help us to improve the prediction accuracy of SIPs experiments. In addition, our method can obtain such good prediction results mainly depending on the following aspects. (1) PSSM not only can obtain effective information of protein sequences but also retains important prior information, which plays a crucial role in predicting SIPs experiments. (2) Since the Histogram of Oriented Gradients (HOG) can extract effective local texture features and maintain good invariance in both geometric and optical deformation of the image, therefore, the HOG feature extraction method can be used to acquire important protein information from the PSSM. (3) It is very necessary to choose a reliable and robust rotation forest classifier to classify the sample features. (4) The SMOTE method can effectively solve the problem of low sensitivity caused by unbalanced samples. Finally, these good experimental results fully demonstrate that the proposed method can predict SIPs efficiently and accurately. Figures 5 and 6 show the ROC curves performed by the proposed method on the human dataset.


3.3. Comparison with SVM-Based Method
A large number of machine learning models have been proposed for detecting interactions between proteins. There is no doubt in saying that one of the most popular classifiers in the field of machine learning is the support vector machine (SVM). In order to better test the performance of the proposed method, we used the state-of-the-art SVM classifier to perform SIPs experiments on yeast and human datasets to better compare the superiority of the two classifiers. Among them, the same protein data and feature extraction methods were applied to this experiment. In addition, the LIbSVM toolbox was also used in this experiment to complete the SVM classification task, which can be downloaded from https://www.csie.ntu.edu.tw/∼cjlin/libsvm/. [32] Here, the radial basis function (RBF) is used for the kernel function of the SVM and the grid search method is also used to determine the values of the main parameters c and , where c is set to 0.1 and is set to 0.9. The predicted results of SIPs experiments on yeast and human datasets by using the SMOTE and SVM method are shown in Tables 6 and 7.
From Table 6, we can easily see that the accuracy of the five experiments predicted on the yeast dataset using the SVM classifier was 93.09%, 92.60%, 94.53%, 94.29%, and 94.22%, respectively. The preformed averages of accuracy, specificity, sensitivity, and MCC are 93.75%, 94.14%, 90.72%, and 75.48%, slightly lower than those performed by the proposed method, which are 97.28%, 98.02%, 91.50%, and 87.34%, respectively. Similarly, from Table 7, we can also see that the average accuracy, specificity, sensitivity, and MCC predicted by the SVM classifier on the human dataset are 85.49%, 91.51%, 18.89%, and 22.88%, respectively. However, the RF classifier achieved an average accuracy of 89.41% on the human dataset. Figures 7 and 8 show the ROC curves obtained on yeast and human datasets for the SVM-based methods. Comparing these experimental results, we can conclude that the RF classifier is significantly better than the SVM classifier in predicting SIPs experiments. Therefore, the proposed prediction model can be well applied to the prediction of SIPs, which may promote the research of other bioinformatics tasks.


3.4. Comparison with Other Methods
So far, an increasing number of calculation methods have been developed for predicting PPIs. Furthermore, in order to test the reliability and effectiveness of the proposed model, the following methods SLIPPER, CRS, SPAR [33], DXECPPI [34], PPIevo [35], and LocFuse [36] were used for comparison, and these methods were proposed based on the same yeast and human datasets. The prediction results obtained on the yeast and human datasets by using these seven methods are shown in Tables 8 and 9, respectively. It can be seen from Table 8 that the average prediction accuracy obtained by the six different methods on the yeast dataset is between 66.28% and 87.46%, and these accuracies are lower than the prediction accuracy of the proposed method. Similarly, the specificity, sensitivity, and MCC values of these methods are also lower than the predicted values obtained by the proposed method. At the same time, Table 9 shows the average accuracy, specificity, sensitivity, and MCC values obtained by seven different methods on the human dataset. Among them, the proposed method has a prediction accuracy of 89.41% on the human dataset, which also achieves a better prediction performance. The experimental results obtained by further comparing these methods can fully demonstrate that the proposed method can effectively predict the interaction between SIPs.
4. Conclusions
In this study, an effective computational method was developed for predicting SIPs by using sequence information of proteins. Among them, the proposed method is based on an RF classifier combined with a position specific scoring matrix (PSSM), HOG, and SMOTE methods. In the first step, all protein sequences are transformed into PSSM, which contains evolutionary information about the protein sequence. Secondly, by using the HOG feature extraction method to obtain valuable protein features from PSSM, the final feature is further processed by the SVD method to reduce the interference of noise in the experimental results. To address the sample class imbalance problem of self-interacting proteins, we used the SMOTE method to process the HOG features of the obtained proteins. Finally, a reliable RF classifier is used to predict whether a given protein sequence can interact with itself. In addition, in order to better evaluate the performance of the proposed model, our method was compared to the most popular support vector machine (SVM) classifiers and other different methods on the same yeast and human datasets. All the experimental results show that the proposed method has good robustness and effectiveness and can be used to predict SIPs, which will be beneficial to the future research of bioinformatics tasks.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Disclosure
Zheng Wang and Yang Li are the co-first authors.
Conflicts of Interest
The authors declare no conflicts of interest.
Authors’ Contributions
Zheng Wang and Yang Li contributed equally to this work.
Acknowledgments
This research was funded by the National Natural Science Foundation of China under Grant nos. 62072378 and 61873212.