Abstract
The detection and characterization of somatic mutations have become the important means to analyze the occurrence and development of cancer and, ultimately, will help to select effective and precise treatment for specific cancer patients. It is very difficult to detect somatic mutations accurately from the massive sequencing data. In this paper, a forest-graph-embedded deep feed-forward network (forgeNet) is utilized to detect somatic mutations from the sequencing data. In forgeNet, the random forest (RF) or Gradient Boosting Machine (GBM) and graph-embedded deep feed-forward network (GEDFN) are utilized to extract features and implement classification, respectively. Three real somatic mutation datasets collected from 48 triple-negative breast cancers are utilized to test the somatic mutation detection performances of forgeNet. The detection results show that forgeNet could make the 0.05%–0.424% improvements in terms of area under the curve (AUC) compared with support vector machines and random forest.
1. Introduction
With the rapid development of new sequencing technology, lots of sequencing omics data have been generated which are processed and analyzed in order to solve biological problems [1–5]. The species without a reference sequence are sequenced again at the genomic level, and the reference sequence of this species can be obtained, which will lay the foundation for the follow-up research and molecular breeding [6–8]. For the species with a reference sequence, whole genome sequencing could detect the mutation sites related to important characters of an organism, including single-nucleotide polymorphism (SNP) and insertion-deletion (InDel), which are molecular basis of individual differences and play an important role in research and industry [9–12].
Somatic mutations occur in the normal body cells including SNPs and InDels. Such mutations will not be passed on to the offspring. Somatic mutations are different from germline mutations, which occur in cells becoming gametes (sperm and egg) [13]. Germline mutations could be passed on to the offspring [14, 15]. Somatic mutations do not make genetic changes for the offspring, but these can cause the changes of the genetic structure for some cells. Many researchers have studied the reasons about cancers [16–20]. The abnormal structures or functions of cellular genetic material could be caused by carcinogenic factors. Most of these abnormalities are not inherited from germ cells, but are caused by new gene mutations in somatic cells. The mutated precancerous cells develop into tumors under the action of some tumor-promoting factors [21–23]. Therefore, most of the tumors can be regarded as a kind of somatic genetic disease [24]. The study of cancer-related somatic variation has an important role for the treatment and prevention of cancer.
Nowadays, lots of machine-learning methods have been utilized to solve biomedical problems [25–29]. However, it is very difficult to detect somatic mutations accurately from the massive sequencing data. In recent years, many researchers have been working on solving this problem. Ding et al. investigated performances of four classical classification methods in order to detect a somatic single-nucleotide variant (SNV) [30]. Shiraishi et al. proposed a novel somatic mutation detection algorithm, namely, Bayesian mutation calling, with whole-exome sequencing data. Also, an empirical Bayesian method was presented to detect somatic mutation and sequencing errors [31]. Koboldt et al. proposed a variant calling tool, namely, VarScan 2, to discriminate germline mutations from somatic mutations from next-generation sequencing (NGS) data [32]. Sahraeian proposed a new somatic identification method based on the convolutional neural network, which could outperform the previous methods [33]. Yang and Chen proposed an ensemble-method-based flexible neural tree model (FNT) and Radial Basis Function (RBF) to improve the accuracy of somatic mutation identification [34]. Dorri et al. proposed the MuClone method to detect somatic mutations with multiple tumor samples, which could classify mutations into biologically meaningful groups [35].
Recently, Kong and Yu presented a novel classifier based on the feature graph and deep neural network, namely, forgeNet. forgeNet was utilized to process RNA-seq data from public databases, and the results proved that this method was valuable for classification and feature selection for biology data [36]. Thus, in this paper, forgeNet is utilized to detect somatic mutations from the sequencing data. In forgeNet, the random forest and graph-embedded deep feed-forward network are utilized. Three real somatic mutation datasets collected from 48 triple negative breast cancers are utilized to test the somatic mutation detection performances of forgeNet.
The rest of the paper is organized as follows. The second section introduces the detailed forgeNet algorithm. The detail identification process of somatic mutation is also given. The third section proposes three experiments on the forgeNet method. The last section provides many conclusions and possible future research.
2. Methods
2.1. forgeNet
The forest-graph-embedded deep feed-forward network (forgeNet) was proposed by Kong in 2020, which is a novel classification method based on the feature extraction algorithm and deep neural network (DNN). This method has been successfully applied to biology data, so forgeNet is utilized to detect somatic mutations. The forgeNet method contains the following two steps [37].
2.1.1. Feature Extractor Part
In this part, random forest and Gradient Boosting Machine (GBM) are utilized to select the proper features according the training dataset. Suppose that a forest has decision trees. According to the training dataset, the fitting forest could be obtained as , where denotes the parameters of trees. A binary tree could be viewed as a special case of a graph simultaneously, and a set of graphs could be obtained as follows:where and are sets of vertices and edges in .
The final feature graph could be obtained by merging all graphs in graph set , which is prepared for the second step of forgeNet.
2.1.2. Neural Network Part
In this part, graph-embedded deep feed-forward networks (GEDFNs) are utilized to tackle with classification problems [37]. The structure of the GEDFN is given as follows:where is the data matrix with the proper features selected from the first step of forgeNet, denotes the Hadamard product, and and are the weights and bias of the hidden layer, respectively.
2.2. Somatic Mutation Identification
In order to test the detection performances of forgeNet and identify somatic mutations, a cross-validation method is utilized, which could solve the overfitting problem [38, 39]. By the -fold cross-validation method, the detection process of somatic mutations with forgeNet is given as follows (Figure 1):(1)The feature data of somatic mutations are divided into groups (), and the numbers of the samples in groups are generally equal. is generally greater than or equal to 2.(2)Each subset is set as a testing set once, and the remaining -1 subsets are set as a training set. With the divided training and testing sets, the forgeNet method is fitted. Through runs, models will be obtained (). The area under the curve (AUC) of the testing set of these models is used as the performance index of the classifier.

3. Experiments
In order to investigate the somatic mutation identification performances of forgeNet, three real somatic mutation datasets are utilized, which were collected from 48 triple negative breast cancers by capturing tumour/normal pairs sequenced with the Illumina genome analyzer [30]. The positive and negative samples of datasets are described in Table 1.
Receiver-operating characteristic (ROC) is utilized to measure the performance of the somatic mutation classification model with any dataset, and area under the curve (AUC) is utilized to quantify the ROC curve. The steeper the ROC curve is, the better the performance of classification is. The value of AUC is between 0.5 and 1. In order to test the detection performances of forgeNet, are utilized, which are defined in equation (4). Support vector machines (SVM) [40, 41] and random forest (RF) [42, 43] are also utilized to identify somatic mutations with three real datasets in order to compare the performances of forgeNet.
The detection results of somatic mutations of SVM, RF, and forgeNet are listed in Table 2 with three datasets. For dataset 1, forgeNet has the highest performance, which reveals that forgeNet could identify more true somatic mutations. RF has higher than forgeNet and SVM, which shows that RF could identify more true nonsomatic mutations. Overall, performances show that RF performs better than forgeNet and SVM, but RF and forgeNet have the extremely close results. For dataset 2, forgeNet and SVM have the same performance, which is 0.933. In terms of , RF has the best performance, which is 0.997. The result reveals that forgeNet and RF could detect the same numbers of true somatic and nonsomatic mutations. But, in terms of , forgeNet performs best. For dataset 3, in terms of , forgeNet has better performance, while RF has the better performance. Overall, forgeNet has higher performance than RF.
The identification of AUC performances of three methods (forgeNet, SVM, and RF) by 10-fold cross validation with dataset 1, dataset 2, and dataset 3 is depicted in Figures 2, 3, and 4, respectively. From Figure 2, the ROC curves of RF and forgeNet are very close, which are better than that of SVM. RF could obtain the best AUC value, which is 0.99499. forgeNet has the second better AUC value, which is 0.16% lower than that of RF and 0.32% higher than that of SVM. From Figure 3, it could be seen that forgeNet has better ROC curve than RF and SVM with dataset 2. forgeNet could obtain the highest AUC value, which is closer to 1, 0.424% higher than that of SVM and 0.05% higher than that of RF. For Figure 4, with dataset 3, forgeNet and RF have the closer ROC curves, which are better than that of SVM. In terms of AUC value, forgeNet is 0.24% higher than SVM and 0.105% higher than SVM. Through the identification results of three datasets, we can see that forgeNet could obtain better performances than SVM and RF when the ratio of somatic mutations is low.



In order to investigate the performance of forgeNet further, forgeNet, SVM, and RF are utilized to identify somatic mutations with dataset 2 and dataset 3 by 3-fold cross validation, 5-fold cross validation, and 8-fold cross validation, respectively. By 3-fold cross validation, the identification of ROC curves and AUC values of three methods is depicted in Figures 5 and 6 with dataset 2 and dataset 3, respectively. From Figure 5, it could be seen that, in terms of AUC, forgeNet is 0.278% higher than SVM and 0.425% higher than RF. Figure 6 reveals that, in terms of AUC, forgeNet is 0.328% higher than SVM and 0.028% higher than RF.


By 5-fold cross validation, the identification of ROC curves and AUC values of three methods is depicted in Figures 7 and 8 with dataset 2 and dataset 3, respectively. From Figure 7, it could be seen that, in terms of AUC, forgeNet is 0.167% higher than SVM and 0.388% higher than RF. Figure 8 shows that, in terms of AUC, forgeNet is 0.27% higher than SVM and 0.05% higher than RF. By 8-fold cross validation, the identification of ROC curves and AUC values of three methods is depicted in Figures 9 and 10 with dataset 2 and dataset 3, respectively. From Figure 9, it could be seen that, in terms of AUC, forgeNet is 0.34% higher than SVM and 0.11% lower than RF. Figure 10 proves that, in terms of AUC, forgeNet is 0.064% higher than SVM and little higher than RF. From the results of 3-fold cross validation, 5-fold cross validation and 8-fold cross validation, forgeNet has better ROC curves and higher AUC values than RF and SVM, which reveal that forgeNet could identify somatic mutations more accurately.




4. Conclusions
In this paper, a novel classifier, namely, forgeNet is utilized to improve the accuracy of somatic mutation identification. forgeNet contains two parts: the feature extractor part and neural network part, which are utilized to extract features and implement classification, respectively. Three real somatic mutation datasets are utilized to test the somatic mutation detection performances of forgeNet. Three-fold cross validation, 5-fold cross validation, 8-fold cross validation, and 10-fold cross validation are utilized. In terms of , forgNet could identify more true somatic mutations, while random forest could identify more true nonsomatic mutations. The classification results reveal that forgeNet could make 0.05%–0.424% AUC improvement compared with support vector machines and random forest.
In the future, we will analyze the biological significance of somatic mutations in the process of classification. Also, the somatic mutations of different cancers will be classified and analyzed.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
F. W. conceived the method. H. Q designed the method and wrote the main manuscript text. C. W. conducted the experiments. All authors reviewed the manuscript.
Acknowledgments
The authors acknowledge the funding received from the Key Research Program of the Science Foundation of Shandong Province (ZR2020KE001).