Abstract
Basic variations in genes can change the patterns of overall DNA that directly lead to cancer-driven genes. The department of oncology needs better and more accurate clinical methods to identify cancer-driven genes at an earlier stage. Many techniques and methods are currently applied in this regard but only a few of them are successful in this race because of huge datasets and complex structures of gene mutations. These strategies and frameworks need to be optimized for better results. This study proposed a machine learning framework which used three different machine learning models. The proposed framework is named Identification of Lower Grade Glioma Cancer Progression (ILGGCP). A benchmark data set is acquired, and statistical feature extraction is used to formulate the dataset. The proposed framework detects cancerous mutation in driver genes as early as possible while the clinical diagnosis is possible after any symptom is detected. The Random Forest demonstrated good results: accuracy is 99%, sensitivity is 99%, specificity is 99%, Mathews Correlation Coefficient (MCC) is 0.98, and Area Under the Curve (AUC) is 100%.
1. Introduction
Gene is a twofold helix particle which is the smallest part of DNA; these helix particles are the direct arrangement of nucleotide sets. DNA holds the information of all events happening in the subject body. Gene mutation is a process of gene altering in which a gene cell changes its structure. These mutations can give information on the progression of cancer [1]. The number of reasons for cancer cell growth is increasing asymmetrically as researchers reveal deep knowledge. Many biomarkers can be used to identify cancer. By finding patterns in TCGA of gene mutation, we can identify cancerous mutations in gene sequences even when there is no physical symptom on the body or other imaginary resources used to detect cancer [2]. By focusing on various types of gene mutation, we can identify different types of cancer. When the mutation occurs in the human body, it increases the respective growth of specified tumor cells that increase the number of active cells in the human body. The mutation affects activities of cell growth that cause disturbance in cells growth, which cause disturbance in the normal process of cells’ birth and death [3]. In this scenario, the body continues to generate cells, and the death process stops. Even the active number of cells exceeds the body limit. This process is called cancer. Timely identification of cancer via machine learning is not new for researchers. There are a lot of publications and research articles written on different platforms using different techniques. In the last few decades, machine learning (ML) methods are widely spread to develop timely identification models for effective decision making [4]. ML provides techniques to find out either subject is cancerous or benign [5]. Three crucial genes, i.e., oncogenes, tumor suppressor genes, and DNA repair genes hold information of mutations that occur. Those genes which have the potential of cancer are known as cancer driver genes. Passive mutations do not cause the inception of carcinoma. Not all genes’ mutations purify an evolutionary benefit to cancers. Very few mutations are cancerous, which means once a gene is mutated, it gives a suitable environment for cancer growth; i.e., isocitrate dehydrogenase 1 (IDH1) has the highest number of mutations that can help to identify Lower Grade Glioma Cancer; it scrambles the information for making an enzyme called isocitrate dehydrogenase. DNA restoration will be directed for IDH1 within the cell. Certain mutations within the IDH1 can reconcile the ability of IDH1 to supervise the preservations, thus snowballing the hazard of emerging cancer. Growth suppressor utility of IDH1gene condenses it as per cancer driver once mutated. Genes are the inheritance units holding protein encoded data for a particular capacity. Computational insight calculations are generally utilized for the recognizable proof of shady samples inside genomic and proteomic information. This concentrates on attempts to distinguish such instances from malignant growth driver qualities and unravel them against traveler qualities. There is a couple of major oncogenic sequencing adventures, for example, “The Cancer Genome Atlas” (TCGA) [6]. At this instant, various computational procedures have been proposed. Considering their basics, existing procedures can be wrapped into various classifications. The most ordinary sorts of systems depend on change repeat [7]. SMG rates are inside and out higher than the establishment change rate and judge them as driver qualities. These current apparatuses are apportioned into three fundamental classifications, for example, recurrence-based, subnetwork strategies, and area of interest-based techniques by their fundamental guidelines. The ID of malignancy driver qualities assumes a significant part in the accuracy of oncology and customized disease treatment. Researchers dealing with subsidizing customized therapy of malignant growth targets recognize the qualities engaged with the beginning of the disease and consequently go to lengths to quiet those qualities or their malignancy causing utilitarian attributes. From this time forward, an apparatus that could precisely and proficiently recognize [8] disease driver qualities is an insignificant interest. The capacity to perceive such driver qualities can empower us to unravel the instrument of illness and henceforth assume a crucial part in the progression of exploration of original drugs and therapies for malignancy. A significant and deliberate succession based technique for a natural system can be arranged by noticing the accompanying basic advances [9] as (1) advances of benchmark dataset for training and testing the prediction model, (2) explanation of the gradual arrangement examinations by a feasible numerical expression, reproducing their elementary relationship with the marks concerned, (3) generating an operative computational system for estimate, (4) validation of results that justifiably weigh the predictable precision, and (5) providing a framework for public use based upon the engraved out vigorous model. The structure of the DNA is depicted in Figure 1 [10].

2. Literature Review
Lower Grade Glioma Cancer is the type of cancer that occurs in the glial cells of the brain and grows slowly. According to a survey report, about 24,530 people in the United States of America suffered from Lower Grade Glioma Cancer in 2021 [11]. There are several studies proposed medically and computationally for the identification of Lower Grade Glioma Cancer. In this section of the research, some of the latest computational techniques are discussed.
Most of the research used MRI images of the brain in their identification and classification algorithms. The research proposed a noninvasive method for the identification of tumors with MRI images. On MRI images, feature extraction techniques are applied to get the main features; the fuzzy c-mean segmentation technique and machine learning classifier SVM, LVQ, and NB are applied. The dataset includes 200 images. Figure 2 explains the working of the noninvasive method [11].

The efficiency of SVM, LVQ, and NB for this study was 88.88%, 91%, and 91%, respectively [11].
In another research, artificial neural network techniques are used for the identification of astrocytoma cancer. A gray-level cooccurrence matrix is used for feature extraction (Energy/Homogeneity) and ANN is applied to these extracted features. For this classification system, ANN provides an accuracy of 99% in classifying images into normal brain images and tumor brain images [12]. A researcher uses k-mean segmentation, Discrete Wavelet Transform, and Principal Component Analysis feature extraction method for extracting the brain features from the MRI images and applies the SVM algorithm for classification of low-grade glioma and high-grade glioma. The classification accuracy of this system was 98.8% [13]. The Computer-Aided Diagnosis (CAD) system is also used in research for the grading of glioma [14]. CAD classifier and VGG-19 DNN classifier are also applied for the identification of glioma tumor grade. The accuracy of the CAD classifier and VGG-19 DNN was 92.86% and 98.25%, respectively.
In a research, machine learning-based radionics are applied to the data extracted from 426 subjects to determine the molecular subtype and histology of gliomas. The accuracy of the research is obtained at 89.2% for the histology diagnosis [15].
The proposed model in the current study claimed the highest accuracy 99% obtained through RF. The model is also evaluated through different validation techniques such as self-consistency test, independent set test, and 10-fold cross-validation test as explained in the result section.
3. Materials and Methods
This area contains a point-by-point depiction and use of the computational cycles that are involved in the proposed model. The proposed model is also explained in Figure 3.

3.1. Benchmark Dataset Collection and Its Preprocessing
It is difficult to find a generalized dataset of mutations in Lower Grade Glioma Cancer (LGGC). A general mutated dataset for Lower Grade Glioma Cancer is not available. Therefore, the normal sequences are taken from https://asia.ensembl.org through web scrapping. Mutation information is available at https://intOGen.Org that is also extracted using web scrapping code. Then, the mutated dataset is created by incorporating the provided mutation information. The total number of driver genes is 38 and the total number of mutations is 1331. The benchmark dataset typically includes tentatively settled unambiguous known examples. These examples are additionally utilized for preparing just for testing purposes. The reason for existing is to foster one top-notch benchmark dataset [16] which is different, precise, and applicable. Further, the result of the exploratory work is validated through a scope of test approval tests like free set and subsampling (K-fold cross-validation) tests [17, 18].
The obtained dataset was not balanced; therefore for balancing the dataset, Synthetic Minority Oversampling Technique (SMOTE) is used.
Intelligible and significant information endures import since the achieved result is a mixture of various unmistakable fair dataset tests. A significant dataset with the well-defined comment of malignant growth driver quality successions is assembled. The dataset is needed as a benchmark of true malignant growth driver quality successions.
Word cloud is extracted by matplotlib function in python, its engraving words according to their important frequency factor, which define the size of Text plotted. The word cloud of ILGGCP is shown in Figure 4.

https://IntOgen.org records 38 genes, 1331 mutations as shown in Table 1, and 1292 sample instances of mutations in a wide assortment of human genes. The greater part of the cases is passenger mutations that do not cause malignant growth while 1161 recorded mutations are cancer-causing. A sum of 38 malignant growth driver potentials is associated with Lower Glioma Cancer-causing mutations. Thus, information accumulated in this way is utilized to plan a benchmark dataset for the depicted issue. The benchmark dataset for Lower Grade Glioma Cancer inside the current review is signified as I, which is characterized as
After delicate preprocessing and redundancy reduction, a database was formulated; the final benchmark dataset contained mutated human gene sequences (I+) and precisely selected negative sample genes (I-) obtained from a larger collection of traveler genes.
3.2. Sample Formulation
A DNA sequence can be articulated aswhere
indicates the nucleotide at any arbitrary position, and represents an image within the set hypothesis meaning “member of” [19]. The detailing of organic sequencing is one of the most basic issues in computational science. Vector measurement is a key to detailing the arrangement by keeping up with the succession examples and provisions that are needed for the designated examination. As vector evaluation clears a way for tending to the defined sequencing utilizing AI calculations [20, 21]. In this work, multiple genes involved in LGGC are picked up. As per the picked synthesis, tests in the dataset can be portrayed as [22]. Equation (4) portrays that each example is an aftereffect of fixed size while equation (5) portrays those 20 build-ups upstream and 20 deposits downstream were removed while P21 is the sarcoma cancer genes.
Each sample peptide sequence is 41 in length due to which equation (4) can be formulated as
3.3. Statistical Instants Calculation
The arrangement of each succession of agene follows some examples. Because of such requirements, each arrangement is to be portrayed with various measurable boundaries. In past work, factual minutes are utilized for highlight extraction [23, 24]. To have included extraction, crude, focal, and Hahn minutes are utilized. The extraction of the component can be area and scale variation. To address area variation highlights, crude minutes are utilized to compute the mean, fluctuation, and imbalance of test appropriation in the dataset. Focal minutes are additionally utilized to include extraction by assessing mean, difference, and unevenness, yet it is area invariant as the assessments are made utilizing centroid, yet focal minutes are scaled variant [25, 26]. Hahn minutes are utilized to gauge measurable boundaries, yet these minutes are both area and scale variants [27, 28]. Accordingly, Hahn minutes are registered utilizing Hahn polynomials to assess the mean in dataset and difference in dataset and deviation of the likelihood transmission. For the said strategy, proceedings are registered in a two-dimensional n × n grid indicated by S′ [29].
A proposed is a mapping purpose cast-off for matrix transformation of as . It uses the component from this matrix . Instants were computed up to order three such as L01, L10, L12, L21, L30, and L03; the raw instants are computed as
The addition of i and j denotes the order of the instants that is i + j and it can be less than or equal to three. The central instants can be computed as
Hahn instants can be easily computed for an even-dimensional data body. Reversible possessions of Han instants are manifested due to their orthogonality. Hahn instants or order n are computed as
Normalized orthogonal Hahn instants of two-dimensional discrete are computed as
3.4. Determination of PRIM and R-PRIM
In cell science, there are regularly many situations where the organic arrangements are homologous. This normally happens when a similar progenitor is important for the advancement cycle and more than one grouping is developed from it. In such cases, the exhibition of the classifier is vastly influenced by utilizing these homologous groupings. Thus, to create accurate results, successful and responsible arrangement resemblance looking is performed during results handling.
In AI, exactness and productivity are massively subject to the carefulness and painstaking calculations through which the most appropriate provisions in the information are extracted. During the learning stage in AI calculations, learning and transformation, of the most implanted obscure patterns in the information, are performed to disguise the applicable and relevant elements [30–33]. R-PRIM and PRIM calculations have a similar methodology, yet just R-PRIM works with the reverse protein sequence requesting. Processing R-PRIM revealed stowed away patterns that empowered justification of any ambiguities between homologous groupings. It is described by equation (11).
3.5. Feature Spacing
Features are essentially cast-off to deliver all features an opportunity to give an identical influence to identify and forecast the mutations involved in Lower Grade Glioma Carcinoma sequencing. The standard scaling formulation is given in
3.6. Prediction Algorithm
Three machine learning algorithms Multilayered Perceptron (MLP) [25], Logistic Regression [34], and Random Forest are applied to learn the hidden patterns in normalized and mutated sequences related to LGGC. MLP classifier demonstrates an accuracy which is 84%. So MLP is also explained in Figure 5.

The dataset contains a bunch of 131 mutations, including 38 genes. In this work, the executive learning systems are utilized to foresee the gathering of Lower Grade Glioma Carcinoma active mutated sites. Expected estimations ought to be anticipated between the aggregation positions with or without active sites. MLP is a feed-forward counterfeit neural engine that is utilized to design input data into the most reasonable yield. It is really a coordination outline, which consists of a data layer and a creation layer, just as numerous mysterious layers between them. All center points are related to any excess center points in the adjoining layer; accordingly, it is known as a fully connected network [35]. Figure 5 shows a graphical portrayal of the MLP classifier. Figure 6 explains the working process of MLP. The MLP classifier is made of N neurons and every neuron has R loads, which are portrayed in an N×R organization. The data weighting system has N parts, indicated by F as per equation (14), for this situation. The working process of the hidden layer is described by the equations (14) and (16):

Results of the output layer from the hidden layer are calculated by the equations (16)–(19):
4. Results
4.1. Estimated Accuracy
This section includes the very first model to predict Lower Grade Glioma Cancer buildup destinations results by applying 3 ML Models and three evaluation techniques, i.e., independent set test, self-consistency test, and 10-fold cross-validation. Test results are gathered and detailed as portrayed in the “Materials and Methods” segment. The acquired informational indexes had nonnumeric qualities having a progression of alphabetic qualities. As there were plenty of varieties of information, the scaling procedure is utilized so that each component ought to have the equivalent commitment to the expectation and identification of Lower Grade Glioma Cancer buildup destinations. A neural engine named MLP classifier is utilized to collect information and afterward dependent on preparing Lower Grade Glioma Cancer buildup locales anticipated proficiently. The course of MLP classifier is very much clarified utilizing graphical portrayal as displayed in Figure 6 and numerically depicted in equations (14)–(19) separately.
4.2. Formulation of Metrics
True-Positive refers to all the subjects/values having glioma cancer represented by TP, True-Negative values are the ones having no Lower Grade Glioma Cancer written as TN, and False-Positive and False-Negative are addressed as FP and FN, respectively. From these values, accuracy (Acc), sensitivity (Sn), and specificity (Sp) for the estimation of negative sample prediction and Mathews Correlation Coefficient (MCC) for the estimation of prediction model stability are determined. There are various measurements used to approve forecast precision. Right and real expectations can be approved by sensitivity. Specificity,
4.3. Test Methods
There are various techniques of deep learning and machine learning which are currently used to evaluate and validate the features of the formulated model. In this study, the independent set test, self-consistency test, and K-fold cross-validation are used to validate the formulated prediction model.
4.3.1. Independent Set Test
The first evaluation method used for the proposed study is an independent set test evaluation method. In this method, 80% of the dataset is used for the training set, and 20% is used for the test set. In the discussed study, an independent set test has 99% accuracy, 99% sensitivity, 99% specificity, 0.98 MCC, and 1.00 AUC achieved by RF. The results of MLP, LR, and RF are shown in Table 2.
The AUC values for Logistic Regression, Random Forest, and Multilayered Perceptron are 0.89, 1.00, and 0.97, respectively. The ROC curve of MLP, LR, and RF for the independent set test is shown in Figures 7–9, respectively. The combined ROC curve is shown in Figure 10.




4.3.2. Self-Consistency Test
Both procedures of training and testing coordinated with the exact same dataset in the self-consistency test. It is because we already know the true positive rate of our benchmark dataset. This test validates the accuracy of training of the formulated prediction model. This model does not provide any robust evaluation like K-fold cross-validation but still has importance in the overall validation process. The results of the self-consistency test are given in Table 3. Table 3 shows that Multilayer Perceptron, Logistic Regression, and Random Forest have accuracies of 84%, 80%, and 99%, respectively.
The AUC of MLP, LR, and RF is 1.00, 0.88, and 1.00, and the ROC curve is depicted in Figures 11–13, respectively. The combined ROC for self-consistency test for applied classifiers is given in Figure 14.




4.3.3. K-Fold Cross-Validation Test
The dataset is divided into 10 parts. It is the leave-one-out approach. All the 9 parts are used for training and only one part is used for testing. In each iteration, the testing part is changed. It is the parameter k of K-fold cross-validation that defines how data samples should be divided. K can be any numeric value; we pass k = 10 which folds the overall learning into 10 folds. This process is also explained in Figure 15. It is the best way to validate that ILGGCP is predicting true positives. In every fold, a random subset of data is selected for validation from the entire dataset and Acc, Sn, Sp, and MCC are measured by taking the average value of all fold’s results.

The working scenario of 10-fold cross-validation is explained in Figure 15. Detailed results of the 10-fold cross-validation test are given in Table 4. It can be observed that Multilayer Perceptron, Logistic Regression, and Random Forest have accuracy of 79%, 79%, and 98% respectively.
The Mean Area Under Curve (MROC) of Multilayer Perceptron, Logistic Regression, and Random Forest is 98%, 88%, and 99% and given in Figures 16–18, respectively. The combined ROC for the 10-fold cross-validation test on applied classifiers is given in Figure 19.




5. Comparsion and Discussion
LGGC is a cancer that occurs in the glial cells of the brain. There are several computational and medical studies proposed for the identification of glioma cancer as discussed in the literature review section. The loophole in the studies was the fact that most of them are using a small dataset of MRI images for their identification or a small dataset of sequences obtained from other studies. The previous work has low accuracies, and the evaluation matrices are also not sufficient. Different algorithms including SVM, LVQ, NB, ANN, CAD, and CNN are applied for the identification of glioma cancer [11–14].
The proposed framework ILGGCP shows the best possible results for the early detection of the LGGC using a machine learning framework. The study is overcoming the loophole of the previous study while using a big dataset. This study used multiple ML techniques and multiple evaluation matrices. The dataset contains 38 genes with 1292 samples having 1331 mutations. For the proposed study, three machine learning algorithms Multilayer Perceptron, Logistic Regression, and Random Forest are used for the early detection of Lower Grade Glioma Cancer for the three evaluation methods: self-consistency test, independent set test, and 10-fold cross-validation test. The maximum accuracy such as 99%, 99%, and 98%, respectively, is obtained by Random Forest for self-consistency test, independent set test, and 10-fold cross-validation. MLP shows the accuracies of 84%, 79%, and 79%, and LR shows the accuracies of 80%, 81%, and 79% for the self-consistency test, independent set test, and 10-fold cross-validation test, respectively. All these results show the AUC [36] of more than 75% which is considered an excellent result for the detection.
6. Conclusion
Hence, it is procedurally accumulated that basic variations in genes can change the patterns of overall DNA that directly lead to cancer-driven genes. Validation techniques, i.e., self-consistency test, independent set test, and 10-fold cross-validation test, validate the proposed neural engine model named ILGGCP. The ILGGCP is developed using MLP, LR, and RF as classifiers and self-consistency test, independent set test, and 10-fold cross-validation test as evaluation matrix with best accuracies of 99%, 99%, and 98%. The best sensitivity, specificity, Mathews Correlation Coefficient, and AUC are 99%, 100%, 100%, 0.98, and 1.00, respectively. These results show that the ILGGCP model will perform very well if the dataset is even updated. In the future, a deep learning model will be developed, and an identification framework for other types of diseases will also be developed.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest to report regarding the present study.
Acknowledgments
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University (KKU) for funding this research under project Number (RGP.1/213/42). The authors are thankful to the Deanship of Scientific Research at University of Bisha for supporting this work through the Fast-Track Research Support Program.