Abstract
Diabetes is one of the most common metabolic diseases that cause high blood sugar. Early diagnosis of such a condition is challenging due to its complex interdependence on various factors. There is a need to develop critical decision support systems to assist medical practitioners in the diagnosis process. This research proposes developing a predictive model that can achieve a high classification accuracy of type 2 diabetes. The study consisted of two fundamental parts. Firstly, the study investigated handling missing data adopting data imputation, namely, median value imputation, K-nearest neighbor imputation, and iterative imputation. Consequently, the study validated the implications of these imputations using various classification algorithms, i.e., linear, tree-based, and ensemble algorithms, to see how each method affected classification accuracy. Secondly, Artificial Neural Network was employed to model the best performing imputed data, balanced with SMOTETomek ensuring each class is represented fairly. This approach provided the best accuracy of 98% on the test data, outperforming accuracies achieved in prior studies using the same dataset. The dataset used in this study is concerned with gender and population. As a prospect, the study recommends adopting a larger population sample without geographic boundaries. Additionally, as the developed Artificial Neural Network model did not undergo any specific hyperparameter tuning, it would be interesting to explore tuning on top of normalized data to optimize accuracy further.
1. Introduction
Diabetes mellitus, or more commonly known as only diabetes, is a global health problem affecting people worldwide. According to [1], 463 million people have diabetes globally, and 88 million are from South East Asia. Diabetes is the general term for separate but related chronic diseases. It occurs when the human body is incapable of producing or properly using the hormone known as insulin [2]. Insulin is responsible for converting glucose from the food humans consume into the energy they need for daily life; thus, its imbalance is detrimental to human health. The cause of diabetes remains a mystery, but studies showed genetics and environmental factors [3].
Additionally, for most countries, diabetes has evolved with dietary changes, aging populations, reduced physical activity, and other unhealthy lifestyle choices [4]. Essentially, there are two types of diabetes. Type 1 diabetes occurs when the body’s cells responsible for producing insulin are destroyed, causing insulin deficiency. The more prevalent, type 2 diabetes occurs when insulin production is not sufficient or adequate to control blood glucose level, causing relative insulin deficiency [5].
Diabetes increases the risk of being susceptible to other diseases as its development often leads to heart attack, stroke, heart disease, and many other severe consequences [6]. The first step in the development of type 2 diabetes is prediabetes. Prediabetes means that the blood sugar level is higher than usual but not high enough to be type 2 diabetes. However, the long-term damage of diabetes, especially to the heart, blood vessels, and kidneys, may already start, making an early diagnosis of diabetes of vital importance [7]. Assessing symptoms of diabetes is a challenging task whereby 50% of patients with type 1 and type 2 symptoms remain underdiagnosed due to no noticeable signs of disease and its complex interdependence to other conditions [2]. There is a need for accurately identifying patterns and relationships from patient data. With the increase in the amount of data in medical databases, it has become too time-consuming and costly to assign staff to conduct traditional manual analysis to identify data patterns and recognize anomalies in patients’ medical records.
In the survey conducted by National Diabetes and Diabetic Retinopathy, it is stated that the prevalence of diabetes in India is at 11.8% [8]. According to the National Health and Nutrition Examination Survey data, more people died from the consequences of diabetes than of HIV/AIDS, malaria, and tuberculosis combined. The World Health Organization has predicted that by the year 2030, diabetes will be the seventh leading cause of death [9]. By 2040 around 482 million people will be affected globally by prediabetes and an estimated 642 million by diabetes, according to the International Diabetes Federation [10]. These statistics show that type 2 diabetes does not discriminate. It has an impact on everyone across all classes, societies, countries, and continents. Unless effective diagnosis methods are utilized, this problem will remain prevalent. This issue will stress the healthcare systems, challenge economies, and predominantly affect life quality.
A practical solution is to employ data mining techniques for examining data and seeking useful information from medical databases that help clinical staff diagnose type 2 diabetes more accurately [11]. The healthcare industry generates a vast amount of valuable data comprising patient information, electronic medical records, treatment history, etc. Conducting knowledge extraction on these data can identify patterns and support decision-making and cost reduction by developing intelligent medical systems. Early detection can prevent diabetes with higher accuracy [12].
Researchers have utilized a variety of classification algorithms to achieve high-performing models. This research builds on previous literature by developing a classification model using data mining and laboratory information of the patients while applying different data imputation approaches to enhance model performance. Besides, we consider the imbalance in data distribution between diabetic patients and ordinary people. It helped in training the classification model to ensure that each class was represented fairly. By identifying the best performing data preprocessing methods, the machine learning model’s performance accuracy was optimized, making it the highest among previous studies.
The positive impact of the proposed solution is on both the patients and the medical practitioners. The countries with a high rate of diabetic patients can employ an accurate early prediction system of diabetes. The more precise the prediction, the more accurate the severity estimation of the underlying medical condition [13]. Detecting diabetes at an early stage and treating it is an essential step towards keeping people with diabetes healthy. Early detection will benefit the patients and medical staff as they can make use of computer-based systems rather than relying on traditional manual analysis of data, which is costly, time-consuming, and prone to human errors [14].
The rest of the article is organized as follows: Section 2 covers a summary of prior literature, Section 3 presents the data and methods used to adopt the proposed solution. Section 4 outlines the results and discussion of the findings, followed by a conclusion in Section 5.
2. Literature Review
Diabetes mellitus is a common chronic disease found in numerous countries around the world. While it can seriously deteriorate an individual’s quality of life, its early diagnosis can minimize the long-term adverse effects [15]. With the abundance of medical data from electronic records, patient information, treatment data, etc., researchers have utilized data mining techniques to discover knowledge through this data. Useful knowledge extraction can support decision-making and cost reduction of intelligent diabetes diagnosis systems. Table 1 outlines extensive research conducted in the early prediction of diabetes with a summary of selected 20 prior studies. For a fair comparison, all selected previous studies have used the same diabetes dataset publicly available on UCI Machine Learning Repository to conduct their investigation on the development of a diabetes predictive model [30]. Additionally, we noted data preprocessing in handling missing data and suggested solutions related to class imbalance problems for these studies.
Existing literature shows various data mining and machine learning algorithms such as neural networks, tree-based algorithms, rule-based fuzzy classifiers, linear classifiers, and hybrid models used to output a highly accurate predictive model. Artificial Neural Network (ANN) was the most popular classifier where 11 out of 20 prior studies used different ANN types in their experiments. More than half of the preliminary studies result in prediction accuracy within the 80%–90% range. Wu et al. [22] used a hybrid predictive model to get 95.42% accuracy in prediction, the highest in the literature. Medical datasets are likely to contain many missing data values. We reviewed their suggested solutions that influenced the classification accuracy. Earlier studies showed that using mean value imputation [2, 20, 22] led to better accuracy results than when missing data were deleted [11, 16, 27].
Consequently, most prior literature chose to delete missing values, with only 5 out of the 20 studies using imputation to replace missing values with meaningful values based on central tendency measures. Another aspect of data preprocessing in binary classification is often found in the medical domain to solve the class imbalance problem. The data contain larger samples of class labeled as not having diabetes than having diabetes. It can lead to a biased predictive model where the accuracy applies only to the majority class. Among prior studies, only [18] addresses both missing values and the class imbalance problem. The authors used naïve Bayes imputation to replace missing values with meaning values and use ADASYN oversampling to get equal representation of both classes in training the predictive model and achieve 87.10% accuracy.
A high-quality classification model is one whose accuracy reflects its capability of generalizing unseen data samples. The development of such a model essentially requires more data representing both classes fairly. While deleting missing data values is a simple solution, it risks losing data points with valuable information reducing the model’s ability to generalize [31]. Furthermore, skewed class distribution will ensure that predictive performance applies to both majority and minority classes. Therefore, despite the extensive reported research, prediction of diabetes remains a challenging problem that needs to be further explored, especially in terms of investigating the effect of preprocessing approaches on classification accuracy. There is a need to experiment with various data imputation methods and solve class imbalance problems to ensure that the predicting model’s resulting accuracy is of high quality.
3. Methodology
Optimal classification performance is dependent on efficient data preparation and data preprocessing methods. The proposed algorithm fills the research gap identified in the literature review. Handling missing values of attributes through a meaningful approach can substantially improve a machine learning model’s performance [32]. Initially, the investigation focused on a comparative study of three different data imputation methods to identify the optimal way of handling missing values in the dataset. Consequently, data from the highest performing method was employed to achieve optimized performance in modeling.
Figure 1 presents the proposed framework for type 2 diabetes prediction. It has two phases where the first phase consists of identifying optimal imputation technique. We apply three different imputation techniques after loading the data, namely, median, iterative, and KNN imputations. We handle each attribute outliers and split instances into a proportion of 70% training and 30% testing, modeling using seven different machine learning algorithms. Table 2 outlines these algorithms below.

We chose the highest performing outlier-treated imputed dataset for the second phase. In the second phase, we adopt a random sampling technique to produce some noise in the data. It then further preprocess to handle the class imbalance problem using undersampling and oversampling. It gives an equal representation of each class in the machine learning model’s training data. The preprocessed data are then split into 70% train and 30% test datasets. Standard scaling smoothens and prepares the training data for modeling. ANN is chosen as the machine learning model based on its common use in past literature. ANN model architecture is created with 1 initial layer, 5 hidden layers, and 1 final layer. RELU activation function is applied to each layer except the final layer. Instead, the sigmoid activation function is used in the final layer. Adam optimizer is used for model compiling and “binary_crossentropy” is used as a loss function. After a few trials, we could conclude that 10 epochs showed good results in terms of “accuracy”; a primary evaluation metric for this model.
Figure 2 shows the ANN model architecture with an input layer, all hidden layers, activation function, and the output layer.

Figure 3 represents the various steps taken to achieve the research objectives, i.e., to obtain high classification accuracy of the type 2 diabetes predictive model. Once the dataset is loaded, we create three duplicate copies of the data for three separate implementation approaches.

Let us consider that the dataset loaded is X.
where () are the duplicate datasets created.
Equations (2), (3), and (4) show individual datasets, with being imputed. , and are the datasets created, respectively, for datasets .
We process individual imputed datasets with outlier treatment in equation (5). are then split into 70% train set and 30% test set individually. Each dataset’s training set is modeled using 10-fold cross-validation across the seven machine learning algorithms. We then use different metrics and accuracies to determine model performance. Test set enables further evaluation of performance, leading to the conclusion of which imputed dataset provides the best accuracy.
, the outlier-treated dataset, is then exposed to random sampling to introduce some noise in the data, as shown in equation (6).
SMOTETomek link is used to create under-oversampling of the data. Equation (7) shows that created is our current modified dataset. This step helps to convert an imbalanced dataset to a balanced dataset. is then split into test-train data of ratio 70%–30%. The training data are then scaled for processing in ANN:
ANN model’s performance is judged based on different metrics. Prediction outcome is drawn out of the stable model for better understanding.
3.1. Data Description
The dataset used in this study is taken from the National Institute of Diabetes and Digestive and Kidney Diseases. As discussed by [23], this is a publicly available dataset at [30]. A detailed description of all attributes is given in Table 3. It contains several medical predictors: the independent variables and one dependent variable, the “Outcome.”
Figure 4 illustrates the histogram of each attribute, explaining the distribution and skewness, and shows which attributes have missing values. Based on the histograms, the attributes age (A), diabetes pedigree function (B), insulin (C), pregnancies (D), and skin thickness (E) show that most of the data are on the left side of the histogram, which makes it skewed to the right. This means that when data are positively skewed and the mean is larger than the median.

(a)

(b)

(c)

(d)

(e)
Figure 5 shows histograms for BMI (A), blood pressure (B), and glucose (C) are symmetric; i.e., they have a normal distribution. The data are symmetrically distributed where most of the observations are around the central peak.

(a)

(b)

(c)
Most healthcare data contain the missing value, noisy and inconsistent data. This diabetes dataset is no exception where its attributes/features include missing values. Namely, glucose, blood pressure, skin thickness, insulin, and BMI have zero (0) values. We observed several candidates with a zero reading/value, for instance, five candidates with a zero glucose level, 35 candidates with a zero blood pressure, 227 candidates with a zero skin thickness, 374 candidates with a zero insulin level, and 11 candidates with a zero BMI. It reflects an inaccurate scenario; i.e., the zero values present in the data are incorrect and will probably lead to wrong results for diabetes prediction. Thus, we need an imputation of these particular zero values with something more meaningful for each parameter. As any machine learning model’s performance heavily depends on the data, we performed necessary preprocessing techniques to handle the missing information.
Figure 6 represents the percentage of missing data related to a specific attribute. 48.69% of data are missing for insulin. 29.55% of information is missing for the skin thickness feature. 4.55% of data are missing for the blood pressure attribute. 1.43% of data are missing for BMI, and 0.65% of data are missing for the glucose attribute.

3.2. Data Preprocessing
In this research, we consider data preprocessing as a significant component. In the first phase of the proposed framework, we conducted an investigation on data imputation methods. After we determine the optimal method, we handled the data’s outliers and the class imbalance problem.
3.2.1. Handling Missing Values
For the first phase, we applied three different imputation techniques to fill in the missing information. Method-I refers to imputation where we group by each missing variable median value under the outcome variable and assign those values to miss-values. Method-II refers to the iterative imputation technique. Method-III refers to the KNN imputation technique.
(1) Method-I. In order to fill missing values, we are grouping each missing variable’s median values in accordance with the outcome class and assign those values to missing values. For example, the median value of glucose parameter is 107 for outcome class 0 and 140 for outcome class 1. All the missing values with 0 outcome class will be replaced with value 107 and all the missing values with 1 outcome class will be replaced with value 140.
The correlation matrix for this imputed dataset is shown in Figure 7. The highest positive correlation is 0.57, between skin thickness and BMI. The highest correlated factor for output class is glucose at 0.50, followed by insulin at 0.38 and BMI at 0.32.

(2) Method-II. Iterative Imputation. A sophisticated practice to find and replace missing values of a dataset is iterative imputation. This involves modeling to predict each feature as a function of all other features and keep repeating the process of estimating feature values multiple times. The repetition process allows the estimated values for other features to be used as input criteria for predicting missing values during subsequent iterations.
This is basically a data transformation method. Bayesian Ridge model is used as a function to all other input parameters by default. All parameters are considered in ascending order, filling the parameter with the least missing values first and then proceeding to the next parameters having the next minimum number of missing values. Different regression algorithms can be used to estimate the missing values for each parameter, although linear methods are often used for simplicity.
The correlation matrix for this imputation method is shown in Figure 8. The highest correlation factor is between skin thickness and BMI (0.71). The highest correlated factor for output class glucose is 0.50, followed by other factors such as insulin (0.36) and BMI (0.32).

(3) Method-III. KNN Imputation. Another good practice to find and replace missing values is through KNN imputation using the KNN machine learning model. In this method, the k-nearest neighbor algorithm is applied to impute a missing value by finding the samples in the training set “closest” to it and averaging these nearby points to fill in the value.
Similar to the last method, KNN imputer is another data transformation method. KNN imputation requires numeric input values, which in our case is the ideal scenario. It is a lazy learner as it learns on the fly. It checks the nearest observation/neighbor to discover its path. Configuration of KNN imputation often requires selecting the distance measure and the number of nearest neighbors for each prediction. The default distance measure is Euclidean distance and the number of neighbors selected is 5.
The correlation matrix for this imputation method is shown in Figure 9. The highest correlation factor is between skin thickness and BMI (0.64). The highest correlated factor for output class glucose is 0.50, followed by other factors such as insulin (0.32) and BMI (0.31). Overall, all three imputed dataset correlation matrices show that the most correlated features with the target variable Outcome are glucose, insulin, and BMI.

3.2.2. Handling Outliers
The next preprocessing step was to handle outliers in all three imputed datasets. We set a threshold to decide on which data points constituted as outliers. We calculate the first quartile, third quartile, and interquartile range. We set the upper limit as third quartile +1.5 interquantile_range of the variable, and the lower limit as quartile 1−1.5 interquantile_range of the variable.
The results shown below illustrate the median value imputed dataset before and after outliers are handled.
Figure 10 shows outliers for blood pressure feature. We can see that there are multiple data points, which are out of bound. Blood pressure is treated with both upper limit and lower limit outlier correction. All these data points are corrected and shown in Figure 10(b) after the removal of outliers.

(a)

(b)
Figure 11 shows outliers for the diabetes pedigree function (DPF) feature. The DPF has multiple data points, which are beyond the upper limit bound of the boxplot. Upper limit outlier treatment is applied to DPF. Figure 11(b) shows the DPF feature after the removal of outliers.

(a)

(b)
Figure 12 shows outliers for the skin thickness feature. There seems to be only one data point that is beyond the upper limit of the boxplot. Upper limit outlier treatment is applied to skin thickness. Figure 12(b) shows skin thickness features after the removal of outliers.

(a)

(b)
Figure 13 shows outliers for the insulin feature. Insulin has multiple data points, which are beyond the upper limit bound of the boxplot. Upper limit outlier treatment is applied to insulin. Figure 13(b) shows insulin features after the removal of outliers.

(a)

(b)
Figure 14 shows outliers for the pregnancy feature. We can see that there are three data points that are out of bounds before outliers are handled. These particular data points can cause erroneous outcomes if not treated properly. Pregnancy feature is treated with upper limit outlier treatment. These data points were corrected and shown in Figure 14(b) after the removal of outliers.

(a)

(b)
Figure 15 shows outliers for the BMI feature. We can see that there are multiple data points that are out of bound. Most of the data points are beyond the upper limit. There is only one data point that is beyond the lower limit. BMI is treated with both upper limit and lower limit outlier correction. Figure 15(b) shows BMI feature after the removal of outliers.

(a)

(b)
Figure 16 shows outliers for the glucose feature. We can see that there is only one data point that is out of bound. Only glucose is treated with lower limit outlier treatment as it has data points breaching the lower limit. This data point is corrected and shown in Figure 16(b) after the removal of outliers.

(a)

(b)
3.2.3. Handling Class Imbalance
The final preprocessing technique was to handle class imbalance and scale the imputed data. The imbalance in the number of samples that were diabetic compared to those that did not have diabetes was 268 by 500. Data sampling techniques reduced getting biased results in predictive performance and got more data samples for learning. According to [33], a combination of synthetic minority oversampling techniques (SMOTE) and random under- or oversampling performs better than when SMOTE is used by itself. Therefore, in this research, we used a combination of random oversampling and SMOTETomek. Firstly, we used random oversampling to create more data samples and then transform them by SMOTETomek. SMOTETomek combines over- and undersampling using SMOTE and Tomek links. SMOTE randomly increases minority class examples by synthesizing new minority instances between existing minority instances. Tomek links are pair of examples that belong to different classes and are each other’s nearest neighbors. After class imbalance has been dealt with, we normalize data features using standard scaling and prepare them as input into the ANN machine learning model.
4. Results and Discussion
Implementing the proposed framework’s first phase generates three datasets based on each imputation method. We applied seven classification algorithms on each dataset, and the results were slightly different as each algorithm’s working criteria are different. We used 10-fold cross-validation to evaluate the training set performance. The primary evaluation metric used throughout this study is accuracy:
The results of all classifiers were evaluated based on different metrics as follows:(a)True positive (TP): subject with disease and model correctly predicts that(b)True negative (TN): subject with no disease and model correctly predicts no disease(c)False positive (FP): subject with no disease but model predicts incorrectly that the subject is diabetic(d)False negative (FN): subject with the disease, but the model predicts incorrectly that the subject is not diabetic
For the evaluation of the accuracy of the classification model, the values from the confusion matrix are used. Evaluation metrics, i.e., F1-score, precision, and recall, are calculated.(a)F1-score:(b)Precision:(c)Sensitivity (recall):(d)Specificity:
4.1. Data Imputation Results
Various evaluation metrics are used to assess the performance of each method of imputation technique applied via 10-fold cross-validation for seven machine learning classifiers.
4.1.1. Prediction Accuracy
Accuracies achieved for each method of imputation technique applied via 10-fold cross-validation using different classifiers are discussed as follows.
Figure 17 shows that GB is the best performing classifier. GB classifier achieves an accuracy of 91.06%, followed by the LGBM classifier, which achieves an accuracy of 90.69%. LR, KNN, DT, and RF classifiers all show good accuracy but not as good as GB. SVM performs the weakest.

Figure 18 shows the accuracy performance measure of all the classifier algorithms applied to Method-II imputation. LR and RF classifiers have almost similar accuracies, with 78.585% and 78.588%, respectively. GB classifier also performs well with 78.03% accuracy. Like in Method-I, SVM classifier performs the weakest.

Figure 19 shows the accuracy performance measure of all the classifier algorithms applied to Method-III imputation. LR performs the best and achieves an accuracy of 78.21%, followed by GB classifier scoring accuracy of 77.65%. KNN and DT classifiers have achieved the same accuracies of 71.89%. SVM performs the weakest in this method as well.

4.1.2. Precision, Recall, and F1-Score
The above figures conclude that Method-I imputation resulted in the highest accuracy. To further validate that Method-I is indeed the optimal data imputation technique among the three methods, performance was observed for other evaluation metrics as well, namely F1-score, Precision, and Recall. As our data is an imbalanced dataset, we are considering a weighted average for the scoring.
Table 4 describes the performance of each classification model evaluated with the help of mentioned metrics. These scores were achieved by applying 10-fold cross-validation. The precision score is similar for both GB and LGBM classifiers, whereas for recall and precision, GB has a better score compared to LGBM.
Table 5 describes the performance of each classification model evaluated with the help of the mentioned metrics for imputation technique Method-II. LGBM classifier has precision, recall, F1-score at 0.74 for each. This is the best, and as per this score, we can see that this model outperforms GB classifier marginally. SVM has the least scores for the individual metrics.
Table 6 describes the performance of each classification model evaluated, with the help of the mentioned metrics for imputation technique Method-III. LGBM and LR have precision, recall, F1-score at 0.73 for each. SVM has the least scores for the individual metrics.
4.1.3. Area under ROC (Receiver Operating Characteristic) Curve
Furthermore, we also considered Receiver Operating Characteristics (ROC) as part of the performance evaluation. We know that the greater the area under the curve, the better the model. An ideal model should be having a TP rate = 1 and a FP rate = 0. So, any curve which is more aligned to the Y-axis is more fit deemed to perform better.
Figure 20 shows the AUC that we have achieved for the Method-I imputation technique. We can see that a solid red line with “o” marker is our curve for GB classifier, strictly followed by the green dashed line, LGBM classifier.

Figure 21 shows the AUC that we have achieved for the Method-II imputation technique. The green dashed line has more area under its curve than that of other models. Thus, it indicates that the LGBM classifier has a better model performance for the Method-II imputation technique.

Figure 22 shows the AUC that we have achieved for the Method-III imputation technique. The green dashed line has more area under its curve than that of other models. Thus, it indicates that LGBM classifier has a better model performance for Method-III imputation technique.

4.1.4. Summary
Table 7 summarizes the findings of the first phase results, presenting the highest numbers of evaluation metrics achieved by each method.
All methods result in favorable performance with more than 60% across all evaluation metrics. LGBM classifier shows up recurrently as the best performing model for all methods. Among the three imputation methods, Method-I can be seen outperforming the other two methods by a significant margin across all the evaluation metrics, while Method-II and Method-III have similar numbers in performance. The correlation matrix of the imputed data presented in Section 3.2.1 offers a possible reason for this difference in performance. Method-II and Method-III imputation generated data that had a higher correlation between its variables than Method-I. Method-II and Method-III imputed data may be suffering from the problem of multicollinearity. When predictor variables are significantly dependent or associated with each other, it can hinder the model’s ability to learn the contribution of each predictor variable individually as a change in one would cause a change in the other. A possible reason behind Method-I’s high performance can thus be attributed to a low chance of multicollinearity in its imputed dataset.
4.2. ANN Results
The second phase consisted of implementing an ANN model on the highest performing imputed data, i.e., the Method-I imputed dataset. After handling the class imbalance and scaling the training dataset, an accuracy of 0.9817 was achieved. Upon evaluation of the test dataset, it produced an accuracy of 0.9809. Table 8 presents the classification report with relevant evaluation metrics.
Precision, recall, and F1-score achieved for outcome “0” are 0.99, 0.97, and 0.98, respectively. Precision, recall, and F1-score achieved for outcome “1” are 0.97, 0.99, and 0.98, respectively. We can also see that the macro average and the weighted average for precision, recall, and F1-score are 0.98. Even though hyperparameter tuning was not conducted for this ANN model, high-performance results were achieved. In the comparison of area under ROC curves, the ANN model outperformed the other seven machine learning algorithms.
Figure 23 shows an AUC curve plot. The figure is for comparative study among all the classification techniques applied. ANN performs thirteen times better than GB, which had performed highest in the data imputation phase.

Key findings of this research include the following:(i)Method-I: missing values imputed with median values showed better results than the KNN imputation technique and the iterative imputation technique. A possible reason for this may be due to lower multicollinearity presence in Method-I imputed dataset than that of the other two methods.(ii)SMOTETomek, along with the random sampling technique, was applied to the dataset created by the Method-I imputation technique. This combined method for treating imbalanced data was better than using one technique individually.(iii)ANN was implemented after balancing the dataset, and it produced an accuracy of 0.9809. This is by far the best accuracy achieved in terms of classification compared to past studies done with the same dataset.
Apart from this, some interesting facts about the attributes’ relationship with type 2 diabetes were found because of data mining, as follows:(i)As per the correlation matrix shown in Section 3, BMI is highly correlated with diabetic patients. So, it can be concluded that obesity has a high chance of causing type 2 diabetes.(ii)Insulin resistance has been identified as a major risk factor in the pathogenesis of type 2 diabetes and has been thoroughly studied in the Arizona Pima. Insulin resistance appears to increase incrementally according to BMI levels primarily and body fat levels secondarily. As shown in the data, that obesity is causing high BMI, which is leading to chances of type 2 diabetes. Further, DPF, which indicates diabetes history in relatives and genetic relationship, shows high chances of diabetes.
5. Conclusion
With advancements in technology and computing power, intelligent machine systems can be utilized to provide a more efficient alternative to traditional manual analysis for medical decision support. The current solutions include classification models that are highly inaccurate, but they do not consider some essential data preprocessing steps that can positively impact overall accuracy achieved. This research adopted a framework whereby the first phase focused on identifying the optimal handling of missing data. The application of three different data imputation techniques was investigated and observed for its impact on classification model performance. In the second phase, the best performing imputation data were balanced using a combination of SMOTETomek and random oversampling and modeled using ANN.
There is a certain assumption that we have considered during data preprocessing. The upper boundary and lower boundary set for treating outliers were based on generic assumptions. ANN performs well with a bigger dataset. But for our case, the dataset was small. Thus, random oversampling applied to imputed data for building ANN is assumed to be close to real data. The number of nodes for starting and hidden layers of ANN is based on multiple trials, and the current network is assumed to work the best. Eight nodes were assumed for starting layer as the target variable is dependent on 8 variables. The next hidden layer had double the number of starting nodes, 16. For the consecutive following hidden layers, every time, the number of nodes was reduced down by 4. The final layer of the network had 1 node as a standard process.
There are several limitations to this study. The data used in this study are of a specific region, restricted to the population of Native Americans. All patients considered in this study are females of age 21 and above. Future works should include data across different geographic regions and of various origins. It would be interesting to see how the model behaves with a mixed group of males and females. From a modeling point of view, hyperparameter optimization of ANN can also be adapted to study the outcome/accuracy.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The authors are grateful to Taif University Researchers Supporting Project (no. TURSP-2020/215), Taif University, Taif, Saudi Arabia. This work was also supported by the Faculty of Computer Science and Information Technology, the University of Malaya under Postgraduate Research Grant (PG035-2016A).