One of the predominant health issues affecting Saudi Arabia and leading to many complications is Type 2 diabetes (T2D). Early detection and significant preventative measures lead to curbing and controlling the health issue. There are fewer datasets in the literature for the detection of T2D in the Saudi population. Past studies using Saudi data have favoured machine learning algorithms to classify T2D. Although the application of this data in machine learning is evident, no studies exist in the literature that compare this data, especially those related to deep learning algorithms. This study's objective is to use specific Saudi data to develop multiple deep learning models that could be used to detect T2D. The research uses a Deep Neural Network (DNN), an Autoencoder (AE), and a Convolutional Neural Network (CNN) to create predictive models and compare their performance with a traditional machine learning classifier used on the same dataset that outperformed other machine learning algorithms such as a Decision Forest (DF). Various metrics were used to evaluate the effectiveness of the models, such as accuracy, precision, recall, F1 score and area under the ROC curve (AUC) where the ROC acts as a receiver operating characteristic curve. There are two cases in this paper: (i) uses all features of the dataset and (ii) uses six of the ten features, such as DF. In case (i), the results were shown that AE outperformed other models with the highest accuracy for imbalanced and balanced data 81.12\% and 79.16\%, respectively. The results for case (ii) showed that AE scored the highest 81.01\% accuracy with imbalanced data compared to DF and DF achieved the highest accuracy of 82.1\% with balanced data. As a result, both cases explored in this study revealed that AE has a constant superior performance if imbalanced data is used. In contrast, DF demonstrated the highest accuracy when a balanced dataset was used with a feature set reduction. They help to identify the undiagnosed T2D, and they are essential for professionals in Saudi Arabia in the health sector to promote health connections, identify risks and contain or improve their diabetes management.
Insulin resistance and deficiency are the main causes of type 2 diabetes (T2D), which raises blood sugar levels [1]. The illness has spread around the world in recent decades and impacted a large number of people [2]. According to estimates from the World Health Organisation (WHO), 3 million Saudi Arabians have prediabetes, or the risk of developing diabetes. According to WHO data, 7 million Saudi Arabians suffer from diabetes, which has a serious impact on their lives [3]. Furthermore, with an estimated 24% of the adult population suffering from T2D, Saudi Arabia is rated seventh among nations with a high prevalence of T2D [4], [5]. Factors including urbanisation, obesity rates rising quickly in the area, and sedentary lifestyles are to blame for the large number of patients [5]. Early detection and the right kind of intervention can help manage T2D. This will postpone or avoid the disease's consequences, like retinopathy, heart problems, and renal failure. Early identification hence improves treatment outcomes and lessens the load on healthcare systems [6]. Healthcare problems including diagnosis and disease identification have improved thanks to the inclusiveness of deep learning and machine learning approaches [7]. Furthermore, the methods' computation has affected the early detection of T2D; for this reason, analyzing the medical records is necessary. However, a number of studies and papers have been published on the superiority of deep learning and machine learning models over antiquated statistical techniques in indicating the risk of diabetes [8].
Motivation. Because it aids in the management and prevention of issues like nephropathy, retinopathy, and cardiovascular disease, it is crucial. Certain conventional screening techniques, such blood glucose testing and oral glucose tolerance tests, take a lot of time and don't always yield reliable results [9]. Thus, there has been interest in creating a disease detection technique that is accurate, dependable, and economical. Due to their ability to analyse massive datasets, machine learning and deep learning techniques have been shown to be effective [7]. Because the approaches analyse complex data automatically rather than by hand, they are more efficient than standard statistical methods and have been applied in a variety of healthcare applications [10]. Numerous studies have looked into the use of deep learning and machine learning methods for T2D prediction in recent years. Deep learning algorithms are being used to diagnose type 2 diabetes in Saudi population, although there is a dearth of research on this topic. There aren't many datasets available in related literature for Saudi population T2D detection. One of these datasets was gathered [3], and the dataset sub-section of this work will detail it. Despite this, the authors in [3] obtained the greatest accuracy of 78.9% using a decision forest classifier for unbalanced data and 82.1% with balanced data using the Synthetic Minority Over-sampling Technique (SMOTE). They did, however, only employ a few machine learning algorithms. Furthermore, comparable research regarding the application of deep learning models to T2D datasets is currently lacking.
Contributions. There are two gaps related to Type 2 Diabetes (T2D) detection in Saudi Arabia that this research aims to fill to contribute to the field. Some of the ways that this study will contribute to the field include:
The datasets used in this work were both balanced and imbalanced data. These datasets are essential because they provide important information regarding data preprocessing techniques such as SMOTE, which influence the performance of deep learning and traditional models. This research examined the performances of both a complete feature set and a reduced one, which enabled we to evaluate their capabilities and techniques that can be adopted to optimize them for detecting T2D in Saudi Arabia. As a result, this study contributed innovatively to how the current gaps in literature can be bridged by introducing deep learning methods to the Saudi T2D detection landscape and proposing a rigorous comparative framework, which can be used by future studies and in clinical decision-making processes. The results of this study can be used to enhance both T2D screen and diagnosis effectiveness and precision in Saudi Arabia.
This paper is structured into distinct sections: Section 2 delineates the models employed for diabetes detection. Section 3 presents materials and methods used. Section 4 exhibits the experimental results, and Section 5 provides a conclusive summary.
The methods used to detect diabetes, particularly type 2 diabetes, are displayed in this section. With the aid of machine learning techniques, Farooq Ahmad et al. [20] elaborates on the factors that predict T2D. Three thousand patients' worth of records from several Saudi hospitals were used by the researchers. Afterwards, they used preprocessing methods and discussed their importance, resulting in a 162 case decrease. Modelling using the techniques of ensemble majority voting (EMV), random forest (RF), logistic regression (LR), decision tree (DT), and support vector machine (SVM). SVM obtained 82.10% in the first dataset, RF with nine features achieved 88.27% accuracy, and RF with eight features obtained 87.65% accuracy in the second dataset. Additionally, Syed and Khan [3] developed an application that might be utilised to identify T2D in Saudi Arabia. Data for their study were obtained from King Abdulaziz University and included 3906 non-diabetic subjects and 990 diabetic cases. To identify important features, binary LR and the Pearson chi-squared test were employed. The dataset underwent preprocessing using an 80:20 ratio, dividing it into training and testing sets. The Synthetic Minority Over-sampling Technique (SMOTE) was used to attain equilibrium. After utilising nine different binary classification methods, it was discovered that the Decision Forest (DF) approach outperformed other models.
Gollapalli et al. [4] created an efficient model for diabetes that was able to predict and detect three types of diabetes: Type 1 Diabetes, T2D, and prediabetic. They did this by applying a variety of machine learning classifiers on a dataset that was obtained from King Fahad University Hospital (KFUH), a hospital in Saudi Arabia. There were 897 instances and 10 different attributes in the sample. The writers or researchers used stacking techniques, SVM, Bagging, DT, K-nearest neighbour (KNN), and RF as their main classifiers. To maximise the outcomes, four trials were carried out, with experiments 2, 3, and 4 employing SMOTE to balance the dataset. Their novel stacking model combined KNN with a KNN meta-classifier, Bagging DT, and Bagging KNN to obtain an accuracy of 94.48%. Sex, education, antiDiab, and nutrition were the five factors that were found to have a significant impact on the accuracy of the model through a critical study of feature importance. Alassaf et al. [21] presented a method intended to proactively diagnose diabetes in an uncharted area. They received data from Khobar's KFUH, which was the first time the information was used to support a diagnosis. The authors performed pre-processing and identified key traits before classifying the data. In addition, they employed recursive feature elimination and the correlation coefficient for feature selection. Following that, four categories of algorithms—Naïve Bayes (NB), Artificial Neural Networks (ANN), Support Vector Machines (SVM), and KNN—were evaluated with an emphasis on classification accuracy, F-measure, precision, and recall. ANN fared better than the other models, with 77.5% accuracy.
Alanazi and Mezher proposed a model combining RF and SVM classifiers to predict diabetes. They obtained a real dataset from the primary health care unit of the security forces in Tabuk, Saudi Arabia. Their model employed RF showed a remarkable performance with an accuracy of 98% and 99% receiver operating characteristic curve (ROC) boast. It signifies that the RF method was much better than SVM in accuracy [22]. In [23], the research project leveraged real healthcare datasets comprising 18 attributes, sourced from the Ministry of National Guard Health Affairs (MNGHA) database. The primary objective was to construct a predictive model for identifying diabetic patients within the adult population of Saudi Arabia. Three distinct algorithms, namely the Self-Organizing Map (SOM), C4.5, and RF, were applied for this purpose. Comparative analysis against various classifiers revealed that RF consistently delivered superior performance.
In the investigation of T2D, Jaber and James in [24] employed diverse classifiers, including the NB Algorithm, the LR Algorithm, and the RF Algorithm. The Pima Indian Diabetes Dataset (PIDD) served as the foundational dataset. The findings underscored the supremacy of the RF Algorithm when compared to alternative approaches. Various machine learning algorithms including linear discriminate (LD), linear SVM, quadratic SVM, cubic SVM, Gaussian SVM, fine KNN, weighted KNN and neural pattern recognition (NPR) were harnessed to construct classification models for diabetes detection in [25]. The result of a rigorous performance analysis showed that the weighted KNN showed commendable predictive accuracy in estimating the prevalence of diabetes in both male and female datasets, with an impressive average accuracy of 94.5% and shorter training time compared to other classification methods.
The identified areas for improvement identified in the related work with this study that could lead to improvements included:
This study made use of deep learning approaches to compare T2D classifications. The two cases that this study considered are those that use all features and those that used 6 out of 10 features as used with the Decision Forest in [3]. Both cases use balanced data (with SMOTE) and imbalanced data (without SMOTE). The whole system for detecting the T2D disease is depicted in Figure 1. The dataset is first preprocessed. The preprocessed dataset was then separated into two pieces: (1) training set; (2) testing set. The system uses different deep learning models such as DNN, AE, and CNN. Finally, a set of performance measurement metrics are used to assess each model’s effectiveness such as accuracy, precision, recall, F1-score, and area under the ROC curve (AUC), in which ROC acts as the receiver operating characteristic curve.
A) Dataset
The dataset used was collected by Syed and Khan in [3]. This dataset is named Saudi Dataset (SD) to be easier to follow. SD is collected by filling out a cross-sectional survey. Participants from the King Abdulaziz University (KAU) were issued with a fill-in form to complete. The survey utilized close-ended questions to gather data regarding diabetes risk factors for the participants from KAU to predict the occurrence of diabetes in Saudi Arabia’s western part. The researchers extracted the most common attributes of diabetes prediction from models of diabetes papers that had recently been published when preparing the survey. The researchers used non-invasive tests and direct observation techniques to address attributes in the survey. However, the researchers first obtained the necessary permissions from the KAU Deanship of Graduate Studies before conducting the study at the university. The study participants from KAU were students, staff, and faculty members. The survey contains eleven questions as follows:
The total number of subjects in this study was 4896, of which 990 were those at high risk, and the remaining 3906 were those at low risk of having diabetic complications. The diabetic data attributes of the participants are shown in Table 1, in which the researchers used the Label Encoding method as a preprocessing step to code the attributes. Q1-Q10 represent either the explanatory variables or predictors of this study. We used the survey's last question as a categorical response variable retrieving data on the level of High Fasting Blood Glucose. It was agreed that those who answered “Yes” to the categorical response variable question would be considered that they were at a high risk of developing diabetes. For those who answered "No" to the categorical response variable question would be considered that they were at a low risk of developing diabetes [3].
No. | Attributes | Type | Description | Labels |
1 | Region | Integer | Subject’s Region | 10=Yambu, 1=Abwa, 3=Khulays, 9=Thual, 2=Jeddah, 8=Sabar, 4=Medina, 7=Rabigh, 5=Masturah, and 6=Mecca. |
2 | Age | Integer | Subject’s Age | 0:Age40 Years, 1:Age=40 Age=49 Years, 2:Age=50 Age=59 Years, and 3:Age=60 Years |
3 | Gender | Integer | Subject’s Gender | 0=Female and 1=Male |
4 | BMI | Integer | Body Mass Index of Subject in (weight in kg/(height in m)^2) |
0:BMI25 Kg/m2, 1:BMI=25 BMI=30 Kg/m2, 2:BMI30 Kg/m2 |
5 | Waist Size (WS) | Integer | Subject’s Waist Size in cm for Male and Female | 0$$_{Male}$$):WS94cm OR 0$$_{female}$$):WS 80cm, 1$$_{Male}$$):WS=94 WS=102cm OR 1$$_{female}$$):WS=80 WS=88cm, 2$$_{Male}$$):WS102cm OR 2$$_{female}$$):WS88cm. |
6 | Physical Activity | Integer | The Subject’s Physical Activity is defined as 30 minutes of exercise daily. |
0:Yes and 1:No |
7 | Diet | Integer | The Subject’s Healthy Diet is Defined as regular consumption of fruits and vegetables. |
0:Everyday and 1:Not Everyday |
8 | BP | Integer | Does the Subject Take Blood Pressure Medicine or No? | 0:No and 1:Yes |
9 | Family History | Integer | The subject's Family History is Defined as having any member diagnosed with diabetes. |
0:Family has no history with Diabetes, 1:Grandparents have Diabetes, and 2:Parents have Diabetes |
10 | Smoking | Integer | Subject’s Smoking Habit. | 0:Non-Smoker and 1:Smoker |
11 | Class | Boolean | This is the subject's response variable based on fasting plasma glucose exposure=5.6mmo/L in the examination of health or during a health examination or expectant. |
0:Low Risk and 1:High Risk |
B) Deep Learning Models
This part defines the used deep learning models including DNN, AE, and CNN.
C) Preparing the Data
The Tensorflow library was installed to ensure that the proposed model could be used in the Keras library. Keras was most preferred because it makes it simple to create neural networks capable of operating on Central Processing Units (CPUs) and Graphics Processing Units (GPUs) simultaneously. Such a system allows for faster computations and seamless parallel processing. The key component of deep learning network construction is the organization of the simple layers that require fewer steps for using Keras to construct complex networks [32].
The first process of the proposed solution involves importing the necessary libraries for handling, modeling, and processing, such as numpy, pandas, TensorFlow, and sklearn. After importation, it preprocesses input features by normalizing input features. This process is important as it ensures that all the input features are on the same scale, improving the model's training. The technique that was used as an oversampling technique is the Synthetic Minority Oversampling Technique (SMOTE), which was used to balance the dataset. The next step is splitting the dataset into 70\% and 30\% for training and testing, respectively as shown in Table 2.
Dataset | Subjects number | Train | Test |
SD | 4896 | 3427 | 1469 |
D) Building and Training Deep Learning Model
The deep learning model was constructed using three different types of layers as follows:
(i) Case: Using all Features of Dataset
(ii) Case: Using 6 features out 10
In this case, Region, Gender, BMI, Diet, BP, and Smoking are used as defined in Table 1, resembling the technique used by Decision Forest as depicted in [3]. For all deep learning models, the experiment of using all features case was repeated in this case with changing the size of the input layer from 10 to 6.
E) Tuning the Algorithms
Deep learning models differ from machine learning models in that deep learning models have been practically staffed with hyperparameters. These hyperparameters control the number of hidden units, known as the network's structure [33]. In this study, we went further to perform a hyperparameter search using the grid search method (GridSearchCV) discussed in the sklearn module. However, this study made use of a large number of parameters, and we decided to tune DNN, AE, and CNN by selecting those parameter values that had the best accuracy. Eight processes were used in determining the best parameter values for this study, which are [34]:
The DNN’s fine-tuning parameters, the AE’s fine-tuning parameters, and the CNN’s fine-tuning parameters are shown below in Table 3, 4, and 5, respectively.
Parameter | Value |
Training optimization algorithm | RMSprop |
Epochs | 100 |
Activation function | Sigmoid for Output layer ReLU for hidden layers |
Learning rate | 0.001 |
Batch size | 64 |
Hidden layers number | 3 |
Number of neurons in the hidden layers | H1=128, H2=256, H3=256, H4=128 |
Dropout regularization | 0.5 |
Parameter | Value |
Training optimization algorithm | RMSprop and Adam |
Epochs | 200 |
Activation function | Sigmoid for Output layer ReLU for hidden layers |
Learning rate | 0.001 |
Batch size | 32 |
Hidden layers number | 1 |
Number of neurons in the hidden layers | 32 |
Parameter | Value |
Training optimization algorithm | RMSprop |
Epochs | 200 |
Activation function | Sigmoid for Output layer ReLU for other layers |
Learning rate | 0.001 |
Batch size | 128 |
Convolutional layers count | 1 |
Filters of each convolutional layer count | 64 |
The size of kernel in each convolutional layer | 3 |
The size of pooling after each convolutional layer (if applicable) |
2 |
Dense layers count | 2 |
Neurons of each dense layer count | 128 and 64 |
Dropout regularization | 0.5 |
F) Pseudocode
The pseudocode for the proposed solution that involves using SD to detect diabetes is as follows:
The algorithms were executed using Python programming language. They can be run on Google Colab, providing it with access to GPUs, which help to speed up the training process significantly. When the algorithm is run on Google Colab, it freely accesses the GPUs, developing and running the Python code directly without requesting additional configurations [35]. At the backend, TensorFlow is used to build and train the programs' networks written in Python 3.<\p>
A) Performance Measures
Evaluation of the performance of the proposed models was done using various measurements such as Accuracy (Acc), precision, recall, F1-measure, and area under the ROC curve (AUC), in which ROC acts as the receiver operating characteristic curve. The model's accuracy is defined as the patient proportion that the models diagnosed properly. The accuracy of the model’s formula is shown in below [36]: \begin{equation}\label{e1} Accuracy (Acc)=\frac{TN+TP}{TN+TP+FN+FP} \end{equation}
True positive (TP) is a term used to refer to the population of patients classified as positive and are truly positive. True negative (TN) represents the predicted patients that are negative and actually negative. False positive (FP) in the equation represents the patients classified as positive, while in the real sense are negative. False negative (FN) refers to the number of patients that are categorized as negative but are actually positive. To determine the proposed model’s classification quality, the parameters of the access are regularly estimated [36]. Precision is the second performance evaluation metric that refers to the sum of true positive and true negative. The formula for calculating precision is presented in below [37]: \begin{equation}\label{e2} Precision=\frac{TP}{TP+FP} \end{equation}
Recall refers to the attributes that are classified correctly and is also known as sensitivity or the True Positive Rate. Recall is expressed as the total number of positive predictions divided by the total number of class values. The formula for calculating recall is presented in below [37]: \begin{equation}\label{e3} Recall=\frac{TP}{TP+FN} \end{equation}
The F1-measure is also known as the F1-score, and it is a performance metric since it can provide data between the recall and precision. The formula used to calculate F1-measure is presented in below [37]:
\begin{equation}\label{e4} F1-Score=2 \times \frac{Precision*Recall}{Precision+Recall} \end{equation}
The Area under the Curve (AUC) value can be used to measure the discriminative power of classification algorithms. AUC is used to assess models' performance with values ranging from 0 to 1. Values of 1 or near 1 mean the model is excellent at finding the balance between recall and precision. Such values indicate models capable of producing superior classification performance [3].
B) Result
(i) Case 1: Using the entire dataset’s features
In the first case, we used the dataset's entire features as outlined in Table 1 above. This study went a step ahead to explore how three deep learning models compare with one another in detecting T2D in data that was imbalanced (does not use SMOTE) and balanced (makes use of SMOTE). The SMOTE is currently one of the methods most used to address imbalanced classes. Thus, this study made use of the SMOTE to obtain balanced data. As shown in Figure 2, the metrics that we used to evaluate the developed deep learning methods are accuracy, F1 score, precision, AUC, and recall. Figure 2 (a) below shows the imbalanced data for this study:
Based on the scores of the three models developed in this study, it can be argued that AE is the best model for detecting cases of diabetics. This is because AE has the highest score among the three models in recall, F1 score, AUC, and accuracy. Although DNN outperformed the other models, especially in precision, AE had the highest scores in all the other metrics. Figure 2 (b) shows the results of the three models for balanced data. The summary of the scores is as follows:
Again, AE is the most balanced model compared to the other models, as evidenced by its metrics scores. AE is the model that leads in most metrics, such as accuracy, precision, F1-score, and AUC. Although DNN had an excellent recall score, AE performed better in other metrics, revealing that it is more accurate and reliable for detecting people with diabetes in imbalanced datasets. Based on the results of the imbalanced and balanced datasets, it is evident that AE is the best model for identifying cases of diabetes in both of these datasets. This is because out of the three models, AE had superior performances in accuracy, F1-score, and AUC. In contrast, DNN and CNN seemed to lag, recording some of the poorest performances in recall and F1-score metrics, especially on imbalanced data.
(ii) Case 2: Using six features out of ten features
Another case that this study used to compare the three deep learning models that were developed and the Decision Forest (DF), popularly known for outperforming most machine learning algorithms in \cite{3}. All these four models were used to detect T2D for imbalanced and balanced data. The imbalanced data was the data that did not use SMOTE, while the balanced data was the data that was balanced using SMOTE. These four models for detecting diabetic cases were evaluated using five metrics, which are accuracy, AUC, precision, F1-score, and recall, as shown in Figure 3. Figure 3 (a) shows the DNN, AE, CNN, and DF results when using imbalanced dataset. The summary of the results of these models are as follows:
Generally, the results of the four models reveal that AE is the best for detecting diabetic cases in imbalanced datasets. This is because AE had the highest scores in all the performance metrics used in this study. As a result, it can be concluded that AE is the most reliable model for identifying true positive cases of people with diabetes and balancing class differentiation. On the contrary, DNN scores indicate that although it has an overall moderate performance, it struggles a lot with recall, making it slightly unreliable in detecting positive cases. CNN is another model with varying scores across the metrics used in this study. While this model has excellent precision performance, it had some of the worst recall and F1 scores. The results of the recall and F1-score reveal that CNN is not a reliable classification model. DF was a model with balanced performances in precision and recall but failed to record promising scores in the other metrics. The performance of DF across all the metrics suggests that deep learning methods are more reliable, especially AE.
Figure 3 (b) shows the DNN, AE, CNN, and DF results when using balanced datasets. The summary of the scores of these four models, when balanced data is used, are as follows:
The results of these four models reveal that DF is the superior model, recording the highest accuracy, precision, recall, F1-score, and AUC scores. This shows that DF is the best model for detecting cases of diabetes in the balanced data context. Conversely, although DNN has an excellent recall score, it has moderate scores in accuracy and precision. However, while DF outperforms AE and CNN, the two deep-learning models have better scores than DNN. Compared to DNN, AE and CNN have more balanced performance profiles, making them more ideal. Overall, it can be concluded that DF is a better model for correctly identifying positive cases of diabetes and ensuring high accuracy in predictions when using balanced datasets. In analyzing the three deep learning models, AE is a better model as it had better accuracy, recall, F1-score, and AUC scores for imbalanced data. On the other hand, DF is the better model for balanced data as it performs better than the three deep learning models.
Early detection of Type 2 Diabetes (T2D) is critical for implementing appropriate treatment strategies and lifestyle adjustments. These will help decelerate and prevent the advancement of the condition. Studies in machine learning and medical datasets have overlooked the Saudi population on T2D detection. This study addresses this research gap by developing and comparing three deep learning methods and the traditional classifier. The deep learning methods developed are Deep Neural Network (DNN), Autoencoder (AE), and Convolutional Neural Network (CNN), while the traditional classifiers used are the Decision Forest (DF). As a result, this study contributes to the literature by providing a comparative analysis of different models, an aspect that was missing. The performance metrics used to evaluate these four models are accuracy, precision, recall, F1-score, and AUC. The findings of this study revealed that AE outperformed other models in the case of using all features of the dataset for imbalanced and balanced data with the highest accuracy of 81.12% and 79.16%, respectively. For the case of using 6 features, the results showed that AE achieved the highest accuracy of 81.01% with imbalanced data compared to DF and DF achieved the highest accuracy of 82.1% with balanced data. Findings suggest that AE is superior when imbalanced and balanced datasets with all features are used. However, DF is a more reliable model when using balanced data with fewer features. Overall, AE is more useful in detecting T2D cases that are yet to be detected. These results signify that the model can be a valuable tool for healthcare practitioners in Saudi Arabia. The models developed in this research may be employed as a screening instrument in public health initiatives. It will also support awareness goals about T2D. This in turn will promote healthier lifestyles among different populations. Future investigations should be expanded by incorporating a larger and more diverse dataset. This should encompass additional variables like genetic factors and dietary patterns. Alternative deep learning architectures and integrating advanced techniques such as data augmentation and transfer learning will lead to improved results. Future research studies will help authenticate the models using more extensive datasets and assess their practical viability within clinical contexts.
The authors declare no conflict of interests. All authors read and approved final version of the paper.
All authors contributed equally in this paper.