Worldwide, the healthcare system is greatly impacted by the changing requirements of the people. Diabetes is a long-lasting condition that can lead to serious complications if not controlled correctly. It is divided into Type 1 (DT1) and Type 2 (DT2) diabetes. Research shows that almost 90% of Diabetes cases are DT2, with DT1 making up around 10% of all Diabetes cases. This paper suggests a Rough-Neuro classification model for identifying Type 2 Diabetes, which includes a two-stage process. The approach includes utilising Rough sets JohnsonReducer to eliminate unnecessary features or characteristics and multi-layer perceptron for illness categorization. The suggested technique seeks to reduce the amount of input characteristics, which results in a reduction in the time needed to train the neural network and the storage space required. The findings show that decreasing the amount of input characteristics results in a lower neural network training time, enhances model performance, and reduces storage needs by 63%. It is worth mentioning that a smaller neural network with only seven hidden layers, trained for 1000 epochs with a learning rate of 0.01, attained the best performance, but time and storage were much decreased.
Diabetes is a recognised chronic condition, also referred to as diabetes mellitus, that is associated with various negative health outcomes like stroke, chronic kidney failure, heart attack, and diabetic foot syndrome, among others [1,2]. As per forecasts from the WHO (World Health Organisation) [3], diabetes is expected to become the seventh leading cause of death by 2030. Additionally, as per the International Diabetes Federation, there is a projected increase in the number of individuals with diabetes over the next 26 years, reaching 693 million in comparison to 451 million worldwide in 2017 [4]. Diabetes is described as a persistent metabolic disorder that gives rise to variations in blood glucose levels, typically classified into two primary forms: Type 1 and Type 2. TID is a result of inadequate insulin production in the body, while T2D occurs due to the body’s inability to utilise the insulin it produces. Type 1 diabetes makes up around 10% of diabetes cases, with Type 2 diabetes accounting for the remaining 90% [3]. T2D is increasingly becoming a widespread issue for the medical field [1]. Even though the exact reason for diabetes remains a mystery, experts suggest that it may result from a combination of genetic and environmental influences. Diabetes poses a significant threat due to its inability to be cured. It is believed that medications and specific drugs can manage the condition. In addition, early detection of diabetes is crucial for minimising complications and serious health problems [5].
Rough Set theory is a modern technique for managing uncertainty. It helps identify data dependencies, categorise and rank objects, assess the significance of features, reduce redundancies, and classify data types. Furthermore, it is utilised for retrieving rules from databases, with one benefit being the generation of comprehensible if-then rules. These bases have the potential to discover types of information that were previously unknown. Additionally, it serves as a classifier for specimens that are not observable. Relying solely on the data provided in personal information, rough set analysis does not require external factors. Furthermore, another significant benefit is that rough set theory can determine the completeness of data through a straightforward evaluation. Additionally, it offers guidance on the necessary items in case the information is not complete [6].
In addition, even if the data is incomplete, rough sets can still detect data duplicates and determine the minimum information needed for evaluation [6]. This important property is crucial when there is limited domain knowledge for applications or when collecting data is expensive or time-consuming. It ensures that the data collected is sufficient to create a reliable classification model, saving time and effort while maintaining accuracy in gathering more information about the objects [7].
The Rough set theory is applied across various fields to address challenges related to classification, feature selection, decision-making, and knowledge discovery. Some domains that apply rough set theory include: Medicine: (the Rough set theory is applied in the analysis of medical data for tasks such as diagnosing diseases, distinguishing patterns, and making decisions based on medical data) [7,8]. Researchers have employed rough set theory to examine financial data for objectives such as evaluating credit risks and forecasting financial results [9]. In addition, rough set theory is utilized to execute image processing tasks, including image segmentation and feature selection [10]. Researchers utilise the rough set theory for pattern recognition, object classification, and feature selection [11]. It has been applied in various fields such as real-time strategy games for categorising opponent behaviour [6], predicting the decrease in forest fire risks [12], and aiding in decision-making for security forces’ operations [13].
Reducing data can help condense a large dataset into a smaller size without compromising the original data’s integrity [14]. Reducing features keeps the original characteristics intact while selecting a subset that accurately forecasts the intended class variable [15]. In addition, neural network classifiers face various challenges, including training expenses, increased storage, and time consumption in neural networks with larger input dimensions.
This paper aims to utilise rough set theory due to its ability to reduce attributes. Furthermore, it provides a straightforward, concise, and easily comprehensible explanation. This suggests that there are no difficulties, intricacies, or undisclosed stages to consider. It pertains to an economical or budget-conscious approach to solving a problem or meeting a need with solutions. Consequently, it reduces the amount of input features and shortens the time needed for network training. This paper presents a model that merges An approximate set reduction algorithm with a neural network classifier. Here is the paper’s structure: Section 2 presents established models for identifying type 2 diabetes. Section 3 provides an overview of the solution that has been put forward. Section 4 exhibits the findings of the present investigation. Section 5 brings the paper to a close.
Reducing features, as discussed earlier, helps decrease the number of input features, resulting in reduced training time and storage needs, while also enhancing network performance. The goal of feature selection is to enhance prediction accuracy and gain deeper insights into the data analysis process. This document offers a summary of various classification techniques. Many of these classification techniques rely on standard algorithms to assess the effectiveness of the features in the dataset. Choosing the best features will decrease the time complexity and space while also enhancing the accuracy of classification. This section discusses the current research in literature on detecting diabetes, focusing on type 2 diabetes, to examine the impact of utilising feature reduction or selection.
In a study by Kakoly et. al. [16], a questionnaire was created and distributed among both urban and rural communities in Bangladesh. Data from 738 subjects was gathered and prepared by addressing missing values and outliers. Next, two techniques were employed to select features: Information Gain (IG) and Principal Component Analysis (PCA). Subsequently, these features were input into five distinct classifiers: Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), k-nearest neighbors (KNN) and Logistic Regression (LR). The outcomes demonstrate an accuracy rate exceeding 82.2%, accompanied by an Area Under the Curve (AUC) value of 87.2%.
Li et. al.’s [17] study used three feature selection techniques to improve the Pima Indian Diabetes dataset (PIDD) classification performance. In conjunction with the K-means clustering technique, the researchers investigated several permutations of the genetic algorithm (GA), particle swarm optimisation (PSO), and harmony search (HR). By means of the GA-Kmeans amalgamation, it was ascertained that Blood pressure, Insulin, and Age assumed a pivotal role in the classification. Likewise, the combination of GA-PSO and Kmeans showed that Glucose, Blood pressure, Insulin, and BMI were significant factors. Ultimately, the HR-Kmeans combination identified blood pressure, insulin, and glucose as critical variables. The KNN classifier was used by the researchers to categorise diabetes cases. Surprisingly, the suggested feature selection method combinations performed better than earlier results on the same dataset. With an accuracy of 91.65%, the HR-Kmean hybrid combination outperformed the others, demonstrating how much it improved classification performance.
In addition, Saxena et. al. [18] applied three different methods to select features in the PIDD dataset to identify T2D: IG, correlation attribute evaluation, and PCA. The dataset has been pre-processed by eliminating the outliers and replacing the missing values with the mean value. Afterward, they utilised three different methods to select either 4 or 6 features. This study employed four machine learning algorithms: DT, KNN, multilayer perceptron (MLP), and random forest (RF). According to the findings, random forest achieved the highest accuracy of 79.8%.
Rahman et. al. [2] developed a Convolutional Long Short-term Memory (Conv-LSTM) model to predict diabetes using the PIDD dataset. For comparison, the CNN (Convolutional Neural Network), CNN-LSTM, and Traditional LSTM (T-LSTM) models were employed. The Boruta algorithm chose the following features: age, BMI, glucose, blood pressure, and insulin. The median was utilised by the authors to fill in the missing values when they employed grid search for hyperparameter optimisation. Embedding, Conv-LSTM, and dence layers made up the model. After splitting the dataset into two categories, the suggested model’s accuracy was 91.38%, and after five-fold cross-validation, it was 97.26%. However, performance suffered due to the model’s intricacy.
Kumar et. al. used a Deep Neural Network (DNN) classifier to predict T2D through an unsupervised learning method [19]. The model’s performance was assessed using PIDD, and the dataset was pre-processed by eliminating unclear or missing data. In order to improve the model, certain features were chosen according to their importance score, such as BMI, Glucose, Age, and Diabetes pedigree function. These characteristics were subsequently utilised to train the DNN. The model has four nodes in the input layer, one node in the output layer, and three hidden layers with 20, 10, and 10 nodes each. The results showed that the model outperformed previous studies in the field, achieving a precision of 98.16%. Nevertheless, this model is constrained by its substantial computational expenses resulting from DNN processing.
In addition, Zhou et. al. [20] utilised a Deep Learning model for Predicting Diabetes called DLPD to make predictions about Diabetes. This model has the ability to forecast potential forms of diabetes that may develop later on. The model was developed using DNN and assessed with reference to the Diabetes Type Dataset (DTD) and the PIDD. The plan is divided into four phases. First prepare the dataset, then build and train the DLPD model, run the normal output and tune the hyperparameters. The authors initially divided the dataset into training (70%), validation (15%), and testing (15%) data for prioritization. In the second stage of the proposed model, there are three layers. The input layer simply passes the dataset’s features to the hidden layers without performing any computations on them. There is no limit to the number of hidden layers. The dataset is processed within the hidden layers, and the outcomes are then transferred to the output layer. During the third phase, dropouts are controlled to prevent overfitting. In order to create an accurate prediction model for DNN, the authors adjusted certain parameters prior to applying the binary cross-entropy loss function. The results from the experiment showed that the proposed model performed better. Nevertheless, there is no comparison to the current related works.
Naz and Ahuja [21], utilised three supervised learning algorithms: DT, Artificial Neural Network (ANN), and Naive Bayes (NB). In order to identify diabetes, the researchers utilised deep learning, specifically a layered feed-forward neural network. They utilised stochastic gradient descent with back-propagation for the training process. PIDD was used to measure performance in this study. To ensure the validity of the research, the data was divided into testing and training. The model proposed in this study consists of an input layer for input data, two hidden layers for processing the data set, and an output layer for prediction. Experimental results show that the maximum accuracy of the multilayer feedforward perceptron model reaches 98.07%. However, the dataset was not pre-processed by the authors.
In addition, Lukmanto et. al. [22] introduced a framework for identifying T2D. This model was evaluated using PIDD. The data was first processed by removing features that contained many missing data, such as skin thickness and insulin. The F-Score selection process is then used to select specific features from the PIDD database. Only blood sugar and body mass index were used in the classification process. The data is divided into two as training and testing, 87% is used for training and 13% for testing. Data classification using fuzzy support vector machine model. According to the results, the proposed model achieved an accuracy of 89.02%.
Prabhu and Selvabharathi [23] developed a Deep Belief Neural (DBN) network model for diabetes detection. The research utilised PIDD to assess the performance of the model. The model consists of three main stages: pre-processing, pre-training using DBN, and fine-tuning. Normalisation is a method utilised to prepare the dataset before processing. The appropriate values are selected from the training database using PCA. Normalisation is typically carried out during the preprocessing phase in machine learning, particularly prior to utilising PCA. Normalisation is not a built-in aspect of PCA, yet it is crucial to preprocess data prior to employing PCA or similar methods. In the pretraining phase, the DBN consists of an output layer, an input layer, and three hidden layers, each utilising a Rectified Linear Unit (ReLU) as the activation function. The classifier is developed further during the fine-tuning phase based on the results obtained from the initial training phase. Based on the experimental findings, the new model performs better than traditional models like NB, DT, LR, SVM, and RF. They failed to utilise an optimisation method to address over-fitting.
Kumar and Manjula utilised the Keras toolkit to develop an MLP network for diabetes detection [24]. The evaluation of performance in this study was conducted using PIDD. The author transformed categorical data and independent variables to organize the data. There are two layers in the model: IL(input layer) and OL(output layer). IL uses the ReLU activation function, while OL uses the sigmoid activation function. According to the research results, the accuracy of the sample request is 86.67%. They chose not to use dropout as a way to avoid overfitting.
A deep wide and deep learning model, a deep feedforward neural network, and the power of an established linear model were combined in [25], Nguyen et. al.’s model, to improve the model’s overall execution by eliminating features related to glucose or insulin. To detect T2D, the proposed methodology relies on electronic health information for the US population. Three groups were created from the 1312 features. The categories include fixed and basic features like blood pressure, sex, BMI, and age; crossed features including the selection of top diagnosis and medication characteristics; and adjustable features such diagnostic features dependent on laboratory tests and medication. The proposed model employed the Synthetic Minority Oversampling Technique (SMOTE) at 150 and 300 percent for each fold of the cross-validated training to evaluate the experiment’s outcomes. Three sets of features are represented by the embeddings of the hidden layers in the deep component. 151 input features are available for diagnosis, 134 input features are available for treatment, and 80 input features are available for laboratory testing. In order to improve the learning process from a sparse binary vector to a dense 16-dimensional vector, each embedding was done using independent shallow layers. There were 256 and 128 neurons in the hidden layers, respectively. To construct a 1439-dimensional vector, these were added to the broader section in the last layer that contained the intersecting features together with the result of the deep component. A single layer with a 128-to-1 layer and a logistic activation function was used in the finalisation of the framework. ReLU served as the activation function in the other layers. The average of the output probabilities from the top 10 models was computed to construct the most recent predictive model for the start of T2D, which was then compared to a threshold of 0.5 to indicate diabetes. The results indicated that the performance of the proposed model using the identical dataset outperformed other machine learning algorithms. The study faced challenges due to the dataset’s high dimensions and sparsity. Additionally, the wide and deep model could not predict certain important risk factors within the model.
In addition, Kannadasan et. al. [26] created a deep neural network model to forecast T2D by utilising the stacked autoencoder. The performance of the suggested model was assessed using various metrics such as recall, F1-score, precision, accuracy, and specificity with the help of PIDD. Various features are extracted from the dataset through a stacked autoencoder, followed by categorising the dataset using a softmax layer. Tests were conducted in two separate scenarios to assess efficiency. The initial situation required fine-tuning,, while the subsequent one did not. During the last stage of classification, fine-tuning is employed to enhance performance by utilising backpropagation with the training dataset under supervision. The suggested model was compared with various other models and cutting-edge techniques in the field. The results showed that the proposed model surpasses previous models, achieving an accuracy of 86.26%. The dataset was not pre-processed by the authors.
Deshmukh and Fadewar [27] utilised a hybrid fuzzy deep learning method that relied on deep CNN to identify diabetes. This model depends on converting data into a matrix to meet the requirements of CNN. Following the process of fuzzification, every data point in this framework gets converted into a 5 × 5 matrix. The value represented is the rows of the matrix and the characteristics as the columns of the matrix. This study uses PIDD to evaluate performance. Regarding the CNN, the network was made up of complex spiral layers with a 3 x 3 kernel size and pooling layers with a 2 x 2 kernel size. The results show that utilising CNN with fuzzification is more effective than traditional neural networks for identifying diabetes. However, the dataset was not pre-processed by the authors.
Ashiquzzaman et. al. [28] created a deep neural network model for detecting diabetes along with their other research projects. In this research, PIDD was utilised. They addressed the issue of over-fitting in this model by incorporating a regularisation layer called Dropout. The model proposed in this paper includes an output layer, an input layer, two fully connected layers (FCL), and two dropout layers. The process starts by inputting data into the input layer. After that, every FCL contains a dropout layer. The initial FCL consists of 64 neurons and uses ELU as an activation function, while the second FCL consists of 23 neurons and also uses ELU as an activation function. The final layer consists of a single neuron that uses Softplus as an activation function to produce a decision. The Backpropagation algorithm is used to improve the model. Based on the findings, the model surpasses other models discussed in the research. Yet, the dataset was not pre-processed by the authors.
Table 1 displays the current models for T2D. The publications for detecting T2D ranged from 2017 to 2023, as shown in Table 1. We collected a few datasets related to T2D detection for our research. The PIDD dataset is widely utilised because it is available to the public. When working with classifiers, it’s important to pre-process the dataset by eliminating unnecessary characters and handling missing values. PIDD involves incomplete and absent values. Rahman et. al. [2] utilised data pre-processing techniques to replace missing values with the median value. Yet, the majority of researchers did not preprocess the datasets. Choosing the right features plays a crucial role in determining the effectiveness of algorithms in machine learning. The data related to diabetes is utilised for training the models, which ultimately leads to producing accurate outcomes. Several researchers have employed different methods for selecting features in models to predict T2D. In the [19], Feature Importance (FI) was utilised for PIDD to choose four features from a total of eight. Furthermore, the Boruta algorithm was employed in [2] for PIDD to choose five features from a total of eight. None of the previous studies have utilised rough set theory for detecting T2D.
Ref. | Year | Dataset | Features | Subjects | Technique | Feature selection | Pre-processing | Accuracy |
2023 | Information gathered through surveys from both rural and urban groups in Bangladesh |
Many features. | 738 | DT, RF, SVM, LR and KNN |
PCA and IG | Remove outliers and filling the records with missing values | 82.2% | |
2023 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF, Age |
768 | KNN | K-means with harmony search, particle swarm optimization, and genetic algorithm | - | 91.65% | |
2022 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF, Age |
768 | MLP, DT, KNN, RF | Correlation, IG and PCA | Remove outliers, filling the missing values by the mean | 79.8% | |
2020 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF, Age |
768 | Convolutional LSTM |
Boruta algorithm to select 5 features out 8 | Replace the missing values by the median value | 97.26% | |
2020 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF,Age |
768 | DNN | Feature Importance to select 4 out 8 |
Eliminate empty, redundant or any ambiguous data | 98.16% | |
2020 | 1-PIDD 2-DTD |
1-PC, PG, BP, ST, 2HSI, BMI, DPF, Age 2- Age, BS_fast, BS_PP, Plasma_R, Plasma_F, HbA1c, Type |
1-768 2-1009 |
DNN | - | Split data into training and testing data | 1-PIDD: 99.41% 2-DTD: 94.02% |
|
2020 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF, Age |
768 | ANN, NB, DT, Multilayer feed forward perceptron |
- | Not mentioned | Multilayer feed forward perceptron: 98.07% |
|
2019 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF, Age |
768 | Fuzzy SVM | F-score to select PG and BMI |
Remove ST and 2HSI |
89.02% | |
2019 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF,Age |
768 | Deep BNN | - | Apply normalization technique | 80.8% | |
2019 | PIDD | PC, PG, BP, ST, 2HSI, BMI, DPF,Age |
768 | MLP | - | Encoding the categorical data and independent variables | 86.67% | |
2019 | Record data sourced by Practice Fusion from public Hospitals EHRs |
1-Fixed and basic features 2-Adjustable features 3-Crossed features |
9948 | Ensemble model | - | - | 84.28% | |
2019 | PIDD | PC, PG, BP, ST, 2HSI,BMI, DPF, Age |
768 | DNN | - | - | 86.26% | |
2018 | PIDD | PC, PG, BP, ST, 2HSI,BMI, DPF, Age |
768 | CNN | - | - | 95% | |
2017 | PIDD | PC, PG, BP, ST, 2HSI,BMI, DPF, Age |
768 | DNN | - | - | 88.41% | |
DT: Decision Tree, SVM: Support Vector Machine, RF: Random Forest, LR: Logistic Regression, PCA: Principal Component Analysis, KNN: k-nearest neighbors, IG: Information Gain, PC: Pregnancy count, PIDD: Pima Indian Diabetes Dataset, PG: Plasma Glucose, ST: Skin Thickness, BP: Blood Pressure, 2HSI: 2Hour Serum Insulin, DPF: Diabetes Pedigree Function, BMI: Body Mass Index, MLP: Multi-Layer Perceptron, DNN: Deep Neural Network, DTD: Diabetes Type Dataset, LSTM: Long Short Term Memory, BS fast: Fasting Blood sugar, Plasma: randomly taken Plasma glucose test, BS PP: Blood sugar 90 minutes post meal, Plasma F: Plasma glucose test typically appropriated at daybreak, ANN: Artificial Neural Network, HbA1c: Hemoglobin A1c test, NB: Naive Bayes, BNN: Belief Neural Network, CNN: Convolutional Neural Network, EHR: Electronic Health Records. |
The proposed Diabetes Type Two Detection (DTTD) Rough Neuro model optimises the combination of a neural network classifier and a rough set attribute reduction algorithm for detecting T2D. Figure 1 illustrates the proposed model, consisting of two primary phases: the rough set phase and the neural network classifier phase.
The dataset used in this study is the Pima Indians Diabetes Database (PIDD), which is commonly utilised and was acquired from the University of California Irvine (UCI) machine learning repository. The dataset includes medical information for 768 individuals who are 21 years old or above, out of which 268 have been identified with diabetes. The dataset contains eight predictor variables: Pregnancy, Blood Pressure, Glucose, Skin Thickness, Body Mass Index (BMI), Diabetes Pedigree Function, Age, and Insulin. The variable of interest, referred to as "Outcome," signifies whether a patient is diagnosed with diabetes or not [2]. During the initial phases of pregnancy, blood sugar levels can increase, resulting in complications related to diabetes. An elevated blood glucose level is a crucial sign of diabetes. Elevated levels of blood sugar can increase blood pressure, which is a key indicator of diabetes. T2D tends to be more common in individuals who are overweight. Therefore, obesity greatly raises the likelihood of developing type 2 diabetes. An important indicator of diabetes is an imbalance in insulin levels. The diabetes pedigree function is valuable for obtaining diabetes information as it can be hereditary. Individuals with insulin-dependent diabetes show signs of skin thickening. The risk of T2D increases as individuals get older, particularly after the age of 45. Therefore, it can be inferred that all these characteristics are crucial for identifying T2D. The dataset’s outcome column indicates a value of 1 for "tested positive for diabetes" and a value of 0 for "tested negative for diabetes." The dataset is briefly described in Table 2 [2]. PIDD contains missing values in some attributes, such as Glucose, Blood pressure, Skin Thickness, Insulin and BMI. The presence of zero in the minimum value column of these attributes means that there are missing values except for Pregnancy.
No. | Attribute | Description | Minimum value |
Maximum value |
1 | Pregnancy | The frequency of a partaker’s Pregnancies | 0 | 17 |
2 | Glucose | Plasma glucose concentration 2-hour oral glucose tolerance test. |
0 | 199 |
3 | Blood pressure | It entails Diastolic blood pressure (blood is exerted into arteries midst the heart) (mm Hg). |
0 | 122 |
4 | Skin Thickness |
Triceps skinfold thickness (mm). It’s decided by the collagen content. |
0 | 99 |
5 | Insulin | 2-Hour serum insulin (mu U/mL). | 0 | 846 |
6 | Body Mass Index | Body mass index (heavinessinkg/(tallnessinm)2). |
0 | 67.1 |
7 | Diabetes pedigree Function | An interesting attribute to diagnose diabetes. | 0.078 | 2.42 |
8 | Age | Participants age | 21 | 81 |
9 | Outcome | Diabetes class variable, Yes confirms diabetes in patients and No represents an absence of diabetes in patients. |
0 | 1 |
A. Phase 1: Rough Set
No. | Pregnancies (a) |
Glucose (b) |
Blood Pressure (c) |
Skin Thickness (d) | Insulin (e) |
Body Mass Index (f) |
Diabetes Pedigree Function (g) |
Age (h) | Outcome (i) |
1 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 (Diabetic) |
2 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 (Non Diabetic) |
3 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 (Diabetic) |
4 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 (Non Diabetic) |
5 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 (Diabetic) |
6 | 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 (Non Diabetic) |
No. | c | e | g | i | No. | b | f | h | i |
---|---|---|---|---|---|---|---|---|---|
1 | 72 | 0 | 0.627 | 1 (Diabetic) | 1 | 148 | 33.6 | 50 | 1 (Diabetic) |
2 | 66 | 0 | 0.351 | 0 (Non Diabetic) | 2 | 66 | 0 | 0.351 | 0 (Non Diabetic) |
3 | 64 | 0 | 0.672 | 1 (Diabetic) | 3 | 183 | 23.3 | 32 | 1 (Diabetic) |
4 | 66 | 94 | 0.167 | 0 (Non Diabetic) | 4 | 89 | 28.1 | 21 | 0 (Non Diabetic) |
5 | 40 | 168 | 2.288 | 1 (Diabetic) | 5 | 137 | 43.1 | 33 | 1 (Diabetic) |
6 | 74 | 0 | 0.201 | 0 (Non Diabetic) | 6 | 116 | 25.6 | 30 | 0 (Non Diabetic) |
These factors are crucial signs for choosing the most suitable reduction in the research context.
B. Phase 2: Classifier using Neural Networks
To confirm the practicality of the reduction, a classifier constructed using MLP was put into action. To find the optimal setup, a series of experiments were carried out involving the network’s configuration and initiation functions [6].
Rough sets are essential for filling in data gaps and decreasing the input size of neural networks. Consequently, it decreases the training time and storage needs of the network. Here, the DTTD Rough-Neuro Model’s performance is assessed for each phase and as a complete model. Different assessment criteria are used to evaluate the effectiveness of each stage of the model being suggested.
A. Phase 1: Rough Set
In order to diminish the amount of measured attributes, the Rough Set Attributes Reduction (RSAR) algorithms developed by Aleksander Øhrn ROSETTA were employed on each of the four reduction algorithms. This engendered the formation of reducts and rules for each algorithm [6]. The information can be accessed in a CSV (Comma-separated value) file format, imported into ROSETTA using the Microsoft Open Database Connectivity (ODBC), and employed in conjunction with each reduction algorithm [6]. Table 5 exhibits the quantity of rules and reducts generated by each algorithm for PIDD. It demonstrates that the SAVGenetic Reducer achieved the highest number of reducts and rules, while the ManualReducer demonstrated the lowest number of reducts and rules.
No. | Reduction algorithms | No. of reducts | No. of rules |
---|---|---|---|
1 | SAVGeneticReducer | 39 | 29,271 |
2 | Holte1RReducer | 8 | 1,250 |
3 | JohnsonReducer | 1 | 768 |
4 | ManualReducer | 1 | 755 |
As per Table 6, the ManualReducer demonstrates the smallest quantity of rules; however, it also exhibits the lowest level of support, thus making it unsuitable for inclusion in the list of available options. After removing the ManualReducer, it was discovered that the JohnsonReducer has the fewest number of rules and the highest level of support. This discovery suggests that the JohnsonReducer is the most suitable choice for the suggested model. Referring to the information presented in Table 6, it can be observed that the JohnsonReducer presents a reduced collection of attributes (Pregnancies, Glucose, and Diabetes Pedigree Function) that produces the same outcome as the original attributes, albeit with a significant 63% reduction in complexity. Figure 2 visually illustrates the pseudocode for the JohnsonReducer.
No. | Reduction algorithms | No. of reducts | No. of rules | Cardinalities of reduct | of | Support of reduct |
---|---|---|---|---|---|---|
1 | JohnsonReducer | 1 | 768 | 3 | 100 | |
2 | SAVGeneticReducer | 39 | 29,271 | 3,4 | 100 | |
3 | ManualReducer | 1 | 755 | 3 | 0 | |
4 | Holte1RReducer | 8 | 1,250 | 1 | 1 |
B. Phase 2: Neural Network Classifier
The performance of the proposed model was assessed based on accuracy (ACC). The precision of a model is determined by the percentage of patients correctly identified by the models as shown in Eq. [eq1] : \[\label{eq1} ACC = \frac{(TP +TN)}{(TP +TN +FP +FN)}\tag{1}\] TP (True positive) represents the sum of patients identified as positive who are truly positive. Patients classified as True Negative (TN) are those expected to be negative and indeed test negative. Identifying patients as positive when they are actually negative is known as false positive (FP). When patients who are actually positive are classified as negative, it is referred to as a false negative (FN). Such parameters are often expected to assess the model’s classification accuracy [2].
This section will discuss the accuracy results of the MLP classifier with varying numbers of neurons in hidden layer, comparing two scenarios: before reduction and after reduction. There are two instances of learning rate, one being 0.01 and the other 0.001. For each scenario, the classifiers underwent training for 500, 1000, and 1500 epochs.
According to Figure 3, the data before reduction indicates that the top accuracy score is 78.1%, achieved by a classifier with 3 hidden neurons trained over 1000 epochs. Results show that the highest accuracy value is 78.61, achieved by a classifier with 7 hidden neurons trained for 1000 epochs, as seen in Figure 4. The 0.51 difference between the two values was considered insignificant, indicating that while the reduction process did not enhance the result, it did help in reducing training time and storage requirements.
During epoch 500, the accuracy reached 77.09 with 4 neurons in the hidden layer without reduction, whereas with reduction, the accuracy increased to 78.36 with 3 neurons in the hidden layer (refer to Figure 5). The discrepancy of 1.27 between the two values was considered insignificant, indicating that while the reduction process did not enhance the outcome, it did lead to a decrease in training time and storage requirements.
During 1000 epochs, an accuracy of 78.1 was achieved with 3 neurons in the hidden layer, while an accuracy of 78.61 was achieved with 7 neurons in the hidden layer (refer to Figure 6). The 0.51 difference between the two values was considered insignificant, indicating that the reduction process did not enhance the result, but it did lead to a decrease in training time and storage requirements.
During epoch 1500, the accuracy reached 77.85 without any reduction, with 3 neurons in the hidden layer. The accuracy dropped slightly to 77.09 with the reduction, still using 3 neurons in the hidden layer (refer to Figure 7). The discrepancy between the two values was 0.76, which was considered insignificant. It was observed that while the reduction process did not enhance the outcome, it did decrease both training time and storage requirements.
According to Figure 8, the top accuracy achieved was 79.37% by classifiers with 3 hidden neurons trained for 500 epochs. In Figure 9, it is evident that the classifier with 3 hidden neurons trained for 1000 epochs achieved an accuracy of 78.48, the highest among all. The discrepancy of 0.89 between the two values was considered insignificant, indicating that while the reduction process did not enhance the outcome, it did lead to savings in training time and storage.
Epochs 500 saw the highest accuracy of 79.37 with 3 neurons in the hidden layer before reduction, and 77.22 with 5 neurons in the hidden layer after reduction (refer to Figure 10). The 2.15 difference between the two values was considered insignificant, indicating that while the reduction process did not enhance the result, it did help in reducing training time and storage requirements.
Epochs 1000 yielded an accuracy of 79.11 was achieved with 6 neurons in the hidden layer prior to reduction, and an accuracy of 78.48 was achieved with 3 neurons in the hidden layer after reduction (see Figure 11). The difference of 0.63 between these two values was deemed inconsequential. It was noted that although the reduction process did not improve the outcome, it did assist in reducing the time required for training and the storage demands.
During the epochs 1500, the scenario observed a peak accuracy of 78.23 with a hidden layer consisting of 3 neurons. In contrast, the scenario following reduction achieved a peak accuracy of 77.34 with 5 neurons in the hidden layer (see Figure 12). The insignificant difference of 0.89 between these two values suggests that while the reduction process did not improve the outcome, it did result in reduced training time and storage requirements.
Based on the data provided, it was noted that the greatest level of accuracy is achieved by decreasing the number of inputs. It’s possible to achieve the same outcome using a smaller neural network (3 Input- 7 Hidden- 1 Output) with a learning rate of 0.01 trained over 1000 iterations, resulting in significant reductions in time and storage requirements.
There are various factors that may have contributed to the lower classification accuracy (79%) of the approach presented in Table 1 in comparison to other studies that have reported higher accuracies (above 90%) on the PIDD dataset. Consider the following factors:
In order to enhance the accuracy of classification, the researchers should thoroughly analyse various factors such as data preprocessing, feature selection, model selection, and hyperparameter tuning. Furthermore, they can delve into more sophisticated machine learning approaches and explore ensembling techniques to improve model accuracy. It is crucial to provide clear details about the methods and results, such as addressing class imbalance and conducting thorough cross-validation, to fully grasp the model’s performance.
The study introduced a Rough-Neuro classification model using a two-stage approach to detect type 2 diabetes. The methodology utilises rough sets from JohnsonReducer to minimise relevant attributes, while disease classification is done using a multi-layer perceptron. The aim of the proposed solution is to minimise the inputs, thereby decreasing the time and storage required for training the neural network. The solution proposed is designed to reduce the input features, resulting in a reduction in both neural network training time and storage needs. The outcomes illustrate that a decrease in the quantity of input features induces a reduction in the duration of training for neural networks, an enhancement in the performance of the model, and a notable decline of 63% in the necessities for storage. These findings confirm that fewer input features result in faster training, enhanced accuracy, and reduced storage demands. Moreover, the most favourable outcomes were attained through the training of a compact neural network (3 Input - 7 Hidden - 1 Output) utilizing a learning rate of 0.01 over 1000 iterations, subsequently leading to a remarkable decline in time and storage requirements. Future improvements for the proposed solution involve training a neural network model using hybrid models. This involves Exploring how various machine learning algorithms can be combined with neural networks. Blending various methods can lead to more effective outcomes. Next, disease progression modelling involves expanding the model’s abilities to forecast disease progression and risk factors.
The authors declare no conflict of interests. All authors read and approved final version of the paper.
All authors contributed equally in this paper.