SlideShare a Scribd company logo
Prognosis of Diabetes by Performing Data
Mining of HbA1c
Veingus kehar
Department of Software
Engineering, Mehran University
of engineering and Technology,
Jamshoro, Pakistan
Sania Bhatti
Department of Software
Engineering, Mehran University
of engineering and Technology,
Jamshoro, Pakistan
Mohsin Ali Memon
Department of Software
Engineering, Mehran University
of engineering and Technology,
Jamshoro, Pakistan
ABSTRACT— This paper helps in foreseeing diabetes by applying data mining strategy. The revelation of information
from clinical datasets is significant so as to make powerful medical determination. The point of data mining is to
extricate information from data put away in dataset and produce clear and reasonable depiction of examples. Diabetes
is an interminable sickness and a significant general wellbeing challenge around the world. Utilizing data mining
techniques by taking hba1c test data to help individuals to predict diabetes has increase significant fame. In this paper,
six classification models are used to classify a diabetic or non-diabetic patient and male and female patients. The
dataset utilized is gathered from a Diagnostics and research laboratory Liaquat university of medical and health
sciences Jamshoro, which gathers the data of patients with diabetes, without diabetes by taking blood sample of patient
and performing hba1c. We utilized Weka tool for the analysis diabetes, no-diabetic examination. Out of six
classification algorithms, four algorithms depict hundred percent accuracy on train and test data.
KEY WORDS: Data mining, Diabetes, HbA1c, Classification models, Weka.
I. INTRODUCTION
HbA1c term is related to diabetes, it shows how much blood glucose is present in our body and used for diagnosing
patients with diabetes via measuring HbA1c or Glycohemoglobin. Medical technologists can receive thorough image
of how much average blood sugar have been by the end of weeks/months. It is important for diabetic patients if the
HbA1c is high there is greater possibility of diabetes related complications. HbA1c sometimes also termed as
hemoglobin A1C or simply A1C. HbA1c is presently officially embraced in numerous nations as an indicative test
for (type 2) diabetes diagnosis. In analysis of diabetes, we are fundamentally worried about characterizing an illness
state as opposed to building up a reference interim for wellbeing. Analysis of glycated hemoglobin (HbA1c) in blood
gives proof about a person's normal blood glucose levels during the past a few months, which is the predicted half-
existence of red platelets (RBCs). HbA1c is presently suggested as a standard of care (SOC) for testing and checking
diabetes, specially the sort diabetes 2 [1].
Data analysis [2] is the process of analyzing large dataset related to wide variety of fields including, health care,
satellite images, agriculture images, biodiversity, and many more. In this paper we are applying analysis process via
machine learning algorithms and focusing on medical data. Specifically, we are using six classification algorithms to
classify diabetic and non-diabetic patient. Further accuracy and root mean square error of all the algorithms are also
calculated. The rest of the paper is organized as follows. Section 2 presents the work of other researchers. Section 3
discusses the methodology, dataset and the tool used for analysis. Section 4 presents implementation details. Section
5 outlines the results and finally section 6 portrays the concluding remarks.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
1 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
II. RELATED WORK
A few endeavors are made to assess the presentation of characterization techniques for Clinical dataset, especially,
Diabetes [3]. In study [4], a correlation of three distinctive methods - Neural Network, Support Vector Machine
(SVM) and Multilayer Perceptron, have been accounted for diabetes dataset. The outcome indicated SVM as ready to
give preferable exactness results over Neural Network and Multilayer Perceptron.
A compelling prescient AI method for diabetes dataset with a few classifiers accessible in WEKA and Rapid Miner
information mining device have been tended to in [5]; coming about better exactness for SVM classifier [5]. Moreover,
an exactness of 80.41% as far as characterization between two classes (nonappearance or nearness of diabetes) have
been examined in [6]. The investigation [7] has created models for diabetic forecast utilizing Stream Associative
arrangement and Association rules and contrasted with prescient principles mined with choice trees.
In a study, choice List, K-NN and Naïve Bayes for grouping of diabetes have been utilized and looked at the exactness
of models. Bayes gives the 52.33% of precision as better classifier [8]. Another investigation [9] concentrated on three
mainstream information mining arrangement calculations: Decision Tree, Naïve Bayes, and K-NN, and looked at
exactness of profoundly scatter Cleveland Diabetes database. Further, the dataset partitioned into three distinct cases
and applied every classifier in the scatter datasets. It was observed that K-NN classifier performed superior to two
classifiers (for example decision Tree and Naïve Bayes) [9]. The three well known information mining arrangement
calculations - CART, ID3 and Decision Table have been accounted for in study [10]; the precision of every model for
the Cleveland diabetes Database utilized 10-crease cross approval. The outcomes demonstrated that CART beat other
considered strategies [10].
The examination in [11] thought about 10 diverse characterization calculations - Naïve Bayes, Decision Tree, Decision
Stump, K-NN, Random Forest (RF), Rule Induction, CHAID, Neural Network and SVM. The results uncovered that
Naïve Bayes and SVM performed better for expectation and identification of diabetes [11]. This examination considers
Decision Tree (DT), Naïve Bayes (NB), Single Conjunctive Rule Learner (SCRL), Radial Bias Function (RBF), K-
Nearest Neighbor (KNN), Multilayer Perceptron (MLP), (RF and SVM for the coronary illness dataset. The
explanation for utilizing these calculations is that practically all potential parts of managed learning approaches are
considered. In this manner, the trial results convers more extensive range of administered learning calculations for the
assorted social insurance information (i.e., diabetes). Further, this investigation additionally consolidates gathering
strategies with considered order techniques to accomplish better precision.
III. METHODOLOGY AND DATASET
Figure 1 outlines the steps of the methodology implemented for conducting this research work. The major steps
include the dataset collection, identification of attributes, implementation of six classification algorithms and
performance comparison of those algorithms. The dataset utilized is gathered from a Diagnostics and research
laboratory Liaquat university of medical and health sciences Jamshoro, which gathers the data of patients with
diabetes, without diabetes by taking blood sample of patient and test performed hba1c. We utilized Weka instrument
for the analysis diabetes. The sample dataset is shown in figure 2. The dataset consists of HbA1c report of 8524
patients.
Figure 1. Methodology
Diabetic patient
Data Collection
(LUMHS)
Preprocessing
Identification of
attributes from
HbA1c
Implementation
of classification
algorithms
Performance
comparison
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
2 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
The Diabetes data was split into train and test set data using (0.7 – 0.3) % ratio respectively. Train set were 8, 524
and Test set = 3,655. The key attributes that were used during modeling were “Sex”, “Age”, “Result”, and “Class”.
Table I: Model attributes of data set used in modeling
Model attributes Scale of Measurement
Sex Nominal
Age Numeric
Result Numeric
Class Nominal
Figure 2. Sample dataset of patients
IV. IMPLEMENTATION
In this work we have used following six classification models to classify a diabetic or non-diabetic patient and to
classify male and female patients. These include Bayesian classifier, J-48 decision tree, Naïve Bayes, Multilayer
perceptron, SVM and RF.
The confusion Matrix computed using WEKA Explorer for Train and Test set data are depicted in table II and table
III respectively [12].
True positive (TP): the patients with diabetes are predicted with diabetes.
True Negative (TN): the patients with no diabetes are predicted with NO diabetes.
False Positive (FP): the patients with no diabetes are predicted with diabetes.
False Negative (FN): the patients with diabetes are predicted with NO diabetes.
Table II: Confusion Matrix of Train Set
Model True Negative False Positive False negative True Positive
Bayesian Classifier 4531 0 0 3993
J-48 Decision Tree 4,531 0 0 3993
Naïve Bayes 4,336 195 55 3,938
Multilayer Perceptron 4531 0 0 3993
SVM 4,463 68 0 3,993
Random Forest 4531 0 0 3993
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
3 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Table III: Confusion Matrix of Test Set
Model True Negative False Positive False negative True Positive
Bayesian Classifier 1,906 0 0 1,749
J-48 Decision Tree 1,906 0 0 1,749
Naïve Bayes 1,839 67 0 1,726
Multilayer Perceptron 1,906 0 0 1,749
SVM 1,889 0 0 1,749
Random Forest 1,906 0 0 1,749
All the six classification algorithms were used to classify a male and female patient. The confusion Matrix were
computed using WEKA Explorer using both Train and Test set data are shown in table IV and table V.
True positive (TP): the male patients are predicted male.
True Negative (TN): the female patients are predicted female.
False Positive (FP): the female patients are predicted male.
False Negative (FN): the male patients are predicted female.
Table IV: Confusion Matrix of Train set Data.
Model True Negative False Positive False negative True Positive
Bayesian Classifier 2396 1997 1649 2482
J-48 Decision Tree 2541 1852 1437 2694
Naïve Bayes 2483 1910 1856 2275
Multilayer Perceptron 2969 1424 2216 1915
Random Forest 3449 944 1430 2701
Table V: Confusion matrix of Test Set Data.
Model True Negative False Positive False negative True Positive
Bayesian Classifier 1065 771 713 1106
J-48 Decision Tree 1078 758 647 1172
Naïve Bayes 1088 748 811 1008
Multilayer Perceptron 1270 566 1005 814
Random Forest 1174 662 925 894
V. RESULTS AND DISCUSSION
The six classifiers are compared based on three parameters which are accuracy, Kappa statics and RMSE. Accuracy
is calculated as using equation (1)
(TP+TN)/ (P+N)
Where, P=TP+FN and N=FP+TN.
Kappa measures the percentage of data values in the main diagonal of the table and then adjusts these values for the
amount of agreement that could be expected due to chance alone.
To compute Kappa, first observed level of agreement is calculated.
This value needs to be compared to the value that you would expect if the two raters were totally independent,
The value of Kappa is defined as
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
4 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
(2)
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure
of how far from the regression line data points are; in other words, it tells how concentrated the data is around the line
of best fit [13].
(3)
The accuracy, Kappa statics and RMSE of six classification algorithms for training data and testing data to classify
diabetic and non-diabetic patients are shown in table VI and table VII respectively. According to experimental results,
correctly classified instances for Bayesian network, J-48 decision Tree, Multilayer perceptron and random forest are
100%. However, Naïve Bayes and SVM depicts 97% and 99% accuracy respectively. The kappa statistics and RMSE
values also depicts the similar results with highest values for four classifiers.
Table VI: Performance Comparison of six models (Train data) (diabetic and non-diabetic)
Train Data
Model
Accuracy
Kappa
Statistic
Correctly
Classified
Instances
Incorrectly
Classified
Instances
RMSE
Bayesian Classifier 100.0% 1.00 8,524 0 0.0002
J-48 Decision Tree 100.0% 1.00 8,524 0 0.0000
Naïve Bayes 97.1% 0.94 8,274 250 0.1713
Multilayer Perceptron 100.0% 1.00 8,524 0 0.0000
SVM 99.2% 0.98 8,456 68 0.0893
Random Forest 100.0% 1.00 8,524 0 0.0002
Table VII: Performance Comparison of six models (Test data) (diabetic and non-diabetic)
Test Data
Model
Accuracy
Kappa
Statistic
Correctly
Classified
Instances
Incorrectly
Classified
Instances
RMSE
Bayesian Classifier 100.0% 1.00 3,655 0 0.0002
J-48 Decision Tree 100.0% 1.00 3,655 0 0.0000
Naïve Bayes 97.5% 0.95 3,565 90 0.1667
Multilayer Perceptron 100.0% 1.00 3,655 0 0.0026
SVM 99.5% 0.99 3,638 17 0.0682
Random Forest 100.0% 1.00 3,655 0 0.0000
The performance metrics used to compare the models shows no sign of under fitting or over fitting. This is a positive
result and good for all the models. Bayesian Classifier, J-48 Decision Tree, Multilayer Perceptron, and RF models
achieved the highest accuracies (100%) on both train and test set data. Naïve Bayes and SVM achieved an accuracy
of 97.1% and 99.2% in train data and 97.5% and 99.5% in test set data respectively.
In addition, Bayesian Classifier, J-48 Decision Tree, Multilayer Perceptron, and RF models had the lowest Root Mean
Square Error. All the models performed well in classifying the diabetic and non-diabetic patients, but it is clear that
these four models are the best to be used in classifying patients with diabetes and those who don’t. All the models had
higher Area under Receiver Operating Characteristic (>0.9) which is perfect for classification.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
5 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure 3 and figure 4 portray the accuracy comparison of six classification models to classify male and female patients
of training and testing data respectively. The accuracy comparison of six classifiers depicts that Random forest gives
the highest accuracy as compared to other five classifiers to classify male and female patients. After this classification
it is also identified that there more male diabetic patients than female diabetic patients.
Figure 3. Accuracy comparison of six classification models to classify male and female patients (Training data)
Figure 4 Accuracy comparison of six classification models to classify male and female patients (Testing data)
VI. CONCLUSION
In this paper we have performed the data mining using classification algorithms. The data set of hba1c test used in
this work is collected from diagnostics and research laboratory LUMHS, Hyderabad. It is observed by performing
hba1c test that many patients were prediabetic and there were less number of patients with diabetes as this test is to
predict diabetes by which a patient can go back from becoming diabetic in future. From the classification
experiments it is evident that the male diabetic patients are more as compared to female diabetic patients. In both
classification experiments, random forest model shows the highest accuracy.
0
10
20
30
40
50
60
70
80
90
100
Bayesian ClassifierJ-48 Decision Tree Naïve Bayes Multilayer
Perceptron
SVM Random Forest
Train Data
0
10
20
30
40
50
60
70
80
90
100
Bayesian Classifier J-48 Decision Tree Naïve Bayes Multilayer
Perceptron
SVM Random Forest
Test Data
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
6 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
REFERENCES
[1] Hemoglobin A1c (HbA1c) Test for Diabetes
Available: https://www.webmd.com/diabetes/guide/glycated-hemoglobin-test-hba1c
[2] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques” Third edition.
[3] B. L. Shivakumar, and S. Alby. "A survey on data-mining technologies for prediction and diagnosis of diabetes." 2014 IEEE International
Conference on Intelligent Computing Applications (ICICA), 2014.
[4] M. Kumari, V. Rajan and A. Anshul, "Prediction of Diabetes Using Bayesian Network.", International Journal of Computer Science and
Information Technologies, Vol. 5 (4), pp. 5174-5178, 2014.
[5] Balpande, R. Vrushali and R. D. Wajgi. "Prediction and severity estimation of diabetes using data mining technique." IEEE International
Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017.
[6] M., Komi, et al. "Application of data mining methods in diabetes prediction." 2017 IEEE 2nd International Conference on Image, Vision and
Computing (ICIVC), 2017.
[7] H. A., Madni, A. Zahid, and A. S., Munam, "Data mining techniques and applications—A decade review." 23rd IEEE International Conference
on Automation and Computing (ICAC), 2017.
[8] Marcano-Cedeno, Alexis, and Diego Andina. "Data mining for the diagnosis of type 2 diabetes." World Automation Congress (WAC), 2012.
IEEE, 2012.
[9] A. H., Shurrab, and A. Y., Maghari. "Blood diseases detection using data mining techniques." 2017 IEEE 8th International Conference on
Information Technology (ICIT), 2017.
[10] A. Marcano-Cedeno, and D. Andina. "Data mining for the diagnosis of type 2 diabetes." IEEE World Automation Congress (WAC), 2012.
[11] G. L., Beckles, & P. E., Thompson-Reid, “Diabetes and Women’s Health across the Life Stages” 2011.
[12] Derived Measures for a test Available:
http://www.academicos.ccadet.unam.mx/jorge.marquez/cursos/Instrumentacion/FalsePositive_TrueNegative_etc.pdf
[13]Statistics how to, Available: https://www.statisticshowto.datasciencecentral.com/rmse/
AUTHORS PROFILE
Veingus kehar is software engineer at Liaquat University of Medical and Health Sciences Jamshoro Sindh Pakistan. She obtained her B.Eng in
Software Engineering from Mehran University of Engineering and Technology Jamshoro Sindh, Pakistan. Currently she is persuing her
M.Eng in Software Engineering from IICT, Mehran University of Engineering and Technology Jamshoro Sindh, Pakistan. Her research area
is data mining.
Sania Bhatti is working with the Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro Sindh,
Pakistan. She obtained her PhD from the University of Leeds, the United Kingdom in 2010 under the scholarship of the faculty development
program. Her research interests include modelling, simulation, communication networks and machine learning algorithms. She has published
more than twenty national and international Journal papers and various international conference papers. She was awarded with two research
grant by Microsoft in the field of Artificial Intelligence as principle investigator.
Mohsin Ali Memon is working with the department of Software Engineering, Mehran UET, Jamshoro, Sindh, Pakistan. He obtained his PhD
degree from the Department of Computer Science, University of Tsukuba in 2014. His research interests include interaction technologies, life
logging, privacy control methods and machine learning. He received his B.Eng. in Software Engineering and M.Eng. in Information
Technology from Mehran University of Engineering and Technology, Pakistan, in 2006 and 2009, respectively.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
7 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

PDF
An Experimental Study of Diabetes Disease Prediction System Using Classificat...
PDF
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
PDF
Machine learning and operations research to find diabetics at risk for readmi...
PDF
Ascendable Clarification for Coronary Illness Prediction using Classification...
PDF
A Hybrid Apporach of Classification Techniques for Predicting Diabetes using ...
PDF
Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...
PDF
K-Nearest Neighbours based diagnosis of hyperglycemia
PDF
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
An Experimental Study of Diabetes Disease Prediction System Using Classificat...
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
Machine learning and operations research to find diabetics at risk for readmi...
Ascendable Clarification for Coronary Illness Prediction using Classification...
A Hybrid Apporach of Classification Techniques for Predicting Diabetes using ...
Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...
K-Nearest Neighbours based diagnosis of hyperglycemia
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...

What's hot (20)

PDF
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
PDF
Acute coronary-syndrome-prediction-using-data-mining-techniques--an-application
PDF
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
PDF
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...
PDF
The Analysis of Performace Model Tiered Artificial Neural Network for Assessm...
PPTX
Ai in diabetes management
PDF
IRJET- Predicting Diabetes Disease using Effective Classification Techniques
PDF
Heart Disease Prediction using Machine Learning Algorithm
PDF
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...
PDF
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
PDF
IRJET- Heart Failure Risk Prediction using Trained Electronic Health Record
PDF
Estimating the Survival Function of HIV AIDS Patients using Weibull Model
PDF
prediction of heart disease using machine learning algorithms
PDF
Improving the performance of k nearest neighbor algorithm for the classificat...
PDF
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
PDF
Decision Tree Models for Medical Diagnosis
PDF
PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...
PDF
FORESTALLING GROWTH RATE IN TYPE II DIABETIC PATIENTS USING DATA MINING AND A...
PDF
Predicting Heart Ailment in Patients with Varying number of Features using Da...
PDF
Prediction of Diabetes using Probability Approach
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
Acute coronary-syndrome-prediction-using-data-mining-techniques--an-application
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...
The Analysis of Performace Model Tiered Artificial Neural Network for Assessm...
Ai in diabetes management
IRJET- Predicting Diabetes Disease using Effective Classification Techniques
Heart Disease Prediction using Machine Learning Algorithm
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
IRJET- Heart Failure Risk Prediction using Trained Electronic Health Record
Estimating the Survival Function of HIV AIDS Patients using Weibull Model
prediction of heart disease using machine learning algorithms
Improving the performance of k nearest neighbor algorithm for the classificat...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
Decision Tree Models for Medical Diagnosis
PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...
FORESTALLING GROWTH RATE IN TYPE II DIABETIC PATIENTS USING DATA MINING AND A...
Predicting Heart Ailment in Patients with Varying number of Features using Da...
Prediction of Diabetes using Probability Approach
Ad

Similar to Prognosis of Diabetes by Performing Data Mining of HbA1c (20)

PDF
An Empirical Study On Diabetes Mellitus Prediction For Typical And Non-Typica...
PPT
Diabetes prediction using machine learning
PDF
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUES
PDF
IRJET - Machine Learning for Diagnosis of Diabetes
PDF
IRJET - Prediction and Detection of Diabetes using Machine Learning
PDF
Disease prediction in big data healthcare using extended convolutional neural...
PDF
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
PDF
Performance evaluation of random forest with feature selection methods in pre...
PDF
IRJET- Diabetes Diagnosis using Machine Learning Algorithms
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
IRJET- Diabetes Prediction using Machine Learning
PDF
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNING
PDF
Ijcatr04041015
PDF
Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...
PDF
DIABETES PREDICTOR USING ENSEMBLE TECHNIQUE
PDF
Analyzing the behavior of different classification algorithms in diabetes pre...
PDF
Forecasting Diabetes Mellitus at an Initial Stage using Machine Learning Methods
PDF
Artificial Intelligence Approaches for Predicting Diabetes in Egypt
PDF
Artificial Intelligence Approaches for Predicting Diabetes in Egypt
PDF
Diabetes Prediction Using ML
An Empirical Study On Diabetes Mellitus Prediction For Typical And Non-Typica...
Diabetes prediction using machine learning
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUES
IRJET - Machine Learning for Diagnosis of Diabetes
IRJET - Prediction and Detection of Diabetes using Machine Learning
Disease prediction in big data healthcare using extended convolutional neural...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance evaluation of random forest with feature selection methods in pre...
IRJET- Diabetes Diagnosis using Machine Learning Algorithms
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IRJET- Diabetes Prediction using Machine Learning
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNING
Ijcatr04041015
Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...
DIABETES PREDICTOR USING ENSEMBLE TECHNIQUE
Analyzing the behavior of different classification algorithms in diabetes pre...
Forecasting Diabetes Mellitus at an Initial Stage using Machine Learning Methods
Artificial Intelligence Approaches for Predicting Diabetes in Egypt
Artificial Intelligence Approaches for Predicting Diabetes in Egypt
Diabetes Prediction Using ML
Ad

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Computer network topology notes for revision
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Lecture1 pattern recognition............
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Computer network topology notes for revision
Major-Components-ofNKJNNKNKNKNKronment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Foundation of Data Science unit number two notes
Supervised vs unsupervised machine learning algorithms
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Lecture1 pattern recognition............
oil_refinery_comprehensive_20250804084928 (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx

Prognosis of Diabetes by Performing Data Mining of HbA1c

  • 1. Prognosis of Diabetes by Performing Data Mining of HbA1c Veingus kehar Department of Software Engineering, Mehran University of engineering and Technology, Jamshoro, Pakistan Sania Bhatti Department of Software Engineering, Mehran University of engineering and Technology, Jamshoro, Pakistan Mohsin Ali Memon Department of Software Engineering, Mehran University of engineering and Technology, Jamshoro, Pakistan ABSTRACT— This paper helps in foreseeing diabetes by applying data mining strategy. The revelation of information from clinical datasets is significant so as to make powerful medical determination. The point of data mining is to extricate information from data put away in dataset and produce clear and reasonable depiction of examples. Diabetes is an interminable sickness and a significant general wellbeing challenge around the world. Utilizing data mining techniques by taking hba1c test data to help individuals to predict diabetes has increase significant fame. In this paper, six classification models are used to classify a diabetic or non-diabetic patient and male and female patients. The dataset utilized is gathered from a Diagnostics and research laboratory Liaquat university of medical and health sciences Jamshoro, which gathers the data of patients with diabetes, without diabetes by taking blood sample of patient and performing hba1c. We utilized Weka tool for the analysis diabetes, no-diabetic examination. Out of six classification algorithms, four algorithms depict hundred percent accuracy on train and test data. KEY WORDS: Data mining, Diabetes, HbA1c, Classification models, Weka. I. INTRODUCTION HbA1c term is related to diabetes, it shows how much blood glucose is present in our body and used for diagnosing patients with diabetes via measuring HbA1c or Glycohemoglobin. Medical technologists can receive thorough image of how much average blood sugar have been by the end of weeks/months. It is important for diabetic patients if the HbA1c is high there is greater possibility of diabetes related complications. HbA1c sometimes also termed as hemoglobin A1C or simply A1C. HbA1c is presently officially embraced in numerous nations as an indicative test for (type 2) diabetes diagnosis. In analysis of diabetes, we are fundamentally worried about characterizing an illness state as opposed to building up a reference interim for wellbeing. Analysis of glycated hemoglobin (HbA1c) in blood gives proof about a person's normal blood glucose levels during the past a few months, which is the predicted half- existence of red platelets (RBCs). HbA1c is presently suggested as a standard of care (SOC) for testing and checking diabetes, specially the sort diabetes 2 [1]. Data analysis [2] is the process of analyzing large dataset related to wide variety of fields including, health care, satellite images, agriculture images, biodiversity, and many more. In this paper we are applying analysis process via machine learning algorithms and focusing on medical data. Specifically, we are using six classification algorithms to classify diabetic and non-diabetic patient. Further accuracy and root mean square error of all the algorithms are also calculated. The rest of the paper is organized as follows. Section 2 presents the work of other researchers. Section 3 discusses the methodology, dataset and the tool used for analysis. Section 4 presents implementation details. Section 5 outlines the results and finally section 6 portrays the concluding remarks. International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 1, January 2020 1 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. II. RELATED WORK A few endeavors are made to assess the presentation of characterization techniques for Clinical dataset, especially, Diabetes [3]. In study [4], a correlation of three distinctive methods - Neural Network, Support Vector Machine (SVM) and Multilayer Perceptron, have been accounted for diabetes dataset. The outcome indicated SVM as ready to give preferable exactness results over Neural Network and Multilayer Perceptron. A compelling prescient AI method for diabetes dataset with a few classifiers accessible in WEKA and Rapid Miner information mining device have been tended to in [5]; coming about better exactness for SVM classifier [5]. Moreover, an exactness of 80.41% as far as characterization between two classes (nonappearance or nearness of diabetes) have been examined in [6]. The investigation [7] has created models for diabetic forecast utilizing Stream Associative arrangement and Association rules and contrasted with prescient principles mined with choice trees. In a study, choice List, K-NN and Naïve Bayes for grouping of diabetes have been utilized and looked at the exactness of models. Bayes gives the 52.33% of precision as better classifier [8]. Another investigation [9] concentrated on three mainstream information mining arrangement calculations: Decision Tree, Naïve Bayes, and K-NN, and looked at exactness of profoundly scatter Cleveland Diabetes database. Further, the dataset partitioned into three distinct cases and applied every classifier in the scatter datasets. It was observed that K-NN classifier performed superior to two classifiers (for example decision Tree and Naïve Bayes) [9]. The three well known information mining arrangement calculations - CART, ID3 and Decision Table have been accounted for in study [10]; the precision of every model for the Cleveland diabetes Database utilized 10-crease cross approval. The outcomes demonstrated that CART beat other considered strategies [10]. The examination in [11] thought about 10 diverse characterization calculations - Naïve Bayes, Decision Tree, Decision Stump, K-NN, Random Forest (RF), Rule Induction, CHAID, Neural Network and SVM. The results uncovered that Naïve Bayes and SVM performed better for expectation and identification of diabetes [11]. This examination considers Decision Tree (DT), Naïve Bayes (NB), Single Conjunctive Rule Learner (SCRL), Radial Bias Function (RBF), K- Nearest Neighbor (KNN), Multilayer Perceptron (MLP), (RF and SVM for the coronary illness dataset. The explanation for utilizing these calculations is that practically all potential parts of managed learning approaches are considered. In this manner, the trial results convers more extensive range of administered learning calculations for the assorted social insurance information (i.e., diabetes). Further, this investigation additionally consolidates gathering strategies with considered order techniques to accomplish better precision. III. METHODOLOGY AND DATASET Figure 1 outlines the steps of the methodology implemented for conducting this research work. The major steps include the dataset collection, identification of attributes, implementation of six classification algorithms and performance comparison of those algorithms. The dataset utilized is gathered from a Diagnostics and research laboratory Liaquat university of medical and health sciences Jamshoro, which gathers the data of patients with diabetes, without diabetes by taking blood sample of patient and test performed hba1c. We utilized Weka instrument for the analysis diabetes. The sample dataset is shown in figure 2. The dataset consists of HbA1c report of 8524 patients. Figure 1. Methodology Diabetic patient Data Collection (LUMHS) Preprocessing Identification of attributes from HbA1c Implementation of classification algorithms Performance comparison International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 1, January 2020 2 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. The Diabetes data was split into train and test set data using (0.7 – 0.3) % ratio respectively. Train set were 8, 524 and Test set = 3,655. The key attributes that were used during modeling were “Sex”, “Age”, “Result”, and “Class”. Table I: Model attributes of data set used in modeling Model attributes Scale of Measurement Sex Nominal Age Numeric Result Numeric Class Nominal Figure 2. Sample dataset of patients IV. IMPLEMENTATION In this work we have used following six classification models to classify a diabetic or non-diabetic patient and to classify male and female patients. These include Bayesian classifier, J-48 decision tree, Naïve Bayes, Multilayer perceptron, SVM and RF. The confusion Matrix computed using WEKA Explorer for Train and Test set data are depicted in table II and table III respectively [12]. True positive (TP): the patients with diabetes are predicted with diabetes. True Negative (TN): the patients with no diabetes are predicted with NO diabetes. False Positive (FP): the patients with no diabetes are predicted with diabetes. False Negative (FN): the patients with diabetes are predicted with NO diabetes. Table II: Confusion Matrix of Train Set Model True Negative False Positive False negative True Positive Bayesian Classifier 4531 0 0 3993 J-48 Decision Tree 4,531 0 0 3993 Naïve Bayes 4,336 195 55 3,938 Multilayer Perceptron 4531 0 0 3993 SVM 4,463 68 0 3,993 Random Forest 4531 0 0 3993 International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 1, January 2020 3 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. Table III: Confusion Matrix of Test Set Model True Negative False Positive False negative True Positive Bayesian Classifier 1,906 0 0 1,749 J-48 Decision Tree 1,906 0 0 1,749 Naïve Bayes 1,839 67 0 1,726 Multilayer Perceptron 1,906 0 0 1,749 SVM 1,889 0 0 1,749 Random Forest 1,906 0 0 1,749 All the six classification algorithms were used to classify a male and female patient. The confusion Matrix were computed using WEKA Explorer using both Train and Test set data are shown in table IV and table V. True positive (TP): the male patients are predicted male. True Negative (TN): the female patients are predicted female. False Positive (FP): the female patients are predicted male. False Negative (FN): the male patients are predicted female. Table IV: Confusion Matrix of Train set Data. Model True Negative False Positive False negative True Positive Bayesian Classifier 2396 1997 1649 2482 J-48 Decision Tree 2541 1852 1437 2694 Naïve Bayes 2483 1910 1856 2275 Multilayer Perceptron 2969 1424 2216 1915 Random Forest 3449 944 1430 2701 Table V: Confusion matrix of Test Set Data. Model True Negative False Positive False negative True Positive Bayesian Classifier 1065 771 713 1106 J-48 Decision Tree 1078 758 647 1172 Naïve Bayes 1088 748 811 1008 Multilayer Perceptron 1270 566 1005 814 Random Forest 1174 662 925 894 V. RESULTS AND DISCUSSION The six classifiers are compared based on three parameters which are accuracy, Kappa statics and RMSE. Accuracy is calculated as using equation (1) (TP+TN)/ (P+N) Where, P=TP+FN and N=FP+TN. Kappa measures the percentage of data values in the main diagonal of the table and then adjusts these values for the amount of agreement that could be expected due to chance alone. To compute Kappa, first observed level of agreement is calculated. This value needs to be compared to the value that you would expect if the two raters were totally independent, The value of Kappa is defined as International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 1, January 2020 4 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. (2) Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; in other words, it tells how concentrated the data is around the line of best fit [13]. (3) The accuracy, Kappa statics and RMSE of six classification algorithms for training data and testing data to classify diabetic and non-diabetic patients are shown in table VI and table VII respectively. According to experimental results, correctly classified instances for Bayesian network, J-48 decision Tree, Multilayer perceptron and random forest are 100%. However, Naïve Bayes and SVM depicts 97% and 99% accuracy respectively. The kappa statistics and RMSE values also depicts the similar results with highest values for four classifiers. Table VI: Performance Comparison of six models (Train data) (diabetic and non-diabetic) Train Data Model Accuracy Kappa Statistic Correctly Classified Instances Incorrectly Classified Instances RMSE Bayesian Classifier 100.0% 1.00 8,524 0 0.0002 J-48 Decision Tree 100.0% 1.00 8,524 0 0.0000 Naïve Bayes 97.1% 0.94 8,274 250 0.1713 Multilayer Perceptron 100.0% 1.00 8,524 0 0.0000 SVM 99.2% 0.98 8,456 68 0.0893 Random Forest 100.0% 1.00 8,524 0 0.0002 Table VII: Performance Comparison of six models (Test data) (diabetic and non-diabetic) Test Data Model Accuracy Kappa Statistic Correctly Classified Instances Incorrectly Classified Instances RMSE Bayesian Classifier 100.0% 1.00 3,655 0 0.0002 J-48 Decision Tree 100.0% 1.00 3,655 0 0.0000 Naïve Bayes 97.5% 0.95 3,565 90 0.1667 Multilayer Perceptron 100.0% 1.00 3,655 0 0.0026 SVM 99.5% 0.99 3,638 17 0.0682 Random Forest 100.0% 1.00 3,655 0 0.0000 The performance metrics used to compare the models shows no sign of under fitting or over fitting. This is a positive result and good for all the models. Bayesian Classifier, J-48 Decision Tree, Multilayer Perceptron, and RF models achieved the highest accuracies (100%) on both train and test set data. Naïve Bayes and SVM achieved an accuracy of 97.1% and 99.2% in train data and 97.5% and 99.5% in test set data respectively. In addition, Bayesian Classifier, J-48 Decision Tree, Multilayer Perceptron, and RF models had the lowest Root Mean Square Error. All the models performed well in classifying the diabetic and non-diabetic patients, but it is clear that these four models are the best to be used in classifying patients with diabetes and those who don’t. All the models had higher Area under Receiver Operating Characteristic (>0.9) which is perfect for classification. International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 1, January 2020 5 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. Figure 3 and figure 4 portray the accuracy comparison of six classification models to classify male and female patients of training and testing data respectively. The accuracy comparison of six classifiers depicts that Random forest gives the highest accuracy as compared to other five classifiers to classify male and female patients. After this classification it is also identified that there more male diabetic patients than female diabetic patients. Figure 3. Accuracy comparison of six classification models to classify male and female patients (Training data) Figure 4 Accuracy comparison of six classification models to classify male and female patients (Testing data) VI. CONCLUSION In this paper we have performed the data mining using classification algorithms. The data set of hba1c test used in this work is collected from diagnostics and research laboratory LUMHS, Hyderabad. It is observed by performing hba1c test that many patients were prediabetic and there were less number of patients with diabetes as this test is to predict diabetes by which a patient can go back from becoming diabetic in future. From the classification experiments it is evident that the male diabetic patients are more as compared to female diabetic patients. In both classification experiments, random forest model shows the highest accuracy. 0 10 20 30 40 50 60 70 80 90 100 Bayesian ClassifierJ-48 Decision Tree Naïve Bayes Multilayer Perceptron SVM Random Forest Train Data 0 10 20 30 40 50 60 70 80 90 100 Bayesian Classifier J-48 Decision Tree Naïve Bayes Multilayer Perceptron SVM Random Forest Test Data International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 1, January 2020 6 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 7. REFERENCES [1] Hemoglobin A1c (HbA1c) Test for Diabetes Available: https://www.webmd.com/diabetes/guide/glycated-hemoglobin-test-hba1c [2] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques” Third edition. [3] B. L. Shivakumar, and S. Alby. "A survey on data-mining technologies for prediction and diagnosis of diabetes." 2014 IEEE International Conference on Intelligent Computing Applications (ICICA), 2014. [4] M. Kumari, V. Rajan and A. Anshul, "Prediction of Diabetes Using Bayesian Network.", International Journal of Computer Science and Information Technologies, Vol. 5 (4), pp. 5174-5178, 2014. [5] Balpande, R. Vrushali and R. D. Wajgi. "Prediction and severity estimation of diabetes using data mining technique." IEEE International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017. [6] M., Komi, et al. "Application of data mining methods in diabetes prediction." 2017 IEEE 2nd International Conference on Image, Vision and Computing (ICIVC), 2017. [7] H. A., Madni, A. Zahid, and A. S., Munam, "Data mining techniques and applications—A decade review." 23rd IEEE International Conference on Automation and Computing (ICAC), 2017. [8] Marcano-Cedeno, Alexis, and Diego Andina. "Data mining for the diagnosis of type 2 diabetes." World Automation Congress (WAC), 2012. IEEE, 2012. [9] A. H., Shurrab, and A. Y., Maghari. "Blood diseases detection using data mining techniques." 2017 IEEE 8th International Conference on Information Technology (ICIT), 2017. [10] A. Marcano-Cedeno, and D. Andina. "Data mining for the diagnosis of type 2 diabetes." IEEE World Automation Congress (WAC), 2012. [11] G. L., Beckles, & P. E., Thompson-Reid, “Diabetes and Women’s Health across the Life Stages” 2011. [12] Derived Measures for a test Available: http://www.academicos.ccadet.unam.mx/jorge.marquez/cursos/Instrumentacion/FalsePositive_TrueNegative_etc.pdf [13]Statistics how to, Available: https://www.statisticshowto.datasciencecentral.com/rmse/ AUTHORS PROFILE Veingus kehar is software engineer at Liaquat University of Medical and Health Sciences Jamshoro Sindh Pakistan. She obtained her B.Eng in Software Engineering from Mehran University of Engineering and Technology Jamshoro Sindh, Pakistan. Currently she is persuing her M.Eng in Software Engineering from IICT, Mehran University of Engineering and Technology Jamshoro Sindh, Pakistan. Her research area is data mining. Sania Bhatti is working with the Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro Sindh, Pakistan. She obtained her PhD from the University of Leeds, the United Kingdom in 2010 under the scholarship of the faculty development program. Her research interests include modelling, simulation, communication networks and machine learning algorithms. She has published more than twenty national and international Journal papers and various international conference papers. She was awarded with two research grant by Microsoft in the field of Artificial Intelligence as principle investigator. Mohsin Ali Memon is working with the department of Software Engineering, Mehran UET, Jamshoro, Sindh, Pakistan. He obtained his PhD degree from the Department of Computer Science, University of Tsukuba in 2014. His research interests include interaction technologies, life logging, privacy control methods and machine learning. He received his B.Eng. in Software Engineering and M.Eng. in Information Technology from Mehran University of Engineering and Technology, Pakistan, in 2006 and 2009, respectively. International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 1, January 2020 7 https://sites.google.com/site/ijcsis/ ISSN 1947-5500