Prognosis of Diabetes by Performing Data Mining of HbA1c

Prognosis of Diabetes by Performing Data
Mining of HbA1c
Veingus kehar
Department of Software
Engineering, Mehran University
of engineering and Technology,
Jamshoro, Pakistan
Sania Bhatti
Jamshoro, Pakistan
Mohsin Ali Memon
Jamshoro, Pakistan
ABSTRACT— This paper helps in foreseeing diabetes by applying data mining strategy. The revelation of information
from clinical datasets is significant so as to make powerful medical determination. The point of data mining is to
extricate information from data put away in dataset and produce clear and reasonable depiction of examples. Diabetes
is an interminable sickness and a significant general wellbeing challenge around the world. Utilizing data mining
techniques by taking hba1c test data to help individuals to predict diabetes has increase significant fame. In this paper,
six classification models are used to classify a diabetic or non-diabetic patient and male and female patients. The
dataset utilized is gathered from a Diagnostics and research laboratory Liaquat university of medical and health
sciences Jamshoro, which gathers the data of patients with diabetes, without diabetes by taking blood sample of patient
and performing hba1c. We utilized Weka tool for the analysis diabetes, no-diabetic examination. Out of six
classification algorithms, four algorithms depict hundred percent accuracy on train and test data.
KEY WORDS: Data mining, Diabetes, HbA1c, Classification models, Weka.
I. INTRODUCTION
HbA1c term is related to diabetes, it shows how much blood glucose is present in our body and used for diagnosing
patients with diabetes via measuring HbA1c or Glycohemoglobin. Medical technologists can receive thorough image
of how much average blood sugar have been by the end of weeks/months. It is important for diabetic patients if the
HbA1c is high there is greater possibility of diabetes related complications. HbA1c sometimes also termed as
hemoglobin A1C or simply A1C. HbA1c is presently officially embraced in numerous nations as an indicative test
for (type 2) diabetes diagnosis. In analysis of diabetes, we are fundamentally worried about characterizing an illness
state as opposed to building up a reference interim for wellbeing. Analysis of glycated hemoglobin (HbA1c) in blood
gives proof about a person's normal blood glucose levels during the past a few months, which is the predicted half-
existence of red platelets (RBCs). HbA1c is presently suggested as a standard of care (SOC) for testing and checking
diabetes, specially the sort diabetes 2 [1].
Data analysis [2] is the process of analyzing large dataset related to wide variety of fields including, health care,
satellite images, agriculture images, biodiversity, and many more. In this paper we are applying analysis process via
machine learning algorithms and focusing on medical data. Specifically, we are using six classification algorithms to
classify diabetic and non-diabetic patient. Further accuracy and root mean square error of all the algorithms are also
calculated. The rest of the paper is organized as follows. Section 2 presents the work of other researchers. Section 3
discusses the methodology, dataset and the tool used for analysis. Section 4 presents implementation details. Section
5 outlines the results and finally section 6 portrays the concluding remarks.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 1, January 2020
1 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

II. RELATED WORK
A few endeavors are made to assess the presentation of characterization techniques for Clinical dataset, especially,
Diabetes [3]. In study [4], a correlation of three distinctive methods - Neural Network, Support Vector Machine
(SVM) and Multilayer Perceptron, have been accounted for diabetes dataset. The outcome indicated SVM as ready to
give preferable exactness results over Neural Network and Multilayer Perceptron.
A compelling prescient AI method for diabetes dataset with a few classifiers accessible in WEKA and Rapid Miner
information mining device have been tended to in [5]; coming about better exactness for SVM classifier [5]. Moreover,
an exactness of 80.41% as far as characterization between two classes (nonappearance or nearness of diabetes) have
been examined in [6]. The investigation [7] has created models for diabetic forecast utilizing Stream Associative
arrangement and Association rules and contrasted with prescient principles mined with choice trees.
In a study, choice List, K-NN and Naïve Bayes for grouping of diabetes have been utilized and looked at the exactness
of models. Bayes gives the 52.33% of precision as better classifier [8]. Another investigation [9] concentrated on three
mainstream information mining arrangement calculations: Decision Tree, Naïve Bayes, and K-NN, and looked at
exactness of profoundly scatter Cleveland Diabetes database. Further, the dataset partitioned into three distinct cases
and applied every classifier in the scatter datasets. It was observed that K-NN classifier performed superior to two
classifiers (for example decision Tree and Naïve Bayes) [9]. The three well known information mining arrangement
calculations - CART, ID3 and Decision Table have been accounted for in study [10]; the precision of every model for
the Cleveland diabetes Database utilized 10-crease cross approval. The outcomes demonstrated that CART beat other
considered strategies [10].
The examination in [11] thought about 10 diverse characterization calculations - Naïve Bayes, Decision Tree, Decision
Stump, K-NN, Random Forest (RF), Rule Induction, CHAID, Neural Network and SVM. The results uncovered that
Naïve Bayes and SVM performed better for expectation and identification of diabetes [11]. This examination considers
Decision Tree (DT), Naïve Bayes (NB), Single Conjunctive Rule Learner (SCRL), Radial Bias Function (RBF), K-
Nearest Neighbor (KNN), Multilayer Perceptron (MLP), (RF and SVM for the coronary illness dataset. The
explanation for utilizing these calculations is that practically all potential parts of managed learning approaches are
considered. In this manner, the trial results convers more extensive range of administered learning calculations for the
assorted social insurance information (i.e., diabetes). Further, this investigation additionally consolidates gathering
strategies with considered order techniques to accomplish better precision.
III. METHODOLOGY AND DATASET
Figure 1 outlines the steps of the methodology implemented for conducting this research work. The major steps
include the dataset collection, identification of attributes, implementation of six classification algorithms and
performance comparison of those algorithms. The dataset utilized is gathered from a Diagnostics and research
laboratory Liaquat university of medical and health sciences Jamshoro, which gathers the data of patients with
diabetes, without diabetes by taking blood sample of patient and test performed hba1c. We utilized Weka instrument
for the analysis diabetes. The sample dataset is shown in figure 2. The dataset consists of HbA1c report of 8524
patients.
Figure 1. Methodology
Diabetic patient
Data Collection
(LUMHS)
Preprocessing
Identification of
attributes from
HbA1c
Implementation
of classification
algorithms
Performance
comparison
ISSN 1947-5500

The Diabetes data was split into train and test set data using (0.7 – 0.3) % ratio respectively. Train set were 8, 524
and Test set = 3,655. The key attributes that were used during modeling were “Sex”, “Age”, “Result”, and “Class”.
Table I: Model attributes of data set used in modeling
Model attributes Scale of Measurement
Sex Nominal
Age Numeric
Result Numeric
Class Nominal
Figure 2. Sample dataset of patients
IV. IMPLEMENTATION
In this work we have used following six classification models to classify a diabetic or non-diabetic patient and to
classify male and female patients. These include Bayesian classifier, J-48 decision tree, Naïve Bayes, Multilayer
perceptron, SVM and RF.
The confusion Matrix computed using WEKA Explorer for Train and Test set data are depicted in table II and table
III respectively [12].
True positive (TP): the patients with diabetes are predicted with diabetes.
True Negative (TN): the patients with no diabetes are predicted with NO diabetes.
False Positive (FP): the patients with no diabetes are predicted with diabetes.
False Negative (FN): the patients with diabetes are predicted with NO diabetes.
Table II: Confusion Matrix of Train Set
Model True Negative False Positive False negative True Positive
Bayesian Classifier 4531 0 0 3993
J-48 Decision Tree 4,531 0 0 3993
Naïve Bayes 4,336 195 55 3,938
Multilayer Perceptron 4531 0 0 3993
SVM 4,463 68 0 3,993
Random Forest 4531 0 0 3993
ISSN 1947-5500

Table III: Confusion Matrix of Test Set
Bayesian Classifier 1,906 0 0 1,749
J-48 Decision Tree 1,906 0 0 1,749
Naïve Bayes 1,839 67 0 1,726
Multilayer Perceptron 1,906 0 0 1,749
SVM 1,889 0 0 1,749
Random Forest 1,906 0 0 1,749
All the six classification algorithms were used to classify a male and female patient. The confusion Matrix were
computed using WEKA Explorer using both Train and Test set data are shown in table IV and table V.
True positive (TP): the male patients are predicted male.
True Negative (TN): the female patients are predicted female.
False Positive (FP): the female patients are predicted male.
False Negative (FN): the male patients are predicted female.
Table IV: Confusion Matrix of Train set Data.
J-48 Decision Tree 2541 1852 1437 2694
Naïve Bayes 2483 1910 1856 2275
Random Forest 3449 944 1430 2701
Table V: Confusion matrix of Test Set Data.
J-48 Decision Tree 1078 758 647 1172
Naïve Bayes 1088 748 811 1008
Random Forest 1174 662 925 894
V. RESULTS AND DISCUSSION
The six classifiers are compared based on three parameters which are accuracy, Kappa statics and RMSE. Accuracy
is calculated as using equation (1)
(TP+TN)/ (P+N)
Where, P=TP+FN and N=FP+TN.
Kappa measures the percentage of data values in the main diagonal of the table and then adjusts these values for the
amount of agreement that could be expected due to chance alone.
To compute Kappa, first observed level of agreement is calculated.
This value needs to be compared to the value that you would expect if the two raters were totally independent,
The value of Kappa is defined as
ISSN 1947-5500

(2)
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure
of how far from the regression line data points are; in other words, it tells how concentrated the data is around the line
of best fit [13].
(3)
The accuracy, Kappa statics and RMSE of six classification algorithms for training data and testing data to classify
diabetic and non-diabetic patients are shown in table VI and table VII respectively. According to experimental results,
correctly classified instances for Bayesian network, J-48 decision Tree, Multilayer perceptron and random forest are
100%. However, Naïve Bayes and SVM depicts 97% and 99% accuracy respectively. The kappa statistics and RMSE
values also depicts the similar results with highest values for four classifiers.
Table VI: Performance Comparison of six models (Train data) (diabetic and non-diabetic)
Train Data
Model
Accuracy
Kappa
Statistic
Correctly
Classified
Instances
Incorrectly
Classified
Instances
RMSE
Bayesian Classifier 100.0% 1.00 8,524 0 0.0002
J-48 Decision Tree 100.0% 1.00 8,524 0 0.0000
Naïve Bayes 97.1% 0.94 8,274 250 0.1713
Multilayer Perceptron 100.0% 1.00 8,524 0 0.0000
SVM 99.2% 0.98 8,456 68 0.0893
Random Forest 100.0% 1.00 8,524 0 0.0002
Table VII: Performance Comparison of six models (Test data) (diabetic and non-diabetic)
Test Data
Model
Accuracy
Kappa
Statistic
Correctly
Classified
Instances
Incorrectly
Classified
Instances
RMSE
Bayesian Classifier 100.0% 1.00 3,655 0 0.0002
J-48 Decision Tree 100.0% 1.00 3,655 0 0.0000
Naïve Bayes 97.5% 0.95 3,565 90 0.1667
Multilayer Perceptron 100.0% 1.00 3,655 0 0.0026
SVM 99.5% 0.99 3,638 17 0.0682
Random Forest 100.0% 1.00 3,655 0 0.0000
The performance metrics used to compare the models shows no sign of under fitting or over fitting. This is a positive
result and good for all the models. Bayesian Classifier, J-48 Decision Tree, Multilayer Perceptron, and RF models
achieved the highest accuracies (100%) on both train and test set data. Naïve Bayes and SVM achieved an accuracy
of 97.1% and 99.2% in train data and 97.5% and 99.5% in test set data respectively.
In addition, Bayesian Classifier, J-48 Decision Tree, Multilayer Perceptron, and RF models had the lowest Root Mean
Square Error. All the models performed well in classifying the diabetic and non-diabetic patients, but it is clear that
these four models are the best to be used in classifying patients with diabetes and those who don’t. All the models had
higher Area under Receiver Operating Characteristic (>0.9) which is perfect for classification.
ISSN 1947-5500

Figure 3 and figure 4 portray the accuracy comparison of six classification models to classify male and female patients
of training and testing data respectively. The accuracy comparison of six classifiers depicts that Random forest gives
the highest accuracy as compared to other five classifiers to classify male and female patients. After this classification
it is also identified that there more male diabetic patients than female diabetic patients.
Figure 3. Accuracy comparison of six classification models to classify male and female patients (Training data)
Figure 4 Accuracy comparison of six classification models to classify male and female patients (Testing data)
VI. CONCLUSION
In this paper we have performed the data mining using classification algorithms. The data set of hba1c test used in
this work is collected from diagnostics and research laboratory LUMHS, Hyderabad. It is observed by performing
hba1c test that many patients were prediabetic and there were less number of patients with diabetes as this test is to
predict diabetes by which a patient can go back from becoming diabetic in future. From the classification
experiments it is evident that the male diabetic patients are more as compared to female diabetic patients. In both
classification experiments, random forest model shows the highest accuracy.
0
10
20
30
40
50
60
70
80
90
100
Bayesian ClassifierJ-48 Decision Tree Naïve Bayes Multilayer
Perceptron
SVM Random Forest
Train Data
0
10
20
30
40
50
60
70
80
90
100
Bayesian Classifier J-48 Decision Tree Naïve Bayes Multilayer
Perceptron
SVM Random Forest
Test Data
ISSN 1947-5500

REFERENCES
[1] Hemoglobin A1c (HbA1c) Test for Diabetes
Available: https://www.webmd.com/diabetes/guide/glycated-hemoglobin-test-hba1c
[2] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques” Third edition.
[3] B. L. Shivakumar, and S. Alby. "A survey on data-mining technologies for prediction and diagnosis of diabetes." 2014 IEEE International
Conference on Intelligent Computing Applications (ICICA), 2014.
[4] M. Kumari, V. Rajan and A. Anshul, "Prediction of Diabetes Using Bayesian Network.", International Journal of Computer Science and
Information Technologies, Vol. 5 (4), pp. 5174-5178, 2014.
[5] Balpande, R. Vrushali and R. D. Wajgi. "Prediction and severity estimation of diabetes using data mining technique." IEEE International
Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017.
[6] M., Komi, et al. "Application of data mining methods in diabetes prediction." 2017 IEEE 2nd International Conference on Image, Vision and
Computing (ICIVC), 2017.
[7] H. A., Madni, A. Zahid, and A. S., Munam, "Data mining techniques and applications—A decade review." 23rd IEEE International Conference
on Automation and Computing (ICAC), 2017.
[8] Marcano-Cedeno, Alexis, and Diego Andina. "Data mining for the diagnosis of type 2 diabetes." World Automation Congress (WAC), 2012.
IEEE, 2012.
[9] A. H., Shurrab, and A. Y., Maghari. "Blood diseases detection using data mining techniques." 2017 IEEE 8th International Conference on
Information Technology (ICIT), 2017.
[10] A. Marcano-Cedeno, and D. Andina. "Data mining for the diagnosis of type 2 diabetes." IEEE World Automation Congress (WAC), 2012.
[11] G. L., Beckles, & P. E., Thompson-Reid, “Diabetes and Women’s Health across the Life Stages” 2011.
[12] Derived Measures for a test Available:
http://www.academicos.ccadet.unam.mx/jorge.marquez/cursos/Instrumentacion/FalsePositive_TrueNegative_etc.pdf
[13]Statistics how to, Available: https://www.statisticshowto.datasciencecentral.com/rmse/
AUTHORS PROFILE
Veingus kehar is software engineer at Liaquat University of Medical and Health Sciences Jamshoro Sindh Pakistan. She obtained her B.Eng in
Software Engineering from Mehran University of Engineering and Technology Jamshoro Sindh, Pakistan. Currently she is persuing her
M.Eng in Software Engineering from IICT, Mehran University of Engineering and Technology Jamshoro Sindh, Pakistan. Her research area
is data mining.
Sania Bhatti is working with the Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro Sindh,
Pakistan. She obtained her PhD from the University of Leeds, the United Kingdom in 2010 under the scholarship of the faculty development
program. Her research interests include modelling, simulation, communication networks and machine learning algorithms. She has published
more than twenty national and international Journal papers and various international conference papers. She was awarded with two research
grant by Microsoft in the field of Artificial Intelligence as principle investigator.
Mohsin Ali Memon is working with the department of Software Engineering, Mehran UET, Jamshoro, Sindh, Pakistan. He obtained his PhD
degree from the Department of Computer Science, University of Tsukuba in 2014. His research interests include interaction technologies, life
logging, privacy control methods and machine learning. He received his B.Eng. in Software Engineering and M.Eng. in Information
Technology from Mehran University of Engineering and Technology, Pakistan, in 2006 and 2009, respectively.
ISSN 1947-5500

Prognosis of Diabetes by Performing Data Mining of HbA1c

More Related Content

What's hot (20)

Similar to Prognosis of Diabetes by Performing Data Mining of HbA1c (20)

Recently uploaded (20)

Prognosis of Diabetes by Performing Data Mining of HbA1c