SlideShare a Scribd company logo
Journal homepage: https://shmpublisher.com/index.php/joscex 32
Increasing Accuracy of C4.5 Algorithm Using Information Gain
Ratio and Adaboost for Classification of Chronic Kidney
Disease
Aprilia Lestari1
, Alamsyah2
1,2
Computer Science Department, Faculty of Mathematics and Natural Sciences, Universitas Negeri Semarang, Indonesia
Article Info ABSTRACT
Article history:
Received Aug 2, 2020
Revised Aug 20 , 2020
Accepted Sept 12, 2020
Data information that has been available is very much and will require a very
long time to process large amounts of information data. Therefore, data
mining is used to process large amounts of data. Data mining methods can be
used to classify patient diseases, one of them is chronic kidney disease. This
research used the classification tree method classification with the C4.5
algorithm. In the pre-processing process, a feature selection was applied to
reduce attributes that did not increase the results of classification accuracy.
The feature selection used the gain ratio. The Ensemble method used
adaboost, which well known as boosting. The datasets used by Chronic
Kidney Dataset (CKD) were obtained from the UCI repository of learning
machine. The purpose of this research was applying the information gain ratio
and adaboost ensemble to the chronic kidney disease dataset using the C4.5
algorithm and finding out the results of the accuracy of the C4.5 algorithm
based on information gain ratio and adaboost ensemble. The results obtained
for the default iteration in adaboost which was 50 iterations. The accuracy of
C4.5 stand-alone was obtained 96.66%. The accuracy for C4.5 using
information gain ratio was obtained 97.5%, while C4.5 method using
information gain ratio and adaboost was obtained 98.33%.
Keywords:
Data Mining;
C4.5;
Gain Ratio Boosting;
CKD;
This is an open access article under the CC BY-SA license.
Corresponding Author:
Aprilia Lestari
Computer Science Departement
Faculty of Mathematich and Natural Sciences, Universitas Negeri Semarang,
Email: aprilialestari11@gmail.com
1. INTRODUCTION
Along with the rapid development of technology development, people can easily access information anytime
and anywhere. Data information that has been available is very much and will require a very long time to
process large amounts of information data. Therefore, to process large amounts of data, data mining techniques
are used. Data mining can be applied in various fields. One of them in the health sector is to predict and
classify a disease from a patient's medical record data. Data mining methods can be used to classify patient
diseases based on the severity of the disease, one of which is to classify patients with kidney disease and not.
Chronic kidney disease or kidney failure is a very serious problem in all corners of the world, where the
kidneys are damaged and become a cause of non-maximal kidney function [1].
Data mining is used to determine patterns in data mining knowledge and is useful in solving data problems in
large data warehouses [2]. The term Data mining is also referred to as knowledge discovery. Data mining has
a variety of types of methods, for that the selection of the right method will depend on the purpose and process.
One of the methods in data mining is classification. The classification method has input in the form of a
collection of records, where each record is marked with tuple (x.y). X is an attribute and Y is a specific attribute
/ target showing the class label. Classification has several algorithms including Naïve Bayes and C4.5, each
J Soft Comp. Exp., Vol. 1, No. 1, September 2020 33
of which has different accuracy [3]. Some techniques that exist in classification, decision trees are
classification techniques that are very popular and widely used.
Decision tree is the most powerful approach in scientific discovery and data mining, and a very effective tool
in various fields such as data and text extraction, information extraction, machine learning, and pattern
recognition [4]. One of the most popular decision tree techniques is C4.5. C4.5 algorithm is one of the
algorithms developed by J. Ross Quinlan which is the development of algorithm ID3 (Iterative Dichotomiser
3) [5].
This research was conducted using Chronic Kidney Disease Dataset obtained from the UCI repository of
learning machine. The following are some of the researches that are relevant to CKD. In the research [6]
compared two algorithms namely C4.5 standalone and C4.5 with Pessimistic prunning applied to the Chronic
Kidney Disease dataset. Standalone C4.5 has an accuracy of 95% and C4.5 with pressimistic prunning
resulting in an accuracy of 96.5%. Based on research [7] discusses the prediction of chronic kidney sufferers
using decision tree and naïve bayes algorithms. The dataset used for this research is the chronic kidney disease
dataset. The results of this research are decision trees resulting in an accuracy of 91% and Naïve Bayes
resulting in an accuracy of 86%. In the research by [8] which states that the C4.5 algorithm has the highest
accuracy when applied to the Chronic Kidney Disease dataset compared to the Expectation Maximization
(EM) and Artificial Neural Network (ANN) algorithms. The C4.5 algorithm produces an accuracy of 96.75%,
EM 70% and ANN 75%.
For improving the accuracy of the C4.5 algorithm, this research used methods in pre- processing and ensemble
methods. In the pre-processing process, a feature selection is applied to reduce attributes that do not increase
the results of classification accuracy. The feature selection used the gain ratio. The Ensemble method used
adaboost which well know as boosting.
The purpose of this research was applying the Information Gain Ratio and Adabost ensemble to the Chronic
Kidney Disease dataset using C4.5 algorithm and finding out the results of the accuracy of the C4.5 algorithm
based on Information Gain Ratio and Adaboost ensemble.
2. METHOD
2.1 Feature Selection
Feature selection is the process for selecting a subset of original attributes, so that the feature space optimally
decreases according to certain criteria. Feature selection which aims to reduce the number of certain features
focusing on relevant data and improve quality therefore Feature selection is able to work better than processes
that are driven by the selected features [11].
2.2.1 Information Gain Ratio
Information gain ratio is the ratio of obtaining information gain with intrinsic information. To reduce the bias
towards multi value attributes by taking the number and size of branches in a calculation when selecting
attributes. This is useful as a consideration for logarithmic probabilities to measure the impact of this type of
calculation in a dataset.
2.2 Algoritma C4.5
The C4.5 algorithm was introduced by Quinlan as an improved version of ID3. In ID3, induction of decision
trees can only be done on categorical type features (nominal / ordinal), while numerical types (internal / ratio)
cannot be used. The improvement that distinguishes C4.5 algorithm from ID3 is that it can handle features
with numeric types, pruning decision trees, and deriving rule sets. The C4.5 algorithm also uses gain criteria
in determining features that are node breakers in the induced tree [12].
In the C4.5 algorithm, building a decision tree the first thing to do is to choose attributes as roots. Then a
branch is created for each value in the root. The next step is to divide the case in branches. Then repeat the
process for each branch until all the cases in the branch have the same class [13].
(1)
J Soft Comp. Exp., Vol. 1, No. 1, September 2020 34
For calculating the gain, Equation 2 is used as follows [15].
(2)
Description:
S = Set of Case
A = Attributs
n = The number of A Attribute Partition
|Si|= The Case Number in i Partition
|S| = The Case Number
Meanwhile, calculating the entropy value can be seen in Equation 3.
(3)
Description:
S = Set of case
n = The partition number of S
pi = The proportion Si to S, which 𝑙𝑜𝑔2𝑝𝑖 can be calculated using Equation 4.
(4)
Entropy is used to determine which node will be the next training data solver. A higher entropy value will
increase the potential for classification. What needs to be considered is that if entropy for nodes is 0 means that
all vector data are on the same class label and that node becomes a leaf containing a decision (class label).
What also needs to be considered in the calculation of entropy is if one of the elements w_i is 0 then the entropy
is confirmed to be 0 too. If the proportion of all w_i elements is equal, it is certain that entropy is worth [16].
For calculate Split Entropy Equation 5 is used as follows.
(5)
Description:
S = Set of Case
A = Attributs
n = The number of A Attribute Partition
|Si|= The Case Number in i Partition
|S| = The Case Number
2.3 Adaptive Boosting (Adaboost)
Adaboost is a boosting algorithm part of ensemble learning that is used to improve classification performance
[17]. According to research conducted by Nurzahputra & Muslim [18] states that adaboost is a part of machine
learning introduced by Freud and Schapire (1995) which is used to improve accurate prediction rules by uniting
many inaccurate and weak regulations.
Adaboost and its variants have been successfully applied in several fields because of their strong theoretical
basis, accurate predictions and great simplicity. The steps in the adaboost algorithm are as follows.
a. Input: A collection of research samples with labels {(xi,yi), …, (xn,yn)}, a component learn
algorithm, the amount of rotation T.
b. Initialize: Weight of a training sample 𝑤𝑖
1
=
1
𝑁
, for all i=1, …, N
c. Do for t= 1, …, T
1) Use component learning algorithms to train a classification component, ht, to weight of a training
sample.
2) Calculate the training error on
1) Determine the weight for component classifier
J Soft Comp. Exp., Vol. 1, No. 1, September 2020 35
3) Update weight of a training sample is a normalization
constant.
4) Output:
3. RESULT AND DISCUSSION
In this research a web-based system was made using the Python programming language to find out the results
of applying information gain ratio and Adaboost Ensemble to the C4.5 algorithm in the diagnosis of chronic
kidney disease. To make it necessary data related to the diagnosis of chronic kidney disease that will be used
as a testing system. This research used a chronic kidney disease dataset obtained from the UCI machine
learning repository. This data consists of 24 attributes and 1 class.
At the data processing stage data processing was carried out before the algorithm is applied or commonly
called pre-processing. The Chronic kidney disease dataset obtained is in the form of a file with an extension of
.arff, changes to the file extension to .xlsx for data processing.
3.1 Formatiing Stage
The following formatting stage is the formatting of standards in the dataset used in the study. For example, in
the Rbc (Red Blood Cells) attribute by changing the label on the Rbc attribute to 0 for negative (abnormal)
and 1 for positive (normal).
3.2 Handling Missing Value Stage
Handling of missing value is part of pre-processing which aims to optimize mining results. Missing values in
datasets are usually marked with the symbol "?" As in Table 1, which is an example of a Chronic Kidney
Disease dataset that has a missing value.
Tabel 1. Missing value in the chronic kidney disease dataset
For replacing the missing value in the dataset using the calculation model mean (average) with the Equation
6.
(6)
a. The value to replace the missing value in attribute Sg:
b. The value to replace the missing value in attribute Al:
c. The value to replace the missing value in attribute Su:
The chronic kidney disease dataset that has been replaced is presented in Table 2.
J Soft Comp. Exp., Vol. 1, No. 1, September 2020 36
Tabel 2. The chronic kidney disease dataset
At the stage of class balancing is done by applying the SMOTE algorithm. The SMOTE algorithm is applied
to make new data more balanced. German Credit's initial dataset has 1000 samples with 700 loyal (good)
classes and 300 churn (bad) classes. Therefore it is necessary to balance the class by creating new data in the
churn class. The new dataset of the SMOTE algorithm results in 300 churn class data, so there are 1300 new
sample data. This is done so that data can be classified optimally. The attribute selection stage is done by
selecting attributes in the data used. In this attribute selection stage there is a dimension reduction in the
data in order to optimize attributes that will affect the accuracy of the Naive Bayes algorithm. Dimension
reduction in attributes is done by using Genetic Algorithms. Removal of attributes is done one by one from
attributes that have the smallest fitness value and will be mining. The process of selecting attributes and
mining will stop when the results of the accuracy have exceeded the specified minimum limit.
After going through the pre-processing stage, new data will go through the classification process using the
Naive Bayes algorithm. From the results obtained, there is an increase in the accuracy of the Naive Bayes
algorithm and the Naive Bayes algorithm by applying the SMOTE algorithm and attibutes selection of Genetic
Algorithms.
3.3 Feature Selection Implementation Stage
The stages of applying the feature selection are pre-processing steps in data mining to select features from the
original attributes. In this study the application of a feature selection in the chronic kidney disease dataset aims
to select attributes that fit certain criteria to improve quality so that optimal results are obtained. The results
of information gain ratio for each CKD attribute are shown in Table 3.
Tabel 3. Result of information gain ratio in CKD
After the feature selection calculation stage is carried out using the information gain ratio method, the results
of the selected attributes are obtained. The results of the feature selection using the information gain ratio
method are shown in Table 4.
J Soft Comp. Exp., Vol. 1, No. 1, September 2020 37
Table 4. The results of the feature selection using the information gain ratio method
3.3 Data Mining Stage
3.3.1 Implementation of C4.5 Algorithm
In this stage the model used is to apply the C4.5 algorithm to the CKD. New data that is ready to be processed
is carried out by sharing training data as a model and testing data to measure the ability of the model formed.
In this study the data distribution used a random sub sampling method. Where training data: data testing =
70%: 30% divided randomly. The application of the stand-alone C4.5 algorithm obtained an accuracy of
96.66% is presented in Table 5.
Table 5. The accuracy of C4.5 stand-alone
3.3.1 Implementation of C4.5 Algorithm and Information Gain Ratio
In this stage, the original attribute of the chronic kidney disease dataset consisted of 24 attributes and 1 class,
after the information gain ratio method was applied as selecting attributes 12 attributes were selected. In this
study the data distribution using the splitter method contained in the sklearn library, the method is random sub
sampling. The data sharing system is by sub-sampling random method, where the data is divided into 70%
and 30% and the data is taken randomly at each execution. The application of Information Gain Ratio to
preprocessing data resulted in an accuracy of C4.5 is 97.5%, the results are presented in Table 6.
Table 6. The accuracy of C4.5 and information gain ratio
3.3.1 Implementation of C4.5 Algorithm and Information Gain Ratio and Adaboost
The results of the decision tree will be known the gain value of the attributes that make up the dataset. From
the gain value, each attribute is initialized as the initial weight in the calculation of adaboost. After the
initialization weight is known, then iterations are determined in adaboost. The default iteration in adaboost is
50 iterations. The accuracy of the C4.5 algorithm based on information gain ratio and adaboost by using sub-
random sampling as the splitter is 98.33%. The results of the accuracy obtained are presented in Table 7.
Table 7. The accuracy of C4.5 using information gain ratio and adaboost
The results of this accuracy are far better than just using the C4.5 or C4.5 algorithm based on information
gain ratio only.
4. CONCLUSION
The application of information gain ratio and adaboost ensemble is a combination of two methods that are
useful for increasing accuracy in the C4.5 algorithm. The original attribute of the chronic kidney disease
dataset consisted of 24 attributes and 1 class, after the information gain ratio method was applied as selecting
attributes 12 attributes were selected. The default iteration in adaboost is 50 iterations. The accuracy of stand-
alone C4.5 was 96.66%, for C4.5 with information gain ratio of 97.5%, while C4.5 method was based on
information gain ratio and adaboost was 98.33%. So, it can be concluded that combining information gain
ratio and adaboost methods can improve classification accuracy.
J Soft Comp. Exp., Vol. 1, No. 1, September 2020 38
REFERENCES
[1] Boukenze, B., Mousannif, H., & Haqiq, A. (2016). Performance of Data Mining Techniques to Predict in
Healthcare Case Study: Chronic Kidney Failure Disease. International Journal of Database Management
Systems (IJDMS), 8(3).
[2] Shajahaan, S. S., Shanthi, S., & ManoChitra, V. (2013). Application Data mining Techniques to Model
Breast Cancer Data. International Journal of Emerging Technology and Advanced Engineering, 3(11):
362-369.
[3] Pranatha, A. A. (2012). Analisis Perbandingan Lima Metode Klasifikasi pada Dataset Sensus Penduduk.
Jurnal Sistem Informasi, 4(2): 127-134.
[4] Neeraj, B., Girja, S., Ritu, D. B., & Manisha, M. (2013). Decision Tree Analysis on J48 Algorithm for
Data mining. International Journal of Advanced Research in Computer Science and Software Engineering
(JARCSSE), 3(6): 1114-1119.
[5] Muzakir, A., & Wulandari, R. A. (2016). Model Data mining sebagai Prediksi Penyakit Hipertensi
Kehamilan dengan Teknik Decision Tree. Scientific Journal of Informatics, 3(1): 19-26.
[6] Muslim, M. A., Rukmana, S. H., Sugiharti, E., Prasetiyo, B., & Alimah, S. (2018). Optimization of C4.5
Algorithm-Based Particle Swarm Optimization for Breast Cancer Diagnosis. International Conference on
Mathematics, Science and Education, 983(1): 012-063.
[7] Padmanaban, K. A & Parthiban, G. (2016). Applying Machine Learning Techniques for Predicting the
Risk of Chronic Kidney Disease. Indian Journal of Science and Technology, 4(2): 1-5.
[8] S, T., Bai, M., & Majumdar, J. (2017). Analysis and Prediction of Chronic Kidney Disease Using Data
Mining Techniques. International Journal of Engineering Research in Computer Science and Engineering
(IJERCSE), 4(9): 25-32.
[9] Gola, J., Britz, D., Staudt, T., Winter, M., Schneider, A. S., Ludovici, M., & Mucklich, F. (2018).
Advanced microstructure classification by Data mining methods. Computational Materials Science, 148:
324-335.
[10] Nurzahputra, A., Safitri, A. R., & Muslim, M. A. (2017). Klasifikasi Pelanggan pada Customer Churn
Prediction Menggunakan Decision Tree. Prosiding Seminar Nasional Matematika. Semarang: Universitas
Negeri Semarang: 717- 722.
[11]Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature
Selection Approaches for Predictive Modelling of Groundwater Nitrate Pollution: An Evaluation of
Filters, Embedded and Wrapper Methods. Science of the Total Environment, 624(2018): 661-672.
[12]Prasetyo, E. (2014). Data mining: Konsep dan Aplikasi Menggunakan Matlab. Yogyakarta: Andi Offset.
[13]Kusrini, & Luthfi, E. T. (2009). Algoritma Data Mining. Yogyakarta: CV Andi Offset.
[14]Neeraj, B., Girja, S., Ritu, D. B., & Manisha, M. (2013). Decision Tree Analysis on J48 Algorithm for
Data mining. International Journal of Advanced Research in Computer Science and Software Engineering
(JARCSSE), 3(6): 1114-1119.
[15]Quinland, J. Ross. (1986). Introduction of Decision Tree. Machine Learning. 1(1): 81-106
[16]Han, J. (2012). Data mining Concepts and Techniques. San francisco: Morgan Kauffman.
[17]Listiana, E., & Muslim, M. A. (2017). Penerapan Adaboost Untuk Klasifikasi Support Vector Machine
Guna Mengingkatkan Akurasi Pada Diagnosa Chronic Kidney Disease. Prosiding Seminar Nasional
Teknologi dan Informatika, 875- 881.
[18]Nurzahputra, A., & Muslim, M. A. (2017). Peningkatan Akurasi pada Algoritma C4.5 Menggunakan
Adaboost untuk Meminimalkan Resiko Kredit. Prosiding Seminar Nasional Teknologi dan Informatika.
Kudus: Universitas Muria Kudus: 243-247

More Related Content

PDF
APPLICATION OF DATA MINING TECHNIQUES FOR THE PREDICTION OF CHRONIC KIDNEY DI...
PDF
IRJET- Feature Selection and Classifier Accuracy of Data Mining Algorithms
PDF
IRJET - Chronic Kidney Disease Prediction using Data Mining and Machine Learning
PDF
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
PDF
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
PDF
Advanced Computational Intelligence: An International Journal (ACII)
PDF
Utilizing Machine Learning, Detect Chronic Kidney Disease and Suggest A Healt...
DOCX
Kidney Failure Due to Diabetics – Detection using Classification Algorithm in...
APPLICATION OF DATA MINING TECHNIQUES FOR THE PREDICTION OF CHRONIC KIDNEY DI...
IRJET- Feature Selection and Classifier Accuracy of Data Mining Algorithms
IRJET - Chronic Kidney Disease Prediction using Data Mining and Machine Learning
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
Advanced Computational Intelligence: An International Journal (ACII)
Utilizing Machine Learning, Detect Chronic Kidney Disease and Suggest A Healt...
Kidney Failure Due to Diabetics – Detection using Classification Algorithm in...

Similar to education for computer svm CKD C4.5 .pdf (20)

PDF
Diagnosing Chronic Kidney Disease using Machine Learning
PDF
AN EFFECTIVE PREDICTION OF CHRONIC KIDENY DISEASE USING DATA MINING CLASSIFIE...
PDF
Chronic Kidney Disease Prediction Using Machine Learning
PDF
1 springer format chronic changed edit iqbal qc
PDF
Chronic Kidney Disease Prediction
PDF
[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa
PDF
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
PDF
Ijcatr04041015
PDF
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
PDF
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
PDF
Analysis of Classification Algorithm in Data Mining
PPTX
Introduction to Machine Learning & Classification
PDF
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
PDF
Diagnosis Of Chronic Kidney Disease Using Machine Learning
PDF
Efficient classification of big data using vfdt (very fast decision tree)
PPTX
Chronic Kidney Disease Prediction Using Machine Learning with Feature Selection
PDF
A comparative analysis of classification techniques on medical data sets
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PDF
Hypothyroid Classification using Machine Learning Approaches and Comparative ...
PDF
IRJET- Chronic Kidney Disease Prediction based on Naive Bayes Technique
Diagnosing Chronic Kidney Disease using Machine Learning
AN EFFECTIVE PREDICTION OF CHRONIC KIDENY DISEASE USING DATA MINING CLASSIFIE...
Chronic Kidney Disease Prediction Using Machine Learning
1 springer format chronic changed edit iqbal qc
Chronic Kidney Disease Prediction
[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Ijcatr04041015
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
Analysis of Classification Algorithm in Data Mining
Introduction to Machine Learning & Classification
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
Diagnosis Of Chronic Kidney Disease Using Machine Learning
Efficient classification of big data using vfdt (very fast decision tree)
Chronic Kidney Disease Prediction Using Machine Learning with Feature Selection
A comparative analysis of classification techniques on medical data sets
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Hypothyroid Classification using Machine Learning Approaches and Comparative ...
IRJET- Chronic Kidney Disease Prediction based on Naive Bayes Technique
Ad

Recently uploaded (20)

PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Empowerment Technology for Senior High School Guide
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
Trump Administration's workforce development strategy
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
1_English_Language_Set_2.pdf probationary
PDF
IGGE1 Understanding the Self1234567891011
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Empowerment Technology for Senior High School Guide
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Practical Manual AGRO-233 Principles and Practices of Natural Farming
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
B.Sc. DS Unit 2 Software Engineering.pptx
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
Trump Administration's workforce development strategy
Weekly quiz Compilation Jan -July 25.pdf
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Introduction to pro and eukaryotes and differences.pptx
1_English_Language_Set_2.pdf probationary
IGGE1 Understanding the Self1234567891011
Ad

education for computer svm CKD C4.5 .pdf

  • 1. Journal homepage: https://shmpublisher.com/index.php/joscex 32 Increasing Accuracy of C4.5 Algorithm Using Information Gain Ratio and Adaboost for Classification of Chronic Kidney Disease Aprilia Lestari1 , Alamsyah2 1,2 Computer Science Department, Faculty of Mathematics and Natural Sciences, Universitas Negeri Semarang, Indonesia Article Info ABSTRACT Article history: Received Aug 2, 2020 Revised Aug 20 , 2020 Accepted Sept 12, 2020 Data information that has been available is very much and will require a very long time to process large amounts of information data. Therefore, data mining is used to process large amounts of data. Data mining methods can be used to classify patient diseases, one of them is chronic kidney disease. This research used the classification tree method classification with the C4.5 algorithm. In the pre-processing process, a feature selection was applied to reduce attributes that did not increase the results of classification accuracy. The feature selection used the gain ratio. The Ensemble method used adaboost, which well known as boosting. The datasets used by Chronic Kidney Dataset (CKD) were obtained from the UCI repository of learning machine. The purpose of this research was applying the information gain ratio and adaboost ensemble to the chronic kidney disease dataset using the C4.5 algorithm and finding out the results of the accuracy of the C4.5 algorithm based on information gain ratio and adaboost ensemble. The results obtained for the default iteration in adaboost which was 50 iterations. The accuracy of C4.5 stand-alone was obtained 96.66%. The accuracy for C4.5 using information gain ratio was obtained 97.5%, while C4.5 method using information gain ratio and adaboost was obtained 98.33%. Keywords: Data Mining; C4.5; Gain Ratio Boosting; CKD; This is an open access article under the CC BY-SA license. Corresponding Author: Aprilia Lestari Computer Science Departement Faculty of Mathematich and Natural Sciences, Universitas Negeri Semarang, Email: aprilialestari11@gmail.com 1. INTRODUCTION Along with the rapid development of technology development, people can easily access information anytime and anywhere. Data information that has been available is very much and will require a very long time to process large amounts of information data. Therefore, to process large amounts of data, data mining techniques are used. Data mining can be applied in various fields. One of them in the health sector is to predict and classify a disease from a patient's medical record data. Data mining methods can be used to classify patient diseases based on the severity of the disease, one of which is to classify patients with kidney disease and not. Chronic kidney disease or kidney failure is a very serious problem in all corners of the world, where the kidneys are damaged and become a cause of non-maximal kidney function [1]. Data mining is used to determine patterns in data mining knowledge and is useful in solving data problems in large data warehouses [2]. The term Data mining is also referred to as knowledge discovery. Data mining has a variety of types of methods, for that the selection of the right method will depend on the purpose and process. One of the methods in data mining is classification. The classification method has input in the form of a collection of records, where each record is marked with tuple (x.y). X is an attribute and Y is a specific attribute / target showing the class label. Classification has several algorithms including Naïve Bayes and C4.5, each
  • 2. J Soft Comp. Exp., Vol. 1, No. 1, September 2020 33 of which has different accuracy [3]. Some techniques that exist in classification, decision trees are classification techniques that are very popular and widely used. Decision tree is the most powerful approach in scientific discovery and data mining, and a very effective tool in various fields such as data and text extraction, information extraction, machine learning, and pattern recognition [4]. One of the most popular decision tree techniques is C4.5. C4.5 algorithm is one of the algorithms developed by J. Ross Quinlan which is the development of algorithm ID3 (Iterative Dichotomiser 3) [5]. This research was conducted using Chronic Kidney Disease Dataset obtained from the UCI repository of learning machine. The following are some of the researches that are relevant to CKD. In the research [6] compared two algorithms namely C4.5 standalone and C4.5 with Pessimistic prunning applied to the Chronic Kidney Disease dataset. Standalone C4.5 has an accuracy of 95% and C4.5 with pressimistic prunning resulting in an accuracy of 96.5%. Based on research [7] discusses the prediction of chronic kidney sufferers using decision tree and naïve bayes algorithms. The dataset used for this research is the chronic kidney disease dataset. The results of this research are decision trees resulting in an accuracy of 91% and Naïve Bayes resulting in an accuracy of 86%. In the research by [8] which states that the C4.5 algorithm has the highest accuracy when applied to the Chronic Kidney Disease dataset compared to the Expectation Maximization (EM) and Artificial Neural Network (ANN) algorithms. The C4.5 algorithm produces an accuracy of 96.75%, EM 70% and ANN 75%. For improving the accuracy of the C4.5 algorithm, this research used methods in pre- processing and ensemble methods. In the pre-processing process, a feature selection is applied to reduce attributes that do not increase the results of classification accuracy. The feature selection used the gain ratio. The Ensemble method used adaboost which well know as boosting. The purpose of this research was applying the Information Gain Ratio and Adabost ensemble to the Chronic Kidney Disease dataset using C4.5 algorithm and finding out the results of the accuracy of the C4.5 algorithm based on Information Gain Ratio and Adaboost ensemble. 2. METHOD 2.1 Feature Selection Feature selection is the process for selecting a subset of original attributes, so that the feature space optimally decreases according to certain criteria. Feature selection which aims to reduce the number of certain features focusing on relevant data and improve quality therefore Feature selection is able to work better than processes that are driven by the selected features [11]. 2.2.1 Information Gain Ratio Information gain ratio is the ratio of obtaining information gain with intrinsic information. To reduce the bias towards multi value attributes by taking the number and size of branches in a calculation when selecting attributes. This is useful as a consideration for logarithmic probabilities to measure the impact of this type of calculation in a dataset. 2.2 Algoritma C4.5 The C4.5 algorithm was introduced by Quinlan as an improved version of ID3. In ID3, induction of decision trees can only be done on categorical type features (nominal / ordinal), while numerical types (internal / ratio) cannot be used. The improvement that distinguishes C4.5 algorithm from ID3 is that it can handle features with numeric types, pruning decision trees, and deriving rule sets. The C4.5 algorithm also uses gain criteria in determining features that are node breakers in the induced tree [12]. In the C4.5 algorithm, building a decision tree the first thing to do is to choose attributes as roots. Then a branch is created for each value in the root. The next step is to divide the case in branches. Then repeat the process for each branch until all the cases in the branch have the same class [13]. (1)
  • 3. J Soft Comp. Exp., Vol. 1, No. 1, September 2020 34 For calculating the gain, Equation 2 is used as follows [15]. (2) Description: S = Set of Case A = Attributs n = The number of A Attribute Partition |Si|= The Case Number in i Partition |S| = The Case Number Meanwhile, calculating the entropy value can be seen in Equation 3. (3) Description: S = Set of case n = The partition number of S pi = The proportion Si to S, which 𝑙𝑜𝑔2𝑝𝑖 can be calculated using Equation 4. (4) Entropy is used to determine which node will be the next training data solver. A higher entropy value will increase the potential for classification. What needs to be considered is that if entropy for nodes is 0 means that all vector data are on the same class label and that node becomes a leaf containing a decision (class label). What also needs to be considered in the calculation of entropy is if one of the elements w_i is 0 then the entropy is confirmed to be 0 too. If the proportion of all w_i elements is equal, it is certain that entropy is worth [16]. For calculate Split Entropy Equation 5 is used as follows. (5) Description: S = Set of Case A = Attributs n = The number of A Attribute Partition |Si|= The Case Number in i Partition |S| = The Case Number 2.3 Adaptive Boosting (Adaboost) Adaboost is a boosting algorithm part of ensemble learning that is used to improve classification performance [17]. According to research conducted by Nurzahputra & Muslim [18] states that adaboost is a part of machine learning introduced by Freud and Schapire (1995) which is used to improve accurate prediction rules by uniting many inaccurate and weak regulations. Adaboost and its variants have been successfully applied in several fields because of their strong theoretical basis, accurate predictions and great simplicity. The steps in the adaboost algorithm are as follows. a. Input: A collection of research samples with labels {(xi,yi), …, (xn,yn)}, a component learn algorithm, the amount of rotation T. b. Initialize: Weight of a training sample 𝑤𝑖 1 = 1 𝑁 , for all i=1, …, N c. Do for t= 1, …, T 1) Use component learning algorithms to train a classification component, ht, to weight of a training sample. 2) Calculate the training error on 1) Determine the weight for component classifier
  • 4. J Soft Comp. Exp., Vol. 1, No. 1, September 2020 35 3) Update weight of a training sample is a normalization constant. 4) Output: 3. RESULT AND DISCUSSION In this research a web-based system was made using the Python programming language to find out the results of applying information gain ratio and Adaboost Ensemble to the C4.5 algorithm in the diagnosis of chronic kidney disease. To make it necessary data related to the diagnosis of chronic kidney disease that will be used as a testing system. This research used a chronic kidney disease dataset obtained from the UCI machine learning repository. This data consists of 24 attributes and 1 class. At the data processing stage data processing was carried out before the algorithm is applied or commonly called pre-processing. The Chronic kidney disease dataset obtained is in the form of a file with an extension of .arff, changes to the file extension to .xlsx for data processing. 3.1 Formatiing Stage The following formatting stage is the formatting of standards in the dataset used in the study. For example, in the Rbc (Red Blood Cells) attribute by changing the label on the Rbc attribute to 0 for negative (abnormal) and 1 for positive (normal). 3.2 Handling Missing Value Stage Handling of missing value is part of pre-processing which aims to optimize mining results. Missing values in datasets are usually marked with the symbol "?" As in Table 1, which is an example of a Chronic Kidney Disease dataset that has a missing value. Tabel 1. Missing value in the chronic kidney disease dataset For replacing the missing value in the dataset using the calculation model mean (average) with the Equation 6. (6) a. The value to replace the missing value in attribute Sg: b. The value to replace the missing value in attribute Al: c. The value to replace the missing value in attribute Su: The chronic kidney disease dataset that has been replaced is presented in Table 2.
  • 5. J Soft Comp. Exp., Vol. 1, No. 1, September 2020 36 Tabel 2. The chronic kidney disease dataset At the stage of class balancing is done by applying the SMOTE algorithm. The SMOTE algorithm is applied to make new data more balanced. German Credit's initial dataset has 1000 samples with 700 loyal (good) classes and 300 churn (bad) classes. Therefore it is necessary to balance the class by creating new data in the churn class. The new dataset of the SMOTE algorithm results in 300 churn class data, so there are 1300 new sample data. This is done so that data can be classified optimally. The attribute selection stage is done by selecting attributes in the data used. In this attribute selection stage there is a dimension reduction in the data in order to optimize attributes that will affect the accuracy of the Naive Bayes algorithm. Dimension reduction in attributes is done by using Genetic Algorithms. Removal of attributes is done one by one from attributes that have the smallest fitness value and will be mining. The process of selecting attributes and mining will stop when the results of the accuracy have exceeded the specified minimum limit. After going through the pre-processing stage, new data will go through the classification process using the Naive Bayes algorithm. From the results obtained, there is an increase in the accuracy of the Naive Bayes algorithm and the Naive Bayes algorithm by applying the SMOTE algorithm and attibutes selection of Genetic Algorithms. 3.3 Feature Selection Implementation Stage The stages of applying the feature selection are pre-processing steps in data mining to select features from the original attributes. In this study the application of a feature selection in the chronic kidney disease dataset aims to select attributes that fit certain criteria to improve quality so that optimal results are obtained. The results of information gain ratio for each CKD attribute are shown in Table 3. Tabel 3. Result of information gain ratio in CKD After the feature selection calculation stage is carried out using the information gain ratio method, the results of the selected attributes are obtained. The results of the feature selection using the information gain ratio method are shown in Table 4.
  • 6. J Soft Comp. Exp., Vol. 1, No. 1, September 2020 37 Table 4. The results of the feature selection using the information gain ratio method 3.3 Data Mining Stage 3.3.1 Implementation of C4.5 Algorithm In this stage the model used is to apply the C4.5 algorithm to the CKD. New data that is ready to be processed is carried out by sharing training data as a model and testing data to measure the ability of the model formed. In this study the data distribution used a random sub sampling method. Where training data: data testing = 70%: 30% divided randomly. The application of the stand-alone C4.5 algorithm obtained an accuracy of 96.66% is presented in Table 5. Table 5. The accuracy of C4.5 stand-alone 3.3.1 Implementation of C4.5 Algorithm and Information Gain Ratio In this stage, the original attribute of the chronic kidney disease dataset consisted of 24 attributes and 1 class, after the information gain ratio method was applied as selecting attributes 12 attributes were selected. In this study the data distribution using the splitter method contained in the sklearn library, the method is random sub sampling. The data sharing system is by sub-sampling random method, where the data is divided into 70% and 30% and the data is taken randomly at each execution. The application of Information Gain Ratio to preprocessing data resulted in an accuracy of C4.5 is 97.5%, the results are presented in Table 6. Table 6. The accuracy of C4.5 and information gain ratio 3.3.1 Implementation of C4.5 Algorithm and Information Gain Ratio and Adaboost The results of the decision tree will be known the gain value of the attributes that make up the dataset. From the gain value, each attribute is initialized as the initial weight in the calculation of adaboost. After the initialization weight is known, then iterations are determined in adaboost. The default iteration in adaboost is 50 iterations. The accuracy of the C4.5 algorithm based on information gain ratio and adaboost by using sub- random sampling as the splitter is 98.33%. The results of the accuracy obtained are presented in Table 7. Table 7. The accuracy of C4.5 using information gain ratio and adaboost The results of this accuracy are far better than just using the C4.5 or C4.5 algorithm based on information gain ratio only. 4. CONCLUSION The application of information gain ratio and adaboost ensemble is a combination of two methods that are useful for increasing accuracy in the C4.5 algorithm. The original attribute of the chronic kidney disease dataset consisted of 24 attributes and 1 class, after the information gain ratio method was applied as selecting attributes 12 attributes were selected. The default iteration in adaboost is 50 iterations. The accuracy of stand- alone C4.5 was 96.66%, for C4.5 with information gain ratio of 97.5%, while C4.5 method was based on information gain ratio and adaboost was 98.33%. So, it can be concluded that combining information gain ratio and adaboost methods can improve classification accuracy.
  • 7. J Soft Comp. Exp., Vol. 1, No. 1, September 2020 38 REFERENCES [1] Boukenze, B., Mousannif, H., & Haqiq, A. (2016). Performance of Data Mining Techniques to Predict in Healthcare Case Study: Chronic Kidney Failure Disease. International Journal of Database Management Systems (IJDMS), 8(3). [2] Shajahaan, S. S., Shanthi, S., & ManoChitra, V. (2013). Application Data mining Techniques to Model Breast Cancer Data. International Journal of Emerging Technology and Advanced Engineering, 3(11): 362-369. [3] Pranatha, A. A. (2012). Analisis Perbandingan Lima Metode Klasifikasi pada Dataset Sensus Penduduk. Jurnal Sistem Informasi, 4(2): 127-134. [4] Neeraj, B., Girja, S., Ritu, D. B., & Manisha, M. (2013). Decision Tree Analysis on J48 Algorithm for Data mining. International Journal of Advanced Research in Computer Science and Software Engineering (JARCSSE), 3(6): 1114-1119. [5] Muzakir, A., & Wulandari, R. A. (2016). Model Data mining sebagai Prediksi Penyakit Hipertensi Kehamilan dengan Teknik Decision Tree. Scientific Journal of Informatics, 3(1): 19-26. [6] Muslim, M. A., Rukmana, S. H., Sugiharti, E., Prasetiyo, B., & Alimah, S. (2018). Optimization of C4.5 Algorithm-Based Particle Swarm Optimization for Breast Cancer Diagnosis. International Conference on Mathematics, Science and Education, 983(1): 012-063. [7] Padmanaban, K. A & Parthiban, G. (2016). Applying Machine Learning Techniques for Predicting the Risk of Chronic Kidney Disease. Indian Journal of Science and Technology, 4(2): 1-5. [8] S, T., Bai, M., & Majumdar, J. (2017). Analysis and Prediction of Chronic Kidney Disease Using Data Mining Techniques. International Journal of Engineering Research in Computer Science and Engineering (IJERCSE), 4(9): 25-32. [9] Gola, J., Britz, D., Staudt, T., Winter, M., Schneider, A. S., Ludovici, M., & Mucklich, F. (2018). Advanced microstructure classification by Data mining methods. Computational Materials Science, 148: 324-335. [10] Nurzahputra, A., Safitri, A. R., & Muslim, M. A. (2017). Klasifikasi Pelanggan pada Customer Churn Prediction Menggunakan Decision Tree. Prosiding Seminar Nasional Matematika. Semarang: Universitas Negeri Semarang: 717- 722. [11]Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature Selection Approaches for Predictive Modelling of Groundwater Nitrate Pollution: An Evaluation of Filters, Embedded and Wrapper Methods. Science of the Total Environment, 624(2018): 661-672. [12]Prasetyo, E. (2014). Data mining: Konsep dan Aplikasi Menggunakan Matlab. Yogyakarta: Andi Offset. [13]Kusrini, & Luthfi, E. T. (2009). Algoritma Data Mining. Yogyakarta: CV Andi Offset. [14]Neeraj, B., Girja, S., Ritu, D. B., & Manisha, M. (2013). Decision Tree Analysis on J48 Algorithm for Data mining. International Journal of Advanced Research in Computer Science and Software Engineering (JARCSSE), 3(6): 1114-1119. [15]Quinland, J. Ross. (1986). Introduction of Decision Tree. Machine Learning. 1(1): 81-106 [16]Han, J. (2012). Data mining Concepts and Techniques. San francisco: Morgan Kauffman. [17]Listiana, E., & Muslim, M. A. (2017). Penerapan Adaboost Untuk Klasifikasi Support Vector Machine Guna Mengingkatkan Akurasi Pada Diagnosa Chronic Kidney Disease. Prosiding Seminar Nasional Teknologi dan Informatika, 875- 881. [18]Nurzahputra, A., & Muslim, M. A. (2017). Peningkatan Akurasi pada Algoritma C4.5 Menggunakan Adaboost untuk Meminimalkan Resiko Kredit. Prosiding Seminar Nasional Teknologi dan Informatika. Kudus: Universitas Muria Kudus: 243-247