SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1160
Analysis On Data Mining Techniques For Heart Disease Dataset
Subhashri.K 1, Arockia Panimalar.S2, Ashwin.S3, Vignesh.P4
1,2 Assistant Professor, Department of BCA & M.Sc SS, Sri Krishna Arts and Science College, Tamilnadu, India
3,4 III BCA, Department of BCA & M.Sc SS, Sri Krishna Arts and Science College, Tamilnadu, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Data Mining is an analytic process designed
to explore data (usually large amounts of data - typically
business or market related - also known as "big data") in
search of consistent patterns and/or systematic
relationships between variables, and then to validate the
findings by applying the detected patterns to new subsets of
data. The ultimate goal of data mining is prediction - and
predictive data mining is the most common type of data
mining and one that has the most direct businessapplications.
Classification trees are used to predict membershipofcases or
objects in the classes of a categorical dependent variablefrom
their measurements on one or more predictor variables.
Classification tree analysis is one of the main techniques
used in Data Mining. During my research, I had analyzed
the various classification algorithms and compared the
performance of classification algorithms on aspects for time
taken to build the model, by using different distance function.
The result is being tested on data set which is taken from UCI
repositories. The aim is to judge the efficiency of different
data mining algorithms on Heart Disease dataset and
determine the optimum algorithm. The performance
analysis depends on many factors encompassing validation
mode, distance function, different nature of dataset.
Key Words: Data Mining, Classification, Classification
Techniques, Distance function, KEEL Tool,
Performance Analysis.
1. INTRODUCTION
The healthcare industry collects huge amounts of
healthcare data which, unfortunately are not “mined” to
discover hidden information for effective decision making.
Discovery of hidden patterns and relationships often goes
exploited. Data mining refers to using a variety of
techniques to identify suggest of information or
decision making knowledge in the database and extracting
these in a way that they can put to use in areas such as
decision support, prediction ,forecasting and estimation.
Discovering relations that connect variables in a database is
the subject of data mining. Data mining is the non-trivial
extraction of implicit, previously unknown and potentially
useful information from data. Data mining technology
provides a user-oriented approach to novel and hidden
patterns in the data.The discovered knowledge can be
used by the healthcare administrators to improve the
quality of service and also used by the medical
practitioners to reduce the number of adverse drug effect.
In information technology, knowledge is one of the most
significant assets of any organization. The role of IT in
healthcare is well established. Knowledge Management in
Health care offers many challenges in creation,
dissemination and preservation of health care knowledge
using advanced technologies. Pragmatic use of database
system, Data Warehousing and Knowledge Management
technologies can contribute a lot to decision support
systems in health care.Knowledge discovery in databases
is well- defined process consisting of several distinct steps.
Data mining is the core step, which results in the discovery
of hidden but useful knowledge from massive databases.
Following are some of the important areas of interests
where data mining techniques can be of tremendous use
in health care management. (Gnanadesikan, et
al...(1977).
1. Data modelling for health care applications.
2. Executives Information System for health care.
3. Forecasting treatment costs and demand of resources.
4. Anticipating patient’s future behaviour given their
history.
5. Public health Informatics.
6. E-governance structures in health care.
7. Health Insurance.
2. CLASSIFICATION
Classification is the task of generalizing known structure
to apply to new data. For example, an e-mail program
might attempt to classify an e-mail as "legitimate" or as
"spam". An algorithm that implements classification,
especially, in a concrete implementation, is known as a
classifier.The term "classifier" sometimes also refers to the
mathematical function, implemented by a classification
algorithm that maps input data to a category.Classification
and clustering are examples of the more general
problem of pattern recognition, which is the
assignment of some sort of output value to a given input
value.
Classification Algorithm
A. Decision Tree
In decision analysis, a decision tree can be used to visually
and explicitly represent decisions and decision making.
In data mining, a decision tree describes data but not
decisions; rather the resulting classification tree can be an
input for decision making. Decision tree learning is a
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1122
method commonly used in data mining. The goal is to
create a model that predicts the value of a target variable
based on several input variables. Each interior node
corresponds to one of input variables . An example is
shown on the right. Each leaf represents a value of the
target variable given the values of the input variables
represented by the path from the root to the leaf.
A tree can be "learned" by splitting the source set into
subsets based on an attribute value test. In data mining,
decision trees can be described also as the combination of
mathematical and computational techniques to aid the
description, categorisation and generalisation of a given
set of data. (Kaur H and Wasan KS et al...2006) .
B. Lazy Learning
In artificial intelligence, lazy learning is a learningmethod in
which generalization beyond the training data is delayed
until a query is made to the system, as opposed to in eager
learning, where the system tries to generalize the training
data before receiving queries.
The main advantage gained in employing a lazy
learning method, such as Case based reasoning,isthat target
function will be approximated locally, such as in the k-
nearest neighbour algorithm. Because the target function is
approximated locally for each query to the system, lazy
learning systems can simultaneously solve multiple
problems and deal successfully with changes in the
problem domain. The disadvantages with lazy learning
include the large space requirement to store the entire
training dataset. Particularly noisy training data increases
the case base unnecessarily, because no abstraction is
made during the training phase.
Selected Distance Function
 Euclidian Distance Function
 HVDM Distance Function
i. Euclidean Distance
In mathematics, the Euclidean distance or Euclidean
metric is the "ordinary" distance between two points that
one would measure with a ruler, and is given by the
Pythagorean formula. By using this formula as distance,
Euclidean space (or even any inner product space)
becomes a metric space. The associated norm is called the
Euclidean norm. Older literature refers to the metric as
Pythagorean metric.
Definition
The Euclidean distance between points p and q is the
length of the line segment connecting them: p.q
In Cartesian coordinates,
if p = (p , p ,..., p ) and q = (q , q ,..., q ) are two points in
Euclidean n space, then the distance from p to q, orfromq to
p is given by the following hetrogenous value difference
metric.
ii. Hetrogenous Value Difference Metric
Instance-based learning technique typically handle
continuous and linear input values well,but often do not
handle nominal input attributes appropriately. The Value
Difference Metric (VDM) was designed to find
reasonabledistance values between nominal attribute
values, but it discretization to map continuous values
into nominal values.
This paper proposes three new heterogeneous distance
functions, called the Heterogeneous Value Difference
Metric (HVDM), the Interpolated Value Difference
Metric (IVDM), and the Windowed Value . These new
distance functions are designed to handle applications with
nominal attributes, continuous attributes, or both. In
experiments on 48 applications the new distance metrics
achieve higher classification accuracy on average than
three previous distance functions on those datasets
that have both nominal and continuous attributes.
So HVDM is used as shown below:
3. RELATED WORK
The clinical and physical diagnosis of Chikungunya viral
fever patients and itscomparisonwithdengueviral fever has
been proposed.
Table 1: Summary of selected reference with goal
Our project aims to integrate different sourcesof
information and to discover patterns of diagnosis, for
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1123
predicting the viral infected patients and their results.
The aim is to apply hybrid classification schemes and
create data mining tools well suited to the crucial
demands of medical diagnostic systems. The approaches
in review are diverse in data mining methods.(Fathima etal.
2011).
The prototype has been described using data mining
techniques, namely Naïve Bayes and WAC (weighted
associative classifier). It enables significant knowledge .Eg.
patterns, relationships between medical factors related to
heart disease, to be established. It can serve a training tool
to train nurses and medical students to diagnose
patients with heart disease. It is a web based user friendly
system and can be used in hospitals if they have a data
ware house for their hospital. The models were validated
using Classification Matrix (Aditya Sunder et al. 2012).
The proposed work is to predict more accurately the
presence of heart disease with reduced number of
attributes. This was carried out using Artificial Neural
Network and Decision Tree algorithms andmeteorological
data collected between 2000 and 2009 from the city of
Ibadan, Nigeria.It has been described that the data mining
classification techniques RIPPER classifier, Decision Tree,
ANNs, and SVM are analyzed on cardiovascular disease
dataset.There analysis shows that out of these four
classification models SVM predicts cardiovascular disease
with least error rate and highest accuracy. (Kumar and
Godara et al. 2011).
4. DATASETS AND TOOLS
A. Hardware
We conduct our evaluation on Intel Pentium P6200
platform which consist of 1 GB memory and 320 GB hard
disk.
B. Software
In this experiment, we used KEEL tool and window 7 to
evaluate the performance of classification algorithms
using time taken to build the model according to
respective no of clusters. KEEL is machine learning/data
mining software written in Java language (distributed
under the GNU Public License).
KEEL is a collection of machine learning algorithmsfor
data mining tasks. KEEL contains tools for developing new
machine learning schemes. It can be used for Pre-
processing, Classification, Clustering, Association and
Visualization.
C. Data Set
The input data set is an integral part of data mining
application. The data used in my experiment is either real
world data obtained from UCI machine learning repository
and widely accepted data set available in KEEL toolkit.
Heart Disease data set comprises 303 instances and 75
attributes in the area of Health Science and some of them
contain missing value.
5. EXPERIMENTS RESULT AND DISCUSSION
To evaluate the selected tool usingHeartDiseasedatasetand
comparisons are performed in two parts. In first
Comparison, I have applied these Classification algorithms
by using two distance function namely Euclidean Distance
and HVDM Distance in three different Pre-processing
techniques namely CHC Adaptive search for advanced
selection, GGA-TSS Generational Genetic Algorithm for
Instance selection, SGGA-TSS Steady-state genetic
algorithm for Instance selection, using different validation
modes namely K-Fold cross validation, 5-Fold Validation
and without validation to found the most efficient
algorithm among two algorithms.
Table 2: The UCI datasets used for the experiments and
their properties
In K-fold Cross Validation Mode: By using Euclidean
Distance function, in K-fold Cross validation mode the
minimum time taken by C4.5 Decision tree algorithm was
44.443 in CHC pre-processing technique is least as
compares to GGA and SGGA pre-processing techniques. The
time taken by GGA and SGGA are 106.613 and 89.485
respectively.
When the C4.5 Decision Tree algorithms were applied using
HVDM Distance function, the minimum time taken to build
the model by CHC pre-processing technique is 157.604 in
k-fold cross validation mode as compared to both GGA
and SGGA pre-processing technique i.e. 899.583 and
1586.068. While when test was applied using KNN
technique in k-fold validation mode by using Euclidean
distance in CHC pre-processing technique the time taken
is 46.78 but in GGA and SGGA pre-processing technique the
time taken is 107.458 and 90.752 resp.
In 5-fold Validation Mode: Then C45 Decision Tree
algorithm is implemented by using 5-fold validation mode.
The time taken by C45 using Euclidian distance in CHC
technique is 72.306,239.223 and 198.667 resp.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1124
Without Fold Mode: In last without validation modeboth
two techniques were implemented by using without
validation mode.By using HVDM distance the time taken in
CHC pre-processing technique is 72.306.
In GGA pre-processing technique is 239.223 pre-
processing technique is 72.306, in GGA pre-processing
technique is 239.223 and in SGGA pre-processingtechnique
it 198.667.
The time taken by KNN technique using Euclidian
distance in CHC pre-processing technique is 5.523 andusing
GGA and SGGA pre-processing technique the time and by is
13.79 and 11.419 resp. Then running time of each
algorithm and distance function is evaluated at each
validation mode.
6. CONCLUSION
By using Euclidian distance the time taken to build
themodel in K-fold validation mode is 80.180 obtained
by C45.In 5-fold validation mode and 0 validation mode,C45
is attaining 35.109 and 35.109 respectively. When the
algorithm is applied using HVDM distance function the
minimum time taken to build the model by C45 in K-fold
validation mode,5-fold validation mode and 0 validation is
881.085,170.065,170.065 respectively. By using KNN
algorithm the time taken to build the model by using
Euclidian distance in K-fold validation mode is 81.66.In 5-
fold validation mode and 0 validation mode, KNN-lazy
learning is attaining 19.572 and 10.244 respectively.
When the algorithm is applied using HVDM distance
function the minimum time taken to build the model by
KNN algorithm in K-fold validation mode, 5-fold
validation mode and 0 validation is 554.399, 141.261,
170.065 respectively.
Table 2: Time difference between different
Validation modes by using C45 and KNN algorithm
We have analyzed the heart disease dataset by using KEEL
tool. In KEEL tool different validation mode is selected
to perform the operation on a dataset. Then different
pre-processing technique is used to remove the noise in a
dataset. Finally ClassificationAlgorithm has been selected
to perform the analysis of the algorithm by comparing the
time taken by different algorithms on a dataset.
More and more Classification algorithm is made
available to find the best performance of the heart disease
dataset that which algorithm performs fast. Many
algorithms have been studied by the researchers to find
theoptimum algorithm.OurfocusherethroughClassification
algorithm is to determine that which algorithm is optimum
to give the best result in a less time by using different
validation mode available in a tool. This study confirms
that Lazy Learning – KNN is the efficient algorithm in
predicting the performance of the heart disease dataset
using without validation mode. We aim to carry out this
study on other machine learning Classification Algorithm
and our focus is on to make a predictive system to find the
efficient performance of heart disease dataset in heart
disease prediction system.
7. REFERENCES
[1].Gnanadesikan, R. (1977) Methods for Statistical Data
Analysis of Multivariate Observations, Wiley. ISBN 0 471-
30845-5 (p.83–86).
[2].FathimaSA,Manimegala D and Hundewale N (2011) A
Review of Data Mining Classifications Applied for Diagnosis
and Prognosis.
[3].International Journal of Computer Science, 6:322-328
Sundar AN, Latha PP, Chandra RM(2012) Performance
Analysis of classification Data Mining Techniques Over
Heart Disease.
[4].M.Anbarasiet. al. (2010) Enhanced Prediction of Heart
Disease with Feature Subset Selection using Genetic
Algorithm.
[5].Olaiya F and Adeyem BA (2012) Application of Data
Mining Techniques in Weather Prediction Climate
Change Studies. International Journal of Information
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1125
Engineering and Electronic Business, 3:51-59.
[6].Milan Kumari M and Godara S (2011)Comparative
Study ofData Mining Classification Methods in
Cardiovascular DiseasePrediction. International Journal of
Computer Science andTechnology, 6:304-308.
[7].Meena K, Subramaniam RK, Gomathy M (2012).
Performance Analysis of Gender Clustering and Cla
ssification Algorithms.International Journal of Computer
Science and Engineering,5:442-457.
[8].Kaur H and Wasan KS (2006) Empirical Study on
Applications of Data Mining Techniques in Healthcare.
International Journal of Computer Science, 4: 194-200.
[9].K.Srinivas et al. (2010) Applications of Data Mining
Techniques in Healthcare and Prediction of Heart Attacks.
International Journal on Computer Science and
Engineering, 5: 250-255.
[10]. RaniM, SinghV and Bhushan B (2013) Performance
Evaluation of Classification Techniques Based on Mean
Absolute Error. International Journal of Computing and
Business Research, 4: 1-5.

More Related Content

PDF
IRJET- Medical Data Mining
PDF
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
PDF
Propose a Enhanced Framework for Prediction of Heart Disease
PDF
A comparative analysis of classification techniques on medical data sets
PDF
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
PDF
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
PDF
Assessment of Decision Tree Algorithms on Student’s Recital
PDF
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
IRJET- Medical Data Mining
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
Propose a Enhanced Framework for Prediction of Heart Disease
A comparative analysis of classification techniques on medical data sets
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
Assessment of Decision Tree Algorithms on Student’s Recital
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...

What's hot (20)

PDF
Decision Tree Models for Medical Diagnosis
PDF
50120130406032
PDF
IRJET-Survey on Data Mining Techniques for Disease Prediction
PDF
An efficient feature selection algorithm for health care data analysis
PDF
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
PDF
50120130406036
PDF
Ijarcet vol-2-issue-4-1393-1397
PDF
IRJET- Disease Prediction System
PDF
IRJET- Predicting Heart Disease using Machine Learning Algorithm
PDF
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
PDF
IRJET- Missing Data Imputation by Evidence Chain
PDF
Preprocessing and Classification in WEKA Using Different Classifiers
PDF
IRJET- Comparative Analysis of Data Mining Classification Techniques for Hear...
PDF
IRJET- A Literature Review on Heart and Alzheimer Disease Prediction
PDF
Effect of Data Size on Feature Set Using Classification in Health Domain
PDF
Comprehensive Survey of Data Classification & Prediction Techniques
PDF
IRJET- Review on Knowledge Discovery and Analysis in Healthcare using Dat...
PDF
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
PDF
Heart Disease Prediction Using Associative Relational Classification Techniq...
PDF
Heart Disease Prediction Using Data Mining Techniques
Decision Tree Models for Medical Diagnosis
50120130406032
IRJET-Survey on Data Mining Techniques for Disease Prediction
An efficient feature selection algorithm for health care data analysis
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
50120130406036
Ijarcet vol-2-issue-4-1393-1397
IRJET- Disease Prediction System
IRJET- Predicting Heart Disease using Machine Learning Algorithm
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IRJET- Missing Data Imputation by Evidence Chain
Preprocessing and Classification in WEKA Using Different Classifiers
IRJET- Comparative Analysis of Data Mining Classification Techniques for Hear...
IRJET- A Literature Review on Heart and Alzheimer Disease Prediction
Effect of Data Size on Feature Set Using Classification in Health Domain
Comprehensive Survey of Data Classification & Prediction Techniques
IRJET- Review on Knowledge Discovery and Analysis in Healthcare using Dat...
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
Heart Disease Prediction Using Associative Relational Classification Techniq...
Heart Disease Prediction Using Data Mining Techniques
Ad

Similar to Analysis on Data Mining Techniques for Heart Disease Dataset (20)

PDF
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
PDF
Hypothesis on Different Data Mining Algorithms
PDF
Variance rover system
PDF
Variance rover system web analytics tool using data
PDF
IRJET- Prediction of Heart Disease using RNN Algorithm
PDF
Disease prediction in big data healthcare using extended convolutional neural...
PDF
IRJET- Disease Prediction using Machine Learning
PDF
Heart Disease Prediction Using Data Mining
PDF
HEALTH PREDICTION ANALYSIS USING DATA MINING
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
PDF
Health Care Application using Machine Learning and Deep Learning
PDF
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
PDF
Intelligent data analysis for medicinal diagnosis
PDF
IRJET- Heart Disease Prediction and Recommendation
PDF
Correlation of artificial neural network classification and nfrs attribute fi...
PDF
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
PDF
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
PDF
Prediction of Diabetes using Probability Approach
PDF
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
PDF
Ez36937941
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
Hypothesis on Different Data Mining Algorithms
Variance rover system
Variance rover system web analytics tool using data
IRJET- Prediction of Heart Disease using RNN Algorithm
Disease prediction in big data healthcare using extended convolutional neural...
IRJET- Disease Prediction using Machine Learning
Heart Disease Prediction Using Data Mining
HEALTH PREDICTION ANALYSIS USING DATA MINING
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Health Care Application using Machine Learning and Deep Learning
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
Intelligent data analysis for medicinal diagnosis
IRJET- Heart Disease Prediction and Recommendation
Correlation of artificial neural network classification and nfrs attribute fi...
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
Prediction of Diabetes using Probability Approach
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
Ez36937941
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Welding lecture in detail for understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Well-logging-methods_new................
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Sustainable Sites - Green Building Construction
PPT
Project quality management in manufacturing
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Welding lecture in detail for understanding
Foundation to blockchain - A guide to Blockchain Tech
Well-logging-methods_new................
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Internet of Things (IOT) - A guide to understanding
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Arduino robotics embedded978-1-4302-3184-4.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
OOP with Java - Java Introduction (Basics)
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Sustainable Sites - Green Building Construction
Project quality management in manufacturing
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

Analysis on Data Mining Techniques for Heart Disease Dataset

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1160 Analysis On Data Mining Techniques For Heart Disease Dataset Subhashri.K 1, Arockia Panimalar.S2, Ashwin.S3, Vignesh.P4 1,2 Assistant Professor, Department of BCA & M.Sc SS, Sri Krishna Arts and Science College, Tamilnadu, India 3,4 III BCA, Department of BCA & M.Sc SS, Sri Krishna Arts and Science College, Tamilnadu, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct businessapplications. Classification trees are used to predict membershipofcases or objects in the classes of a categorical dependent variablefrom their measurements on one or more predictor variables. Classification tree analysis is one of the main techniques used in Data Mining. During my research, I had analyzed the various classification algorithms and compared the performance of classification algorithms on aspects for time taken to build the model, by using different distance function. The result is being tested on data set which is taken from UCI repositories. The aim is to judge the efficiency of different data mining algorithms on Heart Disease dataset and determine the optimum algorithm. The performance analysis depends on many factors encompassing validation mode, distance function, different nature of dataset. Key Words: Data Mining, Classification, Classification Techniques, Distance function, KEEL Tool, Performance Analysis. 1. INTRODUCTION The healthcare industry collects huge amounts of healthcare data which, unfortunately are not “mined” to discover hidden information for effective decision making. Discovery of hidden patterns and relationships often goes exploited. Data mining refers to using a variety of techniques to identify suggest of information or decision making knowledge in the database and extracting these in a way that they can put to use in areas such as decision support, prediction ,forecasting and estimation. Discovering relations that connect variables in a database is the subject of data mining. Data mining is the non-trivial extraction of implicit, previously unknown and potentially useful information from data. Data mining technology provides a user-oriented approach to novel and hidden patterns in the data.The discovered knowledge can be used by the healthcare administrators to improve the quality of service and also used by the medical practitioners to reduce the number of adverse drug effect. In information technology, knowledge is one of the most significant assets of any organization. The role of IT in healthcare is well established. Knowledge Management in Health care offers many challenges in creation, dissemination and preservation of health care knowledge using advanced technologies. Pragmatic use of database system, Data Warehousing and Knowledge Management technologies can contribute a lot to decision support systems in health care.Knowledge discovery in databases is well- defined process consisting of several distinct steps. Data mining is the core step, which results in the discovery of hidden but useful knowledge from massive databases. Following are some of the important areas of interests where data mining techniques can be of tremendous use in health care management. (Gnanadesikan, et al...(1977). 1. Data modelling for health care applications. 2. Executives Information System for health care. 3. Forecasting treatment costs and demand of resources. 4. Anticipating patient’s future behaviour given their history. 5. Public health Informatics. 6. E-governance structures in health care. 7. Health Insurance. 2. CLASSIFICATION Classification is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". An algorithm that implements classification, especially, in a concrete implementation, is known as a classifier.The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm that maps input data to a category.Classification and clustering are examples of the more general problem of pattern recognition, which is the assignment of some sort of output value to a given input value. Classification Algorithm A. Decision Tree In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. Decision tree learning is a
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1122 method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of input variables . An example is shown on the right. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. In data mining, decision trees can be described also as the combination of mathematical and computational techniques to aid the description, categorisation and generalisation of a given set of data. (Kaur H and Wasan KS et al...2006) . B. Lazy Learning In artificial intelligence, lazy learning is a learningmethod in which generalization beyond the training data is delayed until a query is made to the system, as opposed to in eager learning, where the system tries to generalize the training data before receiving queries. The main advantage gained in employing a lazy learning method, such as Case based reasoning,isthat target function will be approximated locally, such as in the k- nearest neighbour algorithm. Because the target function is approximated locally for each query to the system, lazy learning systems can simultaneously solve multiple problems and deal successfully with changes in the problem domain. The disadvantages with lazy learning include the large space requirement to store the entire training dataset. Particularly noisy training data increases the case base unnecessarily, because no abstraction is made during the training phase. Selected Distance Function  Euclidian Distance Function  HVDM Distance Function i. Euclidean Distance In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space (or even any inner product space) becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as Pythagorean metric. Definition The Euclidean distance between points p and q is the length of the line segment connecting them: p.q In Cartesian coordinates, if p = (p , p ,..., p ) and q = (q , q ,..., q ) are two points in Euclidean n space, then the distance from p to q, orfromq to p is given by the following hetrogenous value difference metric. ii. Hetrogenous Value Difference Metric Instance-based learning technique typically handle continuous and linear input values well,but often do not handle nominal input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonabledistance values between nominal attribute values, but it discretization to map continuous values into nominal values. This paper proposes three new heterogeneous distance functions, called the Heterogeneous Value Difference Metric (HVDM), the Interpolated Value Difference Metric (IVDM), and the Windowed Value . These new distance functions are designed to handle applications with nominal attributes, continuous attributes, or both. In experiments on 48 applications the new distance metrics achieve higher classification accuracy on average than three previous distance functions on those datasets that have both nominal and continuous attributes. So HVDM is used as shown below: 3. RELATED WORK The clinical and physical diagnosis of Chikungunya viral fever patients and itscomparisonwithdengueviral fever has been proposed. Table 1: Summary of selected reference with goal Our project aims to integrate different sourcesof information and to discover patterns of diagnosis, for
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1123 predicting the viral infected patients and their results. The aim is to apply hybrid classification schemes and create data mining tools well suited to the crucial demands of medical diagnostic systems. The approaches in review are diverse in data mining methods.(Fathima etal. 2011). The prototype has been described using data mining techniques, namely Naïve Bayes and WAC (weighted associative classifier). It enables significant knowledge .Eg. patterns, relationships between medical factors related to heart disease, to be established. It can serve a training tool to train nurses and medical students to diagnose patients with heart disease. It is a web based user friendly system and can be used in hospitals if they have a data ware house for their hospital. The models were validated using Classification Matrix (Aditya Sunder et al. 2012). The proposed work is to predict more accurately the presence of heart disease with reduced number of attributes. This was carried out using Artificial Neural Network and Decision Tree algorithms andmeteorological data collected between 2000 and 2009 from the city of Ibadan, Nigeria.It has been described that the data mining classification techniques RIPPER classifier, Decision Tree, ANNs, and SVM are analyzed on cardiovascular disease dataset.There analysis shows that out of these four classification models SVM predicts cardiovascular disease with least error rate and highest accuracy. (Kumar and Godara et al. 2011). 4. DATASETS AND TOOLS A. Hardware We conduct our evaluation on Intel Pentium P6200 platform which consist of 1 GB memory and 320 GB hard disk. B. Software In this experiment, we used KEEL tool and window 7 to evaluate the performance of classification algorithms using time taken to build the model according to respective no of clusters. KEEL is machine learning/data mining software written in Java language (distributed under the GNU Public License). KEEL is a collection of machine learning algorithmsfor data mining tasks. KEEL contains tools for developing new machine learning schemes. It can be used for Pre- processing, Classification, Clustering, Association and Visualization. C. Data Set The input data set is an integral part of data mining application. The data used in my experiment is either real world data obtained from UCI machine learning repository and widely accepted data set available in KEEL toolkit. Heart Disease data set comprises 303 instances and 75 attributes in the area of Health Science and some of them contain missing value. 5. EXPERIMENTS RESULT AND DISCUSSION To evaluate the selected tool usingHeartDiseasedatasetand comparisons are performed in two parts. In first Comparison, I have applied these Classification algorithms by using two distance function namely Euclidean Distance and HVDM Distance in three different Pre-processing techniques namely CHC Adaptive search for advanced selection, GGA-TSS Generational Genetic Algorithm for Instance selection, SGGA-TSS Steady-state genetic algorithm for Instance selection, using different validation modes namely K-Fold cross validation, 5-Fold Validation and without validation to found the most efficient algorithm among two algorithms. Table 2: The UCI datasets used for the experiments and their properties In K-fold Cross Validation Mode: By using Euclidean Distance function, in K-fold Cross validation mode the minimum time taken by C4.5 Decision tree algorithm was 44.443 in CHC pre-processing technique is least as compares to GGA and SGGA pre-processing techniques. The time taken by GGA and SGGA are 106.613 and 89.485 respectively. When the C4.5 Decision Tree algorithms were applied using HVDM Distance function, the minimum time taken to build the model by CHC pre-processing technique is 157.604 in k-fold cross validation mode as compared to both GGA and SGGA pre-processing technique i.e. 899.583 and 1586.068. While when test was applied using KNN technique in k-fold validation mode by using Euclidean distance in CHC pre-processing technique the time taken is 46.78 but in GGA and SGGA pre-processing technique the time taken is 107.458 and 90.752 resp. In 5-fold Validation Mode: Then C45 Decision Tree algorithm is implemented by using 5-fold validation mode. The time taken by C45 using Euclidian distance in CHC technique is 72.306,239.223 and 198.667 resp.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1124 Without Fold Mode: In last without validation modeboth two techniques were implemented by using without validation mode.By using HVDM distance the time taken in CHC pre-processing technique is 72.306. In GGA pre-processing technique is 239.223 pre- processing technique is 72.306, in GGA pre-processing technique is 239.223 and in SGGA pre-processingtechnique it 198.667. The time taken by KNN technique using Euclidian distance in CHC pre-processing technique is 5.523 andusing GGA and SGGA pre-processing technique the time and by is 13.79 and 11.419 resp. Then running time of each algorithm and distance function is evaluated at each validation mode. 6. CONCLUSION By using Euclidian distance the time taken to build themodel in K-fold validation mode is 80.180 obtained by C45.In 5-fold validation mode and 0 validation mode,C45 is attaining 35.109 and 35.109 respectively. When the algorithm is applied using HVDM distance function the minimum time taken to build the model by C45 in K-fold validation mode,5-fold validation mode and 0 validation is 881.085,170.065,170.065 respectively. By using KNN algorithm the time taken to build the model by using Euclidian distance in K-fold validation mode is 81.66.In 5- fold validation mode and 0 validation mode, KNN-lazy learning is attaining 19.572 and 10.244 respectively. When the algorithm is applied using HVDM distance function the minimum time taken to build the model by KNN algorithm in K-fold validation mode, 5-fold validation mode and 0 validation is 554.399, 141.261, 170.065 respectively. Table 2: Time difference between different Validation modes by using C45 and KNN algorithm We have analyzed the heart disease dataset by using KEEL tool. In KEEL tool different validation mode is selected to perform the operation on a dataset. Then different pre-processing technique is used to remove the noise in a dataset. Finally ClassificationAlgorithm has been selected to perform the analysis of the algorithm by comparing the time taken by different algorithms on a dataset. More and more Classification algorithm is made available to find the best performance of the heart disease dataset that which algorithm performs fast. Many algorithms have been studied by the researchers to find theoptimum algorithm.OurfocusherethroughClassification algorithm is to determine that which algorithm is optimum to give the best result in a less time by using different validation mode available in a tool. This study confirms that Lazy Learning – KNN is the efficient algorithm in predicting the performance of the heart disease dataset using without validation mode. We aim to carry out this study on other machine learning Classification Algorithm and our focus is on to make a predictive system to find the efficient performance of heart disease dataset in heart disease prediction system. 7. REFERENCES [1].Gnanadesikan, R. (1977) Methods for Statistical Data Analysis of Multivariate Observations, Wiley. ISBN 0 471- 30845-5 (p.83–86). [2].FathimaSA,Manimegala D and Hundewale N (2011) A Review of Data Mining Classifications Applied for Diagnosis and Prognosis. [3].International Journal of Computer Science, 6:322-328 Sundar AN, Latha PP, Chandra RM(2012) Performance Analysis of classification Data Mining Techniques Over Heart Disease. [4].M.Anbarasiet. al. (2010) Enhanced Prediction of Heart Disease with Feature Subset Selection using Genetic Algorithm. [5].Olaiya F and Adeyem BA (2012) Application of Data Mining Techniques in Weather Prediction Climate Change Studies. International Journal of Information
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1125 Engineering and Electronic Business, 3:51-59. [6].Milan Kumari M and Godara S (2011)Comparative Study ofData Mining Classification Methods in Cardiovascular DiseasePrediction. International Journal of Computer Science andTechnology, 6:304-308. [7].Meena K, Subramaniam RK, Gomathy M (2012). Performance Analysis of Gender Clustering and Cla ssification Algorithms.International Journal of Computer Science and Engineering,5:442-457. [8].Kaur H and Wasan KS (2006) Empirical Study on Applications of Data Mining Techniques in Healthcare. International Journal of Computer Science, 4: 194-200. [9].K.Srinivas et al. (2010) Applications of Data Mining Techniques in Healthcare and Prediction of Heart Attacks. International Journal on Computer Science and Engineering, 5: 250-255. [10]. RaniM, SinghV and Bhushan B (2013) Performance Evaluation of Classification Techniques Based on Mean Absolute Error. International Journal of Computing and Business Research, 4: 1-5.