SlideShare a Scribd company logo
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Decision Tree
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Will You Learn Today?
ClassificationMachine Learning Types Of Classifiers
What Is Decision Tree? How Decision Tree Works? Demo In R: Diabetes
Prevention Use Case
1 2 3
4 65
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Machine Learning
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Introduction To Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed.
Training Data Learn
Algorithm
Build Model Perform
Feedback
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Machine Learning - Example
 Amazon has huge amount of consumer purchasing
data.
 The data consists of consumer demographics (age,
sex, location), purchasing history, past browsing
history.
 Based on this data, Amazon segments its
customers, draws a pattern and recommends the
right product to the right customer at the right
time.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Classification
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Introduction To Classification
 Classification is the problem of
identifying to which set of categories a
new observation belongs.
 It is a supervised learning model as the
classifier already has a set of classified
examples and from these examples, the
classifier learns to assign unseen new
examples.
 Example: Assigning a given email
into "spam" or "non-spam" category.
Is this A or B ?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Classification - Example
Feed the classifier with training data set and predefined labels.
It will learn to categorize particular data under a specific label.
How to train my
model to identify
spam mails from
genuine mails?
Source IP Address
Phrases in the text
Subject Line
HTML Tags
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Classification Use Cases
Banking
Remote sensing
Medicine
Banking
Identification of loan risk applicants by their
probability of defaulting payments.
Medicine
Identification of at-risk patients and disease trends.
Remote sensing
Identification of areas of similar land use in a GIS
database.
Marketing
Identifying customer churn.
Use-cases
Marketing
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types Of Classifiers
Decision Tree
• Decision tree builds classification
models in the form of a tree
structure.
• It breaks down a dataset into
smaller and smaller subsets.
• Random Forest is an ensemble
classifier made using many
decision tree models.
• Ensemble models combine the
results from different models.
Random Forest Naïve Bayes
• It is a classification technique
based on Bayes' Theorem with an
assumption of independence
among attributes.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Decision Tree?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Is Decision Tree?
 A decision tree uses a tree structure to specify sequences of decisions and consequences.
 A decision tree employs a structure of nodes and branches.
 The depth of a node is the minimum number of steps required to reach the node from the root.
 Eventually, a final point is reached and a prediction is made.
Gender
AgeIncome
Yes No Yes No
Root Node
Internal Node
Branch NodeDepth=1
Female Male
<=40 >40
Leaf Node
<=45000 >45000
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Use Case - Credit Risk Detection
 To minimize loss, the bank needs a
decision rule to predict whom to give
approval of the loan.
 An applicant’s demographic (income,
debts, credit history) and socio-economic
profiles are considered.
 Data science can help banks recognize
behavior patterns and provide a
complete view of individual customers.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
Let’s take an example,
We have taken dataset consisting of:
• Weather information of last 14 days
• Whether match was played or not on that particular day
Now using the decision tree we need to predict whether the
game will happen if the weather condition is
Outlook = Rain
Humidity = High
Wind = Weak
Play = ?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
From our data, we will choose one variable “Outlook” and will see how it affects the variable “Play”
Day Outlook Humidity Wind Play
D1 Sunny High Weak No
D2 Sunny High Strong No
D3 Overcast High Weak Yes
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
D7 Overcast Normal Strong Yes
D8 Sunny High Weak No
D9 Sunny Normal Weak Yes
D10 Rain Normal Weak Yes
D11 Sunny Normal Strong Yes
D12 Overcast High Strong Yes
D13 Overcast Normal Weak Yes
D14 Rain High Strong No
Outlook
Play: 9 Yes, 5 No
Sunny Overcast Rain
There are 3 types
of Outlook Here
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We can further divide our data based on Outlook.
Outlook
Overcast
Sunny Rain
Day Outlook Humidity Wind
D1 Sunny High Weak
D2 Sunny High Strong
D8 Sunny High Weak
D9 Sunny Normal Weak
D11 Sunny Normal Strong
2 Yes / 3 No
Split further
Pure subset
Will play
3 Yes / 2 No
Split further
Day Outlook Humidity Wind
D4 Rain High Weak
D5 Rain Normal Weak
D6 Rain Normal Strong
D10 Rain Normal Weak
D14 Rain High Strong
We will split the data until we get pure subsets at every branch
9 Yes / 5 No
Day Outlook Humidity Wind
D3 Overcast High Weak
D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We will use Humidity column to split the subset “Sunny” further.
Will playWill not play
Outlook
Overcast
Sunny Rain
Humidity
NormalHigh
Day Humidity Wind
D1 High Weak
D2 High Strong
D8 High Weak
Pure subset
Day Humidity Wind
D9 Normal Weak
D11 Normal Strong
Pure subset 3 Yes / 2 No
Split further
Day Outlook Humidity Wind
D4 Rain High Weak
D5 Rain Normal Weak
D6 Rain Normal Strong
D10 Rain Normal Weak
D14 Rain High Strong
Pure subset
Will play
Day Outlook Humidity Wind
D3 Overcast High Weak
D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We will use Humidity column to split the subset “Sunny” further.
Will playWill not play
Outlook
Overcast
Sunny Rain
Humidity
NormalHigh Weak Strong
Wind
Will play Will not play
Day Humidity Wind
D1 High Weak
D2 High Strong
D8 High Weak
Pure subset
Day Humidity Wind
D9 Normal Weak
D11 Normal Strong
Pure subset
Day Humidity Wind
D4 High Weak
D5 Normal Weak
D10 Normal Weak
Pure subset
Day Humidity Wind
D6 Normal Strong
D14 High Strong
Pure subset
Pure subset
Will play
Day Outlook Humidity Wind
D3 Overcast High Weak
D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We will use Humidity column to split the subset “Sunny” further.
Will playWill not play
Outlook
Overcast
Sunny Rain
Humidity
NormalHigh Weak Strong
Wind
Will play Will not play
Will play
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How will I know which
attribute to take?
I’ll show you how
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem – Client Subscription
Consider the case of a bank that wants to market its products to the appropriate customers.
Given the demographics of clients and their reactions to previous campaign phone calls, the bank's goal is to predict
which clients would subscribe.
The attributes are:
• Job
• Marital status
• Education
• Housing
• Loan
• Contact
• Poutcome
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
 A common way to identify the most
informative attribute is to use entropy-based
methods.
 The entropy methods select the most
informative attribute.
 Entropy (H) can be calculated as,
x = Datapoint
p(x) = Probability of x
H = Entropy of x
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
Now, let’s do some mathematics on it
 Therefore, the root is only 10.55% pure on the subscribed = yes class.
 Conversely, it is 89.45% pure on the subscribed = no class.
P(subscribed=yes) = 0.1055
P(subscribed=no) = 0.8945
Hsubscribed = −0.1055·log20.1055–0.8945·log20.8945
≈ 0.4862
P(subscribed = yes) = 1-1789/2000 =10.55%
Let’s say, the overall fraction of the clients who have not subscribed to is 1,789 out
of the total population of 2,000.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
Conditional entropy is,
Hsubscribed|contact = 0.4661
Calculating conditional entropy for subscribed|contact gives us following result.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
 The information gain of an attribute A is defined as
the difference between the base entropy (HS) and
the conditional entropy of the attribute (HS|A).
 Attribute poutcome has the most information
gain and is the most informative variable.
Therefore, poutcome is chosen for the first split of
the decision tree.
InfoGainA = HS – HS|A
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
Finally, we get the following decision tree
Poutcome
EducationNo
Job Yes
Root Node
Branch Node
Failure, Other,
Unknown
Secondary,
tertiary
Success
Internal Node
Primary,
Unknown
Leaf Node
Admin, blue-collar,
management,
technician
Self-employed,
student, unemployed
No Yes
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Decision Tree - Pros And Cons
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What if we could predict the
occurrence of diabetes and
take appropriate measures
beforehand to prevent it?
Sure! Let me take you
through the steps to
predict the vulnerable
patients.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Doctor gets the following data from the medical history of the patient.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
We will divide our entire dataset into two subsets as:
• Training dataset -> to train the model
• Testing dataset -> to validate and make predictions
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
 Here, we implement decision tree in R using following commands.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
 We get the output as follows but this is not easy to understand, so let’s
visualize it for better understanding.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
For plotting we can use the following commands
> plot(diabet_model,margin = 0.1)
> text(diabet_model,use.n= TRUE,pretty = TRUE,cex =0.6)
glucose_conc< 154.5
Diabetes_pedigree_fn<0.315glucose_conc< 131
blood_pressure>=72
NO
68/18 NO
12/3
YES
5/11
glucose_conc< 100.5
NO
107/3
BMI <26.35 Age >=53.5
NO
6/4
YES
9/65
NO
93/13
Age <30.5
Age >=53.5
NO
5/2
YES
13/39
NO
35/18
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Now, we can use our model to predict the output of our testing dataset.
We can use the following code for predicting the output.
 pred_diabet<-predict(diabet_model,newdata = diabet_test,type ="class")
 pred_diabet
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
We get the following output for our testing dataset where:
“YES” means the probability of patient being vulnerable to diabetes is positive
“NO” means the probability of patient being vulnerable to diabetes is negative.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
 library(caret)
 confusionMatrix(table(pred_diabet,diabet_test$is_diabetic))
We can create confusion matrix for the model using the library caret to
know how good is our model.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data acquisition
Divide dataset
Implement model
Visualize
Accuracy = 71.13%
The accuracy (or the overall success rate) is a metric defining the rate at
which a model has classified the records correctly. A good model should
have a high accuracy score.
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Course Details
Go to www.edureka.co/data-science
Get Edureka Certified in Data Science Today!
What our learners have to say about us!
Shravan Reddy says- “I would like to recommend any one who
wants to be a Data Scientist just one place: Edureka. Explanations
are clean, clear, easy to understand. Their support team works
very well.. I took the Data Science course and I'm going to take
Machine Learning with Mahout and then Big Data and Hadoop”.
Gnana Sekhar says - “Edureka Data science course provided me a very
good mixture of theoretical and practical training. LMS pre recorded
sessions and assignments were very good as there is a lot of
information in them that will help me in my job. Edureka is my
teaching GURU now...Thanks EDUREKA.”
Balu Samaga says - “It was a great experience to undergo and get
certified in the Data Science course from Edureka. Quality of the
training materials, assignments, project, support and other
infrastructures are a top notch.”
www.edureka.co/data-scienceEdureka’s Data Science Certification Training

More Related Content

PDF
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
PPTX
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PDF
Supervised and Unsupervised Machine Learning
PPTX
Random forest algorithm
PDF
Decision tree lecture 3
ODP
Machine Learning with Decision trees
PDF
Introduction to Machine Learning Classifiers
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Machine Learning
Random forest algorithm
Decision tree lecture 3
Machine Learning with Decision trees
Introduction to Machine Learning Classifiers

What's hot (20)

PPTX
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
PPTX
Supervised Unsupervised and Reinforcement Learning
PPT
Machine Learning presentation.
PDF
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
PPTX
Random Forest and KNN is fun
PPT
Decision tree
PPTX
Data preprocessing in Machine learning
PPTX
Classification and Regression
ODP
Machine Learning With Logistic Regression
PPTX
Decision Tree - C4.5&CART
PDF
Decision tree
PPTX
Supervised and unsupervised learning
PDF
Understanding random forests
PPTX
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
PPTX
Kdd process
PDF
Decision trees in Machine Learning
PDF
Machine learning Algorithms
PPTX
Introduction to Machine Learning
PDF
An introduction to Machine Learning
PDF
Decision tree
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Supervised Unsupervised and Reinforcement Learning
Machine Learning presentation.
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Random Forest and KNN is fun
Decision tree
Data preprocessing in Machine learning
Classification and Regression
Machine Learning With Logistic Regression
Decision Tree - C4.5&CART
Decision tree
Supervised and unsupervised learning
Understanding random forests
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Kdd process
Decision trees in Machine Learning
Machine learning Algorithms
Introduction to Machine Learning
An introduction to Machine Learning
Decision tree
Ad

Similar to Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Science Training | Edureka (20)

PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PPTX
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
PPTX
Big Data Analytics - Unit 3.pptx
PDF
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
PPTX
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
PPTX
Quant Data Analysis
PDF
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
PDF
Barga Data Science lecture 2
PDF
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
PPTX
decision tree DECISION TREE IN MACHINE .pptx
PPTX
Random Decision Forests at Scale
PDF
Barga Data Science lecture 9
PDF
Data Science Full Course | Edureka
PDF
Barga Data Science lecture 4
PPTX
Data science in business Administration Nagarajan.pptx
PPTX
Stellar data recovery Gurgaon (ho)
PPTX
stellar Data Recovery Gurgaon (H.O)
PDF
Become a data science professional
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Big Data Analytics - Unit 3.pptx
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Quant Data Analysis
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Barga Data Science lecture 2
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
decision tree DECISION TREE IN MACHINE .pptx
Random Decision Forests at Scale
Barga Data Science lecture 9
Data Science Full Course | Edureka
Barga Data Science lecture 4
Data science in business Administration Nagarajan.pptx
Stellar data recovery Gurgaon (ho)
stellar Data Recovery Gurgaon (H.O)
Become a data science professional
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Business Data Analytics.
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Mega Projects Data Mega Projects Data
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Global journeys: estimating international migration
PDF
Lecture1 pattern recognition............
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Business Data Analytics.
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Foundation of Data Science unit number two notes
Mega Projects Data Mega Projects Data
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Knowledge Engineering Part 1
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
STUDY DESIGN details- Lt Col Maksud (21).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Global journeys: estimating international migration
Lecture1 pattern recognition............

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Science Training | Edureka

  • 1. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Decision Tree
  • 2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What Will You Learn Today? ClassificationMachine Learning Types Of Classifiers What Is Decision Tree? How Decision Tree Works? Demo In R: Diabetes Prevention Use Case 1 2 3 4 65
  • 3. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Machine Learning
  • 4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Introduction To Machine Learning Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Training Data Learn Algorithm Build Model Perform Feedback
  • 5. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Machine Learning - Example  Amazon has huge amount of consumer purchasing data.  The data consists of consumer demographics (age, sex, location), purchasing history, past browsing history.  Based on this data, Amazon segments its customers, draws a pattern and recommends the right product to the right customer at the right time.
  • 6. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Classification
  • 7. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Introduction To Classification  Classification is the problem of identifying to which set of categories a new observation belongs.  It is a supervised learning model as the classifier already has a set of classified examples and from these examples, the classifier learns to assign unseen new examples.  Example: Assigning a given email into "spam" or "non-spam" category. Is this A or B ?
  • 8. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Classification - Example Feed the classifier with training data set and predefined labels. It will learn to categorize particular data under a specific label. How to train my model to identify spam mails from genuine mails? Source IP Address Phrases in the text Subject Line HTML Tags
  • 9. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Classification Use Cases Banking Remote sensing Medicine Banking Identification of loan risk applicants by their probability of defaulting payments. Medicine Identification of at-risk patients and disease trends. Remote sensing Identification of areas of similar land use in a GIS database. Marketing Identifying customer churn. Use-cases Marketing
  • 10. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Types Of Classifiers Decision Tree • Decision tree builds classification models in the form of a tree structure. • It breaks down a dataset into smaller and smaller subsets. • Random Forest is an ensemble classifier made using many decision tree models. • Ensemble models combine the results from different models. Random Forest Naïve Bayes • It is a classification technique based on Bayes' Theorem with an assumption of independence among attributes.
  • 11. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What is Decision Tree?
  • 12. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What Is Decision Tree?  A decision tree uses a tree structure to specify sequences of decisions and consequences.  A decision tree employs a structure of nodes and branches.  The depth of a node is the minimum number of steps required to reach the node from the root.  Eventually, a final point is reached and a prediction is made. Gender AgeIncome Yes No Yes No Root Node Internal Node Branch NodeDepth=1 Female Male <=40 >40 Leaf Node <=45000 >45000
  • 13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Use Case - Credit Risk Detection  To minimize loss, the bank needs a decision rule to predict whom to give approval of the loan.  An applicant’s demographic (income, debts, credit history) and socio-economic profiles are considered.  Data science can help banks recognize behavior patterns and provide a complete view of individual customers.
  • 14. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How Decision Tree Works?
  • 15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How Decision Tree Works? Let’s take an example, We have taken dataset consisting of: • Weather information of last 14 days • Whether match was played or not on that particular day Now using the decision tree we need to predict whether the game will happen if the weather condition is Outlook = Rain Humidity = High Wind = Weak Play = ?
  • 16. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How Decision Tree Works? From our data, we will choose one variable “Outlook” and will see how it affects the variable “Play” Day Outlook Humidity Wind Play D1 Sunny High Weak No D2 Sunny High Strong No D3 Overcast High Weak Yes D4 Rain High Weak Yes D5 Rain Normal Weak Yes D6 Rain Normal Strong No D7 Overcast Normal Strong Yes D8 Sunny High Weak No D9 Sunny Normal Weak Yes D10 Rain Normal Weak Yes D11 Sunny Normal Strong Yes D12 Overcast High Strong Yes D13 Overcast Normal Weak Yes D14 Rain High Strong No Outlook Play: 9 Yes, 5 No Sunny Overcast Rain There are 3 types of Outlook Here
  • 17. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How Decision Tree Works? We can further divide our data based on Outlook. Outlook Overcast Sunny Rain Day Outlook Humidity Wind D1 Sunny High Weak D2 Sunny High Strong D8 Sunny High Weak D9 Sunny Normal Weak D11 Sunny Normal Strong 2 Yes / 3 No Split further Pure subset Will play 3 Yes / 2 No Split further Day Outlook Humidity Wind D4 Rain High Weak D5 Rain Normal Weak D6 Rain Normal Strong D10 Rain Normal Weak D14 Rain High Strong We will split the data until we get pure subsets at every branch 9 Yes / 5 No Day Outlook Humidity Wind D3 Overcast High Weak D7 Overcast Normal Strong D12 Overcast High Strong D13 Overcast Normal Weak
  • 18. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How Decision Tree Works? We will use Humidity column to split the subset “Sunny” further. Will playWill not play Outlook Overcast Sunny Rain Humidity NormalHigh Day Humidity Wind D1 High Weak D2 High Strong D8 High Weak Pure subset Day Humidity Wind D9 Normal Weak D11 Normal Strong Pure subset 3 Yes / 2 No Split further Day Outlook Humidity Wind D4 Rain High Weak D5 Rain Normal Weak D6 Rain Normal Strong D10 Rain Normal Weak D14 Rain High Strong Pure subset Will play Day Outlook Humidity Wind D3 Overcast High Weak D7 Overcast Normal Strong D12 Overcast High Strong D13 Overcast Normal Weak
  • 19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How Decision Tree Works? We will use Humidity column to split the subset “Sunny” further. Will playWill not play Outlook Overcast Sunny Rain Humidity NormalHigh Weak Strong Wind Will play Will not play Day Humidity Wind D1 High Weak D2 High Strong D8 High Weak Pure subset Day Humidity Wind D9 Normal Weak D11 Normal Strong Pure subset Day Humidity Wind D4 High Weak D5 Normal Weak D10 Normal Weak Pure subset Day Humidity Wind D6 Normal Strong D14 High Strong Pure subset Pure subset Will play Day Outlook Humidity Wind D3 Overcast High Weak D7 Overcast Normal Strong D12 Overcast High Strong D13 Overcast Normal Weak
  • 20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How Decision Tree Works? We will use Humidity column to split the subset “Sunny” further. Will playWill not play Outlook Overcast Sunny Rain Humidity NormalHigh Weak Strong Wind Will play Will not play Will play
  • 21. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How will I know which attribute to take? I’ll show you how
  • 22. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem – Client Subscription Consider the case of a bank that wants to market its products to the appropriate customers. Given the demographics of clients and their reactions to previous campaign phone calls, the bank's goal is to predict which clients would subscribe. The attributes are: • Job • Marital status • Education • Housing • Loan • Contact • Poutcome
  • 23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How To Choose An Attribute?  A common way to identify the most informative attribute is to use entropy-based methods.  The entropy methods select the most informative attribute.  Entropy (H) can be calculated as, x = Datapoint p(x) = Probability of x H = Entropy of x
  • 24. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How To Choose An Attribute? Now, let’s do some mathematics on it  Therefore, the root is only 10.55% pure on the subscribed = yes class.  Conversely, it is 89.45% pure on the subscribed = no class. P(subscribed=yes) = 0.1055 P(subscribed=no) = 0.8945 Hsubscribed = −0.1055·log20.1055–0.8945·log20.8945 ≈ 0.4862 P(subscribed = yes) = 1-1789/2000 =10.55% Let’s say, the overall fraction of the clients who have not subscribed to is 1,789 out of the total population of 2,000.
  • 25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How To Choose An Attribute? Conditional entropy is, Hsubscribed|contact = 0.4661 Calculating conditional entropy for subscribed|contact gives us following result.
  • 26. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How To Choose An Attribute?  The information gain of an attribute A is defined as the difference between the base entropy (HS) and the conditional entropy of the attribute (HS|A).  Attribute poutcome has the most information gain and is the most informative variable. Therefore, poutcome is chosen for the first split of the decision tree. InfoGainA = HS – HS|A
  • 27. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How To Choose An Attribute? Finally, we get the following decision tree Poutcome EducationNo Job Yes Root Node Branch Node Failure, Other, Unknown Secondary, tertiary Success Internal Node Primary, Unknown Leaf Node Admin, blue-collar, management, technician Self-employed, student, unemployed No Yes
  • 28. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Decision Tree - Pros And Cons
  • 30. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What if we could predict the occurrence of diabetes and take appropriate measures beforehand to prevent it? Sure! Let me take you through the steps to predict the vulnerable patients.
  • 31. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation Doctor gets the following data from the medical history of the patient.
  • 32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation We will divide our entire dataset into two subsets as: • Training dataset -> to train the model • Testing dataset -> to validate and make predictions
  • 33. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation  Here, we implement decision tree in R using following commands.
  • 34. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation  We get the output as follows but this is not easy to understand, so let’s visualize it for better understanding.
  • 35. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation For plotting we can use the following commands > plot(diabet_model,margin = 0.1) > text(diabet_model,use.n= TRUE,pretty = TRUE,cex =0.6) glucose_conc< 154.5 Diabetes_pedigree_fn<0.315glucose_conc< 131 blood_pressure>=72 NO 68/18 NO 12/3 YES 5/11 glucose_conc< 100.5 NO 107/3 BMI <26.35 Age >=53.5 NO 6/4 YES 9/65 NO 93/13 Age <30.5 Age >=53.5 NO 5/2 YES 13/39 NO 35/18
  • 36. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation Now, we can use our model to predict the output of our testing dataset. We can use the following code for predicting the output.  pred_diabet<-predict(diabet_model,newdata = diabet_test,type ="class")  pred_diabet
  • 37. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation We get the following output for our testing dataset where: “YES” means the probability of patient being vulnerable to diabetes is positive “NO” means the probability of patient being vulnerable to diabetes is negative.
  • 38. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Implement model Visualize Model Validation  library(caret)  confusionMatrix(table(pred_diabet,diabet_test$is_diabetic)) We can create confusion matrix for the model using the library caret to know how good is our model.
  • 39. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data acquisition Divide dataset Implement model Visualize Accuracy = 71.13% The accuracy (or the overall success rate) is a metric defining the rate at which a model has classified the records correctly. A good model should have a high accuracy score. Data Acquisition Divide dataset Implement model Visualize Model Validation
  • 40. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Course Details Go to www.edureka.co/data-science Get Edureka Certified in Data Science Today! What our learners have to say about us! Shravan Reddy says- “I would like to recommend any one who wants to be a Data Scientist just one place: Edureka. Explanations are clean, clear, easy to understand. Their support team works very well.. I took the Data Science course and I'm going to take Machine Learning with Mahout and then Big Data and Hadoop”. Gnana Sekhar says - “Edureka Data science course provided me a very good mixture of theoretical and practical training. LMS pre recorded sessions and assignments were very good as there is a lot of information in them that will help me in my job. Edureka is my teaching GURU now...Thanks EDUREKA.” Balu Samaga says - “It was a great experience to undergo and get certified in the Data Science course from Edureka. Quality of the training materials, assignments, project, support and other infrastructures are a top notch.”

Editor's Notes