SlideShare a Scribd company logo
Decision tree random forest classifier
Glucose Age Diabetes
78 26 No
85 31 No
89 21 No
100 32 No
103 33 No
107 31 Yes
110 30 No
115 29 Yes
126 27 No
115 32 Yes
116 31 Yes
118 31 Yes
183 50 Yes
189 59 Yes
197 53 Yes
Glucose
Age Glucose
75 < G < 90
G > 90
20 < Age <= 31
Y-0, N-3
Prediction
No (Majority)
100 <= G <= 110
G > 110
Y-1, N-3
Prediction
No (Majority)
Age
30 <= Age < 34
Age
Glucose
110 < G < 127 G > 180
Age
Y-4, N-1
Prediction
Yes (Majority)
Age >= 50
Y-3, N-0
Prediction
Yes (Majority)
A few sample observation on diabetes
result along with glucose and age are
given.
Attempting a decision tree prediction model
30 < Age < 34
Root of the Tree
Branch
Leaf
 In the last slide example, the first split divided the into groups of 12 and 3.
 What if decision tree is creating a split for each sample?
 Then for the samples from train data, model can do accurate predictions.
 But for other data, model may perform worse.
 It is better to do the branch split with a minimum number of samples greater
than 1. (Default value is 2)
 We can control minimum sample split count of decision tree model with help
of “min_samples_split” argument.
Glucose
75 < G < 90
G > 90
No. of
samples = 3
No. of
samples = 12
 The samples count in last splits (leaf of decision tree) is also important for the
model
 Default value for minimum sample count in leaf of decision tree is 1.
 We can control the minimum count of samples in a decision tree model with
help of “min_samples_leaf” argument.
 Entropy is the measures of impurity, disorder or uncertainty in a bunch of
examples.
 In a decision tree it is the impurity in the split.
 Entropy value ranges from 0 to 1. Maximum impurity represents 1.
Entropy H(s) = -P(Yes) * log2(P(Yes)) -P(No) * log2(P(No))
Possible split with 6 samples (label “Yes” or “No”) with entropy in each split are shown below.
Entropy in case of more than 2 labels
Gini
Gini is another method of
impurity measure in the
decision tree split.
Gini = 1- (P(Yes)^2 + P(No)^2)
In case of more than 2 labels
Gini = 1 - ∑ (Pi)^2
No. of “Yes” No. of “No” P(Yes) P(No) Entropy Notes
0 6 0 1 0 pure split
1 5 0.17 0.83 0.65
2 4 0.33 0.67 0.92
3 3 0.5 0.5 1 maximum imprity
4 2 0.67 0.33 0.92
5 1 0.83 0.17 0.65
6 0 1 0 0 pure split
Gini
0
0.28
0.44
0.50
0.44
0.28
0
Information Gain helps to measure the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable.
Information Gain IG(S, a) = H(S) – H(S | a)
IG(S, a) = information for the dataset S for the variable a
H(S) = the entropy for the dataset before any change
H(S | a) is the conditional entropy for the dataset given the variable a.
A larger information gain suggests a lower entropy group or groups of samples.
Overall Entropy = -(8/15)*log2(8/15) – (7/15) * log2(7/15) = 0.996
Feature - Gender:
Entropy of ‘Male’ = -(4/8)*log2(4/8)-(4/8)*log2(4/8) = 1
Entropy of ‘Female’ = -(4/7)*log2(4/7)-(3/7)*log2(3/7) = 0.985
Weighted entropy = (8/15)*1 + (7/15)*0.985 = 0.993
Information Gain for gender feature = 0.996 – 0.993 = 0.003
Information Gain for exercise feature = 0.996 – 0.884 = 0.112
Gender Exercise Diabetes
Male Regular No
Female Irregular No
Male Regular No
Male Regular No
Female No No
Male Irregular Yes
Female No No
Female Regular Yes
Male Regular No
Female Regular Yes
Female Regular Yes
Female Irregular Yes
Male Irregular Yes
Male No Yes
Male Irregular Yes
Decision tree random forest classifier
A Random Forest model creates many decision trees and combine the output.
x1 x2 x3 x4 x5 y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
x1 x2 x3
1
2
3
4
x4 x5 y
13
14
15
16
17
x3 x4 x5 y
1
9
10
11
12
13
x1 x2 x3 y
1
2
3
4
Sample-1
Sample-2
Sample-3
Sample-n
Decision
Tree-1
Decision
Tree-2
Decision
Tree-3
Decision
Tree-n
Combined
output
M
A
J
O
R
I
T
Y
Creating multiple models and
combining the output is called
Bagging.
Number of trees in Random Forest can be managed by “n_estimators” argument.
Default number of trees is 100.
Random Forest reduces overfitting in decision trees and helps to improve the
accuracy.
Decision tree random forest classifier

More Related Content

DOCX
Approach to anova questions
PDF
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
PDF
Missing value imputation (slide)
PPT
logistic regression.............................................................
PPT
logisticregressionJeffWitnerMarch2016.ppt
PPTX
Interpreting Logistic Regression.pptx
PPTX
Survival.pptx
PPT
Logisticregression
Approach to anova questions
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
Missing value imputation (slide)
logistic regression.............................................................
logisticregressionJeffWitnerMarch2016.ppt
Interpreting Logistic Regression.pptx
Survival.pptx
Logisticregression

Similar to Decision tree random forest classifier (20)

PPT
unconditional binary logisticregression.ppt
PPT
logisticregression
PPT
logisticregression.ppt
PPTX
Chi sqyre test
PPT
T Test Samples for Statistics in Research
PPT
Two Sample Tests
PPT
Lecture-6 (t-test and one way ANOVA.ppt
PPT
Logisticregression
PPT
The two sample t-test
PPTX
Proportions and Confidence Intervals in Biostatistics
PPTX
PARAMETRIC STATISTICS .pptx
PPT
Test of significance (t-test, proportion test, chi-square test)
PPTX
Video 1B Handout_2023.pptx
PPT
Test of hypothesis (t)
PPT
lecture8.ppt
PPT
Lecture8
PPTX
The siegel-tukey-test-for-equal-variability
PPTX
Categorical data analysis full lecture note PPT.pptx
unconditional binary logisticregression.ppt
logisticregression
logisticregression.ppt
Chi sqyre test
T Test Samples for Statistics in Research
Two Sample Tests
Lecture-6 (t-test and one way ANOVA.ppt
Logisticregression
The two sample t-test
Proportions and Confidence Intervals in Biostatistics
PARAMETRIC STATISTICS .pptx
Test of significance (t-test, proportion test, chi-square test)
Video 1B Handout_2023.pptx
Test of hypothesis (t)
lecture8.ppt
Lecture8
The siegel-tukey-test-for-equal-variability
Categorical data analysis full lecture note PPT.pptx
Ad

More from SreerajVA (6)

PPSX
Logistic regression classification
PPSX
Linear regression
PPSX
Lasso and ridge regression
PPSX
KNN, SVM, Naive bayes classifiers
PPSX
KMeans clustering
PPSX
ADABoost classifier
Logistic regression classification
Linear regression
Lasso and ridge regression
KNN, SVM, Naive bayes classifiers
KMeans clustering
ADABoost classifier
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Introduction to Data Science and Data Analysis
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Lecture1 pattern recognition............
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
STERILIZATION AND DISINFECTION-1.ppthhhbx
[EN] Industrial Machine Downtime Prediction
Clinical guidelines as a resource for EBP(1).pdf
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IB Computer Science - Internal Assessment.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Data_Analytics_and_PowerBI_Presentation.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Data Science and Data Analysis
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Lecture1 pattern recognition............

Decision tree random forest classifier

  • 2. Glucose Age Diabetes 78 26 No 85 31 No 89 21 No 100 32 No 103 33 No 107 31 Yes 110 30 No 115 29 Yes 126 27 No 115 32 Yes 116 31 Yes 118 31 Yes 183 50 Yes 189 59 Yes 197 53 Yes Glucose Age Glucose 75 < G < 90 G > 90 20 < Age <= 31 Y-0, N-3 Prediction No (Majority) 100 <= G <= 110 G > 110 Y-1, N-3 Prediction No (Majority) Age 30 <= Age < 34 Age Glucose 110 < G < 127 G > 180 Age Y-4, N-1 Prediction Yes (Majority) Age >= 50 Y-3, N-0 Prediction Yes (Majority) A few sample observation on diabetes result along with glucose and age are given. Attempting a decision tree prediction model 30 < Age < 34 Root of the Tree Branch Leaf
  • 3.  In the last slide example, the first split divided the into groups of 12 and 3.  What if decision tree is creating a split for each sample?  Then for the samples from train data, model can do accurate predictions.  But for other data, model may perform worse.  It is better to do the branch split with a minimum number of samples greater than 1. (Default value is 2)  We can control minimum sample split count of decision tree model with help of “min_samples_split” argument. Glucose 75 < G < 90 G > 90 No. of samples = 3 No. of samples = 12  The samples count in last splits (leaf of decision tree) is also important for the model  Default value for minimum sample count in leaf of decision tree is 1.  We can control the minimum count of samples in a decision tree model with help of “min_samples_leaf” argument.
  • 4.  Entropy is the measures of impurity, disorder or uncertainty in a bunch of examples.  In a decision tree it is the impurity in the split.  Entropy value ranges from 0 to 1. Maximum impurity represents 1. Entropy H(s) = -P(Yes) * log2(P(Yes)) -P(No) * log2(P(No)) Possible split with 6 samples (label “Yes” or “No”) with entropy in each split are shown below. Entropy in case of more than 2 labels Gini Gini is another method of impurity measure in the decision tree split. Gini = 1- (P(Yes)^2 + P(No)^2) In case of more than 2 labels Gini = 1 - ∑ (Pi)^2 No. of “Yes” No. of “No” P(Yes) P(No) Entropy Notes 0 6 0 1 0 pure split 1 5 0.17 0.83 0.65 2 4 0.33 0.67 0.92 3 3 0.5 0.5 1 maximum imprity 4 2 0.67 0.33 0.92 5 1 0.83 0.17 0.65 6 0 1 0 0 pure split Gini 0 0.28 0.44 0.50 0.44 0.28 0
  • 5. Information Gain helps to measure the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable. Information Gain IG(S, a) = H(S) – H(S | a) IG(S, a) = information for the dataset S for the variable a H(S) = the entropy for the dataset before any change H(S | a) is the conditional entropy for the dataset given the variable a. A larger information gain suggests a lower entropy group or groups of samples. Overall Entropy = -(8/15)*log2(8/15) – (7/15) * log2(7/15) = 0.996 Feature - Gender: Entropy of ‘Male’ = -(4/8)*log2(4/8)-(4/8)*log2(4/8) = 1 Entropy of ‘Female’ = -(4/7)*log2(4/7)-(3/7)*log2(3/7) = 0.985 Weighted entropy = (8/15)*1 + (7/15)*0.985 = 0.993 Information Gain for gender feature = 0.996 – 0.993 = 0.003 Information Gain for exercise feature = 0.996 – 0.884 = 0.112 Gender Exercise Diabetes Male Regular No Female Irregular No Male Regular No Male Regular No Female No No Male Irregular Yes Female No No Female Regular Yes Male Regular No Female Regular Yes Female Regular Yes Female Irregular Yes Male Irregular Yes Male No Yes Male Irregular Yes
  • 7. A Random Forest model creates many decision trees and combine the output. x1 x2 x3 x4 x5 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 x1 x2 x3 1 2 3 4 x4 x5 y 13 14 15 16 17 x3 x4 x5 y 1 9 10 11 12 13 x1 x2 x3 y 1 2 3 4 Sample-1 Sample-2 Sample-3 Sample-n Decision Tree-1 Decision Tree-2 Decision Tree-3 Decision Tree-n Combined output M A J O R I T Y Creating multiple models and combining the output is called Bagging.
  • 8. Number of trees in Random Forest can be managed by “n_estimators” argument. Default number of trees is 100. Random Forest reduces overfitting in decision trees and helps to improve the accuracy.