Classification Tree - Cart

Classification Tree
Earning is in Learning
Data science and AI Certification Course
Visit: Learnbay.co

Enriching training and learningsession…
§ Training Checklist
– Sitting arrangementF2F
– Quality over Quantity
– Everyone to have their own machinesfor
hands-on practice
– Illuminated and happy glowingtraining
room (no candle light dinnerambience)
– Anyone wanting to step-out, feel free
– Feel free to ask for breaks
– Feel free to ask same question againtill
you understand
– Let me know if you want me toskip
Practice Exercises in between the
session
– Brief side-talks areokay
– I don’t speak to walls, respect each
other
Involvement
Content Duration
Enriching
Training
Visit: Learnbay.co

Classification Tree
CART
Visit: Learnbay.co

Learning Objectives
§ What is ClassificationTechnique?
§ CHAID, CART, C4.5 Intro
§ Gini Gain Computation
§ Why are Classification Tree algorithmsRecursive?
§ What is pre-pruning and post-pruning in ClassificationTree?
§ What is Loss?
§ What is Validation? What is Cross-Validation?
§ Why you should avoid over-fitting?
§ Performance Measure
Visit: Learnbay.co

Analytics that are actually used
5Visit: Learnbay.co

What is Classification?
The action or process of classifying something
according to shared qualities or characteristics.
Visit: Learnbay.co

Defining Characteristics of each animalclassification
§ Mammals – Mammals are vertebrates (backboned animals). Mammals are
warm-blooded and have hair. Mammals are able to move around using
limbs
§ Birds – Birds are warm-blooded vertebrates, having a body covered with
feathers, forelimbs modified into wings, scaly legs, a beak, and no teeth, and
bearing young ones in a hard-shelledegg
§ Insects – any of small invertebrate animals which typically have a well
defined head, thorax, and abdomen, only three pairs of legs, and typically
one or two pair of wings
§ Amphibian - any cold-blooded vertebrate that live on land but breed in water
§ Reptiles - class of cold-blooded air-breathing vertebrates withcompletely
ossified skeleton and a body usually covered with scales or horny plates
§ Fish - Alimbless cold-blooded vertebrate animal with gills and fins and living
wholly in water
Visit: Learnbay.co

Why Classify?
To Explain (Profile)
Explaining in the classification world is called Profiling
or
ToPredict (Classify)
Predicting the class of new records is called Classifying
Visit: Learnbay.co

Win Back Campaign Classification Analysis
RootNode
Leaf Node
Leaf/Node
TerminalNode
InRteorontaNlNodoede
LienChrg>5K LienChrg1Kto 5K LienChrg<1K AccBalance<1000 AccBalance>=1000
Dud 1,550 16% Dud 1,250 13% Dud 1,200 12% Dud 1,234 12% Dud 1,340 13%
W.B. 421 12% W.B. 601 17% W.B. 1,078 31% W.B. 152 4% W.B. 769 22%
W.B.% 27.2% W.B.% 48.1% W.B.% 89.8% W.B.% 12.3% W.B.% 57.4%
AccTypeSAL=TRUE AccTypeSAL=FALSE Gender =Female Gender =Male CntTxnsLastActive
Mth <10
CntTxnsLastActive
Mth >=10
Dud 275 3% Dud 1,275 13% Dud 450 5% Dud 800 8% Dud 311 3% Dud 1,029 10%
W.B. 70 2% W.B. 351 10% W.B. 129 4% W.B. 472 13% W.B. 85 2% W.B. 684 20%
W.B.% 25.5% W.B.% 27.5% W.B.% 28.7% W.B.% 59.0% W.B.% 27.3% W.B.% 66.5%
Gender =Male
Gender =Female
t TxnsLastActiveMth<
Dud 540 5% Dud 735 7% Dud 250 3%
W.B. 300 9% W.B. 51 1% W.B. 35 1%
W.B.% 55.6% W.B.% 6.9% W.B.% 14.0%
Total
Dud 10,000 100%
W.B. 3,500 100%
W.B.% 35.0%
Ina
ct
ive<6 Mths Inactive 6- 12Mths Inactive>12Mths
Dud 3,426 34%
Dud 4,000 40% Dud 2574 26% W.B. 479 14%
W.B. 2,100 60% W.B. 921 26% W.B.% 14.0%
W.B.% 52.5% W.B.% 35.8%
CntTxnsLastActive
Mth >=10
Dud 550 6%
W.B. 437 12%
W.B.% 79.5%
Dud Dud Accounts(Inactivefor
longperiod)
W.B. WinBack
Visit: Learnbay.co

Main issues of classification tree learning
§ Choosing the splitting criterion
– Impurity based criteria
– Information gain
– Statistical measures ofassociation
§ Binary or multiway splits
– Multiway split
– Binary split
§ Finding the right sized tree
– Pre-pruning
– Post-pruning
Visit: Learnbay.co

Popular Classification Techniques
§ CHAID - CHi-squared Automatic Interaction Detector. The “Chi-
squared” part of the name arises because the technique essentially
involves automatically constructing many cross-tabs, and working
out statistical significance of the proportions. The most significant
relationships are used to control the structure of a treediagram
– CHAID is a non-binary decision tree; Recursive PartitioningAlgorithm
– Continuous variables must be grouped into a finite number of bins to
create categories.
§ CLASSIFICATION AND REGRESSION TREES (CART) are binary
decision trees, which split a single variable at each node.
– The CART algorithm recursively goes though an exhaustive search ofall
variables and split values to find the optimal splitting rule for each node.
§ C4.5 builds decision trees from a set of training data using the
concept of information entropy
Visit: Learnbay.co

K2Analytics.co.in
CART | Splitting Criteria
§ CART uses the Gini Index as measure of impurity
§ Gini of a Node
(NOTE: p( j | t) is the relative frequencyof
class j at node t).
§ Gini of Split Node is computed as Weighted Avg Gini of each Node
at Split Node level
ni = number of records at childi,
n = Totalnumber of records in parent node
§ Gini Gain = Gini(t) – Gini(split)
www.cs.kent.edu/~jin/DM07/ClassificationDecisionTree.ppt
Visit: Learnbay.co

Gini calculations
Root Node
N:10; T:4
M
N: 6; T:3
F
N: 4; T:1
Gender
Cust_ID Gender Occupation Age Target
1 M Sal 22 1
2 M Sal 22 0
3 M Self-Emp 23 1
4 M Self-Emp 23 0
5 M Self-Emp 24 1
6 M Self-Emp 24 0
7 F Sal 25 1
8 F Sal 25 0
9 F Sal 26 0
10 F Self-Emp 26 0
Node Gini Computation Formula Gini Index
Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48
Gender = M = 1 - ( (3/6)^2 + (3/6)^2) 0.50
Gender = F = 1 - ( (1/4)^2 + (3/4)^2) 0.375
Gender = (6/10) * 0.5 + (4/10) *0.375 0.45
Gini Gain = Gini (Overall) – Gini (Gender) 0.03
Visit: Learnbay.co

Gini calculations
Root Node
N:10; T:4
Sal
N: 5; T:2
Self-Emp
N: 5; T:2
Occupation
Node Gini Computation Formula Gini
Index
Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48
Occ = Sal = 1 - ( (2/5)^2 + (3/5)^2) 0.48
Occ = Self-
Emp
= 1 - ( (2/5)^2 + (3/5)^2) 0.48
Occupation = (5/10) * 0.48 + (5/10) *0.48 0.48
Gini Gain = Gini (Overall) –Gini
(Occupation)
0.0
Age <=22 <=23 <=24 <=25
Gini(Left) 0.5 0.5 0.5 0.5
Gini(Right) 0.47 0.44 0.38 0
GiniSplit 0.48 0.47 0.45 0.40
GiniGain 0.0 0.01 0.03 0.08
Visit: Learnbay.co

Exercise… Compute Gini Gain
Root Node
N:100; T:40
M
N: 25; T:10
F
N: 75; T:30
Visits > 3
Y N
Visit: Learnbay.co

Sampling…
## Creating Development and ValidationSample
##dummy_df = pd.read_csv("/home/utkarsh/Desktop/bank.csv", na_values =['NA'])
##x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)
CTDF.dev <- pd.read_csv("datafile/DEV_SAMPLE.csv", sep = ",", header = T)
CTDF.holdout <- pd.read_csv ("datafile/HOLDOUT_SAMPLE.csv", sep = ",", header = T)
SamplingCode
Separate Dev & Val
samples areprovidedas
such we will directly
import them rather than
use samplingcode
Visit: Learnbay.co

Decision Tree code to build CART Tree
## installing rpart package forCART
# from sklearn.model_selection importtrain_test_split
# from sklearn.tree import DecisionTreeClassifier
# import matplotlib.pyplot as plt from sklearn.externals.six #
# import StringIO from IPython.display import Image
# from sklearn.tree import export_graphviz
# import pydotplus
## calling the Decision Tree functionto buildthe tree
model_dt = DecisionTreeClassifier(max_depth = 8, criterion =“gini“,
min_samples_split = 100, min_sample_leaf = 10 )
Visit: Learnbay.co

Decision Tree control arguments
§ Min_samples_split: the minimum number of observations that must existin
a nodein order for a split to beattempted.
§ Min_samples_leaf: the minimum number of observations in any terminal
leaf node. If only one of min_samples_leaf or min_samples_split is specified,
the code either sets min_samples_split to min_samples_leaf*3 or
min_samples_leaf to min_samples_split/3,as appropriate.
§ max_depth: The maximum depth of the tree.if NONE then nodes are
expanded until all leaves are pure or until all leaves contains less than
min_samples_split samples.
§ Criterion: The function to measure the quality of the split. It can be “gini” for
the gini impurity and “entropy” for the information gain.
Visit: Learnbay.co

Loss, Mis-Classification Error and Response Rate
§ Loss is the number of cases mis-
classified in a given node
§ Mis-Classification Error is the
ratio of total number of cases mis-
classified to total number ofcases
– We are interested in mis-
classification error for the fulltree
§ Response Rate is the ratio of
number of responders (Target=
1) to the total number ofcases
– We are interested in findingnodes
where the response rate is very
high
# Obs : 9,182
# Target =1 :443
# Target= 0 : 8,739
# Obs : 4,818
# Target =1 :792
# Target= 0 : 4,026
# Obs : 600
# Target =1 :400
# Target= 0 : 200
# Obs : 4,218
# Target =1 :392
# Target= 0 : 3,826
Root Node
# Obs : 14,000
# Target =1 :1,235
# Target = 0 :12,765
N
Holding Period >=10
Y
ABC >X
What is the mis-classification error for the abovetree?
Visit: Learnbay.co

Plotting the Classification Tree
l
)
Let us exportthe
output to PDF
format to havea
clear view ofthe
tree
Visit: Learnbay.co

Concepts | Greedy Algorithm
Make 31 Paise using any combination of above coins
Optimal solution with few coins : 25 + 5 + 1
What if the 5 paise coin is not there?
Optimal solution with few coins : 10 * 3 + 1
Greedy Algorithm solution: 25 + 1 * 6 Visit: Learnbay.co

Concepts | Cross Validation
§ Cross Validation is
part of the CART
algorithm
§ Method to see how
well the model
performs tounseen
data
§ Typically xval
parameter for cross-
validation is set to10
KFoldCV P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Fold1 Train Train Train Train Train Train Train Train Train Test
Fold2 Train Train Train Train Train Train Train Train Test Train
Fold3 Train Train Train Train Train Train Train Test Train Train
Fold4 Train Train Train Train Train Train Test Train Train Train
Fold5 Train Train Train Train Train Test Train Train Train Train
Fold6 Train Train Train Train Test Train Train Train Train Train
Fold7 Train Train Train Test Train Train Train Train Train Train
Fold8 Train Train Test Train Train Train Train Train Train Train
Fold9 Train Test Train Train Train Train Train Train Train Train
Fold10 Test Train Train Train Train Train Train Train Train Train
Visit: Learnbay.co

Concepts | Over-fitting
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Training
Data
Test
Data
0 10 20 30 40 50 60 70 80 90 100
Tree Size(No. of Nodes)
Accuracy
§ If you grow the tree
too long you will run
the risk of over-fitting
§ Classification model
may not work well on
unseen data
How do we avoid Over-fitting?
Stopping Rule: don’t expand a node if the impurity reductionof the best
split is below somethreshold
Pruning: grow a very large tree and merge backnodes
Visit: Learnbay.co

Concepts | Parsimony Principle & Re-substitution Error
§ Parsimony principle is basic to all
science and tells us to choose the
simplest scientific explanation that
fits the evidence.
§ Resubstitution Error: It measures
what fraction of the cases in a node
is classified incorrectly if we assign
every case to the majority class in
that node; It always favours large
tree
§ Tocounter balance the
resubstitution error we need a
penalty component that favours
smaller tree
Sub-tree Node
530 ; 113;0
Node 14
122; 10;0
Node 15
408; 103;0
Node 30
388; 90;0
Node 31
20; 7; 1
SCR <334
Y N
Gender:M,O
Re (prunded) = 113 /530
Re (leaves) = 107 /530
Visit: Learnbay.co

Cost Component Pruning
§ “cost-complexity” – a measure of avg. error reduced per leaf
§ Calculate number of errors for each node if collapsed toleaf
§ Compare to errors in leaves, taking into account more nodes used
Sub-tree Node
530 ; 113;0
Node 14
122; 10;0
Node 15
408; 103;0
Node 30
388; 90;0
Node 31
20; 7; 1
SCR <334
Y N
Gender:M,O
Re (prunded) + 1 a
= Re (leaves) +3
a
113 / 530 + 1 a =107/ 530+3
a
a =
0.0056
Visit: Learnbay.co

Pruning
§ Pruning is Basically the average cost complexity reduced perleaf
in a Decision Tree.
§ Generally It’s a hit & try method to get the accuracy improved over
the depth of tree getting reduced or average number of nodes
reduced without over fitting.
§ Practically, We creates a Tree structure which is getting refined on
certain pre-assumptions for improving the performance and
accuracy of a Decision Treeclassifier
http://stats.stackexchange.com/questions/92547/r-rpart-cross-validation-and-1-se-rule-why-is-the-column-in-cptable-called-xst
https://stats.stackexchange.com/questions/13471/how-to-choose-the-number-of-splits-in-rpart
Visit: Learnbay.co

Pruned Classification Tree
Visit: Learnbay.co

Model Evaluation
Various measures to see the model performance
§ Error Matrix
§ Gini Coefficient
§ AUC
§ KS
§ Lift Chart
https://www.youtube.com/watch?v=OAl6eAyP-yo
Demo of Rattle interfaceto
build model and generate
various model evaluation
measures
Visit: Learnbay.co

Confusion Matrix… JJJ
Visit: Learnbay.co

Area Under Curve
Sensitivity = True PositiveRate
= True Positive / TotalPositive
= a / (a + b)
Specificity = True Negative / TotalNegative
= d / (c + d)
False Positive Rate = 1 -Specificity
Classification Matrix Predicted
Y N
Actual Y a b
N c d
Visit: Learnbay.co

Thankyou!!!
Visit: Learnbay.co

Classification Tree - Cart

More Related Content

Similar to Classification Tree - Cart (20)

More from Learnbay Datascience (20)

Recently uploaded (20)

Classification Tree - Cart