SlideShare a Scribd company logo
Classification Tree
Earning is in Learning
Data science and AI Certification Course
Visit: Learnbay.co
Enriching training and learningsession…
§ Training Checklist
– Sitting arrangementF2F
– Quality over Quantity
– Everyone to have their own machinesfor
hands-on practice
– Illuminated and happy glowingtraining
room (no candle light dinnerambience)
– Anyone wanting to step-out, feel free
– Feel free to ask for breaks
– Feel free to ask same question againtill
you understand
– Let me know if you want me toskip
Practice Exercises in between the
session
– Brief side-talks areokay
– I don’t speak to walls, respect each
other
Involvement
Content Duration
Enriching
Training
Visit: Learnbay.co
Classification Tree
CART
Visit: Learnbay.co
Learning Objectives
§ What is ClassificationTechnique?
§ CHAID, CART, C4.5 Intro
§ Gini Gain Computation
§ Why are Classification Tree algorithmsRecursive?
§ What is pre-pruning and post-pruning in ClassificationTree?
§ What is Loss?
§ What is Validation? What is Cross-Validation?
§ Why you should avoid over-fitting?
§ Performance Measure
Visit: Learnbay.co
Analytics that are actually used
5Visit: Learnbay.co
What is Classification?
The action or process of classifying something
according to shared qualities or characteristics.
Visit: Learnbay.co
Defining Characteristics of each animalclassification
§ Mammals – Mammals are vertebrates (backboned animals). Mammals are
warm-blooded and have hair. Mammals are able to move around using
limbs
§ Birds – Birds are warm-blooded vertebrates, having a body covered with
feathers, forelimbs modified into wings, scaly legs, a beak, and no teeth, and
bearing young ones in a hard-shelledegg
§ Insects – any of small invertebrate animals which typically have a well
defined head, thorax, and abdomen, only three pairs of legs, and typically
one or two pair of wings
§ Amphibian - any cold-blooded vertebrate that live on land but breed in water
§ Reptiles - class of cold-blooded air-breathing vertebrates withcompletely
ossified skeleton and a body usually covered with scales or horny plates
§ Fish - Alimbless cold-blooded vertebrate animal with gills and fins and living
wholly in water
Visit: Learnbay.co
Why Classify?
To Explain (Profile)
Explaining in the classification world is called Profiling
or
ToPredict (Classify)
Predicting the class of new records is called Classifying
Visit: Learnbay.co
Win Back Campaign Classification Analysis
RootNode
Leaf Node
Leaf/Node
TerminalNode
InRteorontaNlNodoede
LienChrg>5K LienChrg1Kto 5K LienChrg<1K AccBalance<1000 AccBalance>=1000
Dud 1,550 16% Dud 1,250 13% Dud 1,200 12% Dud 1,234 12% Dud 1,340 13%
W.B. 421 12% W.B. 601 17% W.B. 1,078 31% W.B. 152 4% W.B. 769 22%
W.B.% 27.2% W.B.% 48.1% W.B.% 89.8% W.B.% 12.3% W.B.% 57.4%
AccTypeSAL=TRUE AccTypeSAL=FALSE Gender =Female Gender =Male CntTxnsLastActive
Mth <10
CntTxnsLastActive
Mth >=10
Dud 275 3% Dud 1,275 13% Dud 450 5% Dud 800 8% Dud 311 3% Dud 1,029 10%
W.B. 70 2% W.B. 351 10% W.B. 129 4% W.B. 472 13% W.B. 85 2% W.B. 684 20%
W.B.% 25.5% W.B.% 27.5% W.B.% 28.7% W.B.% 59.0% W.B.% 27.3% W.B.% 66.5%
Gender =Male
Gender =Female
t TxnsLastActiveMth<
Dud 540 5% Dud 735 7% Dud 250 3%
W.B. 300 9% W.B. 51 1% W.B. 35 1%
W.B.% 55.6% W.B.% 6.9% W.B.% 14.0%
Total
Dud 10,000 100%
W.B. 3,500 100%
W.B.% 35.0%
Ina
ct
ive<6 Mths Inactive 6- 12Mths Inactive>12Mths
Dud 3,426 34%
Dud 4,000 40% Dud 2574 26% W.B. 479 14%
W.B. 2,100 60% W.B. 921 26% W.B.% 14.0%
W.B.% 52.5% W.B.% 35.8%
CntTxnsLastActive
Mth >=10
Dud 550 6%
W.B. 437 12%
W.B.% 79.5%
Dud Dud Accounts(Inactivefor
longperiod)
W.B. WinBack
Visit: Learnbay.co
Main issues of classification tree learning
§ Choosing the splitting criterion
– Impurity based criteria
– Information gain
– Statistical measures ofassociation
§ Binary or multiway splits
– Multiway split
– Binary split
§ Finding the right sized tree
– Pre-pruning
– Post-pruning
Visit: Learnbay.co
Popular Classification Techniques
§ CHAID - CHi-squared Automatic Interaction Detector. The “Chi-
squared” part of the name arises because the technique essentially
involves automatically constructing many cross-tabs, and working
out statistical significance of the proportions. The most significant
relationships are used to control the structure of a treediagram
– CHAID is a non-binary decision tree; Recursive PartitioningAlgorithm
– Continuous variables must be grouped into a finite number of bins to
create categories.
§ CLASSIFICATION AND REGRESSION TREES (CART) are binary
decision trees, which split a single variable at each node.
– The CART algorithm recursively goes though an exhaustive search ofall
variables and split values to find the optimal splitting rule for each node.
§ C4.5 builds decision trees from a set of training data using the
concept of information entropy
Visit: Learnbay.co
CART
Visit: Learnbay.co
Visit: Learnbay.co
  K2Analytics.co.in
CART | Splitting Criteria
§ CART uses the Gini Index as measure of impurity
§ Gini of a Node
(NOTE: p( j | t) is the relative frequencyof
class j at node t).
§ Gini of Split Node is computed as Weighted Avg Gini of each Node
at Split Node level
ni = number of records at childi,
n = Totalnumber of records in parent node
§ Gini Gain = Gini(t) – Gini(split)
www.cs.kent.edu/~jin/DM07/ClassificationDecisionTree.ppt
Visit: Learnbay.co
Gini calculations
Root Node
N:10; T:4
M
N: 6; T:3
F
N: 4; T:1
Gender
Cust_ID Gender Occupation Age Target
1 M Sal 22 1
2 M Sal 22 0
3 M Self-Emp 23 1
4 M Self-Emp 23 0
5 M Self-Emp 24 1
6 M Self-Emp 24 0
7 F Sal 25 1
8 F Sal 25 0
9 F Sal 26 0
10 F Self-Emp 26 0
Node Gini Computation Formula Gini Index
Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48
Gender = M = 1 - ( (3/6)^2 + (3/6)^2) 0.50
Gender = F = 1 - ( (1/4)^2 + (3/4)^2) 0.375
Gender = (6/10) * 0.5 + (4/10) *0.375 0.45
Gini Gain = Gini (Overall) – Gini (Gender) 0.03
Visit: Learnbay.co
Gini calculations
Root Node
N:10; T:4
Sal
N: 5; T:2
Self-Emp
N: 5; T:2
Occupation
Node Gini Computation Formula Gini
Index
Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48
Occ = Sal = 1 - ( (2/5)^2 + (3/5)^2) 0.48
Occ = Self-
Emp
= 1 - ( (2/5)^2 + (3/5)^2) 0.48
Occupation = (5/10) * 0.48 + (5/10) *0.48 0.48
Gini Gain = Gini (Overall) –Gini
(Occupation)
0.0
Age <=22 <=23 <=24 <=25
Gini(Left) 0.5 0.5 0.5 0.5
Gini(Right) 0.47 0.44 0.38 0
GiniSplit 0.48 0.47 0.45 0.40
GiniGain 0.0 0.01 0.03 0.08
Visit: Learnbay.co
Exercise… Compute Gini Gain
Root Node
N:100; T:40
M
N: 25; T:10
F
N: 75; T:30
Visits > 3
Y N
Visit: Learnbay.co
Sampling…
## Creating Development and ValidationSample
##dummy_df = pd.read_csv("/home/utkarsh/Desktop/bank.csv", na_values =['NA'])
##x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)
CTDF.dev <- pd.read_csv("datafile/DEV_SAMPLE.csv", sep = ",", header = T)
CTDF.holdout <- pd.read_csv ("datafile/HOLDOUT_SAMPLE.csv", sep = ",", header = T)
SamplingCode
Separate Dev & Val
samples areprovidedas
such we will directly
import them rather than
use samplingcode
Visit: Learnbay.co
Decision Tree code to build CART Tree
## installing rpart package forCART
# from sklearn.model_selection importtrain_test_split
# from sklearn.tree import DecisionTreeClassifier
# import matplotlib.pyplot as plt from sklearn.externals.six #
# import StringIO from IPython.display import Image
# from sklearn.tree import export_graphviz
# import pydotplus 
## calling the Decision Tree functionto buildthe tree
model_dt = DecisionTreeClassifier(max_depth = 8, criterion =“gini“,
min_samples_split = 100, min_sample_leaf = 10 )
Visit: Learnbay.co
Decision Tree control arguments
§ Min_samples_split: the minimum number of observations that must existin
a nodein order for a split to beattempted.
§ Min_samples_leaf: the minimum number of observations in any terminal
leaf node. If only one of min_samples_leaf or min_samples_split is specified,
the code either sets min_samples_split to min_samples_leaf*3 or
min_samples_leaf to min_samples_split/3,as appropriate.
§ max_depth: The maximum depth of the tree.if NONE then nodes are
expanded until all leaves are pure or until all leaves contains less than
min_samples_split samples.
§ Criterion: The function to measure the quality of the split. It can be “gini” for
the gini impurity and “entropy” for the information gain.
Visit: Learnbay.co
Loss, Mis-Classification Error and Response Rate
§ Loss is the number of cases mis-
classified in a given node
§ Mis-Classification Error is the
ratio of total number of cases mis-
classified to total number ofcases
– We are interested in mis-
classification error for the fulltree
§ Response Rate is the ratio of
number of responders (Target=
1) to the total number ofcases
– We are interested in findingnodes
where the response rate is very
high
# Obs : 9,182
# Target =1 :443
# Target= 0 : 8,739
# Obs : 4,818
# Target =1 :792
# Target= 0 : 4,026
# Obs : 600
# Target =1 :400
# Target= 0 : 200
# Obs : 4,218
# Target =1 :392
# Target= 0 : 3,826
Root Node
# Obs : 14,000
# Target =1 :1,235
# Target = 0 :12,765
N
Holding Period >=10
Y
ABC >X
What is the mis-classification error for the abovetree?
Visit: Learnbay.co
Plotting the Classification Tree
l
)
Let us exportthe
output to PDF
format to havea
clear view ofthe
tree
Visit: Learnbay.co
Concepts | Greedy Algorithm
Make 31 Paise using any combination of above coins
Optimal solution with few coins : 25 + 5 + 1
What if the 5 paise coin is not there?
Optimal solution with few coins : 10 * 3 + 1
Greedy Algorithm solution: 25 + 1 * 6 Visit: Learnbay.co
Concepts | Cross Validation
§ Cross Validation is
part of the CART
algorithm
§ Method to see how
well the model
performs tounseen
data
§ Typically xval
parameter for cross-
validation is set to10
KFoldCV P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Fold1 Train Train Train Train Train Train Train Train Train Test
Fold2 Train Train Train Train Train Train Train Train Test Train
Fold3 Train Train Train Train Train Train Train Test Train Train
Fold4 Train Train Train Train Train Train Test Train Train Train
Fold5 Train Train Train Train Train Test Train Train Train Train
Fold6 Train Train Train Train Test Train Train Train Train Train
Fold7 Train Train Train Test Train Train Train Train Train Train
Fold8 Train Train Test Train Train Train Train Train Train Train
Fold9 Train Test Train Train Train Train Train Train Train Train
Fold10 Test Train Train Train Train Train Train Train Train Train
Visit: Learnbay.co
Concepts | Over-fitting
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Training
Data
Test
Data
0 10 20 30 40 50 60 70 80 90 100
Tree Size(No. of Nodes)
Accuracy
§ If you grow the tree
too long you will run
the risk of over-fitting
§ Classification model
may not work well on
unseen data
How do we avoid Over-fitting?
Stopping Rule: don’t expand a node if the impurity reductionof the best
split is below somethreshold
Pruning: grow a very large tree and merge backnodes
Visit: Learnbay.co
Concepts | Parsimony Principle & Re-substitution Error
§ Parsimony principle is basic to all
science and tells us to choose the
simplest scientific explanation that
fits the evidence.
§ Resubstitution Error: It measures
what fraction of the cases in a node
is classified incorrectly if we assign
every case to the majority class in
that node; It always favours large
tree
§ Tocounter balance the
resubstitution error we need a
penalty component that favours
smaller tree
Sub-tree Node
530 ; 113;0
Node 14
122; 10;0
Node 15
408; 103;0
Node 30
388; 90;0
Node 31
20; 7; 1
SCR <334
Y N
Gender:M,O
Re (prunded) = 113 /530
Re (leaves) = 107 /530
Visit: Learnbay.co
Cost Component Pruning
§ “cost-complexity” – a measure of avg. error reduced per leaf
§ Calculate number of errors for each node if collapsed toleaf
§ Compare to errors in leaves, taking into account more nodes used
Sub-tree Node
530 ; 113;0
Node 14
122; 10;0
Node 15
408; 103;0
Node 30
388; 90;0
Node 31
20; 7; 1
SCR <334
Y N
Gender:M,O
Re (prunded) + 1 a
= Re (leaves) +3
a
113 / 530 + 1 a =107/ 530+3
a
a =
0.0056
Visit: Learnbay.co
Pruning
§ Pruning is Basically the average cost complexity reduced perleaf
in a Decision Tree.
§ Generally It’s a hit & try method to get the accuracy improved over
the depth of tree getting reduced or average number of nodes
reduced without over fitting.
§ Practically, We creates a Tree structure which is getting refined on
certain pre-assumptions for improving the performance and
accuracy of a Decision Treeclassifier
http://stats.stackexchange.com/questions/92547/r-rpart-cross-validation-and-1-se-rule-why-is-the-column-in-cptable-called-xst
https://stats.stackexchange.com/questions/13471/how-to-choose-the-number-of-splits-in-rpart
Visit: Learnbay.co
Pruned Classification Tree
Visit: Learnbay.co
Model Evaluation
Various measures to see the model performance
§ Error Matrix
§ Gini Coefficient
§ AUC
§ KS
§ Lift Chart
https://www.youtube.com/watch?v=OAl6eAyP-yo
Demo of Rattle interfaceto
build model and generate
various model evaluation
measures
Visit: Learnbay.co
Confusion Matrix… JJJ
Visit: Learnbay.co
Area Under Curve
Sensitivity = True PositiveRate
= True Positive / TotalPositive
= a / (a + b)
Specificity = True Negative / TotalNegative
= d / (c + d)
False Positive Rate = 1 -Specificity
Classification Matrix Predicted
Y N
Actual Y a b
N c d
Visit: Learnbay.co
Thankyou!!!
Visit: Learnbay.co

More Related Content

PDF
Decision tree
PPTX
Decision Trees
PDF
Machine Learning Decision Tree Algorithms
PPTX
Classification.pptx
PDF
Data Science - Part V - Decision Trees & Random Forests
PPTX
unit classification.pptx
PPTX
Decision Tree - C4.5&CART
PPTX
This is the module of the module 4 inthis is
Decision tree
Decision Trees
Machine Learning Decision Tree Algorithms
Classification.pptx
Data Science - Part V - Decision Trees & Random Forests
unit classification.pptx
Decision Tree - C4.5&CART
This is the module of the module 4 inthis is

Similar to Classification Tree - Cart (20)

PPT
Advanced cart 2007
PPTX
Machine learning session 10
PPT
classification in data warehouse and mining
PPT
08 classbasic
PPT
08 classbasic
PPT
Classification (ML).ppt
PPT
ClassificationOfMachineLearninginCSE.ppt
PPT
08ClassBasic.ppt
PPT
Basics of Classification.ppt
PDF
Supervised Learning Decision Trees Review of Entropy
PDF
Supervised Learning Decision Trees Machine Learning
PPTX
CART Training 1999
PPTX
Decision Tree.pptx
PPTX
Decision Trees for Classification: A Machine Learning Algorithm
PPT
1791kjkljkljlkkljlkjkljlkkljlkjkjl9164.ppt
PPT
Data Mining and Warehousing Concept and Techniques
PPT
Basic Concept of Classification - Data Mining
PPT
Classification Algorighms in Data Warehousing and Data Mininbg
PPT
08ClassBasic - Cosdfsdfadgádfádffádgádpy.ppt
PPT
Classfication Basic.ppt
Advanced cart 2007
Machine learning session 10
classification in data warehouse and mining
08 classbasic
08 classbasic
Classification (ML).ppt
ClassificationOfMachineLearninginCSE.ppt
08ClassBasic.ppt
Basics of Classification.ppt
Supervised Learning Decision Trees Review of Entropy
Supervised Learning Decision Trees Machine Learning
CART Training 1999
Decision Tree.pptx
Decision Trees for Classification: A Machine Learning Algorithm
1791kjkljkljlkkljlkjkljlkkljlkjkjl9164.ppt
Data Mining and Warehousing Concept and Techniques
Basic Concept of Classification - Data Mining
Classification Algorighms in Data Warehousing and Data Mininbg
08ClassBasic - Cosdfsdfadgádfádffádgádpy.ppt
Classfication Basic.ppt
Ad

More from Learnbay Datascience (20)

PDF
Top data science projects
PDF
Python my SQL - create table
PDF
Python my SQL - create database
PDF
Python my sql database connection
PDF
Python - mySOL
PDF
AI - Issues and Terminology
PDF
AI - Fuzzy Logic Systems
PDF
AI - working of an ns
PDF
Artificial Intelligence- Neural Networks
PDF
AI - Robotics
PDF
Applications of expert system
PDF
Components of expert systems
PDF
Artificial intelligence - expert systems
PDF
AI - natural language processing
PDF
Ai popular search algorithms
PDF
AI - Agents & Environments
PDF
Artificial intelligence - research areas
PDF
Artificial intelligence composed
PDF
Artificial intelligence intelligent systems
PDF
Applications of ai
Top data science projects
Python my SQL - create table
Python my SQL - create database
Python my sql database connection
Python - mySOL
AI - Issues and Terminology
AI - Fuzzy Logic Systems
AI - working of an ns
Artificial Intelligence- Neural Networks
AI - Robotics
Applications of expert system
Components of expert systems
Artificial intelligence - expert systems
AI - natural language processing
Ai popular search algorithms
AI - Agents & Environments
Artificial intelligence - research areas
Artificial intelligence composed
Artificial intelligence intelligent systems
Applications of ai
Ad

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
Cell Types and Its function , kingdom of life
PPTX
Lesson notes of climatology university.
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
master seminar digital applications in india
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Institutional Correction lecture only . . .
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Cell Structure & Organelles in detailed.
Pharma ospi slides which help in ospi learning
A systematic review of self-coping strategies used by university students to ...
Cell Types and Its function , kingdom of life
Lesson notes of climatology university.
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
VCE English Exam - Section C Student Revision Booklet
master seminar digital applications in india
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Pharmacology of Heart Failure /Pharmacotherapy of CHF
O7-L3 Supply Chain Operations - ICLT Program
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Institutional Correction lecture only . . .
human mycosis Human fungal infections are called human mycosis..pptx
Cell Structure & Organelles in detailed.

Classification Tree - Cart

  • 1. Classification Tree Earning is in Learning Data science and AI Certification Course Visit: Learnbay.co
  • 2. Enriching training and learningsession… § Training Checklist – Sitting arrangementF2F – Quality over Quantity – Everyone to have their own machinesfor hands-on practice – Illuminated and happy glowingtraining room (no candle light dinnerambience) – Anyone wanting to step-out, feel free – Feel free to ask for breaks – Feel free to ask same question againtill you understand – Let me know if you want me toskip Practice Exercises in between the session – Brief side-talks areokay – I don’t speak to walls, respect each other Involvement Content Duration Enriching Training Visit: Learnbay.co
  • 4. Learning Objectives § What is ClassificationTechnique? § CHAID, CART, C4.5 Intro § Gini Gain Computation § Why are Classification Tree algorithmsRecursive? § What is pre-pruning and post-pruning in ClassificationTree? § What is Loss? § What is Validation? What is Cross-Validation? § Why you should avoid over-fitting? § Performance Measure Visit: Learnbay.co
  • 5. Analytics that are actually used 5Visit: Learnbay.co
  • 6. What is Classification? The action or process of classifying something according to shared qualities or characteristics. Visit: Learnbay.co
  • 7. Defining Characteristics of each animalclassification § Mammals – Mammals are vertebrates (backboned animals). Mammals are warm-blooded and have hair. Mammals are able to move around using limbs § Birds – Birds are warm-blooded vertebrates, having a body covered with feathers, forelimbs modified into wings, scaly legs, a beak, and no teeth, and bearing young ones in a hard-shelledegg § Insects – any of small invertebrate animals which typically have a well defined head, thorax, and abdomen, only three pairs of legs, and typically one or two pair of wings § Amphibian - any cold-blooded vertebrate that live on land but breed in water § Reptiles - class of cold-blooded air-breathing vertebrates withcompletely ossified skeleton and a body usually covered with scales or horny plates § Fish - Alimbless cold-blooded vertebrate animal with gills and fins and living wholly in water Visit: Learnbay.co
  • 8. Why Classify? To Explain (Profile) Explaining in the classification world is called Profiling or ToPredict (Classify) Predicting the class of new records is called Classifying Visit: Learnbay.co
  • 9. Win Back Campaign Classification Analysis RootNode Leaf Node Leaf/Node TerminalNode InRteorontaNlNodoede LienChrg>5K LienChrg1Kto 5K LienChrg<1K AccBalance<1000 AccBalance>=1000 Dud 1,550 16% Dud 1,250 13% Dud 1,200 12% Dud 1,234 12% Dud 1,340 13% W.B. 421 12% W.B. 601 17% W.B. 1,078 31% W.B. 152 4% W.B. 769 22% W.B.% 27.2% W.B.% 48.1% W.B.% 89.8% W.B.% 12.3% W.B.% 57.4% AccTypeSAL=TRUE AccTypeSAL=FALSE Gender =Female Gender =Male CntTxnsLastActive Mth <10 CntTxnsLastActive Mth >=10 Dud 275 3% Dud 1,275 13% Dud 450 5% Dud 800 8% Dud 311 3% Dud 1,029 10% W.B. 70 2% W.B. 351 10% W.B. 129 4% W.B. 472 13% W.B. 85 2% W.B. 684 20% W.B.% 25.5% W.B.% 27.5% W.B.% 28.7% W.B.% 59.0% W.B.% 27.3% W.B.% 66.5% Gender =Male Gender =Female t TxnsLastActiveMth< Dud 540 5% Dud 735 7% Dud 250 3% W.B. 300 9% W.B. 51 1% W.B. 35 1% W.B.% 55.6% W.B.% 6.9% W.B.% 14.0% Total Dud 10,000 100% W.B. 3,500 100% W.B.% 35.0% Ina ct ive<6 Mths Inactive 6- 12Mths Inactive>12Mths Dud 3,426 34% Dud 4,000 40% Dud 2574 26% W.B. 479 14% W.B. 2,100 60% W.B. 921 26% W.B.% 14.0% W.B.% 52.5% W.B.% 35.8% CntTxnsLastActive Mth >=10 Dud 550 6% W.B. 437 12% W.B.% 79.5% Dud Dud Accounts(Inactivefor longperiod) W.B. WinBack Visit: Learnbay.co
  • 10. Main issues of classification tree learning § Choosing the splitting criterion – Impurity based criteria – Information gain – Statistical measures ofassociation § Binary or multiway splits – Multiway split – Binary split § Finding the right sized tree – Pre-pruning – Post-pruning Visit: Learnbay.co
  • 11. Popular Classification Techniques § CHAID - CHi-squared Automatic Interaction Detector. The “Chi- squared” part of the name arises because the technique essentially involves automatically constructing many cross-tabs, and working out statistical significance of the proportions. The most significant relationships are used to control the structure of a treediagram – CHAID is a non-binary decision tree; Recursive PartitioningAlgorithm – Continuous variables must be grouped into a finite number of bins to create categories. § CLASSIFICATION AND REGRESSION TREES (CART) are binary decision trees, which split a single variable at each node. – The CART algorithm recursively goes though an exhaustive search ofall variables and split values to find the optimal splitting rule for each node. § C4.5 builds decision trees from a set of training data using the concept of information entropy Visit: Learnbay.co
  • 14.   K2Analytics.co.in CART | Splitting Criteria § CART uses the Gini Index as measure of impurity § Gini of a Node (NOTE: p( j | t) is the relative frequencyof class j at node t). § Gini of Split Node is computed as Weighted Avg Gini of each Node at Split Node level ni = number of records at childi, n = Totalnumber of records in parent node § Gini Gain = Gini(t) – Gini(split) www.cs.kent.edu/~jin/DM07/ClassificationDecisionTree.ppt Visit: Learnbay.co
  • 15. Gini calculations Root Node N:10; T:4 M N: 6; T:3 F N: 4; T:1 Gender Cust_ID Gender Occupation Age Target 1 M Sal 22 1 2 M Sal 22 0 3 M Self-Emp 23 1 4 M Self-Emp 23 0 5 M Self-Emp 24 1 6 M Self-Emp 24 0 7 F Sal 25 1 8 F Sal 25 0 9 F Sal 26 0 10 F Self-Emp 26 0 Node Gini Computation Formula Gini Index Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48 Gender = M = 1 - ( (3/6)^2 + (3/6)^2) 0.50 Gender = F = 1 - ( (1/4)^2 + (3/4)^2) 0.375 Gender = (6/10) * 0.5 + (4/10) *0.375 0.45 Gini Gain = Gini (Overall) – Gini (Gender) 0.03 Visit: Learnbay.co
  • 16. Gini calculations Root Node N:10; T:4 Sal N: 5; T:2 Self-Emp N: 5; T:2 Occupation Node Gini Computation Formula Gini Index Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48 Occ = Sal = 1 - ( (2/5)^2 + (3/5)^2) 0.48 Occ = Self- Emp = 1 - ( (2/5)^2 + (3/5)^2) 0.48 Occupation = (5/10) * 0.48 + (5/10) *0.48 0.48 Gini Gain = Gini (Overall) –Gini (Occupation) 0.0 Age <=22 <=23 <=24 <=25 Gini(Left) 0.5 0.5 0.5 0.5 Gini(Right) 0.47 0.44 0.38 0 GiniSplit 0.48 0.47 0.45 0.40 GiniGain 0.0 0.01 0.03 0.08 Visit: Learnbay.co
  • 17. Exercise… Compute Gini Gain Root Node N:100; T:40 M N: 25; T:10 F N: 75; T:30 Visits > 3 Y N Visit: Learnbay.co
  • 18. Sampling… ## Creating Development and ValidationSample ##dummy_df = pd.read_csv("/home/utkarsh/Desktop/bank.csv", na_values =['NA']) ##x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5) CTDF.dev <- pd.read_csv("datafile/DEV_SAMPLE.csv", sep = ",", header = T) CTDF.holdout <- pd.read_csv ("datafile/HOLDOUT_SAMPLE.csv", sep = ",", header = T) SamplingCode Separate Dev & Val samples areprovidedas such we will directly import them rather than use samplingcode Visit: Learnbay.co
  • 19. Decision Tree code to build CART Tree ## installing rpart package forCART # from sklearn.model_selection importtrain_test_split # from sklearn.tree import DecisionTreeClassifier # import matplotlib.pyplot as plt from sklearn.externals.six # # import StringIO from IPython.display import Image # from sklearn.tree import export_graphviz # import pydotplus ## calling the Decision Tree functionto buildthe tree model_dt = DecisionTreeClassifier(max_depth = 8, criterion =“gini“, min_samples_split = 100, min_sample_leaf = 10 ) Visit: Learnbay.co
  • 20. Decision Tree control arguments § Min_samples_split: the minimum number of observations that must existin a nodein order for a split to beattempted. § Min_samples_leaf: the minimum number of observations in any terminal leaf node. If only one of min_samples_leaf or min_samples_split is specified, the code either sets min_samples_split to min_samples_leaf*3 or min_samples_leaf to min_samples_split/3,as appropriate. § max_depth: The maximum depth of the tree.if NONE then nodes are expanded until all leaves are pure or until all leaves contains less than min_samples_split samples. § Criterion: The function to measure the quality of the split. It can be “gini” for the gini impurity and “entropy” for the information gain. Visit: Learnbay.co
  • 21. Loss, Mis-Classification Error and Response Rate § Loss is the number of cases mis- classified in a given node § Mis-Classification Error is the ratio of total number of cases mis- classified to total number ofcases – We are interested in mis- classification error for the fulltree § Response Rate is the ratio of number of responders (Target= 1) to the total number ofcases – We are interested in findingnodes where the response rate is very high # Obs : 9,182 # Target =1 :443 # Target= 0 : 8,739 # Obs : 4,818 # Target =1 :792 # Target= 0 : 4,026 # Obs : 600 # Target =1 :400 # Target= 0 : 200 # Obs : 4,218 # Target =1 :392 # Target= 0 : 3,826 Root Node # Obs : 14,000 # Target =1 :1,235 # Target = 0 :12,765 N Holding Period >=10 Y ABC >X What is the mis-classification error for the abovetree? Visit: Learnbay.co
  • 22. Plotting the Classification Tree l ) Let us exportthe output to PDF format to havea clear view ofthe tree Visit: Learnbay.co
  • 23. Concepts | Greedy Algorithm Make 31 Paise using any combination of above coins Optimal solution with few coins : 25 + 5 + 1 What if the 5 paise coin is not there? Optimal solution with few coins : 10 * 3 + 1 Greedy Algorithm solution: 25 + 1 * 6 Visit: Learnbay.co
  • 24. Concepts | Cross Validation § Cross Validation is part of the CART algorithm § Method to see how well the model performs tounseen data § Typically xval parameter for cross- validation is set to10 KFoldCV P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 Fold1 Train Train Train Train Train Train Train Train Train Test Fold2 Train Train Train Train Train Train Train Train Test Train Fold3 Train Train Train Train Train Train Train Test Train Train Fold4 Train Train Train Train Train Train Test Train Train Train Fold5 Train Train Train Train Train Test Train Train Train Train Fold6 Train Train Train Train Test Train Train Train Train Train Fold7 Train Train Train Test Train Train Train Train Train Train Fold8 Train Train Test Train Train Train Train Train Train Train Fold9 Train Test Train Train Train Train Train Train Train Train Fold10 Test Train Train Train Train Train Train Train Train Train Visit: Learnbay.co
  • 25. Concepts | Over-fitting 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Training Data Test Data 0 10 20 30 40 50 60 70 80 90 100 Tree Size(No. of Nodes) Accuracy § If you grow the tree too long you will run the risk of over-fitting § Classification model may not work well on unseen data How do we avoid Over-fitting? Stopping Rule: don’t expand a node if the impurity reductionof the best split is below somethreshold Pruning: grow a very large tree and merge backnodes Visit: Learnbay.co
  • 26. Concepts | Parsimony Principle & Re-substitution Error § Parsimony principle is basic to all science and tells us to choose the simplest scientific explanation that fits the evidence. § Resubstitution Error: It measures what fraction of the cases in a node is classified incorrectly if we assign every case to the majority class in that node; It always favours large tree § Tocounter balance the resubstitution error we need a penalty component that favours smaller tree Sub-tree Node 530 ; 113;0 Node 14 122; 10;0 Node 15 408; 103;0 Node 30 388; 90;0 Node 31 20; 7; 1 SCR <334 Y N Gender:M,O Re (prunded) = 113 /530 Re (leaves) = 107 /530 Visit: Learnbay.co
  • 27. Cost Component Pruning § “cost-complexity” – a measure of avg. error reduced per leaf § Calculate number of errors for each node if collapsed toleaf § Compare to errors in leaves, taking into account more nodes used Sub-tree Node 530 ; 113;0 Node 14 122; 10;0 Node 15 408; 103;0 Node 30 388; 90;0 Node 31 20; 7; 1 SCR <334 Y N Gender:M,O Re (prunded) + 1 a = Re (leaves) +3 a 113 / 530 + 1 a =107/ 530+3 a a = 0.0056 Visit: Learnbay.co
  • 28. Pruning § Pruning is Basically the average cost complexity reduced perleaf in a Decision Tree. § Generally It’s a hit & try method to get the accuracy improved over the depth of tree getting reduced or average number of nodes reduced without over fitting. § Practically, We creates a Tree structure which is getting refined on certain pre-assumptions for improving the performance and accuracy of a Decision Treeclassifier http://stats.stackexchange.com/questions/92547/r-rpart-cross-validation-and-1-se-rule-why-is-the-column-in-cptable-called-xst https://stats.stackexchange.com/questions/13471/how-to-choose-the-number-of-splits-in-rpart Visit: Learnbay.co
  • 30. Model Evaluation Various measures to see the model performance § Error Matrix § Gini Coefficient § AUC § KS § Lift Chart https://www.youtube.com/watch?v=OAl6eAyP-yo Demo of Rattle interfaceto build model and generate various model evaluation measures Visit: Learnbay.co
  • 32. Area Under Curve Sensitivity = True PositiveRate = True Positive / TotalPositive = a / (a + b) Specificity = True Negative / TotalNegative = d / (c + d) False Positive Rate = 1 -Specificity Classification Matrix Predicted Y N Actual Y a b N c d Visit: Learnbay.co