SlideShare a Scribd company logo
MACHINE LEARNING IN HIGH
ENERGY PHYSICS
LECTURE #2
Alex Rogozhnikov, 2015
RECAPITULATION
classification, regression
kNN classifier and regressor
ROC curve, ROC AUC
Given knowledge about distributions, we can build optimal
classifier
OPTIMAL BAYESIAN CLASSIFIER
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
But distributions are complex, contain many parameters.
QDA
QDA follows generative approach.
LOGISTIC REGRESSION
Decision function d(x) =< w, x > +w0
Sharp rule: = sgn d(x)ŷ
Optimizing weights to maximize log-likelihood
LOGISTIC REGRESSION
Smooth rule:
d(x) =< w, x > +w0
(x)p+1
(x)p−1
=
=
σ(d(x))
σ(−d(x))
w, w0
 = − ln( ( )) = L( , ) → min
1
N ∑
i∈events
pyi
xi
1
N ∑
i
xi yi
LOGISTIC LOSS
Loss penalty for single observation
L( , ) = − ln( ( )) =
{
xi yi pyi
xi
ln(1 + ),e
−d( )xi
ln(1 + ),e
d( )xi
= +1yi
= −1yi
GRADIENT DESCENT & STOCHASTIC
OPTIMIZATION
Problem:
finding to minimize
is step size
(also `shrinkage`, `learning rate`)
w 
w ← w − η
∂
∂w
η
STOCHASTIC GRADIENT DESCENT
 = L( , ) → min
1
N ∑
i
xi yi
On each iteration make a step with respect to only one
event:
1. take — random event from training data
2.
i
w ← w − η
∂( , )xi yi
∂w
Each iteration is done much faster, but training process is
less stable.
POLYNOMIAL DECISION RULE
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
is again linear model, introduce new features:
and reusing logistic regression.
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) =
∑
i
wi zi
We can add as one more variable to dataset and
forget about intercept
= 1x0
d(x) = + =w0 ∑N
i=1
wi xi ∑N
i=0
wi xi
PROJECTING IN HIGHER DIMENSION SPACE
SVM with polynomial kernel visualization
After adding new features, classes may become separable.
is projection operator (which adds new features).
Assume
and look for optimal
We need only kernel:
KERNEL TRICK
P
d(x) = < w, P(x) >
w = P( )
∑
i
αi xi
αi
d(x) = < P( ), P(x) >= K( , x)
∑
i
αi xi
∑
i
αi xi
K(x, y) =< P(x), P(y) >
Popular kernel is gaussian Radial Basis Function:
Corresponds to projection to Hilbert space.
KERNEL TRICK
K(x, y) = e
−c||x−y||
2
Exercise: find a correspong projection.
SVM selects decision rule with maximal possible margin.
SUPPORT VECTOR MACHINE
SVM uses different loss function (only signal losses
compared):
HINGE LOSS FUNCTION
SVM + RBF KERNEL
SVM + RBF KERNEL
OVERFITTING
Knn with k=1 gives ideal classification of training data.
OVERFITTING
There are two definitions of overfitting, which often
coincide.
DIFFERENCE-OVERFITTING
There is significant difference in quality of predictions
between train and test.
COMPLEXITY-OVERFITTING
Formula has too high complexity (e.g. too many
parameters), increasing the number of parameters drives to
lower quality.
MEASURING QUALITY
To get unbiased estimate, one should test formula on
independent samples (and be sure that no train information
was given to algorithm during training)
In most cases, simply splitting data into train and holdout is
enough.
More approaches in seminar.
Difference-overfitting is inessential, provided that we
measure quality on holdout (though easy to check).
Complexity-overfitting is problem — we need to test
different parameters for optimality (more examples through
the course).
Don't use distribution comparison to detect overfitting
MLHEP 2015: Introductory Lecture #2
REGULARIZATION
When number of weights is high, overfitting is very probable
Adding regularization term to loss function:
 = L( , ) + → min
1
N ∑
i
xi yi reg
regularization :
regularization:
regularization:
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj
, — REGULARIZATIONSL2 L1
L2 regularization L1 (solid), L1 + L2 (dashed)
REGULARIZATIONS
regularization encourages sparsityL1
MLHEP 2015: Introductory Lecture #2
REGULARIZATIONSLp
=Lp ∑i
w
p
i
What is the expression for ?
But nobody uses it, even . Why?
Because it is not convex
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
LOGISTIC REGRESSION
classifier based on linear decision rule
training is reduced to convex optimization
other decision rules are achieved by adding new features
stochastic optimization is used
can handle > 1000 features, requires regularization
no iteraction between features
[ARTIFICIAL] NEURAL NETWORKS
Based on our understanding of natural neural networks
neurons are organized in networks
receptors activate some neurons, neurons are activating
other neurons, etc.
connection is via synapses
STRUCTURE OF ARTIFICIAL FEED-
FORWARD NETWORK
ACTIVATION OF NEURON
Neuron states: n =
{
1,
0,
activated
not activated
Let to be state of to be weight of connection between
-th neuron and output neuron:
ni wi
i
n =
{
1,
0,
> 0∑i
wi ni
otherwise∑i
Problem: find set of weights, that minimizes error on train
dataset. (discrete optimization)
SMOOTH ACTIVATIONS:
ONE HIDDEN
LAYER
= σ( )hi
∑
j
wij xj
= σ( )yi
∑
i
vij hj
VISUALIZATION OF NN
NEURAL NETWORKS
Powerful general purpose algorithm for classification and
regression
Non-interpretable formula
Optimization problem is non-convex with local optimums
and has many parameters
Stochastic optimization speeds up process and helps not to
be caught in local minimum.
Overfitting due to large amount of parameters
— regularizations (and other tricks),L1 L2
MINUTES BREAKx
DEEP LEARNING
Gradient diminishes as number of hidden layers grows.
Usually 1-2 hidden layers are used.
But modern ANN for image recognition have 7-15 layers.
MLHEP 2015: Introductory Lecture #2
CONVOLUTIONAL NEURAL NETWORK
MLHEP 2015: Introductory Lecture #2
DECISION TREES
Example: predict outside play based on weather conditions.
MLHEP 2015: Introductory Lecture #2
DECISION TREES: IDEA
DECISION TREES
DECISION TREES
DECISION TREE
fast & intuitive prediction
building optimal decision tree is
building tree from root using greedy optimization
each time we split one leaf, finding optimal feature and
threshold
need criterion to select best splitting (feature, threshold)
NP complete
SPLITTING CRITERIONS
TotalImpurity = impurity(leaf ) × size(leaf )∑leaf
Misclass.
Gini
Entropy
=
=
=
min(p, 1 − p)
p(1 − p)
− p log p − (1 − p) log(1 − p)
SPLITTING CRITERIONS
Why using Gini or Entropy not misclassification?
REGRESSION TREE
Greedy optimization (minimizing MSE):
GlobalMSE ∼ ( −∑i
yi ŷ
i
)
2
Can be rewritten as:
GlobalMSE ∼ MSE(leaf) × size(leaf)∑leaf
MSE(leaf) is like 'impurity' of leaf
MSE(leaf) = ( −1
size(leaf)
∑i∈leaf
yi ŷ
i
)
2
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
In most cases, regression trees are optimizing MSE:
GlobalMSE ∼ ( −
∑
i
yi ŷ
i
)
2
But other options also exist, i.e. MAE:
GlobalMAE ∼ | − |
∑
i
yi ŷ
i
For MAE optimal value of leaf is median, not mean.
DECISION TREES INSTABILITY
Little variation in training dataset produce different
classification rule.
PRE-STOPPING OF DECISION TREE
Tree keeps splitting until each event is correctly classified.
PRE-STOPPING
We can stop the process of splitting by imposing different
restrictions.
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
Any combinations of rules above is possible.
no prepruning max_depth
min # of samples in leaf maximal number of leaves
POST-PRUNING
When tree tree is already built we can try optimize it to
simplify formula.
Generally, much slower than pre-stopping.
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
SUMMARY OF DECISION TREE
1. Very intuitive algorithm for regression and classification
2. Fast prediction
3. Scale-independent
4. Supports multiclassification
But
1. Training optimal tree is NP-complex
2. Trained greedily by optimizing Gini index or entropy (fast!)
3. Non-stable
4. Uses only trivial conditions
MISSING VALUES IN DECISION TREES
If event being predicted lacks , we use prior probabilities.x1
FEATURE IMPORTANCES
Different approaches exist to measure importance of feature
in final model
Importance of feature quality provided by one feature≠
FEATURE IMPORTANCES
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
common recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
common recipe: feature shuffling
take one column in test dataset and shuffle them. Compare
quality with/without shuffling.
THE END
Tomorrow: ensembles and boosting

More Related Content

PDF
MLHEP 2015: Introductory Lecture #1
PDF
MLHEP 2015: Introductory Lecture #4
PDF
MLHEP 2015: Introductory Lecture #3
PDF
MLHEP Lectures - day 3, basic track
PDF
MLHEP Lectures - day 2, basic track
PDF
MLHEP Lectures - day 1, basic track
PDF
Reweighting and Boosting to uniforimty in HEP
PDF
Machine learning in science and industry — day 3
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #3
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 1, basic track
Reweighting and Boosting to uniforimty in HEP
Machine learning in science and industry — day 3

What's hot (18)

PDF
Machine learning in science and industry — day 2
PDF
Clustering:k-means, expect-maximization and gaussian mixture model
PDF
20 k-means, k-center, k-meoids and variations
PDF
Ridge regression, lasso and elastic net
PDF
K-means, EM and Mixture models
PDF
Multiclass Logistic Regression: Derivation and Apache Spark Examples
PDF
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
PPTX
Gaussian processing
PPTX
Variational Autoencoder Tutorial
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Tree models with Scikit-Learn: Great models with little assumptions
PDF
K-means and GMM
PDF
Information-theoretic clustering with applications
PDF
L1 intro2 supervised_learning
PDF
2012 mdsp pr13 support vector machine
PPT
Machine Learning and Statistical Analysis
Machine learning in science and industry — day 2
Clustering:k-means, expect-maximization and gaussian mixture model
20 k-means, k-center, k-meoids and variations
Ridge regression, lasso and elastic net
K-means, EM and Mixture models
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Gaussian processing
Variational Autoencoder Tutorial
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Tree models with Scikit-Learn: Great models with little assumptions
K-means and GMM
Information-theoretic clustering with applications
L1 intro2 supervised_learning
2012 mdsp pr13 support vector machine
Machine Learning and Statistical Analysis
Ad

Viewers also liked (6)

DOC
Comparison of Machine Learning Algorithms
PDF
Iaetsd early detection of breast cancer
PDF
10.1.1.151.4974
PPTX
Wavelets AND counterlets
PPTX
Comparative Literature Studies
Comparison of Machine Learning Algorithms
Iaetsd early detection of breast cancer
10.1.1.151.4974
Wavelets AND counterlets
Comparative Literature Studies
Ad

Similar to MLHEP 2015: Introductory Lecture #2 (20)

PDF
super-cheatsheet-artificial-intelligence.pdf
PPTX
Machine Learning in the Financial Industry
PDF
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
PPTX
Intro to ml_2021
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
DOCX
Classification Using Decision Trees and RulesChapter 5.docx
PDF
Computational decision making
PPTX
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
PDF
CS229_MachineLearning_notes.pdfkkkkkkkkkk
PDF
machine learning notes by Andrew Ng and Tengyu Ma
PDF
Machine learning cheat sheet
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
PPTX
Deep learning from mashine learning AI..
PPTX
Decision tree, softmax regression and ensemble methods in machine learning
PDF
Data Science Cheatsheet.pdf
PDF
Machine Learning Algorithms Introduction.pdf
PPT
[ppt]
PPT
[ppt]
PDF
A detailed analysis of the supervised machine Learning Algorithms
super-cheatsheet-artificial-intelligence.pdf
Machine Learning in the Financial Industry
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Intro to ml_2021
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
Classification Using Decision Trees and RulesChapter 5.docx
Computational decision making
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
CS229_MachineLearning_notes.pdfkkkkkkkkkk
machine learning notes by Andrew Ng and Tengyu Ma
Machine learning cheat sheet
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Deep learning from mashine learning AI..
Decision tree, softmax regression and ensemble methods in machine learning
Data Science Cheatsheet.pdf
Machine Learning Algorithms Introduction.pdf
[ppt]
[ppt]
A detailed analysis of the supervised machine Learning Algorithms

Recently uploaded (20)

DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
famous lake in india and its disturibution and importance
PPTX
Microbiology with diagram medical studies .pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
BIOMOLECULES PPT........................
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Viruses (History, structure and composition, classification, Bacteriophage Re...
famous lake in india and its disturibution and importance
Microbiology with diagram medical studies .pptx
Placing the Near-Earth Object Impact Probability in Context
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
BIOMOLECULES PPT........................
neck nodes and dissection types and lymph nodes levels
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
POSITIONING IN OPERATION THEATRE ROOM.ppt
2Systematics of Living Organisms t-.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
7. General Toxicologyfor clinical phrmacy.pptx
protein biochemistry.ppt for university classes
Taita Taveta Laboratory Technician Workshop Presentation.pptx
INTRODUCTION TO EVS | Concept of sustainability
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice

MLHEP 2015: Introductory Lecture #2

  • 1. MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #2 Alex Rogozhnikov, 2015
  • 3. Given knowledge about distributions, we can build optimal classifier OPTIMAL BAYESIAN CLASSIFIER = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(x | y = 1) p(y = 0) p(x | y = 0) But distributions are complex, contain many parameters.
  • 5. LOGISTIC REGRESSION Decision function d(x) =< w, x > +w0 Sharp rule: = sgn d(x)ŷ
  • 6. Optimizing weights to maximize log-likelihood LOGISTIC REGRESSION Smooth rule: d(x) =< w, x > +w0 (x)p+1 (x)p−1 = = σ(d(x)) σ(−d(x)) w, w0  = − ln( ( )) = L( , ) → min 1 N ∑ i∈events pyi xi 1 N ∑ i xi yi
  • 7. LOGISTIC LOSS Loss penalty for single observation L( , ) = − ln( ( )) = { xi yi pyi xi ln(1 + ),e −d( )xi ln(1 + ),e d( )xi = +1yi = −1yi
  • 8. GRADIENT DESCENT & STOCHASTIC OPTIMIZATION Problem: finding to minimize is step size (also `shrinkage`, `learning rate`) w  w ← w − η ∂ ∂w η
  • 9. STOCHASTIC GRADIENT DESCENT  = L( , ) → min 1 N ∑ i xi yi On each iteration make a step with respect to only one event: 1. take — random event from training data 2. i w ← w − η ∂( , )xi yi ∂w Each iteration is done much faster, but training process is less stable.
  • 10. POLYNOMIAL DECISION RULE d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj is again linear model, introduce new features: and reusing logistic regression. z = {1} ∪ { ∪ {xi }i xi xj }ij d(x) = ∑ i wi zi We can add as one more variable to dataset and forget about intercept = 1x0 d(x) = + =w0 ∑N i=1 wi xi ∑N i=0 wi xi
  • 11. PROJECTING IN HIGHER DIMENSION SPACE SVM with polynomial kernel visualization After adding new features, classes may become separable.
  • 12. is projection operator (which adds new features). Assume and look for optimal We need only kernel: KERNEL TRICK P d(x) = < w, P(x) > w = P( ) ∑ i αi xi αi d(x) = < P( ), P(x) >= K( , x) ∑ i αi xi ∑ i αi xi K(x, y) =< P(x), P(y) >
  • 13. Popular kernel is gaussian Radial Basis Function: Corresponds to projection to Hilbert space. KERNEL TRICK K(x, y) = e −c||x−y|| 2 Exercise: find a correspong projection.
  • 14. SVM selects decision rule with maximal possible margin. SUPPORT VECTOR MACHINE
  • 15. SVM uses different loss function (only signal losses compared): HINGE LOSS FUNCTION
  • 16. SVM + RBF KERNEL
  • 17. SVM + RBF KERNEL
  • 18. OVERFITTING Knn with k=1 gives ideal classification of training data.
  • 20. There are two definitions of overfitting, which often coincide. DIFFERENCE-OVERFITTING There is significant difference in quality of predictions between train and test. COMPLEXITY-OVERFITTING Formula has too high complexity (e.g. too many parameters), increasing the number of parameters drives to lower quality.
  • 21. MEASURING QUALITY To get unbiased estimate, one should test formula on independent samples (and be sure that no train information was given to algorithm during training) In most cases, simply splitting data into train and holdout is enough. More approaches in seminar.
  • 22. Difference-overfitting is inessential, provided that we measure quality on holdout (though easy to check). Complexity-overfitting is problem — we need to test different parameters for optimality (more examples through the course). Don't use distribution comparison to detect overfitting
  • 24. REGULARIZATION When number of weights is high, overfitting is very probable Adding regularization term to loss function:  = L( , ) + → min 1 N ∑ i xi yi reg regularization : regularization: regularization: L2 = α |reg ∑j wj | 2 L1 = β | |reg ∑j wj +L1 L2 = α | + β | |reg ∑j wj | 2 ∑j wj
  • 25. , — REGULARIZATIONSL2 L1 L2 regularization L1 (solid), L1 + L2 (dashed)
  • 28. REGULARIZATIONSLp =Lp ∑i w p i What is the expression for ? But nobody uses it, even . Why? Because it is not convex L0 = [ ≠ 0]L0 ∑i wi , 0 < p < 1Lp
  • 29. LOGISTIC REGRESSION classifier based on linear decision rule training is reduced to convex optimization other decision rules are achieved by adding new features stochastic optimization is used can handle > 1000 features, requires regularization no iteraction between features
  • 30. [ARTIFICIAL] NEURAL NETWORKS Based on our understanding of natural neural networks neurons are organized in networks receptors activate some neurons, neurons are activating other neurons, etc. connection is via synapses
  • 31. STRUCTURE OF ARTIFICIAL FEED- FORWARD NETWORK
  • 32. ACTIVATION OF NEURON Neuron states: n = { 1, 0, activated not activated Let to be state of to be weight of connection between -th neuron and output neuron: ni wi i n = { 1, 0, > 0∑i wi ni otherwise∑i Problem: find set of weights, that minimizes error on train dataset. (discrete optimization)
  • 33. SMOOTH ACTIVATIONS: ONE HIDDEN LAYER = σ( )hi ∑ j wij xj = σ( )yi ∑ i vij hj
  • 35. NEURAL NETWORKS Powerful general purpose algorithm for classification and regression Non-interpretable formula Optimization problem is non-convex with local optimums and has many parameters Stochastic optimization speeds up process and helps not to be caught in local minimum. Overfitting due to large amount of parameters — regularizations (and other tricks),L1 L2
  • 37. DEEP LEARNING Gradient diminishes as number of hidden layers grows. Usually 1-2 hidden layers are used. But modern ANN for image recognition have 7-15 layers.
  • 41. DECISION TREES Example: predict outside play based on weather conditions.
  • 46. DECISION TREE fast & intuitive prediction building optimal decision tree is building tree from root using greedy optimization each time we split one leaf, finding optimal feature and threshold need criterion to select best splitting (feature, threshold) NP complete
  • 47. SPLITTING CRITERIONS TotalImpurity = impurity(leaf ) × size(leaf )∑leaf Misclass. Gini Entropy = = = min(p, 1 − p) p(1 − p) − p log p − (1 − p) log(1 − p)
  • 48. SPLITTING CRITERIONS Why using Gini or Entropy not misclassification?
  • 49. REGRESSION TREE Greedy optimization (minimizing MSE): GlobalMSE ∼ ( −∑i yi ŷ i ) 2 Can be rewritten as: GlobalMSE ∼ MSE(leaf) × size(leaf)∑leaf MSE(leaf) is like 'impurity' of leaf MSE(leaf) = ( −1 size(leaf) ∑i∈leaf yi ŷ i ) 2
  • 55. In most cases, regression trees are optimizing MSE: GlobalMSE ∼ ( − ∑ i yi ŷ i ) 2 But other options also exist, i.e. MAE: GlobalMAE ∼ | − | ∑ i yi ŷ i For MAE optimal value of leaf is median, not mean.
  • 56. DECISION TREES INSTABILITY Little variation in training dataset produce different classification rule.
  • 57. PRE-STOPPING OF DECISION TREE Tree keeps splitting until each event is correctly classified.
  • 58. PRE-STOPPING We can stop the process of splitting by imposing different restrictions. limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in leaf more advanced: maximal number of leaves in tree Any combinations of rules above is possible.
  • 59. no prepruning max_depth min # of samples in leaf maximal number of leaves
  • 60. POST-PRUNING When tree tree is already built we can try optimize it to simplify formula. Generally, much slower than pre-stopping.
  • 63. SUMMARY OF DECISION TREE 1. Very intuitive algorithm for regression and classification 2. Fast prediction 3. Scale-independent 4. Supports multiclassification But 1. Training optimal tree is NP-complex 2. Trained greedily by optimizing Gini index or entropy (fast!) 3. Non-stable 4. Uses only trivial conditions
  • 64. MISSING VALUES IN DECISION TREES If event being predicted lacks , we use prior probabilities.x1
  • 65. FEATURE IMPORTANCES Different approaches exist to measure importance of feature in final model Importance of feature quality provided by one feature≠
  • 66. FEATURE IMPORTANCES tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate common recipe: train without one feature, compare quality on test with/without one feature requires many evaluations common recipe: feature shuffling take one column in test dataset and shuffle them. Compare quality with/without shuffling.