MLHEP 2015: Introductory Lecture #2

MACHINE LEARNING IN HIGH
ENERGY PHYSICS
LECTURE #2
Alex Rogozhnikov, 2015

RECAPITULATION
classification, regression
kNN classifier and regressor
ROC curve, ROC AUC

Given knowledge about distributions, we can build optimal
classifier
OPTIMAL BAYESIAN CLASSIFIER
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
But distributions are complex, contain many parameters.

QDA
QDA follows generative approach.

LOGISTIC REGRESSION
Decision function d(x) =< w, x > +w0
Sharp rule: = sgn d(x)ŷ

Optimizing weights to maximize log-likelihood
LOGISTIC REGRESSION
Smooth rule:
d(x) =< w, x > +w0
(x)p+1
(x)p−1
=
=
σ(d(x))
σ(−d(x))
w, w0
 = − ln( ( )) = L( , ) → min
1
N ∑
i∈events
pyi
xi
1
N ∑
i
xi yi

LOGISTIC LOSS
Loss penalty for single observation
L( , ) = − ln( ( )) =
{
xi yi pyi
xi
ln(1 + ),e
−d( )xi
ln(1 + ),e
d( )xi
= +1yi
= −1yi

GRADIENT DESCENT & STOCHASTIC
OPTIMIZATION
Problem:
finding to minimize
is step size
(also `shrinkage`, `learning rate`)
w 
w ← w − η
∂
∂w
η

STOCHASTIC GRADIENT DESCENT
 = L( , ) → min
1
N ∑
i
xi yi
On each iteration make a step with respect to only one
event:
1. take — random event from training data
2.
i
w ← w − η
∂( , )xi yi
∂w
Each iteration is done much faster, but training process is
less stable.

POLYNOMIAL DECISION RULE
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
is again linear model, introduce new features:
and reusing logistic regression.
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) =
∑
i
wi zi
We can add as one more variable to dataset and
forget about intercept
= 1x0
d(x) = + =w0 ∑N
i=1
wi xi ∑N
i=0
wi xi

PROJECTING IN HIGHER DIMENSION SPACE
SVM with polynomial kernel visualization
After adding new features, classes may become separable.

is projection operator (which adds new features).
Assume
and look for optimal
We need only kernel:
KERNEL TRICK
P
d(x) = < w, P(x) >
w = P( )
∑
i
αi xi
αi
d(x) = < P( ), P(x) >= K( , x)
∑
i
αi xi
∑
i
αi xi
K(x, y) =< P(x), P(y) >

Popular kernel is gaussian Radial Basis Function:
Corresponds to projection to Hilbert space.
KERNEL TRICK
K(x, y) = e
−c||x−y||
2
Exercise: find a correspong projection.

SVM selects decision rule with maximal possible margin.
SUPPORT VECTOR MACHINE

SVM uses different loss function (only signal losses
compared):
HINGE LOSS FUNCTION

OVERFITTING
Knn with k=1 gives ideal classification of training data.

There are two definitions of overfitting, which often
coincide.
DIFFERENCE-OVERFITTING
There is significant difference in quality of predictions
between train and test.
COMPLEXITY-OVERFITTING
Formula has too high complexity (e.g. too many
parameters), increasing the number of parameters drives to
lower quality.

MEASURING QUALITY
To get unbiased estimate, one should test formula on
independent samples (and be sure that no train information
was given to algorithm during training)
In most cases, simply splitting data into train and holdout is
enough.
More approaches in seminar.

Difference-overfitting is inessential, provided that we
measure quality on holdout (though easy to check).
Complexity-overfitting is problem — we need to test
different parameters for optimality (more examples through
the course).
Don't use distribution comparison to detect overfitting

MLHEP 2015: Introductory Lecture #2

REGULARIZATION
When number of weights is high, overfitting is very probable
Adding regularization term to loss function:
 = L( , ) + → min
1
N ∑
i
xi yi reg
regularization :
regularization:
regularization:
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj

, — REGULARIZATIONSL2 L1
L2 regularization L1 (solid), L1 + L2 (dashed)

REGULARIZATIONS
regularization encourages sparsityL1

REGULARIZATIONSLp
=Lp ∑i
w
p
i
What is the expression for ?
But nobody uses it, even . Why?
Because it is not convex
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp

LOGISTIC REGRESSION
classifier based on linear decision rule
training is reduced to convex optimization
other decision rules are achieved by adding new features
stochastic optimization is used
can handle > 1000 features, requires regularization
no iteraction between features

[ARTIFICIAL] NEURAL NETWORKS
Based on our understanding of natural neural networks
neurons are organized in networks
receptors activate some neurons, neurons are activating
other neurons, etc.
connection is via synapses

STRUCTURE OF ARTIFICIAL FEED-
FORWARD NETWORK

ACTIVATION OF NEURON
Neuron states: n =
{
1,
0,
activated
not activated
Let to be state of to be weight of connection between
-th neuron and output neuron:
ni wi
i
n =
{
1,
0,
> 0∑i
wi ni
otherwise∑i
Problem: find set of weights, that minimizes error on train
dataset. (discrete optimization)

SMOOTH ACTIVATIONS:
ONE HIDDEN
LAYER
= σ( )hi
∑
j
wij xj
= σ( )yi
∑
i
vij hj

NEURAL NETWORKS
Powerful general purpose algorithm for classification and
regression
Non-interpretable formula
Optimization problem is non-convex with local optimums
and has many parameters
Stochastic optimization speeds up process and helps not to
be caught in local minimum.
Overfitting due to large amount of parameters
— regularizations (and other tricks),L1 L2

DEEP LEARNING
Gradient diminishes as number of hidden layers grows.
Usually 1-2 hidden layers are used.
But modern ANN for image recognition have 7-15 layers.

DECISION TREES
Example: predict outside play based on weather conditions.

DECISION TREE
fast & intuitive prediction
building optimal decision tree is
building tree from root using greedy optimization
each time we split one leaf, finding optimal feature and
threshold
need criterion to select best splitting (feature, threshold)
NP complete

SPLITTING CRITERIONS
TotalImpurity = impurity(leaf ) × size(leaf )∑leaf
Misclass.
Gini
Entropy
=
=
=
min(p, 1 − p)
p(1 − p)
− p log p − (1 − p) log(1 − p)

SPLITTING CRITERIONS
Why using Gini or Entropy not misclassification?

REGRESSION TREE
Greedy optimization (minimizing MSE):
GlobalMSE ∼ ( −∑i
yi ŷ
i
)
2
Can be rewritten as:
GlobalMSE ∼ MSE(leaf) × size(leaf)∑leaf
MSE(leaf) is like 'impurity' of leaf
MSE(leaf) = ( −1
size(leaf)
∑i∈leaf
yi ŷ
i
)
2

In most cases, regression trees are optimizing MSE:
GlobalMSE ∼ ( −
∑
i
yi ŷ
i
)
2
But other options also exist, i.e. MAE:
GlobalMAE ∼ | − |
∑
i
yi ŷ
i
For MAE optimal value of leaf is median, not mean.

DECISION TREES INSTABILITY
Little variation in training dataset produce different
classification rule.

PRE-STOPPING OF DECISION TREE
Tree keeps splitting until each event is correctly classified.

PRE-STOPPING
We can stop the process of splitting by imposing different
restrictions.
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
Any combinations of rules above is possible.

no prepruning max_depth
min # of samples in leaf maximal number of leaves

POST-PRUNING
When tree tree is already built we can try optimize it to
simplify formula.
Generally, much slower than pre-stopping.

SUMMARY OF DECISION TREE
1. Very intuitive algorithm for regression and classification
2. Fast prediction
3. Scale-independent
4. Supports multiclassification
But
1. Training optimal tree is NP-complex
2. Trained greedily by optimizing Gini index or entropy (fast!)
3. Non-stable
4. Uses only trivial conditions

MISSING VALUES IN DECISION TREES
If event being predicted lacks , we use prior probabilities.x1

FEATURE IMPORTANCES
Different approaches exist to measure importance of feature
in final model
Importance of feature quality provided by one feature≠

FEATURE IMPORTANCES
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
common recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
common recipe: feature shuffling
take one column in test dataset and shuffle them. Compare
quality with/without shuffling.

THE END
Tomorrow: ensembles and boosting

MLHEP 2015: Introductory Lecture #2

More Related Content

What's hot (18)

Viewers also liked (6)

Similar to MLHEP 2015: Introductory Lecture #2 (20)

Recently uploaded (20)

MLHEP 2015: Introductory Lecture #2