SlideShare a Scribd company logo
MACHINE LEARNING IN HIGH
ENERGY PHYSICS
LECTURE #4
Alex Rogozhnikov, 2015
WEIGHTED VOTING
The way to introduce importance of classifiers
D(x) = (x)∑j
αj dj
GENERAL CASE OF ENSEMBLING:
D(x) = f ( (x), (x), …, (x))d1 d2 dJ
COMPARISON OF DISTRIBUTIONS
good options to compare 1d
use classifier's output to build discriminating variable
ROC AUC + Mann Whitney to get significance
MLHEP 2015: Introductory Lecture #4
RANDOM FOREST
composition of independent trees
ADABOOST
After building th base classifier:j
1.
2. increase weight of misclassified
= ln
( )
αj
1
2
wcorrect
wwrong
← ×wi wi e
− ( )αj yi
dj xi
MLHEP 2015: Introductory Lecture #4
GRADIENT BOOSTING
 → min
(x) = (x)Dj ∑j
=1j
′ αj
′ dj
′
(x) = (x) + (x)Dj Dj−1 αj dj
At th iteration:j
pseudo-residual
train regressor to minimize MSE:
find optimal
= − zi
∂
∂D( )xi
∣∣D(x)= (x)Dj−1
dj
( ( ) − → min∑i
dj xi zi )
2
αj
LOSSES
regression,
Mean Squared Error
Mean Absolute Error
binary classification,
ExpLoss (aka AdaLoss)
LogLoss
ranking losses
boosting to uniformity: FlatnessLoss
y ∈ ℝ
(d( ) −∑i
xi yi )
2
d( ) −∑i
∣∣ xi yi
∣∣
= ±1yi
∑i
e
− d( )yi
xi
log(1 + )∑i
e
− d( )yi
xi
TUNING GRADIENT BOOSTING OVER
DECISION TREES
Parameters:
loss function
pre-pruning: maximal depth, minimal leaf size
subsample, max_features =
(learning_rate)
number of trees
N/3
η
LOSS FUNCTION
different combinations also may be used
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
TUNING GBDT
1. set high, but feasible number of trees
2. find optimal parameters by checking combinations
3. decrease learning rate, increase number of trees
See also: GBDT tutorial
BaggingClassifier(base_estimator=GradientBoostingClassifier(),
n_estimators=100)
ENSEMBLES: BAGGING OVER
BOOSTING
Different variations , are claimed to overcome single
GBDT.
[1] [2]
Very complex training, better quality if GB estimators are
overfitted
ENSEMBLES: STACKING
Correcting output of several classifiers with new classifier.
D(x) = f ( (x), (x), …, (x))d1 d2 dJ
To use unbiased predictions, use holdout or kFolding.
REWEIGHTING
Given two distributions: target and original, find new
weights for original distributions, that distributions will
coincide.
MLHEP 2015: Introductory Lecture #4
REWEIGHTING
Solution for 1 or 2 variables: reweight using bins (so-called
'histogram division').
Typical application: Monte-Carlo reweighting.
REWEIGHTING WITH BINS
Splitting all space in bins, for each bin:
compute , in each bin
multiply weights of original distribution in bin to
compensate the difference
woriginal wtarget
←wi wi
wtarget
woriginal
Problems
good in 1d, works sometimes in 2d, nearly impossible in
dimensions > 2
too few events in bin reweighting rule is unstable
we can reweight several times over 1d if variables aren't
correlated
⇒
reweighting using enhanced version of bin reweighter
reweighting using gradient boosted reweighter
GRADIENT BOOSTED REWEIGHTER
works well in high dimensions
works with correlated variables
produces more stable reweighting rule
HOW DOES GB REWEIGHTER
WORKS?
iteratively:
1. Find the tree, which is able to discriminate two distributions
2. Correct weights in each leaf
3. Reweight original distribution and repeat the process.
Less bins, bins are guaranteed to have high statistic in each.
DISCRIMINATING TREE
When looking for a tree with good discrimination, optimize
=χ2
∑l∈leaves
( −wl,target wl,original )
2
+wl,target wl,original
As before, using greedy minimization to build a tree, which
provides maximal .χ2
+ introducing all heuristics from standard GB.
QUALITY ESTIMATION
Quality of model can be estimated by the value of optimized
loss function. Though remember that different classifiers
use different optimization target
LogLoss ExpLoss
QUALITY ESTIMATION
different target of optimization use general quality
metrics
⇒
TESTING HYPOTHESIS WITH CUT
Example: we test hypothesis that there is signal channel
with fixed vs. no signal channel ( ).Br ≠ 0 Br = 0
: signal channel with
: there is no signal channel
H0 Br = const
H1
Putting a threshold: d(x) > threshold
Estimate number of signal / bkg events that will be selected:
s = α tpr, b = β fpr
:
:
H0 ∼ Poiss(s + b)nobs
H1 ∼ Poiss(b)nobs
TESTING HYPOTHESIS WITH CUT
Select some appropriate metric:
= 2(s + b) log(1 + ) − 2sAMS
2
s
b
Maximize it by selecting best
special holdout shall be used
or use kFolding
very poor usage of information from classifier
threshold
TESTING HYPOTHESIS WITH BINS
Splitting all data in bins:
.
n
x ∈ k-th bin ⇔ < d(x) ≤thrk−1 thrk
Expected amounts of signal:
= α( − ), = β( − )sk tpr
k−1
tpr
k
bk fpr
k−1
fpr
k
:
:
H0 ∼ Poiss( + )nk sk bk
H1 ∼ Poiss( )nk bk
Optimal statistics:
, = log(1 + )
∑
k
ck nk ck
sk
bk
take some approximation to test power and optimize
thresholds
FINDING OPTIMAL PARAMETERS
some algorithms have many parameters
not all the parameters are guessed
checking all combinations takes too long
FINDING OPTIMAL PARAMETERS
randomly picking parameters is a partial solution
given a target optimal value we can optimize it
no gradient with respect to parameters
noisy results
function reconstruction is a problem
Before running grid optimization make sure your metric is
stable (i.e. by train/testing on different subsets).
Overfitting by using many attempts is real issue.
OPTIMAL GRID SEARCH
stochastic optimization (Metropolis-Hastings, annealing)
regression techniques, reusing all known information
(ML to optimize ML!)
OPTIMAL GRID SEARCH USING
REGRESSION
General algorithm (point of grid = set of parameters):
1. evaluations at random points
2. build regression model based on known results
3. select the point with best expected quality according to
trained model
4. evaluate quality at this points
5. Go to 2 if not enough evaluations
Why not using linear regression?
GAUSSIAN PROCESSES FOR
REGRESSION
Some definitions: , where and are
functions of mean and covariance: ,
Y ∼ GP(m, K) m K
m(x) K( , )x1 x2
represents our prior expectation of quality
(may taken constant)
represents influence of known
results on expectation of new
RBF kernel is the most useful:
m(x) = Y(x)
K( , ) = Y( )Y( )x1 x2 x1 x2
K( , ) = exp(−| − )x1 x2 x1 x2 |
2
We can model the posterior distribution of results in each
point.
Gaussian Process Demo on Mathematica
Also see: .http://www.tmpl.fi/gp/
MINUTES BREAKx
PREDICTION SPEED
Cuts vs classifier
cuts are interpretable
cuts are applied really fast
ML classifiers are applied much slower
SPEED UP WAYS
Method-specific
Logistic regression: sparsity
Neural networks: removing neurons
GBDT: pruning by selecting best trees
Staging of classifiers (a-la triggers)
SPEED UP: LOOKUP TABLES
Split space of features into bins, simple piecewise-constant
function
SPEED UP: LOOKUP TABLES
Training:
1. split each variable into bins
2. replace values with index of bin
3. train any classifier on indices of bins
4. create lookup table (evaluate answer for each
combination of bins)
Prediction:
1. convert features to bins' indices
2. take answer from lookup table
LOOKUP TABLES
speed is comparable to cuts
allows to use any ML model behind
used in LHCb topological trigger ('Bonsai BDT')
Problems:
too many combination when number of features N > 10
(8 bins in 10 features),
finding optimal thresholds of bins
∼ 1Gb8
10
UNSUPERVISED LEARNING: PCA
[PEARSON, 1901]
PCA is finding axes along which variation is maximal
(based on principal axis theorem)
PCA DESCRIPTION
PCA is based on the principal axes
is orthogonal matrix, is diagonal matrix.
Q = ΛUU
T
U Λ
Λ = diag( , , …, ), ≥ ≥ ⋯ ≥λ1 λ2 λn λ1 λ2 λn
PCA: EIGENFACES
Emotion = α[scared] + β[laughs] + γ[angry]+. . .
PCA: CALORIMETER
Electron/jet recognition in ATLAS
AUTOENCODERS
autoencoding: , error hidden layer is
smaller
y = x | − y| → minŷ
CONNECTION TO PCA
When optimizing MSE:
( −
∑
i
ŷ
i
xi )
2
And activation is linear:
hi
ŷ
i
=
=
∑
j
wij xi
∑
aij hi
j
Optimal solution of optimization problem is PCA.
BOTTLENECK
There is an informational bottleneck algorithm will
eager to pass the most precious information.
⇒
MANIFOLD LEARNING TECHNIQUES
Preserve neighbourhood/distances.
Digits samples from MNIST
PCA and autoencoder:
PART-OF-SPEECH TAGGING
MARKOV CHAIN
MARKOV CHAIN IN TEXT
TOPIC MODELLING (LDA)
bag of words
generative model that describe different probabilities
fitting by maximizing log-likelihood
parameters of this model are useful
p(x)
COLLABORATIVE RESEARCH
problem: different environments
problem: data ping-pong
solution: dedicated machine (server) for a team
restart the research = run script or notebook
!ping www.bbc.co.uk
REPRODUCIBLE RESEARCH
readable code
code + description + plots together
keep versions + reviews
Solution: IPython notebook + github
Example notebook
EVEN MORE REPRODUCIBILITY?
Analysis preservation: VM with all needed stuff.
Docker
BIRD'S EYE VIEW TO ML:
Classification
Regression
Ranking
Dimensionality reduction
CLUSTERING
MLHEP 2015: Introductory Lecture #4
DENSITY ESTIMATION
REPRESENTATION LEARNING
Word2vec finds representations of words.
MLHEP 2015: Introductory Lecture #4
THE END

More Related Content

PDF
MLHEP 2015: Introductory Lecture #1
PDF
MLHEP 2015: Introductory Lecture #3
PDF
MLHEP 2015: Introductory Lecture #2
PDF
MLHEP Lectures - day 3, basic track
PDF
MLHEP Lectures - day 2, basic track
PDF
MLHEP Lectures - day 1, basic track
PDF
Reweighting and Boosting to uniforimty in HEP
PDF
Clustering:k-means, expect-maximization and gaussian mixture model
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #2
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 1, basic track
Reweighting and Boosting to uniforimty in HEP
Clustering:k-means, expect-maximization and gaussian mixture model

What's hot (20)

PDF
20 k-means, k-center, k-meoids and variations
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Expectation Maximization and Gaussian Mixture Models
PPTX
Detailed Description on Cross Entropy Loss Function
PDF
Multiclass Logistic Regression: Derivation and Apache Spark Examples
PPTX
Gaussian processing
PDF
Ridge regression, lasso and elastic net
PDF
Semi-Supervised Regression using Cluster Ensemble
PDF
K-means, EM and Mixture models
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Information-theoretic clustering with applications
PPTX
Variational Autoencoder Tutorial
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
2012 mdsp pr08 nonparametric approach
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
2012 mdsp pr04 monte carlo
PDF
Principal Components Analysis, Calculation and Visualization
PDF
K-means and GMM
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
2012 mdsp pr13 support vector machine
20 k-means, k-center, k-meoids and variations
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Expectation Maximization and Gaussian Mixture Models
Detailed Description on Cross Entropy Loss Function
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Gaussian processing
Ridge regression, lasso and elastic net
Semi-Supervised Regression using Cluster Ensemble
K-means, EM and Mixture models
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Information-theoretic clustering with applications
Variational Autoencoder Tutorial
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
2012 mdsp pr08 nonparametric approach
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
2012 mdsp pr04 monte carlo
Principal Components Analysis, Calculation and Visualization
K-means and GMM
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
2012 mdsp pr13 support vector machine
Ad

Similar to MLHEP 2015: Introductory Lecture #4 (20)

PPT
[ppt]
PPT
[ppt]
PPTX
Deep learning from mashine learning AI..
PDF
Machine Learning Algorithms Introduction.pdf
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PPTX
BAS 250 Lecture 8
PDF
Machine Learning Comparative Analysis - Part 1
PDF
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
PPTX
unit classification.pptx
PPTX
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PDF
MS CS - Selecting Machine Learning Algorithm
PPTX
Predict Backorder on a supply chain data for an Organization
PDF
classification in data mining and data warehousing.pdf
PDF
#11 opentalks
PDF
Machine learning Mind Map
PDF
Machine learning in science and industry — day 2
PDF
machine_learning.pptx
PDF
super-cheatsheet-artificial-intelligence.pdf
[ppt]
[ppt]
Deep learning from mashine learning AI..
Machine Learning Algorithms Introduction.pdf
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
BAS 250 Lecture 8
Machine Learning Comparative Analysis - Part 1
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
unit classification.pptx
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
MS CS - Selecting Machine Learning Algorithm
Predict Backorder on a supply chain data for an Organization
classification in data mining and data warehousing.pdf
#11 opentalks
Machine learning Mind Map
Machine learning in science and industry — day 2
machine_learning.pptx
super-cheatsheet-artificial-intelligence.pdf
Ad

Recently uploaded (20)

PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
2. Earth - The Living Planet earth and life
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
protein biochemistry.ppt for university classes
microscope-Lecturecjchchchchcuvuvhc.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ECG_Course_Presentation د.محمد صقران ppt
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
AlphaEarth Foundations and the Satellite Embedding dataset
2. Earth - The Living Planet earth and life
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Biophysics 2.pdffffffffffffffffffffffffff
. Radiology Case Scenariosssssssssssssss
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
bbec55_b34400a7914c42429908233dbd381773.pdf
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS

MLHEP 2015: Introductory Lecture #4

  • 1. MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #4 Alex Rogozhnikov, 2015
  • 2. WEIGHTED VOTING The way to introduce importance of classifiers D(x) = (x)∑j αj dj GENERAL CASE OF ENSEMBLING: D(x) = f ( (x), (x), …, (x))d1 d2 dJ
  • 3. COMPARISON OF DISTRIBUTIONS good options to compare 1d use classifier's output to build discriminating variable ROC AUC + Mann Whitney to get significance
  • 5. RANDOM FOREST composition of independent trees
  • 6. ADABOOST After building th base classifier:j 1. 2. increase weight of misclassified = ln ( ) αj 1 2 wcorrect wwrong ← ×wi wi e − ( )αj yi dj xi
  • 8. GRADIENT BOOSTING  → min (x) = (x)Dj ∑j =1j ′ αj ′ dj ′ (x) = (x) + (x)Dj Dj−1 αj dj At th iteration:j pseudo-residual train regressor to minimize MSE: find optimal = − zi ∂ ∂D( )xi ∣∣D(x)= (x)Dj−1 dj ( ( ) − → min∑i dj xi zi ) 2 αj
  • 9. LOSSES regression, Mean Squared Error Mean Absolute Error binary classification, ExpLoss (aka AdaLoss) LogLoss ranking losses boosting to uniformity: FlatnessLoss y ∈ ℝ (d( ) −∑i xi yi ) 2 d( ) −∑i ∣∣ xi yi ∣∣ = ±1yi ∑i e − d( )yi xi log(1 + )∑i e − d( )yi xi
  • 10. TUNING GRADIENT BOOSTING OVER DECISION TREES Parameters: loss function pre-pruning: maximal depth, minimal leaf size subsample, max_features = (learning_rate) number of trees N/3 η
  • 14. TUNING GBDT 1. set high, but feasible number of trees 2. find optimal parameters by checking combinations 3. decrease learning rate, increase number of trees See also: GBDT tutorial
  • 15. BaggingClassifier(base_estimator=GradientBoostingClassifier(), n_estimators=100) ENSEMBLES: BAGGING OVER BOOSTING Different variations , are claimed to overcome single GBDT. [1] [2] Very complex training, better quality if GB estimators are overfitted
  • 16. ENSEMBLES: STACKING Correcting output of several classifiers with new classifier. D(x) = f ( (x), (x), …, (x))d1 d2 dJ To use unbiased predictions, use holdout or kFolding.
  • 17. REWEIGHTING Given two distributions: target and original, find new weights for original distributions, that distributions will coincide.
  • 19. REWEIGHTING Solution for 1 or 2 variables: reweight using bins (so-called 'histogram division'). Typical application: Monte-Carlo reweighting.
  • 20. REWEIGHTING WITH BINS Splitting all space in bins, for each bin: compute , in each bin multiply weights of original distribution in bin to compensate the difference woriginal wtarget ←wi wi wtarget woriginal Problems good in 1d, works sometimes in 2d, nearly impossible in dimensions > 2 too few events in bin reweighting rule is unstable we can reweight several times over 1d if variables aren't correlated ⇒
  • 21. reweighting using enhanced version of bin reweighter
  • 22. reweighting using gradient boosted reweighter
  • 23. GRADIENT BOOSTED REWEIGHTER works well in high dimensions works with correlated variables produces more stable reweighting rule
  • 24. HOW DOES GB REWEIGHTER WORKS? iteratively: 1. Find the tree, which is able to discriminate two distributions 2. Correct weights in each leaf 3. Reweight original distribution and repeat the process. Less bins, bins are guaranteed to have high statistic in each.
  • 25. DISCRIMINATING TREE When looking for a tree with good discrimination, optimize =χ2 ∑l∈leaves ( −wl,target wl,original ) 2 +wl,target wl,original As before, using greedy minimization to build a tree, which provides maximal .χ2 + introducing all heuristics from standard GB.
  • 26. QUALITY ESTIMATION Quality of model can be estimated by the value of optimized loss function. Though remember that different classifiers use different optimization target LogLoss ExpLoss
  • 27. QUALITY ESTIMATION different target of optimization use general quality metrics ⇒
  • 28. TESTING HYPOTHESIS WITH CUT Example: we test hypothesis that there is signal channel with fixed vs. no signal channel ( ).Br ≠ 0 Br = 0 : signal channel with : there is no signal channel H0 Br = const H1 Putting a threshold: d(x) > threshold Estimate number of signal / bkg events that will be selected: s = α tpr, b = β fpr : : H0 ∼ Poiss(s + b)nobs H1 ∼ Poiss(b)nobs
  • 29. TESTING HYPOTHESIS WITH CUT Select some appropriate metric: = 2(s + b) log(1 + ) − 2sAMS 2 s b Maximize it by selecting best special holdout shall be used or use kFolding very poor usage of information from classifier threshold
  • 30. TESTING HYPOTHESIS WITH BINS Splitting all data in bins: . n x ∈ k-th bin ⇔ < d(x) ≤thrk−1 thrk Expected amounts of signal: = α( − ), = β( − )sk tpr k−1 tpr k bk fpr k−1 fpr k : : H0 ∼ Poiss( + )nk sk bk H1 ∼ Poiss( )nk bk Optimal statistics: , = log(1 + ) ∑ k ck nk ck sk bk take some approximation to test power and optimize thresholds
  • 31. FINDING OPTIMAL PARAMETERS some algorithms have many parameters not all the parameters are guessed checking all combinations takes too long
  • 32. FINDING OPTIMAL PARAMETERS randomly picking parameters is a partial solution given a target optimal value we can optimize it no gradient with respect to parameters noisy results function reconstruction is a problem Before running grid optimization make sure your metric is stable (i.e. by train/testing on different subsets). Overfitting by using many attempts is real issue.
  • 33. OPTIMAL GRID SEARCH stochastic optimization (Metropolis-Hastings, annealing) regression techniques, reusing all known information (ML to optimize ML!)
  • 34. OPTIMAL GRID SEARCH USING REGRESSION General algorithm (point of grid = set of parameters): 1. evaluations at random points 2. build regression model based on known results 3. select the point with best expected quality according to trained model 4. evaluate quality at this points 5. Go to 2 if not enough evaluations Why not using linear regression?
  • 35. GAUSSIAN PROCESSES FOR REGRESSION Some definitions: , where and are functions of mean and covariance: , Y ∼ GP(m, K) m K m(x) K( , )x1 x2 represents our prior expectation of quality (may taken constant) represents influence of known results on expectation of new RBF kernel is the most useful: m(x) = Y(x) K( , ) = Y( )Y( )x1 x2 x1 x2 K( , ) = exp(−| − )x1 x2 x1 x2 | 2 We can model the posterior distribution of results in each
  • 36. point. Gaussian Process Demo on Mathematica Also see: .http://www.tmpl.fi/gp/
  • 38. PREDICTION SPEED Cuts vs classifier cuts are interpretable cuts are applied really fast ML classifiers are applied much slower
  • 39. SPEED UP WAYS Method-specific Logistic regression: sparsity Neural networks: removing neurons GBDT: pruning by selecting best trees Staging of classifiers (a-la triggers)
  • 40. SPEED UP: LOOKUP TABLES Split space of features into bins, simple piecewise-constant function
  • 41. SPEED UP: LOOKUP TABLES Training: 1. split each variable into bins 2. replace values with index of bin 3. train any classifier on indices of bins 4. create lookup table (evaluate answer for each combination of bins) Prediction: 1. convert features to bins' indices 2. take answer from lookup table
  • 42. LOOKUP TABLES speed is comparable to cuts allows to use any ML model behind used in LHCb topological trigger ('Bonsai BDT') Problems: too many combination when number of features N > 10 (8 bins in 10 features), finding optimal thresholds of bins ∼ 1Gb8 10
  • 43. UNSUPERVISED LEARNING: PCA [PEARSON, 1901] PCA is finding axes along which variation is maximal (based on principal axis theorem)
  • 44. PCA DESCRIPTION PCA is based on the principal axes is orthogonal matrix, is diagonal matrix. Q = ΛUU T U Λ Λ = diag( , , …, ), ≥ ≥ ⋯ ≥λ1 λ2 λn λ1 λ2 λn
  • 45. PCA: EIGENFACES Emotion = α[scared] + β[laughs] + γ[angry]+. . .
  • 47. AUTOENCODERS autoencoding: , error hidden layer is smaller y = x | − y| → minŷ
  • 48. CONNECTION TO PCA When optimizing MSE: ( − ∑ i ŷ i xi ) 2 And activation is linear: hi ŷ i = = ∑ j wij xi ∑ aij hi
  • 49. j Optimal solution of optimization problem is PCA. BOTTLENECK There is an informational bottleneck algorithm will eager to pass the most precious information. ⇒
  • 50. MANIFOLD LEARNING TECHNIQUES Preserve neighbourhood/distances.
  • 51. Digits samples from MNIST PCA and autoencoder:
  • 55. TOPIC MODELLING (LDA) bag of words generative model that describe different probabilities fitting by maximizing log-likelihood parameters of this model are useful p(x)
  • 56. COLLABORATIVE RESEARCH problem: different environments problem: data ping-pong solution: dedicated machine (server) for a team restart the research = run script or notebook !ping www.bbc.co.uk
  • 57. REPRODUCIBLE RESEARCH readable code code + description + plots together keep versions + reviews Solution: IPython notebook + github Example notebook EVEN MORE REPRODUCIBILITY? Analysis preservation: VM with all needed stuff. Docker
  • 58. BIRD'S EYE VIEW TO ML: Classification Regression Ranking Dimensionality reduction
  • 62. REPRESENTATION LEARNING Word2vec finds representations of words.