SlideShare a Scribd company logo
Machine Learning basics
[reused from LDL for KSS]
David Samu
14 Feb 2020
Knowledge Sharing Session
Caveat
These are slides reused from Learning Deep Learning series
New presentation on Deep Learning for Land Cover Segmentation is on the
way :)
Table of Contents
1. Intro: ML Basic concepts
2. ML from a Statistical Learning perspective
3. ML from a Probabilistic perspective
4. ML algorithms in practice
5. Outlook: Challenges for classical ML algorithms
Intro: Machine Learning basic
concepts
What is Machine Learning?
● A machine learning algorithm is an
algorithm that is able to learn from data.
● Machine learning is essentially a form of
applied statistics with increased emphasis
on the use of computers to statistically
estimate complicated functions and a
decreased emphasis on proving
confidence intervals around these
functions.
● Deep learning is a specific kind of machine
learning.
A deluge of options… How to select the (a) right one?
Scikit-Learn classifier comparison (source: scikit-learn.org)
Anatomy of a Machine Learning algorithm
● T: task
● P: performance measure
● E: experience
“A computer program is said to learn from
experience E with respect to some task T and
performance measure P, if its performance at
task T, as measured by P, improves with
experience E.”
Typical main ML/DL components in practice:
● Dataset
● Model
● Objective function
● Optimization algorithm
It’s useful to think of these as independent
components of an ML system!
Common ML methods we are already using
● Classification
○ which of k categories the input belongs to?
○ e.g. tree / no tree, which of n species?
● Regression
○ predict numerical value given some input
○ e.g. age / height / density of tree?
● Structured output
○ predict multiple interrelated variables
○ e.g. pixel-wise segmentation into object
categories
● Anomaly detection
○ flag / predict unusual or atypical events
○ pipeline leak, natural disasters
● Denoising
○ predict the clean from corrupted data
○ SAR denoising
Tang et al, 2018
Other ML tasks we could apply in the future
● Transcription, translation
● Synthesis and sampling
● Imputing missing values
● Density estimation
● ...
For what tasks can we use
these techniques in the future?
E.g. density estimation (GANs)
for missing / corrupted value
imputation: fill in cloud covered
areas using other parts of the
image, timestamps, sensors, ...
Experience (aka the Dataset)
● Supervised learning
○ Regression, classification, …
● Unsupervised learning
○ Learn interesting properties, e.g. clustering
○ Learn entire generative prob. distr.
● Reinforcement learning
○ Models interacting with the environment
● (Energy-based, Adversarial, …)
Although these terms are useful in practice, they
not formally defined, and not always distinct!
Many times they can be interchanged and
combined: e.g. density estimation to support
classification.
Karczewski et al 2014
Example: solving the linear regression problem
● Is this an iterative
optimization
process?
● Why didn’t we
need to use
gradient descent
or some other
method?
● Why would this
not work for deep
networks?
Some thoughts before diving in to ML concepts
● Central challenge of statistical learning, ML, DL:
tame complexity and uncertainty
● Many tools and options, some guarantees, lots
of uncertainty, no “best” answer
● Model design is a combination of science and
art (+ engineering in practice)
● A spectrum where we aim towards the left side,
but often find ourselves on the right:
○ deterministic ←→ probabilistic
○ formal ←→ heuristic
○ deductive ←→ inductive
○ analytic ←→ numeric
Situation we want to avoid by understanding through theory
Machine Learning from a
statistical learning
perspective
Capacity, Overfitting, Underfitting
Central challenge in ML: generalization to
unseen input!
Training error ←→ Test error
Test set should be collected separately from
training set!
pdata: data-generating distribution
Assumed to be i.i.d. for mathematical study
Underfitting ← Capacity → Overfitting
Model’s hypothesis space ~ Model capacity Ground truth: f(x) = sin(x), n=10 noisy data points, M:
degree of polynomial fit
Bishop: Pattern Recognition and Machine Learning, 2006
Capacity, Overfitting, Underfitting (cont’d)
Challenge: find model complexity that fits the
task and the available data!
The model’s representational capacity is usually
constrained by insufficient / biased data and the
optimization process → effective capacity
Occam’s razor / principle of parsimony: from
similarly performing models, choose the simpler
one!
No Free Lunch Theorem: no ML algorithm is
universally any better than any other… Modeller
has to find and design an appropriate model for
the task & data at hand!
(Cf. with figure on prev slide)
Bishop: Pattern Recognition and Machine Learning, 2006
Regularization
“Preferences” for certain kind of solutions
L2 regularization, a.k.a. weight decay
Trade-off between fitting training data and
small weights.
In general, regularization aims to decrease
the generalization error at the expense of
training error.
Central concern in ML, next to optimization
(Chapters 7 and 8)
(Cf. with figure on prev slide)
Bishop: Pattern Recognition and Machine Learning,
2006
Bishop: Pattern
Recognition and
Machine Learning,
2006
Hyperparameters, validation set
Model parameters that are not directly
optimized by the learning algorithm
Difficult or not appropriate to optimize on the
training set, e.g. model capacity or λ
Validated against a validation set of examples,
before using test set to measure generalization
error.
If used repeatedly, test sets can introduce bias
into model selection! We need to update /
extend / replace our test sets regularly!
1. Training set: find model weights w for fixed HP settings
2. Validation set: tune HPs
3. Test set: measure generalization error
Bias and variance
expected deviation from real value of
data-generating parameter
Unbiased estimator: bias = 0
` variance of estimate (as a result of
resampling of data), decreases with # of samples
Both are errors of an estimator (i.e. bad), that we
can trade-off to minimize MSE
Let’s derive this!
Bias-variance and underfitting-overfitting are (loosely)
related concepts through model capacity.
● Too high capacity: may need to regularize.
● More data should(!) always help (consistency)
● E.g. mean as 1st sample: unbiased but inconsistent
Taking a step back: What is the purpose of all these
abstract concepts again?
Keep in mind: We introduce these concepts to
gain some insight on how well particular models
would generalize for given datasets!
E.g. data-generating distribution / process, “true”
distribution / parameters, etc are all hypothetical
concepts representing the “unknown” to help us
estimate our expected generalization error for
different models!
And there are more abstract concepts to come! :0
“Data
generating
process”
Observed
data
sampling
What we want: find model that
generalizes to unseen data →
model and approximate “data-
generating process”
Machine Learning from a
probabilistic perspective
Maximum Likelihood Estimation (MLE)
θML: the parameter value that maximizes the
probability of the observed data
Finally, now we’re getting VERY abstract! :-)
Demystifying KL Divergence
https://towardsdatascience.com/
KL divergence, cross-entropy and their other friends
Minimizing the KL divergence between the
empirical distribution pdata and model distribution
pmodel (for some set of model parameters θ) is
equivalent to:
● maximizing the likelihood / minimizing
the negative log likelihood (NLL) of pmodel
● maximizing the expectation of the
observed data under pmodel
● minimizing the cross-entropy between
pdata and pmodel
● maximizing the mutual information
between pdata and pmodel
Different perspectives of the same concept!
Remember those intriguing log functions in the
definition of entropy from Chapter 3? This is why
they are important in ML (theory)!
Cross-entropy (between pdata and pmodel): any loss
consisting of a negative log-likelihood
Most frequently used cost function in DL!
● Why does minimizing this maximizes the
probability of the data under model?
● Why is it useful / why do we want to do that?
Conditional Log-Likelihood
So far we discussed the case of models that
generate the entire observed dataset, which is
useful for unsupervised learning.
How about the supervised learning when we
only want to learn the labels? Conditional LLH!
If i.i.d. simplifies to:
But now the formula became longer. Why is this
a “simplification”?
Minimizing MSE during curve fitting is equivalent to
maximizing the log-likelihood of y given x under model w,
assuming Gaussian distributed “noise”.
Bishop: Pattern Recognition and Machine Learning, 2006
“Entropy - shmentropy… Do I really need to know all
this complicated business to train real ML models?”
https://www.inference.vc/
In practice, not really.
But it definitely helps to
understand what others (and
you) are doing, even in just
practice ;-)
Why is Maximum Likelihood
estimation useful in ML? It has
the property of consistency
with the highest rate of
convergence among all
possible estimator (efficiency)!
[Cramér-Rao lower bound]
Frequentist vs Bayesian statistics
Frequentist perspective (so far):
● θ is fixed but unknown
● θhat is probabilistic due to stochastic
nature of sampling observed data
Bayesian perspective:
● Observed data is fixed, not random
● Probability: uncertainty in our
belief about the true value of θ: p(θ)
● Observations decrease prior
uncertainty / increase knowledge
via belief update
Belief update in Bayesian statistics
Belief update using Bayes rule: ● Why does this equation hold?
● Which term is the prior, evidence, posterior,
normalization constant?
● How / why does this belief update process work?
www.tu-chemnitz.de/
● Frequentist’s critique: Prior is a subjective human
choice!
● The Bayesian’s rebuke: State your assumptions
explicitly!
Bayes learning
Predicted distribution after observing m samples:
In general, Bayesian estimation is more
conservative than frequentist Max LH point
estimate:
● It has lower variance / risk of overfitting,
due to integration over ∀θ,
● but it is potentially biased, underfit model,
e.g. if prior is not well chosen
Max a posteriori estimate and Regularized Max LL
MAP: a less precise, but tractable point estimate
alternative to full posterior PD. LLH + bias
Additive regularization terms in Max LL often
correspond to priors in MAP.
E.g. weight decay penalizes larger weights →
biased toward smaller weights → a Bayesian prior
centered around 0.
High / low weight of regularization term ←→
narrow / broad shape of prior distribution
https://towardsdatascience.com
Recap: linear regression and logistic regression
If we know → we solved the problem!
In practice: limited data → need to generalize out
of sample → we use:
● parametric family of distributions:
● MLE or MAP to find best parameter vector θ
Eg linear regression:
● closed-form solution exist
Logistic regression:
● no closed form solution, need to minimize
negative LLH by e.g. gradient descent Wikipedia
Machine learning algorithms
(estimators) in practice
Some supervised methods
Task: associate input x with output y
Example so far: linear regression
Support Vector Machine
Kernel trick: dot product between examples
rewritten to non-linear kernel function
Learning a linear model in a transformed space
● Space transformation is kept fixed
● Transformed linear model is easy to learn
Drawback: comp. cost of learning does not scale
well with training examples!
k-nearest neighbour (KNN)
Quintessential non-parametric, non-
probabilistic “learning” method:
● take average / most frequent of k nearest
neighbours
Pros: High capacity, no training, typically good
accuracy with large training data
Cons: large “model” size (full dataset) and
computational cost, provides zero “insight” to
problem 1-NN classification map
Wikipedia
Basic algorithm: axis-aligned splits with constant
outputs → non-parametric and non-probabilistic
Many variants (making DTs more parametric)
● Regularized DTs (e.g. pruning)
● Ensemble methods
○ Boosted trees
○ Bagged trees / random forests
Pros: simple DTs are easy to interpret (RFs not!)
Cons: unstable (small change in input can cause
large change in model / output)
Decision trees
Some unsupervised methods
Task: “extract information” from a distribution such as:
● Density estimation
● Sampling
● Denoising
● Manifold learning
● (dimensionality reduction)
● Clustering
Classic task: find “best representation” of data
● Simplify while retaining as much information as possible
Non-linear dimensionality reduction
Wikipedia
“Best” in what sense?
Three most common criteria for best
representation:
● Low-dimensional representation
● Sparse representation
● Independent representation
These properties are useful principles because
they help us:
● manipulate data more effectively
● understand the data-generator process
(ultimately Nature)
We want to disentangle unknown factors of
variation, remove redundancy Karczewski et al 2014
Principal Component Analysis (PCA)
PCA learns:
1. lower dimensional representations
2. with linearly independent dimensions
while preserving as much information (variance)
of the original data as possible.
Disentangle factors of variation by finding a
rotation that transforms the principal axes
variation to new basis vectors.
Wikipedia
K-means clustering
Divide training set into k different clusters of
examples that are “close” to each other.
Number of clusters? Distance metric?
Extreme case of sparse representation (one-hot
vector coding.), but loses advantage of
distributed representations.
Algorithm: iterative refinement of clusters
1. Assign class labels
2. Update centroids
https://scikit-learn.org/
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
No free lunch!
Outlook: Challenges for
classical ML algorithms
Anatomy of a Machine Learning algorithm (revisited)
Typical main ML/DL components in practice:
● Dataset
● Model
● Objective function
● Optimization algorithm
It’s useful to think of these as independent
components of an ML system!
The Curse of Dimensionality
Problem: As the number of dimensions of the
data increases, the number of configurations of
interest may grow exponentially.
Traditional ML methods either learned
● Learned too simple mappings between x
and y (e.g. linear regression), or
● learned local regions of the training data
(k-NN, random forest, SVM) → do not
generalize well to unseen regions of input
space!
Assumption of smoothness / local constancy
does not hold for complex problems!
Manifold learning
Insight (hypothesis): in practice, the distribution
of natural images / sounds / language
sequences / ... occupies a very little volume in
their total space.
Concentration of probability distributions!
These manifolds (sub-spaces) constitute useful
dimensions of variations (e.g. lighting, rotation,
size, etc of the same object).
Some fascinating tools that utilize the manifold
learning capabilities of deep neural networks:
https://distill.pub/2017/aia/
Stochastic Gradient Descent
Problem: we need many examples to learn
complex models that are useful in the real-world.
However, iterative Gradient Descent then
becomes too slow!
Solution: Use a minibatch of examples (sampled
uniformly from the training data)
Cornerstone of modern Deep Learning! More
on optimization in Chapter 8.
https://medium.com/ai-society/hello-gradient-descent-ef74434bdfa5
What makes DL different from classical ML methods?
● Scales better with more data / model
capacity (# of parameters)
● Lends itself better for end-to-end
supervised learning
● A nice presentation on this by Andrew Ng,
plus practical advice to ML development:
○ https://www.youtube.com/watch?v=F1ka6a1
3S9I
The End

More Related Content

PPTX
Machine Learning
PPTX
Machine Learning Contents.pptx
PPTX
K-Folds Cross Validation Method
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PPT
Web Usage Pattern
PPT
Basics of Machine Learning
PDF
Confusion Matrix
PPT
Machine Learning
Machine Learning
Machine Learning Contents.pptx
K-Folds Cross Validation Method
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Web Usage Pattern
Basics of Machine Learning
Confusion Matrix
Machine Learning

What's hot (20)

PDF
Introduction to Statistical Machine Learning
PPTX
Vector space model of information retrieval
PDF
Machine learning
PPTX
Overfitting & Underfitting
PDF
Machine Learning: Introduction to Neural Networks
PDF
Feature Engineering
PDF
Bayesian Networks - A Brief Introduction
PDF
Machine Learning for Everyone
PPTX
Computational learning theory
PPT
Learning sets of rules, Sequential Learning Algorithm,FOIL
PPTX
Machine Learning - Dataset Preparation
PPTX
Explainable AI in Industry (KDD 2019 Tutorial)
PPTX
introduction to machin learning
PPTX
Machine learning (webinar)
PPTX
Hidden Markov Model - The Most Probable Path
PDF
Machine learning
PPTX
Over fitting underfitting
PPT
Machine learning
PDF
Data Science Project Lifecycle
PDF
Text summarization
Introduction to Statistical Machine Learning
Vector space model of information retrieval
Machine learning
Overfitting & Underfitting
Machine Learning: Introduction to Neural Networks
Feature Engineering
Bayesian Networks - A Brief Introduction
Machine Learning for Everyone
Computational learning theory
Learning sets of rules, Sequential Learning Algorithm,FOIL
Machine Learning - Dataset Preparation
Explainable AI in Industry (KDD 2019 Tutorial)
introduction to machin learning
Machine learning (webinar)
Hidden Markov Model - The Most Probable Path
Machine learning
Over fitting underfitting
Machine learning
Data Science Project Lifecycle
Text summarization
Ad

Similar to Machine Learning basics (20)

PPT
notes as .ppt
PPT
Machine learning with Big Data power point presentation
PDF
ML.pdf
PPT
Lecture 1
PPT
lec1.ppt
PDF
ML crash course
PPTX
Machine Learning Basics
PPTX
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
PDF
Machine learning interview questions and answers
DOCX
mapReduce for machine learning
PDF
lec3_annotated.pdf ml csci 567 vatsal sharan
PPTX
Interpretable Machine Learning
PDF
Data science (machine learning , statistics)
PPTX
chapter Three artificial intelligence 1.pptx
PDF
Chapter01 introductory handbook
PPTX
Machine learning ppt unit one syllabuspptx
PPT
Machine learning-in-details-with-out-python-code
PDF
Ml masterclass
PDF
Introduction to machine learning-2023-IT-AI and DS.pdf
PDF
Model Evaluation in the land of Deep Learning
notes as .ppt
Machine learning with Big Data power point presentation
ML.pdf
Lecture 1
lec1.ppt
ML crash course
Machine Learning Basics
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine learning interview questions and answers
mapReduce for machine learning
lec3_annotated.pdf ml csci 567 vatsal sharan
Interpretable Machine Learning
Data science (machine learning , statistics)
chapter Three artificial intelligence 1.pptx
Chapter01 introductory handbook
Machine learning ppt unit one syllabuspptx
Machine learning-in-details-with-out-python-code
Ml masterclass
Introduction to machine learning-2023-IT-AI and DS.pdf
Model Evaluation in the land of Deep Learning
Ad

More from NeeleEilers (6)

PDF
AWS KSS
PDF
To infinity,...... and beyond
PPTX
Clean Code
PPTX
Satellites for Dummies
PPTX
Don't Be A Square
PPTX
History of Remote Sensing
AWS KSS
To infinity,...... and beyond
Clean Code
Satellites for Dummies
Don't Be A Square
History of Remote Sensing

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
sap open course for s4hana steps from ECC to s4
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I

Machine Learning basics

  • 1. Machine Learning basics [reused from LDL for KSS] David Samu 14 Feb 2020 Knowledge Sharing Session
  • 2. Caveat These are slides reused from Learning Deep Learning series New presentation on Deep Learning for Land Cover Segmentation is on the way :)
  • 3. Table of Contents 1. Intro: ML Basic concepts 2. ML from a Statistical Learning perspective 3. ML from a Probabilistic perspective 4. ML algorithms in practice 5. Outlook: Challenges for classical ML algorithms
  • 4. Intro: Machine Learning basic concepts
  • 5. What is Machine Learning? ● A machine learning algorithm is an algorithm that is able to learn from data. ● Machine learning is essentially a form of applied statistics with increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving confidence intervals around these functions. ● Deep learning is a specific kind of machine learning.
  • 6. A deluge of options… How to select the (a) right one? Scikit-Learn classifier comparison (source: scikit-learn.org)
  • 7. Anatomy of a Machine Learning algorithm ● T: task ● P: performance measure ● E: experience “A computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at task T, as measured by P, improves with experience E.” Typical main ML/DL components in practice: ● Dataset ● Model ● Objective function ● Optimization algorithm It’s useful to think of these as independent components of an ML system!
  • 8. Common ML methods we are already using ● Classification ○ which of k categories the input belongs to? ○ e.g. tree / no tree, which of n species? ● Regression ○ predict numerical value given some input ○ e.g. age / height / density of tree? ● Structured output ○ predict multiple interrelated variables ○ e.g. pixel-wise segmentation into object categories ● Anomaly detection ○ flag / predict unusual or atypical events ○ pipeline leak, natural disasters ● Denoising ○ predict the clean from corrupted data ○ SAR denoising Tang et al, 2018
  • 9. Other ML tasks we could apply in the future ● Transcription, translation ● Synthesis and sampling ● Imputing missing values ● Density estimation ● ... For what tasks can we use these techniques in the future? E.g. density estimation (GANs) for missing / corrupted value imputation: fill in cloud covered areas using other parts of the image, timestamps, sensors, ...
  • 10. Experience (aka the Dataset) ● Supervised learning ○ Regression, classification, … ● Unsupervised learning ○ Learn interesting properties, e.g. clustering ○ Learn entire generative prob. distr. ● Reinforcement learning ○ Models interacting with the environment ● (Energy-based, Adversarial, …) Although these terms are useful in practice, they not formally defined, and not always distinct! Many times they can be interchanged and combined: e.g. density estimation to support classification. Karczewski et al 2014
  • 11. Example: solving the linear regression problem ● Is this an iterative optimization process? ● Why didn’t we need to use gradient descent or some other method? ● Why would this not work for deep networks?
  • 12. Some thoughts before diving in to ML concepts ● Central challenge of statistical learning, ML, DL: tame complexity and uncertainty ● Many tools and options, some guarantees, lots of uncertainty, no “best” answer ● Model design is a combination of science and art (+ engineering in practice) ● A spectrum where we aim towards the left side, but often find ourselves on the right: ○ deterministic ←→ probabilistic ○ formal ←→ heuristic ○ deductive ←→ inductive ○ analytic ←→ numeric Situation we want to avoid by understanding through theory
  • 13. Machine Learning from a statistical learning perspective
  • 14. Capacity, Overfitting, Underfitting Central challenge in ML: generalization to unseen input! Training error ←→ Test error Test set should be collected separately from training set! pdata: data-generating distribution Assumed to be i.i.d. for mathematical study Underfitting ← Capacity → Overfitting Model’s hypothesis space ~ Model capacity Ground truth: f(x) = sin(x), n=10 noisy data points, M: degree of polynomial fit Bishop: Pattern Recognition and Machine Learning, 2006
  • 15. Capacity, Overfitting, Underfitting (cont’d) Challenge: find model complexity that fits the task and the available data! The model’s representational capacity is usually constrained by insufficient / biased data and the optimization process → effective capacity Occam’s razor / principle of parsimony: from similarly performing models, choose the simpler one! No Free Lunch Theorem: no ML algorithm is universally any better than any other… Modeller has to find and design an appropriate model for the task & data at hand! (Cf. with figure on prev slide) Bishop: Pattern Recognition and Machine Learning, 2006
  • 16. Regularization “Preferences” for certain kind of solutions L2 regularization, a.k.a. weight decay Trade-off between fitting training data and small weights. In general, regularization aims to decrease the generalization error at the expense of training error. Central concern in ML, next to optimization (Chapters 7 and 8) (Cf. with figure on prev slide) Bishop: Pattern Recognition and Machine Learning, 2006 Bishop: Pattern Recognition and Machine Learning, 2006
  • 17. Hyperparameters, validation set Model parameters that are not directly optimized by the learning algorithm Difficult or not appropriate to optimize on the training set, e.g. model capacity or λ Validated against a validation set of examples, before using test set to measure generalization error. If used repeatedly, test sets can introduce bias into model selection! We need to update / extend / replace our test sets regularly! 1. Training set: find model weights w for fixed HP settings 2. Validation set: tune HPs 3. Test set: measure generalization error
  • 18. Bias and variance expected deviation from real value of data-generating parameter Unbiased estimator: bias = 0 ` variance of estimate (as a result of resampling of data), decreases with # of samples Both are errors of an estimator (i.e. bad), that we can trade-off to minimize MSE Let’s derive this! Bias-variance and underfitting-overfitting are (loosely) related concepts through model capacity. ● Too high capacity: may need to regularize. ● More data should(!) always help (consistency) ● E.g. mean as 1st sample: unbiased but inconsistent
  • 19. Taking a step back: What is the purpose of all these abstract concepts again? Keep in mind: We introduce these concepts to gain some insight on how well particular models would generalize for given datasets! E.g. data-generating distribution / process, “true” distribution / parameters, etc are all hypothetical concepts representing the “unknown” to help us estimate our expected generalization error for different models! And there are more abstract concepts to come! :0 “Data generating process” Observed data sampling What we want: find model that generalizes to unseen data → model and approximate “data- generating process”
  • 20. Machine Learning from a probabilistic perspective
  • 21. Maximum Likelihood Estimation (MLE) θML: the parameter value that maximizes the probability of the observed data Finally, now we’re getting VERY abstract! :-) Demystifying KL Divergence https://towardsdatascience.com/
  • 22. KL divergence, cross-entropy and their other friends Minimizing the KL divergence between the empirical distribution pdata and model distribution pmodel (for some set of model parameters θ) is equivalent to: ● maximizing the likelihood / minimizing the negative log likelihood (NLL) of pmodel ● maximizing the expectation of the observed data under pmodel ● minimizing the cross-entropy between pdata and pmodel ● maximizing the mutual information between pdata and pmodel Different perspectives of the same concept! Remember those intriguing log functions in the definition of entropy from Chapter 3? This is why they are important in ML (theory)! Cross-entropy (between pdata and pmodel): any loss consisting of a negative log-likelihood Most frequently used cost function in DL! ● Why does minimizing this maximizes the probability of the data under model? ● Why is it useful / why do we want to do that?
  • 23. Conditional Log-Likelihood So far we discussed the case of models that generate the entire observed dataset, which is useful for unsupervised learning. How about the supervised learning when we only want to learn the labels? Conditional LLH! If i.i.d. simplifies to: But now the formula became longer. Why is this a “simplification”? Minimizing MSE during curve fitting is equivalent to maximizing the log-likelihood of y given x under model w, assuming Gaussian distributed “noise”. Bishop: Pattern Recognition and Machine Learning, 2006
  • 24. “Entropy - shmentropy… Do I really need to know all this complicated business to train real ML models?” https://www.inference.vc/ In practice, not really. But it definitely helps to understand what others (and you) are doing, even in just practice ;-) Why is Maximum Likelihood estimation useful in ML? It has the property of consistency with the highest rate of convergence among all possible estimator (efficiency)! [Cramér-Rao lower bound]
  • 25. Frequentist vs Bayesian statistics Frequentist perspective (so far): ● θ is fixed but unknown ● θhat is probabilistic due to stochastic nature of sampling observed data Bayesian perspective: ● Observed data is fixed, not random ● Probability: uncertainty in our belief about the true value of θ: p(θ) ● Observations decrease prior uncertainty / increase knowledge via belief update
  • 26. Belief update in Bayesian statistics Belief update using Bayes rule: ● Why does this equation hold? ● Which term is the prior, evidence, posterior, normalization constant? ● How / why does this belief update process work? www.tu-chemnitz.de/
  • 27. ● Frequentist’s critique: Prior is a subjective human choice! ● The Bayesian’s rebuke: State your assumptions explicitly! Bayes learning Predicted distribution after observing m samples: In general, Bayesian estimation is more conservative than frequentist Max LH point estimate: ● It has lower variance / risk of overfitting, due to integration over ∀θ, ● but it is potentially biased, underfit model, e.g. if prior is not well chosen
  • 28. Max a posteriori estimate and Regularized Max LL MAP: a less precise, but tractable point estimate alternative to full posterior PD. LLH + bias Additive regularization terms in Max LL often correspond to priors in MAP. E.g. weight decay penalizes larger weights → biased toward smaller weights → a Bayesian prior centered around 0. High / low weight of regularization term ←→ narrow / broad shape of prior distribution https://towardsdatascience.com
  • 29. Recap: linear regression and logistic regression If we know → we solved the problem! In practice: limited data → need to generalize out of sample → we use: ● parametric family of distributions: ● MLE or MAP to find best parameter vector θ Eg linear regression: ● closed-form solution exist Logistic regression: ● no closed form solution, need to minimize negative LLH by e.g. gradient descent Wikipedia
  • 31. Some supervised methods Task: associate input x with output y Example so far: linear regression
  • 32. Support Vector Machine Kernel trick: dot product between examples rewritten to non-linear kernel function Learning a linear model in a transformed space ● Space transformation is kept fixed ● Transformed linear model is easy to learn Drawback: comp. cost of learning does not scale well with training examples!
  • 33. k-nearest neighbour (KNN) Quintessential non-parametric, non- probabilistic “learning” method: ● take average / most frequent of k nearest neighbours Pros: High capacity, no training, typically good accuracy with large training data Cons: large “model” size (full dataset) and computational cost, provides zero “insight” to problem 1-NN classification map Wikipedia
  • 34. Basic algorithm: axis-aligned splits with constant outputs → non-parametric and non-probabilistic Many variants (making DTs more parametric) ● Regularized DTs (e.g. pruning) ● Ensemble methods ○ Boosted trees ○ Bagged trees / random forests Pros: simple DTs are easy to interpret (RFs not!) Cons: unstable (small change in input can cause large change in model / output) Decision trees
  • 35. Some unsupervised methods Task: “extract information” from a distribution such as: ● Density estimation ● Sampling ● Denoising ● Manifold learning ● (dimensionality reduction) ● Clustering Classic task: find “best representation” of data ● Simplify while retaining as much information as possible Non-linear dimensionality reduction Wikipedia
  • 36. “Best” in what sense? Three most common criteria for best representation: ● Low-dimensional representation ● Sparse representation ● Independent representation These properties are useful principles because they help us: ● manipulate data more effectively ● understand the data-generator process (ultimately Nature) We want to disentangle unknown factors of variation, remove redundancy Karczewski et al 2014
  • 37. Principal Component Analysis (PCA) PCA learns: 1. lower dimensional representations 2. with linearly independent dimensions while preserving as much information (variance) of the original data as possible. Disentangle factors of variation by finding a rotation that transforms the principal axes variation to new basis vectors. Wikipedia
  • 38. K-means clustering Divide training set into k different clusters of examples that are “close” to each other. Number of clusters? Distance metric? Extreme case of sparse representation (one-hot vector coding.), but loses advantage of distributed representations. Algorithm: iterative refinement of clusters 1. Assign class labels 2. Update centroids https://scikit-learn.org/
  • 41. Anatomy of a Machine Learning algorithm (revisited) Typical main ML/DL components in practice: ● Dataset ● Model ● Objective function ● Optimization algorithm It’s useful to think of these as independent components of an ML system!
  • 42. The Curse of Dimensionality Problem: As the number of dimensions of the data increases, the number of configurations of interest may grow exponentially. Traditional ML methods either learned ● Learned too simple mappings between x and y (e.g. linear regression), or ● learned local regions of the training data (k-NN, random forest, SVM) → do not generalize well to unseen regions of input space! Assumption of smoothness / local constancy does not hold for complex problems!
  • 43. Manifold learning Insight (hypothesis): in practice, the distribution of natural images / sounds / language sequences / ... occupies a very little volume in their total space. Concentration of probability distributions! These manifolds (sub-spaces) constitute useful dimensions of variations (e.g. lighting, rotation, size, etc of the same object). Some fascinating tools that utilize the manifold learning capabilities of deep neural networks: https://distill.pub/2017/aia/
  • 44. Stochastic Gradient Descent Problem: we need many examples to learn complex models that are useful in the real-world. However, iterative Gradient Descent then becomes too slow! Solution: Use a minibatch of examples (sampled uniformly from the training data) Cornerstone of modern Deep Learning! More on optimization in Chapter 8. https://medium.com/ai-society/hello-gradient-descent-ef74434bdfa5
  • 45. What makes DL different from classical ML methods? ● Scales better with more data / model capacity (# of parameters) ● Lends itself better for end-to-end supervised learning ● A nice presentation on this by Andrew Ng, plus practical advice to ML development: ○ https://www.youtube.com/watch?v=F1ka6a1 3S9I

Editor's Notes

  • #8: Find correspondence between items in left and right column! Identify elements in diagram!
  • #42: Find correspondence between items in left and right column! Identify elements in diagram!