Machine Learning basics

Machine Learning basics
[reused from LDL for KSS]
David Samu
14 Feb 2020
Knowledge Sharing Session

Caveat
These are slides reused from Learning Deep Learning series
New presentation on Deep Learning for Land Cover Segmentation is on the
way :)

Table of Contents
1. Intro: ML Basic concepts
2. ML from a Statistical Learning perspective
3. ML from a Probabilistic perspective
4. ML algorithms in practice
5. Outlook: Challenges for classical ML algorithms

Intro: Machine Learning basic
concepts

What is Machine Learning?
● A machine learning algorithm is an
algorithm that is able to learn from data.
● Machine learning is essentially a form of
applied statistics with increased emphasis
on the use of computers to statistically
estimate complicated functions and a
decreased emphasis on proving
conﬁdence intervals around these
functions.
● Deep learning is a speciﬁc kind of machine
learning.

A deluge of options… How to select the (a) right one?
Scikit-Learn classifier comparison (source: scikit-learn.org)

Anatomy of a Machine Learning algorithm
● T: task
● P: performance measure
● E: experience
“A computer program is said to learn from
experience E with respect to some task T and
performance measure P, if its performance at
task T, as measured by P, improves with
experience E.”
Typical main ML/DL components in practice:
● Dataset
● Model
● Objective function
● Optimization algorithm
It’s useful to think of these as independent
components of an ML system!

Common ML methods we are already using
● Classification
○ which of k categories the input belongs to?
○ e.g. tree / no tree, which of n species?
● Regression
○ predict numerical value given some input
○ e.g. age / height / density of tree?
● Structured output
○ predict multiple interrelated variables
○ e.g. pixel-wise segmentation into object
categories
● Anomaly detection
○ flag / predict unusual or atypical events
○ pipeline leak, natural disasters
● Denoising
○ predict the clean from corrupted data
○ SAR denoising
Tang et al, 2018

Other ML tasks we could apply in the future
● Transcription, translation
● Synthesis and sampling
● Imputing missing values
● Density estimation
● ...
For what tasks can we use
these techniques in the future?
E.g. density estimation (GANs)
for missing / corrupted value
imputation: fill in cloud covered
areas using other parts of the
image, timestamps, sensors, ...

Experience (aka the Dataset)
● Supervised learning
○ Regression, classification, …
● Unsupervised learning
○ Learn interesting properties, e.g. clustering
○ Learn entire generative prob. distr.
● Reinforcement learning
○ Models interacting with the environment
● (Energy-based, Adversarial, …)
Although these terms are useful in practice, they
not formally defined, and not always distinct!
Many times they can be interchanged and
combined: e.g. density estimation to support
classification.
Karczewski et al 2014

Example: solving the linear regression problem
● Is this an iterative
optimization
process?
● Why didn’t we
need to use
gradient descent
or some other
method?
● Why would this
not work for deep
networks?

Some thoughts before diving in to ML concepts
● Central challenge of statistical learning, ML, DL:
tame complexity and uncertainty
● Many tools and options, some guarantees, lots
of uncertainty, no “best” answer
● Model design is a combination of science and
art (+ engineering in practice)
● A spectrum where we aim towards the left side,
but often find ourselves on the right:
○ deterministic ←→ probabilistic
○ formal ←→ heuristic
○ deductive ←→ inductive
○ analytic ←→ numeric
Situation we want to avoid by understanding through theory

Machine Learning from a
statistical learning
perspective

Capacity, Overfitting, Underfitting
Central challenge in ML: generalization to
unseen input!
Training error ←→ Test error
Test set should be collected separately from
training set!
pdata: data-generating distribution
Assumed to be i.i.d. for mathematical study
Underfitting ← Capacity → Overfitting
Model’s hypothesis space ~ Model capacity Ground truth: f(x) = sin(x), n=10 noisy data points, M:
degree of polynomial fit
Bishop: Pattern Recognition and Machine Learning, 2006

Capacity, Overfitting, Underfitting (cont’d)
Challenge: find model complexity that fits the
task and the available data!
The model’s representational capacity is usually
constrained by insufficient / biased data and the
optimization process → effective capacity
Occam’s razor / principle of parsimony: from
similarly performing models, choose the simpler
one!
No Free Lunch Theorem: no ML algorithm is
universally any better than any other… Modeller
has to find and design an appropriate model for
the task & data at hand!
(Cf. with figure on prev slide)

Regularization
“Preferences” for certain kind of solutions
L2 regularization, a.k.a. weight decay
Trade-off between fitting training data and
small weights.
In general, regularization aims to decrease
the generalization error at the expense of
training error.
Central concern in ML, next to optimization
(Chapters 7 and 8)
(Cf. with figure on prev slide)
Bishop: Pattern Recognition and Machine Learning,
2006
Bishop: Pattern
Recognition and
Machine Learning,
2006

Hyperparameters, validation set
Model parameters that are not directly
optimized by the learning algorithm
Difficult or not appropriate to optimize on the
training set, e.g. model capacity or λ
Validated against a validation set of examples,
before using test set to measure generalization
error.
If used repeatedly, test sets can introduce bias
into model selection! We need to update /
extend / replace our test sets regularly!
1. Training set: find model weights w for fixed HP settings
2. Validation set: tune HPs
3. Test set: measure generalization error

Bias and variance
expected deviation from real value of
data-generating parameter
Unbiased estimator: bias = 0
` variance of estimate (as a result of
resampling of data), decreases with # of samples
Both are errors of an estimator (i.e. bad), that we
can trade-off to minimize MSE
Let’s derive this!
Bias-variance and underfitting-overfitting are (loosely)
related concepts through model capacity.
● Too high capacity: may need to regularize.
● More data should(!) always help (consistency)
● E.g. mean as 1st sample: unbiased but inconsistent

Taking a step back: What is the purpose of all these
abstract concepts again?
Keep in mind: We introduce these concepts to
gain some insight on how well particular models
would generalize for given datasets!
E.g. data-generating distribution / process, “true”
distribution / parameters, etc are all hypothetical
concepts representing the “unknown” to help us
estimate our expected generalization error for
different models!
And there are more abstract concepts to come! :0
“Data
generating
process”
Observed
data
sampling
What we want: find model that
generalizes to unseen data →
model and approximate “data-
generating process”

Machine Learning from a
probabilistic perspective

Maximum Likelihood Estimation (MLE)
θML: the parameter value that maximizes the
probability of the observed data
Finally, now we’re getting VERY abstract! :-)
Demystifying KL Divergence
https://towardsdatascience.com/

KL divergence, cross-entropy and their other friends
Minimizing the KL divergence between the
empirical distribution pdata and model distribution
pmodel (for some set of model parameters θ) is
equivalent to:
● maximizing the likelihood / minimizing
the negative log likelihood (NLL) of pmodel
● maximizing the expectation of the
observed data under pmodel
● minimizing the cross-entropy between
pdata and pmodel
● maximizing the mutual information
between pdata and pmodel
Different perspectives of the same concept!
Remember those intriguing log functions in the
definition of entropy from Chapter 3? This is why
they are important in ML (theory)!
Cross-entropy (between pdata and pmodel): any loss
consisting of a negative log-likelihood
Most frequently used cost function in DL!
● Why does minimizing this maximizes the
probability of the data under model?
● Why is it useful / why do we want to do that?

Conditional Log-Likelihood
So far we discussed the case of models that
generate the entire observed dataset, which is
useful for unsupervised learning.
How about the supervised learning when we
only want to learn the labels? Conditional LLH!
If i.i.d. simplifies to:
But now the formula became longer. Why is this
a “simplification”?
Minimizing MSE during curve fitting is equivalent to
maximizing the log-likelihood of y given x under model w,
assuming Gaussian distributed “noise”.

“Entropy - shmentropy… Do I really need to know all
this complicated business to train real ML models?”
https://www.inference.vc/
In practice, not really.
But it definitely helps to
understand what others (and
you) are doing, even in just
practice ;-)
Why is Maximum Likelihood
estimation useful in ML? It has
the property of consistency
with the highest rate of
convergence among all
possible estimator (efficiency)!
[Cramér-Rao lower bound]

Frequentist vs Bayesian statistics
Frequentist perspective (so far):
● θ is fixed but unknown
● θhat is probabilistic due to stochastic
nature of sampling observed data
Bayesian perspective:
● Observed data is fixed, not random
● Probability: uncertainty in our
belief about the true value of θ: p(θ)
● Observations decrease prior
uncertainty / increase knowledge
via belief update

Belief update in Bayesian statistics
Belief update using Bayes rule: ● Why does this equation hold?
● Which term is the prior, evidence, posterior,
normalization constant?
● How / why does this belief update process work?
www.tu-chemnitz.de/

● Frequentist’s critique: Prior is a subjective human
choice!
● The Bayesian’s rebuke: State your assumptions
explicitly!
Bayes learning
Predicted distribution after observing m samples:
In general, Bayesian estimation is more
conservative than frequentist Max LH point
estimate:
● It has lower variance / risk of overfitting,
due to integration over ∀θ,
● but it is potentially biased, underfit model,
e.g. if prior is not well chosen

Max a posteriori estimate and Regularized Max LL
MAP: a less precise, but tractable point estimate
alternative to full posterior PD. LLH + bias
Additive regularization terms in Max LL often
correspond to priors in MAP.
E.g. weight decay penalizes larger weights →
biased toward smaller weights → a Bayesian prior
centered around 0.
High / low weight of regularization term ←→
narrow / broad shape of prior distribution
https://towardsdatascience.com

Recap: linear regression and logistic regression
If we know → we solved the problem!
In practice: limited data → need to generalize out
of sample → we use:
● parametric family of distributions:
● MLE or MAP to find best parameter vector θ
Eg linear regression:
● closed-form solution exist
Logistic regression:
● no closed form solution, need to minimize
negative LLH by e.g. gradient descent Wikipedia

Machine learning algorithms
(estimators) in practice

Some supervised methods
Task: associate input x with output y
Example so far: linear regression

Support Vector Machine
Kernel trick: dot product between examples
rewritten to non-linear kernel function
Learning a linear model in a transformed space
● Space transformation is kept fixed
● Transformed linear model is easy to learn
Drawback: comp. cost of learning does not scale
well with training examples!

k-nearest neighbour (KNN)
Quintessential non-parametric, non-
probabilistic “learning” method:
● take average / most frequent of k nearest
neighbours
Pros: High capacity, no training, typically good
accuracy with large training data
Cons: large “model” size (full dataset) and
computational cost, provides zero “insight” to
problem 1-NN classification map
Wikipedia

Basic algorithm: axis-aligned splits with constant
outputs → non-parametric and non-probabilistic
Many variants (making DTs more parametric)
● Regularized DTs (e.g. pruning)
● Ensemble methods
○ Boosted trees
○ Bagged trees / random forests
Pros: simple DTs are easy to interpret (RFs not!)
Cons: unstable (small change in input can cause
large change in model / output)
Decision trees

Some unsupervised methods
Task: “extract information” from a distribution such as:
● Density estimation
● Sampling
● Denoising
● Manifold learning
● (dimensionality reduction)
● Clustering
Classic task: find “best representation” of data
● Simplify while retaining as much information as possible
Non-linear dimensionality reduction
Wikipedia

“Best” in what sense?
Three most common criteria for best
representation:
● Low-dimensional representation
● Sparse representation
● Independent representation
These properties are useful principles because
they help us:
● manipulate data more effectively
● understand the data-generator process
(ultimately Nature)
We want to disentangle unknown factors of
variation, remove redundancy Karczewski et al 2014

Principal Component Analysis (PCA)
PCA learns:
1. lower dimensional representations
2. with linearly independent dimensions
while preserving as much information (variance)
of the original data as possible.
Disentangle factors of variation by finding a
rotation that transforms the principal axes
variation to new basis vectors.
Wikipedia

K-means clustering
Divide training set into k different clusters of
examples that are “close” to each other.
Number of clusters? Distance metric?
Extreme case of sparse representation (one-hot
vector coding.), but loses advantage of
distributed representations.
Algorithm: iterative refinement of clusters
1. Assign class labels
2. Update centroids
https://scikit-learn.org/

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
No free lunch!

Outlook: Challenges for
classical ML algorithms

Anatomy of a Machine Learning algorithm (revisited)
Typical main ML/DL components in practice:
● Dataset
● Model
● Objective function
● Optimization algorithm
It’s useful to think of these as independent
components of an ML system!

The Curse of Dimensionality
Problem: As the number of dimensions of the
data increases, the number of conﬁgurations of
interest may grow exponentially.
Traditional ML methods either learned
● Learned too simple mappings between x
and y (e.g. linear regression), or
● learned local regions of the training data
(k-NN, random forest, SVM) → do not
generalize well to unseen regions of input
space!
Assumption of smoothness / local constancy
does not hold for complex problems!

Manifold learning
Insight (hypothesis): in practice, the distribution
of natural images / sounds / language
sequences / ... occupies a very little volume in
their total space.
Concentration of probability distributions!
These manifolds (sub-spaces) constitute useful
dimensions of variations (e.g. lighting, rotation,
size, etc of the same object).
Some fascinating tools that utilize the manifold
learning capabilities of deep neural networks:
https://distill.pub/2017/aia/

Stochastic Gradient Descent
Problem: we need many examples to learn
complex models that are useful in the real-world.
However, iterative Gradient Descent then
becomes too slow!
Solution: Use a minibatch of examples (sampled
uniformly from the training data)
Cornerstone of modern Deep Learning! More
on optimization in Chapter 8.
https://medium.com/ai-society/hello-gradient-descent-ef74434bdfa5

What makes DL different from classical ML methods?
● Scales better with more data / model
capacity (# of parameters)
● Lends itself better for end-to-end
supervised learning
● A nice presentation on this by Andrew Ng,
plus practical advice to ML development:
○ https://www.youtube.com/watch?v=F1ka6a1
3S9I

Machine Learning basics

More Related Content

What's hot (20)

Similar to Machine Learning basics (20)

More from NeeleEilers (6)

Recently uploaded (20)

Machine Learning basics

Editor's Notes