Some Take-Home Message about Machine Learning

Some Take-Home Messages (THM) about ML....
Data Science Meetup
Gianluca Bontempi
Interuniversity Institute of Bioinformatics in Brussels, (IB)2
Machine Learning Group,
Computer Science Department, ULB
mlg.ulb.ac.be, ibsquare.be
May 20, 2016

Introducing myself
1992: Computer science engineer (Politecnico di Milano, Italy),
1994: Researcher in robotics in IRST, Trento, Italy,
1995: Researcher in IRIDIA, ULB, Brussels,
1996-97: Researcher in IDSIA, Lugano, Switzerland,
1998-2000: Marie Curie fellowship in IRIDIA, ULB,
2000-2001: Scientist in Philips Research, Eindhoven, The
Netherlands,
2001-2002: Scientist in IMEC, Microelectronics Institute,
Leuven, Belgium,
since 2002: professor in Machine Learning, Modeling and
Simulation, Bioinformatics in ULB Computer Science Dept.,
since 2004: head of the ULB Machine Learning Group (MLG).
since 2013: director of the Interuniversity Institute of
Bioinformatics in Brussels (IB)2, ibsquare.be.

What is machine learning?
Machine learning is that domain of computational intelligence
which is concerned with the question of how to construct computer
programs that automatically improve with experience. (Mitchell,
97)
Reductionist attitude: ML is just a buzzword which equates to
statistics plus marketing
Positive attitude: ML paved the way to the treatment of real
problems related to data analysis, sometimes
overlooked by statisticians (nonlinearity, classiﬁcation,
pattern recognition, missing variables, adaptivity,
optimization, massive datasets, data management,
causality, representation of knowledge, parallelisation)
Interdisciplinary attitude: ML should have its roots on statistics
and complements it by focusing on: algorithmic
issues, computational eﬃciency, data engineering.

Prediction is pervasive ...
Predict
whether you will like a book/movie (collaborative ﬁltering)
credit applicants as low, medium, or high risk.
which home telephone lines are used for Internet access.
which customers are likely to stop being customers (churn).
the value of a piece of real estate
which telephone subscribers will order a 4G service
which CARREFOUR clients will be more interested to a
discount in Italian products.
the probability that a company is employing black workers
(anti-fraud detection)
the survival risk of a patient on the basis of a genetic signature
the probability of a crime in an urban area.
the key of a cryptographic algorithm on the basis of power
consumption

Supervised learning
First assumption: learning is essentially about prediction !
Second assumption: reality is stochastic, dependency and
uncertainty are well described by conditional probability.
PREDICTION
TARGET
TRAINING
DATASET
INPUT OUTPUT
ERROR
PREDICTION
MODEL
measurable features (inputs)
measurable target variables (outputs) and accuracy criteria
data (in God we trust, all the others must bring data)
THM1: formalizing a problem as a prediction problem is often the
most important contribution of a data scientist!

It is all about ...
1 Probabilistic modeling
it formalizes uncertainty and dependency (regression function)
notions of entropy and information
relevant and irrelevant features (e.g. Markov blanket notion)
Bayesian networks, causal reasoning
2 Estimation
bias/variance notions
generalization issues: underﬁtting vs overﬁtting
Bayesian, frequentist, decision theory
validation
combination/averaging of estimators (bagging, boosting)
3 Optimization
Maximum likelihood, least squares, backpropagation
Dual problems (SVM)
L1, L2 norm (lasso)
4 Computer science
implementation, algorithms
parallelism, scalability
data management

So ... how to teach machine learning?
Focus on ...
Formalism ?
Algorithms ?
Coding ?
Applications ?
Of course all is important but what is the essence, what is common
to the exploding number of algorithms, techniques, fancy
applications?

Estimation
STOCHASTIC PHENOMENON
DATA
LEARNER
DATA DATA
MODEL,
PREDICTION
LEARNER
MODEL,
PREDICTION
LEARNER
MODEL,
PREDICTION
THM2: a predictor is an estimator, i.e. an algorithm (black-box)
which takes data and returns a prediction.
THM3: reality is stochastic, so data is stochastic and prediction is
stochastic.

Assessing in an un uncertain world (Baggio, 1998)
non aver paura di sbagliare un calcio di rigore, non è mica da questi
particolari che si giudica un giocatore (De Gregori, 1982)).

Assessing a learner
The goal of learning is to find a model which is able to
generalize, i.e. able to return good predictions in contexts
with the same distribution but independent of the training set
How to estimate the quality of a model?
It is always possible to find models with such a complicate
structure that they have null training errors. Are these models
good?
Typically NOT. Since doing very well on the training set could
mean doing badly on new data.
This is the phenomenon of overfitting.
THM4: learning is challenging since data have to be used 1) for
creating prediction models and 2) for assessing them.

Bias and variance of a model
Estimation theory: mean-squared-error (a measure of the
generalization quality) can be written as
MSE = σ2
w + squared bias + variance
where
noise concerns the reality alone,
bias reﬂects the relation between reality and the learning
algorithm
variance concerns the learning algorithm alone.
This is purely theoretical since these quantities cannot be
measured ....
.. but useful to understand why and in which circumstances
learners work.

The bias/variance dilemma
Noise is all that cannot be learned from data
Bias measures the lack of representational power of the class
of hypotheses.
Too simple model ⇒ large bias ⇒ underﬁtting
Variance warns us against an excessive complexity of the
approximator.
Too complex model ⇒ large variance ⇒ overﬁtting
A neural network is less biased than a linear model but
inevitably more variant.
Averaging (e.g. bagging, boosting, random forests) is a good
cure for variance.

Bias/variance trade-oﬀ
complexity
generalization
error
Bias
Variance
Underfitting Overfitting
THM5: think in terms of bias/variance tradeoﬀ. Think to your
preferred learning algorithm and discover how bias/variance is
managed.

The Ockam’s Razor (1825)
THM6: "Pluralitas non est ponenda sine neccesitate" i.e. one
should not increase, beyond what is necessary, the number of
entities required to explain anything.
This is the medieval rule of parsimony, or principle of
economy, known as Ockham’s razor.
In other terms the principle states that one should not make
more assumptions than the minimum needed.
It underlies all scientiﬁc modeling and theory building. It
admonishes us to choose from a set of otherwise equivalent
models the simplest one.
Be simple: "shave oﬀ" those concepts, variables or constructs
that are not really needed to explain the phenomenon.

Does the best exist?
Given a ﬁnite number of samples, are there any reasons to
prefer one learning algorithm over another?
If we make no assumption about the nature of the learning
task, can we expect any learning method to be superior or
inferior overall?
Can we even ﬁnd an algorithm that is overall superior to (or
inferior to) random guessing?
The No Free Lunch Theorem answers NO to these questions.

No Free Lunch theorem
If the goal is to obtain good generalization performance, there
are no context-independent or usage-independent reasons
to favor one learning method over another.
If one algorithm seems to outperform another in a particular
situation, it is a consequence of its fit to the particular pattern
recognition problem, not the general superiority of the
algorithm.
The theorem also justifies the skeptiscism about studies that
demonstrate the overall superiority of a particular learning or
recognition algorithm.
If a learning method performs well over some set of problems,
then it must perform worse than average elsewhere. No
method can perform well throughout the full set of functions.
THM7: Every learning algorithm makes assumptions (most of the
times in implicit manner) and these make the difference.

Conclusion
Popper claimed that, if a theory is falsifiable (i.e. it can be
contradicted by an observation or the outcome of a physical
experiment), then it is scientific. Since prediction is the most
falsifiable aspect of science it is also the most scientific one.
Effective machine learning is an extension of statistics, in no
way an alternative.
Simplest (i.e. linear) model first.
Modelling is more an art than an automatic process... then
experience data analysts are more valuable than expensive
tools.
Expert knowledge matters..., data too
Understanding what is predictable is as important as trying to
predict it.
All models are wrong, some of them are useful.

All that we did not discuss...
Dimensionality reduction and feature selection
Causal inference
Unsupervised learning
Active learning
Spatio-temporal prediction
Nonstationary problems
Scalable machine learning
Control and robotics
Libraries and platforms (R, python, Weka)

Resources
A biased list ...:-)
Scoop-it
www.scoop.it/t/machine-learning-by-gianluca-bontempi
on machine learning
Scoop-it
www.scoop.it/t/probabilistic-reasoning-and-statistics
on Probabilistic reasoning, causal inference and statistics
MLG mlg.ulb.ac.be
MA course INFO-F-422 Statistical foundations of machine
learning
Handbook available on https://www.otexts.org

Some Take-Home Message about Machine Learning

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Some Take-Home Message about Machine Learning (20)

More from Gianluca Bontempi (11)

Recently uploaded (20)

Some Take-Home Message about Machine Learning