SlideShare a Scribd company logo
Machine learning: deep networks
and MCTS
olivier.teytaud@inria.fr
1. What is machine learning (ML)
2. Critically needed: optimization
3. Two recent algorithms: DN and MCTS
4. The mathematics of ML
5. Conclusion
What is machine learning ?
It's when machines learn :-)
● Learn to recognize, classify, make decisions,
play, speak, translate …
● Can be inductive (from data, using statistics)
and/or deductive
Examples
● Learn to play chess
● Learn to translate French → English
● Learn to recognize bears / planes / …
● Learn to drive a car (from examples ?)
● Learn to recognize handwritten digits
● Learn which ads you like
● Learn to recognize musics
Different flavors of learning
● From data: given 100000 pictures of bears and 100000 pictures
of beers, learn to discriminate a picture of bear and a picture of
beer.
● From data, 2: given 10000 pictures (no categories!
“unsupervised”)
– Find categories and classify
– Or find a “good” representation as a vector
● From simulators: given a simulator (~ the rules) of Chess, play
(well) chess.
● From experience: control a robot, and avoid bumps.
Deductive: not much... (was important at the time of your
grandfathers/grandmothers)
Machine learning everywhere ! ! !
Finding ads most likely to get your money.
Local weather forecasts.
Translation.
Handwritten text recognition.
Predicting traffic.
Detecting spam.
...
2. Optimization: a key component of
ML
● Given: a function k: w → k(w)
● Output: w* such that k(w*) minimum
Usually, only an approximation of w*.
Many algorithms exist; one of the best for ML is
stochastic gradient descent.
2.a. Gradient descent
● w = random
● for m=1,2,3,....
– alpha = 0.01 / square-root(m)
– compute the gradient g of k at w
– w = w – alpha g
Key problem: computing g quickly.
2.b. Stochastic gradient descent
● k(w) = k1(w) + k2(w) + … + kn(w)
● Then at iteration i, use the gradient of kj where j=i mod n
==> THE key algorithm for machine learning
● w = random
● for m=1,2,3,....
– Alpha = 0.01 / square-root(m)
– compute the gradient g of k(m mod n) at w
– w = w – alpha g
Gradient can often be computed by “reverse-mode differentiation”, termed
“backpropagation” in neural networks (not that hard)
3. Two ML algorithms
● Part 1: Deep learning (learning to predict)
– Neural networks
– Empirical risk minimization & variants
– Deep networks
● Part 2: MCTS (learning to play)
Neuron
x1
x2
x3
z= σ(z)=
w.(x,1) σ(w.(x,1))
1
linear nonlinear
(usually, we do not write the link to “1”)
Formally:
Output=σ(w.(input,1))
w1
w4
w2
w3
Neural networks
f(x,w)=σ(w1.x+w1b) w=(w1,w1b)
(==> matrix notations for short: x=vector, w1=matrix, w1b=vector)
X
f(x,w)
Neural networks
f(x,w)=σ(w1.x+w1b) w=(w1,w1b)
f(x,w)=σ(w2.σ(w1.x+w1b)+w2b)
w=(w1,w2,w1b,w2b)
X f(x,w)
Neural networks
f(x,w)=σ(w1.x+w1b) w=(w1,w1b)
f(x,w)=σ(w2.σ(w1.x+w1b)+w2b) (( =σ(w2.σ(w1x)) ))
w=(w1,w2,w1b,w2b)
f(x,w)= ….. more layers ….
X f(x,w)
Neural networks & empirical risk
minimization
Define the model:
f(x,w)=σ(w1.x+w1b) w=(w1,w1b)
f(x,w)=σ(w2.σ(w1.x+w1b)+w2b)
w=(w1,w2,w1b,w2b)
f(x,w)= ….. more layers ….
how to find a good w ?
What is a good w ?
Try to find w such that ||f(xi,w) – yi||2
is small
==> finding a predictor of y, given x
X f(x,w)
Neural networks & empirical risk
minimization
● Inputs = x1,...,xN (vectors in R^d) and y1,...,yN (vectors in R^k)
● Assumption: the (xi,yi) are randomly drawn, i.i.d, for some probability
distribution
● Define a loss:
L(w) = ( E f(x,w)-y)2
and
its approximation L'(w)= average of (f(x(i),w)-y(i))2
● Optimize:
– Computing w= argmin L(w) impossible (L unknown)
– So w = argmin L'(w) ==> by stochastic gradient descent: gradient ?
Empirical risk
Neural networks with SGD
(stochastic gradient descent)
Minimize the sum of the ||f(xi,w) – yi||2
by
●
w ←w – alpha grad ||f(x1,w) – y1||2
●
w ←w – alpha grad ||f(x2,w) – y2||2
● …
●
w ←w – alpha grad ||f(xn,w) – yn||2
● +restart
X f(x,w) ~ y
The network sees
“xi” and “yi”
one at a time.
Backpropagation ==> gradient
(thanks http://slideplayer.com/slide/5214241)
● Sigmoid function:
● Partial derivative written in terms of outputs (o)
and activation (z); using derivatives/z (δ)
output node: internal node:
Neural networks as encoders
Try to find w such that ||f(xi,w) – xi||2
is small + remove the end
==> finding an encoder of x!
i.e. we get a function f such that x should be a g(f(x)) (for some g).
… looks crazy ? Just f(x)=x is a solution!
X f(x,w)
Delete this ! ! !
Ok, neural networks
We have seen two possibilities:
● Neural networks as predictors (supervised)
● Neural networks as encoders (unsupervised)
Both use stochastic gradient descent and ERM.
Now, let us come back to predictors, but with a
better algorithm, for “deep” learning – using
encoders.
From
examples
One example at
a time
Empirical risk minimization and
numerical optimization
● We would like to optimize the “real” error (expectation; termed
generalization error, GE) but we have only access to the empirical error
(ER).
● For the same ER, we can have different GE.
● Two questions:
– How to reduce the difference between ER and GE ?
Regularization: minim L'+||w||2 Sparsity: minim L'+||w||0
(small parameters) (few parameters)
==> VC theory (no details here)
– Which of the ER optima are best for GE ? ? ? ?
(now known to be an excellent question!)
==> deep network learning by unsupervised tools!
Deep neural networks
● What if many layers ?
● Many local minima (proof: symmetries!)
==> does not work
● Two steps:
– unsupervised learning, layer by layer; the network is
growing;
– then, apply ERM for fine tuning.
● Unsupervised pretraining ==> with the same
empirical error, generalization error is better!
Deep networks pretraining
x
x
Train, auto-encoding
Deep networks pretraining
This part is learnt.
x
Deep networks pretraining
This part is learnt.
x
z
z
Autoencoding!
Deep networks pretraining
This part is learnt.
Autoencoding!
Deep networks pretraining
Then the network grows!
Deep networks pretraining
Then the network grows!
Deep networks: supervised!
Learn (supervised learning) the last layer.
x
y
Deep networks: supervised!
Learn (supervised learning) the whole network
(fine tuning).
x
y
Deep networks in one slide
● For i = 1, 2, 3, …, k:
– Learn one layer by autoencoding (unsupervised)
– Remove the second part
● Learn one more layer in a supervised manner
● Learn the whole network (supervised as well;
fine tuning)
Deep networks
● A revolution in vision
● Important point (not developped here): sharing some parameters,
because first layers = low level feature extractors, and LLF are
the same everywhere ==> convolutional nets
● Link with natural learning: learn simple concepts first;
unsupervised learning.
● Not only “σ”, this was just an example;
output=w0.exp(-w2.||input-w1||2)
● Great success in speech & vision
● Surprising performance in Go (discuss later :-) )
Part 2: MCTS
● MCTS originates in 2006
● UCT = one particular flavor, from 2007, most
well known probably
● A revolution in Computer Go
Part I : The Success Story
(less showing off in part II :-) )
The game of Go is a beautiful
Challenge.
Part I : The Success Story
(less showing off in part II :-) )
The game of Go is a beautiful
challenge.
We did the first wins against
professional players
in the game of Go
But with handicap!
Game of Go (9x9 here)
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go: counting territories
( w h i t e h a s 7 . 5 “ b o n u s ” a s b l a c k s t a r t s )
Game of Go: the rules
Black plays at the blue circle:
the white group dies (it is
removed)
It's impossible to kill white (two “eyes”).
“Superko” rule: we don't come back to the same
situation.
(without superko: “PSPACE hard”
with superko: “EXPTIME-hard”)
At the end, we count territories
==> black starts, so +7.5 for white.
The rank of MCTS and classical programs in Go
(Source: Peter Shotwell+computer Go mailing list )
Stagnation
around 5D ?
MCTS
RAVE
MPI-parallelization
ML+
Expertise, ...
Quasi-solving
of 7x7
Not over
in 9x9...Alpha
beta
MCTS part 2: the UCT algorithm
● MCTS means “Monte Carlo Tree Search”
● UCT means “Upper Confidence Trees”
Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)
UCT (Upper Confidence Trees)
= Monte Carlo
= random part
UCT
UCT
UCT
UCT
UCT
Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )
Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )
Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )
... or exploration ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )
UCT in one slide
Great progress in the game of Go and in various other games
Why ?
Why “+ square-root( log(...)/ … )” ?
because there are nice maths on this in
completely different settings.
Seriously, no good reason, use whatever
you want :-)
Current status ?
MCTS has invaded game applications:
• For games which have a good simulator
(required!)
• For games for which there is no good
evaluation function, i.e. no simple map
“board → probability that black wins”)
Also some hard discrete control tasks.
Current status ?
Go ? Humans still much stronger than
computers.
Deep networks: surprisingly good
performance as an evaluation function.
Still performs far worse than best MCTS.
Merging MCTS and deep networks ?
Current MCTS research ?
Recent years:
• parallelization
• extrapolation (between branches of the
search)
But most progress = human expertise
and tricks in the random part.
4. The maths of ML
One can find theorems justifying regularization (+||
w||2 or +||w||0), or theorems justifying that deep
networks need less parameters than shallow
networks for approximating some functions.
Still, MCTS and neural networks were born quite
independently of maths.
Still, you need stochastic gradient descent.
Maybe in the future of ML a real progress born in
maths ?
Others
Random projection ?
● Randomly project your data (linearly or not)
● Learn on these random projections
● Super fast, not that bad
Machine learning + encryption
● Statistics on data... without decrypting them
● Critical for applications
– Where we must “know” what you do (predicting
power consumption)
– But we should not know too much (privacy)
Simulation-based + data-based
optimization
● Optimization of models = forgets too many features
from the real world
● Optimization of simulators = better
==> technically, optimization of expensive functions
(the optimization algorithm can spend computational
power) + surrogate model (i.e. ML)
Distributed collaborative
decision making ?
● Power network:
– frequency = 50Hz (deviations ≈ )
– (frequency)' = k x (production – demand) → ≈ 0!
● Too much wind power ==> unstable network
because hard to satisfy “production = demand”
● Solutions ?
– Detect frequency
– Increase/decrease production but also demand
Limited
capacity
Typical example of natural monopoly.
Deregulation + more distributed production
+ more renewable energy
==> who regulates the network ?
More regulation after all ?
Distributed collaborative decision making.
Ramping
Constraint
(power output
smooth)
IMHO,
Distributed collaborative
decision making is a great
research area (useful + not well
understood)
Power systems must change!
● Tired of buying oil which leads to ?
● Don't want ?(coal)
● Afraid of ?
But unstable ?
COME AND HELP ! ! ! STABILIZATION NEEDED :-)
Conclusions 1: recent
success stories
● MCTS success story
– 2006: immediately reasonably good
– 2007: thanks to fun tricks in the MC part, strong against pros in
9x9
– 2008: with parallelization, good in 19x19
● Deep networks
– Convolutional DN excellent in 1998 (!) in vision, slightly
overlooked for years
– Now widely recognized in many areas
● Both make sense only with strong computers
Conclusions 2: mathematics &
publication & research
● During so many years:
– SVM was the big boss of supervised ML (because there were
theorems, where as there are few theorems in deep learning)
– Alpha-beta was the big boss of games
● MCTS was immediately recognized as a key contribution
to ML; why wasn't it the case for deep learning ? Maybe
because SVM were easier to explain, prove, adverstise.
(but highest impact factor = +squareRoot(... / … ) ! )
● Both deep learning and MCTS look like fun exercises
rather than science; still, they are key tools for ML.
==> keep time for “fun” research, don't worry too much for
publications
Conclusions 3: applications are fun!
(important ones :-) )
● Both deep learning and Mcts were born from
applications
● Machine learning came from xps more than
from pure theory
● Automatic driving, micro-emotions (big
brother ?), bioinformatics, …. and POWER
SYSTEMS (with open source / open data!).
References
● Backpropagation, Rummelhart et al 1986
● MCTS, Coulom 2006 + Kocsis et al 2007 +
Gelly et al 2007
● Conv. Networks Fukushima 1980
● Deep conv. networks Le Cun 1998
● Regularization, Vapnik et al 1971

More Related Content

PPT
PPTX
A Proposal of Loose Asymmetric Cryptography Algorithm - SMCE2017
PPT
Ch1 2 (2)
PDF
Integration of Unsupervised and Supervised Criteria for DNNs Training
PPTX
Bresenham's line algorithm
PDF
Differntials equatoin
PDF
Analysis of Short RSA Secret Exponent d
PDF
RNN sharing at Trend Micro
A Proposal of Loose Asymmetric Cryptography Algorithm - SMCE2017
Ch1 2 (2)
Integration of Unsupervised and Supervised Criteria for DNNs Training
Bresenham's line algorithm
Differntials equatoin
Analysis of Short RSA Secret Exponent d
RNN sharing at Trend Micro

What's hot (7)

PPT
Net content in computer architecture
PDF
Learning Deep Learning
PPTX
Image Recognition with Neural Network
PDF
Solutions to online rsa factoring challenges
PDF
Machine learning of structured outputs
PDF
G e hay's
PDF
Neural networks with python
Net content in computer architecture
Learning Deep Learning
Image Recognition with Neural Network
Solutions to online rsa factoring challenges
Machine learning of structured outputs
G e hay's
Neural networks with python
Ad

Viewers also liked (16)

ODP
Stochastic modelling and quasi-random numbers
ODP
Noisy optimization --- (theory oriented) Survey
ODP
Tools for artificial intelligence
ODP
Dynamic Optimization without Markov Assumptions: application to power systems
ODP
Combining UCT and Constraint Satisfaction Problems for Minesweeper
ODP
Inteligencia Artificial y Go
ODP
Meta Monte-Carlo Tree Search
ODP
Energy Management Forum, Tainan 2012
ODP
Introduction to the TAO Uct Sig, a team working on computational intelligence...
ODP
Multimodal or Expensive Optimization
ODP
Computers and Killall-Go
ODP
ODP
Statistics 101
ODP
Theories of continuous optimization
ODP
Uncertainties in large scale power systems
ODP
Complexity of planning and games with partial information
Stochastic modelling and quasi-random numbers
Noisy optimization --- (theory oriented) Survey
Tools for artificial intelligence
Dynamic Optimization without Markov Assumptions: application to power systems
Combining UCT and Constraint Satisfaction Problems for Minesweeper
Inteligencia Artificial y Go
Meta Monte-Carlo Tree Search
Energy Management Forum, Tainan 2012
Introduction to the TAO Uct Sig, a team working on computational intelligence...
Multimodal or Expensive Optimization
Computers and Killall-Go
Statistics 101
Theories of continuous optimization
Uncertainties in large scale power systems
Complexity of planning and games with partial information
Ad

Similar to Machine learning 2016: deep networks and Monte Carlo Tree Search (20)

PDF
The Perceptron (D1L2 Deep Learning for Speech and Language)
PDF
DeepXplore: Automated Whitebox Testing of Deep Learning
PDF
GameProgramming for college students DMAD
PDF
Eye deep
PDF
AlphaZero and beyond: Polygames
PDF
Hardware Acceleration for Machine Learning
PDF
Introduction to Artificial Neural Networks
PPTX
Introduction to Neural Netwoks
PDF
Introduction to Machine Learning
PPTX
Introduction to Neural Network
PDF
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
PDF
Yoyak ScalaDays 2015
PPT
Spreading Rumors Quietly and the Subgroup Escape Problem
PDF
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
PDF
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
PPTX
A good tutorial about Deep Learning methods
PPT
Order-Picking-Policies.ppt
PPT
Lecture2---Feed-Forward Neural Networks.ppt
PPTX
Support Vector Machines Simply
PPTX
Teknik Simulasi
The Perceptron (D1L2 Deep Learning for Speech and Language)
DeepXplore: Automated Whitebox Testing of Deep Learning
GameProgramming for college students DMAD
Eye deep
AlphaZero and beyond: Polygames
Hardware Acceleration for Machine Learning
Introduction to Artificial Neural Networks
Introduction to Neural Netwoks
Introduction to Machine Learning
Introduction to Neural Network
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Yoyak ScalaDays 2015
Spreading Rumors Quietly and the Subgroup Escape Problem
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
A good tutorial about Deep Learning methods
Order-Picking-Policies.ppt
Lecture2---Feed-Forward Neural Networks.ppt
Support Vector Machines Simply
Teknik Simulasi

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
UNIT 4 Total Quality Management .pptx
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Sustainable Sites - Green Building Construction
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Lecture Notes Electrical Wiring System Components
DOCX
573137875-Attendance-Management-System-original
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPT
Project quality management in manufacturing
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
UNIT 4 Total Quality Management .pptx
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Digital Logic Computer Design lecture notes
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Sustainable Sites - Green Building Construction
Automation-in-Manufacturing-Chapter-Introduction.pdf
Lecture Notes Electrical Wiring System Components
573137875-Attendance-Management-System-original
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Project quality management in manufacturing
Operating System & Kernel Study Guide-1 - converted.pdf

Machine learning 2016: deep networks and Monte Carlo Tree Search

  • 1. Machine learning: deep networks and MCTS olivier.teytaud@inria.fr 1. What is machine learning (ML) 2. Critically needed: optimization 3. Two recent algorithms: DN and MCTS 4. The mathematics of ML 5. Conclusion
  • 2. What is machine learning ? It's when machines learn :-) ● Learn to recognize, classify, make decisions, play, speak, translate … ● Can be inductive (from data, using statistics) and/or deductive
  • 3. Examples ● Learn to play chess ● Learn to translate French → English ● Learn to recognize bears / planes / … ● Learn to drive a car (from examples ?) ● Learn to recognize handwritten digits ● Learn which ads you like ● Learn to recognize musics
  • 4. Different flavors of learning ● From data: given 100000 pictures of bears and 100000 pictures of beers, learn to discriminate a picture of bear and a picture of beer. ● From data, 2: given 10000 pictures (no categories! “unsupervised”) – Find categories and classify – Or find a “good” representation as a vector ● From simulators: given a simulator (~ the rules) of Chess, play (well) chess. ● From experience: control a robot, and avoid bumps. Deductive: not much... (was important at the time of your grandfathers/grandmothers)
  • 5. Machine learning everywhere ! ! ! Finding ads most likely to get your money. Local weather forecasts. Translation. Handwritten text recognition. Predicting traffic. Detecting spam. ...
  • 6. 2. Optimization: a key component of ML ● Given: a function k: w → k(w) ● Output: w* such that k(w*) minimum Usually, only an approximation of w*. Many algorithms exist; one of the best for ML is stochastic gradient descent.
  • 7. 2.a. Gradient descent ● w = random ● for m=1,2,3,.... – alpha = 0.01 / square-root(m) – compute the gradient g of k at w – w = w – alpha g Key problem: computing g quickly.
  • 8. 2.b. Stochastic gradient descent ● k(w) = k1(w) + k2(w) + … + kn(w) ● Then at iteration i, use the gradient of kj where j=i mod n ==> THE key algorithm for machine learning ● w = random ● for m=1,2,3,.... – Alpha = 0.01 / square-root(m) – compute the gradient g of k(m mod n) at w – w = w – alpha g Gradient can often be computed by “reverse-mode differentiation”, termed “backpropagation” in neural networks (not that hard)
  • 9. 3. Two ML algorithms ● Part 1: Deep learning (learning to predict) – Neural networks – Empirical risk minimization & variants – Deep networks ● Part 2: MCTS (learning to play)
  • 10. Neuron x1 x2 x3 z= σ(z)= w.(x,1) σ(w.(x,1)) 1 linear nonlinear (usually, we do not write the link to “1”) Formally: Output=σ(w.(input,1)) w1 w4 w2 w3
  • 11. Neural networks f(x,w)=σ(w1.x+w1b) w=(w1,w1b) (==> matrix notations for short: x=vector, w1=matrix, w1b=vector) X f(x,w)
  • 13. Neural networks f(x,w)=σ(w1.x+w1b) w=(w1,w1b) f(x,w)=σ(w2.σ(w1.x+w1b)+w2b) (( =σ(w2.σ(w1x)) )) w=(w1,w2,w1b,w2b) f(x,w)= ….. more layers …. X f(x,w)
  • 14. Neural networks & empirical risk minimization Define the model: f(x,w)=σ(w1.x+w1b) w=(w1,w1b) f(x,w)=σ(w2.σ(w1.x+w1b)+w2b) w=(w1,w2,w1b,w2b) f(x,w)= ….. more layers …. how to find a good w ?
  • 15. What is a good w ? Try to find w such that ||f(xi,w) – yi||2 is small ==> finding a predictor of y, given x X f(x,w)
  • 16. Neural networks & empirical risk minimization ● Inputs = x1,...,xN (vectors in R^d) and y1,...,yN (vectors in R^k) ● Assumption: the (xi,yi) are randomly drawn, i.i.d, for some probability distribution ● Define a loss: L(w) = ( E f(x,w)-y)2 and its approximation L'(w)= average of (f(x(i),w)-y(i))2 ● Optimize: – Computing w= argmin L(w) impossible (L unknown) – So w = argmin L'(w) ==> by stochastic gradient descent: gradient ? Empirical risk
  • 17. Neural networks with SGD (stochastic gradient descent) Minimize the sum of the ||f(xi,w) – yi||2 by ● w ←w – alpha grad ||f(x1,w) – y1||2 ● w ←w – alpha grad ||f(x2,w) – y2||2 ● … ● w ←w – alpha grad ||f(xn,w) – yn||2 ● +restart X f(x,w) ~ y The network sees “xi” and “yi” one at a time.
  • 18. Backpropagation ==> gradient (thanks http://slideplayer.com/slide/5214241) ● Sigmoid function: ● Partial derivative written in terms of outputs (o) and activation (z); using derivatives/z (δ) output node: internal node:
  • 19. Neural networks as encoders Try to find w such that ||f(xi,w) – xi||2 is small + remove the end ==> finding an encoder of x! i.e. we get a function f such that x should be a g(f(x)) (for some g). … looks crazy ? Just f(x)=x is a solution! X f(x,w) Delete this ! ! !
  • 20. Ok, neural networks We have seen two possibilities: ● Neural networks as predictors (supervised) ● Neural networks as encoders (unsupervised) Both use stochastic gradient descent and ERM. Now, let us come back to predictors, but with a better algorithm, for “deep” learning – using encoders. From examples One example at a time
  • 21. Empirical risk minimization and numerical optimization ● We would like to optimize the “real” error (expectation; termed generalization error, GE) but we have only access to the empirical error (ER). ● For the same ER, we can have different GE. ● Two questions: – How to reduce the difference between ER and GE ? Regularization: minim L'+||w||2 Sparsity: minim L'+||w||0 (small parameters) (few parameters) ==> VC theory (no details here) – Which of the ER optima are best for GE ? ? ? ? (now known to be an excellent question!) ==> deep network learning by unsupervised tools!
  • 22. Deep neural networks ● What if many layers ? ● Many local minima (proof: symmetries!) ==> does not work ● Two steps: – unsupervised learning, layer by layer; the network is growing; – then, apply ERM for fine tuning. ● Unsupervised pretraining ==> with the same empirical error, generalization error is better!
  • 24. Deep networks pretraining This part is learnt. x
  • 25. Deep networks pretraining This part is learnt. x z z Autoencoding!
  • 26. Deep networks pretraining This part is learnt. Autoencoding!
  • 27. Deep networks pretraining Then the network grows!
  • 28. Deep networks pretraining Then the network grows!
  • 29. Deep networks: supervised! Learn (supervised learning) the last layer. x y
  • 30. Deep networks: supervised! Learn (supervised learning) the whole network (fine tuning). x y
  • 31. Deep networks in one slide ● For i = 1, 2, 3, …, k: – Learn one layer by autoencoding (unsupervised) – Remove the second part ● Learn one more layer in a supervised manner ● Learn the whole network (supervised as well; fine tuning)
  • 32. Deep networks ● A revolution in vision ● Important point (not developped here): sharing some parameters, because first layers = low level feature extractors, and LLF are the same everywhere ==> convolutional nets ● Link with natural learning: learn simple concepts first; unsupervised learning. ● Not only “σ”, this was just an example; output=w0.exp(-w2.||input-w1||2) ● Great success in speech & vision ● Surprising performance in Go (discuss later :-) )
  • 33. Part 2: MCTS ● MCTS originates in 2006 ● UCT = one particular flavor, from 2007, most well known probably ● A revolution in Computer Go
  • 34. Part I : The Success Story (less showing off in part II :-) ) The game of Go is a beautiful Challenge.
  • 35. Part I : The Success Story (less showing off in part II :-) ) The game of Go is a beautiful challenge. We did the first wins against professional players in the game of Go But with handicap!
  • 36. Game of Go (9x9 here)
  • 43. Game of Go: counting territories ( w h i t e h a s 7 . 5 “ b o n u s ” a s b l a c k s t a r t s )
  • 44. Game of Go: the rules Black plays at the blue circle: the white group dies (it is removed) It's impossible to kill white (two “eyes”). “Superko” rule: we don't come back to the same situation. (without superko: “PSPACE hard” with superko: “EXPTIME-hard”) At the end, we count territories ==> black starts, so +7.5 for white.
  • 47. Coulom (06) Chaslot, Saito & Bouzy (06) Kocsis Szepesvari (06) UCT (Upper Confidence Trees) = Monte Carlo = random part
  • 48. UCT
  • 49. UCT
  • 50. UCT
  • 51. UCT
  • 54. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 55. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 56. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 57. ... or exploration ? SCORE = 0/2 + k.sqrt( log(10)/2 )
  • 58. UCT in one slide Great progress in the game of Go and in various other games
  • 59. Why ? Why “+ square-root( log(...)/ … )” ? because there are nice maths on this in completely different settings. Seriously, no good reason, use whatever you want :-)
  • 60. Current status ? MCTS has invaded game applications: • For games which have a good simulator (required!) • For games for which there is no good evaluation function, i.e. no simple map “board → probability that black wins”) Also some hard discrete control tasks.
  • 61. Current status ? Go ? Humans still much stronger than computers. Deep networks: surprisingly good performance as an evaluation function. Still performs far worse than best MCTS. Merging MCTS and deep networks ?
  • 62. Current MCTS research ? Recent years: • parallelization • extrapolation (between branches of the search) But most progress = human expertise and tricks in the random part.
  • 63. 4. The maths of ML One can find theorems justifying regularization (+|| w||2 or +||w||0), or theorems justifying that deep networks need less parameters than shallow networks for approximating some functions. Still, MCTS and neural networks were born quite independently of maths. Still, you need stochastic gradient descent. Maybe in the future of ML a real progress born in maths ?
  • 65. Random projection ? ● Randomly project your data (linearly or not) ● Learn on these random projections ● Super fast, not that bad
  • 66. Machine learning + encryption ● Statistics on data... without decrypting them ● Critical for applications – Where we must “know” what you do (predicting power consumption) – But we should not know too much (privacy)
  • 67. Simulation-based + data-based optimization ● Optimization of models = forgets too many features from the real world ● Optimization of simulators = better ==> technically, optimization of expensive functions (the optimization algorithm can spend computational power) + surrogate model (i.e. ML)
  • 68. Distributed collaborative decision making ? ● Power network: – frequency = 50Hz (deviations ≈ ) – (frequency)' = k x (production – demand) → ≈ 0! ● Too much wind power ==> unstable network because hard to satisfy “production = demand” ● Solutions ? – Detect frequency – Increase/decrease production but also demand
  • 69. Limited capacity Typical example of natural monopoly. Deregulation + more distributed production + more renewable energy ==> who regulates the network ? More regulation after all ? Distributed collaborative decision making. Ramping Constraint (power output smooth) IMHO, Distributed collaborative decision making is a great research area (useful + not well understood)
  • 70. Power systems must change! ● Tired of buying oil which leads to ? ● Don't want ?(coal) ● Afraid of ? But unstable ? COME AND HELP ! ! ! STABILIZATION NEEDED :-)
  • 71. Conclusions 1: recent success stories ● MCTS success story – 2006: immediately reasonably good – 2007: thanks to fun tricks in the MC part, strong against pros in 9x9 – 2008: with parallelization, good in 19x19 ● Deep networks – Convolutional DN excellent in 1998 (!) in vision, slightly overlooked for years – Now widely recognized in many areas ● Both make sense only with strong computers
  • 72. Conclusions 2: mathematics & publication & research ● During so many years: – SVM was the big boss of supervised ML (because there were theorems, where as there are few theorems in deep learning) – Alpha-beta was the big boss of games ● MCTS was immediately recognized as a key contribution to ML; why wasn't it the case for deep learning ? Maybe because SVM were easier to explain, prove, adverstise. (but highest impact factor = +squareRoot(... / … ) ! ) ● Both deep learning and MCTS look like fun exercises rather than science; still, they are key tools for ML. ==> keep time for “fun” research, don't worry too much for publications
  • 73. Conclusions 3: applications are fun! (important ones :-) ) ● Both deep learning and Mcts were born from applications ● Machine learning came from xps more than from pure theory ● Automatic driving, micro-emotions (big brother ?), bioinformatics, …. and POWER SYSTEMS (with open source / open data!).
  • 74. References ● Backpropagation, Rummelhart et al 1986 ● MCTS, Coulom 2006 + Kocsis et al 2007 + Gelly et al 2007 ● Conv. Networks Fukushima 1980 ● Deep conv. networks Le Cun 1998 ● Regularization, Vapnik et al 1971