SlideShare a Scribd company logo
DERIVATIVE-FREE
                     OPTIMIZATION

                       http://www.lri.fr/~teytaud/dfo.pdf
                           (or Quentin's web page ?)



Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège           using also Slides from A. Auger
The next slide is the most important
                                      of all.




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
In case of trouble,
                               Interrupt me.

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
In case of trouble,
                                    Interrupt me.

         Further discussion needed:
            - R82A, Montefiore institute
            - olivier.teytaud@inria.fr
            - or after the lessons (the 25          th
                                                         , not the 18th)
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Derivative Free Optimization
I. Optimization and DFO
  II. Evolutionary algorithms
  III. From math. programming
  IV. Using machine learning
  V. Conclusions

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Derivative-free optimization of f




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Derivative-free optimization of f




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Derivative-free optimization of f




                                                              No gradient !
                                                    Only depends on the x's and f(x)'s
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Derivative-free optimization of f




      Why derivative free optimization ?
Derivative-free optimization of f




      Why derivative free optimization ?
       Ok, it's slower
Derivative-free optimization of f




      Why derivative free optimization ?
       Ok, it's slower
       But sometimes you have no derivative
Derivative-free optimization of f




      Why derivative free optimization ?
       Ok, it's slower
       But sometimes you have no derivative
       It's simpler (by far) ==> less bugs
Derivative-free optimization of f




      Why derivative free optimization ?
       Ok, it's slower
       But sometimes you have no derivative
       It's simpler (by far)
       It's more robust (to noise, to strange functions...)
Derivative-free optimization of f


          Optimization algorithms
      ==> Newton optimization ?
      Why derivative free
       ==> Quasi-Newton (BFGS)
       Ok, it's slower
       But sometimes you have no derivative
       ==> Gradient descent
       It's simpler (by far)

       ==> ...robust (to noise, to strange functions...)
       It's more
Derivative-free optimization of f



          Optimization algorithms
      Why derivative free optimization ?
       Ok, it's slower
        Derivative-free        optimization
       But sometimes you have no derivative
           (don't need gradients)
       It's simpler (by far)
       It's more robust (to noise, to strange functions...)
Derivative-free optimization of f



          Optimization algorithms
      Why derivative free optimization ?
        Derivative-free optimization
       Ok, it's slower
       But sometimes you have no derivative
              Comparison-based optimization
                      (coming soon),
       It's simpler (by far)comparisons,
                  just needing
       It's more robust (to noise, to strange functions...)
               incuding evolutionary algorithms
I. Optimization and DFO
  II. Evolutionary algorithms
  III. From math. programming
  IV. Using machine learning
  V. Conclusions

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
II. Evolutionary algorithms
          a. Fundamental elements
          b. Algorithms
          c. Math. analysis

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Preliminaries:
- Gaussian distribution
- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
- Markov chains
   ==> for theoretical analysis
Preliminaries:
- Gaussian distribution
- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
- Markov chains
K exp( - p(x) ) with
                      - p(x) a degree 2 polynomial (neg. dom coef)
                      - K a normalization constant



Preliminaries:
- Gaussian distribution
- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
- Markov chains
K exp( - p(x) ) with
                                 - p(x) a degree 2 polynomial (neg. dom coef)
                                 - K a normalization constant
                  Translation
                    of the
Preliminaries:Gaussian
        Sze
         of the
- Gaussian distribution
       Gaussian


- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
- Markov chains
Preliminaries:
- Gaussian distribution
- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
Preliminaries:
  Isotropic case:
- Gaussian distribution
- Multivariate Gaussian distribution||2 /22)
==> general case: density = K exp( - || x - 
==> level sets are rotationally invariant
- Non-isotropic Gaussian distribution
==> completely defined by  and 
- Markov chains
       (do you understand why K is fixed by ?)
==> “isotropic” Gaussian
Preliminaries:
- Gaussian distribution
- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
Step-size different
        on each axis



       K exp( - p(x) ) with
- p(x) a quadratic form (--> + infinity)
- K a normalization constant
Notions that we will see:
- Evolutionary algorithm
- Cross-over
- Truncation selection / roulette wheel
- Linear / log-linear convergence
- Estimation of Distribution Algorithm
- EMNA
- Self-adaptation
- (1+1)-ES with 1/5th rule
- Voronoi representation
- Non-isotropy
Comparison-based optimization



     Observation: we want robustness w.r.t that:



                  is comparison-based if




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   29
Comparison-based optimization


                                                            yi=f(xi)




                  is comparison-based if




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   30
Population-based comparison-based algorithms ?




  X(1)=( x(1,1),x(1,2),...,x(1,) ) = Opt()
  X(2)=( x(2,1),x(2,2),...,x(2,) ) = Opt(x(1),
                                      signs of diff)
           …             …            ...
  x(n)=( x(n,1),x(n,2),...,x(n,) ) = Opt(x(n-1),
                                      signs of diff)
    ==> let's write it for =2.


Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   31
Population-based comparison-based algorithms ?



  x(1)=(x(1,1),x(1,2)) = Opt()
  x(2)=(x(2,1),x(2,2)) = Opt(x(1),
                               sign(y(1,1)-y(1,2)) )
           …           …           ...
  x(n)=(x(n,1),x(n,2)) = Opt(x(n-1),
                               sign(y(n-1,1)-y(n-1,2))

                   with y(i,j) = f ( x(i,j) )

Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   32
Population-based comparison-based algorithms ?




  Abstract notations: x(i) is a population

  x(1) = Opt()
  x(2) = Opt(x(1), sign(y(1,1)-y(1,2)) )
          …           …            ...
  x(n) = Opt(x(n-1), sign(y(n-1,1)-y(n-1,2))



Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   33
Population-based comparison-based algorithms ?



  Abstract notations: x(i) is a population, I(i) is an
                   internal state of the algorithm.

  x(1),I(1) = Opt()
  x(2),I(2) = Opt(x(1), sign(y(1,1)-y(1,2)), I(1) )
            …          …           ...
  x(n),I(n) = Opt(x(n-1),sign(y(n-1,1)-y(n-1,2) ,I(n-1))



Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   34
Population-based comparison-based algorithms ?




  Abstract notations: x(i) is a population, I(i) is an
                   internal state of the algorithm.

  x(1),I(1) = Opt()
  x(2),I(2) = Opt(x(1), (1), I(1) )
            …          …             ...
  x(n),I(n) = Opt(x(n-1),(n-1) ,I(n-1))


Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   35
Comparison-based optimization



           ==> Same behavior on many functions



                  is comparison-based if




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   36
Comparison-based optimization



           ==> Same behavior on many functions



                  is comparison-based if




         Quasi-Newton methods very poor on this.
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   37
Why comparison-based algorithms ?
         ==> more robust
         ==> this can be mathematically
                  formalized: comparison-based opt.
                  are slow ( d log ||xn-x*||/n ~ constant)
                but robust (optimal for some worst
                  case analysis)
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
II. Evolutionary algorithms
          a. Fundamental elements
          b. Algorithms
          c. Math. analysis

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Parameters:
                       Generate  points around x
   x,
                      ( x +  N where N is a standard
                                Gaussian)
       o f an
                egy
   ema
           Strat
 c sch
         ution
Basi
       Evol
Parameters:
                       Generate  points around x
   x,
                      ( x +  N where N is a standard
                                Gaussian)
       o f an
                egy

                      Compute their  fitness values
   ema
           Strat
 c sch
         ution
Basi
       Evol
Parameters:
                       Generate  points around x
   x,
                      ( x +  N where N is a standard
                                Gaussian)
       o f an
                egy

                      Compute their  fitness values
   ema
           Strat
 c sch
         ution




                             Select the  best
Basi
       Evol
Parameters:
                        Generate  points around x
   x,
                      ( x +  N where N is a standard
                                Gaussian)
       o f an
                egy

                      Compute their  fitness values
   ema
           Strat
 c sch
         ution




                             Select the  best
Basi
       Evol




                      Let x = average of these  best
Parameters:
                        Generate  points around x
   x,
                      ( x +  N where N is a standard
                                Gaussian)
       o f an
                egy

                      Compute their  fitness values
   ema
           Strat
 c sch
         ution




                             Select the  best
Basi
       Evol




                      Let x = average of these  best
Parameters:
                     Generate  points around x
    x,
                   ( x +  N where N is a standard
                             Gaussian)
          llel
     para



                   Compute their  fitness values
                   Multi-cores,
   sly 




                 Clusters, Grids...
     ou




                          Select the  best
Obvi




                   Let x = average of these  best
Parameters:
                        Generate  points around x
    x,
                      ( x +  N where N is a standard
                                Gaussian)
          llel
     para



                      Compute their  fitness values
   sly 
               ple.
     ou




                             Select the  best
         ly sim
Obvi
     Real




                      Let x = average of these  best
Parameters:
                        Generate  points around x
    x,
                      ( x +  N where N is a standard
                                Gaussian)
          llel
     para



                                 Not a negligible advantage.
                      Compute their  fitness values
                             When I accessed, for the 1st time,
   sly 




                                     to a crucial industrial
               ple.




                                     code of an important
     ou




                             Select the  best
                                      company, I believed
         ly sim
Obvi




                                         that it would be
     Real




                                      clean and bug free.
                      Let x = average of these  best
                                                        (I was young)
Parameters:
                            Generate 1 point x' around x
  x,
                           ( x +  N where N is a standard
                                      Gaussian)




                              Compute its fitness value
        ) - ES
                  le




                               Keep the best (x or x').
                      ru
     (1 +1
               1/5 th




                           x=best(x,x')
   The
          with




                           =2 if x' best
                           =0.84 otherwise
This is x...
I generate =6 points
I select the =3 best points
x=average of these =3 best points
Ok.
Choosing an initial
      x is as in any algorithm.
But how do I choose sigma ?
Ok.
Choosing x is as in any algorithm.
But how do I choose sigma ?


Sometimes by human guess.
But for large number of iterations,
there is better.
Derivative Free Optimization
Derivative Free Optimization
Derivative Free Optimization
log || xn – x* || ~ - C n
Usually termed “linear convergence”,
      ==> but it's in log-scale.
     log || xn – x* || ~ - C n
Examples of evolutionary algorithms




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   60
Estimation of Multivariate Normal Algorithm




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   61
Estimation of Multivariate Normal Algorithm




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   62
Estimation of Multivariate Normal Algorithm




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   63
Estimation of Multivariate Normal Algorithm




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   64
EMNA is usually non-isotropic




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   65
EMNA is usually non-isotropic




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   66
Self-adaptation (works in many frameworks)




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud   parallel evolution   67
Self-adaptation (works in many frameworks)




                                              Can be used for non-isotropic
                                                   multivariate Gaussian
                                                       distributions.


Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud                   parallel evolution   68
Let's generalize.

 We have seen algorithms which work as follows:

  - we keep one search point in memory
    (and one step-size)
  - we generate individuals
  - we evaluate these individuals
  - we regenerate a search point and a step-size

Maybe we could keep more than one search point ?
Let's generalize.

  We have seen algorithms which work as follows:

  - we keep one search point in memory
    (and one step-size) points
       ==> mu search
  - we generate individuals
  - we evaluate thesegenerated individuals
       ==> lambda individuals
  - we regenerate a search point and a step-size

Maybe we could keep more than one search point ?
Parameters:                Generate  points
  x1,...,x                  around x1,...,x
                      e.g. each x randomly generated
                             from two points
          llel
     para



                      Compute their  fitness values
   sly 
               ple.
     ou




                             Select the  best
         ly sim
Obvi
     Real




                             Don't average...
Generate  points
       around x1,...,x
e.g. each x randomly generated
       from two points
Generate  points
       around x1,...,x
e.g. each x randomly generated    This is a
       from two points           cross-over
Generate  points
        around x1,...,x
 e.g. each x randomly generated              This is a
        from two points                     cross-over
Example of procedure for generating a point:

  - Randomly draw k parents x1,...,xk
       (truncation selection: randomly in selected individuals)


  - For generating the ith coordinate of new individual z:
                           u=random(1,k)
                             z(i) = x(u)i
Let's summarize:
We have seen a general scheme for optimization:
 - generate a population (e.g. from some distribution, or from
             a set of search points)
 - select the best = new search points


==> Small difference between an
    Evolutionary Algorithm (EA) and an
    Estimation of Distribution Algorithm (EDA).
==> Some EA (older than the EDA acronym) are EDAs.
Let's summarize:
We have seen a general scheme for optimization:
 - generate a population (e.g. from some distribution, or from
             a set of search points)
 - select the best = new search points      EDA
                         EA
==> Small difference between an
    Evolutionary Algorithm (EA) and an
    Estimation of Distribution Algorithm (EDA).
==> Some EA (older than the EDA acronym) are EDAs.
Gives a lot freedom:
 - choose your representation
        and operators (depending on the problem)
 - if you have a step-size, choose adaptation rule
 - choose your population-size (depending on your
                       computer/grid )

 - choose  (carefully) e.g. min(dimension,  /4)
Gives a lot freedom:
 - choose your operators (depending on the problem)
 - if you have a step-size, choose adaptation rule
 - choose your population-size (depending on your
                        computer/grid )

 - choose  (carefully) e.g. min(dimension,  /4)


Can handle strange things:
  - optimize a physical structure ?
  - structure represented as a Voronoi
  - cross-over makes sense, benefits from local structure
  - not so many algorithms can work on that
Voronoi representation:
  - a family of points
Voronoi representation:
  - a family of points
Voronoi representation:
  - a family of points
     - their labels
Voronoi representation:
     - a family of points
        - their labels
==> cross-over makes sense
==> you can optimize a shape
Voronoi representation:
                        - a family of points
                           - their labels
                  ==> cross-over makes sense
                 ==> you can optimize a shape
                  ==> not that mathematical;
                          but really useful


Mutations: each label is changed with proba 1/n
Cross-over: each point/label is randomly drawn from one of
      the two parents
Voronoi representation:
                        - a family of points
                           - their labels
                  ==> cross-over makes sense
                  ==> you can optimize a shape
                   ==> not that mathematical;
                           but really useful


Mutations: each label is changed with proba 1/n
Cross-over: randomly pick one split in the representation:
                 - left part from parent 1
                 - right part from parent 2
               ==> related to biology
Gives a lot freedom:
 - choose your operators (depending on the problem)
 - if you have a step-size, choose adaptation rule
 - choose your population-size (depending on your
                        computer/grid )

 - choose  (carefully) e.g. min(dimension,  /4)


Can handle strange things:
  - optimize a physical structure ?
  - structure represented as a Voronoi
  - cross-over makes sense, benefits from local structure
  - not so many algorithms can work on that
II. Evolutionary algorithms
          a. Fundamental elements
          b. Algorithms
          c. Math. Analysis

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Consider the (1+1)-ES.
  x(n) = x(n-1) or x(n-1) + (n-1)N
  We want to maximize:
              - E log || x(n) - f* ||



Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Consider the (1+1)-ES.
  x(n) = x(n-1) or x(n-1) + (n-1)N
  We want to maximize:
              - E log || x(n) - f* ||
           --------------------------
            - E log || x(n-1) – f* ||
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Consider the (1+1)-ES.
  x(n) = x(n-1) or x(n-1) + (n-1)N
                                                        We don't know f*.
  We want to maximize:
                                                    How can we optimize this ?
              - E log || x(n) - f* ||
                                                         We will observe
           -------------------------- the acceptance rate,
            - E log || x(n-1) – f* ||
                                                     and we will deduce if   
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège    is too large or too small..
- E log || x(n) - f* ||
                                                    ON THE NORM FUNCTION
    --------------------------
   - E log || x(n-1) – f* ||


Rejected                                                         Accepted
mutations                                                        mutations




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
- E log || x(n) - f* ||                             For each step-size,
    --------------------------                  evaluate this “expected progress rate”
   - E log || x(n-1) – f* ||                        and evaluate “P(acceptance)”


Rejected                                                           Accepted
mutations                                                          mutations




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Progress rate




                Rejected
                mutations




                            Acceptance rate
Progress rate        We want to be here!




                Rejected                    We observe
                mutations                  (approximately)
                                            this variable




                                             Acceptance rate
Progress rate




                Rejected
                mutations




                     Big      Acceptance rate
                  step-size
Progress rate




                Rejected
                mutations



                             Small
                            step-size   Acceptance rate
Progress rate




            Rejected
Small acceptance rate
            mutations
 ==> decrease sigma




                        Acceptance rate
Progress rate




                Rejected    Big acceptance rate
                mutations   ==> increase sigma




                                        Acceptance rate
th
1/5 rule


                                Based on maths showing
                                     that good step-size
                                <==> success rate < 1/5




Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud           parallel evolution   98
I. Optimization and DFO
  II. Evolutionary algorithms
  III. From math. programming
  IV. Using machine learning
  V. Conclusions

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
III. From math. programming
  ==>pattern search method
                                                    Comparison with ES:
                                                    - code more complicated
                                                    - same rate
                                                    - deterministic
                                                    - less robust




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
III. From math. programming

             Also:
             - Nelder-Mead algorithm (similar to pattern search,
                  better constant in the rate)




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
III. From math. programming

             Also:
             - Nelder-Mead algorithm (similar to pattern search,
                  better constant in the rate)
             - NEWUOA (using value functions and
                   not only comparisons)




Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
I. Optimization and DFO
  II. Evolutionary algorithms
  III. From math. programming
  IV. Using machine learning
  V. Conclusions

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
IV. Using machine learning


    What if computing f takes days ?
==> parallelism
==> and “learn” an approximation of f

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
IV. Using machine learning



Statistical tools: f ' (x) = approximation
                             ( x, x1,f(x1), x2,f(x2), … , xn,f(xn))
                y(n+1) = f ' (x(n+1) )


        e.g. f' = quadratic function closest to f on the x(i)'s.
IV. Using machine learning


 ==> keyword “surrogate models”
 ==> use f' instead of f
 ==> periodically, re-use the real f
I. Optimization and DFO
  II. Evolutionary algorithms
  III. From math. programming
  IV. Using machine learning
  V. Conclusions

Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
Derivative free optimization is fun.


==> nice maths
==> nice applications + easily parallel algorithms
==> can handle really complicated domains
   (mixed continuous / integer, optimization
   on sets of programs)


Yet,
often suboptimal on highly structured problems (when
       BFGS is easy to use, thanks to fast gradients)
Keywords, readings


==> cross-entropy (so close to evolution strategies)
==> genetic programming (evolutionary algorithms for
               automatically building programs)
==> H.-G. Beyer's book on ES = good starting point
==> many resources on the web
==> keep in mind that representation / operators are
   often the key
==> we only considered isotropic algorithms; sometimes not
   a good idea at all

More Related Content

PDF
05 history of cv a machine learning (theory) perspective on computer vision
DOCX
Logics of the laplace transform
PDF
EM algorithm and its application in probabilistic latent semantic analysis
PDF
Label propagation - Semisupervised Learning with Applications to NLP
PDF
Optimal control of coupled PDE networks with automated code generation
PDF
CVPR2010: higher order models in computer vision: Part 1, 2
PDF
Ada boost brown boost performance with noisy data
PDF
Solution of linear and nonlinear partial differential equations using mixture...
05 history of cv a machine learning (theory) perspective on computer vision
Logics of the laplace transform
EM algorithm and its application in probabilistic latent semantic analysis
Label propagation - Semisupervised Learning with Applications to NLP
Optimal control of coupled PDE networks with automated code generation
CVPR2010: higher order models in computer vision: Part 1, 2
Ada boost brown boost performance with noisy data
Solution of linear and nonlinear partial differential equations using mixture...

What's hot (17)

PDF
11.solution of linear and nonlinear partial differential equations using mixt...
PDF
Estimation of the score vector and observed information matrix in intractable...
PDF
Spectral Learning Methods for Finite State Machines with Applications to Na...
PDF
Approximate Bayesian Computation on GPUs
PDF
Neural Processes
PDF
Influence of the sampling on Functional Data Analysis
PDF
A current perspectives of corrected operator splitting (os) for systems
PDF
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
ODP
Artificial Intelligence and Optimization with Parallelism
PDF
Rouviere
PDF
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
PDF
Nonlinear Manifolds in Computer Vision
PDF
Prévision de consommation électrique avec adaptive GAM
PDF
YSC 2013
PDF
Machine learning of structured outputs
PDF
simplex.pdf
11.solution of linear and nonlinear partial differential equations using mixt...
Estimation of the score vector and observed information matrix in intractable...
Spectral Learning Methods for Finite State Machines with Applications to Na...
Approximate Bayesian Computation on GPUs
Neural Processes
Influence of the sampling on Functional Data Analysis
A current perspectives of corrected operator splitting (os) for systems
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
Artificial Intelligence and Optimization with Parallelism
Rouviere
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Nonlinear Manifolds in Computer Vision
Prévision de consommation électrique avec adaptive GAM
YSC 2013
Machine learning of structured outputs
simplex.pdf
Ad

Viewers also liked (11)

PDF
Plantilla trazos
ODP
Undecidability in partially observable deterministic games
ODP
reinforcement learning for difficult settings
ODP
France presented to Taiwanese people
ODP
Complexity of multiobjective optimization
ODP
An introduction to SVN
ODP
Derandomized evolution strategies (quasi-random)
PPT
Presentación1 1
ODP
Machine learning 2016: deep networks and Monte Carlo Tree Search
ODP
Why power system studies (and many others!) should be open data / open source
ODP
Batchal slides
Plantilla trazos
Undecidability in partially observable deterministic games
reinforcement learning for difficult settings
France presented to Taiwanese people
Complexity of multiobjective optimization
An introduction to SVN
Derandomized evolution strategies (quasi-random)
Presentación1 1
Machine learning 2016: deep networks and Monte Carlo Tree Search
Why power system studies (and many others!) should be open data / open source
Batchal slides
Ad

Similar to Derivative Free Optimization (20)

ODP
Theories of continuous optimization
PDF
Evolutionary deep learning: computer vision.
PDF
Tutorial rpo
PPT
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
PDF
Evolutionary computation 5773-lecture03-Fall24 (8-23-24).pdf
PDF
Computational Intelligence Assisted Engineering Design Optimization (using MA...
PPT
Ant colony search and heuristic techniques for optimal dispatch of energy sou...
ODP
Multimodal or Expensive Optimization
PDF
Differential evolution
ODP
Noisy optimization --- (theory oriented) Survey
PPT
Artificial bee colony (abc)
ODP
Noisy Optimization combining Bandits and Evolutionary Algorithms
PDF
Optim_methods.pdf
PDF
Eurogen v
PPTX
Optimization tutorial
PDF
MSc Thesis_Francisco Franco_A New Interpolation Approach for Linearly Constra...
PDF
An Information-Theoretic Approach for Clonal Selection Algorithms
PPT
Design and analysis of algorithm ppt ppt
PDF
Recursive Compressed Sensing
PDF
discrete-hmm
Theories of continuous optimization
Evolutionary deep learning: computer vision.
Tutorial rpo
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
Evolutionary computation 5773-lecture03-Fall24 (8-23-24).pdf
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Ant colony search and heuristic techniques for optimal dispatch of energy sou...
Multimodal or Expensive Optimization
Differential evolution
Noisy optimization --- (theory oriented) Survey
Artificial bee colony (abc)
Noisy Optimization combining Bandits and Evolutionary Algorithms
Optim_methods.pdf
Eurogen v
Optimization tutorial
MSc Thesis_Francisco Franco_A New Interpolation Approach for Linearly Constra...
An Information-Theoretic Approach for Clonal Selection Algorithms
Design and analysis of algorithm ppt ppt
Recursive Compressed Sensing
discrete-hmm

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
A Presentation on Artificial Intelligence
Review of recent advances in non-invasive hemoglobin estimation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine Learning_overview_presentation.pptx
Spectral efficient network and resource selection model in 5G networks
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Derivative Free Optimization

  • 1. DERIVATIVE-FREE OPTIMIZATION http://www.lri.fr/~teytaud/dfo.pdf (or Quentin's web page ?) Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège using also Slides from A. Auger
  • 2. The next slide is the most important of all. Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 3. In case of trouble, Interrupt me. Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 4. In case of trouble, Interrupt me. Further discussion needed: - R82A, Montefiore institute - olivier.teytaud@inria.fr - or after the lessons (the 25 th , not the 18th) Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 6. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. Conclusions Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 7. Derivative-free optimization of f Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 8. Derivative-free optimization of f Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 9. Derivative-free optimization of f No gradient ! Only depends on the x's and f(x)'s Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 10. Derivative-free optimization of f Why derivative free optimization ?
  • 11. Derivative-free optimization of f Why derivative free optimization ? Ok, it's slower
  • 12. Derivative-free optimization of f Why derivative free optimization ? Ok, it's slower But sometimes you have no derivative
  • 13. Derivative-free optimization of f Why derivative free optimization ? Ok, it's slower But sometimes you have no derivative It's simpler (by far) ==> less bugs
  • 14. Derivative-free optimization of f Why derivative free optimization ? Ok, it's slower But sometimes you have no derivative It's simpler (by far) It's more robust (to noise, to strange functions...)
  • 15. Derivative-free optimization of f Optimization algorithms ==> Newton optimization ? Why derivative free ==> Quasi-Newton (BFGS) Ok, it's slower But sometimes you have no derivative ==> Gradient descent It's simpler (by far) ==> ...robust (to noise, to strange functions...) It's more
  • 16. Derivative-free optimization of f Optimization algorithms Why derivative free optimization ? Ok, it's slower Derivative-free optimization But sometimes you have no derivative (don't need gradients) It's simpler (by far) It's more robust (to noise, to strange functions...)
  • 17. Derivative-free optimization of f Optimization algorithms Why derivative free optimization ? Derivative-free optimization Ok, it's slower But sometimes you have no derivative Comparison-based optimization (coming soon), It's simpler (by far)comparisons, just needing It's more robust (to noise, to strange functions...) incuding evolutionary algorithms
  • 18. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. Conclusions Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 19. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. analysis Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 20. Preliminaries: - Gaussian distribution - Multivariate Gaussian distribution - Non-isotropic Gaussian distribution - Markov chains ==> for theoretical analysis
  • 21. Preliminaries: - Gaussian distribution - Multivariate Gaussian distribution - Non-isotropic Gaussian distribution - Markov chains
  • 22. K exp( - p(x) ) with - p(x) a degree 2 polynomial (neg. dom coef) - K a normalization constant Preliminaries: - Gaussian distribution - Multivariate Gaussian distribution - Non-isotropic Gaussian distribution - Markov chains
  • 23. K exp( - p(x) ) with - p(x) a degree 2 polynomial (neg. dom coef) - K a normalization constant Translation of the Preliminaries:Gaussian Sze of the - Gaussian distribution Gaussian - Multivariate Gaussian distribution - Non-isotropic Gaussian distribution - Markov chains
  • 24. Preliminaries: - Gaussian distribution - Multivariate Gaussian distribution - Non-isotropic Gaussian distribution
  • 25. Preliminaries: Isotropic case: - Gaussian distribution - Multivariate Gaussian distribution||2 /22) ==> general case: density = K exp( - || x -  ==> level sets are rotationally invariant - Non-isotropic Gaussian distribution ==> completely defined by  and  - Markov chains (do you understand why K is fixed by ?) ==> “isotropic” Gaussian
  • 26. Preliminaries: - Gaussian distribution - Multivariate Gaussian distribution - Non-isotropic Gaussian distribution
  • 27. Step-size different on each axis K exp( - p(x) ) with - p(x) a quadratic form (--> + infinity) - K a normalization constant
  • 28. Notions that we will see: - Evolutionary algorithm - Cross-over - Truncation selection / roulette wheel - Linear / log-linear convergence - Estimation of Distribution Algorithm - EMNA - Self-adaptation - (1+1)-ES with 1/5th rule - Voronoi representation - Non-isotropy
  • 29. Comparison-based optimization Observation: we want robustness w.r.t that: is comparison-based if Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 29
  • 30. Comparison-based optimization yi=f(xi) is comparison-based if Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 30
  • 31. Population-based comparison-based algorithms ? X(1)=( x(1,1),x(1,2),...,x(1,) ) = Opt() X(2)=( x(2,1),x(2,2),...,x(2,) ) = Opt(x(1), signs of diff) … … ... x(n)=( x(n,1),x(n,2),...,x(n,) ) = Opt(x(n-1), signs of diff) ==> let's write it for =2. Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 31
  • 32. Population-based comparison-based algorithms ? x(1)=(x(1,1),x(1,2)) = Opt() x(2)=(x(2,1),x(2,2)) = Opt(x(1), sign(y(1,1)-y(1,2)) ) … … ... x(n)=(x(n,1),x(n,2)) = Opt(x(n-1), sign(y(n-1,1)-y(n-1,2)) with y(i,j) = f ( x(i,j) ) Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 32
  • 33. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population x(1) = Opt() x(2) = Opt(x(1), sign(y(1,1)-y(1,2)) ) … … ... x(n) = Opt(x(n-1), sign(y(n-1,1)-y(n-1,2)) Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 33
  • 34. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population, I(i) is an internal state of the algorithm. x(1),I(1) = Opt() x(2),I(2) = Opt(x(1), sign(y(1,1)-y(1,2)), I(1) ) … … ... x(n),I(n) = Opt(x(n-1),sign(y(n-1,1)-y(n-1,2) ,I(n-1)) Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 34
  • 35. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population, I(i) is an internal state of the algorithm. x(1),I(1) = Opt() x(2),I(2) = Opt(x(1), (1), I(1) ) … … ... x(n),I(n) = Opt(x(n-1),(n-1) ,I(n-1)) Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 35
  • 36. Comparison-based optimization ==> Same behavior on many functions is comparison-based if Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 36
  • 37. Comparison-based optimization ==> Same behavior on many functions is comparison-based if Quasi-Newton methods very poor on this. Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 37
  • 38. Why comparison-based algorithms ? ==> more robust ==> this can be mathematically formalized: comparison-based opt. are slow ( d log ||xn-x*||/n ~ constant) but robust (optimal for some worst case analysis) Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 39. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. analysis Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 40. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy ema Strat c sch ution Basi Evol
  • 41. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch ution Basi Evol
  • 42. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch ution Select the  best Basi Evol
  • 43. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch ution Select the  best Basi Evol Let x = average of these  best
  • 44. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch ution Select the  best Basi Evol Let x = average of these  best
  • 45. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) llel para Compute their  fitness values Multi-cores, sly  Clusters, Grids... ou Select the  best Obvi Let x = average of these  best
  • 46. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) llel para Compute their  fitness values sly  ple. ou Select the  best ly sim Obvi Real Let x = average of these  best
  • 47. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) llel para Not a negligible advantage. Compute their  fitness values When I accessed, for the 1st time, sly  to a crucial industrial ple. code of an important ou Select the  best company, I believed ly sim Obvi that it would be Real clean and bug free. Let x = average of these  best (I was young)
  • 48. Parameters: Generate 1 point x' around x x, ( x +  N where N is a standard Gaussian) Compute its fitness value ) - ES le Keep the best (x or x'). ru (1 +1 1/5 th x=best(x,x') The with =2 if x' best =0.84 otherwise
  • 51. I select the =3 best points
  • 52. x=average of these =3 best points
  • 53. Ok. Choosing an initial x is as in any algorithm. But how do I choose sigma ?
  • 54. Ok. Choosing x is as in any algorithm. But how do I choose sigma ? Sometimes by human guess. But for large number of iterations, there is better.
  • 58. log || xn – x* || ~ - C n
  • 59. Usually termed “linear convergence”, ==> but it's in log-scale. log || xn – x* || ~ - C n
  • 60. Examples of evolutionary algorithms Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 60
  • 61. Estimation of Multivariate Normal Algorithm Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 61
  • 62. Estimation of Multivariate Normal Algorithm Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 62
  • 63. Estimation of Multivariate Normal Algorithm Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 63
  • 64. Estimation of Multivariate Normal Algorithm Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 64
  • 65. EMNA is usually non-isotropic Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 65
  • 66. EMNA is usually non-isotropic Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 66
  • 67. Self-adaptation (works in many frameworks) Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 67
  • 68. Self-adaptation (works in many frameworks) Can be used for non-isotropic multivariate Gaussian distributions. Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 68
  • 69. Let's generalize. We have seen algorithms which work as follows: - we keep one search point in memory (and one step-size) - we generate individuals - we evaluate these individuals - we regenerate a search point and a step-size Maybe we could keep more than one search point ?
  • 70. Let's generalize. We have seen algorithms which work as follows: - we keep one search point in memory (and one step-size) points ==> mu search - we generate individuals - we evaluate thesegenerated individuals ==> lambda individuals - we regenerate a search point and a step-size Maybe we could keep more than one search point ?
  • 71. Parameters: Generate  points x1,...,x around x1,...,x e.g. each x randomly generated from two points llel para Compute their  fitness values sly  ple. ou Select the  best ly sim Obvi Real Don't average...
  • 72. Generate  points around x1,...,x e.g. each x randomly generated from two points
  • 73. Generate  points around x1,...,x e.g. each x randomly generated This is a from two points cross-over
  • 74. Generate  points around x1,...,x e.g. each x randomly generated This is a from two points cross-over Example of procedure for generating a point: - Randomly draw k parents x1,...,xk (truncation selection: randomly in selected individuals) - For generating the ith coordinate of new individual z: u=random(1,k) z(i) = x(u)i
  • 75. Let's summarize: We have seen a general scheme for optimization: - generate a population (e.g. from some distribution, or from a set of search points) - select the best = new search points ==> Small difference between an Evolutionary Algorithm (EA) and an Estimation of Distribution Algorithm (EDA). ==> Some EA (older than the EDA acronym) are EDAs.
  • 76. Let's summarize: We have seen a general scheme for optimization: - generate a population (e.g. from some distribution, or from a set of search points) - select the best = new search points EDA EA ==> Small difference between an Evolutionary Algorithm (EA) and an Estimation of Distribution Algorithm (EDA). ==> Some EA (older than the EDA acronym) are EDAs.
  • 77. Gives a lot freedom: - choose your representation and operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose  (carefully) e.g. min(dimension,  /4)
  • 78. Gives a lot freedom: - choose your operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose  (carefully) e.g. min(dimension,  /4) Can handle strange things: - optimize a physical structure ? - structure represented as a Voronoi - cross-over makes sense, benefits from local structure - not so many algorithms can work on that
  • 79. Voronoi representation: - a family of points
  • 80. Voronoi representation: - a family of points
  • 81. Voronoi representation: - a family of points - their labels
  • 82. Voronoi representation: - a family of points - their labels ==> cross-over makes sense ==> you can optimize a shape
  • 83. Voronoi representation: - a family of points - their labels ==> cross-over makes sense ==> you can optimize a shape ==> not that mathematical; but really useful Mutations: each label is changed with proba 1/n Cross-over: each point/label is randomly drawn from one of the two parents
  • 84. Voronoi representation: - a family of points - their labels ==> cross-over makes sense ==> you can optimize a shape ==> not that mathematical; but really useful Mutations: each label is changed with proba 1/n Cross-over: randomly pick one split in the representation: - left part from parent 1 - right part from parent 2 ==> related to biology
  • 85. Gives a lot freedom: - choose your operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose  (carefully) e.g. min(dimension,  /4) Can handle strange things: - optimize a physical structure ? - structure represented as a Voronoi - cross-over makes sense, benefits from local structure - not so many algorithms can work on that
  • 86. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. Analysis Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 87. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* || Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 88. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* || -------------------------- - E log || x(n-1) – f* || Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 89. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We don't know f*. We want to maximize: How can we optimize this ? - E log || x(n) - f* || We will observe -------------------------- the acceptance rate, - E log || x(n-1) – f* || and we will deduce if  Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège is too large or too small..
  • 90. - E log || x(n) - f* || ON THE NORM FUNCTION -------------------------- - E log || x(n-1) – f* || Rejected Accepted mutations mutations Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 91. - E log || x(n) - f* || For each step-size, -------------------------- evaluate this “expected progress rate” - E log || x(n-1) – f* || and evaluate “P(acceptance)” Rejected Accepted mutations mutations Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 92. Progress rate Rejected mutations Acceptance rate
  • 93. Progress rate We want to be here! Rejected We observe mutations (approximately) this variable Acceptance rate
  • 94. Progress rate Rejected mutations Big Acceptance rate step-size
  • 95. Progress rate Rejected mutations Small step-size Acceptance rate
  • 96. Progress rate Rejected Small acceptance rate mutations ==> decrease sigma Acceptance rate
  • 97. Progress rate Rejected Big acceptance rate mutations ==> increase sigma Acceptance rate
  • 98. th 1/5 rule Based on maths showing that good step-size <==> success rate < 1/5 Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 98
  • 99. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. Conclusions Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 100. III. From math. programming ==>pattern search method Comparison with ES: - code more complicated - same rate - deterministic - less robust Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 101. III. From math. programming Also: - Nelder-Mead algorithm (similar to pattern search, better constant in the rate) Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 102. III. From math. programming Also: - Nelder-Mead algorithm (similar to pattern search, better constant in the rate) - NEWUOA (using value functions and not only comparisons) Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 103. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. Conclusions Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 104. IV. Using machine learning What if computing f takes days ? ==> parallelism ==> and “learn” an approximation of f Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 105. IV. Using machine learning Statistical tools: f ' (x) = approximation ( x, x1,f(x1), x2,f(x2), … , xn,f(xn)) y(n+1) = f ' (x(n+1) ) e.g. f' = quadratic function closest to f on the x(i)'s.
  • 106. IV. Using machine learning ==> keyword “surrogate models” ==> use f' instead of f ==> periodically, re-use the real f
  • 107. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. Conclusions Olivier Teytaud Inria Tao, en visite dans la belle ville de Liège
  • 108. Derivative free optimization is fun. ==> nice maths ==> nice applications + easily parallel algorithms ==> can handle really complicated domains (mixed continuous / integer, optimization on sets of programs) Yet, often suboptimal on highly structured problems (when BFGS is easy to use, thanks to fast gradients)
  • 109. Keywords, readings ==> cross-entropy (so close to evolution strategies) ==> genetic programming (evolutionary algorithms for automatically building programs) ==> H.-G. Beyer's book on ES = good starting point ==> many resources on the web ==> keep in mind that representation / operators are often the key ==> we only considered isotropic algorithms; sometimes not a good idea at all