SlideShare a Scribd company logo
Q-Learning
  and Pontryagin's Minimum Principle

Sean Meyn
Department of Electrical and Computer Engineering
and the Coordinated Science Laboratory
    University of Illinois


   Joint work with Prashant Mehta
   NSF support: ECS-0523620
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Coarse Models: A rich collection of model reduction techniques


Many of today’s participants have contributed to this research.
A biased list:

 Fluid models: Law of Large Numbers scaling,
                 most likely paths in large deviations
 Workload relaxation for networks
 Heavy-traffic limits

 Clustering: spectral graph theory
                Markov spectral theory

  Singular perturbations
  Large population limits: Interacting particle systems
Workload Relaxations

An example from CTCN:




                                                                                           Figure 7.1: Demand-driven model with routing, scheduling, and re-work.



Workload at two stations evolves as a two-dimensional system
Cost is projected onto these coordinates:
                        −(1 − ρ)                                                       −(1 − ρ)
                                             R STO         R∗                                          R STO             R∗
      50                                                             50
w2                                                              w2
      40                                                             40


      30


      20
                                                                     30


                                                                     20
                                                                                                                              Optimal policy for
      10


       0
                                                                     10


                                                                      0
                                                                                                                              relaxation = hedging
      -10


      -20
                                                                     -10


                                                                     -20
                                                                                                                              policy for full network
            -20   -10   0     10   20   30       40        50              -20   -10   0    10    20    30     40        50

                                                      w1                                                            w1
     Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1.
     In each figure the optimal stochastic control region RSTO is compared with the optimal
     region R∗ obtained for the two dimensional fluid model.
Workload Relaxations and Simulation
                                                               α
                                                                                                   µ                                           µ
               An example from CTCN:
                                                                                                   Station 1                                    Station 2

                                                                                                             µ                                       µ

                                                                                                                                                                           α
                                                                            Decision making at stations 1 & 2
                                                                            e.g., setting safety-stock levels

               DP and simulations accelerated
                     using fluid value function for workload relaxation
                      VIA initialized with                                                   Simulated mean with
                                                                                                    and without control variate:
                                     Zero
Average cost




                                                                    Average cost



                                     Fluid value function



                                                                                                                                                10
                                                                                   10                                                      20                                                                 10
                                                                                        20                                            30                                                                 20
                                                                                              30                                 40                                                                 30
                                                                                                        40                                                  20                                 40
                                                                                                                            50                                   30
                 50      100   150     200   250     300                                                     50                                                       40                  50
                                                                                                                  60   60                                                  50
                                                                                                                                                                                60   60


                                                   Iteration                                                                safety-stock levels
VIA initialized with
                                                                                          Zero

What To Do With a Coarse Model?




                                                       Average cost
                                                                                          Fluid value function




                                                                      50      100   150     200   250     300


                                                                                                        Iteration
Setting: we have qualitative or partial quantitative
    insight regarding optimal control

The network examples relied on specific network structure
              What about other models?
VIA initialized with
                                                                                          Zero

What To Do With a Coarse Model?




                                                       Average cost
                                                                                          Fluid value function




                                                                      50      100   150     200   250     300


                                                                                                        Iteration
Setting: we have qualitative or partial quantitative
    insight regarding optimal control

The network examples relied on specific network structure
              What about other models?


     An answer lies in a new formulation of Q-learning
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
VIA initialized with
                                                                                                      Zero

What is Q learning?




                                                                   Average cost
                                                                                                      Fluid value function




                                                                                  50      100   150     200   250     300


                                                                                                                    Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
                                                      Q-Learning

Idea is similar to Mayne & Jacobson’s                 C. J. C. H. Watkins and P. Dayan
                                                      Machine Learning, 1992

                 differential dynamic programming   Differential dynamic programming
                                                    D. H. Jacobson and D. Q. Mayne
                                                    American Elsevier Pub. Co. 1970
VIA initialized with
                                                                                                                Zero

What is Q learning?




                                                                             Average cost
                                                                                                                Fluid value function




                                                                                            50      100   150     200   250     300


                                                                                                                              Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
                                                                Q-Learning

Idea is similar to Mayne & Jacobson’s                           C. J. C. H. Watkins and P. Dayan
                                                                Machine Learning, 1992

                 differential dynamic programming             Differential dynamic programming
                                                              D. H. Jacobson and D. Q. Mayne
                                                              American Elsevier Pub. Co. 1970


Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)


     HJB equation:            min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
                                 u
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)


     HJB equation:            min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
                                 u



The Q-function of Q-learning is this function of two variables
1
                                                                                         0.08
                                                                    Optimal policy
                                                                                         0.07


                                                                                         0.06




Q learning - Steps towards an algorithm                0
                                                                                         0.05


                                                                                         0.04


                                                                                         0.03


                                                                                         0.02


                                                                                         0.01


                                                      −1
                                                       −1       0                    1




Sequence of five steps:

    Step 1: Recognize fixed point equation for the Q-function
    Step 2: Find a stabilizing policy that is ergodic
    Step 3: Optimality criterion - minimize Bellman error
    Step 4: Adjoint operation
    Step 5: Interpret and simulate!
1
                                                                                         0.08
                                                                    Optimal policy
                                                                                         0.07


                                                                                         0.06




Q learning - Steps towards an algorithm                0
                                                                                         0.05


                                                                                         0.04


                                                                                         0.03


                                                                                         0.02


                                                                                         0.01


                                                      −1
                                                       −1       0                    1




Sequence of five steps:

    Step 1: Recognize fixed point equation for the Q-function
    Step 2: Find a stabilizing policy that is ergodic
    Step 3: Optimality criterion - minimize Bellman error
    Step 4: Adjoint operation
    Step 5: Interpret and simulate!


Goal - seek the best approximation,
        within a parameterized class
1
                                                                                                                  0.08
                                                                                             Optimal policy
                                                                                                                  0.07


                                                                                                                  0.06




Q learning - Steps towards an algorithm                                 0
                                                                                                                  0.05


                                                                                                                  0.04


                                                                                                                  0.03


                                                                                                                  0.02


                                                                                                                  0.01


                                                                       −1
                                                                        −1             0                      1




Step 1: Recognize fixed point equation for the Q-function
 Q-function:     H ∗(x, u) = c(x, u) + Du J ∗ (x)

 Its minimum:       H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
                              u∈U



 Fixed point equation:
          Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))




                                                   Step 1: Recognize xed point equation for the Q-function
                                                   Step 2: Find a stabilizing policy that is ergodic
                                                   Step 3: Optimality criterion - minimize Bellman error
                                                   Step 4: Adjoint operation
                                                   Step 5: Interpret and simulate!
1
                                                                                                                  0.08
                                                                                             Optimal policy
                                                                                                                  0.07


                                                                                                                  0.06




Q learning - Steps towards an algorithm                                 0
                                                                                                                  0.05


                                                                                                                  0.04


                                                                                                                  0.03


                                                                                                                  0.02


                                                                                                                  0.01


                                                                       −1
                                                                        −1             0                      1




Step 1: Recognize fixed point equation for the Q-function
 Q-function:     H ∗(x, u) = c(x, u) + Du J ∗ (x)

 Its minimum:       H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
                              u∈U



 Fixed point equation:
          Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))

 Key observation for learning: For any input-output pair,

          Du H ∗ (x) =   d
                         dt H ∗ (x(t)) x=x(t)
                                         u=u(t)
                                                   Step 1: Recognize xed point equation for the Q-function
                                                   Step 2: Find a stabilizing policy that is ergodic
                                                   Step 3: Optimality criterion - minimize Bellman error
                                                   Step 4: Adjoint operation
                                                   Step 5: Interpret and simulate!
1
                                                                                                                    0.08
                                                                                               Optimal policy
                                                                                                                    0.07


                                                                                                                    0.06




Q learning - LQR example                                                 0
                                                                                                                    0.05


                                                                                                                    0.04


                                                                                                                    0.03


                                                                                                                    0.02


                                                                                                                    0.01


                                                                        −1
                                                                         −1             0                       1




Linear model and quadratic cost,
                           1 T          1 T
  Cost:       c(x, u) =    2 x Qx   +   2 u Ru

  Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
                          = c(x, u) + Du J ∗ (x)                              Solves Riccatti eqn

                                                                                            1 T ∗
                                                                      J ∗ (x) =             2x
                                                                                               P x




                                                 Step 2:   Find a stabilizing policy that is ergodic
                                                 Step 3:   Optimality criterion - minimize Bellman error
                                                 Step 4:   Adjoint operation
                                                 Step 5:   Interpret and simulate!
1
                                                                                                                           0.08
                                                                                                      Optimal policy
                                                                                                                           0.07


                                                                                                                           0.06




Q learning - LQR example                                                        0
                                                                                                                           0.05


                                                                                                                           0.04


                                                                                                                           0.03


                                                                                                                           0.02


                                                                                                                           0.01


                                                                               −1
                                                                                −1             0                       1




Linear model and quadratic cost,
                              1 T              1 T
  Cost:        c(x, u) =      2 x Qx      +    2 u Ru

  Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
                                                                                     Solves Riccatti eqn

  Q-function approx:
                                          dx                        dxu
          H θ (x, u) = c(x, u) +    1
                                    2          θi xT E i x +
                                                x
                                                                            θj xT F i u
                                                                             x


                                         i=1                       j=1
  Minimum:
           θ        1 T             θT
          H (x) =   2x    Q + E − F R−1 F θ x
                               θ


  Minimizer:
           uθ (x) = φθ (x) = −R−1 F θ x
                                                        Step 2:   Find a stabilizing policy that is ergodic
                                                        Step 3:   Optimality criterion - minimize Bellman error
                                                        Step 4:   Adjoint operation
                                                        Step 5:   Interpret and simulate!
1
                                                                                                                   0.08
                                                                                              Optimal policy
                                                                                                                   0.07


                                                                                                                   0.06




Q learning - Steps towards an algorithm                                  0
                                                                                                                   0.05


                                                                                                                   0.04


                                                                                                                   0.03


                                                                                                                   0.02


                                                                                                                   0.01


                                                                        −1
                                                                         −1             0                      1




Step 2: Stationary policy that is ergodic?

Assume the LLN holds for continuous functions
                        F: R × R    u
                                        → R
As T → ∞,

             T
     1
                 F (x(t), u(t)) dt −→         F (x, u)      (dx, du)
     T   0                              X×U




                                                    Step 1: Recognize xed point equation for the Q-function
                                                    Step 2: Find a stabilizing policy that is ergodic
                                                    Step 3: Optimality criterion - minimize Bellman error
                                                    Step 4: Adjoint operation
                                                    Step 5: Interpret and simulate!
1
                                                                                                             0.08
                                                                                        Optimal policy
                                                                                                             0.07


                                                                                                             0.06




Q learning - Steps towards an algorithm                            0
                                                                                                             0.05


                                                                                                             0.04


                                                                                                             0.03


                                                                                                             0.02


                                                                                                             0.01


                                                                  −1
                                                                   −1             0                      1




Step 2: Stationary policy that is ergodic?
 Suppose for example the input is scalar, and the system is stable
                                       [Bounded-input/Bounded-state]
 Can try a linear
 combination
 of sinusouids




                                              Step 1: Recognize xed point equation for the Q-function
                                              Step 2: Find a stabilizing policy that is ergodic
                                              Step 3: Optimality criterion - minimize Bellman error
                                              Step 4: Adjoint operation
                                              Step 5: Interpret and simulate!
1
                                                                                                                        0.08
                                                                                                   Optimal policy
                                                                                                                        0.07


                                                                                                                        0.06




       Q learning - Steps towards an algorithm                                0
                                                                                                                        0.05


                                                                                                                        0.04


                                                                                                                        0.03


                                                                                                                        0.02


                                                                                                                        0.01


                                                                             −1
                                                                              −1             0                      1




       Step 2: Stationary policy that is ergodic?
        Suppose for example the input is scalar, and the system is stable
                                              [Bounded-input/Bounded-state]
0.08


0.07


0.06


0.05
                                                    Can try a linear
                                                    combination
0.04
                                                    of sinusouids
0.03


0.02


0.01




                                                         Step 1: Recognize xed point equation for the Q-function

   u(t) = A(sin(t) + sin(πt) + sin(et))                  Step 2: Find a stabilizing policy that is ergodic
                                                         Step 3: Optimality criterion - minimize Bellman error
                                                         Step 4: Adjoint operation
                                                         Step 5: Interpret and simulate!
1
                                                                                                                 0.08
                                                                                            Optimal policy
                                                                                                                 0.07


                                                                                                                 0.06




Q learning - Steps towards an algorithm                                0
                                                                                                                 0.05


                                                                                                                 0.04


                                                                                                                 0.03


                                                                                                                 0.02


                                                                                                                 0.01


                                                                      −1
                                                                       −1             0                      1




Step 3: Bellman error



Based on observations, minimize the mean-square Bellman error:

                                                    θ         θ
                                                        ,

First order condition for optimality:     θ
                                              , Du ψ θ − γψi
                                                     i
                                                           θ
                                                                                =0
                                        with ψ θ (x) = ψi (x, φθ (x)),
                                               i
                                                        θ

                                                                                   1≤i≤d



                                                  Step 1: Recognize xed point equation for the Q-function
                                                  Step 2: Find a stabilizing policy that is ergodic
                                                  Step 3: Optimality criterion - minimize Bellman error
                                                  Step 4: Adjoint operation
                                                  Step 5: Interpret and simulate!
1
                                                                                                               0.08
                                                                                          Optimal policy
                                                                                                               0.07


                                                                                                               0.06




Q learning - Convex Reformulation                                   0
                                                                                                               0.05


                                                                                                               0.04


                                                                                                               0.03


                                                                                                               0.02


                                                                                                               0.01


                                                                   −1
                                                                    −1             0                       1




Step 3: Bellman error



Based on observations, minimize the mean-square Bellman error:

                                              θ            θ
                                                  ,


                        G


                Gθ (x) ≤ H θ (x, u),   all x, u


                                            Step 2:   Find a stabilizing policy that is ergodic
                                            Step 3:   Optimality criterion - minimize Bellman error
                                            Step 4:   Adjoint operation
                                            Step 5:   Interpret and simulate!
1
                                                                                                                           0.08
                                                                                                      Optimal policy
                                                                                                                           0.07


                                                                                                                           0.06




Q learning - LQR example                                                        0
                                                                                                                           0.05


                                                                                                                           0.04


                                                                                                                           0.03


                                                                                                                           0.02


                                                                                                                           0.01


                                                                               −1
                                                                                −1             0                       1




Linear model and quadratic cost,
                                1 T            1 T
  Cost:         c(x, u) =       2 x Qx    +    2 u Ru

  Q-function:      H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x
                                                                               Solves Riccatti eqn

  Q-function approx:
                                          dx                        dxu
          H θ (x, u) = c(x, u) +    1
                                    2          θi xT E i x +
                                                x
                                                                            θj xT F i u
                                                                             x


                                         i=1                       j=1
  Approximation to minimum
          G θ (x) = 1 xT Gθ x
                    2

  Minimizer:
           uθ (x) = φθ (x) = −R−1 F θ x
                                                        Step 2:   Find a stabilizing policy that is ergodic
                                                        Step 3:   Optimality criterion - minimize Bellman error
                                                        Step 4:   Adjoint operation
                                                        Step 5:   Interpret and simulate!
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R     w
                                                     → R
Resolvent gives a new function,


                             ∞
                                     −βt
  Rβ g (x, w) =                  e         g(x(t), ξ(t)) dt
                         0




                                                           Skip to examples
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R    w
                                                    → R
Resolvent gives a new function,

                         ∞
 Rβ g (x, w) =               e−βt g(x(t), ξ (t)) dt ,       β>0
                     0


                                 controlled using the nominal policy

                                   u(t) = φ(x(t), ξ(t)),     t≥0

                                               stabilizing & ergodic
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R        w
                                                        → R
Resolvent gives a new function,
                                   ∞
               Rβ g (x, w) =           e−βt g(x(t), ξ (t)) dt ,   β>0
                               0
Resolvent equation:
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R        w
                                                        → R
Resolvent gives a new function,
                                   ∞
               Rβ g (x, w) =           e−βt g(x(t), ξ (t)) dt ,   β>0
                               0
Resolvent equation:



Smoothed Bellman error:

    Lθ,β = Rβ Lθ
                     θ            θ
         = [βRβ − I]H + γRβ (c − H )
Q learning - Steps towards an algorithm

Smoothed Bellman error:

                          1       θ,β 2
             Eβ (θ) :=    2


                              θ,β
               Eβ (θ) =             ,   θ Lθ,β
                      = zero at an optimum




                              Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

Smoothed Bellman error:

                           1       θ,β 2
             Eβ (θ) :=     2


                               θ,β
               Eβ (θ) =              ,   θ Lθ,β
                      = zero at an optimum


                    Involves terms of the form         Rβ g,R β h

                               Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

                                       1     θ,β 2
Smoothed Bellman error:   Eβ (θ) :=    2

                                           θ,β         θ,β
                           Eβ (θ) =              ,   θL

Adjoint operation:

               †     1   †
              Rβ Rβ
                  =    (Rβ + Rβ )
                    2β
                     1      †         †
       Rβ g,R β h =     g,R β h + h,R β g
                    2β



                             Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

                                             1     θ,β 2
Smoothed Bellman error:         Eβ (θ) :=    2

                                                 θ,β         θ,β
                                 Eβ (θ) =              ,   θL

Adjoint operation:                 1
                           †           †
                          Rβ Rβ =    (Rβ + Rβ )
                                  2β
                                   1      †         †
                     Rβ g,R β h =     g,R β h + h,R β g
                                  2β
Adjoint realization: time-reversal
                          ∞
  †
 Rβ g (x, w)     =            e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt
                      0
             expectation conditional on x◦ (0) = x, ξ ◦ (0) = w.

                                   Step 4: Causal smoothing to avoid differentiation
1
                                                                                                                            0.08
                                                                                                       Optimal policy
                                                                                                                            0.07


                                                                                                                            0.06




Q learning - Steps towards an algorithm                                           0
                                                                                                                            0.05


                                                                                                                            0.04


                                                                                                                            0.03


                                                                                                                            0.02


                                                                                                                            0.01


                                                                                 −1
                                                                                  −1             0                      1




After Step 5: Not quite adaptive control:

                   Desired behavior

      Compare                                                              Outputs
      and learn                Inputs


                                                Complex system
                            Measured behavior




Ergodic input applied



                                                             Step 1: Recognize xed point equation for the Q-function
                                                             Step 2: Find a stabilizing policy that is ergodic
                                                             Step 3: Optimality criterion - minimize Bellman error
                                                             Step 4: Adjoint operation
                                                             Step 5: Interpret and simulate!
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




Q learning - Steps towards an algorithm                               0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




After Step 5: Not quite adaptive control:

                   Desired behavior

      Compare                                                    Outputs
      and learn                Inputs


                                                Complex system
                            Measured behavior




Ergodic input applied
Based on observations minimize the mean-square Bellman error:
1




                                                                      (individual state)
                                                                                           (ensemble state)
  Deterministic Stochastic Approximation                                                                      0




                                                                                                              -1
                                                                                                               0   1   2   3   4   5   6   7   8   9   10




  Gradient descent:
                 d
                 dt θ   = −ε       θ
                                       , Du   θ Hθ − γ      θ Hθ

  Converges* to the minimizer of the mean-square Bellman error:




d                                                        * Convergence observed in experiments!
dt h(x(t))   x=x(t)
                      = Du h (x)                           For a convex re-formulation of
             w=ξ(t)                                        the problem, see Mehta & Meyn 2009
1




                                                                                 (individual state)
                                                                                                      (ensemble state)
  Deterministic Stochastic Approximation                                                                                 0




                                                                                                                         -1
                                                                                                                          0   1   2   3   4       5   6   7   8   9   10




  Stochastic Approximation

                                        θ
    d
    dt θ     = −εt Lθ
                    t
                           d
                           dt      θH       (x◦ (t)) − γ    θH
                                                                   θ
                                                                       (x◦ (t), u◦ (t))


    Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t)))
     t
          d



                                            Gradient descent:
                                              d             θ                θ                                                                θ
                                              dt θ   = −ε       , Du    θH             −γ                                         θH

                                            Mean-square Bellman error:


d
dt h(x(t))   x=x(t)
                      = Du h (x)
             w=ξ(t)
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Desired behavior

                                     Compare                                                   Outputs


Q learning - Local Learning          and learn               Inputs


                                                                              Complex system

                                                          Measured behavior




Cubic nonlinearity:

       d
       dt x   = −x3 + u,      c(x, u) = 1 x2 + 1 u2
                                        2      2
Desired behavior

                                             Compare                                                   Outputs


Q learning - Local Learning                  and learn               Inputs


                                                                                      Complex system

                                                                  Measured behavior




Cubic nonlinearity:   d
                      dt x   = −x3 + u,   c(x, u) = 1 x2 + 1 u2
                                                    2      2


HJB:
   min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
         1
                2
       u
Desired behavior

                                                       Compare                                                        Outputs


Q learning - Local Learning                            and learn                    Inputs


                                                                                                     Complex system

                                                                                 Measured behavior




Cubic nonlinearity:        d
                           dt x   = −x3 + u,       c(x, u) = 1 x2 + 1 u2
                                                             2      2


HJB:                  min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
                            1
                                   2
                       u


Basis:     θ                             x     x   2               xu
         H (x, u) = c(x, u) + θ x + θ       2
                                              u
                                      1 + 2x
Desired behavior

                                                               Compare                                                           Outputs


     Q learning - Local Learning                               and learn                  Inputs


                                                                                                           Complex system

                                                                                       Measured behavior




     Cubic nonlinearity:         d
                                 dt x    = −x3 + u,        c(x, u) = 1 x2 + 1 u2
                                                                     2      2


     HJB:                  min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
                                 1
                                        2
                            u

                                                                              x
     Basis:                H θ (x, u) = c(x, u) + θx x2 + θxu                    2
                                                                                   u
                                                                           1 + 2x

 1                                                    1
                                             0.08
                        Optimal policy                                                             Optimal policy               0.06
                                             0.07

                                                                                                                                0.05
                                             0.06


                                             0.05                                                                               0.04

 0                                           0.04     0
                                                                                                                                0.03

                                             0.03
                                                                                                                                0.02
                                             0.02

                                                                                                                                0.01
                                             0.01


−1                                                  −1
 −1                0                     1           −1                       0                                             1



            Low amplitude input                             High amplitude input

                                                          u(t) = A(sin(t) + sin(πt) + sin(et))
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
M. Huang, P. E. Caines, and R. P. Malhame. Large-population
                                 cost-coupled LQG problems with nonuniform agents: Individual-mass
Multi-agent model                behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto.
                                 Control, 52(9):1560–1571, 2007.




Huang et. al. Local optimization for global coordination
Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model:
Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model:    1




Estimated state feedback gains
                                  0
             (individual state)
             (ensemble state)

                                                                            time
                                  -1
                                       Gains for agent 4: Q-learning sample paths
                                       and gains predicted from ∞-agent limit
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control


                                                          ... Conclusions
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses

Current research: Algorithm analysis and improvements
                  Applications in biology and economics
                  Analysis of game-theoretic issues
                                 in coupled systems
References


                                                                                                                    .
      PhD thesis, University of London, London, England, 1967.

                                                                        . American Elsevier Pub. Co., New York, NY, 1970.

                        Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.

                                                  Machine Learning, 8(3-4):279–292, 1992.


      SIAM J. Control Optim., 38(2):447–469, 2000.




      on policy iteration. Automatica, 45(2):477 – 484, 2009.


      Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009.

[9]   C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks.
      Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.

More Related Content

PPT
PG&E Presentation to Kerntax 2013-02-22
PDF
Session 26 Albania Nissan
PDF
Session 3 ic2011 csoka
PPTX
ACモーターのスパイスモデルについて
PDF
A Temperature-insensitive Simple Current-mode Multiplier/Divider Employing On...
PDF
Full-Band/Full-Wave Simulations of InGaAs-based Pseudomorphic HEMTs
PDF
F0543645
PG&E Presentation to Kerntax 2013-02-22
Session 26 Albania Nissan
Session 3 ic2011 csoka
ACモーターのスパイスモデルについて
A Temperature-insensitive Simple Current-mode Multiplier/Divider Employing On...
Full-Band/Full-Wave Simulations of InGaAs-based Pseudomorphic HEMTs
F0543645

What's hot (10)

PDF
Quantum state tomography of slow and stored light
PDF
Ncomms2255 s1
PDF
SPICE Model of TA7291P
PDF
3相ACモーターのスパイスモデルの概要
PDF
TB62206FGのスパイスモデル
PDF
PPTX
Six sigma quick references
PPT
Token bus standard
PPT
NACo presentation1.1
PDF
Deutsche EuroShop | Company Presentation | 06/11
Quantum state tomography of slow and stored light
Ncomms2255 s1
SPICE Model of TA7291P
3相ACモーターのスパイスモデルの概要
TB62206FGのスパイスモデル
Six sigma quick references
Token bus standard
NACo presentation1.1
Deutsche EuroShop | Company Presentation | 06/11
Ad

Similar to Q-Learning and Pontryagin's Minimum Principle (20)

PDF
Machine Learning Lecture
PPTX
WiNS milsat overview
PDF
Workspace analysis of stewart platform
PDF
Self Organinising neural networks
PDF
oscillators
PPTX
A2DataDive workshop: Introduction to R
PDF
Real Application Testing
PDF
Exploratory Statistics with R
PDF
(Simulated) Organization in Action
PDF
Machine Learning for Speech
PDF
Predictive And Experimental Hardware Robustness Evaluation Hp Seminar 1997
PPT
Solving by graphing remediation notes
PDF
Capacity analysis of gsm systems using slow frequency hoppin
PPT
Network Information Processing
PDF
Quantitative Analysis For Business Decisions Set1
PPTX
Systems Analysis & Control: Steady State Errors
PDF
Dsp U Lec01 Real Time Dsp Systems
PDF
Sirris materials day 2011 ampacimon overhead line monitoring - ampacimon
PDF
Salamian dv club_foils_intel_austin
DOCX
Machine Learning Lecture
WiNS milsat overview
Workspace analysis of stewart platform
Self Organinising neural networks
oscillators
A2DataDive workshop: Introduction to R
Real Application Testing
Exploratory Statistics with R
(Simulated) Organization in Action
Machine Learning for Speech
Predictive And Experimental Hardware Robustness Evaluation Hp Seminar 1997
Solving by graphing remediation notes
Capacity analysis of gsm systems using slow frequency hoppin
Network Information Processing
Quantitative Analysis For Business Decisions Set1
Systems Analysis & Control: Steady State Errors
Dsp U Lec01 Real Time Dsp Systems
Sirris materials day 2011 ampacimon overhead line monitoring - ampacimon
Salamian dv club_foils_intel_austin
Ad

More from Sean Meyn (20)

PDF
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
PDF
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
PDF
DeepLearn2022 3. TD and Q Learning
PDF
DeepLearn2022 2. Variance Matters
PDF
Smart Grid Tutorial - January 2019
PDF
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
PDF
Irrational Agents and the Power Grid
PDF
Zap Q-Learning - ISMP 2018
PDF
Introducing Zap Q-Learning
PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
PDF
State estimation and Mean-Field Control with application to demand dispatch
PDF
Demand-Side Flexibility for Reliable Ancillary Services
PDF
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
PDF
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
PDF
Why Do We Ignore Risk in Power Economics?
PDF
Distributed Randomized Control for Ancillary Service to the Power Grid
PDF
Ancillary service to the grid from deferrable loads: the case for intelligent...
PDF
2012 Tutorial: Markets for Differentiated Electric Power Products
PDF
Control Techniques for Complex Systems
PDF
Tutorial for Energy Systems Week - Cambridge 2010
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 3. TD and Q Learning
DeepLearn2022 2. Variance Matters
Smart Grid Tutorial - January 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
Irrational Agents and the Power Grid
Zap Q-Learning - ISMP 2018
Introducing Zap Q-Learning
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
State estimation and Mean-Field Control with application to demand dispatch
Demand-Side Flexibility for Reliable Ancillary Services
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Why Do We Ignore Risk in Power Economics?
Distributed Randomized Control for Ancillary Service to the Power Grid
Ancillary service to the grid from deferrable loads: the case for intelligent...
2012 Tutorial: Markets for Differentiated Electric Power Products
Control Techniques for Complex Systems
Tutorial for Energy Systems Week - Cambridge 2010

Recently uploaded (20)

PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Classroom Observation Tools for Teachers
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
01-Introduction-to-Information-Management.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Cell Types and Its function , kingdom of life
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Computing-Curriculum for Schools in Ghana
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Microbial disease of the cardiovascular and lymphatic systems
Final Presentation General Medicine 03-08-2024.pptx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Renaissance Architecture: A Journey from Faith to Humanism
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Classroom Observation Tools for Teachers
TR - Agricultural Crops Production NC III.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
VCE English Exam - Section C Student Revision Booklet
01-Introduction-to-Information-Management.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Complications of Minimal Access Surgery at WLH
Cell Types and Its function , kingdom of life
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Computing-Curriculum for Schools in Ghana

Q-Learning and Pontryagin's Minimum Principle

  • 1. Q-Learning and Pontryagin's Minimum Principle Sean Meyn Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois Joint work with Prashant Mehta NSF support: ECS-0523620
  • 2. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 3. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 4. Coarse Models: A rich collection of model reduction techniques Many of today’s participants have contributed to this research. A biased list: Fluid models: Law of Large Numbers scaling, most likely paths in large deviations Workload relaxation for networks Heavy-traffic limits Clustering: spectral graph theory Markov spectral theory Singular perturbations Large population limits: Interacting particle systems
  • 5. Workload Relaxations An example from CTCN: Figure 7.1: Demand-driven model with routing, scheduling, and re-work. Workload at two stations evolves as a two-dimensional system Cost is projected onto these coordinates: −(1 − ρ) −(1 − ρ) R STO R∗ R STO R∗ 50 50 w2 w2 40 40 30 20 30 20 Optimal policy for 10 0 10 0 relaxation = hedging -10 -20 -10 -20 policy for full network -20 -10 0 10 20 30 40 50 -20 -10 0 10 20 30 40 50 w1 w1 Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1. In each figure the optimal stochastic control region RSTO is compared with the optimal region R∗ obtained for the two dimensional fluid model.
  • 6. Workload Relaxations and Simulation α µ µ An example from CTCN: Station 1 Station 2 µ µ α Decision making at stations 1 & 2 e.g., setting safety-stock levels DP and simulations accelerated using fluid value function for workload relaxation VIA initialized with Simulated mean with and without control variate: Zero Average cost Average cost Fluid value function 10 10 20 10 20 30 20 30 40 30 40 20 40 50 30 50 100 150 200 250 300 50 40 50 60 60 50 60 60 Iteration safety-stock levels
  • 7. VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models?
  • 8. VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models? An answer lies in a new formulation of Q-learning
  • 9. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 10. VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970
  • 11. VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function.
  • 12. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u)
  • 13. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u
  • 14. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u The Q-function of Q-learning is this function of two variables
  • 15. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 16. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate! Goal - seek the best approximation, within a parameterized class
  • 17. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 18. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Key observation for learning: For any input-output pair, Du H ∗ (x) = d dt H ∗ (x(t)) x=x(t) u=u(t) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 19. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x = c(x, u) + Du J ∗ (x) Solves Riccatti eqn 1 T ∗ J ∗ (x) = 2x P x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 20. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Minimum: θ 1 T θT H (x) = 2x Q + E − F R−1 F θ x θ Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 21. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Assume the LLN holds for continuous functions F: R × R u → R As T → ∞, T 1 F (x(t), u(t)) dt −→ F (x, u) (dx, du) T 0 X×U Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 22. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] Can try a linear combination of sinusouids Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 23. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] 0.08 0.07 0.06 0.05 Can try a linear combination 0.04 of sinusouids 0.03 0.02 0.01 Step 1: Recognize xed point equation for the Q-function u(t) = A(sin(t) + sin(πt) + sin(et)) Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 24. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , First order condition for optimality: θ , Du ψ θ − γψi i θ =0 with ψ θ (x) = ψi (x, φθ (x)), i θ 1≤i≤d Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 25. 1 0.08 Optimal policy 0.07 0.06 Q learning - Convex Reformulation 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , G Gθ (x) ≤ H θ (x, u), all x, u Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 26. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Approximation to minimum G θ (x) = 1 xT Gθ x 2 Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 27. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ −βt Rβ g (x, w) = e g(x(t), ξ(t)) dt 0 Skip to examples
  • 28. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 controlled using the nominal policy u(t) = φ(x(t), ξ(t)), t≥0 stabilizing & ergodic
  • 29. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation:
  • 30. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation: Smoothed Bellman error: Lθ,β = Rβ Lθ θ θ = [βRβ − I]H + γRβ (c − H )
  • 31. Q learning - Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Step 4: Causal smoothing to avoid differentiation
  • 32. Q learning - Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Involves terms of the form Rβ g,R β h Step 4: Causal smoothing to avoid differentiation
  • 33. Q learning - Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: † 1 † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Step 4: Causal smoothing to avoid differentiation
  • 34. Q learning - Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: 1 † † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Adjoint realization: time-reversal ∞ † Rβ g (x, w) = e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt 0 expectation conditional on x◦ (0) = x, ξ ◦ (0) = w. Step 4: Causal smoothing to avoid differentiation
  • 35. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 36. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Based on observations minimize the mean-square Bellman error:
  • 37. 1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Gradient descent: d dt θ = −ε θ , Du θ Hθ − γ θ Hθ Converges* to the minimizer of the mean-square Bellman error: d * Convergence observed in experiments! dt h(x(t)) x=x(t) = Du h (x) For a convex re-formulation of w=ξ(t) the problem, see Mehta & Meyn 2009
  • 38. 1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Stochastic Approximation θ d dt θ = −εt Lθ t d dt θH (x◦ (t)) − γ θH θ (x◦ (t), u◦ (t)) Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t))) t d Gradient descent: d θ θ θ dt θ = −ε , Du θH −γ θH Mean-square Bellman error: d dt h(x(t)) x=x(t) = Du h (x) w=ξ(t)
  • 39. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 40. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2
  • 41. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u
  • 42. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u Basis: θ x x 2 xu H (x, u) = c(x, u) + θ x + θ 2 u 1 + 2x
  • 43. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u x Basis: H θ (x, u) = c(x, u) + θx x2 + θxu 2 u 1 + 2x 1 1 0.08 Optimal policy Optimal policy 0.06 0.07 0.05 0.06 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 −1 −1 −1 0 1 −1 0 1 Low amplitude input High amplitude input u(t) = A(sin(t) + sin(πt) + sin(et))
  • 44. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 45. M. Huang, P. E. Caines, and R. P. Malhame. Large-population cost-coupled LQG problems with nonuniform agents: Individual-mass Multi-agent model behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto. Control, 52(9):1560–1571, 2007. Huang et. al. Local optimization for global coordination
  • 46. Multi-agent model Model: Linear autonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model:
  • 47. Multi-agent model Model: Linear autonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model: 1 Estimated state feedback gains 0 (individual state) (ensemble state) time -1 Gains for agent 4: Q-learning sample paths and gains predicted from ∞-agent limit
  • 48. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control ... Conclusions
  • 49. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms
  • 50. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses
  • 51. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses Current research: Algorithm analysis and improvements Applications in biology and economics Analysis of game-theoretic issues in coupled systems
  • 52. References . PhD thesis, University of London, London, England, 1967. . American Elsevier Pub. Co., New York, NY, 1970. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989. Machine Learning, 8(3-4):279–292, 1992. SIAM J. Control Optim., 38(2):447–469, 2000. on policy iteration. Automatica, 45(2):477 – 484, 2009. Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009. [9] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks. Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.