SlideShare a Scribd company logo
On some computational methods for Bayesian model choice




             On some computational methods for Bayesian
                          model choice

                                            Christian P. Robert

                               CREST-INSEE and Universit´ Paris Dauphine
                                                        e
                               http://www.ceremade.dauphine.fr/~xian


                          MaxEnt 2009, Oxford, July 6, 2009
                      Joint works with M. Beaumont, N. Chopin,
                      J.-M. Cornuet, J.-M. Marin and D. Wraith
On some computational methods for Bayesian model choice




Outline


      1    Evidence

      2    Importance sampling solutions

      3    Cross-model solutions

      4    Nested sampling

      5    ABC model choice
On some computational methods for Bayesian model choice
  Evidence
     Bayes tests



Formal construction of Bayes tests

      Definition (Test)
      Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
      statistical model, a test is a statistical procedure that takes its
      values in {0, 1}.

      Theorem (Bayes test)
      The Bayes estimator associated with π and with the 0 − 1 loss is

                                         1    if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x),
                        δ π (x) =
                                         0    otherwise,
On some computational methods for Bayesian model choice
  Evidence
     Bayes tests



Formal construction of Bayes tests

      Definition (Test)
      Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
      statistical model, a test is a statistical procedure that takes its
      values in {0, 1}.

      Theorem (Bayes test)
      The Bayes estimator associated with π and with the 0 − 1 loss is

                                         1    if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x),
                        δ π (x) =
                                         0    otherwise,
On some computational methods for Bayesian model choice
  Evidence
     Bayes factor



Bayes factor

      Definition (Bayes factors)
      For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior

                                      π(Θ0 )π0 (θ) + π(Θc )π1 (θ) ,
                                                        0

      central quantity

                                                               f (x|θ)π0 (θ)dθ
                       π(Θ0 |x)               π(Θ0 )      Θ0                         m0 (x)
             B01 (x) =                               =                           =
                       π(Θc |x)
                          0                   π(Θc )
                                                 0                                   m1 (x)
                                                               f (x|θ)π1 (θ)dθ
                                                          Θc
                                                           0


                                                                            [Jeffreys, 1939]
On some computational methods for Bayesian model choice
  Evidence
     Model choice



Model choice and model comparison



      Choice between models
      Several models available for the same observation x

                                    Mi : x ∼ fi (x|θi ),   i∈I

      where the family I can be finite or infinite

      Identical setup: Replace hypotheses with models but keep
      marginal likelihoods and Bayes factors
On some computational methods for Bayesian model choice
  Evidence
     Model choice



Model choice and model comparison



      Choice between models
      Several models available for the same observation x

                                    Mi : x ∼ fi (x|θi ),   i∈I

      where the family I can be finite or infinite

      Identical setup: Replace hypotheses with models but keep
      marginal likelihoods and Bayes factors
On some computational methods for Bayesian model choice
  Evidence
     Model choice



Bayesian model choice
      Probabilise the entire model/parameter space
          allocate probabilities pi to all models Mi
          define priors πi (θi ) for each parameter space Θi
          compute

                                                        pi          fi (x|θi )πi (θi )dθi
                                                               Θi
                               π(Mi |x) =
                                                          pj         fj (x|θj )πj (θj )dθj
                                                    j           Θj


              take largest π(Mi |x) to determine “best” model,
              or use averaged predictive

                                        π(Mj |x)               fj (x |θj )πj (θj |x)dθj
                                    j                     Θj
On some computational methods for Bayesian model choice
  Evidence
     Model choice



Bayesian model choice
      Probabilise the entire model/parameter space
          allocate probabilities pi to all models Mi
          define priors πi (θi ) for each parameter space Θi
          compute

                                                        pi          fi (x|θi )πi (θi )dθi
                                                               Θi
                               π(Mi |x) =
                                                          pj         fj (x|θj )πj (θj )dθj
                                                    j           Θj


              take largest π(Mi |x) to determine “best” model,
              or use averaged predictive

                                        π(Mj |x)               fj (x |θj )πj (θj |x)dθj
                                    j                     Θj
On some computational methods for Bayesian model choice
  Evidence
     Model choice



Bayesian model choice
      Probabilise the entire model/parameter space
          allocate probabilities pi to all models Mi
          define priors πi (θi ) for each parameter space Θi
          compute

                                                        pi          fi (x|θi )πi (θi )dθi
                                                               Θi
                               π(Mi |x) =
                                                          pj         fj (x|θj )πj (θj )dθj
                                                    j           Θj


              take largest π(Mi |x) to determine “best” model,
              or use averaged predictive

                                        π(Mj |x)               fj (x |θj )πj (θj |x)dθj
                                    j                     Θj
On some computational methods for Bayesian model choice
  Evidence
     Model choice



Bayesian model choice
      Probabilise the entire model/parameter space
          allocate probabilities pi to all models Mi
          define priors πi (θi ) for each parameter space Θi
          compute

                                                        pi          fi (x|θi )πi (θi )dθi
                                                               Θi
                               π(Mi |x) =
                                                          pj         fj (x|θj )πj (θj )dθj
                                                    j           Θj


              take largest π(Mi |x) to determine “best” model,
              or use averaged predictive

                                        π(Mj |x)               fj (x |θj )πj (θj |x)dθj
                                    j                     Θj
On some computational methods for Bayesian model choice
  Evidence
     Evidence



Evidence



      All these problems end up with a similar quantity, the evidence

                                     Zk =            πk (θk )Lk (θk ) dθk ,
                                                Θk

      aka the marginal likelihood.
On some computational methods for Bayesian model choice
  Importance sampling solutions




Importance sampling revisited

      1    Evidence

      2    Importance sampling solutions
             Regular importance
             Harmonic means
             Chib’s solution

      3    Cross-model solutions

      4    Nested sampling

      5    ABC model choice
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Bayes factor approximation

      When approximating the Bayes factor

                                                      f0 (x|θ0 )π0 (θ0 )dθ0
                                                 Θ0
                                    B01 =
                                                      f1 (x|θ1 )π1 (θ1 )dθ1
                                                 Θ1

      use of importance functions                     0   and   1   and

                                       n−1
                                        0
                                                 n0         i       i
                                                 i=1 f0 (x|θ0 )π0 (θ0 )/
                                                                                  i
                                                                              0 (θ0 )
                           B01 =
                                       n−1
                                        1
                                                 n1         i       i
                                                 i=1 f1 (x|θ1 )π1 (θ1 )/
                                                                                  i
                                                                              1 (θ1 )
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Bridge sampling


      Special case:
      If
                                           π1 (θ1 |x) ∝ π1 (θ1 |x)
                                                        ˜
                                           π2 (θ2 |x) ∝ π2 (θ2 |x)
                                                        ˜
      live on the same space (Θ1 = Θ2 ), then
                                            n
                                       1         π1 (θi |x)
                                                 ˜
                           B12 ≈                              θi ∼ π2 (θ|x)
                                       n         π2 (θi |x)
                                                 ˜
                                           i=1

                           [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Bridge sampling variance



      The bridge sampling estimator does poorly if
                                                                            2
                              var(B12 )  1                π1 (θ) − π2 (θ)
                                  2     ≈ E
                                B12      n                     π2 (θ)

      is large, i.e. if π1 and π2 have little overlap...
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Bridge sampling variance



      The bridge sampling estimator does poorly if
                                                                            2
                              var(B12 )  1                π1 (θ) − π2 (θ)
                                  2     ≈ E
                                B12      n                     π2 (θ)

      is large, i.e. if π1 and π2 have little overlap...
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



(Further) bridge sampling

      In addition

                                       π2 (θ|x)α(θ)π1 (θ|x)dθ
                                       ˜
                  B12 =                                              ∀ α(·)
                                       π1 (θ|x)α(θ)π2 (θ|x)dθ
                                       ˜

                                        n1
                                  1
                                              π2 (θ1i |x)α(θ1i )
                                              ˜
                                  n1
                                        i=1
                           ≈             n2                        θji ∼ πj (θ|x)
                                  1
                                              π1 (θ2i |x)α(θ2i )
                                              ˜
                                  n2
                                        i=1
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



An infamous example
      When
                                                               1
                                              α(θ) =
                                                          π1 (θ)˜2 (θ)
                                                          ˜     π
      harmonic mean approximation to B12
                                         n1
                                   1
                                               1/˜1 (θ1i |x)
                                                 π
                                   n1
                                        i=1
                          B21 =          n2                              θji ∼ πj (θ|x)
                                   1
                                               1/˜2 (θ2i |x)
                                                 π
                                   n2
                                        i=1

                                               [Newton & Raftery, 1994]
      Infamous: Most often leads to an infinite variance!!!
                                             [Radford Neal’s blog, 2008]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



An infamous example
      When
                                                               1
                                              α(θ) =
                                                          π1 (θ)˜2 (θ)
                                                          ˜     π
      harmonic mean approximation to B12
                                         n1
                                   1
                                               1/˜1 (θ1i |x)
                                                 π
                                   n1
                                        i=1
                          B21 =          n2                              θji ∼ πj (θ|x)
                                   1
                                               1/˜2 (θ2i |x)
                                                 π
                                   n2
                                        i=1

                                               [Newton & Raftery, 1994]
      Infamous: Most often leads to an infinite variance!!!
                                             [Radford Neal’s blog, 2008]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



“The Worst Monte Carlo Method Ever”

      “The good news is that the Law of Large Numbers guarantees that
      this estimator is consistent ie, it will very likely be very close to the
      correct answer if you use a sufficiently large number of points from
      the posterior distribution.
      The bad news is that the number of points required for this
      estimator to get close to the right answer will often be greater
      than the number of atoms in the observable universe. The even
      worse news is that itws easy for people to not realize this, and to
      naively accept estimates that are nowhere close to the correct
      value of the marginal likelihood.”
                                       [Radford Neal’s blog, Aug. 23, 2008]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



“The Worst Monte Carlo Method Ever”

      “The good news is that the Law of Large Numbers guarantees that
      this estimator is consistent ie, it will very likely be very close to the
      correct answer if you use a sufficiently large number of points from
      the posterior distribution.
      The bad news is that the number of points required for this
      estimator to get close to the right answer will often be greater
      than the number of atoms in the observable universe. The even
      worse news is that itws easy for people to not realize this, and to
      naively accept estimates that are nowhere close to the correct
      value of the marginal likelihood.”
                                       [Radford Neal’s blog, Aug. 23, 2008]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Optimal bridge sampling

      The optimal choice of auxiliary function is
                                                       n1 + n2
                                    α =
                                              n1 π1 (θ|x) + n2 π2 (θ|x)

      leading to
                                             n1
                                       1                    π2 (θ1i |x)
                                                             ˜
                                       n1         n1 π1 (θ1i |x) + n2 π2 (θ1i |x)
                                            i=1
                           B12 ≈             n2
                                       1                    π1 (θ2i |x)
                                                             ˜
                                       n2         n1 π1 (θ2i |x) + n2 π2 (θ2i |x)
                                            i=1

                                                                                    Back later!
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Optimal bridge sampling (2)



      Reason:

       Var(B12 )    1                      π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ
           2     ≈                                                              2        −1
         B12       n1 n2                                  π1 (θ)π2 (θ)α(θ) dθ

      δ method
      Drag: Dependence on the unknown normalising constants solved
      iteratively
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Optimal bridge sampling (2)



      Reason:

       Var(B12 )    1                      π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ
           2     ≈                                                              2        −1
         B12       n1 n2                                  π1 (θ)π2 (θ)α(θ) dθ

      δ method
      Drag: Dependence on the unknown normalising constants solved
      iteratively
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Ratio importance sampling


      Another identity:
                                                    Eϕ [˜1 (θ)/ϕ(θ)]
                                                        π
                                         B12 =
                                                    Eϕ [˜2 (θ)/ϕ(θ)]
                                                        π
      for any density ϕ with sufficiently large support
                                                    [Torrie & Valleau, 1977]
      Use of a single sample θ1 , . . . , θn from ϕ

                                                      i=1 π1 (θi )/ϕ(θi )
                                                          ˜
                                       B12 =
                                                      i=1 π2 (θi )/ϕ(θi )
                                                          ˜
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Ratio importance sampling


      Another identity:
                                                    Eϕ [˜1 (θ)/ϕ(θ)]
                                                        π
                                         B12 =
                                                    Eϕ [˜2 (θ)/ϕ(θ)]
                                                        π
      for any density ϕ with sufficiently large support
                                                    [Torrie & Valleau, 1977]
      Use of a single sample θ1 , . . . , θn from ϕ

                                                      i=1 π1 (θi )/ϕ(θi )
                                                          ˜
                                       B12 =
                                                      i=1 π2 (θi )/ϕ(θi )
                                                          ˜
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Ratio importance sampling (2)

      Approximate variance:
                                                                               2
                          var(B12 )  1                    (π1 (θ) − π2 (θ))2
                              2     ≈ Eϕ
                            B12      n                          ϕ(θ)2

      Optimal choice:

                                                     | π1 (θ) − π2 (θ) |
                                  ϕ∗ (θ) =
                                                    | π1 (η) − π2 (η) | dη

                                                             [Chen, Shao & Ibrahim, 2000]
      Formaly better than bridge sampling
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Ratio importance sampling (2)

      Approximate variance:
                                                                               2
                          var(B12 )  1                    (π1 (θ) − π2 (θ))2
                              2     ≈ Eϕ
                            B12      n                          ϕ(θ)2

      Optimal choice:

                                                     | π1 (θ) − π2 (θ) |
                                  ϕ∗ (θ) =
                                                    | π1 (η) − π2 (η) | dη

                                                             [Chen, Shao & Ibrahim, 2000]
      Formaly better than bridge sampling
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Regular importance



Ratio importance sampling (2)

      Approximate variance:
                                                                               2
                          var(B12 )  1                    (π1 (θ) − π2 (θ))2
                              2     ≈ Eϕ
                            B12      n                          ϕ(θ)2

      Optimal choice:

                                                     | π1 (θ) − π2 (θ) |
                                  ϕ∗ (θ) =
                                                    | π1 (η) − π2 (η) | dη

                                                             [Chen, Shao & Ibrahim, 2000]
      Formaly better than bridge sampling
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Approximating Zk from a posterior sample


      Use of the [harmonic mean] identity

                    ϕ(θk )                               ϕ(θk )       πk (θk )Lk (θk )       1
      Eπk                        x =                                                   dθk =
                πk (θk )Lk (θk )                     πk (θk )Lk (θk )       Zk               Zk

      no matter what the proposal ϕ(·) is.
                          [Gelfand & Dey, 1994; Bartolucci et al., 2006]
      Direct exploitation of the MCMC output
                                                                                         RB-RJ
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Approximating Zk from a posterior sample


      Use of the [harmonic mean] identity

                    ϕ(θk )                               ϕ(θk )       πk (θk )Lk (θk )       1
      Eπk                        x =                                                   dθk =
                πk (θk )Lk (θk )                     πk (θk )Lk (θk )       Zk               Zk

      no matter what the proposal ϕ(·) is.
                          [Gelfand & Dey, 1994; Bartolucci et al., 2006]
      Direct exploitation of the MCMC output
                                                                                         RB-RJ
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Comparison with regular importance sampling


      Harmonic mean: Constraint opposed to usual importance sampling
      constraints: ϕ(θ) must have lighter (rather than fatter) tails than
      πk (θk )Lk (θk ) for the approximation
                                                          T               (t)
                                                  1                 ϕ(θk )
                                  Z1k = 1                           (t)         (t)
                                                  T             πk (θk )Lk (θk )
                                                          t=1

      to have a finite variance.
      E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Comparison with regular importance sampling


      Harmonic mean: Constraint opposed to usual importance sampling
      constraints: ϕ(θ) must have lighter (rather than fatter) tails than
      πk (θk )Lk (θk ) for the approximation
                                                          T               (t)
                                                  1                 ϕ(θk )
                                  Z1k = 1                           (t)         (t)
                                                  T             πk (θk )Lk (θk )
                                                          t=1

      to have a finite variance.
      E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Comparison with regular importance sampling (cont’d)



      Compare Z1k with a standard importance sampling approximation
                                                     T        (t)         (t)
                                            1             πk (θk )Lk (θk )
                                    Z2k   =                         (t)
                                            T                 ϕ(θk )
                                                   t=1

                         (t)
      where the θk ’s are generated from the density ϕ(·) (with fatter
      tails like t’s)
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Approximating Zk using a mixture representation



                                                                              Bridge sampling recall

      Design a specific mixture for simulation [importance sampling]
      purposes, with density

                                  ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,

      where ϕ(·) is arbitrary (but normalised)
      Note: ω1 is not a probability weight
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Approximating Zk using a mixture representation



                                                                              Bridge sampling recall

      Design a specific mixture for simulation [importance sampling]
      purposes, with density

                                  ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,

      where ϕ(·) is arbitrary (but normalised)
      Note: ω1 is not a probability weight
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Approximating Z using a mixture representation (cont’d)

      Corresponding MCMC (=Gibbs) sampler
      At iteration t
          1   Take δ (t) = 1 with probability

                         (t−1)            (t−1)                  (t−1)         (t−1)          (t−1)
              ω1 πk (θk           )Lk (θk         )       ω1 πk (θk      )Lk (θk       ) + ϕ(θk       )

              and δ (t) = 2 otherwise;
                                                (t)                   (t−1)
          2   If δ (t) = 1, generate θk ∼ MCMC(θk          , θk ) where
              MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
              with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
                                                (t)
          3   If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Approximating Z using a mixture representation (cont’d)

      Corresponding MCMC (=Gibbs) sampler
      At iteration t
          1   Take δ (t) = 1 with probability

                         (t−1)            (t−1)                  (t−1)         (t−1)          (t−1)
              ω1 πk (θk           )Lk (θk         )       ω1 πk (θk      )Lk (θk       ) + ϕ(θk       )

              and δ (t) = 2 otherwise;
                                                (t)                   (t−1)
          2   If δ (t) = 1, generate θk ∼ MCMC(θk          , θk ) where
              MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
              with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
                                                (t)
          3   If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Approximating Z using a mixture representation (cont’d)

      Corresponding MCMC (=Gibbs) sampler
      At iteration t
          1   Take δ (t) = 1 with probability

                         (t−1)            (t−1)                  (t−1)         (t−1)          (t−1)
              ω1 πk (θk           )Lk (θk         )       ω1 πk (θk      )Lk (θk       ) + ϕ(θk       )

              and δ (t) = 2 otherwise;
                                                (t)                   (t−1)
          2   If δ (t) = 1, generate θk ∼ MCMC(θk          , θk ) where
              MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
              with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
                                                (t)
          3   If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Evidence approximation by mixtures
      Rao-Blackwellised estimate
                        T
           ˆ 1
           ξ=
                                        (t)
                             ω1 πk (θk )Lk (θk )
                                                     (t)               (t)          (t)
                                                              ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
                                                                                                 (t)
              T
                       t=1

      converges to ω1 Zk /{ω1 Zk + 1}
               3k
                           ˆ       ˆ        ˆ
      Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie

                          T           (t)     (t)                       (t)          (t)          (t)
                          t=1 ω1 πk (θk )Lk (θk )               ω1 π(θk )Lk (θk ) + ϕ(θk )
           ˆ
           Z3k =
                                  T      (t)                     (t)          (t)          (t)
                                  t=1 ϕ(θk )              ω1 πk (θk )Lk (θk ) + ϕ(θk )

                                                                             [Bridge sampler redux]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Harmonic means



Evidence approximation by mixtures
      Rao-Blackwellised estimate
                        T
           ˆ 1
           ξ=
                                        (t)
                             ω1 πk (θk )Lk (θk )
                                                     (t)               (t)          (t)
                                                              ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
                                                                                                 (t)
              T
                       t=1

      converges to ω1 Zk /{ω1 Zk + 1}
               3k
                           ˆ       ˆ        ˆ
      Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie

                          T           (t)     (t)                       (t)          (t)          (t)
                          t=1 ω1 πk (θk )Lk (θk )               ω1 π(θk )Lk (θk ) + ϕ(θk )
           ˆ
           Z3k =
                                  T      (t)                     (t)          (t)          (t)
                                  t=1 ϕ(θk )              ω1 πk (θk )Lk (θk ) + ϕ(θk )

                                                                             [Bridge sampler redux]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Chib’s representation


      Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
      θk ∼ πk (θk ),
                                     fk (x|θk ) πk (θk )
                       Zk = mk (x) =
                                         πk (θk |x)
      Use of an approximation to the posterior
                                                                  ∗       ∗
                                                          fk (x|θk ) πk (θk )
                                  Zk = mk (x) =                               .
                                                               ˆ ∗
                                                              πk (θk |x)
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Chib’s representation


      Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
      θk ∼ πk (θk ),
                                     fk (x|θk ) πk (θk )
                       Zk = mk (x) =
                                         πk (θk |x)
      Use of an approximation to the posterior
                                                                  ∗       ∗
                                                          fk (x|θk ) πk (θk )
                                  Zk = mk (x) =                               .
                                                               ˆ ∗
                                                              πk (θk |x)
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Case of latent variables



      For missing variable z as in mixture models, natural Rao-Blackwell
      estimate
                                       T
                            ∗       1          ∗      (t)
                       πk (θk |x) =       πk (θk |x, zk ) ,
                                    T
                                                          t=1
                         (t)
      where the zk ’s are Gibbs sampled latent variables
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Label switching impact


      A mixture model [special case of missing variable model] is
      invariant under permutations of the indices of the components.
      E.g., mixtures
                          0.3N (0, 1) + 0.7N (2.3, 1)
      and
                                       0.7N (2.3, 1) + 0.3N (0, 1)
      are exactly the same!
       c The component parameters θi are not identifiable
      marginally since they are exchangeable
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Label switching impact


      A mixture model [special case of missing variable model] is
      invariant under permutations of the indices of the components.
      E.g., mixtures
                          0.3N (0, 1) + 0.7N (2.3, 1)
      and
                                       0.7N (2.3, 1) + 0.3N (0, 1)
      are exactly the same!
       c The component parameters θi are not identifiable
      marginally since they are exchangeable
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Connected difficulties


          1   Number of modes of the likelihood of order O(k!):
               c Maximization and even [MCMC] exploration of the
              posterior surface harder
          2   Under exchangeable priors on (θ, p) [prior invariant under
              permutation of the indices], all posterior marginals are
              identical:
               c Posterior expectation of θ1 equal to posterior expectation
              of θ2
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Connected difficulties


          1   Number of modes of the likelihood of order O(k!):
               c Maximization and even [MCMC] exploration of the
              posterior surface harder
          2   Under exchangeable priors on (θ, p) [prior invariant under
              permutation of the indices], all posterior marginals are
              identical:
               c Posterior expectation of θ1 equal to posterior expectation
              of θ2
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



License


      Since Gibbs output does not produce exchangeability, the Gibbs
      sampler has not explored the whole parameter space: it lacks
      energy to switch simultaneously enough component allocations at
      once

                                                                             0.2 0.3 0.4 0.5
                  −1 0 1 2 3
          µi




                                  0   100   200

                                                  n
                                                      300   400   500
                                                                        pi                     −1         0
                                                                                                                    µ
                                                                                                                     1

                                                                                                                     i
                                                                                                                               2       3




                                                                             0.4 0.6 0.8 1.0
                0.2 0.3 0.4 0.5




                                                                        σi
           pi




                                  0   100   200       300   400   500                           0.2           0.3        0.4         0.5
                                                  n                                                                 pi
                0.4 0.6 0.8 1.0




                                                                             −1 0 1 2 3
          σi




                                                                        µi




                                  0   100   200       300   400   500                               0.4         0.6            0.8         1.0
                                                  n                                                                 σi
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Label switching paradox




      We should observe the exchangeability of the components [label
      switching] to conclude about convergence of the Gibbs sampler.
      If we observe it, then we do not know how to estimate the
      parameters.
      If we do not, then we are uncertain about the convergence!!!
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Label switching paradox




      We should observe the exchangeability of the components [label
      switching] to conclude about convergence of the Gibbs sampler.
      If we observe it, then we do not know how to estimate the
      parameters.
      If we do not, then we are uncertain about the convergence!!!
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Label switching paradox




      We should observe the exchangeability of the components [label
      switching] to conclude about convergence of the Gibbs sampler.
      If we observe it, then we do not know how to estimate the
      parameters.
      If we do not, then we are uncertain about the convergence!!!
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Compensation for label switching
                                         (t)
      For mixture models, zk usually fails to visit all configurations in a
      balanced way, despite the symmetry predicted by the theory
                                                              1
                       πk (θk |x) = πk (σ(θk )|x) =                      πk (σ(θk )|x)
                                                              k!
                                                                   σ∈S

      for all σ’s in Sk , set of all permutations of {1, . . . , k}.
      Consequences on numerical approximation, biased by an order k!
      Recover the theoretical symmetry by using
                                                          T
                              ∗              1                        ∗        (t)
                         πk (θk |x) =                          πk (σ(θk )|x, zk ) .
                                            T k!
                                                    σ∈Sk t=1

                                                    [Berkhof, Mechelen, & Gelman, 2003]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Compensation for label switching
                                         (t)
      For mixture models, zk usually fails to visit all configurations in a
      balanced way, despite the symmetry predicted by the theory
                                                              1
                       πk (θk |x) = πk (σ(θk )|x) =                      πk (σ(θk )|x)
                                                              k!
                                                                   σ∈S

      for all σ’s in Sk , set of all permutations of {1, . . . , k}.
      Consequences on numerical approximation, biased by an order k!
      Recover the theoretical symmetry by using
                                                          T
                              ∗              1                        ∗        (t)
                         πk (θk |x) =                          πk (σ(θk )|x, zk ) .
                                            T k!
                                                    σ∈Sk t=1

                                                    [Berkhof, Mechelen, & Gelman, 2003]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Galaxy dataset

      n = 82 galaxies as a mixture of k normal distributions with both
      mean and variance unknown.
                                                          [Roeder, 1992]
                                                                         Average density
                                                         0.8
                                                         0.6
                                    Relative Frequency

                                                         0.4
                                                         0.2
                                                         0.0




                                                               −2   −1      0              1   2   3

                                                                                data
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Galaxy dataset (k)
                                              ∗
      Using only the original estimate, with θk as the MAP estimator,
                                       log(mk (x)) = −105.1396
                                           ˆ
      for k = 3 (based on 103 simulations), while introducing the
      permutations leads to
                                       log(mk (x)) = −103.3479
                                           ˆ
      Note that
                                  −105.1396 + log(3!) = −103.3479

        k                 2           3             4         5         6         7         8
        mk (x)         -115.68     -103.35       -102.66   -101.93   -102.88   -105.48   -108.44

      Estimations of the marginal likelihoods by the symmetrised Chib’s
      approximation (based on 105 Gibbs iterations and, for k > 5, 100
      permutations selected at random in Sk ).
                                [Lee, Marin, Mengersen & Robert, 2008]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Galaxy dataset (k)
                                              ∗
      Using only the original estimate, with θk as the MAP estimator,
                                       log(mk (x)) = −105.1396
                                           ˆ
      for k = 3 (based on 103 simulations), while introducing the
      permutations leads to
                                       log(mk (x)) = −103.3479
                                           ˆ
      Note that
                                  −105.1396 + log(3!) = −103.3479

        k                 2           3             4         5         6         7         8
        mk (x)         -115.68     -103.35       -102.66   -101.93   -102.88   -105.48   -108.44

      Estimations of the marginal likelihoods by the symmetrised Chib’s
      approximation (based on 105 Gibbs iterations and, for k > 5, 100
      permutations selected at random in Sk ).
                                [Lee, Marin, Mengersen & Robert, 2008]
On some computational methods for Bayesian model choice
  Importance sampling solutions
     Chib’s solution



Galaxy dataset (k)
                                              ∗
      Using only the original estimate, with θk as the MAP estimator,
                                       log(mk (x)) = −105.1396
                                           ˆ
      for k = 3 (based on 103 simulations), while introducing the
      permutations leads to
                                       log(mk (x)) = −103.3479
                                           ˆ
      Note that
                                  −105.1396 + log(3!) = −103.3479

        k                 2           3             4         5         6         7         8
        mk (x)         -115.68     -103.35       -102.66   -101.93   -102.88   -105.48   -108.44

      Estimations of the marginal likelihoods by the symmetrised Chib’s
      approximation (based on 105 Gibbs iterations and, for k > 5, 100
      permutations selected at random in Sk ).
                                [Lee, Marin, Mengersen & Robert, 2008]
On some computational methods for Bayesian model choice
  Cross-model solutions




Cross-model solutions

      1    Evidence

      2    Importance sampling solutions

      3    Cross-model solutions
             Reversible jump
             Saturation schemes
             Implementation error

      4    Nested sampling

      5    ABC model choice
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Reversible jump


                                                                               irreversible jump

      Idea: Set up a proper measure–theoretic framework for designing
      moves between models Mk
                                                         [Green, 1995]
      Create a reversible kernel K on H = k {k} × Θk such that

                                  K(x, dy)π(x)dx =                K(y, dx)π(y)dy
                          A   B                           B   A

      for the invariant density π [x is of the form (k, θ(k) )]
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Reversible jump


                                                                               irreversible jump

      Idea: Set up a proper measure–theoretic framework for designing
      moves between models Mk
                                                         [Green, 1995]
      Create a reversible kernel K on H = k {k} × Θk such that

                                  K(x, dy)π(x)dx =                K(y, dx)π(y)dy
                          A   B                           B   A

      for the invariant density π [x is of the form (k, θ(k) )]
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Local moves
      For a move between two models, M1 and M2 , the Markov chain
      being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ)
      the corresponding kernels, under the detailed balance condition

                          π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) ,

      and take, wlog, dim(M2 ) > dim(M1 ).
      Proposal expressed as

                                           θ2 = Ψ1→2 (θ1 , v1→2 )

      where v1→2 is a random variable of dimension
      dim(M2 ) − dim(M1 ), generated as

                                          v1→2 ∼ ϕ1→2 (v1→2 ) .
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Local moves
      For a move between two models, M1 and M2 , the Markov chain
      being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ)
      the corresponding kernels, under the detailed balance condition

                          π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) ,

      and take, wlog, dim(M2 ) > dim(M1 ).
      Proposal expressed as

                                           θ2 = Ψ1→2 (θ1 , v1→2 )

      where v1→2 is a random variable of dimension
      dim(M2 ) − dim(M1 ), generated as

                                          v1→2 ∼ ϕ1→2 (v1→2 ) .
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Local moves (2)
      In this case, q1→2 (θ1 , dθ2 ) has density
                                                                          −1
                                                     ∂Ψ1→2 (θ1 , v1→2 )
                               ϕ1→2 (v1→2 )                                    ,
                                                       ∂(θ1 , v1→2 )
      by the Jacobian rule.
                                                                                   Reverse importance link

      If probability 1→2 of choosing move to M2 while in M1 ,
      acceptance probability reduces to
                                         π(M2 , θ2 ) 2→1        ∂Ψ1→2 (θ1 , v1→2 )
      α(θ1 , v1→2 ) = 1∧                                                           .
                                   π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 )   ∂(θ1 , v1→2 )
      If several models are considered simultaneously, with probability
         1→2 of choosing move to M2 while in M1 , as in
                                         ∞
                                         XZ
                          K(x, B) =               ρm (x, y)qm (x, dy) + ω(x)IB (x)
                                         m=1
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Local moves (2)
      In this case, q1→2 (θ1 , dθ2 ) has density
                                                                          −1
                                                     ∂Ψ1→2 (θ1 , v1→2 )
                               ϕ1→2 (v1→2 )                                    ,
                                                       ∂(θ1 , v1→2 )
      by the Jacobian rule.
                                                                                   Reverse importance link

      If probability 1→2 of choosing move to M2 while in M1 ,
      acceptance probability reduces to
                                         π(M2 , θ2 ) 2→1        ∂Ψ1→2 (θ1 , v1→2 )
      α(θ1 , v1→2 ) = 1∧                                                           .
                                   π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 )   ∂(θ1 , v1→2 )
      If several models are considered simultaneously, with probability
         1→2 of choosing move to M2 while in M1 , as in
                                         ∞
                                         XZ
                          K(x, B) =               ρm (x, y)qm (x, dy) + ω(x)IB (x)
                                         m=1
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Local moves (2)
      In this case, q1→2 (θ1 , dθ2 ) has density
                                                                          −1
                                                     ∂Ψ1→2 (θ1 , v1→2 )
                               ϕ1→2 (v1→2 )                                    ,
                                                       ∂(θ1 , v1→2 )
      by the Jacobian rule.
                                                                                   Reverse importance link

      If probability 1→2 of choosing move to M2 while in M1 ,
      acceptance probability reduces to
                                         π(M2 , θ2 ) 2→1        ∂Ψ1→2 (θ1 , v1→2 )
      α(θ1 , v1→2 ) = 1∧                                                           .
                                   π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 )   ∂(θ1 , v1→2 )
      If several models are considered simultaneously, with probability
         1→2 of choosing move to M2 while in M1 , as in
                                         ∞
                                         XZ
                          K(x, B) =               ρm (x, y)qm (x, dy) + ω(x)IB (x)
                                         m=1
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Generic reversible jump acceptance probability


      Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
                                           π(M2 , θ2 ) 2→1        ∂Ψ1→2 (θ1 , v1→2 )
          α(θ1 , v1→2 ) = 1 ∧
                                     π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 )   ∂(θ1 , v1→2 )

      while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
                                                              1→2
                                                                                     −1
                                   π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
        α(θ1 , v1→2 ) = 1 ∧
                                         π(M2 , θ2 ) 2→1          ∂(θ1 , v1→2 )

                                           c Difficult calibration
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Generic reversible jump acceptance probability


      Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
                                           π(M2 , θ2 ) 2→1        ∂Ψ1→2 (θ1 , v1→2 )
          α(θ1 , v1→2 ) = 1 ∧
                                     π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 )   ∂(θ1 , v1→2 )

      while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
                                                              1→2
                                                                                     −1
                                   π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
        α(θ1 , v1→2 ) = 1 ∧
                                         π(M2 , θ2 ) 2→1          ∂(θ1 , v1→2 )

                                           c Difficult calibration
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Generic reversible jump acceptance probability


      Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
                                           π(M2 , θ2 ) 2→1        ∂Ψ1→2 (θ1 , v1→2 )
          α(θ1 , v1→2 ) = 1 ∧
                                     π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 )   ∂(θ1 , v1→2 )

      while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
                                                              1→2
                                                                                     −1
                                   π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
        α(θ1 , v1→2 ) = 1 ∧
                                         π(M2 , θ2 ) 2→1          ∂(θ1 , v1→2 )

                                           c Difficult calibration
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Green’s sampler

      Algorithm
      Iteration t (t ≥ 1): if x(t) = (m, θ(m) ),
          1   Select model Mn with probability πmn
          2   Generate umn ∼ ϕmn (u) and set
              (θ(n) , vnm ) = Ψm→n (θ(m) , umn )
          3   Take x(t+1) = (n, θ(n) ) with probability

                            π(n, θ(n) ) πnm ϕnm (vnm )    ∂Ψm→n (θ(m) , umn )
                   min                                                        ,1
                            π(m, θ(m) ) πmn ϕmn (umn )      ∂(θ(m) , umn )

              and take x(t+1) = x(t) otherwise.
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Interpretation


      The representation puts us back in a fixed dimension setting:
              M1 × V1→2 and M2 in one-to-one relation.
              reversibility imposes that θ1 is derived as

                                             (θ1 , v1→2 ) = Ψ−1 (θ2 )
                                                             1→2

              appears like a regular Metropolis–Hastings move from the
              couple (θ1 , v1→2 ) to θ2 when stationary distributions are
              π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal
              distribution is deterministic
On some computational methods for Bayesian model choice
  Cross-model solutions
     Reversible jump



Interpretation


      The representation puts us back in a fixed dimension setting:
              M1 × V1→2 and M2 in one-to-one relation.
              reversibility imposes that θ1 is derived as

                                             (θ1 , v1→2 ) = Ψ−1 (θ2 )
                                                             1→2

              appears like a regular Metropolis–Hastings move from the
              couple (θ1 , v1→2 ) to θ2 when stationary distributions are
              π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal
              distribution is deterministic
On some computational methods for Bayesian model choice
  Cross-model solutions
     Saturation schemes



Alternative
      Saturation of the parameter space H = k {k} × Θk by creating
           θ = (θ1 , . . . , θD )
           a model index M
           pseudo-priors πj (θj |M = k) for j = k
                                                  [Carlin & Chib, 1995]
      Validation by

                     P(M = k|x) =                 P (M = k|x, θ)π(θ|x)dθ = Zk

      where the (marginal) posterior is [not πk ]
                                       D
                      π(θ|x) =              P(θ, M = k|x)
                                      k=1
                                       D
                                  =         pk Zk πk (θk |x)         πj (θj |M = k) .
                                      k=1                      j=k
On some computational methods for Bayesian model choice
  Cross-model solutions
     Saturation schemes



Alternative
      Saturation of the parameter space H = k {k} × Θk by creating
           θ = (θ1 , . . . , θD )
           a model index M
           pseudo-priors πj (θj |M = k) for j = k
                                                  [Carlin & Chib, 1995]
      Validation by

                     P(M = k|x) =                 P (M = k|x, θ)π(θ|x)dθ = Zk

      where the (marginal) posterior is [not πk ]
                                       D
                      π(θ|x) =              P(θ, M = k|x)
                                      k=1
                                       D
                                  =         pk Zk πk (θk |x)         πj (θj |M = k) .
                                      k=1                      j=k
On some computational methods for Bayesian model choice
  Cross-model solutions
     Saturation schemes



MCMC implementation
                                                    (t)      (t)
      Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary
      distribution π(θ, M |x) by
          1   Pick M (t) = k with probability π(θ(t−1) , k|x)
                               (t−1)
          2   Generate θk                from the posterior πk (θk |x) [or MCMC step]
                              (t−1)
          3   Generate       θj          (j = k) from the pseudo-prior πj (θj |M = k)
      Approximate P(M = k|x) = Zk by
                                     T
                                                    (t)     (t)              (t)
                 pk (x) ∝ pk
                 ˇ                        fk (x|θk ) πk (θk )            πj (θj |M = k)
                                    t=1                            j=k
                                    D
                                                      (t)    (t)              (t)
                                         p f (x|θ ) π (θ )               πj (θj |M = )
                                    =1                              j=
On some computational methods for Bayesian model choice
  Cross-model solutions
     Saturation schemes



MCMC implementation
                                                    (t)      (t)
      Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary
      distribution π(θ, M |x) by
          1   Pick M (t) = k with probability π(θ(t−1) , k|x)
                               (t−1)
          2   Generate θk                from the posterior πk (θk |x) [or MCMC step]
                              (t−1)
          3   Generate       θj          (j = k) from the pseudo-prior πj (θj |M = k)
      Approximate P(M = k|x) = Zk by
                                     T
                                                    (t)     (t)              (t)
                 pk (x) ∝ pk
                 ˇ                        fk (x|θk ) πk (θk )            πj (θj |M = k)
                                    t=1                            j=k
                                    D
                                                      (t)    (t)              (t)
                                         p f (x|θ ) π (θ )               πj (θj |M = )
                                    =1                              j=
On some computational methods for Bayesian model choice
  Cross-model solutions
     Saturation schemes



MCMC implementation
                                                    (t)      (t)
      Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary
      distribution π(θ, M |x) by
          1   Pick M (t) = k with probability π(θ(t−1) , k|x)
                               (t−1)
          2   Generate θk                from the posterior πk (θk |x) [or MCMC step]
                              (t−1)
          3   Generate       θj          (j = k) from the pseudo-prior πj (θj |M = k)
      Approximate P(M = k|x) = Zk by
                                     T
                                                    (t)     (t)              (t)
                 pk (x) ∝ pk
                 ˇ                        fk (x|θk ) πk (θk )            πj (θj |M = k)
                                    t=1                            j=k
                                    D
                                                      (t)    (t)              (t)
                                         p f (x|θ ) π (θ )               πj (θj |M = )
                                    =1                              j=
On some computational methods for Bayesian model choice
  Cross-model solutions
     Implementation error



Scott’s (2002) proposal


      Suggest estimating P(M = k|x) by
                                                                                  
                         T                                  D                     
                                  (t)                                       (t)
                Zk ∝ pk     f (x|θk )                              pj fj (x|θj )       ,
                            k                                                     
                                    t=1                      j=1

      based on D simultaneous and independent MCMC chains
                                           (t)
                                        (θk )t ,          1 ≤ k ≤ D,

      with stationary distributions πk (θk |x) [instead of above joint]
On some computational methods for Bayesian model choice
  Cross-model solutions
     Implementation error



Scott’s (2002) proposal


      Suggest estimating P(M = k|x) by
                                                                                  
                         T                                  D                     
                                  (t)                                       (t)
                Zk ∝ pk     f (x|θk )                              pj fj (x|θj )       ,
                            k                                                     
                                    t=1                      j=1

      based on D simultaneous and independent MCMC chains
                                           (t)
                                        (θk )t ,          1 ≤ k ≤ D,

      with stationary distributions πk (θk |x) [instead of above joint]
On some computational methods for Bayesian model choice
  Cross-model solutions
     Implementation error



Congdon’s (2006) extension


      Selecting flat [prohibited] pseudo-priors, uses instead
                                                                      
                  T                         D                         
                               (t)     (t)                 (t)     (t)
         Zk ∝ pk        fk (x|θk )πk (θk )       pj fj (x|θj )πj (θj ) ,
                                                                      
                            t=1                           j=1

                                   (t)
      where again the θk ’s are MCMC chains with stationary
      distributions πk (θk |x)
                                                                 to next section
On some computational methods for Bayesian model choice
  Cross-model solutions
     Implementation error



Examples

      Example (Model choice)
      Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model
      M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on
      both models: 1 = 2 = 0.5.




                                                          1.0
                                                          0.8
     Approximations of P(M = 1|x):

                                                          0.6
     Scott’s (2002) (blue), and
     Congdon’s (2006) (red)                               0.4

     [N = 106 simulations].
                                                          0.2
                                                          0.0




                                                                1   2       3   4   5

                                                                        y
On some computational methods for Bayesian model choice
  Cross-model solutions
     Implementation error



Examples

      Example (Model choice)
      Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model
      M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on
      both models: 1 = 2 = 0.5.




                                                          1.0
                                                          0.8
     Approximations of P(M = 1|x):

                                                          0.6
     Scott’s (2002) (blue), and
     Congdon’s (2006) (red)                               0.4

     [N = 106 simulations].
                                                          0.2
                                                          0.0




                                                                1   2       3   4   5

                                                                        y
On some computational methods for Bayesian model choice
  Cross-model solutions
     Implementation error



Examples (2)
      Example (Model choice (2))
      Normal model M1 : x ∼ N (θ, 1) with θ ∼ N (0, 1) vs. normal
      model M2 : x ∼ N (θ, 1) with θ ∼ N (5, 1)



     Comparison of both



                                                          1.0
     approximations with


                                                          0.8
     P(M = 1|x): Scott’s (2002)
     (green and mixed dashes) and
                                                          0.6
     Congdon’s (2006) (brown and
                                                          0.4




     long dashes) [N = 104
     simulations].
                                                          0.2
                                                          0.0




                                                                −1   0   1   2       3   4   5   6

                                                                                 y
On some computational methods for Bayesian model choice
  Cross-model solutions
     Implementation error



Examples (3)
      Example (Model choice (3))
      Model M1 : x ∼ N (0, 1/ω) with ω ∼ Exp(a) vs.
      M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b).




                                                          1.0




                                                                                        1.0
                                                          0.8




                                                                                        0.8
     Comparison of Congdon’s (2006)

                                                          0.6




                                                                                        0.6
     (brown and dashed lines) with
                                                          0.4




                                                                                        0.4
                                                          0.2




                                                                                        0.2
     P(M = 1|x) when (a, b) is equal                      0.0




                                                                                        0.0
     to (.24, 8.9), (.56, .7), (4.1, .46)                       −10   −5   0

                                                                           y
                                                                               5   10         −10   −5   0

                                                                                                         y
                                                                                                             5   10




     and (.98, .081), resp. [N = 104
                                                          1.0




                                                                                        1.0
                                                          0.8




                                                                                        0.8
     simulations].
                                                          0.6




                                                                                        0.6
                                                          0.4




                                                                                        0.4
                                                          0.2




                                                                                        0.2
                                                          0.0




                                                                                        0.0

                                                                −10   −5   0   5   10         −10   −5   0   5   10

                                                                           y                             y
On some computational methods for Bayesian model choice
  Nested sampling




Nested sampling
      1    Evidence

      2    Importance sampling solutions

      3    Cross-model solutions

      4    Nested sampling
             Purpose
             Implementation
             Error rates
             Constraints
             Importance variant
             A mixture comparison
On some computational methods for Bayesian model choice
  Nested sampling
     Purpose



Nested sampling: Goal


      Skilling’s (2007) technique using the one-dimensional
      representation:
                                                              1
                                     Z = Eπ [L(θ)] =              ϕ(x) dx
                                                          0

      with
                                        ϕ−1 (l) = P π (L(θ) > l).
      Note; ϕ(·) is intractable in most cases.
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: First approximation

      Approximate Z by a Riemann sum:
                                                N
                                        Z=           (xi−1 − xi )ϕ(xi )
                                               i=1

      where the xi ’s are either:
              deterministic:            xi = e−i/N
              or random:

                                x0 = 0,          xi+1 = ti xi ,   ti ∼ Be(N, 1)

              so that E[log xi ] = −i/N .
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Extraneous white noise
      Take
                                          1 −(1−δ)θ −δθ      1 −(1−δ)θ
          Z=          e−θ dθ =              e      e    = Eδ   e
                                          δ                  δ
                        N
             1
          ˆ
          Z=                  δ −1 e−(1−δ)θi (xi−1 − xi ) ,   θi ∼ E(δ) I(θi ≤ θi−1 )
             N
                        i=1

        N           deterministic       random
        50              4.64              10.5
                        4.65              10.5
        100             2.47               4.9 Comparison of variances and MSEs
                        2.48              5.02
        500             .549              1.01
                        .550              1.14
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Extraneous white noise
      Take
                                          1 −(1−δ)θ −δθ      1 −(1−δ)θ
          Z=          e−θ dθ =              e      e    = Eδ   e
                                          δ                  δ
                        N
             1
          ˆ
          Z=                  δ −1 e−(1−δ)θi (xi−1 − xi ) ,   θi ∼ E(δ) I(θi ≤ θi−1 )
             N
                        i=1

        N           deterministic       random
        50              4.64              10.5
                        4.65              10.5
        100             2.47               4.9 Comparison of variances and MSEs
                        2.48              5.02
        500             .549              1.01
                        .550              1.14
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: Alternative representation

      Another representation is
                                           N −1
                                    Z=            {ϕ(xi+1 ) − ϕ(xi )}xi
                                            i=0

      which is a special case of
                         N −1
                    Z=          {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )})
                          i=0

      where · · · L(θ(i+1) ) < L(θ(i) ) · · ·
                                          [Lebesgue version of Riemann’s sum]
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: Alternative representation

      Another representation is
                                           N −1
                                    Z=            {ϕ(xi+1 ) − ϕ(xi )}xi
                                            i=0

      which is a special case of
                         N −1
                    Z=          {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )})
                          i=0

      where · · · L(θ(i+1) ) < L(θ(i) ) · · ·
                                          [Lebesgue version of Riemann’s sum]
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: Second approximation
      Estimate (intractable) ϕ(xi ) by ϕi :

      Nested sampling
      Start with N values θ1 , . . . , θN sampled from π
      At iteration i,
          1    Take ϕi = L(θk ), where θk is the point with smallest
               likelihood in the pool of θi ’s
          2    Replace θk with a sample from the prior constrained to
               L(θ) > ϕi : the current N points are sampled from prior
               constrained to L(θ) > ϕi .

      Note that
                                                                                          1
              π({θ; L(θ) > L(θ(i+1) )})                   π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
                                                                                          N
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: Second approximation
      Estimate (intractable) ϕ(xi ) by ϕi :

      Nested sampling
      Start with N values θ1 , . . . , θN sampled from π
      At iteration i,
          1    Take ϕi = L(θk ), where θk is the point with smallest
               likelihood in the pool of θi ’s
          2    Replace θk with a sample from the prior constrained to
               L(θ) > ϕi : the current N points are sampled from prior
               constrained to L(θ) > ϕi .

      Note that
                                                                                          1
              π({θ; L(θ) > L(θ(i+1) )})                   π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
                                                                                          N
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: Second approximation
      Estimate (intractable) ϕ(xi ) by ϕi :

      Nested sampling
      Start with N values θ1 , . . . , θN sampled from π
      At iteration i,
          1    Take ϕi = L(θk ), where θk is the point with smallest
               likelihood in the pool of θi ’s
          2    Replace θk with a sample from the prior constrained to
               L(θ) > ϕi : the current N points are sampled from prior
               constrained to L(θ) > ϕi .

      Note that
                                                                                          1
              π({θ; L(θ) > L(θ(i+1) )})                   π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
                                                                                          N
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: Second approximation
      Estimate (intractable) ϕ(xi ) by ϕi :

      Nested sampling
      Start with N values θ1 , . . . , θN sampled from π
      At iteration i,
          1    Take ϕi = L(θk ), where θk is the point with smallest
               likelihood in the pool of θi ’s
          2    Replace θk with a sample from the prior constrained to
               L(θ) > ϕi : the current N points are sampled from prior
               constrained to L(θ) > ϕi .

      Note that
                                                                                          1
              π({θ; L(θ) > L(θ(i+1) )})                   π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
                                                                                          N
On some computational methods for Bayesian model choice
  Nested sampling
     Implementation



Nested sampling: Third approximation


      Iterate the above steps until a given stopping iteration j is
      reached: e.g.,
              observe very small changes in the approximation Z;
              reach the maximal value of L(θ) when the likelihood is
              bounded and its maximum is known;
              truncate the integral Z at level , i.e. replace
                                        1                        1
                                            ϕ(x) dx       with       ϕ(x) dx
                                    0
On some computational methods for Bayesian model choice
  Nested sampling
     Error rates



Approximation error


      Error = Z − Z
                        j                                 1
                   =         (xi−1 − xi )ϕi −                 ϕ(x) dx = −         ϕ(x) dx
                       i=1                            0                       0
                         j                                      1
                   +          (xi−1 − xi )ϕ(xi ) −                  ϕ(x) dx       (Quadrature Error)
                        i=1
                         j
                   +          (xi−1 − xi ) {ϕi − ϕ(xi )}                          (Stochastic Error)
                        i=1



                                                                     [Dominated by Monte Carlo]
On some computational methods for Bayesian model choice
  Nested sampling
     Error rates



A CLT for the Stochastic Error


      The (dominating) stochastic error is OP (N −1/2 ):
                                                              D
                                      N 1/2 Z − Z → N (0, V )

      with
                         V =−                     sϕ (s)tϕ (t) log(s ∨ t) ds dt.
                                      s,t∈[ ,1]

                                                          [Proof based on Donsker’s theorem]
On some computational methods for Bayesian model choice
  Nested sampling
     Error rates



What of log Z?

      If the interest lies within log Z, Slutsky’s transform of the CLT:
                                                          D
                            N −1/2 log Z − log Z → N 0, V /Z2


      Note: The number of simulated points equals the number of
      iterations j, and is a multiple of N : if one stops at first iteration j
      such that e−j/N < , then:

                                              j = N − log     .
On some computational methods for Bayesian model choice
  Nested sampling
     Error rates



What of log Z?

      If the interest lies within log Z, Slutsky’s transform of the CLT:
                                                          D
                            N −1/2 log Z − log Z → N 0, V /Z2


      Note: The number of simulated points equals the number of
      iterations j, and is a multiple of N : if one stops at first iteration j
      such that e−j/N < , then:

                                              j = N − log     .
On some computational methods for Bayesian model choice
  Nested sampling
     Error rates



What of log Z?

      If the interest lies within log Z, Slutsky’s transform of the CLT:
                                                          D
                            N −1/2 log Z − log Z → N 0, V /Z2


      Note: The number of simulated points equals the number of
      iterations j, and is a multiple of N : if one stops at first iteration j
      such that e−j/N < , then:

                                              j = N − log     .
On some computational methods for Bayesian model choice
  Nested sampling
     Constraints



Sampling from constr’d priors



      Exact simulation from the constrained prior is intractable in most
      cases

      Skilling (2007) proposes to use MCMC but this introduces a bias
      (stopping rule)
      If implementable, then slice sampler can be devised at the same
      cost [or less]
On some computational methods for Bayesian model choice
  Nested sampling
     Constraints



Sampling from constr’d priors



      Exact simulation from the constrained prior is intractable in most
      cases

      Skilling (2007) proposes to use MCMC but this introduces a bias
      (stopping rule)
      If implementable, then slice sampler can be devised at the same
      cost [or less]
On some computational methods for Bayesian model choice
  Nested sampling
     Constraints



Sampling from constr’d priors



      Exact simulation from the constrained prior is intractable in most
      cases

      Skilling (2007) proposes to use MCMC but this introduces a bias
      (stopping rule)
      If implementable, then slice sampler can be devised at the same
      cost [or less]
On some computational methods for Bayesian model choice
  Nested sampling
     Constraints



Banana illustration
      Case of a banana target made of a twisted 2D normal:
      x2 = x2 + βx2 − 100β
                   1
                                  [Haario, Sacksman, Tamminen, 1999]




                                           β = .03, σ = (100, 1)
On some computational methods for Bayesian model choice
  Nested sampling
     Constraints



Banana illustration (2)
      Use of nested sampling with N = 1000, 50 MCMC steps with size
      0.1, compared with a population Monte Carlo (PMC) based on 10
      iterations with 5000 points per iteration and final sample of 50000
      points, using nine Student’s t components with 9 df
                [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]




                                            Evidence estimation
On some computational methods for Bayesian model choice
  Nested sampling
     Constraints



Banana illustration (2)
      Use of nested sampling with N = 1000, 50 MCMC steps with size
      0.1, compared with a population Monte Carlo (PMC) based on 10
      iterations with 5000 points per iteration and final sample of 50000
      points, using nine Student’s t components with 9 df
                [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]




                                              E[X1 ] estimation
On some computational methods for Bayesian model choice
  Nested sampling
     Constraints



Banana illustration (2)
      Use of nested sampling with N = 1000, 50 MCMC steps with size
      0.1, compared with a population Monte Carlo (PMC) based on 10
      iterations with 5000 points per iteration and final sample of 50000
      points, using nine Student’s t components with 9 df
                [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]




                                              E[X2 ] estimation
On some computational methods for Bayesian model choice
  Nested sampling
     Importance variant



A IS variant of nested sampling

                                                   ˜
      Consider instrumental prior π and likelihood L, weight function

                                                          π(θ)L(θ)
                                             w(θ) =
                                                          π(θ)L(θ)

      and weighted NS estimator
                                               j
                                     Z=            (xi−1 − xi )ϕi w(θi ).
                                             i=1

      Then choose (π, L) so that sampling from π constrained to
      L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
On some computational methods for Bayesian model choice
  Nested sampling
     Importance variant



A IS variant of nested sampling

                                                   ˜
      Consider instrumental prior π and likelihood L, weight function

                                                          π(θ)L(θ)
                                             w(θ) =
                                                          π(θ)L(θ)

      and weighted NS estimator
                                               j
                                     Z=            (xi−1 − xi )ϕi w(θi ).
                                             i=1

      Then choose (π, L) so that sampling from π constrained to
      L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
On some computational methods for Bayesian model choice
  Nested sampling
     A mixture comparison



Mixture pN (0, 1) + (1 − p)N (µ, σ) posterior


           Posterior on (µ, σ) for n
           observations with µ = 2
           and σ = 3/2, when p is
           known
           Use of a uniform prior
           both on (−2, 6) for µ
           and on (.001, 16) for
           log σ 2 .
           occurrences of posterior
           bursts for µ = xi
On some computational methods for Bayesian model choice
  Nested sampling
     A mixture comparison



Experiment




   MCMC sample for n = 16                                 Nested sampling sequence
   observations from the mixture.                         with M = 1000 starting points.
On some computational methods for Bayesian model choice
  Nested sampling
     A mixture comparison



Experiment




   MCMC sample for n = 50                                 Nested sampling sequence
   observations from the mixture.                         with M = 1000 starting points.
On some computational methods for Bayesian model choice
  Nested sampling
     A mixture comparison



Comparison

          1   Nested sampling: M = 1000 points, with 10 random walk
              moves at each step, simulations from the constr’d prior and a
              stopping rule at 95% of the observed maximum likelihood
          2   T = 104 MCMC (=Gibbs) simulations producing
              non-parametric estimates ϕ
                                                   [Diebolt & Robert, 1990]
          3   Monte Carlo estimates Z1 , Z2 , Z3 using product of two
              Gaussian kernels
          4   numerical integration based on 850 × 950 grid [reference
              value, confirmed by Chib’s]
On some computational methods for Bayesian model choice
  Nested sampling
     A mixture comparison



Comparison (cont’d)
                                                               q

                                                          q
                                                          q    q
                                                          q    q




                                  1.15
                                                               q    q
                                                                    q
                                                          q
                                                          q
                                                          q    q    q
                                                          q

                                                          q
                                                          q


                                  1.10
                                                          q
                                                          q    q
                                                          q    q
                                                          q    q    q
                                                               q
                                                               q    q
                                                               q    q
                                                               q
                                                               q    q
                                                               q    q
                                                                    q
                                  1.05

                                                                    q
                                  1.00
                                  0.95




                                                                    q
                                                                    q
                                                          q
                                                          q    q
                                                               q    q
                                                          q
                                                          q         q
                                                          q
                                                          q
                                                          q
                                                          q
                                  0.90




                                                          q
                                  0.85




                                                          q


                                                          q
                                                          q
                                                          q
                                            V1
                                             q            V2   V3   V4
                                                          q




      Graph based on a sample of 10 observations for µ = 2 and
      σ = 3/2 (150 replicas).
On some computational methods for Bayesian model choice
  Nested sampling
     A mixture comparison



Comparison (cont’d)



                                  1.10
                                  1.05                    q
                                                          q
                                                          q
                                  1.00




                                                          q
                                  0.95




                                                          q

                                                          q         q
                                                          q
                                  0.90




                                            V1            V2   V3   V4




      Graph based on a sample of 50 observations for µ = 2 and
      σ = 3/2 (150 replicas).
On some computational methods for Bayesian model choice
  Nested sampling
     A mixture comparison



Comparison (cont’d)



                                  1.15
                                  1.10
                                  1.05


                                                          q
                                                          q
                                                                    q
                                                               q
                                  1.00




                                                               q
                                                               q
                                                               q
                                                          q
                                  0.95




                                                          q
                                  0.90
                                  0.85




                                            V1            V2   V3   V4




      Graph based on a sample of 100 observations for µ = 2 and
      σ = 3/2 (150 replicas).
On some computational methods for Bayesian model choice
  ABC model choice




Approximate Bayesian computation

      1    Evidence

      2    Importance sampling solutions

      3    Cross-model solutions

      4    Nested sampling

      5    ABC model choice
            ABC method
            ABC-PMC
            ABC for model choice in GRFs
            Illustrations
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



Approximate Bayesian Computation

      Bayesian setting: target is π(θ)f (x|θ)
      When likelihood f (x|θ) not in closed form, likelihood-free rejection
      technique:
      ABC algorithm
      For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
      simulating
                           θ ∼ π(θ) , x ∼ f (x|θ ) ,
      until the auxiliary variable x is equal to the observed value, x = y.

                                                          [Pritchard et al., 1999]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



Approximate Bayesian Computation

      Bayesian setting: target is π(θ)f (x|θ)
      When likelihood f (x|θ) not in closed form, likelihood-free rejection
      technique:
      ABC algorithm
      For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
      simulating
                           θ ∼ π(θ) , x ∼ f (x|θ ) ,
      until the auxiliary variable x is equal to the observed value, x = y.

                                                          [Pritchard et al., 1999]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



Approximate Bayesian Computation

      Bayesian setting: target is π(θ)f (x|θ)
      When likelihood f (x|θ) not in closed form, likelihood-free rejection
      technique:
      ABC algorithm
      For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
      simulating
                           θ ∼ π(θ) , x ∼ f (x|θ ) ,
      until the auxiliary variable x is equal to the observed value, x = y.

                                                          [Pritchard et al., 1999]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



A as approximative


      When y is a continuous random variable, equality x = y is replaced
      with a tolerance condition,

                                                     (x, y) ≤

      where is a distance between summary statistics
      Output distributed from

                            π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



A as approximative


      When y is a continuous random variable, equality x = y is replaced
      with a tolerance condition,

                                                     (x, y) ≤

      where is a distance between summary statistics
      Output distributed from

                            π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



Dynamic area
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC improvements


      Simulating from the prior is often poor in efficiency
      Either modify the proposal distribution on θ to increase the density
      of x’s within the vicinity of y...
           [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

      ...or by viewing the problem as a conditional density estimation
      and by developing techniques to allow for larger
                                                  [Beaumont et al., 2002]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC improvements


      Simulating from the prior is often poor in efficiency
      Either modify the proposal distribution on θ to increase the density
      of x’s within the vicinity of y...
           [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

      ...or by viewing the problem as a conditional density estimation
      and by developing techniques to allow for larger
                                                  [Beaumont et al., 2002]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC improvements


      Simulating from the prior is often poor in efficiency
      Either modify the proposal distribution on θ to increase the density
      of x’s within the vicinity of y...
           [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

      ...or by viewing the problem as a conditional density estimation
      and by developing techniques to allow for larger
                                                  [Beaumont et al., 2002]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-MCMC


      Markov chain (θ(t) ) created via the transition function
                 
                 θ ∼ K(θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y
                 
                 
         (t+1)                                             π(θ )K(θ(t) |θ )
       θ       =                      and u ∼ U(0, 1) ≤ π(θ(t) )K(θ |θ(t) ) ,
                 
                  (t)
                 θ                  otherwise,

      has the posterior π(θ|y) as stationary distribution
                                                    [Marjoram et al, 2003]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-MCMC


      Markov chain (θ(t) ) created via the transition function
                 
                 θ ∼ K(θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y
                 
                 
         (t+1)                                             π(θ )K(θ(t) |θ )
       θ       =                      and u ∼ U(0, 1) ≤ π(θ(t) )K(θ |θ(t) ) ,
                 
                  (t)
                 θ                  otherwise,

      has the posterior π(θ|y) as stationary distribution
                                                    [Marjoram et al, 2003]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-PRC

      Another sequential version producing a sequence of Markov
                                             (t)          (t)
      transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T )

      ABC-PRC Algorithm
                                                                            (t−1)
          1   Pick a θ is selected at random among the previous θi                  ’s
                                  (t−1)
              with probabilities ωi     (1 ≤ i ≤ N ).
          2   Generate
                                        (t)                      (t)
                                      θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) ,
          3   Check that (x, y) < , otherwise start again.

                                                                 [Sisson et al., 2007]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-PRC

      Another sequential version producing a sequence of Markov
                                             (t)          (t)
      transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T )

      ABC-PRC Algorithm
                                                                            (t−1)
          1   Pick a θ is selected at random among the previous θi                  ’s
                                  (t−1)
              with probabilities ωi     (1 ≤ i ≤ N ).
          2   Generate
                                        (t)                      (t)
                                      θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) ,
          3   Check that (x, y) < , otherwise start again.

                                                                 [Sisson et al., 2007]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-PRC weight

                            (t)
      Probability ωi              computed as
                      (t)            (t)                  (t)   (t)
                     ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 ,

       where Lt−1 is an arbitrary transition kernel.
      In case
                            Lt−1 (θ |θ) = Kt (θ|θ ) ,
      all weights are equal under a uniform prior.
      Inspired from Del Moral et al. (2006), who use backward kernels
      Lt−1 in SMC to achieve unbiasedness
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-PRC weight

                            (t)
      Probability ωi              computed as
                      (t)            (t)                  (t)   (t)
                     ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 ,

       where Lt−1 is an arbitrary transition kernel.
      In case
                            Lt−1 (θ |θ) = Kt (θ|θ ) ,
      all weights are equal under a uniform prior.
      Inspired from Del Moral et al. (2006), who use backward kernels
      Lt−1 in SMC to achieve unbiasedness
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-PRC weight

                            (t)
      Probability ωi              computed as
                      (t)            (t)                  (t)   (t)
                     ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 ,

       where Lt−1 is an arbitrary transition kernel.
      In case
                            Lt−1 (θ |θ) = Kt (θ|θ ) ,
      all weights are equal under a uniform prior.
      Inspired from Del Moral et al. (2006), who use backward kernels
      Lt−1 in SMC to achieve unbiasedness
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-PRC bias

                    Lack of unbiasedness of the method
      Joint density of the accepted pair (θ(t−1) , θ(t) ) proportional to

                                             π(θ(t−1) |y)Kt (θ(t) |θ(t−1) )f (y|θ(t) ) ,

      For an arbitrary function h(θ), E[ωt h(θ(t) )] proportional to
                                π(θ (t) )Lt−1 (θ (t−1) |θ (t) )
           ZZ
                      (t)                                                  (t−1)              (t)        (t−1)             (t)         (t−1)        (t)
                h(θ         )                                        π(θ           |y)Kt (θ         |θ           )f (y|θ         )dθ           dθ
                                π(θ (t−1) )Kt (θ (t) |θ (t−1) )
                                             π(θ (t) )Lt−1 (θ (t−1) |θ (t) )
                    ZZ
                                   (t)                                               (t−1)               (t−1)
                ∝           h(θ          )                                     π(θ           )f (y|θ             )
                                             π(θ (t−1) )Kt (θ (t) |θ (t−1) )
                                       (t)        (t−1)       (t)   (t−1)      (t)
                        × Kt (θ              |θ         )f (y|θ )dθ       dθ
                    Z                                   Z                                           ff
                                 (t)          (t)                    (t−1) (t)       (t−1)     (t−1)      (t)
                ∝        h(θ           )π(θ         |y)      Lt−1 (θ      |θ )f (y|θ       )dθ         dθ     .
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



ABC-PRC bias

                    Lack of unbiasedness of the method
      Joint density of the accepted pair (θ(t−1) , θ(t) ) proportional to

                                             π(θ(t−1) |y)Kt (θ(t) |θ(t−1) )f (y|θ(t) ) ,

      For an arbitrary function h(θ), E[ωt h(θ(t) )] proportional to
                                π(θ (t) )Lt−1 (θ (t−1) |θ (t) )
           ZZ
                      (t)                                                  (t−1)              (t)        (t−1)             (t)         (t−1)        (t)
                h(θ         )                                        π(θ           |y)Kt (θ         |θ           )f (y|θ         )dθ           dθ
                                π(θ (t−1) )Kt (θ (t) |θ (t−1) )
                                             π(θ (t) )Lt−1 (θ (t−1) |θ (t) )
                    ZZ
                                   (t)                                               (t−1)               (t−1)
                ∝           h(θ          )                                     π(θ           )f (y|θ             )
                                             π(θ (t−1) )Kt (θ (t) |θ (t−1) )
                                       (t)        (t−1)       (t)   (t−1)      (t)
                        × Kt (θ              |θ         )f (y|θ )dθ       dθ
                    Z                                   Z                                           ff
                                 (t)          (t)                    (t−1) (t)       (t−1)     (t−1)      (t)
                ∝        h(θ           )π(θ         |y)      Lt−1 (θ      |θ )f (y|θ       )dθ         dθ     .
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



A mixture example (0)

      Toy model of Sisson et al. (2007): if

              θ ∼ U(−10, 10) ,               x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) ,

      then the posterior distribution associated with y = 0 is the normal
      mixture
                   θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100)
      restricted to [−10, 10].
      Furthermore, true target available as

      π(θ||x| < ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
On some computational methods for Bayesian model choice
  ABC model choice
     ABC method



A mixture example (1)

                            1.0




                                                        1.0




                                                                                    1.0




                                                                                                                1.0




                                                                                                                                            1.0
                            0.8




                                                        0.8




                                                                                    0.8




                                                                                                                0.8




                                                                                                                                            0.8
                            0.6




                                                        0.6




                                                                                    0.6




                                                                                                                0.6




                                                                                                                                            0.6
                            0.4




                                                        0.4




                                                                                    0.4




                                                                                                                0.4




                                                                                                                                            0.4
                            0.2




                                                        0.2




                                                                                    0.2




                                                                                                                0.2




                                                                                                                                            0.2
                            0.0




                                                        0.0




                                                                                    0.0




                                                                                                                0.0




                                                                                                                                            0.0
                                  −3   −1       1   3         −3   −1       1   3         −3   −1       1   3         −3   −1       1   3         −3   −1       1   3

                                            θ                           θ                           θ                           θ                           θ
                            1.0




                                                        1.0




                                                                                    1.0




                                                                                                                1.0




                                                                                                                                            1.0
                            0.8




                                                        0.8




                                                                                    0.8




                                                                                                                0.8




                                                                                                                                            0.8
                            0.6




                                                        0.6




                                                                                    0.6




                                                                                                                0.6




                                                                                                                                            0.6
                            0.4




                                                        0.4




                                                                                    0.4




                                                                                                                0.4




                                                                                                                                            0.4
                            0.2




                                                        0.2




                                                                                    0.2




                                                                                                                0.2




                                                                                                                                            0.2
                            0.0




                                                        0.0




                                                                                    0.0




                                                                                                                0.0




                                                                                                                                            0.0
                                  −3   −1       1   3         −3   −1       1   3         −3   −1       1   3         −3   −1       1   3         −3   −1       1   3

                                            θ                           θ                           θ                           θ                           θ



                     Comparison of τ = 0.15 and τ = 1/0.15 in Kt
On some computational methods for Bayesian model choice
  ABC model choice
     ABC-PMC



A PMC version
      Use of the same kernel idea as ABC-PRC but with IS correction
      Generate a sample at iteration t by
                                                  N
                                                          (t−1)              (t−1)
                                πt (θ(t) ) ∝
                                ˆ                      ωj         Kt (θ(t) |θj       )
                                                 j=1

      modulo acceptance of the associated xt , and use an importance
                                                     (t)
      weight associated with an accepted simulation θi
                                           (t)            (t)          (t)
                                         ωi ∝ π(θi ) πt (θi ) .
                                                     ˆ

                                          c Still likelihood free
                                                               [Beaumont et al., 2009]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC-PMC



The ABC-PMC algorithm
      Given a decreasing sequence of approximation levels                        1   ≥ ... ≥         T,

         1. At iteration t = 1,
                     For i = 1, ..., N
                                           (1)                             (1)
                             Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y) <                           1
                                  (1)
                             Set ωi = 1/N
                                                                                 (1)
              Take τ 2 as twice the empirical variance of the θi ’s
         2. At iteration 2 ≤ t ≤ T ,
                     For i = 1, ..., N , repeat
                                                      (t−1)                             (t−1)
                             Pick θi from the θj              ’s with probabilities ωj
                                           (t)          2                              (t)
                             generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi )
                     until (x, y) <          t
                              (t)          (t)       N        (t−1)      −1      (t)         (t−1)
                     Set ωi         ∝ π(θi )/        j=1   ωj         ϕ σt  θi − θj                  )
                    2                                                                                (t)
              Take τt+1 as twice the weighted empirical variance of the θi ’s
On some computational methods for Bayesian model choice
  ABC model choice
     ABC-PMC



A mixture example (2)
      Recovery of the target, whether using a fixed standard deviation of
      τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s.
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Gibbs random fields

      Gibbs distribution
      The rv y = (y1 , . . . , yn ) is a Gibbs random field associated with
      the graph G if

                                               1
                                    f (y) =      exp −          Vc (yc )   ,
                                               Z
                                                          c∈C

      where Z is the normalising constant, C is the set of cliques of G
      and Vc is any function also called potential
      U (y) = c∈C Vc (yc ) is the energy function

       c Z is usually unavailable in closed form
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Gibbs random fields

      Gibbs distribution
      The rv y = (y1 , . . . , yn ) is a Gibbs random field associated with
      the graph G if

                                               1
                                    f (y) =      exp −          Vc (yc )   ,
                                               Z
                                                          c∈C

      where Z is the normalising constant, C is the set of cliques of G
      and Vc is any function also called potential
      U (y) = c∈C Vc (yc ) is the energy function

       c Z is usually unavailable in closed form
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Potts model
      Potts model
      Vc (y) is of the form

                                    Vc (y) = θS(y) = θ                δyl =yi
                                                                l∼i

      where l∼i denotes a neighbourhood structure

      In most realistic settings, summation

                                         Zθ =             exp{θT S(x)}
                                                  x∈X

      involves too many terms to be manageable and numerical
      approximations cannot always be trusted
                             [Cucala, Marin, CPR & Titterington, 2009]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Potts model
      Potts model
      Vc (y) is of the form

                                    Vc (y) = θS(y) = θ                δyl =yi
                                                                l∼i

      where l∼i denotes a neighbourhood structure

      In most realistic settings, summation

                                         Zθ =             exp{θT S(x)}
                                                  x∈X

      involves too many terms to be manageable and numerical
      approximations cannot always be trusted
                             [Cucala, Marin, CPR & Titterington, 2009]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Bayesian Model Choice


      Comparing a model with potential S0 taking values in Rp0 versus a
      model with potential S1 taking values in Rp1 can be done through
      the Bayes factor corresponding to the priors π0 and π1 on each
      parameter space
                                                       T
                                                  exp{θ0 S0 (x)}/Zθ0 ,0 π0 (dθ0 )
                       Bm0 /m1 (x) =                   T
                                                  exp{θ1 S1 (x)}/Zθ1 ,1 π1 (dθ1 )

      Use of Jeffreys’ scale to select most appropriate model
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Bayesian Model Choice


      Comparing a model with potential S0 taking values in Rp0 versus a
      model with potential S1 taking values in Rp1 can be done through
      the Bayes factor corresponding to the priors π0 and π1 on each
      parameter space
                                                       T
                                                  exp{θ0 S0 (x)}/Zθ0 ,0 π0 (dθ0 )
                       Bm0 /m1 (x) =                   T
                                                  exp{θ1 S1 (x)}/Zθ1 ,1 π1 (dθ1 )

      Use of Jeffreys’ scale to select most appropriate model
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Neighbourhood relations



      Choice to be made between M neighbourhood relations
                                       m
                                     i∼i             (0 ≤ m ≤ M − 1)

      with
                                           Sm (x) =             I{xi =xi }
                                                          m
                                                          i∼i

      driven by the posterior probabilities of the models.
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Model index



      Formalisation via a model index M that appears as a new
      parameter with prior distribution π(M = m) and
      π(θ|M = m) = πm (θm )
      Computational target:

               P(M = m|x) ∝                      fm (x|θm )πm (θm ) dθm π(M = m) ,
                                           Θm
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Model index



      Formalisation via a model index M that appears as a new
      parameter with prior distribution π(M = m) and
      π(θ|M = m) = πm (θm )
      Computational target:

               P(M = m|x) ∝                      fm (x|θm )πm (θm ) dθm π(M = m) ,
                                           Θm
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Sufficient statistics
      By definition, if S(x) sufficient statistic for the joint parameters
      (M, θ0 , . . . , θM −1 ),
                                P(M = m|x) = P(M = m|S(x)) .
      For each model m, own sufficient statistic Sm (·) and
      S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient.
      For Gibbs random fields,
                                        1           2
                x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm )
                                           1
                                     =         f 2 (S(x)|θm )
                                       n(S(x)) m
      where
                              n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)}
                                         x         x
       c S(x) is therefore also sufficient for the joint parameters
                                     [Specific to Gibbs random fields!]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Sufficient statistics
      By definition, if S(x) sufficient statistic for the joint parameters
      (M, θ0 , . . . , θM −1 ),
                                P(M = m|x) = P(M = m|S(x)) .
      For each model m, own sufficient statistic Sm (·) and
      S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient.
      For Gibbs random fields,
                                        1           2
                x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm )
                                           1
                                     =         f 2 (S(x)|θm )
                                       n(S(x)) m
      where
                              n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)}
                                         x         x
       c S(x) is therefore also sufficient for the joint parameters
                                     [Specific to Gibbs random fields!]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



Sufficient statistics
      By definition, if S(x) sufficient statistic for the joint parameters
      (M, θ0 , . . . , θM −1 ),
                                P(M = m|x) = P(M = m|S(x)) .
      For each model m, own sufficient statistic Sm (·) and
      S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient.
      For Gibbs random fields,
                                        1           2
                x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm )
                                           1
                                     =         f 2 (S(x)|θm )
                                       n(S(x)) m
      where
                              n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)}
                                         x         x
       c S(x) is therefore also sufficient for the joint parameters
                                     [Specific to Gibbs random fields!]
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



ABC model choice Algorithm


      ABC-MC
              Generate m∗ from the prior π(M = m).
                        ∗
              Generate θm∗ from the prior πm∗ (·).
              Generate x∗ from the model fm∗ (·|θm∗ ).
                                                 ∗

              Compute the distance ρ(S(x0 ), S(x∗ )).
              Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) < .
                       ∗


                                              [Cornuet, Grelaud, Marin & Robert, 2008]

      Note When             = 0 the algorithm is exact
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



ABC approximation to the Bayes factor

      Frequency ratio:
                                                   ˆ
                                                   P(M = m0 |x0 ) π(M = m1 )
                  BF m0 /m1 (x0 ) =                              ×
                                                   ˆ
                                                   P(M = m1 |x0 ) π(M = m0 )
                                                    {mi∗ = m0 } π(M = m1 )
                                            =                  ×           ,
                                                    {mi∗ = m1 } π(M = m0 )

      replaced with

                                               1 + {mi∗ = m0 } π(M = m1 )
                   BF m0 /m1 (x0 ) =                          ×
                                               1 + {mi∗ = m1 } π(M = m0 )

      to avoid indeterminacy (also Bayes estimate).
On some computational methods for Bayesian model choice
  ABC model choice
     ABC for model choice in GRFs



ABC approximation to the Bayes factor

      Frequency ratio:
                                                   ˆ
                                                   P(M = m0 |x0 ) π(M = m1 )
                  BF m0 /m1 (x0 ) =                              ×
                                                   ˆ
                                                   P(M = m1 |x0 ) π(M = m0 )
                                                    {mi∗ = m0 } π(M = m1 )
                                            =                  ×           ,
                                                    {mi∗ = m1 } π(M = m0 )

      replaced with

                                               1 + {mi∗ = m0 } π(M = m1 )
                   BF m0 /m1 (x0 ) =                          ×
                                               1 + {mi∗ = m1 } π(M = m0 )

      to avoid indeterminacy (also Bayes estimate).
On some computational methods for Bayesian model choice
  ABC model choice
     Illustrations



Toy example

      iid Bernoulli model versus two-state first-order Markov chain, i.e.
                                                      n
                     f0 (x|θ0 ) = exp θ0                  I{xi =1}    {1 + exp(θ0 )}n ,
                                                   i=1

      versus
                                                  n
                                1
               f1 (x|θ1 ) =       exp θ1               I{xi =xi−1 }    {1 + exp(θ1 )}n−1 ,
                                2
                                                 i=2

      with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase
      transition” boundaries).
On some computational methods for Bayesian model choice
  ABC model choice
     Illustrations



Toy example (2)




                                                                             10
                                          5




                                                                             5
                                   BF01




                                                                      BF01

                                                                             0
                                    ^




                                                                       ^
                                          0




                                                                             −5
                                          −5




                                                                             −10
                                               −40   −20     0   10                −40   −20     0   10

                                                      BF01                                BF01




      (left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 )
      (in logs) over 2, 000 simulations and 4.106 proposals from the
      prior. (right) Same when using tolerance corresponding to the
      1% quantile on the distances.
On some computational methods for Bayesian model choice
  ABC model choice
     Illustrations



Protein folding




      Superposition of the native structure (grey) with the ST1
      structure (red.), the ST2 structure (orange), the ST3 structure
      (green), and the DT structure (blue).
On some computational methods for Bayesian model choice
  ABC model choice
     Illustrations



Protein folding (2)


                                         % seq . Id.      TM-score   FROST score
                 1i5nA (ST1)                 32            0.86         75.3
                 1ls1A1 (ST2)                5             0.42         8.9
                 1jr8A (ST3)                 4             0.24         8.9
                 1s7oA (DT)                  10            0.08         7.8

      Characteristics of dataset. % seq. Id.: percentage of identity with
      the query sequence. TM-score.: similarity between predicted and
      native structure (uncertainty between 0.17 and 0.4) FROST score:
      quality of alignment of the query onto the candidate structure
      (uncertainty between 7 and 9).
On some computational methods for Bayesian model choice
  ABC model choice
     Illustrations



Protein folding (3)



                                        NS/ST1            NS/ST2   NS/ST3   NS/DT
               BF                         1.34             1.22     2.42     2.76
           P(M = NS|x0 )                 0.573             0.551    0.708    0.734

      Estimates of the Bayes factors between model NS and models
      ST1, ST2, ST3, and DT, and corresponding posterior
      probabilities of model NS based on an ABC-MC algorithm using
      1.2 106 simulations and a tolerance equal to the 1% quantile of
      the distances.

More Related Content

PDF
4th joint Warwick Oxford Statistics Seminar
PDF
Yes III: Computational methods for model choice
PDF
Jsm09 talk
PDF
45th SIS Meeting, Padova, Italy
PDF
Considerate Approaches to ABC Model Selection
PDF
Colloquium in honor of Hans Ruedi Künsch
PDF
BIRS 12w5105 meeting
PDF
Boston talk
4th joint Warwick Oxford Statistics Seminar
Yes III: Computational methods for model choice
Jsm09 talk
45th SIS Meeting, Padova, Italy
Considerate Approaches to ABC Model Selection
Colloquium in honor of Hans Ruedi Künsch
BIRS 12w5105 meeting
Boston talk

What's hot (20)

PDF
Approximate Bayesian model choice via random forests
PDF
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
PDF
A Maximum Entropy Approach to the Loss Data Aggregation Problem
PDF
von Mises lecture, Berlin
PDF
Coordinate sampler : A non-reversible Gibbs-like sampler
PDF
ABC-Gibbs
PDF
the ABC of ABC
PDF
Convergence of ABC methods
PDF
Presentation on stochastic control problem with financial applications (Merto...
PDF
Principle of Maximum Entropy
PDF
ABC-Gibbs
PPT
Max Entropy
PDF
CISEA 2019: ABC consistency and convergence
PDF
seminar at Princeton University
PDF
Approximating Bayes Factors
PDF
WSC 2011, advanced tutorial on simulation in Statistics
PPT
Bayesian statistics using r intro
PDF
Bayesian computation with INLA
PDF
Computational tools for Bayesian model choice
PDF
Introduction to Bayesian Methods
Approximate Bayesian model choice via random forests
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
A Maximum Entropy Approach to the Loss Data Aggregation Problem
von Mises lecture, Berlin
Coordinate sampler : A non-reversible Gibbs-like sampler
ABC-Gibbs
the ABC of ABC
Convergence of ABC methods
Presentation on stochastic control problem with financial applications (Merto...
Principle of Maximum Entropy
ABC-Gibbs
Max Entropy
CISEA 2019: ABC consistency and convergence
seminar at Princeton University
Approximating Bayes Factors
WSC 2011, advanced tutorial on simulation in Statistics
Bayesian statistics using r intro
Bayesian computation with INLA
Computational tools for Bayesian model choice
Introduction to Bayesian Methods
Ad

Similar to MaxEnt 2009 talk (20)

PDF
ABC model choice
PDF
from model uncertainty to ABC
PDF
An overview of Bayesian testing
PDF
ABC in London, May 5, 2011
PDF
Course on Bayesian computational methods
PDF
San Antonio short course, March 2010
PDF
Statistics symposium talk, Harvard University
PDF
Rss talk for Bayes 250 by Steven Walker
PPT
Elementary Probability and Information Theory
PDF
(Approximate) Bayesian computation as a new empirical Bayes (something)?
PPT
original
PDF
Non-parametric analysis of models and data
PDF
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
PDF
Workshop on Bayesian Inference for Latent Gaussian Models with Applications
PDF
Bayesian model choice (and some alternatives)
PDF
EM algorithm and its application in probabilistic latent semantic analysis
PDF
Columbia workshop [ABC model choice]
PDF
Bayesian Nonparametrics: Models Based on the Dirichlet Process
PDF
Edinburgh, Bayes-250
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
ABC model choice
from model uncertainty to ABC
An overview of Bayesian testing
ABC in London, May 5, 2011
Course on Bayesian computational methods
San Antonio short course, March 2010
Statistics symposium talk, Harvard University
Rss talk for Bayes 250 by Steven Walker
Elementary Probability and Information Theory
(Approximate) Bayesian computation as a new empirical Bayes (something)?
original
Non-parametric analysis of models and data
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
Workshop on Bayesian Inference for Latent Gaussian Models with Applications
Bayesian model choice (and some alternatives)
EM algorithm and its application in probabilistic latent semantic analysis
Columbia workshop [ABC model choice]
Bayesian Nonparametrics: Models Based on the Dirichlet Process
Edinburgh, Bayes-250
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
Ad

More from Christian Robert (20)

PDF
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
PDF
The future of conferences towards sustainability and inclusivity
PDF
Adaptive Restore algorithm & importance Monte Carlo
PDF
Asymptotics of ABC, lecture, Collège de France
PDF
Workshop in honour of Don Poskitt and Gael Martin
PDF
discussion of ICML23.pdf
PDF
How many components in a mixture?
PDF
restore.pdf
PDF
Testing for mixtures at BNP 13
PDF
Inferring the number of components: dream or reality?
PDF
CDT 22 slides.pdf
PDF
Testing for mixtures by seeking components
PDF
discussion on Bayesian restricted likelihood
PDF
NCE, GANs & VAEs (and maybe BAC)
PDF
eugenics and statistics
PDF
Laplace's Demon: seminar #1
PDF
asymptotics of ABC
PDF
ABC-Gibbs
PDF
Likelihood-free Design: a discussion
PDF
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
The future of conferences towards sustainability and inclusivity
Adaptive Restore algorithm & importance Monte Carlo
Asymptotics of ABC, lecture, Collège de France
Workshop in honour of Don Poskitt and Gael Martin
discussion of ICML23.pdf
How many components in a mixture?
restore.pdf
Testing for mixtures at BNP 13
Inferring the number of components: dream or reality?
CDT 22 slides.pdf
Testing for mixtures by seeking components
discussion on Bayesian restricted likelihood
NCE, GANs & VAEs (and maybe BAC)
eugenics and statistics
Laplace's Demon: seminar #1
asymptotics of ABC
ABC-Gibbs
Likelihood-free Design: a discussion
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PPTX
Introduction to Building Materials
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
Unit 4 Skeletal System.ppt.pptxopresentatiom
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
Classroom Observation Tools for Teachers
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Supply Chain Operations Speaking Notes -ICLT Program
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Weekly quiz Compilation Jan -July 25.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Introduction to Building Materials
Orientation - ARALprogram of Deped to the Parents.pptx
Unit 4 Skeletal System.ppt.pptxopresentatiom
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
A systematic review of self-coping strategies used by university students to ...
Chinmaya Tiranga quiz Grand Finale.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Final Presentation General Medicine 03-08-2024.pptx
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Classroom Observation Tools for Teachers
Chinmaya Tiranga Azadi Quiz (Class 7-8 )

MaxEnt 2009 talk

  • 1. On some computational methods for Bayesian model choice On some computational methods for Bayesian model choice Christian P. Robert CREST-INSEE and Universit´ Paris Dauphine e http://www.ceremade.dauphine.fr/~xian MaxEnt 2009, Oxford, July 6, 2009 Joint works with M. Beaumont, N. Chopin, J.-M. Cornuet, J.-M. Marin and D. Wraith
  • 2. On some computational methods for Bayesian model choice Outline 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling 5 ABC model choice
  • 3. On some computational methods for Bayesian model choice Evidence Bayes tests Formal construction of Bayes tests Definition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is 1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), δ π (x) = 0 otherwise,
  • 4. On some computational methods for Bayesian model choice Evidence Bayes tests Formal construction of Bayes tests Definition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is 1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), δ π (x) = 0 otherwise,
  • 5. On some computational methods for Bayesian model choice Evidence Bayes factor Bayes factor Definition (Bayes factors) For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior π(Θ0 )π0 (θ) + π(Θc )π1 (θ) , 0 central quantity f (x|θ)π0 (θ)dθ π(Θ0 |x) π(Θ0 ) Θ0 m0 (x) B01 (x) = = = π(Θc |x) 0 π(Θc ) 0 m1 (x) f (x|θ)π1 (θ)dθ Θc 0 [Jeffreys, 1939]
  • 6. On some computational methods for Bayesian model choice Evidence Model choice Model choice and model comparison Choice between models Several models available for the same observation x Mi : x ∼ fi (x|θi ), i∈I where the family I can be finite or infinite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors
  • 7. On some computational methods for Bayesian model choice Evidence Model choice Model choice and model comparison Choice between models Several models available for the same observation x Mi : x ∼ fi (x|θi ), i∈I where the family I can be finite or infinite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors
  • 8. On some computational methods for Bayesian model choice Evidence Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi define priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj j Θj take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj j Θj
  • 9. On some computational methods for Bayesian model choice Evidence Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi define priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj j Θj take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj j Θj
  • 10. On some computational methods for Bayesian model choice Evidence Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi define priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj j Θj take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj j Θj
  • 11. On some computational methods for Bayesian model choice Evidence Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi define priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj j Θj take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj j Θj
  • 12. On some computational methods for Bayesian model choice Evidence Evidence Evidence All these problems end up with a similar quantity, the evidence Zk = πk (θk )Lk (θk ) dθk , Θk aka the marginal likelihood.
  • 13. On some computational methods for Bayesian model choice Importance sampling solutions Importance sampling revisited 1 Evidence 2 Importance sampling solutions Regular importance Harmonic means Chib’s solution 3 Cross-model solutions 4 Nested sampling 5 ABC model choice
  • 14. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bayes factor approximation When approximating the Bayes factor f0 (x|θ0 )π0 (θ0 )dθ0 Θ0 B01 = f1 (x|θ1 )π1 (θ1 )dθ1 Θ1 use of importance functions 0 and 1 and n−1 0 n0 i i i=1 f0 (x|θ0 )π0 (θ0 )/ i 0 (θ0 ) B01 = n−1 1 n1 i i i=1 f1 (x|θ1 )π1 (θ1 )/ i 1 (θ1 )
  • 15. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling Special case: If π1 (θ1 |x) ∝ π1 (θ1 |x) ˜ π2 (θ2 |x) ∝ π2 (θ2 |x) ˜ live on the same space (Θ1 = Θ2 ), then n 1 π1 (θi |x) ˜ B12 ≈ θi ∼ π2 (θ|x) n π2 (θi |x) ˜ i=1 [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
  • 16. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 var(B12 ) 1 π1 (θ) − π2 (θ) 2 ≈ E B12 n π2 (θ) is large, i.e. if π1 and π2 have little overlap...
  • 17. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 var(B12 ) 1 π1 (θ) − π2 (θ) 2 ≈ E B12 n π2 (θ) is large, i.e. if π1 and π2 have little overlap...
  • 18. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance (Further) bridge sampling In addition π2 (θ|x)α(θ)π1 (θ|x)dθ ˜ B12 = ∀ α(·) π1 (θ|x)α(θ)π2 (θ|x)dθ ˜ n1 1 π2 (θ1i |x)α(θ1i ) ˜ n1 i=1 ≈ n2 θji ∼ πj (θ|x) 1 π1 (θ2i |x)α(θ2i ) ˜ n2 i=1
  • 19. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜1 (θ1i |x) π n1 i=1 B21 = n2 θji ∼ πj (θ|x) 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an infinite variance!!! [Radford Neal’s blog, 2008]
  • 20. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜1 (θ1i |x) π n1 i=1 B21 = n2 θji ∼ πj (θ|x) 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an infinite variance!!! [Radford Neal’s blog, 2008]
  • 21. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  • 22. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  • 23. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling The optimal choice of auxiliary function is n1 + n2 α = n1 π1 (θ|x) + n2 π2 (θ|x) leading to n1 1 π2 (θ1i |x) ˜ n1 n1 π1 (θ1i |x) + n2 π2 (θ1i |x) i=1 B12 ≈ n2 1 π1 (θ2i |x) ˜ n2 n1 π1 (θ2i |x) + n2 π2 (θ2i |x) i=1 Back later!
  • 24. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ 2 ≈ 2 −1 B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ δ method Drag: Dependence on the unknown normalising constants solved iteratively
  • 25. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ 2 ≈ 2 −1 B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ δ method Drag: Dependence on the unknown normalising constants solved iteratively
  • 26. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with sufficiently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜
  • 27. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with sufficiently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜
  • 28. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 var(B12 ) 1 (π1 (θ) − π2 (θ))2 2 ≈ Eϕ B12 n ϕ(θ)2 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
  • 29. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 var(B12 ) 1 (π1 (θ) − π2 (θ))2 2 ≈ Eϕ B12 n ϕ(θ)2 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
  • 30. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 var(B12 ) 1 (π1 (θ) − π2 (θ))2 2 ≈ Eϕ B12 n ϕ(θ)2 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
  • 31. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
  • 32. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
  • 33. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  • 34. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  • 35. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling (cont’d) Compare Z1k with a standard importance sampling approximation T (t) (t) 1 πk (θk )Lk (θk ) Z2k = (t) T ϕ(θk ) t=1 (t) where the θk ’s are generated from the density ϕ(·) (with fatter tails like t’s)
  • 36. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling recall Design a specific mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
  • 37. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling recall Design a specific mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
  • 38. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ (t) = 1 with probability (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) 2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) 3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
  • 39. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ (t) = 1 with probability (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) 2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) 3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
  • 40. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ (t) = 1 with probability (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) 2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) 3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
  • 41. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ 1 ξ= (t) ω1 πk (θk )Lk (θk ) (t) (t) (t) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , (t) T t=1 converges to ω1 Zk /{ω1 Zk + 1} 3k ˆ ˆ ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie T (t) (t) (t) (t) (t) t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = T (t) (t) (t) (t) t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler redux]
  • 42. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ 1 ξ= (t) ω1 πk (θk )Lk (θk ) (t) (t) (t) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , (t) T t=1 converges to ω1 Zk /{ω1 Zk + 1} 3k ˆ ˆ ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie T (t) (t) (t) (t) (t) t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = T (t) (t) (t) (t) t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler redux]
  • 43. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x)
  • 44. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x)
  • 45. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate T ∗ 1 ∗ (t) πk (θk |x) = πk (θk |x, zk ) , T t=1 (t) where the zk ’s are Gibbs sampled latent variables
  • 46. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching impact A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identifiable marginally since they are exchangeable
  • 47. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching impact A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identifiable marginally since they are exchangeable
  • 48. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected difficulties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
  • 49. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected difficulties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
  • 50. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0.2 0.3 0.4 0.5 −1 0 1 2 3 µi 0 100 200 n 300 400 500 pi −1 0 µ 1 i 2 3 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 σi pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σi µi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi
  • 51. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
  • 52. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
  • 53. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
  • 54. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T ∗ 1 ∗ (t) πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]
  • 55. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T ∗ 1 ∗ (t) πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]
  • 56. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset n = 82 galaxies as a mixture of k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density 0.8 0.6 Relative Frequency 0.4 0.2 0.0 −2 −1 0 1 2 3 data
  • 57. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]
  • 58. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]
  • 59. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]
  • 60. On some computational methods for Bayesian model choice Cross-model solutions Cross-model solutions 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions Reversible jump Saturation schemes Implementation error 4 Nested sampling 5 ABC model choice
  • 61. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump irreversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]
  • 62. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump irreversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]
  • 63. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .
  • 64. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .
  • 65. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1
  • 66. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1
  • 67. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1
  • 68. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 ) c Difficult calibration
  • 69. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 ) c Difficult calibration
  • 70. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 ) c Difficult calibration
  • 71. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t) = (m, θ(m) ), 1 Select model Mn with probability πmn 2 Generate umn ∼ ϕmn (u) and set (θ(n) , vnm ) = Ψm→n (θ(m) , umn ) 3 Take x(t+1) = (n, θ(n) ) with probability π(n, θ(n) ) πnm ϕnm (vnm ) ∂Ψm→n (θ(m) , umn ) min ,1 π(m, θ(m) ) πmn ϕmn (umn ) ∂(θ(m) , umn ) and take x(t+1) = x(t) otherwise.
  • 72. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a fixed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic
  • 73. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a fixed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic
  • 74. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1 , . . . , θD ) a model index M pseudo-priors πj (θj |M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk ] D π(θ|x) = P(θ, M = k|x) k=1 D = pk Zk πk (θk |x) πj (θj |M = k) . k=1 j=k
  • 75. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1 , . . . , θD ) a model index M pseudo-priors πj (θj |M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk ] D π(θ|x) = P(θ, M = k|x) k=1 D = pk Zk πk (θk |x) πj (θj |M = k) . k=1 j=k
  • 76. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes MCMC implementation (t) (t) Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary distribution π(θ, M |x) by 1 Pick M (t) = k with probability π(θ(t−1) , k|x) (t−1) 2 Generate θk from the posterior πk (θk |x) [or MCMC step] (t−1) 3 Generate θj (j = k) from the pseudo-prior πj (θj |M = k) Approximate P(M = k|x) = Zk by T (t) (t) (t) pk (x) ∝ pk ˇ fk (x|θk ) πk (θk ) πj (θj |M = k) t=1 j=k D (t) (t) (t) p f (x|θ ) π (θ ) πj (θj |M = ) =1 j=
  • 77. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes MCMC implementation (t) (t) Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary distribution π(θ, M |x) by 1 Pick M (t) = k with probability π(θ(t−1) , k|x) (t−1) 2 Generate θk from the posterior πk (θk |x) [or MCMC step] (t−1) 3 Generate θj (j = k) from the pseudo-prior πj (θj |M = k) Approximate P(M = k|x) = Zk by T (t) (t) (t) pk (x) ∝ pk ˇ fk (x|θk ) πk (θk ) πj (θj |M = k) t=1 j=k D (t) (t) (t) p f (x|θ ) π (θ ) πj (θj |M = ) =1 j=
  • 78. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes MCMC implementation (t) (t) Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary distribution π(θ, M |x) by 1 Pick M (t) = k with probability π(θ(t−1) , k|x) (t−1) 2 Generate θk from the posterior πk (θk |x) [or MCMC step] (t−1) 3 Generate θj (j = k) from the pseudo-prior πj (θj |M = k) Approximate P(M = k|x) = Zk by T (t) (t) (t) pk (x) ∝ pk ˇ fk (x|θk ) πk (θk ) πj (θj |M = k) t=1 j=k D (t) (t) (t) p f (x|θ ) π (θ ) πj (θj |M = ) =1 j=
  • 79. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Scott’s (2002) proposal Suggest estimating P(M = k|x) by   T  D  (t) (t) Zk ∝ pk f (x|θk ) pj fj (x|θj ) ,  k  t=1 j=1 based on D simultaneous and independent MCMC chains (t) (θk )t , 1 ≤ k ≤ D, with stationary distributions πk (θk |x) [instead of above joint]
  • 80. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Scott’s (2002) proposal Suggest estimating P(M = k|x) by   T  D  (t) (t) Zk ∝ pk f (x|θk ) pj fj (x|θj ) ,  k  t=1 j=1 based on D simultaneous and independent MCMC chains (t) (θk )t , 1 ≤ k ≤ D, with stationary distributions πk (θk |x) [instead of above joint]
  • 81. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Congdon’s (2006) extension Selecting flat [prohibited] pseudo-priors, uses instead   T  D  (t) (t) (t) (t) Zk ∝ pk fk (x|θk )πk (θk ) pj fj (x|θj )πj (θj ) ,   t=1 j=1 (t) where again the θk ’s are MCMC chains with stationary distributions πk (θk |x) to next section
  • 82. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples Example (Model choice) Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. 1.0 0.8 Approximations of P(M = 1|x): 0.6 Scott’s (2002) (blue), and Congdon’s (2006) (red) 0.4 [N = 106 simulations]. 0.2 0.0 1 2 3 4 5 y
  • 83. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples Example (Model choice) Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. 1.0 0.8 Approximations of P(M = 1|x): 0.6 Scott’s (2002) (blue), and Congdon’s (2006) (red) 0.4 [N = 106 simulations]. 0.2 0.0 1 2 3 4 5 y
  • 84. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples (2) Example (Model choice (2)) Normal model M1 : x ∼ N (θ, 1) with θ ∼ N (0, 1) vs. normal model M2 : x ∼ N (θ, 1) with θ ∼ N (5, 1) Comparison of both 1.0 approximations with 0.8 P(M = 1|x): Scott’s (2002) (green and mixed dashes) and 0.6 Congdon’s (2006) (brown and 0.4 long dashes) [N = 104 simulations]. 0.2 0.0 −1 0 1 2 3 4 5 6 y
  • 85. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples (3) Example (Model choice (3)) Model M1 : x ∼ N (0, 1/ω) with ω ∼ Exp(a) vs. M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b). 1.0 1.0 0.8 0.8 Comparison of Congdon’s (2006) 0.6 0.6 (brown and dashed lines) with 0.4 0.4 0.2 0.2 P(M = 1|x) when (a, b) is equal 0.0 0.0 to (.24, 8.9), (.56, .7), (4.1, .46) −10 −5 0 y 5 10 −10 −5 0 y 5 10 and (.98, .081), resp. [N = 104 1.0 1.0 0.8 0.8 simulations]. 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 −10 −5 0 5 10 −10 −5 0 5 10 y y
  • 86. On some computational methods for Bayesian model choice Nested sampling Nested sampling 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling Purpose Implementation Error rates Constraints Importance variant A mixture comparison
  • 87. On some computational methods for Bayesian model choice Nested sampling Purpose Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation: 1 Z = Eπ [L(θ)] = ϕ(x) dx 0 with ϕ−1 (l) = P π (L(θ) > l). Note; ϕ(·) is intractable in most cases.
  • 88. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: First approximation Approximate Z by a Riemann sum: N Z= (xi−1 − xi )ϕ(xi ) i=1 where the xi ’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = ti xi , ti ∼ Be(N, 1) so that E[log xi ] = −i/N .
  • 89. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ Z= e−θ dθ = e e = Eδ e δ δ N 1 ˆ Z= δ −1 e−(1−δ)θi (xi−1 − xi ) , θi ∼ E(δ) I(θi ≤ θi−1 ) N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14
  • 90. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ Z= e−θ dθ = e e = Eδ e δ δ N 1 ˆ Z= δ −1 e−(1−δ)θi (xi−1 − xi ) , θi ∼ E(δ) I(θi ≤ θi−1 ) N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14
  • 91. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Alternative representation Another representation is N −1 Z= {ϕ(xi+1 ) − ϕ(xi )}xi i=0 which is a special case of N −1 Z= {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) i=0 where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]
  • 92. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Alternative representation Another representation is N −1 Z= {ϕ(xi+1 ) − ϕ(xi )}xi i=0 which is a special case of N −1 Z= {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) i=0 where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]
  • 93. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Estimate (intractable) ϕ(xi ) by ϕi : Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk ), where θk is the point with smallest likelihood in the pool of θi ’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi . Note that 1 π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − N
  • 94. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Estimate (intractable) ϕ(xi ) by ϕi : Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk ), where θk is the point with smallest likelihood in the pool of θi ’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi . Note that 1 π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − N
  • 95. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Estimate (intractable) ϕ(xi ) by ϕi : Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk ), where θk is the point with smallest likelihood in the pool of θi ’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi . Note that 1 π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − N
  • 96. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Estimate (intractable) ϕ(xi ) by ϕi : Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk ), where θk is the point with smallest likelihood in the pool of θi ’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi . Note that 1 π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − N
  • 97. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Third approximation Iterate the above steps until a given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 1 ϕ(x) dx with ϕ(x) dx 0
  • 98. On some computational methods for Bayesian model choice Nested sampling Error rates Approximation error Error = Z − Z j 1 = (xi−1 − xi )ϕi − ϕ(x) dx = − ϕ(x) dx i=1 0 0 j 1 + (xi−1 − xi )ϕ(xi ) − ϕ(x) dx (Quadrature Error) i=1 j + (xi−1 − xi ) {ϕi − ϕ(xi )} (Stochastic Error) i=1 [Dominated by Monte Carlo]
  • 99. On some computational methods for Bayesian model choice Nested sampling Error rates A CLT for the Stochastic Error The (dominating) stochastic error is OP (N −1/2 ): D N 1/2 Z − Z → N (0, V ) with V =− sϕ (s)tϕ (t) log(s ∨ t) ds dt. s,t∈[ ,1] [Proof based on Donsker’s theorem]
  • 100. On some computational methods for Bayesian model choice Nested sampling Error rates What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: D N −1/2 log Z − log Z → N 0, V /Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at first iteration j such that e−j/N < , then: j = N − log .
  • 101. On some computational methods for Bayesian model choice Nested sampling Error rates What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: D N −1/2 log Z − log Z → N 0, V /Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at first iteration j such that e−j/N < , then: j = N − log .
  • 102. On some computational methods for Bayesian model choice Nested sampling Error rates What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: D N −1/2 log Z − log Z → N 0, V /Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at first iteration j such that e−j/N < , then: j = N − log .
  • 103. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
  • 104. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
  • 105. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
  • 106. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration Case of a banana target made of a twisted 2D normal: x2 = x2 + βx2 − 100β 1 [Haario, Sacksman, Tamminen, 1999] β = .03, σ = (100, 1)
  • 107. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and final sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation
  • 108. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and final sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X1 ] estimation
  • 109. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and final sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X2 ] estimation
  • 110. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j Z= (xi−1 − xi )ϕi w(θi ). i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
  • 111. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j Z= (xi−1 − xi )ϕi w(θi ). i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
  • 112. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Mixture pN (0, 1) + (1 − p)N (µ, σ) posterior Posterior on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ 2 . occurrences of posterior bursts for µ = xi
  • 113. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment MCMC sample for n = 16 Nested sampling sequence observations from the mixture. with M = 1000 starting points.
  • 114. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment MCMC sample for n = 50 Nested sampling sequence observations from the mixture. with M = 1000 starting points.
  • 115. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison 1 Nested sampling: M = 1000 points, with 10 random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1 , Z2 , Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, confirmed by Chib’s]
  • 116. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) q q q q q q 1.15 q q q q q q q q q q q 1.10 q q q q q q q q q q q q q q q q q q q 1.05 q 1.00 0.95 q q q q q q q q q q q q q q 0.90 q 0.85 q q q q V1 q V2 V3 V4 q Graph based on a sample of 10 observations for µ = 2 and σ = 3/2 (150 replicas).
  • 117. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) 1.10 1.05 q q q 1.00 q 0.95 q q q q 0.90 V1 V2 V3 V4 Graph based on a sample of 50 observations for µ = 2 and σ = 3/2 (150 replicas).
  • 118. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) 1.15 1.10 1.05 q q q q 1.00 q q q q 0.95 q 0.90 0.85 V1 V2 V3 V4 Graph based on a sample of 100 observations for µ = 2 and σ = 3/2 (150 replicas).
  • 119. On some computational methods for Bayesian model choice ABC model choice Approximate Bayesian computation 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling 5 ABC model choice ABC method ABC-PMC ABC for model choice in GRFs Illustrations
  • 120. On some computational methods for Bayesian model choice ABC model choice ABC method Approximate Bayesian Computation Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f (x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
  • 121. On some computational methods for Bayesian model choice ABC model choice ABC method Approximate Bayesian Computation Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f (x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
  • 122. On some computational methods for Bayesian model choice ABC model choice ABC method Approximate Bayesian Computation Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f (x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
  • 123. On some computational methods for Bayesian model choice ABC model choice ABC method A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )
  • 124. On some computational methods for Bayesian model choice ABC model choice ABC method A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )
  • 125. On some computational methods for Bayesian model choice ABC model choice ABC method Dynamic area
  • 126. On some computational methods for Bayesian model choice ABC model choice ABC method ABC improvements Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]
  • 127. On some computational methods for Bayesian model choice ABC model choice ABC method ABC improvements Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]
  • 128. On some computational methods for Bayesian model choice ABC model choice ABC method ABC improvements Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]
  • 129. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ K(θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   (t+1) π(θ )K(θ(t) |θ ) θ = and u ∼ U(0, 1) ≤ π(θ(t) )K(θ |θ(t) ) ,   (t) θ otherwise, has the posterior π(θ|y) as stationary distribution [Marjoram et al, 2003]
  • 130. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ K(θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   (t+1) π(θ )K(θ(t) |θ ) θ = and u ∼ U(0, 1) ≤ π(θ(t) )K(θ |θ(t) ) ,   (t) θ otherwise, has the posterior π(θ|y) as stationary distribution [Marjoram et al, 2003]
  • 131. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC Another sequential version producing a sequence of Markov (t) (t) transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T ) ABC-PRC Algorithm (t−1) 1 Pick a θ is selected at random among the previous θi ’s (t−1) with probabilities ωi (1 ≤ i ≤ N ). 2 Generate (t) (t) θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) , 3 Check that (x, y) < , otherwise start again. [Sisson et al., 2007]
  • 132. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC Another sequential version producing a sequence of Markov (t) (t) transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T ) ABC-PRC Algorithm (t−1) 1 Pick a θ is selected at random among the previous θi ’s (t−1) with probabilities ωi (1 ≤ i ≤ N ). 2 Generate (t) (t) θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) , 3 Check that (x, y) < , otherwise start again. [Sisson et al., 2007]
  • 133. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior. Inspired from Del Moral et al. (2006), who use backward kernels Lt−1 in SMC to achieve unbiasedness
  • 134. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior. Inspired from Del Moral et al. (2006), who use backward kernels Lt−1 in SMC to achieve unbiasedness
  • 135. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior. Inspired from Del Moral et al. (2006), who use backward kernels Lt−1 in SMC to achieve unbiasedness
  • 136. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC bias Lack of unbiasedness of the method Joint density of the accepted pair (θ(t−1) , θ(t) ) proportional to π(θ(t−1) |y)Kt (θ(t) |θ(t−1) )f (y|θ(t) ) , For an arbitrary function h(θ), E[ωt h(θ(t) )] proportional to π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t) (t−1) (t) (t−1) (t) h(θ ) π(θ |y)Kt (θ |θ )f (y|θ )dθ dθ π(θ (t−1) )Kt (θ (t) |θ (t−1) ) π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t−1) ∝ h(θ ) π(θ )f (y|θ ) π(θ (t−1) )Kt (θ (t) |θ (t−1) ) (t) (t−1) (t) (t−1) (t) × Kt (θ |θ )f (y|θ )dθ dθ Z Z ff (t) (t) (t−1) (t) (t−1) (t−1) (t) ∝ h(θ )π(θ |y) Lt−1 (θ |θ )f (y|θ )dθ dθ .
  • 137. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC bias Lack of unbiasedness of the method Joint density of the accepted pair (θ(t−1) , θ(t) ) proportional to π(θ(t−1) |y)Kt (θ(t) |θ(t−1) )f (y|θ(t) ) , For an arbitrary function h(θ), E[ωt h(θ(t) )] proportional to π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t) (t−1) (t) (t−1) (t) h(θ ) π(θ |y)Kt (θ |θ )f (y|θ )dθ dθ π(θ (t−1) )Kt (θ (t) |θ (t−1) ) π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t−1) ∝ h(θ ) π(θ )f (y|θ ) π(θ (t−1) )Kt (θ (t) |θ (t−1) ) (t) (t−1) (t) (t−1) (t) × Kt (θ |θ )f (y|θ )dθ dθ Z Z ff (t) (t) (t−1) (t) (t−1) (t−1) (t) ∝ h(θ )π(θ |y) Lt−1 (θ |θ )f (y|θ )dθ dθ .
  • 138. On some computational methods for Bayesian model choice ABC model choice ABC method A mixture example (0) Toy model of Sisson et al. (2007): if θ ∼ U(−10, 10) , x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) , then the posterior distribution associated with y = 0 is the normal mixture θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100) restricted to [−10, 10]. Furthermore, true target available as π(θ||x| < ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
  • 139. On some computational methods for Bayesian model choice ABC model choice ABC method A mixture example (1) 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 θ θ θ θ θ 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 θ θ θ θ θ Comparison of τ = 0.15 and τ = 1/0.15 in Kt
  • 140. On some computational methods for Bayesian model choice ABC model choice ABC-PMC A PMC version Use of the same kernel idea as ABC-PRC but with IS correction Generate a sample at iteration t by N (t−1) (t−1) πt (θ(t) ) ∝ ˆ ωj Kt (θ(t) |θj ) j=1 modulo acceptance of the associated xt , and use an importance (t) weight associated with an accepted simulation θi (t) (t) (t) ωi ∝ π(θi ) πt (θi ) . ˆ c Still likelihood free [Beaumont et al., 2009]
  • 141. On some computational methods for Bayesian model choice ABC model choice ABC-PMC The ABC-PMC algorithm Given a decreasing sequence of approximation levels 1 ≥ ... ≥ T, 1. At iteration t = 1, For i = 1, ..., N (1) (1) Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y) < 1 (1) Set ωi = 1/N (1) Take τ 2 as twice the empirical variance of the θi ’s 2. At iteration 2 ≤ t ≤ T , For i = 1, ..., N , repeat (t−1) (t−1) Pick θi from the θj ’s with probabilities ωj (t) 2 (t) generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi ) until (x, y) < t (t) (t) N (t−1) −1 (t) (t−1) Set ωi ∝ π(θi )/ j=1 ωj ϕ σt θi − θj ) 2 (t) Take τt+1 as twice the weighted empirical variance of the θi ’s
  • 142. On some computational methods for Bayesian model choice ABC model choice ABC-PMC A mixture example (2) Recovery of the target, whether using a fixed standard deviation of τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s.
  • 143. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Gibbs random fields Gibbs distribution The rv y = (y1 , . . . , yn ) is a Gibbs random field associated with the graph G if 1 f (y) = exp − Vc (yc ) , Z c∈C where Z is the normalising constant, C is the set of cliques of G and Vc is any function also called potential U (y) = c∈C Vc (yc ) is the energy function c Z is usually unavailable in closed form
  • 144. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Gibbs random fields Gibbs distribution The rv y = (y1 , . . . , yn ) is a Gibbs random field associated with the graph G if 1 f (y) = exp − Vc (yc ) , Z c∈C where Z is the normalising constant, C is the set of cliques of G and Vc is any function also called potential U (y) = c∈C Vc (yc ) is the energy function c Z is usually unavailable in closed form
  • 145. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Potts model Potts model Vc (y) is of the form Vc (y) = θS(y) = θ δyl =yi l∼i where l∼i denotes a neighbourhood structure In most realistic settings, summation Zθ = exp{θT S(x)} x∈X involves too many terms to be manageable and numerical approximations cannot always be trusted [Cucala, Marin, CPR & Titterington, 2009]
  • 146. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Potts model Potts model Vc (y) is of the form Vc (y) = θS(y) = θ δyl =yi l∼i where l∼i denotes a neighbourhood structure In most realistic settings, summation Zθ = exp{θT S(x)} x∈X involves too many terms to be manageable and numerical approximations cannot always be trusted [Cucala, Marin, CPR & Titterington, 2009]
  • 147. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Bayesian Model Choice Comparing a model with potential S0 taking values in Rp0 versus a model with potential S1 taking values in Rp1 can be done through the Bayes factor corresponding to the priors π0 and π1 on each parameter space T exp{θ0 S0 (x)}/Zθ0 ,0 π0 (dθ0 ) Bm0 /m1 (x) = T exp{θ1 S1 (x)}/Zθ1 ,1 π1 (dθ1 ) Use of Jeffreys’ scale to select most appropriate model
  • 148. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Bayesian Model Choice Comparing a model with potential S0 taking values in Rp0 versus a model with potential S1 taking values in Rp1 can be done through the Bayes factor corresponding to the priors π0 and π1 on each parameter space T exp{θ0 S0 (x)}/Zθ0 ,0 π0 (dθ0 ) Bm0 /m1 (x) = T exp{θ1 S1 (x)}/Zθ1 ,1 π1 (dθ1 ) Use of Jeffreys’ scale to select most appropriate model
  • 149. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Neighbourhood relations Choice to be made between M neighbourhood relations m i∼i (0 ≤ m ≤ M − 1) with Sm (x) = I{xi =xi } m i∼i driven by the posterior probabilities of the models.
  • 150. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Model index Formalisation via a model index M that appears as a new parameter with prior distribution π(M = m) and π(θ|M = m) = πm (θm ) Computational target: P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m) , Θm
  • 151. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Model index Formalisation via a model index M that appears as a new parameter with prior distribution π(M = m) and π(θ|M = m) = πm (θm ) Computational target: P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m) , Θm
  • 152. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Sufficient statistics By definition, if S(x) sufficient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own sufficient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient. For Gibbs random fields, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 = f 2 (S(x)|θm ) n(S(x)) m where n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) is therefore also sufficient for the joint parameters [Specific to Gibbs random fields!]
  • 153. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Sufficient statistics By definition, if S(x) sufficient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own sufficient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient. For Gibbs random fields, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 = f 2 (S(x)|θm ) n(S(x)) m where n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) is therefore also sufficient for the joint parameters [Specific to Gibbs random fields!]
  • 154. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Sufficient statistics By definition, if S(x) sufficient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own sufficient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient. For Gibbs random fields, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 = f 2 (S(x)|θm ) n(S(x)) m where n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) is therefore also sufficient for the joint parameters [Specific to Gibbs random fields!]
  • 155. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs ABC model choice Algorithm ABC-MC Generate m∗ from the prior π(M = m). ∗ Generate θm∗ from the prior πm∗ (·). Generate x∗ from the model fm∗ (·|θm∗ ). ∗ Compute the distance ρ(S(x0 ), S(x∗ )). Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) < . ∗ [Cornuet, Grelaud, Marin & Robert, 2008] Note When = 0 the algorithm is exact
  • 156. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs ABC approximation to the Bayes factor Frequency ratio: ˆ P(M = m0 |x0 ) π(M = m1 ) BF m0 /m1 (x0 ) = × ˆ P(M = m1 |x0 ) π(M = m0 ) {mi∗ = m0 } π(M = m1 ) = × , {mi∗ = m1 } π(M = m0 ) replaced with 1 + {mi∗ = m0 } π(M = m1 ) BF m0 /m1 (x0 ) = × 1 + {mi∗ = m1 } π(M = m0 ) to avoid indeterminacy (also Bayes estimate).
  • 157. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs ABC approximation to the Bayes factor Frequency ratio: ˆ P(M = m0 |x0 ) π(M = m1 ) BF m0 /m1 (x0 ) = × ˆ P(M = m1 |x0 ) π(M = m0 ) {mi∗ = m0 } π(M = m1 ) = × , {mi∗ = m1 } π(M = m0 ) replaced with 1 + {mi∗ = m0 } π(M = m1 ) BF m0 /m1 (x0 ) = × 1 + {mi∗ = m1 } π(M = m0 ) to avoid indeterminacy (also Bayes estimate).
  • 158. On some computational methods for Bayesian model choice ABC model choice Illustrations Toy example iid Bernoulli model versus two-state first-order Markov chain, i.e. n f0 (x|θ0 ) = exp θ0 I{xi =1} {1 + exp(θ0 )}n , i=1 versus n 1 f1 (x|θ1 ) = exp θ1 I{xi =xi−1 } {1 + exp(θ1 )}n−1 , 2 i=2 with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase transition” boundaries).
  • 159. On some computational methods for Bayesian model choice ABC model choice Illustrations Toy example (2) 10 5 5 BF01 BF01 0 ^ ^ 0 −5 −5 −10 −40 −20 0 10 −40 −20 0 10 BF01 BF01 (left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 ) (in logs) over 2, 000 simulations and 4.106 proposals from the prior. (right) Same when using tolerance corresponding to the 1% quantile on the distances.
  • 160. On some computational methods for Bayesian model choice ABC model choice Illustrations Protein folding Superposition of the native structure (grey) with the ST1 structure (red.), the ST2 structure (orange), the ST3 structure (green), and the DT structure (blue).
  • 161. On some computational methods for Bayesian model choice ABC model choice Illustrations Protein folding (2) % seq . Id. TM-score FROST score 1i5nA (ST1) 32 0.86 75.3 1ls1A1 (ST2) 5 0.42 8.9 1jr8A (ST3) 4 0.24 8.9 1s7oA (DT) 10 0.08 7.8 Characteristics of dataset. % seq. Id.: percentage of identity with the query sequence. TM-score.: similarity between predicted and native structure (uncertainty between 0.17 and 0.4) FROST score: quality of alignment of the query onto the candidate structure (uncertainty between 7 and 9).
  • 162. On some computational methods for Bayesian model choice ABC model choice Illustrations Protein folding (3) NS/ST1 NS/ST2 NS/ST3 NS/DT BF 1.34 1.22 2.42 2.76 P(M = NS|x0 ) 0.573 0.551 0.708 0.734 Estimates of the Bayes factors between model NS and models ST1, ST2, ST3, and DT, and corresponding posterior probabilities of model NS based on an ABC-MC algorithm using 1.2 106 simulations and a tolerance equal to the 1% quantile of the distances.