SlideShare a Scribd company logo
Let’s Practice What We Preach:
          Likelihood Methods for Monte Carlo Data

                                      Xiao-Li Meng

                         Department of Statistics, Harvard University


                                  September 24, 2011




                                                                                             logo



Xiao-Li Meng (Harvard)                  MCMC+likelihood                 September 24, 2011   1 / 23
Let’s Practice What We Preach:
             Likelihood Methods for Monte Carlo Data

                                         Xiao-Li Meng

                            Department of Statistics, Harvard University


                                     September 24, 2011


Based on
Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, with
discussions);
Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);
Tan (2004, JASA); ..., Meng and Tan (201X)
                                                                                                logo



   Xiao-Li Meng (Harvard)                  MCMC+likelihood                 September 24, 2011   1 / 23
Importance sampling (IS)




                                                                   logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   2 / 23
Importance sampling (IS)

    Estimand:
                                                          q1 (x)
                        c1 =       q1 (x)µ(dx) =                 p2 (x)µ(dx).
                               Γ                     Γ    p2 (x)




                                                                                             logo



   Xiao-Li Meng (Harvard)               MCMC+likelihood                 September 24, 2011   2 / 23
Importance sampling (IS)

    Estimand:
                                                          q1 (x)
                        c1 =       q1 (x)µ(dx) =                 p2 (x)µ(dx).
                               Γ                     Γ    p2 (x)

    Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2




                                                                                             logo



   Xiao-Li Meng (Harvard)               MCMC+likelihood                 September 24, 2011   2 / 23
Importance sampling (IS)

    Estimand:
                                                           q1 (x)
                        c1 =       q1 (x)µ(dx) =                  p2 (x)µ(dx).
                               Γ                      Γ    p2 (x)

    Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2
    Estimating Equation (EE):

                                          c1      q1 (X )
                                    r≡       = E2         .
                                          c2      q2 (X )




                                                                                              logo



   Xiao-Li Meng (Harvard)                MCMC+likelihood                 September 24, 2011   2 / 23
Importance sampling (IS)

    Estimand:
                                                           q1 (x)
                        c1 =       q1 (x)µ(dx) =                  p2 (x)µ(dx).
                               Γ                       Γ   p2 (x)

    Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2
    Estimating Equation (EE):

                                          c1      q1 (X )
                                    r≡       = E2         .
                                          c2      q2 (X )

    The EE estimator:
                                                  n2
                                            1          q1 (Xi2 )
                                      ˆ=
                                      r
                                            n2         q2 (Xi2 )
                                                 i=1

                                                                                              logo



   Xiao-Li Meng (Harvard)                MCMC+likelihood                 September 24, 2011   2 / 23
Importance sampling (IS)

    Estimand:
                                                           q1 (x)
                        c1 =       q1 (x)µ(dx) =                  p2 (x)µ(dx).
                               Γ                       Γ   p2 (x)

    Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2
    Estimating Equation (EE):

                                          c1      q1 (X )
                                    r≡       = E2         .
                                          c2      q2 (X )

    The EE estimator:
                                                  n2
                                            1          q1 (Xi2 )
                                      ˆ=
                                      r
                                            n2         q2 (Xi2 )
                                                 i=1

    Standard IS estimator for c1 when c2 = 1.                                                 logo



   Xiao-Li Meng (Harvard)                MCMC+likelihood                 September 24, 2011   2 / 23
What about MLE?




                                                                  logo



  Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   3 / 23
What about MLE?



   The “likelihood” is:
                                    n2
           f (X12 . . . Xn2 2 ) =         p2 (Xi2 )    — free of the estimand c1 !
                                    i=1




                                                                                           logo



  Xiao-Li Meng (Harvard)                  MCMC+likelihood             September 24, 2011   3 / 23
What about MLE?



   The “likelihood” is:
                                    n2
           f (X12 . . . Xn2 2 ) =         p2 (Xi2 )    — free of the estimand c1 !
                                    i=1

   So why are {Xi2 , i = 1, . . . n2 } even relevant?
   Violation of likelihood principle?




                                                                                           logo



  Xiao-Li Meng (Harvard)                  MCMC+likelihood             September 24, 2011   3 / 23
What about MLE?



   The “likelihood” is:
                                    n2
           f (X12 . . . Xn2 2 ) =         p2 (Xi2 )    — free of the estimand c1 !
                                    i=1

   So why are {Xi2 , i = 1, . . . n2 } even relevant?
   Violation of likelihood principle?
   What are we “inferring”?
   What is the “unknown” model parameter?



                                                                                           logo



  Xiao-Li Meng (Harvard)                  MCMC+likelihood             September 24, 2011   3 / 23
Bridge sampling (BS)




                                                                   logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   4 / 23
Bridge sampling (BS)


    Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2




                                                                                     logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood             September 24, 2011   4 / 23
Bridge sampling (BS)


    Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2
    Estimating Equation (Meng and Wong, 1996):

                   c1   E2 [α(X )q1 (X )]
            r≡        =                   ,    ∀α: 0<|          αq1 q2 dµ| < ∞
                   c2   E1 [α(X )q2 (X )]




                                                                                       logo



   Xiao-Li Meng (Harvard)            MCMC+likelihood              September 24, 2011   4 / 23
Bridge sampling (BS)


    Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2
    Estimating Equation (Meng and Wong, 1996):

                   c1   E2 [α(X )q1 (X )]
            r≡        =                   ,    ∀α: 0<|          αq1 q2 dµ| < ∞
                   c2   E1 [α(X )q2 (X )]

    Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1




                                                                                       logo



   Xiao-Li Meng (Harvard)            MCMC+likelihood              September 24, 2011   4 / 23
Bridge sampling (BS)


    Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2
    Estimating Equation (Meng and Wong, 1996):

                   c1   E2 [α(X )q1 (X )]
            r≡        =                   ,              ∀α: 0<|               αq1 q2 dµ| < ∞
                   c2   E1 [α(X )q2 (X )]

    Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1
    Optimal estimator ˆO , the limit of
                      r
                                                n2
                                          1                      q1 (Xi2 )
                                          n2                            (t)
                                                        s1 q1 (Xi2 )+s2 ˆO q2 (Xi2 )
                                                                        r
                              (t+1)            i=1
                            ˆO
                            r         =         n1
                                          1                      q2 (Xi1 )
                                          n1                            (t)
                                                        s1 q1 (Xi1 )+s2 ˆO q2 (Xi1 )
                                                                        r
                                               i=1
                                                                                                            logo



   Xiao-Li Meng (Harvard)                      MCMC+likelihood                         September 24, 2011   4 / 23
What about MLE?




                                                                  logo



  Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   5 / 23
What about MLE?


   The “likelihood” is:
                           2   nj
                                    qj (Xij )    −n −n
                                              ∝ c1 1 c2 2    — free of data!
                                       cj
                       j=1 i=1




                                                                                            logo



  Xiao-Li Meng (Harvard)                   MCMC+likelihood             September 24, 2011   5 / 23
What about MLE?


   The “likelihood” is:
                           2   nj
                                    qj (Xij )    −n −n
                                              ∝ c1 1 c2 2    — free of data!
                                       cj
                       j=1 i=1

   What went wrong: cj is not “free parameter” because
   cj = Γ qj (x)µ(dx) and qj is known.




                                                                                            logo



  Xiao-Li Meng (Harvard)                   MCMC+likelihood             September 24, 2011   5 / 23
What about MLE?


   The “likelihood” is:
                           2   nj
                                    qj (Xij )    −n −n
                                              ∝ c1 1 c2 2    — free of data!
                                       cj
                       j=1 i=1

   What went wrong: cj is not “free parameter” because
   cj = Γ qj (x)µ(dx) and qj is known.
   So what is the “unknown” model parameter?




                                                                                            logo



  Xiao-Li Meng (Harvard)                   MCMC+likelihood             September 24, 2011   5 / 23
What about MLE?


   The “likelihood” is:
                           2   nj
                                    qj (Xij )    −n −n
                                              ∝ c1 1 c2 2    — free of data!
                                       cj
                       j=1 i=1

   What went wrong: cj is not “free parameter” because
   cj = Γ qj (x)µ(dx) and qj is known.
   So what is the “unknown” model parameter?
   Turns out ˆO is the same as Bennett’s (1976) optimal acceptance
              r
   ratio estimator, as well as Geyer’s (1994) reversed logistic regression
   estimator.


                                                                                            logo



  Xiao-Li Meng (Harvard)                   MCMC+likelihood             September 24, 2011   5 / 23
What about MLE?


   The “likelihood” is:
                           2   nj
                                    qj (Xij )    −n −n
                                              ∝ c1 1 c2 2    — free of data!
                                       cj
                       j=1 i=1

   What went wrong: cj is not “free parameter” because
   cj = Γ qj (x)µ(dx) and qj is known.
   So what is the “unknown” model parameter?
   Turns out ˆO is the same as Bennett’s (1976) optimal acceptance
              r
   ratio estimator, as well as Geyer’s (1994) reversed logistic regression
   estimator.
   So why is that? Can it be improved upon without any “sleight of
   hand”?
                                                                                            logo



  Xiao-Li Meng (Harvard)                   MCMC+likelihood             September 24, 2011   5 / 23
Pretending the measure is unknown!




                                                                   logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   6 / 23
Pretending the measure is unknown!


    Because
                            c=          q(x)µ(dx),
                                    Γ
    and q is known in the sense that we can evaluate it at any sample
    value, the only way to make c “unknown” is to assume the underlying
    measure µ is “unknown”.




                                                                           logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   6 / 23
Pretending the measure is unknown!


    Because
                            c=          q(x)µ(dx),
                                    Γ
    and q is known in the sense that we can evaluate it at any sample
    value, the only way to make c “unknown” is to assume the underlying
    measure µ is “unknown”.
    This is natural because Monte Carlo simulation means we use samples
    to represent, and thus estimate/infer, the underlying population
    q(x)µ(dx), and hence estimate/infer µ since q is known.




                                                                           logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   6 / 23
Pretending the measure is unknown!


    Because
                             c=          q(x)µ(dx),
                                     Γ
    and q is known in the sense that we can evaluate it at any sample
    value, the only way to make c “unknown” is to assume the underlying
    measure µ is “unknown”.
    This is natural because Monte Carlo simulation means we use samples
    to represent, and thus estimate/infer, the underlying population
    q(x)µ(dx), and hence estimate/infer µ since q is known.
    Monte Carlo integration is about finding a tractable discrete µ to
                                                                 ˆ
    approximate the intractable µ.

                                                                             logo



   Xiao-Li Meng (Harvard)      MCMC+likelihood          September 24, 2011   6 / 23
Importance Sampling Likelihood




                                                                   logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   7 / 23
Importance Sampling Likelihood


    Estimand: c1 =          Γ q1 (x)µ(dx)




                                                                            logo



   Xiao-Li Meng (Harvard)            MCMC+likelihood   September 24, 2011   7 / 23
Importance Sampling Likelihood


    Estimand: c1 =          Γ q1 (x)µ(dx)
                                             −1
    Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx)




                                                                                  logo



   Xiao-Li Meng (Harvard)            MCMC+likelihood         September 24, 2011   7 / 23
Importance Sampling Likelihood


    Estimand: c1 =          Γ q1 (x)µ(dx)
                                             −1
    Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx)
    Likelihood for µ:
                                        n2
                                               −1
                               L(µ) =         c2 q2 (Xi2 )µ(Xi2 )
                                        i=1

    Note that c2 is a functional of µ.




                                                                                         logo



   Xiao-Li Meng (Harvard)            MCMC+likelihood                September 24, 2011   7 / 23
Importance Sampling Likelihood


    Estimand: c1 =             Γ q1 (x)µ(dx)
                                             −1
    Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx)
    Likelihood for µ:
                                            n2
                                                   −1
                                  L(µ) =          c2 q2 (Xi2 )µ(Xi2 )
                                            i=1

    Note that c2 is a functional of µ.
    The nonparametric MLE of µ is
                                      ˆ
                                      P(dx)
                            µ(dx) =
                            ˆ                ,     ˆ
                                                   P — empirical measure
                                      q2 (x)
                                                                                             logo



   Xiao-Li Meng (Harvard)                 MCMC+likelihood               September 24, 2011   7 / 23
Importance Sampling Likelihood




                                                                   logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   8 / 23
Importance Sampling Likelihood



    Thus the MLE for r ≡ c1 /c2 is
                                                           n2
                                                      1          q1 (Xi2 )
                            ˆ=
                            r    q1 (x)ˆ(dx) =
                                       µ
                                                      n2         q2 (Xi2 )
                                                           i=1




                                                                                              logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood                      September 24, 2011   8 / 23
Importance Sampling Likelihood



    Thus the MLE for r ≡ c1 /c2 is
                                                           n2
                                                      1          q1 (Xi2 )
                            ˆ=
                            r    q1 (x)ˆ(dx) =
                                       µ
                                                      n2         q2 (Xi2 )
                                                           i=1

    When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained.




                                                                                              logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood                      September 24, 2011   8 / 23
Importance Sampling Likelihood



    Thus the MLE for r ≡ c1 /c2 is
                                                           n2
                                                      1          q1 (Xi2 )
                            ˆ=
                            r    q1 (x)ˆ(dx) =
                                       µ
                                                      n2         q2 (Xi2 )
                                                           i=1

    When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained.
    {X(i2) , i = 1, . . . n2 } is (minimum) sufficient for µ on
    x ∈ S2 = {x : q2 (x) > 0}, and hence c1 is guaranteed to be
                                              ˆ
    consistent only when S1 ⊂ S2 .



                                                                                              logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood                      September 24, 2011   8 / 23
Bridge Sampling Likelihood




                                                                   logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   9 / 23
Bridge Sampling Likelihood



    Estimand: ∝ cj =        Γ qj (x)µ(x), j   = 1, . . . , J.




                                                                                     logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood             September 24, 2011   9 / 23
Bridge Sampling Likelihood



    Estimand: ∝ cj =         Γ qj (x)µ(x), j = 1, . . . , J.
    Data: {Xij , 1 ≤ i      ≤ nj } ∼ cj−1 qj (x)µ(dx), 1       ≤j ≤J




                                                                                        logo



   Xiao-Li Meng (Harvard)              MCMC+likelihood             September 24, 2011   9 / 23
Bridge Sampling Likelihood



    Estimand: ∝ cj =          Γ qj (x)µ(x), j = 1, . . . , J.
    Data: {Xij , 1 ≤ i      ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J
                                              nj
    Likelihood for µ:       L(µ) = J   j=1
                                                   −1
                                              i=1 cj qj (Xij )µ(Xij )




                                                                                        logo



   Xiao-Li Meng (Harvard)              MCMC+likelihood             September 24, 2011   9 / 23
Bridge Sampling Likelihood



    Estimand: ∝ cj =             Γ qj (x)µ(x), j = 1, . . . , J.
    Data: {Xij , 1 ≤ i         ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J
                                                 nj
    Likelihood for µ:          L(µ) = J   j=1
                                                      −1
                                                 i=1 cj qj (Xij )µ(Xij )
    Writing θ(x) = log µ(x), then
                                                             J
                            log L(µ) = n              ˆ
                                                θ(x)d P −          nj log cj (θ),
                                            Γ                j=1

    ˆ
    P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.


                                                                                                  logo



   Xiao-Li Meng (Harvard)                  MCMC+likelihood                   September 24, 2011   9 / 23
Bridge Sampling Likelihood




                                                                    logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   10 / 23
Bridge Sampling Likelihood
                                                                  ˆ
    MLE for µ given by equating the canonical sufficient statistics P to
    its expectation:
                                       J
                             ˆ
                            nP(dx) =         nj cj−1 qj (x)ˆ(dx),
                                                ˆ          µ
                                       j=1

                                                ˆ
                                               nP(dx)
                             µ(dx) =
                             ˆ               J
                                                                .                        (A)
                                                    ˆ−1
                                             j=1 nj cj qj (x)




                                                                                          logo



   Xiao-Li Meng (Harvard)         MCMC+likelihood                   September 24, 2011   10 / 23
Bridge Sampling Likelihood
                                                                  ˆ
    MLE for µ given by equating the canonical sufficient statistics P to
    its expectation:
                                               J
                                 ˆ
                                nP(dx) =               nj cj−1 qj (x)ˆ(dx),
                                                          ˆ          µ
                                            j=1

                                                          ˆ
                                                         nP(dx)
                                  µ(dx) =
                                  ˆ                J
                                                                       .                           (A)
                                                          ˆ−1
                                                   j=1 nj cj qj (x)

    Consequently, the MLE for {c1 , . . . , cJ } must satisfy
                                                   J     nj
                                                                    qr (xij )
                     cr =
                     ˆ          qr (x) d µ =
                                         ˆ                      J
                                                                                    .              (B)
                            Γ                  j=1 i=1          s=1    ˆ−1
                                                                    ns cs qs (xij )


                                                                                                    logo



   Xiao-Li Meng (Harvard)               MCMC+likelihood                       September 24, 2011   10 / 23
Bridge Sampling Likelihood
                                                                  ˆ
    MLE for µ given by equating the canonical sufficient statistics P to
    its expectation:
                                               J
                                 ˆ
                                nP(dx) =               nj cj−1 qj (x)ˆ(dx),
                                                          ˆ          µ
                                            j=1

                                                          ˆ
                                                         nP(dx)
                                  µ(dx) =
                                  ˆ                J
                                                                       .                           (A)
                                                          ˆ−1
                                                   j=1 nj cj qj (x)

    Consequently, the MLE for {c1 , . . . , cJ } must satisfy
                                                   J     nj
                                                                    qr (xij )
                     cr =
                     ˆ          qr (x) d µ =
                                         ˆ                      J
                                                                                    .              (B)
                            Γ                  j=1 i=1          s=1    ˆ−1
                                                                    ns cs qs (xij )


    (B) is the “dual” equation of (A), and is also the same as the    logo
    equation for optimal multiple bridge sampling estimator (Tan 2004).
   Xiao-Li Meng (Harvard)               MCMC+likelihood                       September 24, 2011   10 / 23
But We Can Ignore Less ...




                                                                    logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   11 / 23
But We Can Ignore Less ...




    To restrict the parameter space for µ by using some knowledge of the
    known µ, that it, to set up a sub-model.




                                                                            logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   11 / 23
But We Can Ignore Less ...




    To restrict the parameter space for µ by using some knowledge of the
    known µ, that it, to set up a sub-model.
    The new MLE has a smaller asymptotic variance under the submodel
    than under the full model.




                                                                            logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   11 / 23
But We Can Ignore Less ...




    To restrict the parameter space for µ by using some knowledge of the
    known µ, that it, to set up a sub-model.
    The new MLE has a smaller asymptotic variance under the submodel
    than under the full model.
    Examples:




                                                                            logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   11 / 23
But We Can Ignore Less ...




    To restrict the parameter space for µ by using some knowledge of the
    known µ, that it, to set up a sub-model.
    The new MLE has a smaller asymptotic variance under the submodel
    than under the full model.
    Examples:
            Group-invariance submodel




                                                                            logo



   Xiao-Li Meng (Harvard)         MCMC+likelihood     September 24, 2011   11 / 23
But We Can Ignore Less ...




    To restrict the parameter space for µ by using some knowledge of the
    known µ, that it, to set up a sub-model.
    The new MLE has a smaller asymptotic variance under the submodel
    than under the full model.
    Examples:
            Group-invariance submodel
            Linear submodel




                                                                            logo



   Xiao-Li Meng (Harvard)         MCMC+likelihood     September 24, 2011   11 / 23
But We Can Ignore Less ...




    To restrict the parameter space for µ by using some knowledge of the
    known µ, that it, to set up a sub-model.
    The new MLE has a smaller asymptotic variance under the submodel
    than under the full model.
    Examples:
            Group-invariance submodel
            Linear submodel
            Log-linear submodel




                                                                            logo



   Xiao-Li Meng (Harvard)         MCMC+likelihood     September 24, 2011   11 / 23
An Universally Improved IS




                                                                     logo



   Xiao-Li Meng (Harvard)    MCMC+likelihood   September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj =    Rd   qj (x)µ(dx)




                                                                            logo



   Xiao-Li Meng (Harvard)         MCMC+likelihood     September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
                                            −1
    Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)




                                                                               logo



   Xiao-Li Meng (Harvard)        MCMC+likelihood         September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
                                            −1
    Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)
    Taking G = {Id , −Id } leads to




                                                                               logo



   Xiao-Li Meng (Harvard)        MCMC+likelihood         September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
                                            −1
    Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)
    Taking G = {Id , −Id } leads to
                                         n2
                                   1          q1 (Xi2 ) + q1 (−Xi2 )
                            ˆG =
                            r                                        .
                                   n2         q2 (Xi2 ) + q2 (−Xi2 )
                                        i=1




                                                                                               logo



   Xiao-Li Meng (Harvard)               MCMC+likelihood                  September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
                                            −1
    Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)
    Taking G = {Id , −Id } leads to
                                         n2
                                   1          q1 (Xi2 ) + q1 (−Xi2 )
                            ˆG =
                            r                                        .
                                   n2         q2 (Xi2 ) + q2 (−Xi2 )
                                        i=1

    Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ).
                                           r        r




                                                                                               logo



   Xiao-Li Meng (Harvard)               MCMC+likelihood                  September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
                                            −1
    Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)
    Taking G = {Id , −Id } leads to
                                         n2
                                   1          q1 (Xi2 ) + q1 (−Xi2 )
                            ˆG =
                            r                                        .
                                   n2         q2 (Xi2 ) + q2 (−Xi2 )
                                        i=1

    Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ).
                                           r          r
    Need twice as many evaluations, but typically this is a small insurance
    premium.




                                                                                               logo



   Xiao-Li Meng (Harvard)               MCMC+likelihood                  September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
                                            −1
    Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)
    Taking G = {Id , −Id } leads to
                                               n2
                                        1           q1 (Xi2 ) + q1 (−Xi2 )
                               ˆG =
                               r                                           .
                                        n2          q2 (Xi2 ) + q2 (−Xi2 )
                                              i=1

    Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ).
                                           r          r
    Need twice as many evaluations, but typically this is a small insurance
    premium.
    Consider S1 = R & S2 = R + . Then ˆG is consistent for r :
                                        r
                                        n2                      n2
                                   1          q1 (Xi2 )   1           q1 (−Xi2 )
                            ˆG =
                            r                           +                        .
                                   n2         q2 (Xi2 ) n2             q2 (Xi2 )
                                        i=1                     i=1
                                                                                                     logo



   Xiao-Li Meng (Harvard)                     MCMC+likelihood                  September 24, 2011   12 / 23
An Universally Improved IS

    Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
                                            −1
    Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)
    Taking G = {Id , −Id } leads to
                                               n2
                                        1           q1 (Xi2 ) + q1 (−Xi2 )
                               ˆG =
                               r                                           .
                                        n2          q2 (Xi2 ) + q2 (−Xi2 )
                                              i=1

    Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ).
                                           r          r
    Need twice as many evaluations, but typically this is a small insurance
    premium.
    Consider S1 = R & S2 = R + . Then ˆG is consistent for r :
                                        r
                                        n2                      n2
                                   1          q1 (Xi2 )   1           q1 (−Xi2 )
                            ˆG =
                            r                           +                        .
                                   n2         q2 (Xi2 ) n2             q2 (Xi2 )
                                        i=1                     i=1
                                                                                                     logo
                                                         ∞
    But standard IS ˆ only estimates
                    r                                   0 q1 (x)µ(dx)/c2 .
   Xiao-Li Meng (Harvard)                     MCMC+likelihood                  September 24, 2011   12 / 23
There are many more improvements ...

    Define a sub-model by requiring µ to be G-invariant, where G is a
    finite group on Γ.




                                                                            logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   13 / 23
There are many more improvements ...

    Define a sub-model by requiring µ to be G-invariant, where G is a
    finite group on Γ.
    The new MLE of µ is
                                              ˆ
                                             nP G (dx)
                            µG (dx) =
                            ˆ             J
                                                              ,
                                                 ˆ−1 G
                                          j=1 nj cj q j (x)

           ˆ                 ˆ
     where P G (A) = aveg ∈G P(gA);        q j G (x) = aveg ∈G qj (gx).




                                                                                        logo



   Xiao-Li Meng (Harvard)         MCMC+likelihood                 September 24, 2011   13 / 23
There are many more improvements ...

    Define a sub-model by requiring µ to be G-invariant, where G is a
    finite group on Γ.
    The new MLE of µ is
                                              ˆ
                                             nP G (dx)
                            µG (dx) =
                            ˆ             J
                                                              ,
                                                 ˆ−1 G
                                          j=1 nj cj q j (x)

    where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx).
          ˆ                 ˆ
    When the draws are i.i.d. within each ps dµ,

                                  µG = E [ˆ| GX ],
                                  ˆ       µ

    i.e., the Rao-Blackwellization of µ given the orbit.
                                      ˆ


                                                                                        logo



   Xiao-Li Meng (Harvard)         MCMC+likelihood                 September 24, 2011   13 / 23
There are many more improvements ...

    Define a sub-model by requiring µ to be G-invariant, where G is a
    finite group on Γ.
    The new MLE of µ is
                                                     ˆ
                                                    nP G (dx)
                              µG (dx) =
                              ˆ                  J
                                                                     ,
                                                        ˆ−1 G
                                                 j=1 nj cj q j (x)

    where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx).
          ˆ                 ˆ
    When the draws are i.i.d. within each ps dµ,

                                         µG = E [ˆ| GX ],
                                         ˆ       µ

    i.e., the Rao-Blackwellization of µ given the orbit.
                                      ˆ
    Consequently,

                            cj G =
                            ˆ            qj (x)µG (dx) = E [ˆj |GX ].
                                                            c                                  logo
                                     Γ

   Xiao-Li Meng (Harvard)                MCMC+likelihood                 September 24, 2011   13 / 23
Using Groups to model trade-off




                                                                    logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   14 / 23
Using Groups to model trade-off




    If G1       G2 , then
                                    G1             G2
                            Var c        ≤ Var c        .




                                                                                  logo



   Xiao-Li Meng (Harvard)      MCMC+likelihood              September 24, 2011   14 / 23
Using Groups to model trade-off




    If G1       G2 , then
                                    G1             G2
                            Var c        ≤ Var c        .


    The statistical efficiency increases with the size of Gi , but so does the
    computational cost needed for function evaluation (but not for
    sampling, because there are no additional samples involved).




                                                                                  logo



   Xiao-Li Meng (Harvard)       MCMC+likelihood             September 24, 2011   14 / 23
Linear submodel: stratified sampling (Tan 2004)




                                                                    logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   15 / 23
Linear submodel: stratified sampling (Tan 2004)

                            i.i.d
    Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J.




                                                                                logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood       September 24, 2011   15 / 23
Linear submodel: stratified sampling (Tan 2004)

                                     i.i.d
    Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J.
    The sub-model has parameter space

                      µ:        pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1).
                            Γ




                                                                                         logo



   Xiao-Li Meng (Harvard)                    MCMC+likelihood       September 24, 2011   15 / 23
Linear submodel: stratified sampling (Tan 2004)

                                     i.i.d
    Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J.
    The sub-model has parameter space

                      µ:        pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1).
                            Γ

                                             J       nj
    Likelihood for µ: L(µ) =                 j=1     i=1 pj (Xij )µ(Xij )




                                                                                                  logo



   Xiao-Li Meng (Harvard)                    MCMC+likelihood                September 24, 2011   15 / 23
Linear submodel: stratified sampling (Tan 2004)

                                         i.i.d
    Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J.
    The sub-model has parameter space

                      µ:        pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1).
                            Γ

                                                 J       nj
    Likelihood for µ: L(µ) =                     j=1     i=1 pj (Xij )µ(Xij )
    The MLE is
                                                              ˆ
                                                              P(dx)
                                     µlin (dx) =
                                     ˆ                      J
                                                                            ,
                                                            j=1 πj pj (x)
                                                                ˆ
    where πj s are MLEs from a mixture model:
          ˆ
                                 i.i.d      J
                    the data ∼              j=1 πj pj (·)     with πj s unknown
                                                                                                      logo



   Xiao-Li Meng (Harvard)                        MCMC+likelihood                September 24, 2011   15 / 23
So why MLE?




                                                                   logo



  Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   16 / 23
So why MLE?
   Goal: to estimate c =   Γ q(x)µ(dx).




                                                                      logo



  Xiao-Li Meng (Harvard)      MCMC+likelihood   September 24, 2011   16 / 23
So why MLE?
   Goal: to estimate c = Γ q(x)µ(dx).
   For an arbitrary vector b, consider the control-variate estimator
   (Owen and Zhou 2000)
                                   J      nj
                                               q(xji ) − b g (xji )
                           cb ≡
                           ˆ                       J
                                                                      ,
                                  j=1 i=1          s=1 ns ps (xji )

   where g = (p2 − p1 , . . . , pJ − p1 ) .




                                                                                                logo



  Xiao-Li Meng (Harvard)               MCMC+likelihood                    September 24, 2011   16 / 23
So why MLE?
   Goal: to estimate c = Γ q(x)µ(dx).
   For an arbitrary vector b, consider the control-variate estimator
   (Owen and Zhou 2000)
                                          J        nj
                                                        q(xji ) − b g (xji )
                               cb ≡
                               ˆ                            J
                                                                                ,
                                      j=1 i=1               s=1 ns ps (xji )

   where g = (p2 − p1 , . . . , pJ − p1 ) .
   A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b,
                                   j=1            j=1
   consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
                                J             nj
                                     1                         q(xji ) − bj (xji )g (xji )
                      cλ,B =
                      ˆ                            λj (xji )
                                     nj                                 pj (xji )
                               j=1        i=1


                                                                                                          logo



  Xiao-Li Meng (Harvard)                      MCMC+likelihood                       September 24, 2011   16 / 23
So why MLE?
   Goal: to estimate c = Γ q(x)µ(dx).
   For an arbitrary vector b, consider the control-variate estimator
   (Owen and Zhou 2000)
                                          J        nj
                                                        q(xji ) − b g (xji )
                               cb ≡
                               ˆ                            J
                                                                                ,
                                      j=1 i=1               s=1 ns ps (xji )

   where g = (p2 − p1 , . . . , pJ − p1 ) .
   A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b,
                                   j=1            j=1
   consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
                                J             nj
                                     1                         q(xji ) − bj (xji )g (xji )
                      cλ,B =
                      ˆ                            λj (xji )
                                     nj                                 pj (xji )
                               j=1        i=1


   Should cλ,B be more efficient than cb ? Could there be something
          ˆ                         ˆ                                                                     logo

   even more efficient?
  Xiao-Li Meng (Harvard)                      MCMC+likelihood                       September 24, 2011   16 / 23
Three estimators for c =      Γ q(x) µ(dx):




                                                                    logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   17 / 23
Three estimators for c =             Γ q(x) µ(dx):

    IS:                     1
                                 n
                                          q(xi )
                                        J
                                                          ,
                            n           j=1 πj pj (xi )
                                i=1

    where πj = nj /n are the true proportions.




                                                                                    logo



   Xiao-Li Meng (Harvard)       MCMC+likelihood               September 24, 2011   17 / 23
Three estimators for c =                 Γ q(x) µ(dx):

    IS:                     1
                                     n
                                               q(xi )
                                             J
                                                               ,
                            n                j=1 πj pj (xi )
                                 i=1

    where πj = nj /n are the true proportions.

    Reg:                         n                ˆ
                            1            q(xi ) − β g (xi )
                                             J
                                                                   ,
                            n                j=1 πj pj (xi )
                                i=1

          ˆ
    where β is the estimated regression coefficient, ignoring stratification.




                                                                                             logo



   Xiao-Li Meng (Harvard)        MCMC+likelihood                       September 24, 2011   17 / 23
Three estimators for c =                 Γ q(x) µ(dx):

    IS:                      1
                                     n
                                               q(xi )
                                             J
                                                               ,
                             n               j=1 πj pj (xi )
                                 i=1

    where πj = nj /n are the true proportions.

    Reg:                         n                ˆ
                            1            q(xi ) − β g (xi )
                                             J
                                                                   ,
                            n                j=1 πj pj (xi )
                                i=1

          ˆ
    where β is the estimated regression coefficient, ignoring stratification.

    Lik:                     1
                                     n
                                               q(xi )
                                             J
                                                               ,
                             n               j=1 πj pj (xi )
                                                 ˆ
                                 i=1

    where πj s are the estimated proportions, ignoring stratification.
          ˆ
                                                                                             logo



   Xiao-Li Meng (Harvard)        MCMC+likelihood                       September 24, 2011   17 / 23
Three estimators for c =                 Γ q(x) µ(dx):

    IS:                      1
                                     n
                                               q(xi )
                                             J
                                                               ,
                             n               j=1 πj pj (xi )
                                 i=1

    where πj = nj /n are the true proportions.

    Reg:                         n                ˆ
                            1            q(xi ) − β g (xi )
                                             J
                                                                   ,
                            n                j=1 πj pj (xi )
                                i=1

          ˆ
    where β is the estimated regression coefficient, ignoring stratification.

    Lik:                     1
                                     n
                                               q(xi )
                                             J
                                                               ,
                             n               j=1 πj pj (xi )
                                                 ˆ
                                 i=1

    where πj s are the estimated proportions, ignoring stratification.
          ˆ
                                                                                             logo
    Which one is most efficient? Least efficient?
   Xiao-Li Meng (Harvard)        MCMC+likelihood                       September 24, 2011   17 / 23
Let’s find it out ...




                                                                    logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   18 / 23
Let’s find it out ...

    Γ = R10 and µ is Lebesgue measure.




                                                                     logo



   Xiao-Li Meng (Harvard)    MCMC+likelihood   September 24, 2011   18 / 23
Let’s find it out ...

    Γ = R10 and µ is Lebesgue measure.
    The integrand is
                                         10                    10
                            q(x) = 0.8         φ(x j ) + 0.2         ψ(x j ; 4) ,
                                         j=1                   j=1

    where φ(·) is standard normal density and ψ(·; 4) is t4 density.




                                                                                                   logo



   Xiao-Li Meng (Harvard)                MCMC+likelihood                     September 24, 2011   18 / 23
Let’s find it out ...

    Γ = R10 and µ is Lebesgue measure.
    The integrand is
                                         10                    10
                            q(x) = 0.8         φ(x j ) + 0.2         ψ(x j ; 4) ,
                                         j=1                   j=1

    where φ(·) is standard normal density and ψ(·; 4) is t4 density.
    Two sampling designs:




                                                                                                   logo



   Xiao-Li Meng (Harvard)                MCMC+likelihood                     September 24, 2011   18 / 23
Let’s find it out ...

    Γ = R10 and µ is Lebesgue measure.
    The integrand is
                                         10                    10
                            q(x) = 0.8         φ(x j ) + 0.2         ψ(x j ; 4) ,
                                         j=1                   j=1

    where φ(·) is standard normal density and ψ(·; 4) is t4 density.
    Two sampling designs:
      (i) q2 (x) with n draws, or




                                                                                                   logo



   Xiao-Li Meng (Harvard)                MCMC+likelihood                     September 24, 2011   18 / 23
Let’s find it out ...

    Γ = R10 and µ is Lebesgue measure.
    The integrand is
                                             10                    10
                             q(x) = 0.8            φ(x j ) + 0.2         ψ(x j ; 4) ,
                                             j=1                   j=1

    where φ(·) is standard normal density and ψ(·; 4) is t4 density.
    Two sampling designs:
       (i) q2 (x) with n draws, or
      (ii) q1 (x) and q2 (x) each with n/2 draws,
    where
                                       10                               10
                            q1 (x) =         φ(x j ),   q2 (x) =             ψ(x j ; 1)
                                       j=1                           j=1                                logo



   Xiao-Li Meng (Harvard)                    MCMC+likelihood                      September 24, 2011   18 / 23
A little surprise?



                       Table: Comparison of design and estimator

                             one sampler                         two samplers
                        IS     Reg         Lik            IS        Reg           Lik
   Sqrt MSE           .162    .00942    .00931           .0175     .00881      .00881
   Std Err            .162
                        .00919 .00920    .0174 .00885 .00884
                       √
   Note: Sqrt MSE is mean squared error of the point estimates and
               √
   Std Err is mean of the variance estimates from 10000 repeated
   simulations of size n = 500.

                                                                                            logo



   Xiao-Li Meng (Harvard)              MCMC+likelihood                September 24, 2011   19 / 23
Comparison of efficiency:




                                                                   logo



  Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   20 / 23
Comparison of efficiency:




   Statistical efficiency: IS < Reg ≈ Lik




                                                                     logo



  Xiao-Li Meng (Harvard)     MCMC+likelihood   September 24, 2011   20 / 23
Comparison of efficiency:




   Statistical efficiency: IS < Reg ≈ Lik
   IS is a stratified estimator, which uses only the labels.




                                                                              logo



  Xiao-Li Meng (Harvard)       MCMC+likelihood          September 24, 2011   20 / 23
Comparison of efficiency:




   Statistical efficiency: IS < Reg ≈ Lik
   IS is a stratified estimator, which uses only the labels.
   Reg is conventional method of control variates.




                                                                              logo



  Xiao-Li Meng (Harvard)       MCMC+likelihood          September 24, 2011   20 / 23
Comparison of efficiency:




   Statistical efficiency: IS < Reg ≈ Lik
   IS is a stratified estimator, which uses only the labels.
   Reg is conventional method of control variates.
   Lik is constrained MLE, which uses pj s but ignores the labels;
                             it is exact if q = pj for any particular j.




                                                                              logo



  Xiao-Li Meng (Harvard)       MCMC+likelihood          September 24, 2011   20 / 23
Building intuition ...




                                                                    logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   21 / 23
Building intuition ...




    Suppose we make n = 2 draws, one from N(0, 1) and one from
    Cauchy (0, 1), hence π1 = π2 = 50%.




                                                                         logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood        September 24, 2011   21 / 23
Building intuition ...




    Suppose we make n = 2 draws, one from N(0, 1) and one from
    Cauchy (0, 1), hence π1 = π2 = 50%.
    Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )?
                                                         π ˆ




                                                                            logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   21 / 23
Building intuition ...




    Suppose we make n = 2 draws, one from N(0, 1) and one from
    Cauchy (0, 1), hence π1 = π2 = 50%.
    Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )?
                                                         π ˆ
    Suppose the draws are {1, 3}, what would be the MLE (ˆ1 , π2 )?
                                                         π ˆ




                                                                            logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   21 / 23
Building intuition ...




    Suppose we make n = 2 draws, one from N(0, 1) and one from
    Cauchy (0, 1), hence π1 = π2 = 50%.
    Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )?
                                                         π ˆ
    Suppose the draws are {1, 3}, what would be the MLE (ˆ1 , π2 )?
                                                         π ˆ
    Suppose the draws are {3, 3}, what would be the MLE (ˆ1 , π2 )?
                                                         π ˆ




                                                                            logo



   Xiao-Li Meng (Harvard)     MCMC+likelihood         September 24, 2011   21 / 23
What Did I Learn?




                                                                   logo



  Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   22 / 23
What Did I Learn?




   Model what we ignore, not what we know!




                                                                   logo



  Xiao-Li Meng (Harvard)   MCMC+likelihood   September 24, 2011   22 / 23
What Did I Learn?




   Model what we ignore, not what we know!
   Model comparison/selection is not about which model is true (as all
   of them are “true”), but which model represents a better compromise
   among human, computational, and statistical efficiency.




                                                                          logo



  Xiao-Li Meng (Harvard)     MCMC+likelihood        September 24, 2011   22 / 23
What Did I Learn?




   Model what we ignore, not what we know!
   Model comparison/selection is not about which model is true (as all
   of them are “true”), but which model represents a better compromise
   among human, computational, and statistical efficiency.
   There is a cure for our “schizophrenia” — we now can analyze Monte
   Carlo data using the same sound statistical principles and methods for
   analyzing real data.




                                                                            logo



  Xiao-Li Meng (Harvard)      MCMC+likelihood         September 24, 2011   22 / 23
If you are looking for theoretical research topics ...




                                                                     logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood    September 24, 2011   23 / 23
If you are looking for theoretical research topics ...




    RE-EXAM OLD ONES AND DERIVE NEW ONES!




                                                                     logo



   Xiao-Li Meng (Harvard)   MCMC+likelihood    September 24, 2011   23 / 23
If you are looking for theoretical research topics ...




    RE-EXAM OLD ONES AND DERIVE NEW ONES!
            Prove it is MLE, or a good approximation to MLE.




                                                                                     logo



   Xiao-Li Meng (Harvard)          MCMC+likelihood             September 24, 2011   23 / 23
If you are looking for theoretical research topics ...




    RE-EXAM OLD ONES AND DERIVE NEW ONES!
            Prove it is MLE, or a good approximation to MLE.
            Or derive MLE or a cost-effective approximation to it.




                                                                                    logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood           September 24, 2011   23 / 23
If you are looking for theoretical research topics ...




    RE-EXAM OLD ONES AND DERIVE NEW ONES!
            Prove it is MLE, or a good approximation to MLE.
            Or derive MLE or a cost-effective approximation to it.
    Markov chain Monte Carlo (Tan 2006, 2008)




                                                                                    logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood           September 24, 2011   23 / 23
If you are looking for theoretical research topics ...




    RE-EXAM OLD ONES AND DERIVE NEW ONES!
            Prove it is MLE, or a good approximation to MLE.
            Or derive MLE or a cost-effective approximation to it.
    Markov chain Monte Carlo (Tan 2006, 2008)
    More ......




                                                                                    logo



   Xiao-Li Meng (Harvard)           MCMC+likelihood           September 24, 2011   23 / 23

More Related Content

PDF
cswiercz-general-presentation
PDF
Chapter02
PDF
Chapter01
PDF
Ben Gal
PDF
Vectorial types, non-determinism and probabilistic systems: Towards a computa...
PDF
Triggering patterns of topology changes in dynamic attributed graphs
PDF
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi Koivisto
PDF
01 graphical models
cswiercz-general-presentation
Chapter02
Chapter01
Ben Gal
Vectorial types, non-determinism and probabilistic systems: Towards a computa...
Triggering patterns of topology changes in dynamic attributed graphs
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi Koivisto
01 graphical models

What's hot (20)

PDF
Adaptive Coordinate Descent
PDF
The convenience yield implied by quadratic volatility smiles presentation [...
PDF
David Seim. Tax Rates, Tax Evasion and Cognitive Skills
KEY
87fg8df7g8df7g87sd7f8s7df8sdf7
PDF
A type system for the vectorial aspects of the linear-algebraic lambda-calculus
PDF
Adaptive Signal and Image Processing
PDF
Future Value and Present Value --- Paper (2006)
PDF
Camera calibration
PDF
rinko2010
PDF
rinko2011-agh
PDF
Computational complexity
PDF
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trend
KEY
Tprimal agh
PDF
Engr 371 final exam august 1999
PDF
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
PDF
presentacion
PDF
Camera parameters
PDF
Cheat Sheet
PDF
Lecture1
PDF
Elliptic Curve Cryptography and Zero Knowledge Proof
Adaptive Coordinate Descent
The convenience yield implied by quadratic volatility smiles presentation [...
David Seim. Tax Rates, Tax Evasion and Cognitive Skills
87fg8df7g8df7g87sd7f8s7df8sdf7
A type system for the vectorial aspects of the linear-algebraic lambda-calculus
Adaptive Signal and Image Processing
Future Value and Present Value --- Paper (2006)
Camera calibration
rinko2010
rinko2011-agh
Computational complexity
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trend
Tprimal agh
Engr 371 final exam august 1999
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
presentacion
Camera parameters
Cheat Sheet
Lecture1
Elliptic Curve Cryptography and Zero Knowledge Proof
Ad

Viewers also liked (7)

PDF
Séminaire de Physique à Besancon, Nov. 22, 2012
PDF
ABC in Varanasi
PDF
ABC and empirical likelihood
PDF
Gelfand and Smith (1990), read by
PDF
Hastings paper discussion
PDF
Reading Efron's 1979 paper on bootstrap
PDF
ABC workshop: 17w5025
Séminaire de Physique à Besancon, Nov. 22, 2012
ABC in Varanasi
ABC and empirical likelihood
Gelfand and Smith (1990), read by
Hastings paper discussion
Reading Efron's 1979 paper on bootstrap
ABC workshop: 17w5025
Ad

Similar to Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data (20)

PDF
Montpellier Math Colloquium
PDF
YSC 2013
PDF
Divergence center-based clustering and their applications
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
Meanshift Tracking Presentation
PDF
Divergence clustering
PDF
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
PDF
Xian's abc
PDF
ABC in Roma
PDF
ABC: How Bayesian can it be?
PDF
ABC-Xian
PDF
R. Jimenez - Fundamental Physics from Astronomical Observations
PDF
Optimal interval clustering: Application to Bregman clustering and statistica...
PDF
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
PDF
Slides: A glance at information-geometric signal processing
PDF
fauvel_igarss.pdf
PDF
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
PDF
Md2521102111
PDF
Munich07 Foils
PDF
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Montpellier Math Colloquium
YSC 2013
Divergence center-based clustering and their applications
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Meanshift Tracking Presentation
Divergence clustering
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
Xian's abc
ABC in Roma
ABC: How Bayesian can it be?
ABC-Xian
R. Jimenez - Fundamental Physics from Astronomical Observations
Optimal interval clustering: Application to Bregman clustering and statistica...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: A glance at information-geometric signal processing
fauvel_igarss.pdf
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
Md2521102111
Munich07 Foils
Computational Information Geometry on Matrix Manifolds (ICTP 2013)

More from Christian Robert (20)

PDF
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
PDF
The future of conferences towards sustainability and inclusivity
PDF
Adaptive Restore algorithm & importance Monte Carlo
PDF
Asymptotics of ABC, lecture, Collège de France
PDF
Workshop in honour of Don Poskitt and Gael Martin
PDF
discussion of ICML23.pdf
PDF
How many components in a mixture?
PDF
restore.pdf
PDF
Testing for mixtures at BNP 13
PDF
Inferring the number of components: dream or reality?
PDF
CDT 22 slides.pdf
PDF
Testing for mixtures by seeking components
PDF
discussion on Bayesian restricted likelihood
PDF
NCE, GANs & VAEs (and maybe BAC)
PDF
ABC-Gibbs
PDF
Coordinate sampler : A non-reversible Gibbs-like sampler
PDF
eugenics and statistics
PDF
Laplace's Demon: seminar #1
PDF
ABC-Gibbs
PDF
asymptotics of ABC
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
The future of conferences towards sustainability and inclusivity
Adaptive Restore algorithm & importance Monte Carlo
Asymptotics of ABC, lecture, Collège de France
Workshop in honour of Don Poskitt and Gael Martin
discussion of ICML23.pdf
How many components in a mixture?
restore.pdf
Testing for mixtures at BNP 13
Inferring the number of components: dream or reality?
CDT 22 slides.pdf
Testing for mixtures by seeking components
discussion on Bayesian restricted likelihood
NCE, GANs & VAEs (and maybe BAC)
ABC-Gibbs
Coordinate sampler : A non-reversible Gibbs-like sampler
eugenics and statistics
Laplace's Demon: seminar #1
ABC-Gibbs
asymptotics of ABC

Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

  • 1. Let’s Practice What We Preach: Likelihood Methods for Monte Carlo Data Xiao-Li Meng Department of Statistics, Harvard University September 24, 2011 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23
  • 2. Let’s Practice What We Preach: Likelihood Methods for Monte Carlo Data Xiao-Li Meng Department of Statistics, Harvard University September 24, 2011 Based on Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, with discussions); Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift); Tan (2004, JASA); ..., Meng and Tan (201X) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23
  • 3. Importance sampling (IS) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  • 4. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  • 5. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  • 6. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 Estimating Equation (EE): c1 q1 (X ) r≡ = E2 . c2 q2 (X ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  • 7. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 Estimating Equation (EE): c1 q1 (X ) r≡ = E2 . c2 q2 (X ) The EE estimator: n2 1 q1 (Xi2 ) ˆ= r n2 q2 (Xi2 ) i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  • 8. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 Estimating Equation (EE): c1 q1 (X ) r≡ = E2 . c2 q2 (X ) The EE estimator: n2 1 q1 (Xi2 ) ˆ= r n2 q2 (Xi2 ) i=1 Standard IS estimator for c1 when c2 = 1. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  • 9. What about MLE? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  • 10. What about MLE? The “likelihood” is: n2 f (X12 . . . Xn2 2 ) = p2 (Xi2 ) — free of the estimand c1 ! i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  • 11. What about MLE? The “likelihood” is: n2 f (X12 . . . Xn2 2 ) = p2 (Xi2 ) — free of the estimand c1 ! i=1 So why are {Xi2 , i = 1, . . . n2 } even relevant? Violation of likelihood principle? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  • 12. What about MLE? The “likelihood” is: n2 f (X12 . . . Xn2 2 ) = p2 (Xi2 ) — free of the estimand c1 ! i=1 So why are {Xi2 , i = 1, . . . n2 } even relevant? Violation of likelihood principle? What are we “inferring”? What is the “unknown” model parameter? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  • 13. Bridge sampling (BS) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  • 14. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  • 15. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 Estimating Equation (Meng and Wong, 1996): c1 E2 [α(X )q1 (X )] r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞ c2 E1 [α(X )q2 (X )] logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  • 16. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 Estimating Equation (Meng and Wong, 1996): c1 E2 [α(X )q1 (X )] r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞ c2 E1 [α(X )q2 (X )] Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  • 17. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 Estimating Equation (Meng and Wong, 1996): c1 E2 [α(X )q1 (X )] r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞ c2 E1 [α(X )q2 (X )] Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1 Optimal estimator ˆO , the limit of r n2 1 q1 (Xi2 ) n2 (t) s1 q1 (Xi2 )+s2 ˆO q2 (Xi2 ) r (t+1) i=1 ˆO r = n1 1 q2 (Xi1 ) n1 (t) s1 q1 (Xi1 )+s2 ˆO q2 (Xi1 ) r i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  • 18. What about MLE? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  • 19. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  • 20. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  • 21. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. So what is the “unknown” model parameter? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  • 22. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. So what is the “unknown” model parameter? Turns out ˆO is the same as Bennett’s (1976) optimal acceptance r ratio estimator, as well as Geyer’s (1994) reversed logistic regression estimator. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  • 23. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. So what is the “unknown” model parameter? Turns out ˆO is the same as Bennett’s (1976) optimal acceptance r ratio estimator, as well as Geyer’s (1994) reversed logistic regression estimator. So why is that? Can it be improved upon without any “sleight of hand”? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  • 24. Pretending the measure is unknown! logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  • 25. Pretending the measure is unknown! Because c= q(x)µ(dx), Γ and q is known in the sense that we can evaluate it at any sample value, the only way to make c “unknown” is to assume the underlying measure µ is “unknown”. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  • 26. Pretending the measure is unknown! Because c= q(x)µ(dx), Γ and q is known in the sense that we can evaluate it at any sample value, the only way to make c “unknown” is to assume the underlying measure µ is “unknown”. This is natural because Monte Carlo simulation means we use samples to represent, and thus estimate/infer, the underlying population q(x)µ(dx), and hence estimate/infer µ since q is known. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  • 27. Pretending the measure is unknown! Because c= q(x)µ(dx), Γ and q is known in the sense that we can evaluate it at any sample value, the only way to make c “unknown” is to assume the underlying measure µ is “unknown”. This is natural because Monte Carlo simulation means we use samples to represent, and thus estimate/infer, the underlying population q(x)µ(dx), and hence estimate/infer µ since q is known. Monte Carlo integration is about finding a tractable discrete µ to ˆ approximate the intractable µ. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  • 28. Importance Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  • 29. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  • 30. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  • 31. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx) Likelihood for µ: n2 −1 L(µ) = c2 q2 (Xi2 )µ(Xi2 ) i=1 Note that c2 is a functional of µ. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  • 32. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx) Likelihood for µ: n2 −1 L(µ) = c2 q2 (Xi2 )µ(Xi2 ) i=1 Note that c2 is a functional of µ. The nonparametric MLE of µ is ˆ P(dx) µ(dx) = ˆ , ˆ P — empirical measure q2 (x) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  • 33. Importance Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  • 34. Importance Sampling Likelihood Thus the MLE for r ≡ c1 /c2 is n2 1 q1 (Xi2 ) ˆ= r q1 (x)ˆ(dx) = µ n2 q2 (Xi2 ) i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  • 35. Importance Sampling Likelihood Thus the MLE for r ≡ c1 /c2 is n2 1 q1 (Xi2 ) ˆ= r q1 (x)ˆ(dx) = µ n2 q2 (Xi2 ) i=1 When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  • 36. Importance Sampling Likelihood Thus the MLE for r ≡ c1 /c2 is n2 1 q1 (Xi2 ) ˆ= r q1 (x)ˆ(dx) = µ n2 q2 (Xi2 ) i=1 When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained. {X(i2) , i = 1, . . . n2 } is (minimum) sufficient for µ on x ∈ S2 = {x : q2 (x) > 0}, and hence c1 is guaranteed to be ˆ consistent only when S1 ⊂ S2 . logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  • 37. Bridge Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  • 38. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  • 39. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤j ≤J logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  • 40. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J nj Likelihood for µ: L(µ) = J j=1 −1 i=1 cj qj (Xij )µ(Xij ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  • 41. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J nj Likelihood for µ: L(µ) = J j=1 −1 i=1 cj qj (Xij )µ(Xij ) Writing θ(x) = log µ(x), then J log L(µ) = n ˆ θ(x)d P − nj log cj (θ), Γ j=1 ˆ P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  • 42. Bridge Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  • 43. Bridge Sampling Likelihood ˆ MLE for µ given by equating the canonical sufficient statistics P to its expectation: J ˆ nP(dx) = nj cj−1 qj (x)ˆ(dx), ˆ µ j=1 ˆ nP(dx) µ(dx) = ˆ J . (A) ˆ−1 j=1 nj cj qj (x) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  • 44. Bridge Sampling Likelihood ˆ MLE for µ given by equating the canonical sufficient statistics P to its expectation: J ˆ nP(dx) = nj cj−1 qj (x)ˆ(dx), ˆ µ j=1 ˆ nP(dx) µ(dx) = ˆ J . (A) ˆ−1 j=1 nj cj qj (x) Consequently, the MLE for {c1 , . . . , cJ } must satisfy J nj qr (xij ) cr = ˆ qr (x) d µ = ˆ J . (B) Γ j=1 i=1 s=1 ˆ−1 ns cs qs (xij ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  • 45. Bridge Sampling Likelihood ˆ MLE for µ given by equating the canonical sufficient statistics P to its expectation: J ˆ nP(dx) = nj cj−1 qj (x)ˆ(dx), ˆ µ j=1 ˆ nP(dx) µ(dx) = ˆ J . (A) ˆ−1 j=1 nj cj qj (x) Consequently, the MLE for {c1 , . . . , cJ } must satisfy J nj qr (xij ) cr = ˆ qr (x) d µ = ˆ J . (B) Γ j=1 i=1 s=1 ˆ−1 ns cs qs (xij ) (B) is the “dual” equation of (A), and is also the same as the logo equation for optimal multiple bridge sampling estimator (Tan 2004). Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  • 46. But We Can Ignore Less ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  • 47. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  • 48. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  • 49. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  • 50. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: Group-invariance submodel logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  • 51. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: Group-invariance submodel Linear submodel logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  • 52. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: Group-invariance submodel Linear submodel Log-linear submodel logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  • 53. An Universally Improved IS logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 54. An Universally Improved IS Estimand: r = c1 /c2 ; cj = Rd qj (x)µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 55. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 56. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 57. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 58. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 59. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r Need twice as many evaluations, but typically this is a small insurance premium. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 60. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r Need twice as many evaluations, but typically this is a small insurance premium. Consider S1 = R & S2 = R + . Then ˆG is consistent for r : r n2 n2 1 q1 (Xi2 ) 1 q1 (−Xi2 ) ˆG = r + . n2 q2 (Xi2 ) n2 q2 (Xi2 ) i=1 i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 61. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r Need twice as many evaluations, but typically this is a small insurance premium. Consider S1 = R & S2 = R + . Then ˆG is consistent for r : r n2 n2 1 q1 (Xi2 ) 1 q1 (−Xi2 ) ˆG = r + . n2 q2 (Xi2 ) n2 q2 (Xi2 ) i=1 i=1 logo ∞ But standard IS ˆ only estimates r 0 q1 (x)µ(dx)/c2 . Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  • 62. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  • 63. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. The new MLE of µ is ˆ nP G (dx) µG (dx) = ˆ J , ˆ−1 G j=1 nj cj q j (x) ˆ ˆ where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx). logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  • 64. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. The new MLE of µ is ˆ nP G (dx) µG (dx) = ˆ J , ˆ−1 G j=1 nj cj q j (x) where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx). ˆ ˆ When the draws are i.i.d. within each ps dµ, µG = E [ˆ| GX ], ˆ µ i.e., the Rao-Blackwellization of µ given the orbit. ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  • 65. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. The new MLE of µ is ˆ nP G (dx) µG (dx) = ˆ J , ˆ−1 G j=1 nj cj q j (x) where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx). ˆ ˆ When the draws are i.i.d. within each ps dµ, µG = E [ˆ| GX ], ˆ µ i.e., the Rao-Blackwellization of µ given the orbit. ˆ Consequently, cj G = ˆ qj (x)µG (dx) = E [ˆj |GX ]. c logo Γ Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  • 66. Using Groups to model trade-off logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
  • 67. Using Groups to model trade-off If G1 G2 , then G1 G2 Var c ≤ Var c . logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
  • 68. Using Groups to model trade-off If G1 G2 , then G1 G2 Var c ≤ Var c . The statistical efficiency increases with the size of Gi , but so does the computational cost needed for function evaluation (but not for sampling, because there are no additional samples involved). logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
  • 69. Linear submodel: stratified sampling (Tan 2004) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  • 70. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  • 71. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. The sub-model has parameter space µ: pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1). Γ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  • 72. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. The sub-model has parameter space µ: pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1). Γ J nj Likelihood for µ: L(µ) = j=1 i=1 pj (Xij )µ(Xij ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  • 73. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. The sub-model has parameter space µ: pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1). Γ J nj Likelihood for µ: L(µ) = j=1 i=1 pj (Xij )µ(Xij ) The MLE is ˆ P(dx) µlin (dx) = ˆ J , j=1 πj pj (x) ˆ where πj s are MLEs from a mixture model: ˆ i.i.d J the data ∼ j=1 πj pj (·) with πj s unknown logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  • 74. So why MLE? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  • 75. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  • 76. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). For an arbitrary vector b, consider the control-variate estimator (Owen and Zhou 2000) J nj q(xji ) − b g (xji ) cb ≡ ˆ J , j=1 i=1 s=1 ns ps (xji ) where g = (p2 − p1 , . . . , pJ − p1 ) . logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  • 77. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). For an arbitrary vector b, consider the control-variate estimator (Owen and Zhou 2000) J nj q(xji ) − b g (xji ) cb ≡ ˆ J , j=1 i=1 s=1 ns ps (xji ) where g = (p2 − p1 , . . . , pJ − p1 ) . A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b, j=1 j=1 consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004) J nj 1 q(xji ) − bj (xji )g (xji ) cλ,B = ˆ λj (xji ) nj pj (xji ) j=1 i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  • 78. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). For an arbitrary vector b, consider the control-variate estimator (Owen and Zhou 2000) J nj q(xji ) − b g (xji ) cb ≡ ˆ J , j=1 i=1 s=1 ns ps (xji ) where g = (p2 − p1 , . . . , pJ − p1 ) . A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b, j=1 j=1 consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004) J nj 1 q(xji ) − bj (xji )g (xji ) cλ,B = ˆ λj (xji ) nj pj (xji ) j=1 i=1 Should cλ,B be more efficient than cb ? Could there be something ˆ ˆ logo even more efficient? Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  • 79. Three estimators for c = Γ q(x) µ(dx): logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  • 80. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  • 81. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. Reg: n ˆ 1 q(xi ) − β g (xi ) J , n j=1 πj pj (xi ) i=1 ˆ where β is the estimated regression coefficient, ignoring stratification. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  • 82. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. Reg: n ˆ 1 q(xi ) − β g (xi ) J , n j=1 πj pj (xi ) i=1 ˆ where β is the estimated regression coefficient, ignoring stratification. Lik: 1 n q(xi ) J , n j=1 πj pj (xi ) ˆ i=1 where πj s are the estimated proportions, ignoring stratification. ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  • 83. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. Reg: n ˆ 1 q(xi ) − β g (xi ) J , n j=1 πj pj (xi ) i=1 ˆ where β is the estimated regression coefficient, ignoring stratification. Lik: 1 n q(xi ) J , n j=1 πj pj (xi ) ˆ i=1 where πj s are the estimated proportions, ignoring stratification. ˆ logo Which one is most efficient? Least efficient? Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  • 84. Let’s find it out ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  • 85. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  • 86. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  • 87. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. Two sampling designs: logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  • 88. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. Two sampling designs: (i) q2 (x) with n draws, or logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  • 89. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. Two sampling designs: (i) q2 (x) with n draws, or (ii) q1 (x) and q2 (x) each with n/2 draws, where 10 10 q1 (x) = φ(x j ), q2 (x) = ψ(x j ; 1) j=1 j=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  • 90. A little surprise? Table: Comparison of design and estimator one sampler two samplers IS Reg Lik IS Reg Lik Sqrt MSE .162 .00942 .00931 .0175 .00881 .00881 Std Err .162 .00919 .00920 .0174 .00885 .00884 √ Note: Sqrt MSE is mean squared error of the point estimates and √ Std Err is mean of the variance estimates from 10000 repeated simulations of size n = 500. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 19 / 23
  • 91. Comparison of efficiency: logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  • 92. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  • 93. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik IS is a stratified estimator, which uses only the labels. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  • 94. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik IS is a stratified estimator, which uses only the labels. Reg is conventional method of control variates. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  • 95. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik IS is a stratified estimator, which uses only the labels. Reg is conventional method of control variates. Lik is constrained MLE, which uses pj s but ignores the labels; it is exact if q = pj for any particular j. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  • 96. Building intuition ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  • 97. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  • 98. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )? π ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  • 99. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )? π ˆ Suppose the draws are {1, 3}, what would be the MLE (ˆ1 , π2 )? π ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  • 100. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )? π ˆ Suppose the draws are {1, 3}, what would be the MLE (ˆ1 , π2 )? π ˆ Suppose the draws are {3, 3}, what would be the MLE (ˆ1 , π2 )? π ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  • 101. What Did I Learn? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  • 102. What Did I Learn? Model what we ignore, not what we know! logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  • 103. What Did I Learn? Model what we ignore, not what we know! Model comparison/selection is not about which model is true (as all of them are “true”), but which model represents a better compromise among human, computational, and statistical efficiency. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  • 104. What Did I Learn? Model what we ignore, not what we know! Model comparison/selection is not about which model is true (as all of them are “true”), but which model represents a better compromise among human, computational, and statistical efficiency. There is a cure for our “schizophrenia” — we now can analyze Monte Carlo data using the same sound statistical principles and methods for analyzing real data. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  • 105. If you are looking for theoretical research topics ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  • 106. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  • 107. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  • 108. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. Or derive MLE or a cost-effective approximation to it. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  • 109. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. Or derive MLE or a cost-effective approximation to it. Markov chain Monte Carlo (Tan 2006, 2008) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  • 110. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. Or derive MLE or a cost-effective approximation to it. Markov chain Monte Carlo (Tan 2006, 2008) More ...... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23