Lecture15 xing

Machine LearningMachine Learninggg
Algorithms and Theory ofAlgorithms and Theory of
A i t I fA i t I fApproximate InferenceApproximate Inference
Eric XingEric Xing
Lecture 15, August 15, 2010
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Reading:
X1
X4
X2 X3
X4
X2 X3
X1
X1
X2
X1
X3
X1
X4
X2 X3
X4
X2 X3
X1
X1
X2
X1
X3

Inference ProblemsInference Problems
 Compute the likelihood of observed data
 Compute the marginal distribution over a particular subset
of nodes
 Compute the conditional distribution for disjoint subsets A
and B
 Compute a mode of the density

Inference in GMInference in GM
 HMM
y2 y3y1 yT...
A AA Ax2 x3x1 xT
...
 A general BN B A
DC
E F
G H

Inference ProblemsInference Problems
 Compute the likelihood of observed data
 Compute the marginal distribution over a particular subset
of nodes
 Compute the conditional distribution for disjoint subsets A
and B
 Compute a mode of the density
 Methods we have
Message Passing
Brute force Elimination
g g
(Forward-backward , Max-product
/BP, Junction Tree)
Sharing intermediate termsIndividual computations independent

Recall forward backward on HMMRecall forward-backward on HMM
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
F d l ith
A AA Ax2 x3x1 xT
 ikk
)|( 1 Forward algorithm
 Backward algorithm
 
i
ki
i
t
k
tt
k
t ayxp ,)|( 11 
  iik
yxpa )1|(  Backward algorithm   
i
tttikt yxpa 111, )1|( 
),( xyP
P
k
t
k
t
k
tk  1
1
)()(
),(
)|(
xx
x
x
PP
yP
yP tttk
t


1
1

Message passing for treesMessage passing for trees
Let m (x ) denote the factor resulting fromLet mij(xi) denote the factor resulting from
eliminating variables from bellow up to i,
which is a function of xi:
f
This is reminiscent of a message sent
from j to i.
i
j
mij(xi) represents a "belief" of xi from xj!
k l
k l

The General Sum-Product
AlgorithmAlgorithm
 Tree-structured GMs
M P i T Message Passing on Trees:
 On trees, converge to a unique fixed point after a finite number of
iterations

Junction Tree RevisitedJunction Tree Revisited
 General Algorithm on Graphs with Cyclesg p y
 Steps: => Triangularization => Construct JTsp > Triangularization > Construct JTs
=> Message Passing on Clique Trees
B CS

Local ConsistencyLocal Consistency
 Given a set of functions associated
with the cliques and separator sets
 They are locally consistent if:y y
 For junction trees, local consistency is equivalent to global
consistency!

An Ising model on 2 D imageAn Ising model on 2-D image
 Nodes encode hidden
information (patch-
identity).
 They receive local
information from theinformation from the
image (brightness,
color).
 Information is Information is
propagated though the
graph over its edges.
 Edges encodeg
‘compatibility’ between
nodes.
?air or water ?

Why Approximate Inference?Why Approximate Inference?
 Why can’t we just run junction tree on this graph?y j j g p




  XXXXp exp
1
)( 



   i
ii
ji
jiij XXX
Z
Xp 0exp)( 
 If NxN grid, tree width at least N
g ,
 N can be a huge number(~1000s of pixels)
 If N~O(1000), we have a clique with 2100 entries

Solution 1: Belief Propagation on
loopy graphs
kk
loopy graphs
i
Mki
k k
i
j k
k kk
 BP Message-update Rules
   
k
iik
x
iijiijjji xMxxxxM
i
)()(),()( 
Compatibilities (interactions)
external evidence

k
kkiiii xMxxb )()()( 
 May not converge or converge to a wrong solution

Recall BP on trees
kk
Recall BP on trees
i
Mki
k k
i
j k
k kk
 
 BP Message-update Rules
  
k
iik
x
iijiijjji xMxxxxM
i
)()(),()( 
Compatibilities (interactions)
external evidence

k
kkiiii xMxxb )()()( 
 BP on trees always converges to exact marginals

Solution 2: The naive mean field
approximation
 Approximate p(X) by fully factorized q(X)=iqi(Xi)
approximation
pp p( ) y y q( ) iqi( i)
 For Boltzmann distribution p(X)=exp{i < j qijXiXj+qioXi}/Z :
mean field equation:
XXXX






}):{|(
exp)(
iji
j
iqjiijiiii
jXXp
AXXXXq
i
j
N
N









 
 0
jqjX
}:{ ij jX N
XiXi
}):{|( iji jXXp jq
N }:{ ij jX jq
N
 xjqj resembles a “message” sent from node j to ijX
jqj g j
 {xjqj : j  Ni} forms the “mean field” applied to Xi from its neighborhood}:{ iqj jX j
N
jqj

Recall Gibbs sampling
 Approximate p(X) by fully factorized q(X)=iqi(Xi)
Recall Gibbs sampling
pp p( ) y y q( ) iqi( i)
 For Boltzmann distribution p(X)=exp{i < j qijXiXj+qioXi}/Z :
Gibbs predictive distribution:





 XXX








 

ij
ijiijiiii AxXXxXp
N
 0exp)|(
}):{|( iji jxXp N
jx
jx
XiXi
}){|( iji jp j

Summary So FarSummary So Far
 Exact inference methods are limited to tree-structured graphsg p
 Junction Tree methods is exponentially expensive to the tree-
idthwidth
 Message Passing methods can be applied for loopy graphs Message Passing methods can be applied for loopy graphs,
but lack of analysis!
 Mean-field is convergent, but can have local optimal
f ?
 Where do these two algorithm come from? Do they make
sense?

Next StepNext Step …
 Develop a general theory of variational inferencep g y
 Introduce some approximate inference methods
 Provide deep understandings to some popular methods

Exponential Family GMsExponential Family GMs
 Canonical Parameterization
 Effective canonical parameters
Canonical Parameters Sufficient Statistics Log-normalization Function
 Regular family:
 Minimal representation:
 if there does not exist a nonzero vector such that is a
constant

ExamplesExamples
 Ising Model (binary r.v.: {-1, +1})g ( y { })
 Gaussian MRF

Mean ParameterizationMean Parameterization
 The mean parameter associated with a sufficient statistic
is defined as
R li bl t t Realizable mean parameter set
 A convex subset of
 Convex hull for discrete case
 Convex polytope when is finite
Convex polytope when is finite

Convex PolytopeConvex Polytope
 Convex hull representationp
 Half-plane based representation
 Minkowski-Weyl Theorem:
 any polytope can be characterized by a finite collection of linear inequalitya y po ytope ca be c a acte ed by a te co ect o o ea equa ty
constraints

ExampleExample
 Two-node Ising Modelg
 Convex hull representation
 Half-plane representation
 Probability Theory:

Marginal PolytopeMarginal Polytope
 Canonical Parameterization
 Mean parameterization
 Marginal distributions over nodes and edges
 Marginal Polytope

Conjugate DualityConjugate Duality
 Duality between MLE and Max-Ent:
F ll i i l t ti f i For all , a unique canonical parameter satisfying
 The log partition function has the variational form The log-partition function has the variational form
 For all the supremum in (*) is attained uniquely at specified by the For all , the supremum in ( ) is attained uniquely at specified by the
moment-matching conditions
 Bijection for minimal exponential familyj p y

Roles of Mean ParametersRoles of Mean Parameters
 Forward Mapping:pp g
 From to the mean parameters
 A fundamental class of inference problems in exponential family models
 Backward Mapping:g
 Parameter estimation to learn the unknown

ExampleExample
 Bernoulli
 If
If
Unique!
 If
No gradient stationary point in the Opt. problem (**)
 Reverse mapping:
Unique!

Variational Inference In GeneralVariational Inference In General
 An umbrella term that refers to various mathematical
t l f ti i ti b d f l ti f bltools for optimization-based formulations of problems, as
well as associated techniques for their solution
 General idea:
 Express a quantity of interest as the solution of an optimization problem
 The optimization problem can be relaxed in various ways
 Approximate the functions to be optimized Approximate the functions to be optimized
 Approximate the set over which the optimization takes place
 Goes in parallel with MCMC
p

A Tree Based Outer Bound to aA Tree-Based Outer-Bound to a
 Local Consistent (Pseudo-) Marginal Polytope( ) g y p
 normalization
 marginalization
 Relation to Relation to
 holds for any graph
 holds for tree-structured graphs

A ExampleA Example
 A three node graph (binary r.v.) 1g p ( y )
3 2
1
F h For any , we have
 For , we have
 an exercise?

Bethe Entropy ApproximationBethe Entropy Approximation
 Approximate the negative entropy , which Approximate the negative entropy , which
doesn’t has a closed-form in general graph.
 Entropy on tree (Marginals)Entropy on tree (Marginals)
 recall:
 entropy
 Bethe entropy approximation (Pseudo-marginals) Bethe entropy approximation (Pseudo marginals)

Bethe Variational Problem (BVP)Bethe Variational Problem (BVP)
 We already have:
 a convex (polyhedral) outer bound
 the Bethe approximate entropy
 Combining the two ingredients, we have Combining the two ingredients, we have
 a simple structured problem (differentiable & constraint set is a simple
polytope)
 Max-product is the solver!
Nobel Prize in Physics (1967)

Connection to Sum Product AlgConnection to Sum-Product Alg.
 Lagrangian method for BVP:
 Sum-product and Bethe Variational (Yedidia et al., 2002)p ( , )
 For any graph G, any fixed point of the sum-product updates specifies a
pair of such that
 For a tree-structured MRF, the solution is unique, where
correspond to the exact singleton and pairwise marginal distributions of
p g p g
the MRF, and the optimal value of BVP is equal to

ProofProof

DiscussionsDiscussions
 The connection provides a principled basis for applying the
d t l ith f l hsum-product algorithm for loopy graphs
 However,
 this connection provides no guarantees on the convergence of the sum-product
alg. on loopy graphs
 the Bethe variational problem is usually non-convex. Therefore, there are no
guarantees on the global optimumguarantees on the global optimum
 Generally, there are no guarantees that is a lower bound of
 However however However, however
 the connection and understanding suggest a number of avenues for improving
upon the ordinary sum-product alg., via progressively better approximations to
the entropy function and outer bounds on the marginal polytope!

Inexactness of Bethe and Sum-
ProductProduct
 From Bethe entropy approximationpy pp
 Example
1 4
From pseudo marginal outer bound
32
 From pseudo-marginal outer bound
 strict inclusion

Summary of LBPSummary of LBP
 Variational methods in general turn inference into an optimization
blproblem
 However, both the objective function and constraint set are hard to
deal with
 Bethe variational approximation is a tree-based approximation topp pp
both objective function and marginal polytope
 Belief propagation is a Lagrangian-based solver for BVPp p g g g
 Generalized BP extends BP to solve the generalized hyper-tree
based variational approximation problem

Tractable SubgraphTractable Subgraph
 Given a GM with a graph G, a subgraph F is tractable ifg p g p
 We can perform exact inference on it
E l Example:

Mean ParameterizationMean Parameterization
 For an exponential family GM defined with graph G andp y g p
sufficient statistics , the realizable mean parameter set
 For a given tractable subgraph F, a subset of mean
parameters is of interestp
 Inner Approximation

Optimizing a Lower BoundOptimizing a Lower Bound
 Any mean parameter yields a lower bound on the log-y p y g
partition function
Moreo er eq alit holds iff and are d all co pled i e Moreover, equality holds iff and are dually coupled, i.e.,
 Proof Idea: (Jensen’s Inequality)
 Optimizing the lower bound gives
 This is an inference!

Mean Field Methods In GeneralMean Field Methods In General
 However, the lower bound can’t explicitly evaluated in generalp y g
 Because the dual function typically lacks an explicit form
Mean Field Methods Mean Field Methods
 Approximate the lower bound
 Approximate the realizable mean parameter set
Th MF ti i ti bl The MF optimization problem
 Still a lower bound?

KL divergenceKL-divergence
 Kullback-Leibler Divergenceg
 For two exponential family distributions with the same STs:
Primal Form
Mixed FormMixed Form
Dual Form

Mean Field and KL divergenceMean Field and KL-divergence
 Optimizing a lower boundp g
 Equivalent to minimize a KL-divergence
 Therefore we are doing minimization Therefore, we are doing minimization

Naïve Mean FieldNaïve Mean Field
 Fully factorized variational distributiony

Naïve Mean Field for Ising ModelNaïve Mean Field for Ising Model
 Sufficient statistics and Mean Parameters
 Naïve Mean Field
 Realizable mean parameter subset
 Entropy
 Optimization Problem

Naïve Mean Field for Ising ModelNaïve Mean Field for Ising Model
 Optimization Problem
 Update Rule
 resembles “message” sent from node to resembles message sent from node to
 forms the “mean field” applied to from its
neighborhood
g

Non Convexity of Mean FieldNon-Convexity of Mean Field
 Mean field optimization is always non-convex for anyp y y
exponential family in which the state space is finite
 Finite convex hull
 contains all the extreme points
 If is a convex set, then
 Mean field has been used successfully

Structured Mean FieldStructured Mean Field
 Mean field theory is general to any tractable sub-graphsy g y g p
 Naïve mean field is based on the fully unconnected sub-graph
 Variants based on structured sub-graphs can be derived

Topic modelsTopic models
μ Σ
β

z
μ Σ
Approximate
the Integral 
z
μ* Σ*
Φ*
)|( 1 DzP 
z
w
Approximate
the Posterior       zqqzq  **
z
w
Φ
β
)|,( :1 DzP n the Posterior       nnn zqqzq  ,, :1
 i Optimization
μ*,Σ*,φ1:n*
 pqKL
n
minarg
**,*, :1 
Optimization
Problem
Solve

Variational Inference With no Tears
 Fully Factored Distribution
μ Σ
a at o a e e ce t o ea s
[Ahmed and Xing, 2006, Xing et al 2003]
 Fully Factored Distribution
      zqqzq 
β

z
e
      nn zqqzq  :1,
)|}{,( EzP 
Fi d P i t E ti Fixed Point Equations
    zSPq ,,* 

μ Σ
 ,   N   
  



 kqz
qz
SzPzq
q
z
:1,*
,,




β z
e
 , N
)Multi( z
      nn zqqzq  :1,Laplace approximationLaplace approximation

Summary of GMFSummary of GMF
 Message-passing algorithms (e.g., belief propagation, mean field)
are solving approximate versions of exact variational principle in
exponential families
Th t di ti t t t i ti There are two distinct components to approximations:
 Can use either inner or outer bounds to
 Various approximation to the entropy function
 BP: polyhedral outer bound and non-convex Bethe approximation
 MF: non-convex inner bound and exact form of entropy MF: non convex inner bound and exact form of entropy
 Kikuchi: tighter polyhedral outer bound and better entropy
approximation
Eric Xing © Eric Xing @ CMU, 2006-2010 50© Eric Xing @ CMU, 2005-2009

Lecture15 xing

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Lecture15 xing (20)

More from Tianlu Wang (20)

Recently uploaded (20)

Lecture15 xing