SlideShare a Scribd company logo
Machine LearningMachine Learninggg
Algorithms and Theory ofAlgorithms and Theory of
A i t I fA i t I fApproximate InferenceApproximate Inference
Eric XingEric Xing
Lecture 15, August 15, 2010
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Reading:
X1
X4
X2 X3
X4
X2 X3
X1
X1
X2
X1
X3
X1
X4
X2 X3
X4
X2 X3
X1
X1
X2
X1
X3
Inference ProblemsInference Problems
 Compute the likelihood of observed data
 Compute the marginal distribution over a particular subset
of nodes
 Compute the conditional distribution for disjoint subsets A
and B
 Compute a mode of the density
Eric Xing © Eric Xing @ CMU, 2006-2010 2
Inference in GMInference in GM
 HMM
y2 y3y1 yT...
A AA Ax2 x3x1 xT
...
 A general BN B A
DC
E F
Eric Xing © Eric Xing @ CMU, 2006-2010 3
G H
Inference ProblemsInference Problems
 Compute the likelihood of observed data
 Compute the marginal distribution over a particular subset
of nodes
 Compute the conditional distribution for disjoint subsets A
and B
 Compute a mode of the density
 Methods we have
Message Passing
Brute force Elimination
g g
(Forward-backward , Max-product
/BP, Junction Tree)
Eric Xing © Eric Xing @ CMU, 2006-2010 4
Sharing intermediate termsIndividual computations independent
Recall forward backward on HMMRecall forward-backward on HMM
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
F d l ith
A AA Ax2 x3x1 xT
 ikk
)|( 1 Forward algorithm
 Backward algorithm
 
i
ki
i
t
k
tt
k
t ayxp ,)|( 11 
  iik
yxpa )1|(  Backward algorithm   
i
tttikt yxpa 111, )1|( 
),( xyP
P
k
t
k
t
k
tk  1
1
Eric Xing © Eric Xing @ CMU, 2006-2010 5
)()(
),(
)|(
xx
x
x
PP
yP
yP tttk
t


1
1
Message passing for treesMessage passing for trees
Let m (x ) denote the factor resulting fromLet mij(xi) denote the factor resulting from
eliminating variables from bellow up to i,
which is a function of xi:
f
This is reminiscent of a message sent
from j to i.
i
j
mij(xi) represents a "belief" of xi from xj!
k l
Eric Xing © Eric Xing @ CMU, 2006-2010 6
k l
The General Sum-Product
AlgorithmAlgorithm
 Tree-structured GMs
M P i T Message Passing on Trees:
 On trees, converge to a unique fixed point after a finite number of
iterations
Eric Xing © Eric Xing @ CMU, 2006-2010 7
Junction Tree RevisitedJunction Tree Revisited
 General Algorithm on Graphs with Cyclesg p y
 Steps: => Triangularization => Construct JTsp > Triangularization > Construct JTs
=> Message Passing on Clique Trees
B CS
Eric Xing © Eric Xing @ CMU, 2006-2010 8
Local ConsistencyLocal Consistency
 Given a set of functions associated
with the cliques and separator sets
 They are locally consistent if:y y
 For junction trees, local consistency is equivalent to global
consistency!
Eric Xing © Eric Xing @ CMU, 2006-2010 9
An Ising model on 2 D imageAn Ising model on 2-D image
 Nodes encode hidden
information (patch-
identity).
 They receive local
information from theinformation from the
image (brightness,
color).
 Information is Information is
propagated though the
graph over its edges.
 Edges encodeg
‘compatibility’ between
nodes.
Eric Xing © Eric Xing @ CMU, 2006-2010 10
?air or water ?
Why Approximate Inference?Why Approximate Inference?
 Why can’t we just run junction tree on this graph?y j j g p




  XXXXp exp
1
)( 



   i
ii
ji
jiij XXX
Z
Xp 0exp)( 
 If NxN grid, tree width at least N
Eric Xing © Eric Xing @ CMU, 2006-2010 11
g ,
 N can be a huge number(~1000s of pixels)
 If N~O(1000), we have a clique with 2100 entries
Solution 1: Belief Propagation on
loopy graphs
kk
loopy graphs
i
Mki
k k
i
j k
k kk
 BP Message-update Rules
   
k
iik
x
iijiijjji xMxxxxM
i
)()(),()( 
Compatibilities (interactions)
external evidence

k
kkiiii xMxxb )()()( 
Eric Xing © Eric Xing @ CMU, 2006-2010 12
 May not converge or converge to a wrong solution
Recall BP on trees
kk
Recall BP on trees
i
Mki
k k
i
j k
k kk
 
 BP Message-update Rules
  
k
iik
x
iijiijjji xMxxxxM
i
)()(),()( 
Compatibilities (interactions)
external evidence

k
kkiiii xMxxb )()()( 
Eric Xing © Eric Xing @ CMU, 2006-2010 13
 BP on trees always converges to exact marginals
Solution 2: The naive mean field
approximation
 Approximate p(X) by fully factorized q(X)=iqi(Xi)
approximation
pp p( ) y y q( ) iqi( i)
 For Boltzmann distribution p(X)=exp{i < j qijXiXj+qioXi}/Z :
mean field equation:
XXXX






}):{|(
exp)(
iji
j
iqjiijiiii
jXXp
AXXXXq
i
j
N
N









 
 0
jqjX
}:{ ij jX N
XiXi
}):{|( iji jXXp jq
N }:{ ij jX jq
N
 xjqj resembles a “message” sent from node j to ijX
Eric Xing © Eric Xing @ CMU, 2006-2010 14
jqj g j
 {xjqj : j  Ni} forms the “mean field” applied to Xi from its neighborhood}:{ iqj jX j
N
jqj
Recall Gibbs sampling
 Approximate p(X) by fully factorized q(X)=iqi(Xi)
Recall Gibbs sampling
pp p( ) y y q( ) iqi( i)
 For Boltzmann distribution p(X)=exp{i < j qijXiXj+qioXi}/Z :
Gibbs predictive distribution:





 XXX








 

ij
ijiijiiii AxXXxXp
N
 0exp)|(
}):{|( iji jxXp N
jx
jx
XiXi
}){|( iji jp j
Eric Xing © Eric Xing @ CMU, 2006-2010 15
Summary So FarSummary So Far
 Exact inference methods are limited to tree-structured graphsg p
 Junction Tree methods is exponentially expensive to the tree-
idthwidth
 Message Passing methods can be applied for loopy graphs Message Passing methods can be applied for loopy graphs,
but lack of analysis!
 Mean-field is convergent, but can have local optimal
f ?
Eric Xing © Eric Xing @ CMU, 2006-2010 16
 Where do these two algorithm come from? Do they make
sense?
Next StepNext Step …
 Develop a general theory of variational inferencep g y
 Introduce some approximate inference methods
 Provide deep understandings to some popular methods
Eric Xing © Eric Xing @ CMU, 2006-2010 17
Exponential Family GMsExponential Family GMs
 Canonical Parameterization
 Effective canonical parameters
Canonical Parameters Sufficient Statistics Log-normalization Function
 Regular family:
 Minimal representation:
 if there does not exist a nonzero vector such that is a
constant
Eric Xing © Eric Xing @ CMU, 2006-2010 18
ExamplesExamples
 Ising Model (binary r.v.: {-1, +1})g ( y { })
 Gaussian MRF
Eric Xing © Eric Xing @ CMU, 2006-2010 19
Mean ParameterizationMean Parameterization
 The mean parameter associated with a sufficient statistic
is defined as
R li bl t t Realizable mean parameter set
 A convex subset of
 Convex hull for discrete case
 Convex polytope when is finite
Eric Xing © Eric Xing @ CMU, 2006-2010 20
Convex polytope when is finite
Convex PolytopeConvex Polytope
 Convex hull representationp
 Half-plane based representation
 Minkowski-Weyl Theorem:
 any polytope can be characterized by a finite collection of linear inequalitya y po ytope ca be c a acte ed by a te co ect o o ea equa ty
constraints
Eric Xing © Eric Xing @ CMU, 2006-2010 21
ExampleExample
 Two-node Ising Modelg
 Convex hull representation
 Half-plane representation
 Probability Theory:
Eric Xing © Eric Xing @ CMU, 2006-2010 22
Marginal PolytopeMarginal Polytope
 Canonical Parameterization
 Mean parameterization
 Marginal distributions over nodes and edges
 Marginal Polytope
Eric Xing © Eric Xing @ CMU, 2006-2010 23
Conjugate DualityConjugate Duality
 Duality between MLE and Max-Ent:
F ll i i l t ti f i For all , a unique canonical parameter satisfying
 The log partition function has the variational form The log-partition function has the variational form
 For all the supremum in (*) is attained uniquely at specified by the For all , the supremum in ( ) is attained uniquely at specified by the
moment-matching conditions
 Bijection for minimal exponential familyj p y
Eric Xing © Eric Xing @ CMU, 2006-2010 24
Roles of Mean ParametersRoles of Mean Parameters
 Forward Mapping:pp g
 From to the mean parameters
 A fundamental class of inference problems in exponential family models
 Backward Mapping:g
 Parameter estimation to learn the unknown
Eric Xing © Eric Xing @ CMU, 2006-2010 25
ExampleExample
 Bernoulli
 If
If
Unique!
 If
No gradient stationary point in the Opt. problem (**)
 Reverse mapping:
Eric Xing © Eric Xing @ CMU, 2006-2010 26
Unique!
Variational Inference In GeneralVariational Inference In General
 An umbrella term that refers to various mathematical
t l f ti i ti b d f l ti f bltools for optimization-based formulations of problems, as
well as associated techniques for their solution
 General idea:
 Express a quantity of interest as the solution of an optimization problem
 The optimization problem can be relaxed in various ways
 Approximate the functions to be optimized Approximate the functions to be optimized
 Approximate the set over which the optimization takes place
 Goes in parallel with MCMC
Eric Xing © Eric Xing @ CMU, 2006-2010 27
p
A Tree Based Outer Bound to aA Tree-Based Outer-Bound to a
 Local Consistent (Pseudo-) Marginal Polytope( ) g y p
 normalization
 marginalization
 Relation to Relation to
 holds for any graph
 holds for tree-structured graphs
Eric Xing © Eric Xing @ CMU, 2006-2010 28
A ExampleA Example
 A three node graph (binary r.v.) 1g p ( y )
3 2
1
F h For any , we have
 For , we have
 an exercise?
Eric Xing © Eric Xing @ CMU, 2006-2010 29
Bethe Entropy ApproximationBethe Entropy Approximation
 Approximate the negative entropy , which Approximate the negative entropy , which
doesn’t has a closed-form in general graph.
 Entropy on tree (Marginals)Entropy on tree (Marginals)
 recall:
 entropy
 Bethe entropy approximation (Pseudo-marginals) Bethe entropy approximation (Pseudo marginals)
Eric Xing © Eric Xing @ CMU, 2006-2010 30
Bethe Variational Problem (BVP)Bethe Variational Problem (BVP)
 We already have:
 a convex (polyhedral) outer bound
 the Bethe approximate entropy
 Combining the two ingredients, we have Combining the two ingredients, we have
 a simple structured problem (differentiable & constraint set is a simple
polytope)
 Max-product is the solver!
Eric Xing © Eric Xing @ CMU, 2006-2010 31
Nobel Prize in Physics (1967)
Connection to Sum Product AlgConnection to Sum-Product Alg.
 Lagrangian method for BVP:
 Sum-product and Bethe Variational (Yedidia et al., 2002)p ( , )
 For any graph G, any fixed point of the sum-product updates specifies a
pair of such that
 For a tree-structured MRF, the solution is unique, where
correspond to the exact singleton and pairwise marginal distributions of
Eric Xing © Eric Xing @ CMU, 2006-2010 32
p g p g
the MRF, and the optimal value of BVP is equal to
ProofProof
Eric Xing © Eric Xing @ CMU, 2006-2010 33
DiscussionsDiscussions
 The connection provides a principled basis for applying the
d t l ith f l hsum-product algorithm for loopy graphs
 However,
 this connection provides no guarantees on the convergence of the sum-product
alg. on loopy graphs
 the Bethe variational problem is usually non-convex. Therefore, there are no
guarantees on the global optimumguarantees on the global optimum
 Generally, there are no guarantees that is a lower bound of
 However however However, however
 the connection and understanding suggest a number of avenues for improving
upon the ordinary sum-product alg., via progressively better approximations to
the entropy function and outer bounds on the marginal polytope!
Eric Xing © Eric Xing @ CMU, 2006-2010 34
Inexactness of Bethe and Sum-
ProductProduct
 From Bethe entropy approximationpy pp
 Example
1 4
From pseudo marginal outer bound
32
 From pseudo-marginal outer bound
 strict inclusion
Eric Xing © Eric Xing @ CMU, 2006-2010 35
Summary of LBPSummary of LBP
 Variational methods in general turn inference into an optimization
blproblem
 However, both the objective function and constraint set are hard to
deal with
 Bethe variational approximation is a tree-based approximation topp pp
both objective function and marginal polytope
 Belief propagation is a Lagrangian-based solver for BVPp p g g g
 Generalized BP extends BP to solve the generalized hyper-tree
based variational approximation problem
Eric Xing © Eric Xing @ CMU, 2006-2010 36
Tractable SubgraphTractable Subgraph
 Given a GM with a graph G, a subgraph F is tractable ifg p g p
 We can perform exact inference on it
E l Example:
Eric Xing © Eric Xing @ CMU, 2006-2010 37
Mean ParameterizationMean Parameterization
 For an exponential family GM defined with graph G andp y g p
sufficient statistics , the realizable mean parameter set
 For a given tractable subgraph F, a subset of mean
parameters is of interestp
 Inner Approximation
Eric Xing © Eric Xing @ CMU, 2006-2010 38
Optimizing a Lower BoundOptimizing a Lower Bound
 Any mean parameter yields a lower bound on the log-y p y g
partition function
Moreo er eq alit holds iff and are d all co pled i e Moreover, equality holds iff and are dually coupled, i.e.,
 Proof Idea: (Jensen’s Inequality)
 Optimizing the lower bound gives
 This is an inference!
Eric Xing © Eric Xing @ CMU, 2006-2010 39
Mean Field Methods In GeneralMean Field Methods In General
 However, the lower bound can’t explicitly evaluated in generalp y g
 Because the dual function typically lacks an explicit form
Mean Field Methods Mean Field Methods
 Approximate the lower bound
 Approximate the realizable mean parameter set
Th MF ti i ti bl The MF optimization problem
Eric Xing © Eric Xing @ CMU, 2006-2010 40
 Still a lower bound?
KL divergenceKL-divergence
 Kullback-Leibler Divergenceg
 For two exponential family distributions with the same STs:
Primal Form
Mixed FormMixed Form
Dual Form
Eric Xing © Eric Xing @ CMU, 2006-2010 41
Mean Field and KL divergenceMean Field and KL-divergence
 Optimizing a lower boundp g
 Equivalent to minimize a KL-divergence
 Therefore we are doing minimization Therefore, we are doing minimization
Eric Xing © Eric Xing @ CMU, 2006-2010 42
Naïve Mean FieldNaïve Mean Field
 Fully factorized variational distributiony
Eric Xing © Eric Xing @ CMU, 2006-2010 43
Naïve Mean Field for Ising ModelNaïve Mean Field for Ising Model
 Sufficient statistics and Mean Parameters
 Naïve Mean Field
 Realizable mean parameter subset
 Entropy
 Optimization Problem
Eric Xing © Eric Xing @ CMU, 2006-2010 44
Naïve Mean Field for Ising ModelNaïve Mean Field for Ising Model
 Optimization Problem
 Update Rule
 resembles “message” sent from node to resembles message sent from node to
 forms the “mean field” applied to from its
neighborhood
Eric Xing © Eric Xing @ CMU, 2006-2010 45
g
Non Convexity of Mean FieldNon-Convexity of Mean Field
 Mean field optimization is always non-convex for anyp y y
exponential family in which the state space is finite
 Finite convex hull
 contains all the extreme points
 If is a convex set, then
 Mean field has been used successfully
Eric Xing © Eric Xing @ CMU, 2006-2010 46
Structured Mean FieldStructured Mean Field
 Mean field theory is general to any tractable sub-graphsy g y g p
 Naïve mean field is based on the fully unconnected sub-graph
 Variants based on structured sub-graphs can be derived
Eric Xing © Eric Xing @ CMU, 2006-2010 47
Topic modelsTopic models
μ Σ
β

z
μ Σ
Approximate
the Integral 
z
μ* Σ*
Φ*
)|( 1 DzP 
z
w
Approximate
the Posterior       zqqzq  **
z
w
Φ
β
)|,( :1 DzP n the Posterior       nnn zqqzq  ,, :1
 i Optimization
μ*,Σ*,φ1:n*
Eric Xing © Eric Xing @ CMU, 2006-2010 48
 pqKL
n
minarg
**,*, :1 
Optimization
Problem
Solve
Variational Inference With no Tears
 Fully Factored Distribution
μ Σ
a at o a e e ce t o ea s
[Ahmed and Xing, 2006, Xing et al 2003]
 Fully Factored Distribution
      zqqzq 
β

z
e
      nn zqqzq  :1,
)|}{,( EzP 
Fi d P i t E ti Fixed Point Equations
    zSPq ,,* 

μ Σ
 ,   N   
  



 kqz
qz
SzPzq
q
z
:1,*
,,




β z
e
 , N
)Multi( z
Eric Xing © Eric Xing @ CMU, 2006-2010 49
      nn zqqzq  :1,Laplace approximationLaplace approximation
Summary of GMFSummary of GMF
 Message-passing algorithms (e.g., belief propagation, mean field)
are solving approximate versions of exact variational principle in
exponential families
Th t di ti t t t i ti There are two distinct components to approximations:
 Can use either inner or outer bounds to
 Various approximation to the entropy function
 BP: polyhedral outer bound and non-convex Bethe approximation
 MF: non-convex inner bound and exact form of entropy MF: non convex inner bound and exact form of entropy
 Kikuchi: tighter polyhedral outer bound and better entropy
approximation
Eric Xing © Eric Xing @ CMU, 2006-2010 50© Eric Xing @ CMU, 2005-2009

More Related Content

PDF
From RNN to neural networks for cyclic undirected graphs
PDF
A review on structure learning in GNN
PDF
Graph Neural Network in practice
PDF
Simulated Annealing Algorithm for VLSI Floorplanning for Soft Blocks
PPTX
Region filling
PDF
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
PDF
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
PDF
G04654247
From RNN to neural networks for cyclic undirected graphs
A review on structure learning in GNN
Graph Neural Network in practice
Simulated Annealing Algorithm for VLSI Floorplanning for Soft Blocks
Region filling
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
G04654247

What's hot (20)

PDF
Data Science for Number and Coding Theory
PDF
10.1.1.630.8055
PPTX
20100822 computervision boykov
PDF
Geo1004 lecture 1_topology&amp;topological_datamodels_final
PPTX
Mesh final pzn_geo1004_2015_f3_2017
PDF
A Note on Confidence Bands for Linear Regression Means-07-24-2015
PDF
Robust Shape and Topology Optimization - Northwestern
PDF
optimal subsampling
PDF
Likelihood-free Design: a discussion
PDF
Ar1 twf030 lecture2.1: Geometry and Topology in Computational Design
PPT
Double Patterning
PDF
Information geometry: Dualistic manifold structures and their uses
PDF
An elementary introduction to information geometry
PPT
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...
PPS
Self-Assembling Hyper-heuristics: a proof of concept
PPT
How to Layer a Directed Acyclic Graph (GD 2001)
PDF
Lesson 28: The Fundamental Theorem of Calculus
PPT
Double Patterning (3/31 update)
PDF
Dimensionality reduction with UMAP
PDF
www.ijerd.com
Data Science for Number and Coding Theory
10.1.1.630.8055
20100822 computervision boykov
Geo1004 lecture 1_topology&amp;topological_datamodels_final
Mesh final pzn_geo1004_2015_f3_2017
A Note on Confidence Bands for Linear Regression Means-07-24-2015
Robust Shape and Topology Optimization - Northwestern
optimal subsampling
Likelihood-free Design: a discussion
Ar1 twf030 lecture2.1: Geometry and Topology in Computational Design
Double Patterning
Information geometry: Dualistic manifold structures and their uses
An elementary introduction to information geometry
Raw 2009 -THE ROLE OF LATEST FIXATIONS ON ONGOING VISUAL SEARCH A MODEL TO E...
Self-Assembling Hyper-heuristics: a proof of concept
How to Layer a Directed Acyclic Graph (GD 2001)
Lesson 28: The Fundamental Theorem of Calculus
Double Patterning (3/31 update)
Dimensionality reduction with UMAP
www.ijerd.com
Ad

Viewers also liked (18)

PDF
Lecture4 xing
PDF
Lecture2 xing
PDF
Lecture19 xing
PDF
Lecture6 xing
PDF
Lecture18 xing
PDF
Lecture17 xing fei-fei
PDF
Lecture14 xing fei-fei
PDF
Lecture20 xing
PDF
Lecture8 xing
PDF
Lecture5 xing
PDF
Lecture16 xing
PDF
Lecture11 xing
PDF
Lecture7 xing fei-fei
PDF
Lecture1 xing fei-fei
PDF
Lecture12 xing
PDF
Lecture9 xing
PDF
Lecture3 xing fei-fei
PDF
Lecture13 xing fei-fei
Lecture4 xing
Lecture2 xing
Lecture19 xing
Lecture6 xing
Lecture18 xing
Lecture17 xing fei-fei
Lecture14 xing fei-fei
Lecture20 xing
Lecture8 xing
Lecture5 xing
Lecture16 xing
Lecture11 xing
Lecture7 xing fei-fei
Lecture1 xing fei-fei
Lecture12 xing
Lecture9 xing
Lecture3 xing fei-fei
Lecture13 xing fei-fei
Ad

Similar to Lecture15 xing (20)

PDF
Lecture10 xing
PDF
lecture1-Introduction introduction to graphical modeling
PDF
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
PDF
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
PDF
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
PDF
A simple framework for contrastive learning of visual representations
PDF
Bayesian_Decision_Theory-3.pdf
PDF
Accelerating Metropolis Hastings with Lightweight Inference Compilation
PDF
Output Units and Cost Function in FNN
PDF
IRJET- Optimization of 1-Bit ALU using Ternary Logic
PDF
Ba26343346
PDF
SVD and the Netflix Dataset
PDF
Object Detection Beyond Mask R-CNN and RetinaNet III
PDF
Traffic flow modeling on road networks using Hamilton-Jacobi equations
PDF
Stochastic optimization from mirror descent to recent algorithms
PDF
Representing Simplicial Complexes with Mangroves
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
PDF
Isoparametric Elements 4.pdf
PDF
Network and risk spillovers: a multivariate GARCH perspective
PDF
An Algorithm For Vector Quantizer Design
Lecture10 xing
lecture1-Introduction introduction to graphical modeling
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
A simple framework for contrastive learning of visual representations
Bayesian_Decision_Theory-3.pdf
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Output Units and Cost Function in FNN
IRJET- Optimization of 1-Bit ALU using Ternary Logic
Ba26343346
SVD and the Netflix Dataset
Object Detection Beyond Mask R-CNN and RetinaNet III
Traffic flow modeling on road networks using Hamilton-Jacobi equations
Stochastic optimization from mirror descent to recent algorithms
Representing Simplicial Complexes with Mangroves
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
Isoparametric Elements 4.pdf
Network and risk spillovers: a multivariate GARCH perspective
An Algorithm For Vector Quantizer Design

More from Tianlu Wang (20)

PDF
L7 er2
PDF
L8 design1
PDF
L9 design2
PDF
14 pro resolution
PDF
13 propositional calculus
PDF
12 adversal search
PDF
11 alternative search
PDF
10 2 sum
PDF
22 planning
PDF
21 situation calculus
PDF
20 bayes learning
PDF
19 uncertain evidence
PDF
18 common knowledge
PDF
17 2 expert systems
PDF
17 1 knowledge-based system
PDF
16 2 predicate resolution
PDF
16 1 predicate resolution
PDF
15 predicate
PDF
09 heuristic search
PDF
08 uninformed search
L7 er2
L8 design1
L9 design2
14 pro resolution
13 propositional calculus
12 adversal search
11 alternative search
10 2 sum
22 planning
21 situation calculus
20 bayes learning
19 uncertain evidence
18 common knowledge
17 2 expert systems
17 1 knowledge-based system
16 2 predicate resolution
16 1 predicate resolution
15 predicate
09 heuristic search
08 uninformed search

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
A Presentation on Artificial Intelligence
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Monthly Chronicles - July 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Lecture15 xing

  • 1. Machine LearningMachine Learninggg Algorithms and Theory ofAlgorithms and Theory of A i t I fA i t I fApproximate InferenceApproximate Inference Eric XingEric Xing Lecture 15, August 15, 2010 Eric Xing © Eric Xing @ CMU, 2006-2010 1 Reading: X1 X4 X2 X3 X4 X2 X3 X1 X1 X2 X1 X3 X1 X4 X2 X3 X4 X2 X3 X1 X1 X2 X1 X3
  • 2. Inference ProblemsInference Problems  Compute the likelihood of observed data  Compute the marginal distribution over a particular subset of nodes  Compute the conditional distribution for disjoint subsets A and B  Compute a mode of the density Eric Xing © Eric Xing @ CMU, 2006-2010 2
  • 3. Inference in GMInference in GM  HMM y2 y3y1 yT... A AA Ax2 x3x1 xT ...  A general BN B A DC E F Eric Xing © Eric Xing @ CMU, 2006-2010 3 G H
  • 4. Inference ProblemsInference Problems  Compute the likelihood of observed data  Compute the marginal distribution over a particular subset of nodes  Compute the conditional distribution for disjoint subsets A and B  Compute a mode of the density  Methods we have Message Passing Brute force Elimination g g (Forward-backward , Max-product /BP, Junction Tree) Eric Xing © Eric Xing @ CMU, 2006-2010 4 Sharing intermediate termsIndividual computations independent
  • 5. Recall forward backward on HMMRecall forward-backward on HMM A AA Ax2 x3x1 xT y2 y3y1 yT... ... F d l ith A AA Ax2 x3x1 xT  ikk )|( 1 Forward algorithm  Backward algorithm   i ki i t k tt k t ayxp ,)|( 11    iik yxpa )1|(  Backward algorithm    i tttikt yxpa 111, )1|(  ),( xyP P k t k t k tk  1 1 Eric Xing © Eric Xing @ CMU, 2006-2010 5 )()( ),( )|( xx x x PP yP yP tttk t   1 1
  • 6. Message passing for treesMessage passing for trees Let m (x ) denote the factor resulting fromLet mij(xi) denote the factor resulting from eliminating variables from bellow up to i, which is a function of xi: f This is reminiscent of a message sent from j to i. i j mij(xi) represents a "belief" of xi from xj! k l Eric Xing © Eric Xing @ CMU, 2006-2010 6 k l
  • 7. The General Sum-Product AlgorithmAlgorithm  Tree-structured GMs M P i T Message Passing on Trees:  On trees, converge to a unique fixed point after a finite number of iterations Eric Xing © Eric Xing @ CMU, 2006-2010 7
  • 8. Junction Tree RevisitedJunction Tree Revisited  General Algorithm on Graphs with Cyclesg p y  Steps: => Triangularization => Construct JTsp > Triangularization > Construct JTs => Message Passing on Clique Trees B CS Eric Xing © Eric Xing @ CMU, 2006-2010 8
  • 9. Local ConsistencyLocal Consistency  Given a set of functions associated with the cliques and separator sets  They are locally consistent if:y y  For junction trees, local consistency is equivalent to global consistency! Eric Xing © Eric Xing @ CMU, 2006-2010 9
  • 10. An Ising model on 2 D imageAn Ising model on 2-D image  Nodes encode hidden information (patch- identity).  They receive local information from theinformation from the image (brightness, color).  Information is Information is propagated though the graph over its edges.  Edges encodeg ‘compatibility’ between nodes. Eric Xing © Eric Xing @ CMU, 2006-2010 10 ?air or water ?
  • 11. Why Approximate Inference?Why Approximate Inference?  Why can’t we just run junction tree on this graph?y j j g p       XXXXp exp 1 )(        i ii ji jiij XXX Z Xp 0exp)(   If NxN grid, tree width at least N Eric Xing © Eric Xing @ CMU, 2006-2010 11 g ,  N can be a huge number(~1000s of pixels)  If N~O(1000), we have a clique with 2100 entries
  • 12. Solution 1: Belief Propagation on loopy graphs kk loopy graphs i Mki k k i j k k kk  BP Message-update Rules     k iik x iijiijjji xMxxxxM i )()(),()(  Compatibilities (interactions) external evidence  k kkiiii xMxxb )()()(  Eric Xing © Eric Xing @ CMU, 2006-2010 12  May not converge or converge to a wrong solution
  • 13. Recall BP on trees kk Recall BP on trees i Mki k k i j k k kk    BP Message-update Rules    k iik x iijiijjji xMxxxxM i )()(),()(  Compatibilities (interactions) external evidence  k kkiiii xMxxb )()()(  Eric Xing © Eric Xing @ CMU, 2006-2010 13  BP on trees always converges to exact marginals
  • 14. Solution 2: The naive mean field approximation  Approximate p(X) by fully factorized q(X)=iqi(Xi) approximation pp p( ) y y q( ) iqi( i)  For Boltzmann distribution p(X)=exp{i < j qijXiXj+qioXi}/Z : mean field equation: XXXX       }):{|( exp)( iji j iqjiijiiii jXXp AXXXXq i j N N             0 jqjX }:{ ij jX N XiXi }):{|( iji jXXp jq N }:{ ij jX jq N  xjqj resembles a “message” sent from node j to ijX Eric Xing © Eric Xing @ CMU, 2006-2010 14 jqj g j  {xjqj : j  Ni} forms the “mean field” applied to Xi from its neighborhood}:{ iqj jX j N jqj
  • 15. Recall Gibbs sampling  Approximate p(X) by fully factorized q(X)=iqi(Xi) Recall Gibbs sampling pp p( ) y y q( ) iqi( i)  For Boltzmann distribution p(X)=exp{i < j qijXiXj+qioXi}/Z : Gibbs predictive distribution:       XXX            ij ijiijiiii AxXXxXp N  0exp)|( }):{|( iji jxXp N jx jx XiXi }){|( iji jp j Eric Xing © Eric Xing @ CMU, 2006-2010 15
  • 16. Summary So FarSummary So Far  Exact inference methods are limited to tree-structured graphsg p  Junction Tree methods is exponentially expensive to the tree- idthwidth  Message Passing methods can be applied for loopy graphs Message Passing methods can be applied for loopy graphs, but lack of analysis!  Mean-field is convergent, but can have local optimal f ? Eric Xing © Eric Xing @ CMU, 2006-2010 16  Where do these two algorithm come from? Do they make sense?
  • 17. Next StepNext Step …  Develop a general theory of variational inferencep g y  Introduce some approximate inference methods  Provide deep understandings to some popular methods Eric Xing © Eric Xing @ CMU, 2006-2010 17
  • 18. Exponential Family GMsExponential Family GMs  Canonical Parameterization  Effective canonical parameters Canonical Parameters Sufficient Statistics Log-normalization Function  Regular family:  Minimal representation:  if there does not exist a nonzero vector such that is a constant Eric Xing © Eric Xing @ CMU, 2006-2010 18
  • 19. ExamplesExamples  Ising Model (binary r.v.: {-1, +1})g ( y { })  Gaussian MRF Eric Xing © Eric Xing @ CMU, 2006-2010 19
  • 20. Mean ParameterizationMean Parameterization  The mean parameter associated with a sufficient statistic is defined as R li bl t t Realizable mean parameter set  A convex subset of  Convex hull for discrete case  Convex polytope when is finite Eric Xing © Eric Xing @ CMU, 2006-2010 20 Convex polytope when is finite
  • 21. Convex PolytopeConvex Polytope  Convex hull representationp  Half-plane based representation  Minkowski-Weyl Theorem:  any polytope can be characterized by a finite collection of linear inequalitya y po ytope ca be c a acte ed by a te co ect o o ea equa ty constraints Eric Xing © Eric Xing @ CMU, 2006-2010 21
  • 22. ExampleExample  Two-node Ising Modelg  Convex hull representation  Half-plane representation  Probability Theory: Eric Xing © Eric Xing @ CMU, 2006-2010 22
  • 23. Marginal PolytopeMarginal Polytope  Canonical Parameterization  Mean parameterization  Marginal distributions over nodes and edges  Marginal Polytope Eric Xing © Eric Xing @ CMU, 2006-2010 23
  • 24. Conjugate DualityConjugate Duality  Duality between MLE and Max-Ent: F ll i i l t ti f i For all , a unique canonical parameter satisfying  The log partition function has the variational form The log-partition function has the variational form  For all the supremum in (*) is attained uniquely at specified by the For all , the supremum in ( ) is attained uniquely at specified by the moment-matching conditions  Bijection for minimal exponential familyj p y Eric Xing © Eric Xing @ CMU, 2006-2010 24
  • 25. Roles of Mean ParametersRoles of Mean Parameters  Forward Mapping:pp g  From to the mean parameters  A fundamental class of inference problems in exponential family models  Backward Mapping:g  Parameter estimation to learn the unknown Eric Xing © Eric Xing @ CMU, 2006-2010 25
  • 26. ExampleExample  Bernoulli  If If Unique!  If No gradient stationary point in the Opt. problem (**)  Reverse mapping: Eric Xing © Eric Xing @ CMU, 2006-2010 26 Unique!
  • 27. Variational Inference In GeneralVariational Inference In General  An umbrella term that refers to various mathematical t l f ti i ti b d f l ti f bltools for optimization-based formulations of problems, as well as associated techniques for their solution  General idea:  Express a quantity of interest as the solution of an optimization problem  The optimization problem can be relaxed in various ways  Approximate the functions to be optimized Approximate the functions to be optimized  Approximate the set over which the optimization takes place  Goes in parallel with MCMC Eric Xing © Eric Xing @ CMU, 2006-2010 27 p
  • 28. A Tree Based Outer Bound to aA Tree-Based Outer-Bound to a  Local Consistent (Pseudo-) Marginal Polytope( ) g y p  normalization  marginalization  Relation to Relation to  holds for any graph  holds for tree-structured graphs Eric Xing © Eric Xing @ CMU, 2006-2010 28
  • 29. A ExampleA Example  A three node graph (binary r.v.) 1g p ( y ) 3 2 1 F h For any , we have  For , we have  an exercise? Eric Xing © Eric Xing @ CMU, 2006-2010 29
  • 30. Bethe Entropy ApproximationBethe Entropy Approximation  Approximate the negative entropy , which Approximate the negative entropy , which doesn’t has a closed-form in general graph.  Entropy on tree (Marginals)Entropy on tree (Marginals)  recall:  entropy  Bethe entropy approximation (Pseudo-marginals) Bethe entropy approximation (Pseudo marginals) Eric Xing © Eric Xing @ CMU, 2006-2010 30
  • 31. Bethe Variational Problem (BVP)Bethe Variational Problem (BVP)  We already have:  a convex (polyhedral) outer bound  the Bethe approximate entropy  Combining the two ingredients, we have Combining the two ingredients, we have  a simple structured problem (differentiable & constraint set is a simple polytope)  Max-product is the solver! Eric Xing © Eric Xing @ CMU, 2006-2010 31 Nobel Prize in Physics (1967)
  • 32. Connection to Sum Product AlgConnection to Sum-Product Alg.  Lagrangian method for BVP:  Sum-product and Bethe Variational (Yedidia et al., 2002)p ( , )  For any graph G, any fixed point of the sum-product updates specifies a pair of such that  For a tree-structured MRF, the solution is unique, where correspond to the exact singleton and pairwise marginal distributions of Eric Xing © Eric Xing @ CMU, 2006-2010 32 p g p g the MRF, and the optimal value of BVP is equal to
  • 33. ProofProof Eric Xing © Eric Xing @ CMU, 2006-2010 33
  • 34. DiscussionsDiscussions  The connection provides a principled basis for applying the d t l ith f l hsum-product algorithm for loopy graphs  However,  this connection provides no guarantees on the convergence of the sum-product alg. on loopy graphs  the Bethe variational problem is usually non-convex. Therefore, there are no guarantees on the global optimumguarantees on the global optimum  Generally, there are no guarantees that is a lower bound of  However however However, however  the connection and understanding suggest a number of avenues for improving upon the ordinary sum-product alg., via progressively better approximations to the entropy function and outer bounds on the marginal polytope! Eric Xing © Eric Xing @ CMU, 2006-2010 34
  • 35. Inexactness of Bethe and Sum- ProductProduct  From Bethe entropy approximationpy pp  Example 1 4 From pseudo marginal outer bound 32  From pseudo-marginal outer bound  strict inclusion Eric Xing © Eric Xing @ CMU, 2006-2010 35
  • 36. Summary of LBPSummary of LBP  Variational methods in general turn inference into an optimization blproblem  However, both the objective function and constraint set are hard to deal with  Bethe variational approximation is a tree-based approximation topp pp both objective function and marginal polytope  Belief propagation is a Lagrangian-based solver for BVPp p g g g  Generalized BP extends BP to solve the generalized hyper-tree based variational approximation problem Eric Xing © Eric Xing @ CMU, 2006-2010 36
  • 37. Tractable SubgraphTractable Subgraph  Given a GM with a graph G, a subgraph F is tractable ifg p g p  We can perform exact inference on it E l Example: Eric Xing © Eric Xing @ CMU, 2006-2010 37
  • 38. Mean ParameterizationMean Parameterization  For an exponential family GM defined with graph G andp y g p sufficient statistics , the realizable mean parameter set  For a given tractable subgraph F, a subset of mean parameters is of interestp  Inner Approximation Eric Xing © Eric Xing @ CMU, 2006-2010 38
  • 39. Optimizing a Lower BoundOptimizing a Lower Bound  Any mean parameter yields a lower bound on the log-y p y g partition function Moreo er eq alit holds iff and are d all co pled i e Moreover, equality holds iff and are dually coupled, i.e.,  Proof Idea: (Jensen’s Inequality)  Optimizing the lower bound gives  This is an inference! Eric Xing © Eric Xing @ CMU, 2006-2010 39
  • 40. Mean Field Methods In GeneralMean Field Methods In General  However, the lower bound can’t explicitly evaluated in generalp y g  Because the dual function typically lacks an explicit form Mean Field Methods Mean Field Methods  Approximate the lower bound  Approximate the realizable mean parameter set Th MF ti i ti bl The MF optimization problem Eric Xing © Eric Xing @ CMU, 2006-2010 40  Still a lower bound?
  • 41. KL divergenceKL-divergence  Kullback-Leibler Divergenceg  For two exponential family distributions with the same STs: Primal Form Mixed FormMixed Form Dual Form Eric Xing © Eric Xing @ CMU, 2006-2010 41
  • 42. Mean Field and KL divergenceMean Field and KL-divergence  Optimizing a lower boundp g  Equivalent to minimize a KL-divergence  Therefore we are doing minimization Therefore, we are doing minimization Eric Xing © Eric Xing @ CMU, 2006-2010 42
  • 43. Naïve Mean FieldNaïve Mean Field  Fully factorized variational distributiony Eric Xing © Eric Xing @ CMU, 2006-2010 43
  • 44. Naïve Mean Field for Ising ModelNaïve Mean Field for Ising Model  Sufficient statistics and Mean Parameters  Naïve Mean Field  Realizable mean parameter subset  Entropy  Optimization Problem Eric Xing © Eric Xing @ CMU, 2006-2010 44
  • 45. Naïve Mean Field for Ising ModelNaïve Mean Field for Ising Model  Optimization Problem  Update Rule  resembles “message” sent from node to resembles message sent from node to  forms the “mean field” applied to from its neighborhood Eric Xing © Eric Xing @ CMU, 2006-2010 45 g
  • 46. Non Convexity of Mean FieldNon-Convexity of Mean Field  Mean field optimization is always non-convex for anyp y y exponential family in which the state space is finite  Finite convex hull  contains all the extreme points  If is a convex set, then  Mean field has been used successfully Eric Xing © Eric Xing @ CMU, 2006-2010 46
  • 47. Structured Mean FieldStructured Mean Field  Mean field theory is general to any tractable sub-graphsy g y g p  Naïve mean field is based on the fully unconnected sub-graph  Variants based on structured sub-graphs can be derived Eric Xing © Eric Xing @ CMU, 2006-2010 47
  • 48. Topic modelsTopic models μ Σ β  z μ Σ Approximate the Integral  z μ* Σ* Φ* )|( 1 DzP  z w Approximate the Posterior       zqqzq  ** z w Φ β )|,( :1 DzP n the Posterior       nnn zqqzq  ,, :1  i Optimization μ*,Σ*,φ1:n* Eric Xing © Eric Xing @ CMU, 2006-2010 48  pqKL n minarg **,*, :1  Optimization Problem Solve
  • 49. Variational Inference With no Tears  Fully Factored Distribution μ Σ a at o a e e ce t o ea s [Ahmed and Xing, 2006, Xing et al 2003]  Fully Factored Distribution       zqqzq  β  z e       nn zqqzq  :1, )|}{,( EzP  Fi d P i t E ti Fixed Point Equations     zSPq ,,*   μ Σ  ,   N           kqz qz SzPzq q z :1,* ,,     β z e  , N )Multi( z Eric Xing © Eric Xing @ CMU, 2006-2010 49       nn zqqzq  :1,Laplace approximationLaplace approximation
  • 50. Summary of GMFSummary of GMF  Message-passing algorithms (e.g., belief propagation, mean field) are solving approximate versions of exact variational principle in exponential families Th t di ti t t t i ti There are two distinct components to approximations:  Can use either inner or outer bounds to  Various approximation to the entropy function  BP: polyhedral outer bound and non-convex Bethe approximation  MF: non-convex inner bound and exact form of entropy MF: non convex inner bound and exact form of entropy  Kikuchi: tighter polyhedral outer bound and better entropy approximation Eric Xing © Eric Xing @ CMU, 2006-2010 50© Eric Xing @ CMU, 2005-2009