01 graphical models

Graphical Models Factor Graphs Test-time Inference Training

Part 2: Introduction to Graphical Models

Sebastian Nowozin and Christoph H. Lampert

Colorado Springs, 25th June 2011



Graphical Models

Introduction
Model: relating observations x to
quantities of interest y
f
Example 1: given RGB image x, infer
depth y for each pixel
Example 2: given RGB image x, infer X Y
presence and positions y of all objects f :X →Y
shown



Graphical Models

Introduction
Model: relating observations x to
quantities of interest y
f
Example 1: given RGB image x, infer
depth y for each pixel
Example 2: given RGB image x, infer X Y
presence and positions y of all objects f :X →Y
shown

X : image, Y: object annotations


Graphical Models

Introduction

General case: mapping x ∈ X to y ∈ Y
Graphical models are a concise
language to deﬁne this mapping x
Mapping can be ambiguous:
f (x)
measurement noise, lack of X Y
well-posedness (e.g. occlusions) f :X →Y
Probabilistic graphical models: deﬁne
form p(y |x) or p(x, y ) for all y ∈ Y



Graphical Models

Introduction

General case: mapping x ∈ X to y ∈ Y
Graphical models are a concise ?
language to deﬁne this mapping x
Mapping can be ambiguous: ?
measurement noise, lack of X Y
well-posedness (e.g. occlusions) p(Y |X = x)
Probabilistic graphical models: deﬁne
form p(y |x) or p(x, y ) for all y ∈ Y



Graphical Models

Graphical Models

A graphical model deﬁnes
a family of probability distributions over a set of random variables,
by means of a graph,
so that the random variables satisfy conditional independence
assumptions encoded in the graph.



Graphical Models

Graphical Models

A graphical model defines
a family of probability distributions over a set of random variables,
by means of a graph,
so that the random variables satisfy conditional independence
assumptions encoded in the graph.
Popular classes of graphical models,
Undirected graphical models (Markov
random fields),
Directed graphical models (Bayesian
networks),
Factor graphs,
Others: chain graphs, influence
diagrams, etc.



Graphical Models

Bayesian Networks

Graph: G = (V , E), E ⊂ V × V Yi Yj
directed
acyclic
Variable domains Yi Yk
Factorization

p(Y = y ) = p(yi |ypaG (i) )
Yl
i∈V

over distributions, by conditioning on parent A simple Bayes net
nodes.
Example

p(Y = y ) =p(Yl = yl |Yk = yk )p(Yk = yk |Yi = yi , Yj = yj )
p(Yi = yi )p(Yj = yj ).
Family of distributions


Graphical Models

Undirected Graphical Models
Yi Yj Yk
= Markov random ﬁeld (MRF) = Markov
network A simple MRF
Graph: G = (V , E), E ⊂ V × V
undirected, no self-edges
Variable domains Yi
Factorization over potentials ψ at cliques,
1
p(y ) = ψC (yC )
Z
C ∈C(G )

Constant Z = y ∈Y C ∈C(G ) ψC (yC )
Example
1
p(y ) = ψi (yi )ψj (yj )ψl (yl )ψi,j (yi , yj )
Z


Graphical Models

Example 1

Yi Yj Yk

Cliques C(G ): set of vertex sets V with V ⊆ V ,
E ∩ (V × V ) = V × V
Here C(G ) = {{i}, {i, j}, {j}, {j, k}, {k}}

1
p(y ) = ψi (yi )ψj (yj )ψl (yl )ψi,j (yi , yj )
Z



Graphical Models

Example 2

Yi Yj

Yk Yl

Here C(G ) = 2V : all subsets of V are cliques

1
p(y ) = ψA (yA ).
Z
A∈2{i,j,k,l}



Factor Graphs

Factor Graphs

Graph: G = (V , F, E), E ⊆ V × F Yi Yj
variable nodes V ,
factor nodes F ,
edges E between variable and factor nodes.
scope of a factor,
N(F ) = {i ∈ V : (i, F ) ∈ E}
Yk Yl
Variable domains Yi
Factorization over potentials ψ at factors, Factor graph
1
p(y ) = ψF (yN(F ) )
Z
F ∈F

Constant Z = y ∈Y F ∈F ψF (yN(F ) )



Factor Graphs

Why factor graphs?

Yi Yj Yi Yj Yi Yj

Yk Yl Yk Yl Yk Yl

Factor graphs are explicit about the factorization
Hence, easier to work with
Universal (just like MRFs and Bayesian networks)



Factor Graphs

Capacity

Yi Yj Yi Yj

Yk Yl Yk Yl

Factor graph deﬁnes family of distributions
Some families are larger than others



Factor Graphs

Four remaining pieces

1. Conditional distributions (CRFs)
2. Parameterization
3. Test-time inference
4. Learning the model from training data



Factor Graphs

Conditional Distributions

We have discussed p(y ), Xi Xj
How do we deﬁne p(y |x)?
Potentials become a function of xN(F )
Partition function depends on x
Yi Yj
Conditional random ﬁelds (CRFs)
x is not part of the probability model, i.e. not conditional
treated as random variable distribution



Factor Graphs

Conditional Distributions

We have discussed p(y ), Xi Xj
How do we deﬁne p(y |x)?
Potentials become a function of xN(F )
Partition function depends on x
Yi Yj
Conditional random ﬁelds (CRFs)
x is not part of the probability model, i.e. not conditional
treated as random variable distribution
1
p(y ) = ψF (yN(F ) )
Z
F ∈F

1
p(y |x) = ψF (yN(F ) ; xN(F ) )
Z (x)
F ∈F



Factor Graphs

Potentials and Energy Functions

For each factor F ∈ F, YF = ×
i∈N(F )
Yi ,

EF : YN(F ) → R,

Potentials and energies (assume ψF (yF ) > 0)

ψF (yF ) = exp(−EF (yF )), and EF (yF ) = − log(ψF (yF )).

Then p(y ) can be written as
1
p(Y = y ) = ψF (yF )
Z
F ∈F
1
= exp(− EF (yF )),
Z
F ∈F

Hence, p(y ) is completely determined by E (y ) = F ∈F EF (yF )


Factor Graphs

Energy Minimization

1
argmax p(Y = y ) = argmax exp(− EF (yF ))
y ∈Y y ∈Y Z
F ∈F

= argmax exp(− EF (yF ))
y ∈Y
F ∈F

= argmax − EF (yF )
y ∈Y
F ∈F

= argmin EF (yF )
y ∈Y
F ∈F
= argmin E (y ).
y ∈Y

Energy minimization can be interpreted as solving for the most likely
state of some factor graph model


Factor Graphs

Parameterization
Factor graphs deﬁne a family of distributions
Parameterization: identifying individual members by parameters w



Factor Graphs

Parameterization
Factor graphs deﬁne a family of distributions
Parameterization: identifying individual members by parameters w

distributions
indexed
by w pw1
pw2

distributions
in family



Factor Graphs

Example: Parameterization

Image segmentation model
Pairwise “Potts” energy function
EF (yi , yj ; w1 ),

EF : {0, 1} × {0, 1} × R → R,

EF (0, 0; w1 ) = EF (1, 1; w1 ) = 0 image segmentation model
EF (0, 1; w1 ) = EF (1, 0; w1 ) = w1



Factor Graphs

Example: Parameterization (cont)

Image segmentation model
Unary energy function EF (yi ; x, w ),

EF : {0, 1} × X × R{0,1}×D → R,

EF (0; x, w ) = w (0), ψF (x)
EF (1; x, w ) = w (1), ψF (x) image segmentation model
Features ψF : X → RD , e.g. image
ﬁlters



Factor Graphs


w(0), ψF (x)
... ... ...
w(1), ψF (x)
...
0 w1
w1 0



Factor Graphs


w(0), ψF (x)
... ... ...
w(1), ψF (x)
...
0 w1
w1 0

Total number of parameters: D + D + 1
Parameters are shared, but energies diﬀer because of diﬀerent ψF (x)
General form, linear in w ,

EF (yF ; xF , w ) = w (yF ), ψF (xF )



Test-time Inference

Making Predictions

Making predictions: given x ∈ X , predict y ∈ Y
How to measure quality of prediction? (or function f : X → Y)



Test-time Inference

Loss function

Deﬁne a loss function

∆ : Y × Y → R+ ,

so that ∆(y , y ∗ ) measures the loss incurred by predicting y when y ∗
is true.
The loss function is application dependent



Test-time Inference

Test-time Inference

Loss function ∆(y , f (x)): correct label y , predict f (x)

∆:Y ×Y →R

True joint distribution d(X , Y ) and true conditional d(y |x)
Model distribution p(y |x)
Expected loss: quality of prediction

R∆ (x)
f = Ey ∼d(y |x) ∆(y , f (x))
= d(y |x) ∆(y , f (x)).
y ∈Y
≈ Ey ∼p(y |x;w ) ∆(y , f (x))

Assuming that p(y |x; w ) ≈ d(y |x)



Test-time Inference

Example 1: 0/1 loss

Loss 0 iﬀ perfectly predicted, 1 otherwise:

0 if y = y ∗
∆0/1 (y , y ∗ ) = I (y = y ∗ ) =
1 otherwise

Plugging it in,

y∗ := argmin Ey ∼p(y |x) ∆0/1 (y , y )
y ∈Y

= argmax p(y |x)
y ∈Y

= argmin E (y , x).
y ∈Y

Minimizing the expected 0/1-loss → MAP prediction (energy
minimization)


Test-time Inference

Example 2: Hamming loss
Count the number of mislabeled variables:
1
∆H (y , y ∗ ) = I (yi = yi∗ )
|V |
i∈V

Plugging it in,

y∗ := argmin Ey ∼p(y |x) [∆H (y , y )]
y ∈Y

= argmax p(yi |x)
yi ∈Yi
i∈V

Minimizing the expected Hamming loss → maximum posterior
marginal (MPM, Max-Marg) prediction


Test-time Inference

Example 3: Squared error
Assume a vector space on Yi (pixel intensities,
optical ﬂow vectors, etc.).
Sum of squared errors
1
∆Q (y , y ∗ ) = yi − yi∗ 2 .
|V |
i∈V

Plugging it in,
y∗ := argmin Ey ∼p(y |x) [∆Q (y , y )]
y ∈Y
 

=  p(yi |x)yi 
yi ∈Yi
i∈V

Minimizing the expected squared error → minimum mean squared
error (MMSE) prediction


Test-time Inference

Inference Task: Maximum A Posteriori (MAP) Inference

Deﬁnition (Maximum A Posteriori (MAP) Inference)
Given a factor graph, parameterization, and weight vector w , and given
the observation x, ﬁnd

y ∗ = argmax p(Y = y |x, w ) = argmin E (y ; x, w ).
y ∈Y y ∈Y



Test-time Inference

Inference Task: Probabilistic Inference

Deﬁnition (Probabilistic Inference)
Given a factor graph, parameterization, and weight vector w , and given
the observation x, ﬁnd

log Z (x, w ) = log exp(−E (y ; x, w )),
y ∈Y
µF (yF ) = p(YF = yf |x, w ), ∀F ∈ F, ∀yF ∈ YF .

This typically includes variable marginals

µi (yi ) = p(yi |x, w )



Test-time Inference

Example: Man-made structure detection

Xi
ψi2
Yi 3
ψi,k Yk
ψi1

Left: input image x,
Middle: ground truth labeling on 16-by-16 pixel blocks,
Right: factor graph model

Features: gradient and color histograms
Estimate model parameters from ≈ 60 training images



Test-time Inference

Example: Man-made structure detection

Left: input image x,
Middle (probabilistic inference): visualization of the variable
marginals p(yi = “manmade |x, w ),
Right (MAP inference): joint MAP labeling
y ∗ = argmaxy ∈Y p(y |x, w ).



Training

Training the Model

What can be learned?
Model structure: factors
Model variables: observed variables ﬁxed, but we can add
unobserved variables
Factor energies: parameters



Training

Training: Overview

Assume a fully observed, independent and identically distributed
(iid) sample set

{(x n , y n )}n=1,...,N , (x n , y n ) ∼ d(X , Y )

Goal: predict well,
Alternative goal: ﬁrst model d(y |x) well by p(y |x, w ), then predict
by minimizing the expected loss



Training

Probabilistic Learning

Problem (Probabilistic Parameter Learning)
Let d(y |x) be the (unknown) conditional distribution of labels for a
problem to be solved. For a parameterized conditional distribution
p(y |x, w ) with parameters w ∈ RD , probabilistic parameter learning is
the task of ﬁnding a point estimate of the parameter w ∗ that makes
p(y |x, w ∗ ) closest to d(y |x).

We will discuss probabilistic parameter learning in detail.



Training

Loss-Minimizing Parameter Learning

Problem (Loss-Minimizing Parameter Learning)
Let d(x, y ) be the unknown distribution of data in labels, and let
∆ : Y × Y → R be a loss function. Loss minimizing parameter learning is
the task of ﬁnding a parameter value w ∗ such that the expected
prediction risk
E(x,y )∼d(x,y ) [∆(y , fp (x))]
is as small as possible, where fp (x) = argmaxy ∈Y p(y |x, w ∗ ).

Requires loss function at training time
Directly learns a prediction function fp (x)


01 graphical models

More Related Content

What's hot (19)

Similar to 01 graphical models (20)

More from zukun (20)

Recently uploaded (20)

01 graphical models