20140530.journal club

Journal
Club
論文紹介

Modeling
Natural
Images
using

Gated
MRF
Ranzato,
M.
et
al.

2014/05/30

shouno@uec.ac.jp

やりたいこと
•  できるだけ同じメカニズムつかって

– 画像表現の獲得

– 画像識別機械の構成

– ノイズ除去

– 遮蔽シーンの推定,
etc…

•  道具

– 確率的な機械学習モデル

MRFの拡張的な
(gatedRBM)
+
DeepBeliefNet(DBN)

画像の表現形式のおさらい

２画素からなる画像で
サンプル
モデルでの表現

Toy
Problem:
パッチ画像の再構成
パッチを
0
平均
画像を
0
平均

必要な介在変数
•  平均(mean)

– Heavy
tailed
な構造ではない

•  共分散(の逆行列：精度行列)(precision)

– Heavy
tailed
な構造（近いピクセルの相関大）

•  介在変数として平均と共分散を表現しうる

→
Gated
MRF(mcRBM,
mPoT)

Gated
MRF(Markov
Random
Field)
•  Mean
Units:
ピクセル値の符号化器

•  Precision
Units:
ピクセル間相関の符号化器
5
xp( v) .
nits form
on weight
Therefore,
CT
(5)
FN M
precision
units
factors mean units
pixels
Fig. 3. Graphical model representation (with only three inputW
C
P
As described in sec. 2.2, we want the conditional dis-
tribution over the pixels to be a Gaussian with not only
its covariance but also its mean depending on the states of
the latent variables. Since the product of a full covariance
Gaussian (like the one in eq. 5) with a spherical non-
zero mean Gaussian is a non-zero mean full covariance
Gaussian, we simply add the energy function of cRBM in
eq. 3 to the energy function of a GRBM [29], yielding:
E(x, hm
, hp
) =
1
2
xT
Cdiag(Php
)CT
x bpT
hp
+
1
2
xT
x hm
WT
x bmT
hm
bxT
x (6)
where hm
2 {0, 1}M
are called “mean” latent variables be-
cause they contribute to control the mean of the conditional
distribution over the input:
p(x|hm
, hp
) = N
✓
⌃(Whm
+ bx
), ⌃
◆
, (7)
with ⌃ 1
= Cdiag(Php
)CT
+ I
where I is the identity matrix, W 2 RD⇥M
is a matrix of
trainable parameters and bx
2 RD
is a vector of trainable
Fig. 4. A) In
only mean hid
sion hiddens (
image-speciﬁc
of Gaussian in
correct image-
speciﬁed pixe
pair-wise depe
spread out ove
like in C) show
(nor for the exa
For instance
the image in
are likely to
variables do
this is. Then
individual pi
provided by
to reconstruc

DemonstraWon
of
Gated
MRF
•  Fig4.A
原画像（下部に強い相関アリ)

•  Fig4.B〜D
上段 mRBM
による再構成，

下段は mcRBM
による再構成

•  Fig4.C
観測情報を入れると，mcRBM
は相関を再現

•  Fig4.D,
位相反転させても，（たぶん）正しそうな解釈
ig. 3. Graphical model representation (with only three input
ariables): There are two sets of latent variables (the mean and the
recision units) that are conditionally independent given the input
xels and a set of deterministic factor nodes that connect triplets of
ariables (pairs of input variables and one precision unit).
強い相関

自然画像見せた時の基底()
•  精度行列(C)
フィルタ:
Topographic
Mapライク表現
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20??
More quantitatively, we have compared th
models in terms of their log-probability on
dataset of test image patches using a techn
annealed importance sampling [50]. Tab. 1 s
improvements of mPoT over both GRBM and
ever, the estimated log-probability is lower tha
of Gaussians (MoG) with diagonal covarianc
same number of parameters as the other mod
consistent with previous work by Theis et
interpret this result with caution since the es
GRBM, PoT and mPoT can have a large varianc
noise introduced by the samplers, while the log
of the MoG is calculated exactly. This conjec
firmed by the results reported in the rightm
showing the exact log-probability ratio betwee
test images and Gaussian noise with the same c
This ratio is indeed larger for mPoT than MoG
inaccuracies in the former estimation. The table
results of mPoT models trained by using c
less accurate approximations to the maximum
Fig. 8. Top: Precision filters (matrix C) of size 16x1
on color image patches of the same size. Matrix P wa
a local connectivity inducing a two dimensional top
This makes nearby filters in the map learn similar fe
left: Random subset of mean filters (matrix W). Bot
top to bottom): independent samples drawn from th
mPoT, GRBM and PoT.
TABLE 1
Log-probability estimates of test natural images x
log-probability ratios between the same test images
images n.
Model log p(x) log p(x)/p(
MoG -85 68
GRBM FPCD -152 -3

サンプリングによる再構成評価
•  原画像

•  mPoT(提案モデル)

•  GRBM

•  PoT

atrix C) of size 16x16 pixels learned
me size. Matrix P was initialized with
wo dimensional topographic map.
map learn similar features. Bottom
ers (matrix W). Bottom right (from
This ratio is indeed larg
inaccuracies in the forme
results of mPoT mode
less accurate approxima
gradient, namely Contras
Contrastive Divergence [
yields a higher likelihoo
We repeated the sam
ing the extension of mP
training on large patche
at random locations fro
natural images that were
model was trained using
with different spatial ph
a diagonal offset of two
Fig. 8. Top: Precision filters (matrix C) of size 16x16 pixels learned
on color image patches of the same size. Matrix P was initialized with
a local connectivity inducing a two dimensional topographic map.
This makes nearby filters in the map learn similar features. Bottom
left: Random subset of mean filters (matrix W). Bottom right (from
top to bottom): independent samples drawn from the training data,
mPoT, GRBM and PoT.
TABLE 1
Log-probability estimates of test natural images x, and exact
log-probability ratios between the same test images x and random
images n.
Model log p(x) log p(x)/p(n)
MoG -85 68
GRBM FPCD -152 -3
PoT FPCD -101 65
mPoT CD -109 82
mPoT PCD -92 98
mPoT FPCD -94 102
This ratio is indeed la
inaccuracies in the for
results of mPoT mo
less accurate approxi
gradient, namely Cont
Contrastive Divergenc
yields a higher likelih
We repeated the s
ing the extension of
training on large patc
at random locations
natural images that w
model was trained us
with different spatial
a diagonal offset of tw
set consists of 64 co
Parameters were learn
sec. 3, but setting P
The first two rows
from PoT with sampl
the latter ones exhibit
separated by sharp ed
sort of long range str
rather artificial becaus
repetitive. We then m
layers on the top of m
All layers are traine

高解像度化
•  このままだと単なるパッチ画像の処理装置

•  画像全体にグラフィカルモデルを拡張するのは

計算量的にしんどい，Full
ConecWon(A)

•  ブロック分割してConvoluWon(D)的な伺かで対応

Tiled
ConvoluWon
(C,
E)

画像的にはこんな感じ
Fig. 7. A toy illustration of how units are combined across layers.
Squares are ﬁlters, gray planes are channels and circles are latent
variables. Left: illustration of how input channels are combined into a
single output channel. Input variables at the same spatial location
across different channels contribute to determine the state of the
same latent variable. Input units falling into different tiles (without
overlap) determine the state of nearby units in the hidden layer (here,
we have only two spatial locations). Right: illustration of how ﬁlters
that overlap with an offset contribute to hidden units that are at
the same output spatial location but in different hidden channels. In
practice, the deep model combines these two methods at each layerRanzato
et
al,
NIPS
2010

Deep 化
•  介在変数が2値で書けてるんだし，

Deep
化できるんじゃね？
DBNs$for$Classiﬁca:on$
W +W
W
W
W +
W +
W +
W
W
W
W
1 11
500 500
500
2000
500
500
2000
500
2
500
RBM
500
2000
3
Pretraining Unrolling Fine tuning
4 4
2 2
3 3
1
2
3
4
RBM
10
Softmax Output
10
RBM
T
T
T
T
T
T
T
T
今ココ
mcRBM/mPoT
Salakhutdinov
CVPR
2012
Tutorial

mPoT
+
DBN
による再構成
•  PoT

•  mPoT

•  DBN
(2layers)
+
mPoT

•  Time
course
of
DBN+mPoT
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 11
TABLE 2
Denoising performance using = 20 (PSNR=22.1dB).
Barb. Boats Fgpt. House Lena Peprs.
mPoT 28.0 30.0 27.6 32.2 31.9 30.7
mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7
mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0
FoE [11] 28.3 29.8 - 32.3 31.9 30.6
NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3
GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3
BM3D [57] 31.8 30.9 - 33.8 33.1 31.3
LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4
6.2 Denoising
The most commonly used task to quantitatively validate
a generative model of natural images is image denoising,
assuming homogeneous additive Gaussian noise of known
variance [10], [11], [37], [58], [12]. We restore images
by maximum a-posteriori (MAP) estimation. In the log
domain, this amounts to solving the following optimization
problem: arg minx ||y x||2
+ F(x; ✓), where y is the
observed noisy image, F(x; ✓) is the mPoT energy function
(see eq. 12), is an hyper-parameter which is inversely
proportional to the noise variance and x is an estimate of
the clean image. In our experiments, the optimization is
performed by gradient descent.
For images with repetitive texture, generic prior models
usually offer only modest denoising performance compared

応用1:
ノイズ除去
•  ノイズ画像に対して，mPoT,
mPoT
+
NLM
などで修復
Fig. 10. Denoising of “Barbara” (detail). From left to right: original image, noisy image ( = 20, PSNR=22.1dB), denoising of mPoT
(28.0dB), denoising of mPoT adapted to this image (29.2dB) and denoising of adapted mPoT combined with non-local means (30.7dB).
(over both baselines, mPoT+A and NLM). Second, on this
task a gated MRF with mean latent variables is no better
than a model that lacks them, like FoE [11] (which is a
convolutional version of PoT)14
. Mean hiddens are crucial
when generating images in an unconstrained manner, but
are not needed for denoising since the observation y already
provides (noisy) mean intensities to the model (a similar
argument applies to inpainting). Finally, the performance
of the adapted model is still slightly worse than the best
denoising method.
Barbara
さん
+AWGN

(σ=20,
PSNR=22.1[dB])
mPoT

(PSNR=28.0[dB])
mPoT+A

(PSNR=29.2[dB])
mPoT+A+NLM(って…)

(PSNR=30.7[dB])
ARY 20?? 11
TABLE 2
Denoising performance using = 20 (PSNR=22.1dB).
Barb. Boats Fgpt. House Lena Peprs.
mPoT 28.0 30.0 27.6 32.2 31.9 30.7
mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7
mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0
FoE [11] 28.3 29.8 - 32.3 31.9 30.6
NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3
GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3
BM3D [57] 31.8 30.9 - 33.8 33.1 31.3
LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4

応用2:
Scene
ClassiﬁcaWon
•  Lazebnik
ら
(2006)
の SceneDB

•  mPoT+DBN(2layer)+K-‐means
→
81.2%
accuracy

•  SIFT
+
SVM
→
81.4%
accuracy(Lazebnik’06)

• これ以上はパス

（あまりデータのってないお

↑言い訳）

応用3:
物体認識
•  CIFAR10
データセット(Torralba
et
al’08)の識別

– Train
5K/cls,
Test
1K/cls
age, noisy image ( = 20, PSNR=22.1dB), denoising of mPoT
g of adapted mPoT combined with non-local means (30.7dB).
convolutional version of PoT)14
. Mean hiddens are crucial
when generating images in an unconstrained manner, but
are not needed for denoising since the observation y already
provides (noisy) mean intensities to the model (a similar
argument applies to inpainting). Finally, the performance
of the adapted model is still slightly worse than the best
denoising method.
6.3 Scene Classification
In this experiment, we use a classification task to com-
pare SIFT features with the features learned by adding
a second layer of Bernoulli latent variables that model
the distribution of latent variables of an mPoT generative
model. The task is to classify the natural scenes in the 15
scene dataset [2] into one of 15 categories. The method
of reference on this dataset was proposed by Lazebnik et
al. [2] and it can be summarized as follows: 1) densely
compute SIFT descriptors every 8 pixels on a regular grid,
2) perform K-Means clustering on the SIFT descriptors, 3)
compute histograms of cluster ids over regions of the image
at different locations and spatial scales, and 4) use an SVM
with an intersection kernel for classification.
We use a DBN with an mPoT front-end to mimic this
pipeline. We treat the expected value of the latent variables
as features that describe the input image. We extract first
and second layer features (using the model that produced
the generations in the bottom of fig. 9) from a regular
grid with a stride equal to 8 pixels. We apply K-Means
to learn a dictionary with 1024 prototypes and then assign
each feature to its closest prototype. We compute a spatial
pyramid with 2 levels for the first layer features ({hm
, hp
})
Fig. 11. Example of images in the CIFAR 10 dataset. Each colum
shows samples belonging to the same category.
TABLE 3
Test and training (in parenthesis) recognition accuracy on the
CIFAR 10 dataset. The numbers in italics are the feature
dimensionality at each stage.
Method Accuracy %
1) mean (GRBM): 11025 59.7 (72.2)
2) cRBM (225 factors): 11025 63.6 (83.9)
3) cRBM (900 factors): 11025 64.7 (80.2)
4) mcRBM: 11025 68.2 (83.1)
5) mcRBM-DBN (11025-8192) 70.7 (85.4)
6) mcRBM-DBN (11025-8192-8192) 71.0 (83.6)
7) mcRBM-DBN (11025-8192-4096-1024-384) 59.8 (62.0)
fig. 11. These images were downloaded from the web an
down-sampled to a very low resolution, just 32x32 pixels
The CIFAR 10 subset has ten object categories, namely air
plane, car, bird, cat, deer, dog, frog, horse, ship, and truck
The training set has 5000 samples per class, the test set ha
1000 samples per class. The low resolution and extrem
TABLE 4
Test recognition accuracy on the CIFAR 10 dataset produced b
different methods. Features are fed to a multinomial logistic
regression classifier for recognition.
Method Accuracy %
384 dimens. GIST 54.7
10,000 linear random projections 36.0
10K GRBM(*), 1 layer, ZCA’d images 59.6
10K GRBM(*), 1 layer 63.8
10K GRBM(*), 1layer with fine-tuning 64.8
10K GRBM-DBN(*), 2 layers 56.6
11025 mcRBM 1 layer, PCA’d images 68.2
8192 mcRBM-DBN, 3 layers, PCA’d images 71.0

応用4:
顔画像生成とか認識
•  Tront
Face
Dataset(TFD,
Susskind
et
al.’10)

–  48x48,
100K
unlabeled,
4K
labeled

•  mPoT
+
DBN(〜4layer)
+
mult.
cls
log
reg.(sommax?)
13
Fig. 12. Top: Samples generated by a five-layer deep model
TABLE 4
Test recognition accuracy on the CIFAR 10 dataset produced by
different methods. Features are fed to a multinomial logistic
regression classifier for recognition.
Method Accuracy %
384 dimens. GIST 54.7
10,000 linear random projections 36.0
10K GRBM(*), 1 layer, ZCA’d images 59.6
10K GRBM(*), 1 layer 63.8
10K GRBM(*), 1layer with fine-tuning 64.8
10K GRBM-DBN(*), 2 layers 56.6
11025 mcRBM 1 layer, PCA’d images 68.2
to recognize the object category in the image. Since our
model is unsupervised, we train it on a set of two million
images from the TINY dataset that does not overlap with
the labeled CIFAR 10 subset in order to further improve
generalization [20], [63], [64]. In the default set up we
learn all parameters of the model, we use 81 filters in W
to encode the mean, 576 filters in C to encode covariance
constraints and we pool these filters into 144 hidden units
through matrix P. P is initialized with a two-dimensional
topography that takes 3x3 neighborhoods of filters with
a stride equal to 2. In total, at each location we extract
Fig. 12. Top: Samples generated by a five-layer deep model
trained on faces. The top layer has 128 binary latent variables
and images have size 48x48 pixels. Bottom: comparison between
six samples from the model (top row) and the Euclidean distance
nearest neighbor images in the training set (bottom row).
TABLE 5
TFD: Facial expression classification accuracy using features
trained without supervision.
Method layer 1 layer 2 layer 3 layer 4
raw pixels 71.5 - - -
4layer
DBN(+mPoT)
sampling
生成画像と原画像の比較
Method Accuracy %
mens. GIST 54.7
0 linear random projections 36.0
GRBM(*), 1 layer, ZCA’d images 59.6
GRBM(*), 1 layer 63.8
GRBM(*), 1layer with fine-tuning 64.8
GRBM-DBN(*), 2 layers 56.6
mcRBM 1 layer, PCA’d images 68.2
mcRBM-DBN, 3 layers, PCA’d images 71.0
cRBM-DBN, 5 layers, PCA’d images 59.8
e the object category in the image. Since our
nsupervised, we train it on a set of two million
m the TINY dataset that does not overlap with
CIFAR 10 subset in order to further improve
on [20], [63], [64]. In the default set up we
rameters of the model, we use 81 filters in W
he mean, 576 filters in C to encode covariance
and we pool these filters into 144 hidden units
trix P. P is initialized with a two-dimensional
that takes 3x3 neighborhoods of filters with
ual to 2. In total, at each location we extract
5 features. Therefore, we represent a 32x32
a 225x7x7=11025 dimensional descriptor.
shows some comparisons. First, we assess
s best to model just the mean intensity, or just
Fig. 12. Top: Samples generated by a five-layer deep mode
trained on faces. The top layer has 128 binary latent variable
and images have size 48x48 pixels. Bottom: comparison betwee
six samples from the model (top row) and the Euclidean distanc
nearest neighbor images in the training set (bottom row).
TABLE 5
TFD: Facial expression classification accuracy using features
trained without supervision.
Method layer 1 layer 2 layer 3 layer 4
raw pixels 71.5 - - -
Gaussian SVM 76.2 - - -
Sparse Coding 74.6 - - -
Gabor PCA 80.2 - - -
GRBM 80.0 81.5 80.3 79.5
PoT 79.4 79.3 80.6 80.2
mPoT 81.6 82.1 82.5 82.4

応用4:
顔画像生成とか認識(続き1)
•  顔のパーツを遮蔽した場合の生成

–  上段が入力，下段が生成結果
ORIGINALTYPE1TYPE2TYPE3
Fig. 15. Expression recognition accuracy on the TFD dataset
when both training and test labeled images are subject to 7 types
of occlusion.
RBMs with 1024 latent variables, and at the fifth layer
we have an RBM with 128 hidden units. The deep model
was trained entirely generatively on the unlabeled images
without using any labeled instances [64]. The discriminative
training consisted of training a linear multi-class logistic
regression classifier on the top level representation without
using back-propagation to jointly optimize the parameters
across all layers.
TYPE4TYPE5TYPE6TYPE7
Fig. 13. Example of conditional generation performed by a four-
Fig. 15. E
when both tra
of occlusion.
RBMs with
we have an
was trained
without usin
training con
regression c
using back-
across all la
Fig. 12 sh
ative model
ferent indiv
poses, and
using Eucli
not just cop
exhibits reg
cases.
In the fir
the features
facial expre
input image
mean intens
deviation o
successive h
82.5%, and
hidden unit
71.5% achi

応用4:
顔画像生成とか認識(続き2)
•  顔パーツを遮蔽した場合の生成

– 各階層の寄与

– 生成画像を再入力したとき

Fig. 13. Example of conditional generation performed by a four-
layer deep model trained on faces. Each column is a different
example (not used in the unsupervised training phase). The topmost
row shows some example images from the TFD dataset. The other
rows show the same images occluded by a synthetic mask (on the
top) and their restoration performed by the deep generative model
(on the bottom).
Fig. 14. An example of restoration of unseen images performed
by propagating the input up to ﬁrst, second, third, fourth layer, and
again through the four layers and re-circulating the input through the
same model for ten times.
8
h
7
7
7
ac
w
an
n
it
li
w
at
w
o
av
8

まとめ
•  Gated
MRF 使うと
latent
の取り扱い楽だよ

–  ２体相互作用→

“mean”

–  3体相互作用→
“precision”

•  Gated
MRF
は入力空間を
hyper-‐ellipse
で覆おうとする→
割と精度良く使えそう(GRBM,
PPCA
でも多分可だけど

hyper-‐sphere なので相関強い時はちと問題かも)

•  システムとしては

–  Low
level
(patch処理のとこ？)
と
High
level

(高解像度化のと
こ？)のモデル作って

–  High
level
の部分は DBN に繋げる，と．

•  で，まとめたシステムを

–  物体，シーン認識

–  顔生成と認識とかに使って，state-‐of-‐art
に迫る成績出した

Future
work
•  厳密な最尤推定が出来ないのは欠点じゃなかろうか？

–  MCMC
も計算コスト高いので…

•  後はタスクと計算機コストでモデルを修正してくしかなさそ

•  確かにテクスチャとか出てくるけど，まだ単純なものだし…

•  Gated
MRF
は

–  エネルギー関数の形式を改良する余地があるよ

–  Higher
level
の構造を吟味することが出来るよ

全部 RBM
→
gated
MRF
とかにするとか

–  パラメトリックとノンパラをミックスするとかがいいんじゃない？
（こんなこと言ってたっけ？）

•  Video
とかへの応用はどうだろう？

20140530.journal club

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to 20140530.journal club (20)

Recently uploaded (20)

20140530.journal club