SlideShare a Scribd company logo
Journal	
  Club	
  論文紹介	
  
Modeling	
  Natural	
  Images	
  using	
  
Gated	
  MRF	
Ranzato,	
  M.	
  et	
  al.	
  
2014/05/30	
  
shouno@uec.ac.jp
やりたいこと	
•  できるだけ同じメカニズムつかって	
  
– 画像表現の獲得	
  
– 画像識別機械の構成	
  
– ノイズ除去	
  
– 遮蔽シーンの推定,	
  etc…	
  
•  道具	
  
– 確率的な機械学習モデル	
  
MRFの拡張的な	
  (gatedRBM)	
  +	
  DeepBeliefNet(DBN)	
  
画像の表現形式のおさらい	
  
2画素からなる画像で	
サンプル	
モデルでの表現
Toy	
  Problem:	
  パッチ画像の再構成	
パッチを	
  0	
  平均	
 画像を	
  0	
  平均
必要な介在変数	
•  平均(mean)	
  
– Heavy	
  tailed	
  な構造ではない	
  
•  共分散(の逆行列:精度行列)(precision)	
  
– Heavy	
  tailed	
  な構造(近いピクセルの相関大)	
  
•  介在変数として平均と共分散を表現しうる	
  
→	
  Gated	
  MRF(mcRBM,	
  mPoT)
Gated	
  MRF(Markov	
  Random	
  Field)	
•  Mean	
  Units:	
  ピクセル値の符号化器	
  
•  Precision	
  Units:	
  ピクセル間相関の符号化器	
5
xp( v) .
nits form
on weight
Therefore,
CT
(5)
FN M
precision
units
factors mean units
pixels
Fig. 3. Graphical model representation (with only three inputW	
C	
P	
As described in sec. 2.2, we want the conditional dis-
tribution over the pixels to be a Gaussian with not only
its covariance but also its mean depending on the states of
the latent variables. Since the product of a full covariance
Gaussian (like the one in eq. 5) with a spherical non-
zero mean Gaussian is a non-zero mean full covariance
Gaussian, we simply add the energy function of cRBM in
eq. 3 to the energy function of a GRBM [29], yielding:
E(x, hm
, hp
) =
1
2
xT
Cdiag(Php
)CT
x bpT
hp
+
1
2
xT
x hm
WT
x bmT
hm
bxT
x (6)
where hm
2 {0, 1}M
are called “mean” latent variables be-
cause they contribute to control the mean of the conditional
distribution over the input:
p(x|hm
, hp
) = N
✓
⌃(Whm
+ bx
), ⌃
◆
, (7)
with ⌃ 1
= Cdiag(Php
)CT
+ I
where I is the identity matrix, W 2 RD⇥M
is a matrix of
trainable parameters and bx
2 RD
is a vector of trainable
Fig. 4. A) In
only mean hid
sion hiddens (
image-specific
of Gaussian in
correct image-
specified pixe
pair-wise depe
spread out ove
like in C) show
(nor for the exa
For instance
the image in
are likely to
variables do
this is. Then
individual pi
provided by
to reconstruc
DemonstraWon	
  of	
  Gated	
  MRF	
•  Fig4.A	
  原画像(下部に強い相関アリ)	
  
•  Fig4.B〜D	
  上段 mRBM	
  による再構成,	
  
下段は mcRBM	
  による再構成	
  
•  Fig4.C	
  観測情報を入れると,mcRBM	
  は相関を再現	
  
•  Fig4.D,	
  位相反転させても,(たぶん)正しそうな解釈	
ig. 3. Graphical model representation (with only three input
ariables): There are two sets of latent variables (the mean and the
recision units) that are conditionally independent given the input
xels and a set of deterministic factor nodes that connect triplets of
ariables (pairs of input variables and one precision unit).
強い相関
自然画像見せた時の基底()	
•  精度行列(C)	
  フィルタ:	
  Topographic	
  Mapライク表現	
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20??
More quantitatively, we have compared th
models in terms of their log-probability on
dataset of test image patches using a techn
annealed importance sampling [50]. Tab. 1 s
improvements of mPoT over both GRBM and
ever, the estimated log-probability is lower tha
of Gaussians (MoG) with diagonal covarianc
same number of parameters as the other mod
consistent with previous work by Theis et
interpret this result with caution since the es
GRBM, PoT and mPoT can have a large varianc
noise introduced by the samplers, while the log
of the MoG is calculated exactly. This conjec
firmed by the results reported in the rightm
showing the exact log-probability ratio betwee
test images and Gaussian noise with the same c
This ratio is indeed larger for mPoT than MoG
inaccuracies in the former estimation. The table
results of mPoT models trained by using c
less accurate approximations to the maximum
Fig. 8. Top: Precision filters (matrix C) of size 16x1
on color image patches of the same size. Matrix P wa
a local connectivity inducing a two dimensional top
This makes nearby filters in the map learn similar fe
left: Random subset of mean filters (matrix W). Bot
top to bottom): independent samples drawn from th
mPoT, GRBM and PoT.
TABLE 1
Log-probability estimates of test natural images x
log-probability ratios between the same test images
images n.
Model log p(x) log p(x)/p(
MoG -85 68
GRBM FPCD -152 -3
サンプリングによる再構成評価	
•  原画像	
  
•  mPoT(提案モデル)	
  
•  GRBM	
  
•  PoT	
  
atrix C) of size 16x16 pixels learned
me size. Matrix P was initialized with
wo dimensional topographic map.
map learn similar features. Bottom
ers (matrix W). Bottom right (from
This ratio is indeed larg
inaccuracies in the forme
results of mPoT mode
less accurate approxima
gradient, namely Contras
Contrastive Divergence [
yields a higher likelihoo
We repeated the sam
ing the extension of mP
training on large patche
at random locations fro
natural images that were
model was trained using
with different spatial ph
a diagonal offset of two
Fig. 8. Top: Precision filters (matrix C) of size 16x16 pixels learned
on color image patches of the same size. Matrix P was initialized with
a local connectivity inducing a two dimensional topographic map.
This makes nearby filters in the map learn similar features. Bottom
left: Random subset of mean filters (matrix W). Bottom right (from
top to bottom): independent samples drawn from the training data,
mPoT, GRBM and PoT.
TABLE 1
Log-probability estimates of test natural images x, and exact
log-probability ratios between the same test images x and random
images n.
Model log p(x) log p(x)/p(n)
MoG -85 68
GRBM FPCD -152 -3
PoT FPCD -101 65
mPoT CD -109 82
mPoT PCD -92 98
mPoT FPCD -94 102
This ratio is indeed la
inaccuracies in the for
results of mPoT mo
less accurate approxi
gradient, namely Cont
Contrastive Divergenc
yields a higher likelih
We repeated the s
ing the extension of
training on large patc
at random locations
natural images that w
model was trained us
with different spatial
a diagonal offset of tw
set consists of 64 co
Parameters were learn
sec. 3, but setting P
The first two rows
from PoT with sampl
the latter ones exhibit
separated by sharp ed
sort of long range str
rather artificial becaus
repetitive. We then m
layers on the top of m
All layers are traine
高解像度化	
•  このままだと単なるパッチ画像の処理装置	
  
•  画像全体にグラフィカルモデルを拡張するのは	
  
計算量的にしんどい,Full	
  ConecWon(A)	
  
•  ブロック分割してConvoluWon(D)的な伺かで対応	
  
Tiled	
  ConvoluWon	
  (C,	
  E)	
  
画像的にはこんな感じ	
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20??
Fig. 7. A toy illustration of how units are combined across layers.
Squares are filters, gray planes are channels and circles are latent
variables. Left: illustration of how input channels are combined into a
single output channel. Input variables at the same spatial location
across different channels contribute to determine the state of the
same latent variable. Input units falling into different tiles (without
overlap) determine the state of nearby units in the hidden layer (here,
we have only two spatial locations). Right: illustration of how filters
that overlap with an offset contribute to hidden units that are at
the same output spatial location but in different hidden channels. In
practice, the deep model combines these two methods at each layerRanzato	
  et	
  al,	
  NIPS	
  2010
Deep 化	
•  介在変数が2値で書けてるんだし,	
  
Deep	
  化できるんじゃね?	
DBNs$for$Classifica:on$
W +W
W
W
W +
W +
W +
W
W
W
W
1 11
500 500
500
2000
500
500
2000
500
2
500
RBM
500
2000
3
Pretraining Unrolling Fine tuning
4 4
2 2
3 3
1
2
3
4
RBM
10
Softmax Output
10
RBM
T
T
T
T
T
T
T
T
今ココ	
mcRBM/mPoT	
Salakhutdinov	
  CVPR	
  2012	
  Tutorial
mPoT	
  +	
  DBN	
  による再構成	
•  PoT	
  
•  mPoT	
  
•  DBN	
  (2layers)	
  +	
  mPoT	
  
•  Time	
  course	
  of	
  DBN+mPoT	
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 11
TABLE 2
Denoising performance using = 20 (PSNR=22.1dB).
Barb. Boats Fgpt. House Lena Peprs.
mPoT 28.0 30.0 27.6 32.2 31.9 30.7
mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7
mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0
FoE [11] 28.3 29.8 - 32.3 31.9 30.6
NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3
GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3
BM3D [57] 31.8 30.9 - 33.8 33.1 31.3
LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4
6.2 Denoising
The most commonly used task to quantitatively validate
a generative model of natural images is image denoising,
assuming homogeneous additive Gaussian noise of known
variance [10], [11], [37], [58], [12]. We restore images
by maximum a-posteriori (MAP) estimation. In the log
domain, this amounts to solving the following optimization
problem: arg minx ||y x||2
+ F(x; ✓), where y is the
observed noisy image, F(x; ✓) is the mPoT energy function
(see eq. 12), is an hyper-parameter which is inversely
proportional to the noise variance and x is an estimate of
the clean image. In our experiments, the optimization is
performed by gradient descent.
For images with repetitive texture, generic prior models
usually offer only modest denoising performance compared
応用1:	
  ノイズ除去	
•  ノイズ画像に対して,mPoT,	
  mPoT	
  +	
  NLM	
  などで修復	
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 12
Fig. 10. Denoising of “Barbara” (detail). From left to right: original image, noisy image ( = 20, PSNR=22.1dB), denoising of mPoT
(28.0dB), denoising of mPoT adapted to this image (29.2dB) and denoising of adapted mPoT combined with non-local means (30.7dB).
(over both baselines, mPoT+A and NLM). Second, on this
task a gated MRF with mean latent variables is no better
than a model that lacks them, like FoE [11] (which is a
convolutional version of PoT)14
. Mean hiddens are crucial
when generating images in an unconstrained manner, but
are not needed for denoising since the observation y already
provides (noisy) mean intensities to the model (a similar
argument applies to inpainting). Finally, the performance
of the adapted model is still slightly worse than the best
denoising method.
Barbara	
  さん	
 +AWGN	
  
	
  (σ=20,	
  PSNR=22.1[dB])	
mPoT	
  
	
  (PSNR=28.0[dB])	
mPoT+A	
  
	
  (PSNR=29.2[dB])	
mPoT+A+NLM(って…)	
  
	
  (PSNR=30.7[dB])	
ARY 20?? 11
TABLE 2
Denoising performance using = 20 (PSNR=22.1dB).
Barb. Boats Fgpt. House Lena Peprs.
mPoT 28.0 30.0 27.6 32.2 31.9 30.7
mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7
mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0
FoE [11] 28.3 29.8 - 32.3 31.9 30.6
NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3
GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3
BM3D [57] 31.8 30.9 - 33.8 33.1 31.3
LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4
応用2:	
  Scene	
  ClassificaWon	
•  Lazebnik	
  ら	
  (2006)	
  の SceneDB	
  
•  mPoT+DBN(2layer)+K-­‐means	
  →	
  81.2%	
  accuracy	
  
•  SIFT	
  +	
  SVM	
  →	
  81.4%	
  accuracy(Lazebnik’06)	
  
	
  
• これ以上はパス	
  
(あまりデータのってないお	
  
↑言い訳)
応用3:	
  物体認識	
•  CIFAR10	
  データセット(Torralba	
  et	
  al’08)の識別	
  
– Train	
  5K/cls,	
  Test	
  1K/cls	
age, noisy image ( = 20, PSNR=22.1dB), denoising of mPoT
g of adapted mPoT combined with non-local means (30.7dB).
convolutional version of PoT)14
. Mean hiddens are crucial
when generating images in an unconstrained manner, but
are not needed for denoising since the observation y already
provides (noisy) mean intensities to the model (a similar
argument applies to inpainting). Finally, the performance
of the adapted model is still slightly worse than the best
denoising method.
6.3 Scene Classification
In this experiment, we use a classification task to com-
pare SIFT features with the features learned by adding
a second layer of Bernoulli latent variables that model
the distribution of latent variables of an mPoT generative
model. The task is to classify the natural scenes in the 15
scene dataset [2] into one of 15 categories. The method
of reference on this dataset was proposed by Lazebnik et
al. [2] and it can be summarized as follows: 1) densely
compute SIFT descriptors every 8 pixels on a regular grid,
2) perform K-Means clustering on the SIFT descriptors, 3)
compute histograms of cluster ids over regions of the image
at different locations and spatial scales, and 4) use an SVM
with an intersection kernel for classification.
We use a DBN with an mPoT front-end to mimic this
pipeline. We treat the expected value of the latent variables
as features that describe the input image. We extract first
and second layer features (using the model that produced
the generations in the bottom of fig. 9) from a regular
grid with a stride equal to 8 pixels. We apply K-Means
to learn a dictionary with 1024 prototypes and then assign
each feature to its closest prototype. We compute a spatial
pyramid with 2 levels for the first layer features ({hm
, hp
})
Fig. 11. Example of images in the CIFAR 10 dataset. Each colum
shows samples belonging to the same category.
TABLE 3
Test and training (in parenthesis) recognition accuracy on the
CIFAR 10 dataset. The numbers in italics are the feature
dimensionality at each stage.
Method Accuracy %
1) mean (GRBM): 11025 59.7 (72.2)
2) cRBM (225 factors): 11025 63.6 (83.9)
3) cRBM (900 factors): 11025 64.7 (80.2)
4) mcRBM: 11025 68.2 (83.1)
5) mcRBM-DBN (11025-8192) 70.7 (85.4)
6) mcRBM-DBN (11025-8192-8192) 71.0 (83.6)
7) mcRBM-DBN (11025-8192-4096-1024-384) 59.8 (62.0)
fig. 11. These images were downloaded from the web an
down-sampled to a very low resolution, just 32x32 pixels
The CIFAR 10 subset has ten object categories, namely air
plane, car, bird, cat, deer, dog, frog, horse, ship, and truck
The training set has 5000 samples per class, the test set ha
1000 samples per class. The low resolution and extrem
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20??
TABLE 4
Test recognition accuracy on the CIFAR 10 dataset produced b
different methods. Features are fed to a multinomial logistic
regression classifier for recognition.
Method Accuracy %
384 dimens. GIST 54.7
10,000 linear random projections 36.0
10K GRBM(*), 1 layer, ZCA’d images 59.6
10K GRBM(*), 1 layer 63.8
10K GRBM(*), 1layer with fine-tuning 64.8
10K GRBM-DBN(*), 2 layers 56.6
11025 mcRBM 1 layer, PCA’d images 68.2
8192 mcRBM-DBN, 3 layers, PCA’d images 71.0
384 mcRBM-DBN, 5 layers, PCA’d images 59.8
応用4:	
  顔画像生成とか認識	
•  Tront	
  Face	
  Dataset(TFD,	
  Susskind	
  et	
  al.’10)	
  
–  48x48,	
  100K	
  unlabeled,	
  4K	
  labeled	
  
•  mPoT	
  +	
  DBN(〜4layer)	
  +	
  mult.	
  cls	
  log	
  reg.(sommax?)	
  13
Fig. 12. Top: Samples generated by a five-layer deep model
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 13
TABLE 4
Test recognition accuracy on the CIFAR 10 dataset produced by
different methods. Features are fed to a multinomial logistic
regression classifier for recognition.
Method Accuracy %
384 dimens. GIST 54.7
10,000 linear random projections 36.0
10K GRBM(*), 1 layer, ZCA’d images 59.6
10K GRBM(*), 1 layer 63.8
10K GRBM(*), 1layer with fine-tuning 64.8
10K GRBM-DBN(*), 2 layers 56.6
11025 mcRBM 1 layer, PCA’d images 68.2
8192 mcRBM-DBN, 3 layers, PCA’d images 71.0
384 mcRBM-DBN, 5 layers, PCA’d images 59.8
to recognize the object category in the image. Since our
model is unsupervised, we train it on a set of two million
images from the TINY dataset that does not overlap with
the labeled CIFAR 10 subset in order to further improve
generalization [20], [63], [64]. In the default set up we
learn all parameters of the model, we use 81 filters in W
to encode the mean, 576 filters in C to encode covariance
constraints and we pool these filters into 144 hidden units
through matrix P. P is initialized with a two-dimensional
topography that takes 3x3 neighborhoods of filters with
a stride equal to 2. In total, at each location we extract
Fig. 12. Top: Samples generated by a five-layer deep model
trained on faces. The top layer has 128 binary latent variables
and images have size 48x48 pixels. Bottom: comparison between
six samples from the model (top row) and the Euclidean distance
nearest neighbor images in the training set (bottom row).
TABLE 5
TFD: Facial expression classification accuracy using features
trained without supervision.
Method layer 1 layer 2 layer 3 layer 4
raw pixels 71.5 - - -
4layer	
  DBN(+mPoT)	
  sampling	
生成画像と原画像の比較	
Method Accuracy %
mens. GIST 54.7
0 linear random projections 36.0
GRBM(*), 1 layer, ZCA’d images 59.6
GRBM(*), 1 layer 63.8
GRBM(*), 1layer with fine-tuning 64.8
GRBM-DBN(*), 2 layers 56.6
mcRBM 1 layer, PCA’d images 68.2
mcRBM-DBN, 3 layers, PCA’d images 71.0
cRBM-DBN, 5 layers, PCA’d images 59.8
e the object category in the image. Since our
nsupervised, we train it on a set of two million
m the TINY dataset that does not overlap with
CIFAR 10 subset in order to further improve
on [20], [63], [64]. In the default set up we
rameters of the model, we use 81 filters in W
he mean, 576 filters in C to encode covariance
and we pool these filters into 144 hidden units
trix P. P is initialized with a two-dimensional
that takes 3x3 neighborhoods of filters with
ual to 2. In total, at each location we extract
5 features. Therefore, we represent a 32x32
a 225x7x7=11025 dimensional descriptor.
shows some comparisons. First, we assess
s best to model just the mean intensity, or just
Fig. 12. Top: Samples generated by a five-layer deep mode
trained on faces. The top layer has 128 binary latent variable
and images have size 48x48 pixels. Bottom: comparison betwee
six samples from the model (top row) and the Euclidean distanc
nearest neighbor images in the training set (bottom row).
TABLE 5
TFD: Facial expression classification accuracy using features
trained without supervision.
Method layer 1 layer 2 layer 3 layer 4
raw pixels 71.5 - - -
Gaussian SVM 76.2 - - -
Sparse Coding 74.6 - - -
Gabor PCA 80.2 - - -
GRBM 80.0 81.5 80.3 79.5
PoT 79.4 79.3 80.6 80.2
mPoT 81.6 82.1 82.5 82.4
応用4:	
  顔画像生成とか認識(続き1)	
•  顔のパーツを遮蔽した場合の生成	
  
–  上段が入力,下段が生成結果	
JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 14
ORIGINALTYPE1TYPE2TYPE3
Fig. 15. Expression recognition accuracy on the TFD dataset
when both training and test labeled images are subject to 7 types
of occlusion.
RBMs with 1024 latent variables, and at the fifth layer
we have an RBM with 128 hidden units. The deep model
was trained entirely generatively on the unlabeled images
without using any labeled instances [64]. The discriminative
training consisted of training a linear multi-class logistic
regression classifier on the top level representation without
using back-propagation to jointly optimize the parameters
across all layers.
TYPE4TYPE5TYPE6TYPE7
Fig. 13. Example of conditional generation performed by a four-
Fig. 15. E
when both tra
of occlusion.
RBMs with
we have an
was trained
without usin
training con
regression c
using back-
across all la
Fig. 12 sh
ative model
ferent indiv
poses, and
using Eucli
not just cop
exhibits reg
cases.
In the fir
the features
facial expre
input image
mean intens
deviation o
successive h
82.5%, and
hidden unit
71.5% achi
応用4:	
  顔画像生成とか認識(続き2)	
•  顔パーツを遮蔽した場合の生成	
  
– 各階層の寄与	
  
– 生成画像を再入力したとき	
  
Fig. 13. Example of conditional generation performed by a four-
layer deep model trained on faces. Each column is a different
example (not used in the unsupervised training phase). The topmost
row shows some example images from the TFD dataset. The other
rows show the same images occluded by a synthetic mask (on the
top) and their restoration performed by the deep generative model
(on the bottom).
Fig. 14. An example of restoration of unseen images performed
by propagating the input up to first, second, third, fourth layer, and
again through the four layers and re-circulating the input through the
same model for ten times.
8
h
7
7
7
ac
w
an
n
it
li
w
at
w
o
av
8
まとめ	
•  Gated	
  MRF 使うと	
  latent	
  の取り扱い楽だよ	
  
–  2体相互作用→	
  	
  “mean”	
  
–  3体相互作用→	
  “precision”	
  
•  Gated	
  MRF	
  は入力空間を	
  hyper-­‐ellipse	
  で覆おうとする→
割と精度良く使えそう(GRBM,	
  PPCA	
  でも多分可だけど	
  
hyper-­‐sphere なので相関強い時はちと問題かも)	
  
•  システムとしては	
  
–  Low	
  level	
  (patch処理のとこ?)	
  と	
  High	
  level	
  	
  (高解像度化のと
こ?)のモデル作って	
  
–  High	
  level	
  の部分は DBN に繋げる,と.	
  
•  で,まとめたシステムを	
  
–  物体,シーン認識	
  
–  顔生成と認識とかに使って,state-­‐of-­‐art	
  に迫る成績出した	
  
Future	
  work	
•  厳密な最尤推定が出来ないのは欠点じゃなかろうか?	
  
–  MCMC	
  も計算コスト高いので…	
  
•  後はタスクと計算機コストでモデルを修正してくしかなさそ	
  
•  確かにテクスチャとか出てくるけど,まだ単純なものだし…	
  
•  Gated	
  MRF	
  は	
  
–  エネルギー関数の形式を改良する余地があるよ	
  
–  Higher	
  level	
  の構造を吟味することが出来るよ	
  
全部 RBM	
  →	
  gated	
  MRF	
  とかにするとか	
  
–  パラメトリックとノンパラをミックスするとかがいいんじゃない?
(こんなこと言ってたっけ?)	
  
•  Video	
  とかへの応用はどうだろう?	
  

More Related Content

PDF
20150703.journal club
PDF
www.ijerd.com
PDF
Fractal Image Compression of Satellite Color Imageries Using Variable Size of...
PDF
Parallel implementation of geodesic distance transform with application in su...
PDF
Image Processing
PDF
I3602061067
PDF
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
PDF
Interferogram Filtering Using Gaussians Scale Mixtures in Steerable Wavelet D...
20150703.journal club
www.ijerd.com
Fractal Image Compression of Satellite Color Imageries Using Variable Size of...
Parallel implementation of geodesic distance transform with application in su...
Image Processing
I3602061067
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
Interferogram Filtering Using Gaussians Scale Mixtures in Steerable Wavelet D...

What's hot (20)

PDF
Kernel Estimation of Videodeblurringalgorithm and Motion Compensation of Resi...
DOCX
WBOIT Final Version
PDF
Performance Analysis of Image Enhancement Using Dual-Tree Complex Wavelet Tra...
PDF
Network Deconvolution review [cdm]
DOC
Double transform contoor extraction
PPTX
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
PDF
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
PDF
改进的固定点图像复原算法_英文_阎雪飞
PPTX
Structure from motion
PDF
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PDF
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
PPTX
2020 11 4_bag_of_tricks
PDF
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PDF
deep CNN vs conventional ML
PPTX
Texture mapping in_opengl
PDF
ShuffleNet - PR054
PPT
Cnn method
PPTX
Motion Estimation in h.264 encoder
PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PDF
2012.09.25 - Local and non-metric similarities between images - why, how and ...
Kernel Estimation of Videodeblurringalgorithm and Motion Compensation of Resi...
WBOIT Final Version
Performance Analysis of Image Enhancement Using Dual-Tree Complex Wavelet Tra...
Network Deconvolution review [cdm]
Double transform contoor extraction
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
改进的固定点图像复原算法_英文_阎雪飞
Structure from motion
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
2020 11 4_bag_of_tricks
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
deep CNN vs conventional ML
Texture mapping in_opengl
ShuffleNet - PR054
Cnn method
Motion Estimation in h.264 encoder
PR-284: End-to-End Object Detection with Transformers(DETR)
2012.09.25 - Local and non-metric similarities between images - why, how and ...
Ad

Viewers also liked (15)

PDF
20141208.名大セミナー
PDF
20130925.deeplearning
PDF
20150803.山口大学講演
PDF
20150803.山口大学集中講義
PDF
20160825 IEICE SIP研究会 講演
PDF
20160329.dnn講演
PDF
20130722
PDF
20141003.journal club
PDF
20150326.journal club
PDF
ベイズ Chow-Liu アルゴリズム
PDF
20140726.西野研セミナー
PDF
20141204.journal club
PDF
量子アニーリングを用いたクラスタ分析
PDF
20140705.西野研セミナー
PDF
ものまね鳥を愛でる 結合子論理と計算
20141208.名大セミナー
20130925.deeplearning
20150803.山口大学講演
20150803.山口大学集中講義
20160825 IEICE SIP研究会 講演
20160329.dnn講演
20130722
20141003.journal club
20150326.journal club
ベイズ Chow-Liu アルゴリズム
20140726.西野研セミナー
20141204.journal club
量子アニーリングを用いたクラスタ分析
20140705.西野研セミナー
ものまね鳥を愛でる 結合子論理と計算
Ad

Similar to 20140530.journal club (20)

PDF
International Journal of Engineering Research and Development (IJERD)
PDF
TransNeRF
DOC
Image Compression Using Discrete Cosine Transform & Discrete Wavelet Transform
PDF
Final Poster
PDF
Dycops2019
PPTX
Final Review
PPTX
Generating super resolution images using transformers
PPTX
ESTIMATING NOISE PARAMETER & FILTERING (Digital Image Processing)
PPT
Advanced Lighting Techniques Dan Baker (Meltdown 2005)
PDF
Baum3
PDF
B070306010
PPTX
Anomaly detection using deep one class classifier
PPTX
feature matching and model fitting .pptx
PPTX
20230213_ComputerVision_연구.pptx
PDF
論文紹介:Learning With Neighbor Consistency for Noisy Labels
PPTX
Image parts and segmentation
PPTX
3rd unit.pptx
PDF
Approximate Thin Plate Spline Mappings
PDF
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
International Journal of Engineering Research and Development (IJERD)
TransNeRF
Image Compression Using Discrete Cosine Transform & Discrete Wavelet Transform
Final Poster
Dycops2019
Final Review
Generating super resolution images using transformers
ESTIMATING NOISE PARAMETER & FILTERING (Digital Image Processing)
Advanced Lighting Techniques Dan Baker (Meltdown 2005)
Baum3
B070306010
Anomaly detection using deep one class classifier
feature matching and model fitting .pptx
20230213_ComputerVision_연구.pptx
論文紹介:Learning With Neighbor Consistency for Noisy Labels
Image parts and segmentation
3rd unit.pptx
Approximate Thin Plate Spline Mappings
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...

Recently uploaded (20)

PDF
Well-logging-methods_new................
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
web development for engineering and engineering
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Sustainable Sites - Green Building Construction
PDF
composite construction of structures.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
Welding lecture in detail for understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
OOP with Java - Java Introduction (Basics)
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
Structs to JSON How Go Powers REST APIs.pdf
Well-logging-methods_new................
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Foundation to blockchain - A guide to Blockchain Tech
web development for engineering and engineering
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mechanical Engineering MATERIALS Selection
Sustainable Sites - Green Building Construction
composite construction of structures.pdf
CH1 Production IntroductoryConcepts.pptx
Construction Project Organization Group 2.pptx
Welding lecture in detail for understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
OOP with Java - Java Introduction (Basics)
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Lesson 3_Tessellation.pptx finite Mathematics
Structs to JSON How Go Powers REST APIs.pdf

20140530.journal club

  • 1. Journal  Club  論文紹介   Modeling  Natural  Images  using   Gated  MRF Ranzato,  M.  et  al.   2014/05/30   shouno@uec.ac.jp
  • 2. やりたいこと •  できるだけ同じメカニズムつかって   – 画像表現の獲得   – 画像識別機械の構成   – ノイズ除去   – 遮蔽シーンの推定,  etc…   •  道具   – 確率的な機械学習モデル   MRFの拡張的な  (gatedRBM)  +  DeepBeliefNet(DBN)  
  • 4. Toy  Problem:  パッチ画像の再構成 パッチを  0  平均 画像を  0  平均
  • 5. 必要な介在変数 •  平均(mean)   – Heavy  tailed  な構造ではない   •  共分散(の逆行列:精度行列)(precision)   – Heavy  tailed  な構造(近いピクセルの相関大)   •  介在変数として平均と共分散を表現しうる   →  Gated  MRF(mcRBM,  mPoT)
  • 6. Gated  MRF(Markov  Random  Field) •  Mean  Units:  ピクセル値の符号化器   •  Precision  Units:  ピクセル間相関の符号化器 5 xp( v) . nits form on weight Therefore, CT (5) FN M precision units factors mean units pixels Fig. 3. Graphical model representation (with only three inputW C P As described in sec. 2.2, we want the conditional dis- tribution over the pixels to be a Gaussian with not only its covariance but also its mean depending on the states of the latent variables. Since the product of a full covariance Gaussian (like the one in eq. 5) with a spherical non- zero mean Gaussian is a non-zero mean full covariance Gaussian, we simply add the energy function of cRBM in eq. 3 to the energy function of a GRBM [29], yielding: E(x, hm , hp ) = 1 2 xT Cdiag(Php )CT x bpT hp + 1 2 xT x hm WT x bmT hm bxT x (6) where hm 2 {0, 1}M are called “mean” latent variables be- cause they contribute to control the mean of the conditional distribution over the input: p(x|hm , hp ) = N ✓ ⌃(Whm + bx ), ⌃ ◆ , (7) with ⌃ 1 = Cdiag(Php )CT + I where I is the identity matrix, W 2 RD⇥M is a matrix of trainable parameters and bx 2 RD is a vector of trainable Fig. 4. A) In only mean hid sion hiddens ( image-specific of Gaussian in correct image- specified pixe pair-wise depe spread out ove like in C) show (nor for the exa For instance the image in are likely to variables do this is. Then individual pi provided by to reconstruc
  • 7. DemonstraWon  of  Gated  MRF •  Fig4.A  原画像(下部に強い相関アリ)   •  Fig4.B〜D  上段 mRBM  による再構成,   下段は mcRBM  による再構成   •  Fig4.C  観測情報を入れると,mcRBM  は相関を再現   •  Fig4.D,  位相反転させても,(たぶん)正しそうな解釈 ig. 3. Graphical model representation (with only three input ariables): There are two sets of latent variables (the mean and the recision units) that are conditionally independent given the input xels and a set of deterministic factor nodes that connect triplets of ariables (pairs of input variables and one precision unit). 強い相関
  • 8. 自然画像見せた時の基底() •  精度行列(C)  フィルタ:  Topographic  Mapライク表現 JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? More quantitatively, we have compared th models in terms of their log-probability on dataset of test image patches using a techn annealed importance sampling [50]. Tab. 1 s improvements of mPoT over both GRBM and ever, the estimated log-probability is lower tha of Gaussians (MoG) with diagonal covarianc same number of parameters as the other mod consistent with previous work by Theis et interpret this result with caution since the es GRBM, PoT and mPoT can have a large varianc noise introduced by the samplers, while the log of the MoG is calculated exactly. This conjec firmed by the results reported in the rightm showing the exact log-probability ratio betwee test images and Gaussian noise with the same c This ratio is indeed larger for mPoT than MoG inaccuracies in the former estimation. The table results of mPoT models trained by using c less accurate approximations to the maximum Fig. 8. Top: Precision filters (matrix C) of size 16x1 on color image patches of the same size. Matrix P wa a local connectivity inducing a two dimensional top This makes nearby filters in the map learn similar fe left: Random subset of mean filters (matrix W). Bot top to bottom): independent samples drawn from th mPoT, GRBM and PoT. TABLE 1 Log-probability estimates of test natural images x log-probability ratios between the same test images images n. Model log p(x) log p(x)/p( MoG -85 68 GRBM FPCD -152 -3
  • 9. サンプリングによる再構成評価 •  原画像   •  mPoT(提案モデル)   •  GRBM   •  PoT   atrix C) of size 16x16 pixels learned me size. Matrix P was initialized with wo dimensional topographic map. map learn similar features. Bottom ers (matrix W). Bottom right (from This ratio is indeed larg inaccuracies in the forme results of mPoT mode less accurate approxima gradient, namely Contras Contrastive Divergence [ yields a higher likelihoo We repeated the sam ing the extension of mP training on large patche at random locations fro natural images that were model was trained using with different spatial ph a diagonal offset of two Fig. 8. Top: Precision filters (matrix C) of size 16x16 pixels learned on color image patches of the same size. Matrix P was initialized with a local connectivity inducing a two dimensional topographic map. This makes nearby filters in the map learn similar features. Bottom left: Random subset of mean filters (matrix W). Bottom right (from top to bottom): independent samples drawn from the training data, mPoT, GRBM and PoT. TABLE 1 Log-probability estimates of test natural images x, and exact log-probability ratios between the same test images x and random images n. Model log p(x) log p(x)/p(n) MoG -85 68 GRBM FPCD -152 -3 PoT FPCD -101 65 mPoT CD -109 82 mPoT PCD -92 98 mPoT FPCD -94 102 This ratio is indeed la inaccuracies in the for results of mPoT mo less accurate approxi gradient, namely Cont Contrastive Divergenc yields a higher likelih We repeated the s ing the extension of training on large patc at random locations natural images that w model was trained us with different spatial a diagonal offset of tw set consists of 64 co Parameters were learn sec. 3, but setting P The first two rows from PoT with sampl the latter ones exhibit separated by sharp ed sort of long range str rather artificial becaus repetitive. We then m layers on the top of m All layers are traine
  • 10. 高解像度化 •  このままだと単なるパッチ画像の処理装置   •  画像全体にグラフィカルモデルを拡張するのは   計算量的にしんどい,Full  ConecWon(A)   •  ブロック分割してConvoluWon(D)的な伺かで対応   Tiled  ConvoluWon  (C,  E)   画像的にはこんな感じ JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? Fig. 7. A toy illustration of how units are combined across layers. Squares are filters, gray planes are channels and circles are latent variables. Left: illustration of how input channels are combined into a single output channel. Input variables at the same spatial location across different channels contribute to determine the state of the same latent variable. Input units falling into different tiles (without overlap) determine the state of nearby units in the hidden layer (here, we have only two spatial locations). Right: illustration of how filters that overlap with an offset contribute to hidden units that are at the same output spatial location but in different hidden channels. In practice, the deep model combines these two methods at each layerRanzato  et  al,  NIPS  2010
  • 11. Deep 化 •  介在変数が2値で書けてるんだし,   Deep  化できるんじゃね? DBNs$for$Classifica:on$ W +W W W W + W + W + W W W W 1 11 500 500 500 2000 500 500 2000 500 2 500 RBM 500 2000 3 Pretraining Unrolling Fine tuning 4 4 2 2 3 3 1 2 3 4 RBM 10 Softmax Output 10 RBM T T T T T T T T 今ココ mcRBM/mPoT Salakhutdinov  CVPR  2012  Tutorial
  • 12. mPoT  +  DBN  による再構成 •  PoT   •  mPoT   •  DBN  (2layers)  +  mPoT   •  Time  course  of  DBN+mPoT JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 11 TABLE 2 Denoising performance using = 20 (PSNR=22.1dB). Barb. Boats Fgpt. House Lena Peprs. mPoT 28.0 30.0 27.6 32.2 31.9 30.7 mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7 mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0 FoE [11] 28.3 29.8 - 32.3 31.9 30.6 NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3 GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3 BM3D [57] 31.8 30.9 - 33.8 33.1 31.3 LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4 6.2 Denoising The most commonly used task to quantitatively validate a generative model of natural images is image denoising, assuming homogeneous additive Gaussian noise of known variance [10], [11], [37], [58], [12]. We restore images by maximum a-posteriori (MAP) estimation. In the log domain, this amounts to solving the following optimization problem: arg minx ||y x||2 + F(x; ✓), where y is the observed noisy image, F(x; ✓) is the mPoT energy function (see eq. 12), is an hyper-parameter which is inversely proportional to the noise variance and x is an estimate of the clean image. In our experiments, the optimization is performed by gradient descent. For images with repetitive texture, generic prior models usually offer only modest denoising performance compared
  • 13. 応用1:  ノイズ除去 •  ノイズ画像に対して,mPoT,  mPoT  +  NLM  などで修復 JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 12 Fig. 10. Denoising of “Barbara” (detail). From left to right: original image, noisy image ( = 20, PSNR=22.1dB), denoising of mPoT (28.0dB), denoising of mPoT adapted to this image (29.2dB) and denoising of adapted mPoT combined with non-local means (30.7dB). (over both baselines, mPoT+A and NLM). Second, on this task a gated MRF with mean latent variables is no better than a model that lacks them, like FoE [11] (which is a convolutional version of PoT)14 . Mean hiddens are crucial when generating images in an unconstrained manner, but are not needed for denoising since the observation y already provides (noisy) mean intensities to the model (a similar argument applies to inpainting). Finally, the performance of the adapted model is still slightly worse than the best denoising method. Barbara  さん +AWGN    (σ=20,  PSNR=22.1[dB]) mPoT    (PSNR=28.0[dB]) mPoT+A    (PSNR=29.2[dB]) mPoT+A+NLM(って…)    (PSNR=30.7[dB]) ARY 20?? 11 TABLE 2 Denoising performance using = 20 (PSNR=22.1dB). Barb. Boats Fgpt. House Lena Peprs. mPoT 28.0 30.0 27.6 32.2 31.9 30.7 mPoT+A 29.2 30.2 28.4 32.4 32.0 30.7 mPoT+A+NLM 30.7 30.4 28.6 32.9 32.4 31.0 FoE [11] 28.3 29.8 - 32.3 31.9 30.6 NLM [55] 30.5 29.8 27.8 32.4 32.0 30.3 GSM [56] 30.3 30.4 28.6 32.4 32.7 30.3 BM3D [57] 31.8 30.9 - 33.8 33.1 31.3 LSSC [58] 31.6 30.9 28.8 34.2 32.9 31.4
  • 14. 応用2:  Scene  ClassificaWon •  Lazebnik  ら  (2006)  の SceneDB   •  mPoT+DBN(2layer)+K-­‐means  →  81.2%  accuracy   •  SIFT  +  SVM  →  81.4%  accuracy(Lazebnik’06)     • これ以上はパス   (あまりデータのってないお   ↑言い訳)
  • 15. 応用3:  物体認識 •  CIFAR10  データセット(Torralba  et  al’08)の識別   – Train  5K/cls,  Test  1K/cls age, noisy image ( = 20, PSNR=22.1dB), denoising of mPoT g of adapted mPoT combined with non-local means (30.7dB). convolutional version of PoT)14 . Mean hiddens are crucial when generating images in an unconstrained manner, but are not needed for denoising since the observation y already provides (noisy) mean intensities to the model (a similar argument applies to inpainting). Finally, the performance of the adapted model is still slightly worse than the best denoising method. 6.3 Scene Classification In this experiment, we use a classification task to com- pare SIFT features with the features learned by adding a second layer of Bernoulli latent variables that model the distribution of latent variables of an mPoT generative model. The task is to classify the natural scenes in the 15 scene dataset [2] into one of 15 categories. The method of reference on this dataset was proposed by Lazebnik et al. [2] and it can be summarized as follows: 1) densely compute SIFT descriptors every 8 pixels on a regular grid, 2) perform K-Means clustering on the SIFT descriptors, 3) compute histograms of cluster ids over regions of the image at different locations and spatial scales, and 4) use an SVM with an intersection kernel for classification. We use a DBN with an mPoT front-end to mimic this pipeline. We treat the expected value of the latent variables as features that describe the input image. We extract first and second layer features (using the model that produced the generations in the bottom of fig. 9) from a regular grid with a stride equal to 8 pixels. We apply K-Means to learn a dictionary with 1024 prototypes and then assign each feature to its closest prototype. We compute a spatial pyramid with 2 levels for the first layer features ({hm , hp }) Fig. 11. Example of images in the CIFAR 10 dataset. Each colum shows samples belonging to the same category. TABLE 3 Test and training (in parenthesis) recognition accuracy on the CIFAR 10 dataset. The numbers in italics are the feature dimensionality at each stage. Method Accuracy % 1) mean (GRBM): 11025 59.7 (72.2) 2) cRBM (225 factors): 11025 63.6 (83.9) 3) cRBM (900 factors): 11025 64.7 (80.2) 4) mcRBM: 11025 68.2 (83.1) 5) mcRBM-DBN (11025-8192) 70.7 (85.4) 6) mcRBM-DBN (11025-8192-8192) 71.0 (83.6) 7) mcRBM-DBN (11025-8192-4096-1024-384) 59.8 (62.0) fig. 11. These images were downloaded from the web an down-sampled to a very low resolution, just 32x32 pixels The CIFAR 10 subset has ten object categories, namely air plane, car, bird, cat, deer, dog, frog, horse, ship, and truck The training set has 5000 samples per class, the test set ha 1000 samples per class. The low resolution and extrem JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? TABLE 4 Test recognition accuracy on the CIFAR 10 dataset produced b different methods. Features are fed to a multinomial logistic regression classifier for recognition. Method Accuracy % 384 dimens. GIST 54.7 10,000 linear random projections 36.0 10K GRBM(*), 1 layer, ZCA’d images 59.6 10K GRBM(*), 1 layer 63.8 10K GRBM(*), 1layer with fine-tuning 64.8 10K GRBM-DBN(*), 2 layers 56.6 11025 mcRBM 1 layer, PCA’d images 68.2 8192 mcRBM-DBN, 3 layers, PCA’d images 71.0 384 mcRBM-DBN, 5 layers, PCA’d images 59.8
  • 16. 応用4:  顔画像生成とか認識 •  Tront  Face  Dataset(TFD,  Susskind  et  al.’10)   –  48x48,  100K  unlabeled,  4K  labeled   •  mPoT  +  DBN(〜4layer)  +  mult.  cls  log  reg.(sommax?)  13 Fig. 12. Top: Samples generated by a five-layer deep model JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 13 TABLE 4 Test recognition accuracy on the CIFAR 10 dataset produced by different methods. Features are fed to a multinomial logistic regression classifier for recognition. Method Accuracy % 384 dimens. GIST 54.7 10,000 linear random projections 36.0 10K GRBM(*), 1 layer, ZCA’d images 59.6 10K GRBM(*), 1 layer 63.8 10K GRBM(*), 1layer with fine-tuning 64.8 10K GRBM-DBN(*), 2 layers 56.6 11025 mcRBM 1 layer, PCA’d images 68.2 8192 mcRBM-DBN, 3 layers, PCA’d images 71.0 384 mcRBM-DBN, 5 layers, PCA’d images 59.8 to recognize the object category in the image. Since our model is unsupervised, we train it on a set of two million images from the TINY dataset that does not overlap with the labeled CIFAR 10 subset in order to further improve generalization [20], [63], [64]. In the default set up we learn all parameters of the model, we use 81 filters in W to encode the mean, 576 filters in C to encode covariance constraints and we pool these filters into 144 hidden units through matrix P. P is initialized with a two-dimensional topography that takes 3x3 neighborhoods of filters with a stride equal to 2. In total, at each location we extract Fig. 12. Top: Samples generated by a five-layer deep model trained on faces. The top layer has 128 binary latent variables and images have size 48x48 pixels. Bottom: comparison between six samples from the model (top row) and the Euclidean distance nearest neighbor images in the training set (bottom row). TABLE 5 TFD: Facial expression classification accuracy using features trained without supervision. Method layer 1 layer 2 layer 3 layer 4 raw pixels 71.5 - - - 4layer  DBN(+mPoT)  sampling 生成画像と原画像の比較 Method Accuracy % mens. GIST 54.7 0 linear random projections 36.0 GRBM(*), 1 layer, ZCA’d images 59.6 GRBM(*), 1 layer 63.8 GRBM(*), 1layer with fine-tuning 64.8 GRBM-DBN(*), 2 layers 56.6 mcRBM 1 layer, PCA’d images 68.2 mcRBM-DBN, 3 layers, PCA’d images 71.0 cRBM-DBN, 5 layers, PCA’d images 59.8 e the object category in the image. Since our nsupervised, we train it on a set of two million m the TINY dataset that does not overlap with CIFAR 10 subset in order to further improve on [20], [63], [64]. In the default set up we rameters of the model, we use 81 filters in W he mean, 576 filters in C to encode covariance and we pool these filters into 144 hidden units trix P. P is initialized with a two-dimensional that takes 3x3 neighborhoods of filters with ual to 2. In total, at each location we extract 5 features. Therefore, we represent a 32x32 a 225x7x7=11025 dimensional descriptor. shows some comparisons. First, we assess s best to model just the mean intensity, or just Fig. 12. Top: Samples generated by a five-layer deep mode trained on faces. The top layer has 128 binary latent variable and images have size 48x48 pixels. Bottom: comparison betwee six samples from the model (top row) and the Euclidean distanc nearest neighbor images in the training set (bottom row). TABLE 5 TFD: Facial expression classification accuracy using features trained without supervision. Method layer 1 layer 2 layer 3 layer 4 raw pixels 71.5 - - - Gaussian SVM 76.2 - - - Sparse Coding 74.6 - - - Gabor PCA 80.2 - - - GRBM 80.0 81.5 80.3 79.5 PoT 79.4 79.3 80.6 80.2 mPoT 81.6 82.1 82.5 82.4
  • 17. 応用4:  顔画像生成とか認識(続き1) •  顔のパーツを遮蔽した場合の生成   –  上段が入力,下段が生成結果 JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 14 ORIGINALTYPE1TYPE2TYPE3 Fig. 15. Expression recognition accuracy on the TFD dataset when both training and test labeled images are subject to 7 types of occlusion. RBMs with 1024 latent variables, and at the fifth layer we have an RBM with 128 hidden units. The deep model was trained entirely generatively on the unlabeled images without using any labeled instances [64]. The discriminative training consisted of training a linear multi-class logistic regression classifier on the top level representation without using back-propagation to jointly optimize the parameters across all layers. TYPE4TYPE5TYPE6TYPE7 Fig. 13. Example of conditional generation performed by a four- Fig. 15. E when both tra of occlusion. RBMs with we have an was trained without usin training con regression c using back- across all la Fig. 12 sh ative model ferent indiv poses, and using Eucli not just cop exhibits reg cases. In the fir the features facial expre input image mean intens deviation o successive h 82.5%, and hidden unit 71.5% achi
  • 18. 応用4:  顔画像生成とか認識(続き2) •  顔パーツを遮蔽した場合の生成   – 各階層の寄与   – 生成画像を再入力したとき   Fig. 13. Example of conditional generation performed by a four- layer deep model trained on faces. Each column is a different example (not used in the unsupervised training phase). The topmost row shows some example images from the TFD dataset. The other rows show the same images occluded by a synthetic mask (on the top) and their restoration performed by the deep generative model (on the bottom). Fig. 14. An example of restoration of unseen images performed by propagating the input up to first, second, third, fourth layer, and again through the four layers and re-circulating the input through the same model for ten times. 8 h 7 7 7 ac w an n it li w at w o av 8
  • 19. まとめ •  Gated  MRF 使うと  latent  の取り扱い楽だよ   –  2体相互作用→    “mean”   –  3体相互作用→  “precision”   •  Gated  MRF  は入力空間を  hyper-­‐ellipse  で覆おうとする→ 割と精度良く使えそう(GRBM,  PPCA  でも多分可だけど   hyper-­‐sphere なので相関強い時はちと問題かも)   •  システムとしては   –  Low  level  (patch処理のとこ?)  と  High  level    (高解像度化のと こ?)のモデル作って   –  High  level  の部分は DBN に繋げる,と.   •  で,まとめたシステムを   –  物体,シーン認識   –  顔生成と認識とかに使って,state-­‐of-­‐art  に迫る成績出した  
  • 20. Future  work •  厳密な最尤推定が出来ないのは欠点じゃなかろうか?   –  MCMC  も計算コスト高いので…   •  後はタスクと計算機コストでモデルを修正してくしかなさそ   •  確かにテクスチャとか出てくるけど,まだ単純なものだし…   •  Gated  MRF  は   –  エネルギー関数の形式を改良する余地があるよ   –  Higher  level  の構造を吟味することが出来るよ   全部 RBM  →  gated  MRF  とかにするとか   –  パラメトリックとノンパラをミックスするとかがいいんじゃない? (こんなこと言ってたっけ?)   •  Video  とかへの応用はどうだろう?