AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会＠関東)

AET vs. AED:
Unsupervised Representation Learning
by Auto-Encoding Transformations
rather than Data
250
1 0 39 0
Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo

n 53 CV @ CVPR2019 ( )
- https://kantocv.connpass.com/event/133980/
n
- Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo,
“AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations
rather than Data”,
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2019. [Link]
※

AI @DeNA
twitter: @tomoyukun
n
- ~2019.3
CV
- 2019.4~
DeNA AI CV

n
- ~2019.3
CV
- 2019.4~
DeNA AI CV
- 2019.6
CVPR2019
AI @DeNA
twitter: @tomoyukun

n CVPR2019
- 30 234
slide share @DeNAxAI_NEWS

n CVPR2019 Oral
n
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding
Transformations rather than Data
Liheng Zhang 1,∗
, Guo-Jun Qi 1,2,†
, Liqiang Wang3
, Jiebo Luo4
1
Laboratory for MAchine Perception and LEarning (MAPLE)
http://maple-lab.net/
2
Huawei Cloud, 3
University of Central Florida, 4
University of Rochester
guojun.qi@huawei.com
http://maple-lab.net/projects/AET.htm
Abstract
The success of deep neural networks often relies on a
rge amount of labeled examples, which can be difﬁcult to
btain in many real scenarios. To address this challenge,
nsupervised methods are strongly preferred for training (a) Auto-Encoding Data (AED)
Paper Project page Code

n(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)

n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)

n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
discriminative
semantic
disentangle
…

n
- DNN
- (..?)
n
-
-
n ImageNet …
- ImageNet
-
n
-

Previous works (unsupervised)
n Auto Encoder [Hinton+, 06] / Variational Auto Encoder [Kingma+, 13]
n GANs
- Discriminator encoder [Radford+, 16]
- Generator G " # E # " [Donahue+, 17]
n BiGAN
➤ h-(!|/)hmx (Generator) uGANd t
-(/|!)p (Encoder)
➤ Generatorgru (-0 !, / = -0 !|/ -(/))dEncodergru
(-1 !, / = -1 /|! -(!))x hGANd h mc a u
➤ d Dh x u hGANrtp
- Di d v x g uphci
Cls. Det.
random 53.3 43.4
BiGAN 60.3 46.9
JP 67.7 53.2
Published as a conference paper at ICLR 2017
features data
z G G(z)
xEE(x)
G(z), z
x, E(x)
D P(y)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
Donahue et al., ”Adversarial Feature Learning”, ICLR 2017.

Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial conﬁguration for theDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.

n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
neighbors (red). Can you guess the spatial conﬁguration for the
inav Gupta1
Alexei A. Efros2
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
Question 1: Question 2:
domly sampling a patch (blue) and then one of eight possibleDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.

n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
inav Gupta1
Alexei A. Efros2
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
domly sampling a patch (blue) and then one of eight possible sasignificantboostover
resultinginstate-of-the-
swhichuseonlyPascal-
methodshaveleveraged
dexamplestolearnrich,
tations[32].Yetefforts
ternet-scaledatasets(i.e.
ehamperedbythesheer
required.Anaturalway
toemployunsupervised
withoutanyannotation.
adesofsustainedeffort,
etbeenshowntoextract
ectionsoffull-sized,real
itisnotevenclearwhat
onewriteanobjective
twopairsofpatches?Notethatthetaskismucheasieronceyou
haverecognizedtheobject!
Answer key: Q1: Bottom right Q2: Top center
inthecontext(i.e.,afewwordsbeforeand/orafter)given
thevector.Thisconvertsanapparentlyunsupervisedprob-
lem(findingagoodsimilaritymetricbetweenwords)into
a“self-supervised”one:learningafunctionfromagiven
wordtothewordssurroundingit.Herethecontextpredic-
tiontaskisjusta“pretext”toforcethemodeltolearna
goodwordembedding,which,inturn,hasbeenshownto
beusefulinanumberofrealtasks,suchassemanticword
similarity[40].
Ourpaperaimstoprovideasimilar“self-supervised”
formulationforimagedata:asupervisedtaskinvolvingpre-
dictingthecontextforapatch.OurtaskisillustratedinFig-
ures1and2.Wesamplerandompairsofpatchesinoneof
eightspatialconfigurations,andpresenteachpairtoama-
chinelearner,providingnoinformationaboutthepatches’
originalpositionwithintheimage.Thealgorithmmustthen
Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.

n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
inav Gupta1
Alexei A. Efros2
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
domly sampling a patch (blue) and then one of eight possible sasignificantboostover
resultinginstate-of-the-
swhichuseonlyPascal-
methodshaveleveraged
dexamplestolearnrich,
tations[32].Yetefforts
ternet-scaledatasets(i.e.
ehamperedbythesheer
required.Anaturalway
toemployunsupervised
withoutanyannotation.
adesofsustainedeffort,
etbeenshowntoextract
ectionsoffull-sized,real
itisnotevenclearwhat
onewriteanobjective
twopairsofpatches?Notethatthetaskismucheasieronceyou
haverecognizedtheobject!
Answer key: Q1: Bottom right Q2: Top center
inthecontext(i.e.,afewwordsbeforeand/orafter)given
thevector.Thisconvertsanapparentlyunsupervisedprob-
lem(findingagoodsimilaritymetricbetweenwords)into
a“self-supervised”one:learningafunctionfromagiven
wordtothewordssurroundingit.Herethecontextpredic-
tiontaskisjusta“pretext”toforcethemodeltolearna
goodwordembedding,which,inturn,hasbeenshownto
beusefulinanumberofrealtasks,suchassemanticword
similarity[40].
Ourpaperaimstoprovideasimilar“self-supervised”
formulationforimagedata:asupervisedtaskinvolvingpre-
dictingthecontextforapatch.OurtaskisillustratedinFig-
ures1and2.Wesamplerandompairsofpatchesinoneof
eightspatialconfigurations,andpresenteachpairtoama-
chinelearner,providingnoinformationaboutthepatches’
originalpositionwithintheimage.Thealgorithmmustthen - mx x SiameseNetg2
- hCNNx m d
➤ Fine-tuningh i rt
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16]
focus on defining similarity metrics which can be used in
more standard clustering algorithms; [45], for instance,
re-casts the problem as frequent itemset mining. Geom-
etry may also be used to for verifying links between im-
ages [44, 6, 23], although this can fail for deformable ob-
jects.
Video can provide another cue for representation learn-
ing. For most scenes, the identity of objects remains un-
changed even as appearance changes with time. This kind
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong
improvements on modern detection datasets [57].
Finally, our work is related to a line of research on dis-
criminative patch mining [13, 50, 28, 37, 52, 11], which has
emphasized weak supervision as a means of object discov-
ery. Like the current work, they emphasize the utility of
Patch 2Patch 1
pool1 (3x3,96,2)pool1 (3x3,96,2)
LRN1LRN1
pool2 (3x3,384,2)pool2 (3x3,384,2)
LRN2LRN2
fc6 (4096)fc6 (4096)
conv5 (3x3,256,1)conv5 (3x3,256,1)
conv4 (3x3,384,1)conv4 (3x3,384,1)
conv3 (3x3,384,1)conv3 (3x3,384,1)
conv2 (5x5,384,2)conv2 (5x5,384,2)
conv1 (11x11,96,4)conv1 (11x11,96,4)
fc7 (4096)
fc8 (4096)
fc9 (8)
pool5 (3x3,256,2)pool5 (3x3,256,2)
Figure 3. Our architecture for pair classification. Dotted lines in-
dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
‘LRN’ is a local response normalization layer. Numbers in paren-
theses are kernel size, number of outputs, and stride (fc layers have
SiameseNet
(if there is no spe-
s “stuff” [1]). We
arn a visual repre-
e that the resulting
ject detection, pro-
OC 2007 compared
nsupervised object
eans, surprisingly,
ss images, despite
n that operates on a
e-level supervision
gory-level tasks.
epresentation is as
nerative model. An
Figure 2. The algorithm receives two patches in one of these eight
Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.

n [Gidaris+, 18]
=> artifact lv trivial solutionh d u
➤ objecth x u ogiobjecth
➤ (Cls., Det. ) &
random
CR
JP++
Published as a conference paper at ICLR 2018
Rotated image: X
0
Rotated image: X3
Rotated image: X
2
Rotated image: X
1
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
Image X
Predict 270 degrees rotation (y=3)Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
F
3
( X
3
)
Predict 0 degrees rotation (y=0)
Maximize prob.
F2
( X2
)
Maximize prob.
F
1
( X
1
)
Maximize prob.
F
0
( X
0
)
Objectives:
Gidaris et al., “Unsupervised Representation Learning by Predicting Image Rotation”, ICLR 2018.

Idea
erception and LEarning (MAPLE)
aple-lab.net/
entral Florida, 4
qi@huawei.com
.net/projects/AET.htm
a
o
,
n AED (Auto Encoding Data)
-
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.

Idea
erception and LEarning (MAPLE)
aple-lab.net/
entral Florida, 4
qi@huawei.com
.net/projects/AET.htm
a
o
,
n AED (Auto Encoding Data)
-
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
l(x, ˆx) = ∥x − ˆx∥2
2 (3)
θ ∈ R8
ˆθ ∈ R8
N
(crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.

Idea
n a
t to
nge,
ing
this
pre-
ET)
ED)
AET
ac-
fol-
ful-
(a) Auto-Encoding Data (AED)
n AET (Auto Encoding Transform)
- transform
( structure )

Idea
n a
t to
nge,
ing
this
pre-
ET)
ED)
AET
ac-
fol-
ful-
(a) Auto-Encoding Data (AED)
n AET (Auto Encoding Transform)
- transform
( structure )
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
l(x, ˆx) = ∥x − ˆx∥2
2 (3)
l(t,ˆt) (4)
θ ∈ R8
ˆθ ∈ R8
N
(crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.

Network
=
1
2
∥M(θ) − M(ˆθ)∥2
2 to model the difference
he target and the estimated transformations. In
nts, we will compare different instances of param-
ransformations in this category and demonstrate
ield competitive performances on training AET.
uced Transformations. One can choose other
ransformations without explicit geometric impli-
ke the afﬁne and the projective transformations.
nsider a GAN generator that transforms an input
manifold of real images. For example, in [24], a
Encoder
n weights share 2branch encoder
n branch concatenate
n transform

Transformations
n white box
- AET-affine
- AET-project
Rotation
[-180, 180]
Translation
[-0.2, +0.2] * H or W
Scale
[0.7, 1.3]Original
Shear
[-30, +30]
Scale
[0.8, 1.2]
Rotation
0 ,90 ,180 ,270
Stretch corner
0.125 * H or W
Original

Transformations
n MSE
- AET-affine AET-project
n transform 2 ablation …
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
θ ∈ R8
:
θ ∈ R8
:
N
(crop,
ic segmentation
Rotation 85.07 89.06 86.21 61.73
Proposed 83.11 86.33 86.94 83.21
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
θ ∈ R8
ˆθ ∈ R8
N
(crop,
ﬁlp, color jitter random erasing)
adver-
sarial perturbation

Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data

Evaluation
n
n val data
k-NN

Evaluation
3 0
n
n val data
0

Details
n CIFAR-10
- Network In Network
• NIN block4 => GAP => concatenate => fc
• encoder block2 +GAP
- 3 fc NIN block3 ( )
n ImageNet
- AlexNet
• AlexNet fc2=> concatenate => concatenate => fc
• encoder
- 3 fc 1 fc ( )

Results on CIFAR-10
Table 1: Comparison between unsupervised feature learn-
ing methods on CIFAR-10. The fully supervised NIN and
the random Init. + conv have the same three-block NIN ar-
chitecture, but the first is fully supervised while the second
is trained on top of the first two blocks that are randomly
initialized and stay frozen during training.
Method Error rate
Supervised NIN (Lower Bound) 7.20
Random Init. + conv (Upper Bound) 27.50
Roto-Scat + SVM [21] 17.7
ExamplarCNN [7] 15.7
DCGAN [25] 17.2
Scattering [20] 15.3
RotNet + FC [10] 10.94
RotNet + conv [10] 8.84
(Ours) AET-affine + FC 9.77
(Ours) AET-affine + conv 8.05
(Ours) AET-project + FC 9.41
(Ours) AET-project + conv 7.82
SGD with a batch size
nsformed counterparts.
et to 0.9 and 5 × 10−4
.
and scheduled to drop
800 and 1, 000 epochs.
chs in total. For AET-
composition of a ran-
random translation by
both vertical and hori-
ing factor of [0.7, 1.3],
−30◦
, 30◦
] degree. For
ansformation is formed
rs of an image in both
y ±0.125 of its height
ed by [0.8, 1.2] and ro-
(Ours) AET-affine + FC 9.77
(Ours) AET-affine + conv 8.05
(Ours) AET-project + FC 9.41
(Ours) AET-project + conv 7.82
k-NN
n AET-affine AET-project
n AET-project ( )

Results on ImageNet
n ImageNet AET-projectTable 4: Top-1 accuracy with linear layers on ImageNet. AlexNet is used as backbone to train the unsupervised mo
comparison. A 1, 000-way linear classiﬁer is trained upon various convolutional layers of feature maps that ar
resized to have about 9, 000 elements. Fully supervised and random models are also reported to show the upper an
bounds of unsupervised model performances. Only a single crop is used and no dropout or local response norm
used during testing for the AET, except the models denoted with * where ten crops are applied to compare results
Method Conv1 Conv2 Conv3 Conv4 Conv5
ImageNet Labels (Upper Bound) [10] 19.3 36.3 44.2 48.3 50.5
Random (Lower Bound)[10] 11.6 17.1 16.9 16.3 14.1
Random rescaled [16](Lower Bound) 17.5 23.0 24.5 23.2 20.6
Context [5] 16.2 23.3 30.2 31.7 29.6
Context Encoders [22] 14.1 20.7 21.0 19.8 15.5
Colorization[30] 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles [18] 18.2 28.8 34.0 33.9 27.1
BiGAN [6] 17.7 24.5 31.0 29.9 28.0
Split-Brain [29] 17.7 29.3 35.4 35.2 32.8
Counting [19] 18.0 30.6 34.3 32.5 25.7
RotNet [10] 18.8 31.7 38.7 38.2 36.5
(Ours) AET-project 19.2 32.8 40.6 39.7 37.7
DeepCluster* [4] 13.4 32.3 41.0 39.6 38.2
(Ours) AET-project* 19.3 35.4 44.0 43.6 42.4
0.01, and it is dropped by a factor of 10 at epoch 100 and
150. AET is trained for 200 epochs in total. Finally, the
projective transformations applied are randomly sampled in
performance. From the comparison, the AET mo
ly narrow the performance gap to the upper boun
to the upper bound Top-1 accuracy has been decr
er-
ed
ap-
on-
ng
by
ng
ed
in
ns
ed
ent
al-
be-
art
tes
he
se-
ork
m-
eat
on-
so
ith
Table 3: Top-1 accuracy with non-linear layers on Ima-
geNet. AlexNet is used as backbone to train the unsu-
pervised models. After unsupervised features are learned,
nonlinear classiﬁers are trained on top of Conv4 and Con-
v5 layers with labeled examples to compare their perfor-
mances. We also compare with the fully supervised mod-
els and random models that give upper and lower bounded
performances. For a fair comparison, only a single crop is
applied in AET and no dropout or local response normal-
ization is applied during the testing.
Method Conv4 Conv5
ImageNet Labels [3](Upper Bound) 59.7 59.7
Random [19] (Lower Bound) 27.1 12.0
Tracking [28] 38.8 29.8
Context [5] 45.6 30.4
Colorization [30] 40.7 35.2
Jigsaw Puzzles [18] 45.3 34.6
BiGAN [6] 41.9 32.2
NAT [3] - 36.0
DeepCluster [4] - 44.0
RotNet [10] 50.0 43.8
(Ours) AET-project 53.2 47.0
4.2. ImageNet Experiments
We further evaluate the performance by AET on the Im-
3 NN
1 NN

Results
n
-
Counting [19] 23.3 33.9
RotNet [10] 21.5 31.0
(Ours) AET-project 22.1 32.9
(a) CIFAR-10 (b) ImageNet
CIFAR-10 ImageNet

Results
n
31.0 35.1 34.6 33.7
32.9 37.1 36.2 34.7

n
- “structure information” …
-
-
- AET
• Keypoints
n
- 0
trivial
31.0 35.1 34.6 33.7
32.9 37.1 36.2 34.7

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会＠関東)

More Related Content

Similar to AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会＠関東) (20)

Recently uploaded (20)

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会＠関東)