SlideShare a Scribd company logo
AET vs. AED:
Unsupervised Representation Learning
by Auto-Encoding Transformations
rather than Data
250
1 0 39 0
Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo
n 53 CV @ CVPR2019 ( )
- https://kantocv.connpass.com/event/133980/
n
- Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo,
“AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations
rather than Data”,
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2019. [Link]
※
AI @DeNA
twitter: @tomoyukun
n
- ~2019.3
CV
- 2019.4~
DeNA AI CV
n
- ~2019.3
CV
- 2019.4~
DeNA AI CV
- 2019.6
CVPR2019
AI @DeNA
twitter: @tomoyukun
n CVPR2019
- 30 234
slide share @DeNAxAI_NEWS
n CVPR2019 Oral
n
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding
Transformations rather than Data
Liheng Zhang 1,∗
, Guo-Jun Qi 1,2,†
, Liqiang Wang3
, Jiebo Luo4
1
Laboratory for MAchine Perception and LEarning (MAPLE)
http://maple-lab.net/
2
Huawei Cloud, 3
University of Central Florida, 4
University of Rochester
guojun.qi@huawei.com
http://maple-lab.net/projects/AET.htm
Abstract
The success of deep neural networks often relies on a
rge amount of labeled examples, which can be difficult to
btain in many real scenarios. To address this challenge,
nsupervised methods are strongly preferred for training (a) Auto-Encoding Data (AED)
Paper Project page Code
Unsupervised Representation Learning
n(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
Unsupervised Representation Learning
n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
Unsupervised Representation Learning
n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
Unsupervised Representation Learning
n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
discriminative
semantic
disentangle
…
Unsupervised Representation Learning
n
- DNN
- (..?)
n
-
-
n ImageNet …
- ImageNet
-
n
-
Previous works (unsupervised)
n Auto Encoder [Hinton+, 06] / Variational Auto Encoder [Kingma+, 13]
n GANs
- Discriminator encoder [Radford+, 16]
- Generator G " # E # " [Donahue+, 17]
n BiGAN
➤ h-(!|/)hmx (Generator) uGANd t
-(/|!)p (Encoder)
➤ Generatorgru (-0 !, / = -0 !|/ -(/))dEncodergru
(-1 !, / = -1 /|! -(!))x hGANd h mc a u
➤ d Dh x u hGANrtp
- Di d v x g uphci
Cls. Det.
random 53.3 43.4
BiGAN 60.3 46.9
JP 67.7 53.2
Published as a conference paper at ICLR 2017
features data
z G G(z)
xEE(x)
G(z), z
x, E(x)
D P(y)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
Donahue et al., ”Adversarial Feature Learning”, ICLR 2017.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for theDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for the
inav Gupta1
Alexei A. Efros2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
Question 1: Question 2:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possibleDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for the
inav Gupta1
Alexei A. Efros2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
Question 1: Question 2:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible sasignificantboostover
resultinginstate-of-the-
swhichuseonlyPascal-
methodshaveleveraged
dexamplestolearnrich,
tations[32].Yetefforts
ternet-scaledatasets(i.e.
ehamperedbythesheer
required.Anaturalway
toemployunsupervised
withoutanyannotation.
adesofsustainedeffort,
etbeenshowntoextract
ectionsoffull-sized,real
itisnotevenclearwhat
onewriteanobjective
twopairsofpatches?Notethatthetaskismucheasieronceyou
haverecognizedtheobject!
Answer key: Q1: Bottom right Q2: Top center
inthecontext(i.e.,afewwordsbeforeand/orafter)given
thevector.Thisconvertsanapparentlyunsupervisedprob-
lem(findingagoodsimilaritymetricbetweenwords)into
a“self-supervised”one:learningafunctionfromagiven
wordtothewordssurroundingit.Herethecontextpredic-
tiontaskisjusta“pretext”toforcethemodeltolearna
goodwordembedding,which,inturn,hasbeenshownto
beusefulinanumberofrealtasks,suchassemanticword
similarity[40].
Ourpaperaimstoprovideasimilar“self-supervised”
formulationforimagedata:asupervisedtaskinvolvingpre-
dictingthecontextforapatch.OurtaskisillustratedinFig-
ures1and2.Wesamplerandompairsofpatchesinoneof
eightspatialconfigurations,andpresenteachpairtoama-
chinelearner,providingnoinformationaboutthepatches’
originalpositionwithintheimage.Thealgorithmmustthen
Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for the
inav Gupta1
Alexei A. Efros2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
Question 1: Question 2:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible sasignificantboostover
resultinginstate-of-the-
swhichuseonlyPascal-
methodshaveleveraged
dexamplestolearnrich,
tations[32].Yetefforts
ternet-scaledatasets(i.e.
ehamperedbythesheer
required.Anaturalway
toemployunsupervised
withoutanyannotation.
adesofsustainedeffort,
etbeenshowntoextract
ectionsoffull-sized,real
itisnotevenclearwhat
onewriteanobjective
twopairsofpatches?Notethatthetaskismucheasieronceyou
haverecognizedtheobject!
Answer key: Q1: Bottom right Q2: Top center
inthecontext(i.e.,afewwordsbeforeand/orafter)given
thevector.Thisconvertsanapparentlyunsupervisedprob-
lem(findingagoodsimilaritymetricbetweenwords)into
a“self-supervised”one:learningafunctionfromagiven
wordtothewordssurroundingit.Herethecontextpredic-
tiontaskisjusta“pretext”toforcethemodeltolearna
goodwordembedding,which,inturn,hasbeenshownto
beusefulinanumberofrealtasks,suchassemanticword
similarity[40].
Ourpaperaimstoprovideasimilar“self-supervised”
formulationforimagedata:asupervisedtaskinvolvingpre-
dictingthecontextforapatch.OurtaskisillustratedinFig-
ures1and2.Wesamplerandompairsofpatchesinoneof
eightspatialconfigurations,andpresenteachpairtoama-
chinelearner,providingnoinformationaboutthepatches’
originalpositionwithintheimage.Thealgorithmmustthen - mx x SiameseNetg2
- hCNNx m d
➤ Fine-tuningh i rt
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16]
focus on defining similarity metrics which can be used in
more standard clustering algorithms; [45], for instance,
re-casts the problem as frequent itemset mining. Geom-
etry may also be used to for verifying links between im-
ages [44, 6, 23], although this can fail for deformable ob-
jects.
Video can provide another cue for representation learn-
ing. For most scenes, the identity of objects remains un-
changed even as appearance changes with time. This kind
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong
improvements on modern detection datasets [57].
Finally, our work is related to a line of research on dis-
criminative patch mining [13, 50, 28, 37, 52, 11], which has
emphasized weak supervision as a means of object discov-
ery. Like the current work, they emphasize the utility of
Patch 2Patch 1
pool1 (3x3,96,2)pool1 (3x3,96,2)
LRN1LRN1
pool2 (3x3,384,2)pool2 (3x3,384,2)
LRN2LRN2
fc6 (4096)fc6 (4096)
conv5 (3x3,256,1)conv5 (3x3,256,1)
conv4 (3x3,384,1)conv4 (3x3,384,1)
conv3 (3x3,384,1)conv3 (3x3,384,1)
conv2 (5x5,384,2)conv2 (5x5,384,2)
conv1 (11x11,96,4)conv1 (11x11,96,4)
fc7 (4096)
fc8 (4096)
fc9 (8)
pool5 (3x3,256,2)pool5 (3x3,256,2)
Figure 3. Our architecture for pair classification. Dotted lines in-
dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
‘LRN’ is a local response normalization layer. Numbers in paren-
theses are kernel size, number of outputs, and stride (fc layers have
SiameseNet
(if there is no spe-
s “stuff” [1]). We
arn a visual repre-
e that the resulting
ject detection, pro-
OC 2007 compared
nsupervised object
eans, surprisingly,
ss images, despite
n that operates on a
e-level supervision
gory-level tasks.
epresentation is as
nerative model. An
Figure 2. The algorithm receives two patches in one of these eight
Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Gidaris+, 18]
=> artifact lv trivial solutionh d u
➤ objecth x u ogiobjecth
➤ (Cls., Det. ) &
random
CR
JP++
Published as a conference paper at ICLR 2018
Rotated image: X
0
Rotated image: X3
Rotated image: X
2
Rotated image: X
1
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
Image X
Predict 270 degrees rotation (y=3)Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
F
3
( X
3
)
Predict 0 degrees rotation (y=0)
Maximize prob.
F2
( X2
)
Maximize prob.
F
1
( X
1
)
Maximize prob.
F
0
( X
0
)
Predict 180 degrees rotation (y=2)
Predict 90 degrees rotation (y=1)
Objectives:
Gidaris et al., “Unsupervised Representation Learning by Predicting Image Rotation”, ICLR 2018.
Idea
erception and LEarning (MAPLE)
aple-lab.net/
entral Florida, 4
University of Rochester
qi@huawei.com
.net/projects/AET.htm
a
o
,
n AED (Auto Encoding Data)
-
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Idea
erception and LEarning (MAPLE)
aple-lab.net/
entral Florida, 4
University of Rochester
qi@huawei.com
.net/projects/AET.htm
a
o
,
n AED (Auto Encoding Data)
-
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
l(x, ˆx) = ∥x − ˆx∥2
2 (3)
θ ∈ R8
ˆθ ∈ R8
N
(crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Idea
n a
t to
nge,
ing
this
pre-
ET)
ED)
AET
ac-
fol-
ful-
(a) Auto-Encoding Data (AED)
n AET (Auto Encoding Transform)
- transform
( structure )
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Idea
n a
t to
nge,
ing
this
pre-
ET)
ED)
AET
ac-
fol-
ful-
(a) Auto-Encoding Data (AED)
n AET (Auto Encoding Transform)
- transform
( structure )
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
l(x, ˆx) = ∥x − ˆx∥2
2 (3)
l(t,ˆt) (4)
θ ∈ R8
ˆθ ∈ R8
N
(crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Network
=
1
2
∥M(θ) − M(ˆθ)∥2
2 to model the difference
he target and the estimated transformations. In
nts, we will compare different instances of param-
ransformations in this category and demonstrate
ield competitive performances on training AET.
uced Transformations. One can choose other
ransformations without explicit geometric impli-
ke the affine and the projective transformations.
nsider a GAN generator that transforms an input
manifold of real images. For example, in [24], a
Encoder
n weights share 2branch encoder
n branch concatenate
n transform
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Transformations
n white box
- AET-affine
- AET-project
Rotation
[-180, 180]
Translation
[-0.2, +0.2] * H or W
Scale
[0.7, 1.3]Original
Shear
[-30, +30]
Scale
[0.8, 1.2]
Rotation
0 ,90 ,180 ,270
Stretch corner
0.125 * H or W
Original
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Transformations
n MSE
- AET-affine AET-project
n transform 2 ablation …
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
θ ∈ R8
:
θ ∈ R8
:
N
(crop,
ic segmentation
Rotation 85.07 89.06 86.21 61.73
Proposed 83.11 86.33 86.94 83.21
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
θ ∈ R8
ˆθ ∈ R8
N
(crop,
filp, color jitter random erasing)
adver-
sarial perturbation
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
k-NN
Evaluation
3 0
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
0
Details
n CIFAR-10
- Network In Network
• NIN block4 => GAP => concatenate => fc
• encoder block2 +GAP
- 3 fc NIN block3 ( )
n ImageNet
- AlexNet
• AlexNet fc2=> concatenate => concatenate => fc
• encoder
- 3 fc 1 fc ( )
Results on CIFAR-10
Table 1: Comparison between unsupervised feature learn-
ing methods on CIFAR-10. The fully supervised NIN and
the random Init. + conv have the same three-block NIN ar-
chitecture, but the first is fully supervised while the second
is trained on top of the first two blocks that are randomly
initialized and stay frozen during training.
Method Error rate
Supervised NIN (Lower Bound) 7.20
Random Init. + conv (Upper Bound) 27.50
Roto-Scat + SVM [21] 17.7
ExamplarCNN [7] 15.7
DCGAN [25] 17.2
Scattering [20] 15.3
RotNet + FC [10] 10.94
RotNet + conv [10] 8.84
(Ours) AET-affine + FC 9.77
(Ours) AET-affine + conv 8.05
(Ours) AET-project + FC 9.41
(Ours) AET-project + conv 7.82
SGD with a batch size
nsformed counterparts.
et to 0.9 and 5 × 10−4
.
and scheduled to drop
800 and 1, 000 epochs.
chs in total. For AET-
composition of a ran-
random translation by
both vertical and hori-
ing factor of [0.7, 1.3],
−30◦
, 30◦
] degree. For
ansformation is formed
rs of an image in both
y ±0.125 of its height
ed by [0.8, 1.2] and ro-
(Ours) AET-affine + FC 9.77
(Ours) AET-affine + conv 8.05
(Ours) AET-project + FC 9.41
(Ours) AET-project + conv 7.82
k-NN
n AET-affine AET-project
n AET-project ( )
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Results on ImageNet
n ImageNet AET-projectTable 4: Top-1 accuracy with linear layers on ImageNet. AlexNet is used as backbone to train the unsupervised mo
comparison. A 1, 000-way linear classifier is trained upon various convolutional layers of feature maps that ar
resized to have about 9, 000 elements. Fully supervised and random models are also reported to show the upper an
bounds of unsupervised model performances. Only a single crop is used and no dropout or local response norm
used during testing for the AET, except the models denoted with * where ten crops are applied to compare results
Method Conv1 Conv2 Conv3 Conv4 Conv5
ImageNet Labels (Upper Bound) [10] 19.3 36.3 44.2 48.3 50.5
Random (Lower Bound)[10] 11.6 17.1 16.9 16.3 14.1
Random rescaled [16](Lower Bound) 17.5 23.0 24.5 23.2 20.6
Context [5] 16.2 23.3 30.2 31.7 29.6
Context Encoders [22] 14.1 20.7 21.0 19.8 15.5
Colorization[30] 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles [18] 18.2 28.8 34.0 33.9 27.1
BiGAN [6] 17.7 24.5 31.0 29.9 28.0
Split-Brain [29] 17.7 29.3 35.4 35.2 32.8
Counting [19] 18.0 30.6 34.3 32.5 25.7
RotNet [10] 18.8 31.7 38.7 38.2 36.5
(Ours) AET-project 19.2 32.8 40.6 39.7 37.7
DeepCluster* [4] 13.4 32.3 41.0 39.6 38.2
(Ours) AET-project* 19.3 35.4 44.0 43.6 42.4
0.01, and it is dropped by a factor of 10 at epoch 100 and
150. AET is trained for 200 epochs in total. Finally, the
projective transformations applied are randomly sampled in
performance. From the comparison, the AET mo
ly narrow the performance gap to the upper boun
to the upper bound Top-1 accuracy has been decr
er-
ed
ap-
on-
ng
by
ng
ed
in
ns
ed
ent
al-
be-
art
tes
he
se-
ork
m-
eat
on-
so
ith
Table 3: Top-1 accuracy with non-linear layers on Ima-
geNet. AlexNet is used as backbone to train the unsu-
pervised models. After unsupervised features are learned,
nonlinear classifiers are trained on top of Conv4 and Con-
v5 layers with labeled examples to compare their perfor-
mances. We also compare with the fully supervised mod-
els and random models that give upper and lower bounded
performances. For a fair comparison, only a single crop is
applied in AET and no dropout or local response normal-
ization is applied during the testing.
Method Conv4 Conv5
ImageNet Labels [3](Upper Bound) 59.7 59.7
Random [19] (Lower Bound) 27.1 12.0
Tracking [28] 38.8 29.8
Context [5] 45.6 30.4
Colorization [30] 40.7 35.2
Jigsaw Puzzles [18] 45.3 34.6
BiGAN [6] 41.9 32.2
NAT [3] - 36.0
DeepCluster [4] - 44.0
RotNet [10] 50.0 43.8
(Ours) AET-project 53.2 47.0
4.2. ImageNet Experiments
We further evaluate the performance by AET on the Im-
3 NN
1 NN
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Results
n
-
Counting [19] 23.3 33.9
RotNet [10] 21.5 31.0
(Ours) AET-project 22.1 32.9
(a) CIFAR-10 (b) ImageNet
CIFAR-10 ImageNet
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Results
n
31.0 35.1 34.6 33.7
32.9 37.1 36.2 34.7
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
n
- “structure information” …
-
-
- AET
• Keypoints
n
- 0
trivial
31.0 35.1 34.6 33.7
32.9 37.1 36.2 34.7
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.

More Related Content

PDF
Cross-domain complementary learning with synthetic data for multi-person part...
PDF
Deep Learning Using TensorFlow | TensorFlow Tutorial | AI & Deep Learning Tra...
PDF
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
PDF
Meta learning tutorial
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PPTX
What Deep Learning Means for Artificial Intelligence
PDF
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
PDF
A QUICK INTRODUCTION TO DEEP LEARNING
Cross-domain complementary learning with synthetic data for multi-person part...
Deep Learning Using TensorFlow | TensorFlow Tutorial | AI & Deep Learning Tra...
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Meta learning tutorial
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
What Deep Learning Means for Artificial Intelligence
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
A QUICK INTRODUCTION TO DEEP LEARNING

Similar to AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会@関東) (20)

PPTX
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
PDF
IPT.pdf
PDF
Machine learning for document analysis and understanding
PDF
A simple framework for contrastive learning of visual representations
PDF
Scene Description From Images To Sentences
PDF
深度学习639页PPT/////////////////////////////
PDF
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
PPT
Machine Learning
PDF
nlp dl 1.pdf
PDF
Deep Learning And Business Models (VNITC 2015-09-13)
PDF
Coates p: the use of genetic programing in exploring 3 d design worlds
PDF
Neural Nets Deconstructed
PDF
H2O Distributed Deep Learning by Arno Candel 071614
PDF
know Machine Learning Basic Concepts.pdf
PDF
最近の研究情勢についていくために - Deep Learningを中心に -
PPT
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
PDF
Collective Response Spike Prediction for Mutually Interacting Consumers
PDF
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
PDF
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
PDF
Tips And Tricks For Bioinformatics Software Engineering
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
IPT.pdf
Machine learning for document analysis and understanding
A simple framework for contrastive learning of visual representations
Scene Description From Images To Sentences
深度学习639页PPT/////////////////////////////
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Machine Learning
nlp dl 1.pdf
Deep Learning And Business Models (VNITC 2015-09-13)
Coates p: the use of genetic programing in exploring 3 d design worlds
Neural Nets Deconstructed
H2O Distributed Deep Learning by Arno Candel 071614
know Machine Learning Basic Concepts.pdf
最近の研究情勢についていくために - Deep Learningを中心に -
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Collective Response Spike Prediction for Mutually Interacting Consumers
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Tips And Tricks For Bioinformatics Software Engineering
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Machine Learning_overview_presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine Learning_overview_presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Per capita expenditure prediction using model stacking based on satellite ima...
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Ad

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会@関東)

  • 1. AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data 250 1 0 39 0 Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo
  • 2. n 53 CV @ CVPR2019 ( ) - https://kantocv.connpass.com/event/133980/ n - Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo, “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. [Link] ※
  • 3. AI @DeNA twitter: @tomoyukun n - ~2019.3 CV - 2019.4~ DeNA AI CV
  • 4. n - ~2019.3 CV - 2019.4~ DeNA AI CV - 2019.6 CVPR2019 AI @DeNA twitter: @tomoyukun
  • 5. n CVPR2019 - 30 234 slide share @DeNAxAI_NEWS
  • 6. n CVPR2019 Oral n AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data Liheng Zhang 1,∗ , Guo-Jun Qi 1,2,† , Liqiang Wang3 , Jiebo Luo4 1 Laboratory for MAchine Perception and LEarning (MAPLE) http://maple-lab.net/ 2 Huawei Cloud, 3 University of Central Florida, 4 University of Rochester guojun.qi@huawei.com http://maple-lab.net/projects/AET.htm Abstract The success of deep neural networks often relies on a rge amount of labeled examples, which can be difficult to btain in many real scenarios. To address this challenge, nsupervised methods are strongly preferred for training (a) Auto-Encoding Data (AED) Paper Project page Code
  • 7. Unsupervised Representation Learning n(t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x)
  • 8. Unsupervised Representation Learning n cvpaper.challenge - Pretext : ImageNet => Target : ➤ AlexNetc uh (g Pretext task ex. ImageNet w/o labels ex. AlexNet (ex (t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x)
  • 9. Unsupervised Representation Learning n cvpaper.challenge - Pretext : ImageNet => Target : ➤ AlexNetc uh (g Pretext task ex. ImageNet w/o labels ex. AlexNet (ex (t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x)
  • 10. Unsupervised Representation Learning n cvpaper.challenge - Pretext : ImageNet => Target : ➤ AlexNetc uh (g Pretext task ex. ImageNet w/o labels ex. AlexNet (ex (t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x) discriminative semantic disentangle …
  • 11. Unsupervised Representation Learning n - DNN - (..?) n - - n ImageNet … - ImageNet - n -
  • 12. Previous works (unsupervised) n Auto Encoder [Hinton+, 06] / Variational Auto Encoder [Kingma+, 13] n GANs - Discriminator encoder [Radford+, 16] - Generator G " # E # " [Donahue+, 17] n BiGAN ➤ h-(!|/)hmx (Generator) uGANd t -(/|!)p (Encoder) ➤ Generatorgru (-0 !, / = -0 !|/ -(/))dEncodergru (-1 !, / = -1 /|! -(!))x hGANd h mc a u ➤ d Dh x u hGANrtp - Di d v x g uphci Cls. Det. random 53.3 43.4 BiGAN 60.3 46.9 JP 67.7 53.2 Published as a conference paper at ICLR 2017 features data z G G(z) xEE(x) G(z), z x, E(x) D P(y) Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN). Donahue et al., ”Adversarial Feature Learning”, ICLR 2017.
  • 13. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for theDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 14. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for the inav Gupta1 Alexei A. Efros2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ce ch ge ch o- at g- a- xt m- d ds e, _ _? ? Question 1: Question 2: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possibleDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 15. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for the inav Gupta1 Alexei A. Efros2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ce ch ge ch o- at g- a- xt m- d ds e, _ _? ? Question 1: Question 2: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible sasignificantboostover resultinginstate-of-the- swhichuseonlyPascal- methodshaveleveraged dexamplestolearnrich, tations[32].Yetefforts ternet-scaledatasets(i.e. ehamperedbythesheer required.Anaturalway toemployunsupervised withoutanyannotation. adesofsustainedeffort, etbeenshowntoextract ectionsoffull-sized,real itisnotevenclearwhat onewriteanobjective twopairsofpatches?Notethatthetaskismucheasieronceyou haverecognizedtheobject! Answer key: Q1: Bottom right Q2: Top center inthecontext(i.e.,afewwordsbeforeand/orafter)given thevector.Thisconvertsanapparentlyunsupervisedprob- lem(findingagoodsimilaritymetricbetweenwords)into a“self-supervised”one:learningafunctionfromagiven wordtothewordssurroundingit.Herethecontextpredic- tiontaskisjusta“pretext”toforcethemodeltolearna goodwordembedding,which,inturn,hasbeenshownto beusefulinanumberofrealtasks,suchassemanticword similarity[40]. Ourpaperaimstoprovideasimilar“self-supervised” formulationforimagedata:asupervisedtaskinvolvingpre- dictingthecontextforapatch.OurtaskisillustratedinFig- ures1and2.Wesamplerandompairsofpatchesinoneof eightspatialconfigurations,andpresenteachpairtoama- chinelearner,providingnoinformationaboutthepatches’ originalpositionwithintheimage.Thealgorithmmustthen Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 16. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for the inav Gupta1 Alexei A. Efros2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ce ch ge ch o- at g- a- xt m- d ds e, _ _? ? Question 1: Question 2: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible sasignificantboostover resultinginstate-of-the- swhichuseonlyPascal- methodshaveleveraged dexamplestolearnrich, tations[32].Yetefforts ternet-scaledatasets(i.e. ehamperedbythesheer required.Anaturalway toemployunsupervised withoutanyannotation. adesofsustainedeffort, etbeenshowntoextract ectionsoffull-sized,real itisnotevenclearwhat onewriteanobjective twopairsofpatches?Notethatthetaskismucheasieronceyou haverecognizedtheobject! Answer key: Q1: Bottom right Q2: Top center inthecontext(i.e.,afewwordsbeforeand/orafter)given thevector.Thisconvertsanapparentlyunsupervisedprob- lem(findingagoodsimilaritymetricbetweenwords)into a“self-supervised”one:learningafunctionfromagiven wordtothewordssurroundingit.Herethecontextpredic- tiontaskisjusta“pretext”toforcethemodeltolearna goodwordembedding,which,inturn,hasbeenshownto beusefulinanumberofrealtasks,suchassemanticword similarity[40]. Ourpaperaimstoprovideasimilar“self-supervised” formulationforimagedata:asupervisedtaskinvolvingpre- dictingthecontextforapatch.OurtaskisillustratedinFig- ures1and2.Wesamplerandompairsofpatchesinoneof eightspatialconfigurations,andpresenteachpairtoama- chinelearner,providingnoinformationaboutthepatches’ originalpositionwithintheimage.Thealgorithmmustthen - mx x SiameseNetg2 - hCNNx m d ➤ Fine-tuningh i rt cover clusters of, say, foliage. A few subsequent works have attempted to use representations more closely tied to shape [36, 43], but relied on contour extraction, which is difficult in complex images. Many other approaches [22, 29, 16] focus on defining similarity metrics which can be used in more standard clustering algorithms; [45], for instance, re-casts the problem as frequent itemset mining. Geom- etry may also be used to for verifying links between im- ages [44, 6, 23], although this can fail for deformable ob- jects. Video can provide another cue for representation learn- ing. For most scenes, the identity of objects remains un- changed even as appearance changes with time. This kind of temporal coherence has a long history in visual learning literature [18, 59], and contemporaneous work shows strong improvements on modern detection datasets [57]. Finally, our work is related to a line of research on dis- criminative patch mining [13, 50, 28, 37, 52, 11], which has emphasized weak supervision as a means of object discov- ery. Like the current work, they emphasize the utility of Patch 2Patch 1 pool1 (3x3,96,2)pool1 (3x3,96,2) LRN1LRN1 pool2 (3x3,384,2)pool2 (3x3,384,2) LRN2LRN2 fc6 (4096)fc6 (4096) conv5 (3x3,256,1)conv5 (3x3,256,1) conv4 (3x3,384,1)conv4 (3x3,384,1) conv3 (3x3,384,1)conv3 (3x3,384,1) conv2 (5x5,384,2)conv2 (5x5,384,2) conv1 (11x11,96,4)conv1 (11x11,96,4) fc7 (4096) fc8 (4096) fc9 (8) pool5 (3x3,256,2)pool5 (3x3,256,2) Figure 3. Our architecture for pair classification. Dotted lines in- dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’ stands for a fully-connected one, ‘pool’ is a max-pooling layer, and ‘LRN’ is a local response normalization layer. Numbers in paren- theses are kernel size, number of outputs, and stride (fc layers have SiameseNet (if there is no spe- s “stuff” [1]). We arn a visual repre- e that the resulting ject detection, pro- OC 2007 compared nsupervised object eans, surprisingly, ss images, despite n that operates on a e-level supervision gory-level tasks. epresentation is as nerative model. An Figure 2. The algorithm receives two patches in one of these eight Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 17. Previous works (self-supervised) n [Gidaris+, 18] => artifact lv trivial solutionh d u ➤ objecth x u ogiobjecth ➤ (Cls., Det. ) & random CR JP++ Published as a conference paper at ICLR 2018 Rotated image: X 0 Rotated image: X3 Rotated image: X 2 Rotated image: X 1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3)Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F 3 ( X 3 ) Predict 0 degrees rotation (y=0) Maximize prob. F2 ( X2 ) Maximize prob. F 1 ( X 1 ) Maximize prob. F 0 ( X 0 ) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: Gidaris et al., “Unsupervised Representation Learning by Predicting Image Rotation”, ICLR 2018.
  • 18. Idea erception and LEarning (MAPLE) aple-lab.net/ entral Florida, 4 University of Rochester qi@huawei.com .net/projects/AET.htm a o , n AED (Auto Encoding Data) - Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 19. Idea erception and LEarning (MAPLE) aple-lab.net/ entral Florida, 4 University of Rochester qi@huawei.com .net/projects/AET.htm a o , n AED (Auto Encoding Data) - l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) l(x, ˆx) = ∥x − ˆx∥2 2 (3) θ ∈ R8 ˆθ ∈ R8 N (crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 20. Idea n a t to nge, ing this pre- ET) ED) AET ac- fol- ful- (a) Auto-Encoding Data (AED) n AET (Auto Encoding Transform) - transform ( structure ) Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 21. Idea n a t to nge, ing this pre- ET) ED) AET ac- fol- ful- (a) Auto-Encoding Data (AED) n AET (Auto Encoding Transform) - transform ( structure ) l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) l(x, ˆx) = ∥x − ˆx∥2 2 (3) l(t,ˆt) (4) θ ∈ R8 ˆθ ∈ R8 N (crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 22. Network = 1 2 ∥M(θ) − M(ˆθ)∥2 2 to model the difference he target and the estimated transformations. In nts, we will compare different instances of param- ransformations in this category and demonstrate ield competitive performances on training AET. uced Transformations. One can choose other ransformations without explicit geometric impli- ke the affine and the projective transformations. nsider a GAN generator that transforms an input manifold of real images. For example, in [24], a Encoder n weights share 2branch encoder n branch concatenate n transform Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 23. Transformations n white box - AET-affine - AET-project Rotation [-180, 180] Translation [-0.2, +0.2] * H or W Scale [0.7, 1.3]Original Shear [-30, +30] Scale [0.8, 1.2] Rotation 0 ,90 ,180 ,270 Stretch corner 0.125 * H or W Original Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 24. Transformations n MSE - AET-affine AET-project n transform 2 ablation … l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) θ ∈ R8 : θ ∈ R8 : N (crop, ic segmentation Rotation 85.07 89.06 86.21 61.73 Proposed 83.11 86.33 86.94 83.21 l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) θ ∈ R8 ˆθ ∈ R8 N (crop, filp, color jitter random erasing) adver- sarial perturbation
  • 25. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 26. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 27. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 28. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 29. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data k-NN
  • 30. Evaluation 3 0 n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data 0
  • 31. Details n CIFAR-10 - Network In Network • NIN block4 => GAP => concatenate => fc • encoder block2 +GAP - 3 fc NIN block3 ( ) n ImageNet - AlexNet • AlexNet fc2=> concatenate => concatenate => fc • encoder - 3 fc 1 fc ( )
  • 32. Results on CIFAR-10 Table 1: Comparison between unsupervised feature learn- ing methods on CIFAR-10. The fully supervised NIN and the random Init. + conv have the same three-block NIN ar- chitecture, but the first is fully supervised while the second is trained on top of the first two blocks that are randomly initialized and stay frozen during training. Method Error rate Supervised NIN (Lower Bound) 7.20 Random Init. + conv (Upper Bound) 27.50 Roto-Scat + SVM [21] 17.7 ExamplarCNN [7] 15.7 DCGAN [25] 17.2 Scattering [20] 15.3 RotNet + FC [10] 10.94 RotNet + conv [10] 8.84 (Ours) AET-affine + FC 9.77 (Ours) AET-affine + conv 8.05 (Ours) AET-project + FC 9.41 (Ours) AET-project + conv 7.82 SGD with a batch size nsformed counterparts. et to 0.9 and 5 × 10−4 . and scheduled to drop 800 and 1, 000 epochs. chs in total. For AET- composition of a ran- random translation by both vertical and hori- ing factor of [0.7, 1.3], −30◦ , 30◦ ] degree. For ansformation is formed rs of an image in both y ±0.125 of its height ed by [0.8, 1.2] and ro- (Ours) AET-affine + FC 9.77 (Ours) AET-affine + conv 8.05 (Ours) AET-project + FC 9.41 (Ours) AET-project + conv 7.82 k-NN n AET-affine AET-project n AET-project ( ) Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 33. Results on ImageNet n ImageNet AET-projectTable 4: Top-1 accuracy with linear layers on ImageNet. AlexNet is used as backbone to train the unsupervised mo comparison. A 1, 000-way linear classifier is trained upon various convolutional layers of feature maps that ar resized to have about 9, 000 elements. Fully supervised and random models are also reported to show the upper an bounds of unsupervised model performances. Only a single crop is used and no dropout or local response norm used during testing for the AET, except the models denoted with * where ten crops are applied to compare results Method Conv1 Conv2 Conv3 Conv4 Conv5 ImageNet Labels (Upper Bound) [10] 19.3 36.3 44.2 48.3 50.5 Random (Lower Bound)[10] 11.6 17.1 16.9 16.3 14.1 Random rescaled [16](Lower Bound) 17.5 23.0 24.5 23.2 20.6 Context [5] 16.2 23.3 30.2 31.7 29.6 Context Encoders [22] 14.1 20.7 21.0 19.8 15.5 Colorization[30] 12.5 24.5 30.4 31.5 30.3 Jigsaw Puzzles [18] 18.2 28.8 34.0 33.9 27.1 BiGAN [6] 17.7 24.5 31.0 29.9 28.0 Split-Brain [29] 17.7 29.3 35.4 35.2 32.8 Counting [19] 18.0 30.6 34.3 32.5 25.7 RotNet [10] 18.8 31.7 38.7 38.2 36.5 (Ours) AET-project 19.2 32.8 40.6 39.7 37.7 DeepCluster* [4] 13.4 32.3 41.0 39.6 38.2 (Ours) AET-project* 19.3 35.4 44.0 43.6 42.4 0.01, and it is dropped by a factor of 10 at epoch 100 and 150. AET is trained for 200 epochs in total. Finally, the projective transformations applied are randomly sampled in performance. From the comparison, the AET mo ly narrow the performance gap to the upper boun to the upper bound Top-1 accuracy has been decr er- ed ap- on- ng by ng ed in ns ed ent al- be- art tes he se- ork m- eat on- so ith Table 3: Top-1 accuracy with non-linear layers on Ima- geNet. AlexNet is used as backbone to train the unsu- pervised models. After unsupervised features are learned, nonlinear classifiers are trained on top of Conv4 and Con- v5 layers with labeled examples to compare their perfor- mances. We also compare with the fully supervised mod- els and random models that give upper and lower bounded performances. For a fair comparison, only a single crop is applied in AET and no dropout or local response normal- ization is applied during the testing. Method Conv4 Conv5 ImageNet Labels [3](Upper Bound) 59.7 59.7 Random [19] (Lower Bound) 27.1 12.0 Tracking [28] 38.8 29.8 Context [5] 45.6 30.4 Colorization [30] 40.7 35.2 Jigsaw Puzzles [18] 45.3 34.6 BiGAN [6] 41.9 32.2 NAT [3] - 36.0 DeepCluster [4] - 44.0 RotNet [10] 50.0 43.8 (Ours) AET-project 53.2 47.0 4.2. ImageNet Experiments We further evaluate the performance by AET on the Im- 3 NN 1 NN Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 34. Results n - Counting [19] 23.3 33.9 RotNet [10] 21.5 31.0 (Ours) AET-project 22.1 32.9 (a) CIFAR-10 (b) ImageNet CIFAR-10 ImageNet Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 35. Results n 31.0 35.1 34.6 33.7 32.9 37.1 36.2 34.7 Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 36. n - “structure information” … - - - AET • Keypoints n - 0 trivial 31.0 35.1 34.6 33.7 32.9 37.1 36.2 34.7 Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.