SlideShare a Scribd company logo
†
‡
† †‡ †
nofDepth
,96,/4,pool/2
256,pool/2
nv,384
nv,384
256,pool/2
4096
4096
1000
3x3conv,64
3x3conv,64,pool/2
3x3conv,128
3x3conv,128,pool/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256,pool/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512,pool/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512,pool/2
fc,4096
fc,4096
fc,1000
VGG,19layers
(ILSVRC2014)
input
Conv
7x7+2(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv
1x1+1(V)
Conv
3x3+1(S)
LocalRespNorm
MaxPool
3x3+2(S)
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
DepthConcat
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
DepthConcat
MaxPool
3x3+2(S)
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
DepthConcat
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
DepthConcat
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
DepthConcat
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
MaxPool
3x3+2(S)
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
DepthConcat
ConvConvConvConv
1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S)
ConvConvMaxPool
1x1+1(S)1x1+1(S)3x3+1(S)
DepthConcat
AveragePool
7x7+1(V)
FC
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
GoogleNet,22layers
(ILSVRC2014)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,64
3x3conv,64
1x1conv,256
1x2conv,128,/2
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,256,/2
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
7x7conv,64,/2,pool/2
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,512,/2
3x3conv,512
1x1conv,2048
1x1conv,512
3x3conv,512
1x1conv,2048
1x1conv,512
3x3conv,512
1x1conv,2048
avepool,fc1000
ageRecognition”.arXi
et.al
w1 w2 w3
w1
w2
w3
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DE
tion, outputting the classification scores using global average pooling or global max p
from the feature map f (·). However, global average pooling increases in the respons
of entire feature map at specific class due to using an average of all pixel at a featur
On the other hand, global max pooling does not increase the entire feature map at s
class because of using a maximum pixel value in a feature map. Response score fo
class of global average pooling and global max pooling is calculated as follow Eq. (1
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
After outputting the score for each class, the attention of pedestrian and occlusion r
are generated. First, we fuse the multiple channel feature map to one channel. In this
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) so
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is
summation of feature map. In softmax weighting, it is weighted the feature maps fo
channel using softmax score by Eq. (2). The softmax weighting can mask the unnec
channel feature map. In SE block fusion, it is weighted the feature maps for each c
using the attention of SE block like Squeeze-and-Excitation Network. After fusing
channel, pedestrian classification and occlusion state attentions are fused. In this wo
calculate the attention by subtracting the occlusion attention from pedestrian classifi
attention. Here, we call the attention the attention map because of containing positi
negative values.
Attentioni =
C
∑
c=1
fc
(xi)∗
exp(vc
i )
∑J
j=1 exp vj
i
3.4 Perception branch
In the perception branch, it outputs the final result score using attention map and featu
of RoI pooling. Attention map can refine the feature map of RoI pooling, such as m
unnecessary background feature and enhancing the important locations. Converted
map is made of the inner product of attention map and feature map from RoI poolin
perception branch is composed two fully connected layers like Fast R-CNN. The struc
the perception branch is the same as conventional Fast R-CNN, however, our model e
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
Anonymous CVPR submission
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
f(xi) (4)
f (xi, yi) (5)
2. Concolusion
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
C (4)
f (xi, yi) (5)
2. Concolusion
References
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
C (4)
f (xi, yi) (5)
2. Concolusion
References
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
C (4)
f (xi, yi) (5)
2. Concolusion
90
91
92
93
94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
09
10
11
12
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
(1)
After outputting the score for each class, the attention of pedestrian and occlusion regions
are generated. First, we fuse the multiple channel feature map to one channel. In this work,
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax-
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply
summation of feature map. In softmax weighting, it is weighted the feature maps for each
channel using softmax score by Eq. (2). The softmax weighting can mask the unnecessary
channel feature map. In SE block fusion, it is weighted the feature maps for each channel
using the attention of SE block like Squeeze-and-Excitation Network. After fusing to one
channel, pedestrian classification and occlusion state attentions are fused. In this work, we
calculate the attention by subtracting the occlusion attention from pedestrian classification
attention. Here, we call the attention the attention map because of containing positive and
negative values.
Attentioni =
C
∑
c=1
fc
(xi)∗
exp(vc
i )
∑J
j=1 exp vj
i
(2)
3.4 Perception branch
In the perception branch, it outputs the final result score using attention map and feature map
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5
tion, outputting the classification scores using global average pooling or global max pooling
from the feature map f (·). However, global average pooling increases in the response value
of entire feature map at specific class due to using an average of all pixel at a feature map.
On the other hand, global max pooling does not increase the entire feature map at specific
class because of using a maximum pixel value in a feature map. Response score for each
class of global average pooling and global max pooling is calculated as follow Eq. (1).
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
(1)
After outputting the score for each class, the attention of pedestrian and occlusion regions
are generated. First, we fuse the multiple channel feature map to one channel. In this work,
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax-
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5
tion, outputting the classification scores using global average pooling or global max pooling
from the feature map f (·). However, global average pooling increases in the response value
of entire feature map at specific class due to using an average of all pixel at a feature map.
On the other hand, global max pooling does not increase the entire feature map at specific
class because of using a maximum pixel value in a feature map. Response score for each
class of global average pooling and global max pooling is calculated as follow Eq. (1).
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
(1)
After outputting the score for each class, the attention of pedestrian and occlusion regions
are generated. First, we fuse the multiple channel feature map to one channel. In this work,
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax-
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply
summation of feature map. In softmax weighting, it is weighted the feature maps for each
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
f(xi) (4)
f (xi, yi) (5)
2. Concolusion
References
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
How Small Network Can Detect Ped
Anonymous CVPR submission
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
M, N (3)
C (4)
Table 1. Classification error on the ILSVRC validation set.
Networks top-1 val. error top-5 val. error
VGGnet-GAP 33.4 12.2
GoogLeNet-GAP 35.0 13.2
AlexNet∗-GAP 44.9 20.9
AlexNet-GAP 51.1 26.3
GoogLeNet 31.9 11.3
VGGnet 31.2 11.4
AlexNet 42.6 19.5
NIN 41.9 19.6
GoogLeNet-GMP 35.6 13.9
Table 2. Localization error on the ILSVRC validation set. Bac
prop refers to using [23] for localization instead of CAM.
Method top-1 val.error top-5 val. error
GoogLeNet-GAP 56.40 43.00
VGGnet-GAP 57.20 45.14
GoogLeNet 60.09 49.34
AlexNet∗-GAP 63.75 49.53
AlexNet-GAP 67.19 52.16
NIN 65.47 54.19
Backprop on GoogLeNet 61.31 50.55
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
Lall(x) = Eatt(x) + Eper(x)
Eper(x)
Eatt(x)
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
g(x)
M(x)
g′(x)
g′(x) = (1 + M(x)) ⋅ g(x)
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
irshick2
Piotr Doll´ar2
Zhuowen Tu1
Kaiming He2
C San Diego 2
Facebook AI Research
@ucsd.edu {rbg,pdollar,kaiminghe}@fb.com
rized network archi-
etwork is constructed
egates a set of trans-
ur simple design re-
architecture that has
is strategy exposes a
ality” (the size of the
factor in addition to
On the ImageNet-1K
under the restricted
ncreasing cardinality
racy. Moreover, in-
han going deeper or
Our models, named
entry to the ILSVRC
secured 2nd place.
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
+
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
....
total 32
paths
256-d in
+
256, 1x1, 64
64, 3x3, 64
64, 1x1, 256
+
256-d in
256-d out
256-d out
Figure 1. Left: A block of ResNet [14]. Right: A block of
ResNeXt with cardinality = 32, with roughly the same complex-
ity. A layer is shown as (# in channels, filter size, # out channels).
ing blocks of the same shape. This strategy is inherited
by ResNets [14] which stack modules of the same topol-
ogy. This simple rule reduces the free choices of hyper-
parameters, and depth is exposed as an essential dimension
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie1
Ross Girshick2
Piotr Doll´ar2
Zhuowen Tu1
Kaiming He2
1
UC San Diego 2
Facebook AI Research
{s9xie,ztu}@ucsd.edu {rbg,pdollar,kaiminghe}@fb.com
Abstract
We present a simple, highly modularized network archi-
tecture for image classification. Our network is constructed
by repeating a building block that aggregates a set of trans-
formations with the same topology. Our simple design re-
sults in a homogeneous, multi-branch architecture that has
only a few hyper-parameters to set. This strategy exposes a
new dimension, which we call “cardinality” (the size of the
set of transformations), as an essential factor in addition to
the dimensions of depth and width. On the ImageNet-1K
dataset, we empirically show that even under the restricted
condition of maintaining complexity, increasing cardinality
is able to improve classification accuracy. Moreover, in-
creasing cardinality is more effective than going deeper or
wider when we increase the capacity. Our models, named
ResNeXt, are the foundations of our entry to the ILSVRC
2016 classification task in which we secured 2nd place.
We further investigate ResNeXt on an ImageNet-5K set and
the COCO detection set, also showing better results than
its ResNet counterpart. The code and models are publicly
available online1
.
1. Introduction
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
+
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
....
total 32
paths
256-d in
+
256, 1x1, 64
64, 3x3, 64
64, 1x1, 256
+
256-d in
256-d out
256-d out
Figure 1. Left: A block of ResNet [14]. Right: A block of
ResNeXt with cardinality = 32, with roughly the same complex-
ity. A layer is shown as (# in channels, filter size, # out channels).
ing blocks of the same shape. This strategy is inherited
by ResNets [14] which stack modules of the same topol-
ogy. This simple rule reduces the free choices of hyper-
parameters, and depth is exposed as an essential dimension
in neural networks. Moreover, we argue that the simplicity
of this rule may reduce the risk of over-adapting the hyper-
parameters to a specific dataset. The robustness of VGG-
nets and ResNets has been proven by various visual recog-
nition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks
involving speech [42, 30] and language [4, 41, 20].
Unlike VGG-nets, the family of Inception models [38,
17, 39, 37] have demonstrated that carefully designed
v:1611.05431v2[cs.CV]11Apr2017
Densely Connected Convolutional Networks
Gao Huang⇤
Cornell University
gh349@cornell.edu
Zhuang Liu⇤
Tsinghua University
liuzhuang13@mails.tsinghua.edu.cn
Laurens van der Maaten
Facebook AI Research
lvdmaaten@fb.com
Kilian Q. Weinberger
Cornell University
kqw4@cornell.edu
Abstract
Recent work has shown that convolutional networks can
be substantially deeper, more accurate, and efficient to train
if they contain shorter connections between layers close to
the input and those close to the output. In this paper, we
embrace this observation and introduce the Dense Convo-
lutional Network (DenseNet), which connects each layer
to every other layer in a feed-forward fashion. Whereas
traditional convolutional networks with L layers have L
connections—one between each layer and its subsequent
layer—our network has L(L+1)
2 direct connections. For
each layer, the feature-maps of all preceding layers are
used as inputs, and its own feature-maps are used as inputs
into all subsequent layers. DenseNets have several com-
pelling advantages: they alleviate the vanishing-gradient
problem, strengthen feature propagation, encourage fea-
ture reuse, and substantially reduce the number of parame-
ters. We evaluate our proposed architecture on four highly
competitive object recognition benchmark tasks (CIFAR-10,
CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig-
nificant improvements over the state-of-the-art on most of
them, whilst requiring less computation to achieve high per-
formance. Code and pre-trained models are available at
https://github.com/liuzhuang13/DenseNet.
1. Introduction
Convolutional neural networks (CNNs) have become
the dominant machine learning approach for visual object
recognition. Although they were originally introduced over
20 years ago [18], improvements in computer hardware and
network structure have enabled the training of truly deep
CNNs only recently. The original LeNet5 [19] consisted of
5 layers, VGG featured 19 [29], and only last year Highway
⇤Authors contributed equally
x0
x1
H1
x2
H2
H3
H4
x3
x4
Figure 1: A 5-layer dense block with a growth rate of k = 4.
Each layer takes all preceding feature-maps as input.
Networks [34] and Residual Networks (ResNets) [11] have
surpassed the 100-layer barrier.
As CNNs become increasingly deep, a new research
problem emerges: as information about the input or gra-
dient passes through many layers, it can vanish and “wash
out” by the time it reaches the end (or beginning) of the
network. Many recent publications address this or related
problems. ResNets [11] and Highway Networks [34] by-
pass signal from one layer to the next via identity connec-
tions. Stochastic depth [13] shortens ResNets by randomly
dropping layers during training to allow better information
and gradient flow. FractalNets [17] repeatedly combine sev-
eral parallel layer sequences with different number of con-
volutional blocks to obtain a large nominal depth, while
maintaining many short paths in the network. Although
these different approaches vary in network topology and
training procedure, they all share a key characteristic: they
create short paths from early layers to later layers.
1
arXiv:1608.06993v5[cs.CV]28Jan2018
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
tanh
× Σ
f(st)
g(st)
g′(st)
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network

More Related Content

PPTX
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
PPTX
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
PDF
動作認識の最前線:手法,タスク,データセット
PDF
画像認識の初歩、SIFT,SURF特徴量
PPTX
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
PDF
[DL輪読会]ICLR2020の分布外検知速報
PPTX
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
PDF
Active Learning 入門
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
動作認識の最前線:手法,タスク,データセット
画像認識の初歩、SIFT,SURF特徴量
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
[DL輪読会]ICLR2020の分布外検知速報
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
Active Learning 入門

What's hot (20)

PDF
【DL輪読会】Foundation Models for Decision Making: Problems, Methods, and Opportun...
PDF
三次元点群を取り扱うニューラルネットワークのサーベイ
PPTX
[DL輪読会]MetaFormer is Actually What You Need for Vision
PPTX
【論文読み会】BEiT_BERT Pre-Training of Image Transformers.pptx
PPTX
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
PPTX
Structure from Motion
PDF
Graph Attention Network
PDF
【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"
PDF
三次元表現まとめ(深層学習を中心に)
PPTX
【DL輪読会】ViT + Self Supervised Learningまとめ
PPTX
Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unk...
PPTX
SfM Learner系単眼深度推定手法について
PDF
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
PPTX
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
PDF
Transformerを多層にする際の勾配消失問題と解決法について
PDF
【メタサーベイ】数式ドリブン教師あり学習
PPTX
【DL輪読会】Flow Matching for Generative Modeling
PDF
ConvNetの歴史とResNet亜種、ベストプラクティス
PDF
Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)
PDF
3次元レジストレーション(PCLデモとコード付き)
【DL輪読会】Foundation Models for Decision Making: Problems, Methods, and Opportun...
三次元点群を取り扱うニューラルネットワークのサーベイ
[DL輪読会]MetaFormer is Actually What You Need for Vision
【論文読み会】BEiT_BERT Pre-Training of Image Transformers.pptx
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
Structure from Motion
Graph Attention Network
【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"
三次元表現まとめ(深層学習を中心に)
【DL輪読会】ViT + Self Supervised Learningまとめ
Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unk...
SfM Learner系単眼深度推定手法について
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
Transformerを多層にする際の勾配消失問題と解決法について
【メタサーベイ】数式ドリブン教師あり学習
【DL輪読会】Flow Matching for Generative Modeling
ConvNetの歴史とResNet亜種、ベストプラクティス
Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)
3次元レジストレーション(PCLデモとコード付き)
Ad

Similar to [MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network (20)

PDF
Class Weighted Convolutional Features for Image Retrieval
PPTX
DeepLearning
PPTX
ObjRecog2-17 (1).pptx
PPTX
TAME: Trainable Attention Mechanism for Explanations
PPTX
Object detection - RCNNs vs Retinanet
PPTX
Efficient architecture to condensate visual information driven by attention ...
PPTX
Wits presentation 6_28072015
PDF
ArmourRD_ResearchPoster
PDF
Pr083 Non-local Neural Networks
PDF
Advanced computer vision transfroemasfgmblzfbmlzamfgvDLMV.pdf
PDF
MLIP - Chapter 5 - Detection, Segmentation, Captioning
PDF
PR-132: SSD: Single Shot MultiBox Detector
PDF
A Survey about Object Retrieval
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PDF
Mirko Lucchese - Deep Image Processing
PPTX
riken-RBlur-slides.pptx
PPTX
Conventional Neural Networks and compute
PDF
Ijarcet vol-2-issue-2-855-860
PDF
IRJET- Weakly Supervised Object Detection by using Fast R-CNN
PDF
Computer Vision
Class Weighted Convolutional Features for Image Retrieval
DeepLearning
ObjRecog2-17 (1).pptx
TAME: Trainable Attention Mechanism for Explanations
Object detection - RCNNs vs Retinanet
Efficient architecture to condensate visual information driven by attention ...
Wits presentation 6_28072015
ArmourRD_ResearchPoster
Pr083 Non-local Neural Networks
Advanced computer vision transfroemasfgmblzfbmlzamfgvDLMV.pdf
MLIP - Chapter 5 - Detection, Segmentation, Captioning
PR-132: SSD: Single Shot MultiBox Detector
A Survey about Object Retrieval
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Mirko Lucchese - Deep Image Processing
riken-RBlur-slides.pptx
Conventional Neural Networks and compute
Ijarcet vol-2-issue-2-855-860
IRJET- Weakly Supervised Object Detection by using Fast R-CNN
Computer Vision
Ad

More from Hiroshi Fukui (6)

PDF
最近の研究情勢についていくために - Deep Learningを中心に -
PDF
Non-local Neural Network
PPTX
[名古屋CV・PRML勉強会] ゼロからはじめたいDeep Learning
PPTX
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向
PPTX
CVPR2016を自分なりにまとめてみた
PPTX
2016/4/16 名古屋CVPRML 発表資料
最近の研究情勢についていくために - Deep Learningを中心に -
Non-local Neural Network
[名古屋CV・PRML勉強会] ゼロからはじめたいDeep Learning
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向
CVPR2016を自分なりにまとめてみた
2016/4/16 名古屋CVPRML 発表資料

Recently uploaded (20)

PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Microbiology with diagram medical studies .pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPT
Chemical bonding and molecular structure
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
neck nodes and dissection types and lymph nodes levels
Microbiology with diagram medical studies .pptx
Biophysics 2.pdffffffffffffffffffffffffff
Classification Systems_TAXONOMY_SCIENCE8.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Comparative Structure of Integument in Vertebrates.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Viruses (History, structure and composition, classification, Bacteriophage Re...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Chemical bonding and molecular structure
Derivatives of integument scales, beaks, horns,.pptx
The KM-GBF monitoring framework – status & key messages.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ECG_Course_Presentation د.محمد صقران ppt
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
microscope-Lecturecjchchchchcuvuvhc.pptx

[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network

  • 2. nofDepth ,96,/4,pool/2 256,pool/2 nv,384 nv,384 256,pool/2 4096 4096 1000 3x3conv,64 3x3conv,64,pool/2 3x3conv,128 3x3conv,128,pool/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256,pool/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512,pool/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512,pool/2 fc,4096 fc,4096 fc,1000 VGG,19layers (ILSVRC2014) input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat MaxPool 3x3+2(S) ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) AveragePool 5x5+3(V) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) AveragePool 5x5+3(V) DepthConcat MaxPool 3x3+2(S) ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 GoogleNet,22layers (ILSVRC2014) KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015. 1x1conv,64 3x3conv,64 1x1conv,256 1x1conv,64 3x3conv,64 1x1conv,256 1x1conv,64 3x3conv,64 1x1conv,256 1x2conv,128,/2 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,256,/2 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 7x7conv,64,/2,pool/2 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,512,/2 3x3conv,512 1x1conv,2048 1x1conv,512 3x3conv,512 1x1conv,2048 1x1conv,512 3x3conv,512 1x1conv,2048 avepool,fc1000 ageRecognition”.arXi
  • 5. 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DE tion, outputting the classification scores using global average pooling or global max p from the feature map f (·). However, global average pooling increases in the respons of entire feature map at specific class due to using an average of all pixel at a featur On the other hand, global max pooling does not increase the entire feature map at s class because of using a maximum pixel value in a feature map. Response score fo class of global average pooling and global max pooling is calculated as follow Eq. (1 vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), After outputting the score for each class, the attention of pedestrian and occlusion r are generated. First, we fuse the multiple channel feature map to one channel. In this we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) so weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is summation of feature map. In softmax weighting, it is weighted the feature maps fo channel using softmax score by Eq. (2). The softmax weighting can mask the unnec channel feature map. In SE block fusion, it is weighted the feature maps for each c using the attention of SE block like Squeeze-and-Excitation Network. After fusing channel, pedestrian classification and occlusion state attentions are fused. In this wo calculate the attention by subtracting the occlusion attention from pedestrian classifi attention. Here, we call the attention the attention map because of containing positi negative values. Attentioni = C ∑ c=1 fc (xi)∗ exp(vc i ) ∑J j=1 exp vj i 3.4 Perception branch In the perception branch, it outputs the final result score using attention map and featu of RoI pooling. Attention map can refine the feature map of RoI pooling, such as m unnecessary background feature and enhancing the important locations. Converted map is made of the inner product of attention map and feature map from RoI poolin perception branch is composed two fully connected layers like Fast R-CNN. The struc the perception branch is the same as conventional Fast R-CNN, however, our model e 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 Anonymous CVPR submission Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) f(xi) (4) f (xi, yi) (5) 2. Concolusion 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) C (4) f (xi, yi) (5) 2. Concolusion References 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) C (4) f (xi, yi) (5) 2. Concolusion References 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) C (4) f (xi, yi) (5) 2. Concolusion 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), (1) After outputting the score for each class, the attention of pedestrian and occlusion regions are generated. First, we fuse the multiple channel feature map to one channel. In this work, we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax- weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply summation of feature map. In softmax weighting, it is weighted the feature maps for each channel using softmax score by Eq. (2). The softmax weighting can mask the unnecessary channel feature map. In SE block fusion, it is weighted the feature maps for each channel using the attention of SE block like Squeeze-and-Excitation Network. After fusing to one channel, pedestrian classification and occlusion state attentions are fused. In this work, we calculate the attention by subtracting the occlusion attention from pedestrian classification attention. Here, we call the attention the attention map because of containing positive and negative values. Attentioni = C ∑ c=1 fc (xi)∗ exp(vc i ) ∑J j=1 exp vj i (2) 3.4 Perception branch In the perception branch, it outputs the final result score using attention map and feature map 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5 tion, outputting the classification scores using global average pooling or global max pooling from the feature map f (·). However, global average pooling increases in the response value of entire feature map at specific class due to using an average of all pixel at a feature map. On the other hand, global max pooling does not increase the entire feature map at specific class because of using a maximum pixel value in a feature map. Response score for each class of global average pooling and global max pooling is calculated as follow Eq. (1). vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), (1) After outputting the score for each class, the attention of pedestrian and occlusion regions are generated. First, we fuse the multiple channel feature map to one channel. In this work, we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax- weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5 tion, outputting the classification scores using global average pooling or global max pooling from the feature map f (·). However, global average pooling increases in the response value of entire feature map at specific class due to using an average of all pixel at a feature map. On the other hand, global max pooling does not increase the entire feature map at specific class because of using a maximum pixel value in a feature map. Response score for each class of global average pooling and global max pooling is calculated as follow Eq. (1). vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), (1) After outputting the score for each class, the attention of pedestrian and occlusion regions are generated. First, we fuse the multiple channel feature map to one channel. In this work, we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax- weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply summation of feature map. In softmax weighting, it is weighted the feature maps for each 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) f(xi) (4) f (xi, yi) (5) 2. Concolusion References 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 How Small Network Can Detect Ped Anonymous CVPR submission Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) M, N (3) C (4)
  • 6. Table 1. Classification error on the ILSVRC validation set. Networks top-1 val. error top-5 val. error VGGnet-GAP 33.4 12.2 GoogLeNet-GAP 35.0 13.2 AlexNet∗-GAP 44.9 20.9 AlexNet-GAP 51.1 26.3 GoogLeNet 31.9 11.3 VGGnet 31.2 11.4 AlexNet 42.6 19.5 NIN 41.9 19.6 GoogLeNet-GMP 35.6 13.9 Table 2. Localization error on the ILSVRC validation set. Bac prop refers to using [23] for localization instead of CAM. Method top-1 val.error top-5 val. error GoogLeNet-GAP 56.40 43.00 VGGnet-GAP 57.20 45.14 GoogLeNet 60.09 49.34 AlexNet∗-GAP 63.75 49.53 AlexNet-GAP 67.19 52.16 NIN 65.47 54.19 Backprop on GoogLeNet 61.31 50.55
  • 8. Lall(x) = Eatt(x) + Eper(x) Eper(x) Eatt(x)
  • 10. g(x) M(x) g′(x) g′(x) = (1 + M(x)) ⋅ g(x)
  • 13. irshick2 Piotr Doll´ar2 Zhuowen Tu1 Kaiming He2 C San Diego 2 Facebook AI Research @ucsd.edu {rbg,pdollar,kaiminghe}@fb.com rized network archi- etwork is constructed egates a set of trans- ur simple design re- architecture that has is strategy exposes a ality” (the size of the factor in addition to On the ImageNet-1K under the restricted ncreasing cardinality racy. Moreover, in- han going deeper or Our models, named entry to the ILSVRC secured 2nd place. 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 + 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 .... total 32 paths 256-d in + 256, 1x1, 64 64, 3x3, 64 64, 1x1, 256 + 256-d in 256-d out 256-d out Figure 1. Left: A block of ResNet [14]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complex- ity. A layer is shown as (# in channels, filter size, # out channels). ing blocks of the same shape. This strategy is inherited by ResNets [14] which stack modules of the same topol- ogy. This simple rule reduces the free choices of hyper- parameters, and depth is exposed as an essential dimension Aggregated Residual Transformations for Deep Neural Networks Saining Xie1 Ross Girshick2 Piotr Doll´ar2 Zhuowen Tu1 Kaiming He2 1 UC San Diego 2 Facebook AI Research {s9xie,ztu}@ucsd.edu {rbg,pdollar,kaiminghe}@fb.com Abstract We present a simple, highly modularized network archi- tecture for image classification. Our network is constructed by repeating a building block that aggregates a set of trans- formations with the same topology. Our simple design re- sults in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, in- creasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online1 . 1. Introduction 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 + 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 .... total 32 paths 256-d in + 256, 1x1, 64 64, 3x3, 64 64, 1x1, 256 + 256-d in 256-d out 256-d out Figure 1. Left: A block of ResNet [14]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complex- ity. A layer is shown as (# in channels, filter size, # out channels). ing blocks of the same shape. This strategy is inherited by ResNets [14] which stack modules of the same topol- ogy. This simple rule reduces the free choices of hyper- parameters, and depth is exposed as an essential dimension in neural networks. Moreover, we argue that the simplicity of this rule may reduce the risk of over-adapting the hyper- parameters to a specific dataset. The robustness of VGG- nets and ResNets has been proven by various visual recog- nition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]. Unlike VGG-nets, the family of Inception models [38, 17, 39, 37] have demonstrated that carefully designed v:1611.05431v2[cs.CV]11Apr2017 Densely Connected Convolutional Networks Gao Huang⇤ Cornell University gh349@cornell.edu Zhuang Liu⇤ Tsinghua University liuzhuang13@mails.tsinghua.edu.cn Laurens van der Maaten Facebook AI Research lvdmaaten@fb.com Kilian Q. Weinberger Cornell University kqw4@cornell.edu Abstract Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convo- lutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer and its subsequent layer—our network has L(L+1) 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several com- pelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage fea- ture reuse, and substantially reduce the number of parame- ters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig- nificant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high per- formance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet. 1. Introduction Convolutional neural networks (CNNs) have become the dominant machine learning approach for visual object recognition. Although they were originally introduced over 20 years ago [18], improvements in computer hardware and network structure have enabled the training of truly deep CNNs only recently. The original LeNet5 [19] consisted of 5 layers, VGG featured 19 [29], and only last year Highway ⇤Authors contributed equally x0 x1 H1 x2 H2 H3 H4 x3 x4 Figure 1: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding feature-maps as input. Networks [34] and Residual Networks (ResNets) [11] have surpassed the 100-layer barrier. As CNNs become increasingly deep, a new research problem emerges: as information about the input or gra- dient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets [11] and Highway Networks [34] by- pass signal from one layer to the next via identity connec- tions. Stochastic depth [13] shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets [17] repeatedly combine sev- eral parallel layer sequences with different number of con- volutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers. 1 arXiv:1608.06993v5[cs.CV]28Jan2018