[DL輪読会] Residual Attention Network for Image Classification

Residual Attention Network for Image Classification
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang
Zhang, Xiaogang Wang, Xiaoou Tang
2017-09-04
輪読@松尾研究室 M1 ⽥村浩⼀郎

Agenda
0. Information
1. Introduction
2. Related work & knowledges
3. Proposed Model
4. Experiment & Result
5. Conclusion
6. *Squeeze-and-Excitation Networks

0. Information
• Author
- Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang
Zhang, Xiaogang Wang, Xiaoou Tang
• Submission date
- Submitted on 23 Apr 2017
• Society
- accepted to CVPR2017
- https://arxiv.org/abs/1704.06904
• About
- Computer vision において，ResNet だけでなくAttentionも取り⼊れたも
の
- まだpaperは出ていないが，ILSVRC2017で優勝したSqueeze-and-
Excitation networksの前⾝?のモデル

1. Introduction
- 背景
• Attentionモデルは時系列のモデルに対してはよく使われているが，画像
認識などのfeedforward networkに対しては使われてこなかった
• 近年の画像認識の技術向上は，ResNetにより，層を深くすることが可能
になったことが⼤きい
ResNetを利⽤した `深い` CNNに対して，attention機構を適⽤し，精度向上を図る

1. Introduction
- モデル構造と成果
1. Stacked network structure
• 複数のAttention Moduleを積み⽴てたモデル構造．異なるAttention Moduleで異なる種類
のAttentionを導⼊できる
2. Attention Residual Learning
• 単純にAttention Moduleを導⼊するだけでは精度が下がる．ResNetを⽤いて深
い(hundreds of layers)のネットワークを⽤いる
3. Bottom-up top-down feedforward attention
• Bottom-up(背景の違いなどから)attention(注⽬)するアプローチ
• Top-down(事前知識などから)attention(注⽬)するアプローチ
1. 安定して層を増やし精度向上(state-of-the-art@2017-04-23)
2. End-to-Endの深いネットワークに簡単に適⽤でき，
効率的な計算を⾏うことができる

2. Related work & knowledge
- Attention model
• Attention機構が適⽤されるのは，多くの場合RNN
Effective Approaches to Attention-based Neural Machine Translationの例
1. RNNにより隠れ層ベクトルを計算
ℎ" = 𝑅𝑁𝑁(ℎ"'(, 𝑥)
2. ⼊⼒系列のどこに注⽬するかの重み𝑎"(𝑠)をscore関数により計算
𝑎" 𝑠 =
exp 𝑠𝑐𝑜𝑟𝑒(ℎ67, ℎ")
∑ exp 𝑠𝑐𝑜𝑟𝑒(ℎ67, ℎ")
3. 重み𝑎" 𝑠 を⽤いて重み付き平均ベクトル𝑐"を計算
𝑐" = : 𝑎"(𝑠) ℎ67
4. 3.の平均ベクトルと1.の隠れ層ベクトルから新しい出⼒ベクトルを計算
ℎ;" = tanh ( 𝑊Aℎ" + 𝑊C 𝑐" + 𝑏)
5. 各単語の出⼒確率を計算
𝑦" = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥((𝑊IJ"ℎ;" + 𝑏IJ")
• Computer visionにおいては，以下のような研究でsoft attention(みたいなも
の)が使われている
• Spatial Transformer Network[17] ->(⾯⽩いdemo:
https://drive.google.com/file/d/0B1nQa_sA3W2iN3RQLXVFRkNXN0k/view)
• Attention to scale: Scale-aware semantic image segmentation[3]
[引⽤:Effective Approaches to Attention-based Neural Machine Translation]

2. Related work & knowledge
- ResNet
• CNNにおいて層を深さは精度おいて⼤きく寄与する
• 層が深すぎると勾配消失などの問題があった=>ResNet
• ResNet
• 出⼒を𝐻 𝑥 とすると，残差𝐹 𝑥 = 𝐻 𝑥 − 𝑥 を学習(最⼩化)する
• 層が深くなると⼊⼒𝑥と出⼒𝐻 𝑥 はほとんど同じ値になる．
直接𝐻 𝑥 の値を𝑥に近づけることよりも，残差𝐹 𝑥 を0に近づける⽅が簡単である

3. Proposal model
- Residual Attention Network
1. Attention residual learning
2. Soft mask branch 3. Special attention and channel attention

3. Proposal model
3.1. Attention Residual Learning
• 単純にAttention機構をCNNの出⼒に掛け合わせるだけでは，以下の問題か
ら精度が下がる
1. 層が深くなるにつれて勾配が消失する
2. CNNにおける重要な特徴量を弱めてしまう可能性がある
• Attention Residual Learning
• Soft mask branchの𝑀 𝑥 ∈ [0, 1]が以下の役割を果たしている
1. 特徴量選択
2. ノイズの抑制
Attention moduleの出⼒ Soft Attention Mask Convolutionの出⼒
** i: spatial position, c: channel
Residual

3. Proposal model
• Attention Residual Learningは良い特徴量を保持する⼀⽅で，mask branchが
特徴量を抽出する能⼒を弱めてしまう
• Stacked Attention Modulesがそのトレードオフを補い，特徴量mapを洗練して
いく
• Attention Moduleが異なる役割のattention 機構を持ち，層が深くすることを可
能にしている
複数のAttention Module

3. Proposal model
異なるAttention Moduleで異なるattention maskを持つ．
層が浅いattention moduleでは背景の空の⻘⾊を消し，層が深いattention
moduleでは気球を強調している

3. Proposal model
3.2. Soft Mask Branch
• Soft Mask Branch
• 以下の2つの機能を畳み込み構造に
1. Fast feed-forward sweep -> 画像全体の情報を捉える
2. Top-down feedback step -> 元の特徴量mapと画像全体の情報を組み合わせる

3. Proposal model
3.3. Spatial Attention and Channel Attention
• 活性化関数を変えることによって，attentionの制約を加えることができ
る
1. Mixed attention => シグモイド
2. Channel attention => 場所ごとに正規化
3. Spatial attention => channelごとに正規化

4.1. CIFAR and Analysis
1. Attention Residual Learningの有効性を検証
• Attention Residual Learningを⾏わないナイーブなattention機構を⽤いたモデル(NAL: naive
attention learning)をベースラインにする
• Attention Moduleのstageごとに出⼒の平均を取ったもの．NALではstage2で勾配が消えて
いることがわかる

2. 他のmask branch構造との⽐較
• ダウンサンプリングとアップサンプリングを⾏わない普通の畳み込みと精度を⽐較する
ことで，mask branchの構造の優位性を検証する

3. ラベルのノイズに対する耐性の検証
• ダウンサンプリングとアップサンプリングを⾏わない普通の畳み込みと精度を⽐較する
ことで，mask branchの構造の優位性を検証する
• Training convolutional networks with noisy labels[31]に従って，以下のように確率を定義
r = 正しいlabelである確率，𝑞UV = 本当のlabelがjで実際のノイズつきlabelがiである確率

4. 他のstate-of-the-artのモデルとの精度⽐較

4.2. ImageNet Classification
1. 精度が良くなっているだけでなく，モデルの効率性が優れる
1. より少ないパラメタで学習可能
2. FLOPs(Floating-point Operations Per Second)が優れている
2. ResNetユニットについて⽐較すると，
1. 同程度の精度ならAttentionNeXt-56の⽅が効率的
2. 同程度の効率性ならAttentionNeXt-56の⽅が⾼精度
3. State-of-the-artのアルゴリズムと⽐べても⾼性能

5. Conclusion
• ResNetにattention機構を追加
• 異なるAttention Moduleで異なるattention機構を持つ
• Attention機構にbottom-up top-down feedforward convolutional structure
を⽤いる
• より安定して層を深くし，精度を向上
• より洗練された特徴量の選択とノイズへの耐性
• 既存のモデルに対して，要求されるモデルの複雑さ(パラメタ数や計算
量)が少なくて済む

6. Squeeze-and-Excitation Networks
• ILSVRC2017で優勝したモデル(まだpaperでてない)
• Residual Attention Network for Image Classificationと⾮常に似ている
• 違いはchannelごとにattentionを⾏なっていること
[引⽤:https://github.com/hujie-frank/SENet]

~資料参考⽂献~
**論⽂内引⽤⽂献を除く
• Squeeze-and-Excitation networks (ILSVRC 2017 winner) at CVPR2017
https://photos.google.com/share/AF1QipNRXiNDP9tw-
B_kyKk4hnXL_N283IaWNxSYH7jtAN1N0m62Uydh3MnpWFPh2GQYUw?key=STNBSU5XRkpKLXBSbm
E2Um9GbGRUSm9aME1naFF3
• Convolutional Neural Networks のトレンド
https://www.slideshare.net/sheemap/convolutional-neural-networks-wbafl2
• Res netと派⽣研究の紹介
https://www.slideshare.net/masatakanishimori/res-net
• Residual Network(ResNet)の理解とチューニングのベストプラクティス
https://deepage.net/deep_learning/2016/11/30/resnet.html
• Effective Approaches to Attention-based Neural Machine Translation, Minh-
Thang Luong, Hieu Pham, Christopher D. Manning
https://arxiv.org/abs/1508.04025

[DL輪読会] Residual Attention Network for Image Classification

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to [DL輪読会] Residual Attention Network for Image Classification (16)

More from Deep Learning JP (20)

[DL輪読会] Residual Attention Network for Image Classification