Restricting the Flow: Information Bottlenecks for Attribution

제목
펀디멘탈팀
고형권, 김동희, 김창연, 송헌, 이민경, 이재윤
2021.02.07
Restricting the Flow: Information Bottlenecks for Attribution
ICLR 20’
1
딥러닝읽기모임

2
- What is attribution map?: Providing insights of the DNNs’ decision-making
Understanding Deep Neural Networks
Input Image
How to “explain” the network prediction to Dog or Cat?
 Find the region where network is looking at. (Visual Explanation)
DNNs
Dog
Cat

3
- Visualizing Attribution Maps
Visual Explanation
White-box Approach
- Pros: Simple and Fast.
- Cons: Need tractable internal components
(e.g., gradient, activation), Network
architecture dependency.
Black-box approach
- Pros: Model-agnostic property,
Interpretability for the black-box models.
- Cons: difficult to optimize.
[Selvaraju et al. 2017; Fong et al. 2017]

4
- Restricting the flow: Information Bottlenecks for Attribution (ICLR 20’)
Introduction
Existing attribution heatmap highlights subjectively irrelevant areas and this might
correctly reflect the network’s unexpected way of processing the data
 So, the authors proposed a novel attribution method that estimates the amount
of information an image region provides to the network’s prediction using
information bottleneck concept

5
- Information Bottleneck:
Preliminaries
[Tishby et al. 2000]
max 𝐼𝐼 𝑌𝑌; 𝑍𝑍 − 𝛽𝛽𝛽𝛽[𝑋𝑋; 𝑍𝑍]
New Random Variable
Control the trade-off between predicting the
labels well and using little information of 𝑋𝑋
Label Input
Goal: minimizing information flow + maximizing original model objective
Common way to reduce the amount of information?
𝑍𝑍 = 𝜆𝜆 𝑋𝑋 𝑅𝑅 + 1 − 𝜆𝜆 𝑋𝑋 𝜖𝜖, 𝜆𝜆𝑖𝑖 ∈ [0,1]
𝑅𝑅 = 𝑓𝑓𝑙𝑙 X  𝑙𝑙th layer output
𝜖𝜖~𝒩𝒩(𝜇𝜇𝑅𝑅, 𝜎𝜎𝑅𝑅
2
)mean, var as 𝑅𝑅
1. 𝜆𝜆𝑖𝑖 𝑋𝑋 = 1  Transmit all information (𝑍𝑍𝑖𝑖 = 𝑅𝑅𝑖𝑖)
2. 𝜆𝜆𝑖𝑖 𝑋𝑋 = 0  All information in 𝑅𝑅𝑖𝑖 is lost (𝑍𝑍𝑖𝑖 = 𝜖𝜖)

6
- Information Bottleneck for attribution:
Proposed Method
Now, it is required to estimate how much information is contained in 𝑍𝑍 from 𝑅𝑅
Variational approximation 𝑄𝑄 𝑍𝑍 = 𝒩𝒩(𝜇𝜇𝑅𝑅, 𝜎𝜎𝑅𝑅) (Assumption : All dim of 𝑍𝑍 are distributed normally and independent.).
𝐼𝐼 𝑅𝑅, 𝑍𝑍 = 𝔼𝔼𝑅𝑅[𝐷𝐷𝐾𝐾𝐾𝐾[𝑃𝑃(𝑍𝑍|𝑅𝑅)‖𝑃𝑃(𝑍𝑍)]]
𝑝𝑝 𝑧𝑧 = ∫ 𝑝𝑝 𝑧𝑧|𝑟𝑟 𝑝𝑝 𝑟𝑟 𝑑𝑑𝑑𝑑  intractable
𝐼𝐼 𝑅𝑅, 𝑍𝑍 = � 𝑝𝑝 𝑅𝑅 � 𝑝𝑝 𝑧𝑧|𝑟𝑟 log
𝑝𝑝 𝑧𝑧|𝑟𝑟
𝑝𝑝 𝑧𝑧
𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑
= � � 𝑝𝑝(𝑟𝑟, 𝑧𝑧) log
𝑝𝑝 𝑧𝑧
𝑞𝑞(𝑧𝑧)
𝑞𝑞(𝑧𝑧)
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑞𝑞 𝑧𝑧
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + � � 𝑝𝑝(𝑟𝑟, 𝑧𝑧) log
𝑞𝑞 𝑧𝑧
𝑝𝑝 𝑧𝑧
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑞𝑞 𝑧𝑧
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + � 𝑝𝑝 𝑧𝑧 � 𝑝𝑝 𝑟𝑟|𝑧𝑧 𝑑𝑑𝑑𝑑 log
𝑞𝑞(𝑧𝑧)
𝑝𝑝(𝑧𝑧)
𝑑𝑑𝑑𝑑
= 𝔼𝔼𝑅𝑅 𝐷𝐷𝐾𝐾𝐾𝐾 𝑃𝑃 𝑍𝑍|𝑅𝑅 ‖𝑄𝑄 𝑍𝑍 − 𝐷𝐷𝐾𝐾𝐾𝐾[𝑃𝑃(𝑍𝑍)‖𝑄𝑄(𝑍𝑍)]
≤ 𝔼𝔼𝑅𝑅 𝐷𝐷𝐾𝐾𝐾𝐾 𝑃𝑃 𝑍𝑍|𝑅𝑅 ‖𝑄𝑄 𝑍𝑍
[Klambauer et al. 2017]

7
- Information Bottleneck for attribution:
Proposed Method
𝐼𝐼 𝑅𝑅, 𝑍𝑍 = 𝔼𝔼𝑅𝑅 𝐷𝐷𝐾𝐾𝐾𝐾 𝑃𝑃 𝑍𝑍|𝑅𝑅 ‖𝑄𝑄 𝑍𝑍 − 𝐷𝐷𝐾𝐾𝐾𝐾[𝑃𝑃(𝑍𝑍)‖𝑄𝑄(𝑍𝑍)]
ℒ𝐼𝐼 ≥ 𝐼𝐼[𝑅𝑅, 𝑍𝑍]
If ℒ𝐼𝐼 = 0 for an area  information from this area is not necessary for the network’s prediction
Their goal is keeping only the information necessary for correct classification. Thus, the mutual
information should be minimal while the classification score remain high.
Total objective function:
ℒ = ℒ𝐶𝐶𝐶𝐶 + 𝛽𝛽ℒ𝐼𝐼
𝛽𝛽 controls the relative importance of both objectives. (e.g., for a small 𝛽𝛽, more bits of
information are flowing and less for a higher 𝛽𝛽)
[Alemi et al. 2017]

8
- Per-Sample Bottleneck
Proposed Method
Parameterization
: The bottleneck parameters 𝜆𝜆 have to be in [0,1]. Therefore, they parametrize 𝜆𝜆 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝛼𝛼), where 𝛼𝛼 ∈ ℝ𝑑𝑑
.
Initialization
: In the beginning, they wanted retain all the information  initialize 𝛼𝛼𝑖𝑖 = 5, thus 𝜆𝜆 ≈ 0.993 ⟹ 𝑍𝑍 ≈ 𝑅𝑅.
At first, the bottleneck has practically no impact on the model performance. It then deviates from this starting point to
suppress unimportant regions.
Optimization
: 10 iters using Adam with lr=1 to fit the mask 𝛼𝛼. To stabilize the training, they copy the single sample 10 times and apply
different noise to each.

9
- Per-Sample Bottleneck
Proposed Method
Measure of information in 𝑍𝑍 (i.e., 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃(𝑍𝑍|𝑅𝑅)‖𝑄𝑄(𝑍𝑍))) per dimension)
: summing over the channel axis  𝑚𝑚 ℎ,𝑤𝑤 = ∑𝑖𝑖=0
𝑐𝑐
𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃(𝑍𝑍 𝑖𝑖,ℎ,𝑤𝑤 �𝑅𝑅 𝑖𝑖,ℎ,𝑤𝑤 )�𝑄𝑄(𝑍𝑍 𝑖𝑖,ℎ,𝑤𝑤 ))
Enforcing local smoothness
: Pooling and conv stride ignore parts of the input, causing PSB overfit to a grid structure
 Convolve the sigmoid output with a fixed Gaussian kernel with standard deviation 𝜎𝜎𝑠𝑠.
𝜆𝜆 = 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏(𝜎𝜎𝑠𝑠, 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝛼𝛼))

10
- Readout Bottleneck
Proposed Method
Collect feature maps from different depths and then combine them with 1X1 conv
① In a first forward pass, no noise is added and collect different feature maps and interpolate them bilinearly
to match the spatial dimension.
② In a second forward pass, they insert the bottleneck layer into the network and restrict the flow of
information.

11
- Qualitative Assessment
Evaluation
Subjectively, both the PSB, RB identify areas relevant to the classification well and more specific (fewer pixels are scored high.).

12
- Qualitative Assessment
Evaluation

13
- Sanity Check (Randomization of Model Parameters)
Evaluation
: Starting from the last layer, an increasing proportion of the network parameters is re-initialized until all parameters are
random. The difference between original heatmap and the heatmap obtained from the randomized model is quantified using
SSIM.
For their methods, the randomizing the final dense layer drops the mean SSIM by around 0.4
[Adebayo et al. 2018]

14
- Sensitivity-N
Evaluation
: Masks the network’s input randomly and then measures how strongly the amount of attribution in the mask correlates with
the drop in classifier score.
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 �
𝑖𝑖∈𝑇𝑇𝑛𝑛
𝑅𝑅𝑖𝑖 𝑥𝑥 , 𝑆𝑆𝑐𝑐 𝑥𝑥 − 𝑆𝑆𝑐𝑐 𝑥𝑥 𝑥𝑥𝑇𝑇𝑛𝑛=0
Classifier logit output for class 𝑐𝑐
The input with all pixels in 𝑇𝑇𝑛𝑛 set to zero
PSB (𝛽𝛽 = 10/𝑘𝑘) perform best for both models above 𝑛𝑛 = 2 ⋅ 103
pixels (i.e., when more than 2% of all pixels are masked).
[Anacona et al. 2018]

15
- Localization
Evaluation
1. Bounding Box: If the bbox contains 𝑛𝑛 pixels, measuring how many of 𝑛𝑛-th highest scored pixels are contained in the
bounding box (Then divided by 𝑛𝑛 = ratio).
2. Image Degradation: The most relevant tiles are removed first (MoRF) ⟹ Removing tiles ranked as least relevant by the
attribution method first (LeRF).
𝑠𝑠 𝑥𝑥 =
𝑝𝑝(𝑦𝑦|𝑥𝑥) − 𝑏𝑏
𝑡𝑡1 − 𝑏𝑏
Top-1 probability on the original samples
Mean model output on the
fully degraded images.
Both LeRf, MoRF degradation yield curves measuring
different qualities of the attribution method.
 calculate the integral between the two curves

제목
Thank You
Q&A
16
펀디멘탈팀
고형권, 김동희, 김창연, 송헌, 이민경, 이재윤
2021.02.07
arkimjh@naver.com

Restricting the Flow: Information Bottlenecks for Attribution

More Related Content

What's hot (20)

Similar to Restricting the Flow: Information Bottlenecks for Attribution (20)

More from taeseon ryu (20)

Recently uploaded (20)

Restricting the Flow: Information Bottlenecks for Attribution