Regularizing Class-wise Predictions via Self-knowledge Distillation

Contents
1. Introduction
2. Class-wise Self-Knowledge Distillation (CS-KD)
3. Experiments
4. Conclusion
2

Introduction
• Regularization
• 네트워크의 parameter가 많아질 수록 overfitting과 poor generalizatio이 발생
• early stopping, 𝐿1/𝐿2-regularization, dropout, batch normalization, data augmentation
• Regularizing the predictive distribution of DNNs
• Label-smoothing, entropy maximization, angular-margin
• Network calibration, novelty detection, exploration in reinforcement learning에도 영향미침
• Dark knowledge
• 잘못된 예측의 knowledge
• Knowledge distillation에서 처음으로 중요성이 입증됨
3

Introduction
• Class-wise Self-Knowledge Distillation (CS-KD)
• 같은 클래스의 다른 sample에 대한 predictive distribution을 matching or distilling
• 같은 클래스의 sample들이 잘못된 예측을 하더라도 비슷한 예측을 하도록 유도
• Predictive distribution의 일관성
• Preventing overconfident prediction & Reducing intra-class variation
• 다른 regularization 방법들보다 낮은 top-1 error rate
• 더 좋은 top-5 error rate와 expected calibration error
• 최근 self-distillation 방식들보다 좋은 top-1 error rate
• Mixup, knowledge distillation 등 방법들과 합쳤을 때 더 좋은 성능
4

Class-wise Self-Knowledge Diatillation
• Softmax classifier
𝑃 𝑦|𝑥; 𝜃, 𝑇 =
exp 𝑓𝑦 𝑥; 𝜃 /𝑇
σ𝑖=1
𝐶
exp 𝑓𝑖 𝑥; 𝜃 /𝑇
• Class-wise regularization
• 같은 클래스의 sample들에 대해 일정한 predictive distribution을 유도
• Class-wise regularization loss
ℒcls 𝑥, 𝑥′
; 𝜃, 𝑇 ≔ KL 𝑃 𝑦|𝑥′
; ෨𝜃, 𝑇 ||𝑃 𝑦|𝑥; 𝜃, 𝑇
• 𝑥, 𝑥′ : input, another randomly sampled input having the same label 𝑦
• KL : the Kullback-Leibler divergence
• ෨𝜃 : a fixed copy of the parameters 𝜃. Stop gradient to avoid the model collapse issue
5

Class-wise Self-Knowledge Distillation
• Class-wise regularization (Cont.)
• Total training loss ℒCS-KD
ℒCS−KD 𝑥, 𝑥′
, 𝑦; 𝜃, 𝑇 ≔ ℒCE 𝑥, 𝑦; 𝜃 + 𝜆cls ⋅ 𝑇2
⋅ ℒcls 𝑥, 𝑥′
; 𝜃, 𝑇
• ℒCE : the standard cross-entropy loss
• 𝜆cls > 0 : a loss weight for the class-wise regularization
• Original KD와 동일하게 the square of the temperature 𝑇2
적용
6
https://arxiv.org/pdf/1503.02531.pdf

• Effects of class-wise regularization
• Preventing overconfident predictions
• 다른 sample들의 model-prediction을 soft-label로 사용
• Label-smoothing보다 현실적
• Reducing the intra-class variations
• 같은 클래스의 두 logit 사이의 거리를 최소화
• Softmax의 prediction value 조사
• PreAct ResNet-18 trained on the CIFAR-100
• CIFAR-100에서 잘못 예측한 데이터로 확인
• Overconfident prediction 완화
• Ground-truth 클래스의 prediction value 강화
7

• Effects of class-wise regularization (Cont.)
• Log-probabilities of the softmax scores
• (a) 잘못 예측한 sample의 confident prediction이 낮음
• (b) 잘못 예측한 sample의 ground-truth class의 score가 높음
8

Experiments
• Experimental setup
• Datasets
• CIFAR-100, TinyImageNet : datasets for conventional classification tasks
• CUB-200-2011, Standford Dogs, MIT67 : datasets for fine-grained classification tasks
• 시각적으로 유사한 클래스들이 존재, 클래스당 training sample이 적음
• ImageNet : a large-scale classification task
• Network architecture
• ResNet-18 with 64 filters, DenseNet-121 with a growth rate of 32 : fine-grained classification
• PreAct ResNet-18 : conventional classification
• Hyper-parameters
• SGD with momentum 0.9, weight decay 0.0001, an initial learning rate of 0.1 (divided by 10 after epochs 100 and 150)
• 200 epochs / batch size : 128(conventional), 32(fine-grained) / Flipping, random cropping
• 𝑇 {1, 4} / 𝜆cls {1, 2, 3, 4}
9

Experiments
• Experimental setup (Cont.)
• Baselines
10
https://www.aaai.org/ojs/index.php/AAAI/article/view/4498

Experiments
• Experimental setup (Cont.)
• Evaluation metric
11

Experiments
• Classification accuracy
• Comparison with output regularization methods
12

Experiments
• Comparison with self-distillation methods
13

Experiments
• Evaluation on large-scale datasets
14

Experiments
• Compatibility with other regularization methods
15

Experiments
• Ablation study
• Feature embedding analysis
16

Experiments
• Ablation study
• Hierarchical image classification
• 387 fine-grained labels and three hierarchy labels : bird(CUB-200-2011), dog(Standford Dogs), indoor (MIT67)
• Fine-grained label당 30 sample씩 임의 추출 후 학습. Original testset으로 테스트
• Fine-grained label을 예측하고 hierarchical classification accuracy 측정
17

Experiments
• Calibration effects
• Plotted identity function (dashed diagonal) : perfect calibration
18

Experiments
• Calibration effects (Cont.)
• Consistency loss와 결합 ℒCS-KD-E
ℒCS−KD−E 𝑥, 𝑥′
, 𝑦; 𝜃, 𝑇 ≔ ℒCS−KD 𝑥aug, 𝑥aug
′
, 𝑦; 𝜃, 𝑇 + 𝜆E ⋅ 𝑇2
⋅ KL 𝑃 𝑦|𝑥; ෨𝜃, 𝑇 ||𝑃 𝑦|𝑥aug; 𝜃, 𝑇
• 𝑥aug : an augmented sample that is generated by the data augmentation technique
• 𝜆E > 0 : the loss weight for balancing
19

Conclusion
• A simple regularization method to enhance the generalization performance of DNN
• 같은 label의 다른 sample들의 predictive distribution의 Kullback-Leibler divergence를 최소화
• Generalization and calibration of neural network
• Applicable with a broader range of applications
• Exploration in deep reinforcement learning
• Transfer learning
• Face verification
• Detection of out-of-distribution samples
20

Regularizing Class-wise Predictions via Self-knowledge Distillation

More Related Content

Similar to Regularizing Class-wise Predictions via Self-knowledge Distillation (20)

More from Sungchul Kim (20)

Recently uploaded (20)

Regularizing Class-wise Predictions via Self-knowledge Distillation