SlideShare a Scribd company logo
김 성 철
Contents
1. Introduction
2. Class-wise Self-Knowledge Distillation (CS-KD)
3. Experiments
4. Conclusion
2
Introduction
• Regularization
• 네트워크의 parameter가 많아질 수록 overfitting과 poor generalizatio이 발생
• early stopping, 𝐿1/𝐿2-regularization, dropout, batch normalization, data augmentation
• Regularizing the predictive distribution of DNNs
• Label-smoothing, entropy maximization, angular-margin
• Network calibration, novelty detection, exploration in reinforcement learning에도 영향미침
• Dark knowledge
• 잘못된 예측의 knowledge
• Knowledge distillation에서 처음으로 중요성이 입증됨
3
Introduction
• Class-wise Self-Knowledge Distillation (CS-KD)
• 같은 클래스의 다른 sample에 대한 predictive distribution을 matching or distilling
• 같은 클래스의 sample들이 잘못된 예측을 하더라도 비슷한 예측을 하도록 유도
• Predictive distribution의 일관성
• Preventing overconfident prediction & Reducing intra-class variation
• 다른 regularization 방법들보다 낮은 top-1 error rate
• 더 좋은 top-5 error rate와 expected calibration error
• 최근 self-distillation 방식들보다 좋은 top-1 error rate
• Mixup, knowledge distillation 등 방법들과 합쳤을 때 더 좋은 성능
4
Class-wise Self-Knowledge Diatillation
• Softmax classifier
𝑃 𝑦|𝑥; 𝜃, 𝑇 =
exp 𝑓𝑦 𝑥; 𝜃 /𝑇
σ𝑖=1
𝐶
exp 𝑓𝑖 𝑥; 𝜃 /𝑇
• Class-wise regularization
• 같은 클래스의 sample들에 대해 일정한 predictive distribution을 유도
• Class-wise regularization loss
ℒcls 𝑥, 𝑥′
; 𝜃, 𝑇 ≔ KL 𝑃 𝑦|𝑥′
; ෨𝜃, 𝑇 ||𝑃 𝑦|𝑥; 𝜃, 𝑇
• 𝑥, 𝑥′ : input, another randomly sampled input having the same label 𝑦
• KL : the Kullback-Leibler divergence
• ෨𝜃 : a fixed copy of the parameters 𝜃. Stop gradient to avoid the model collapse issue
5
Class-wise Self-Knowledge Distillation
• Class-wise regularization (Cont.)
• Total training loss ℒCS-KD
ℒCS−KD 𝑥, 𝑥′
, 𝑦; 𝜃, 𝑇 ≔ ℒCE 𝑥, 𝑦; 𝜃 + 𝜆cls ⋅ 𝑇2
⋅ ℒcls 𝑥, 𝑥′
; 𝜃, 𝑇
• ℒCE : the standard cross-entropy loss
• 𝜆cls > 0 : a loss weight for the class-wise regularization
• Original KD와 동일하게 the square of the temperature 𝑇2
적용
6
https://arxiv.org/pdf/1503.02531.pdf
Class-wise Self-Knowledge Distillation
• Effects of class-wise regularization
• Preventing overconfident predictions
• 다른 sample들의 model-prediction을 soft-label로 사용
• Label-smoothing보다 현실적
• Reducing the intra-class variations
• 같은 클래스의 두 logit 사이의 거리를 최소화
• Softmax의 prediction value 조사
• PreAct ResNet-18 trained on the CIFAR-100
• CIFAR-100에서 잘못 예측한 데이터로 확인
• Overconfident prediction 완화
• Ground-truth 클래스의 prediction value 강화
7
Class-wise Self-Knowledge Distillation
• Effects of class-wise regularization (Cont.)
• Log-probabilities of the softmax scores
• (a) 잘못 예측한 sample의 confident prediction이 낮음
• (b) 잘못 예측한 sample의 ground-truth class의 score가 높음
8
Experiments
• Experimental setup
• Datasets
• CIFAR-100, TinyImageNet : datasets for conventional classification tasks
• CUB-200-2011, Standford Dogs, MIT67 : datasets for fine-grained classification tasks
• 시각적으로 유사한 클래스들이 존재, 클래스당 training sample이 적음
• ImageNet : a large-scale classification task
• Network architecture
• ResNet-18 with 64 filters, DenseNet-121 with a growth rate of 32 : fine-grained classification
• PreAct ResNet-18 : conventional classification
• Hyper-parameters
• SGD with momentum 0.9, weight decay 0.0001, an initial learning rate of 0.1 (divided by 10 after epochs 100 and 150)
• 200 epochs / batch size : 128(conventional), 32(fine-grained) / Flipping, random cropping
• 𝑇 {1, 4} / 𝜆cls {1, 2, 3, 4}
9
Experiments
• Experimental setup (Cont.)
• Baselines
10
https://arxiv.org/pdf/1905.00292.pdf
https://arxiv.org/pdf/1811.12611.pdf
https://arxiv.org/pdf/1809.05934.pdf
https://arxiv.org/pdf/1906.02629.pdf
https://www.aaai.org/ojs/index.php/AAAI/article/view/4498
https://arxiv.org/pdf/1905.08094.pdf
Experiments
• Experimental setup (Cont.)
• Evaluation metric
11
https://arxiv.org/pdf/1706.04599.pdf
Experiments
• Classification accuracy
• Comparison with output regularization methods
12
Experiments
• Classification accuracy
• Comparison with self-distillation methods
13
Experiments
• Classification accuracy
• Evaluation on large-scale datasets
14
Experiments
• Classification accuracy
• Compatibility with other regularization methods
15
Experiments
• Ablation study
• Feature embedding analysis
16
Experiments
• Ablation study
• Hierarchical image classification
• 387 fine-grained labels and three hierarchy labels : bird(CUB-200-2011), dog(Standford Dogs), indoor (MIT67)
• Fine-grained label당 30 sample씩 임의 추출 후 학습. Original testset으로 테스트
• Fine-grained label을 예측하고 hierarchical classification accuracy 측정
17
Experiments
• Calibration effects
• Plotted identity function (dashed diagonal) : perfect calibration
18
Experiments
• Calibration effects (Cont.)
• Consistency loss와 결합 ℒCS-KD-E
ℒCS−KD−E 𝑥, 𝑥′
, 𝑦; 𝜃, 𝑇 ≔ ℒCS−KD 𝑥aug, 𝑥aug
′
, 𝑦; 𝜃, 𝑇 + 𝜆E ⋅ 𝑇2
⋅ KL 𝑃 𝑦|𝑥; ෨𝜃, 𝑇 ||𝑃 𝑦|𝑥aug; 𝜃, 𝑇
• 𝑥aug : an augmented sample that is generated by the data augmentation technique
• 𝜆E > 0 : the loss weight for balancing
19
Conclusion
• A simple regularization method to enhance the generalization performance of DNN
• 같은 label의 다른 sample들의 predictive distribution의 Kullback-Leibler divergence를 최소화
• Generalization and calibration of neural network
• Applicable with a broader range of applications
• Exploration in deep reinforcement learning
• Transfer learning
• Face verification
• Detection of out-of-distribution samples
20
감 사 합 니 다
21

More Related Content

PPTX
SQCFramework: SPARQL Query containment Benchmark Generation Framework
PDF
SQCFramework: SPARQL Query Containment Benchmarks Generation Framework
PDF
Barga Data Science lecture 5
PDF
Clinical Data Classification of alzheimer's disease
PDF
Validation Model dalam penggunaan klusterisasi.pdf
PDF
Kaggle Higgs Boson Machine Learning Challenge
PPTX
tmptmptmp123.pptx
PDF
G53 qat09pdf6up
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query Containment Benchmarks Generation Framework
Barga Data Science lecture 5
Clinical Data Classification of alzheimer's disease
Validation Model dalam penggunaan klusterisasi.pdf
Kaggle Higgs Boson Machine Learning Challenge
tmptmptmp123.pptx
G53 qat09pdf6up

Similar to Regularizing Class-wise Predictions via Self-knowledge Distillation (20)

PDF
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
PPTX
Test Case Design Techniques
PPTX
Performance Issue? Machine Learning to the rescue!
PDF
On the Validity of Bayesian Neural Networks for Uncertainty Estimation
PPTX
How Machine Learning Helps Organizations to Work More Efficiently?
PPTX
crossvalidation.pptx
PDF
Elasticsearch in production New York Meetup at Twitter October 2014
PDF
Elasticsearch in production Boston Meetup October 2014
PDF
Kx for wine tasting
PDF
Brief Introduction to Agile Software Testing
PPTX
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
PPTX
Too good to be true? How validate your data
PPTX
Prediction of pKa from chemical structure using free and open source tools
PPTX
HDFS Erasure Coding in Action
PPT
Data mining techniques unit iv
PDF
Applying soft computing techniques to corporate mobile security systems
PDF
Finding Bugs Faster with Assertion Based Verification (ABV)
PDF
Meetup_Consumer_Credit_Default_Vers_2_All
PDF
Machine Learning - Implementation with Python - 2
PDF
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Test Case Design Techniques
Performance Issue? Machine Learning to the rescue!
On the Validity of Bayesian Neural Networks for Uncertainty Estimation
How Machine Learning Helps Organizations to Work More Efficiently?
crossvalidation.pptx
Elasticsearch in production New York Meetup at Twitter October 2014
Elasticsearch in production Boston Meetup October 2014
Kx for wine tasting
Brief Introduction to Agile Software Testing
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Too good to be true? How validate your data
Prediction of pKa from chemical structure using free and open source tools
HDFS Erasure Coding in Action
Data mining techniques unit iv
Applying soft computing techniques to corporate mobile security systems
Finding Bugs Faster with Assertion Based Verification (ABV)
Meetup_Consumer_Credit_Default_Vers_2_All
Machine Learning - Implementation with Python - 2
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Ad

More from Sungchul Kim (20)

PDF
SAM2: Segment Anything in Images and Videos
PDF
FeatUp: A Model-Agnostic Framework for Features at Any Resolution
PDF
Personalize Segment Anything Model with One Shot
PDF
TOOD: Task-aligned One-stage Object Detection
PDF
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
PDF
Network Representation Analysis using Centered Kernel Alignment (CKA)
PDF
Review. Dense Prediction Tasks for SSL
PPTX
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PDF
Revisiting the Calibration of Modern Neural Networks
PDF
Emerging Properties in Self-Supervised Vision Transformers
PDF
PR-305: Exploring Simple Siamese Representation Learning
PDF
Score based Generative Modeling through Stochastic Differential Equations
PDF
Exploring Simple Siamese Representation Learning
PDF
Revisiting the Sibling Head in Object Detector
PDF
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
PDF
Deeplabv1, v2, v3, v3+
PDF
Going Deeper with Convolutions
PDF
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
PDF
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
PDF
Panoptic Segmentation
SAM2: Segment Anything in Images and Videos
FeatUp: A Model-Agnostic Framework for Features at Any Resolution
Personalize Segment Anything Model with One Shot
TOOD: Task-aligned One-stage Object Detection
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Network Representation Analysis using Centered Kernel Alignment (CKA)
Review. Dense Prediction Tasks for SSL
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
Revisiting the Calibration of Modern Neural Networks
Emerging Properties in Self-Supervised Vision Transformers
PR-305: Exploring Simple Siamese Representation Learning
Score based Generative Modeling through Stochastic Differential Equations
Exploring Simple Siamese Representation Learning
Revisiting the Sibling Head in Object Detector
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Deeplabv1, v2, v3, v3+
Going Deeper with Convolutions
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Panoptic Segmentation
Ad

Recently uploaded (20)

PPTX
UNIT 4 Total Quality Management .pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Project quality management in manufacturing
PPTX
Geodesy 1.pptx...............................................
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
PPT on Performance Review to get promotions
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Well-logging-methods_new................
PPTX
web development for engineering and engineering
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
573137875-Attendance-Management-System-original
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Sustainable Sites - Green Building Construction
UNIT 4 Total Quality Management .pptx
Operating System & Kernel Study Guide-1 - converted.pdf
R24 SURVEYING LAB MANUAL for civil enggi
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Project quality management in manufacturing
Geodesy 1.pptx...............................................
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Construction Project Organization Group 2.pptx
PPT on Performance Review to get promotions
CH1 Production IntroductoryConcepts.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Well-logging-methods_new................
web development for engineering and engineering
bas. eng. economics group 4 presentation 1.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
573137875-Attendance-Management-System-original
Model Code of Practice - Construction Work - 21102022 .pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Sustainable Sites - Green Building Construction

Regularizing Class-wise Predictions via Self-knowledge Distillation

  • 2. Contents 1. Introduction 2. Class-wise Self-Knowledge Distillation (CS-KD) 3. Experiments 4. Conclusion 2
  • 3. Introduction • Regularization • 네트워크의 parameter가 많아질 수록 overfitting과 poor generalizatio이 발생 • early stopping, 𝐿1/𝐿2-regularization, dropout, batch normalization, data augmentation • Regularizing the predictive distribution of DNNs • Label-smoothing, entropy maximization, angular-margin • Network calibration, novelty detection, exploration in reinforcement learning에도 영향미침 • Dark knowledge • 잘못된 예측의 knowledge • Knowledge distillation에서 처음으로 중요성이 입증됨 3
  • 4. Introduction • Class-wise Self-Knowledge Distillation (CS-KD) • 같은 클래스의 다른 sample에 대한 predictive distribution을 matching or distilling • 같은 클래스의 sample들이 잘못된 예측을 하더라도 비슷한 예측을 하도록 유도 • Predictive distribution의 일관성 • Preventing overconfident prediction & Reducing intra-class variation • 다른 regularization 방법들보다 낮은 top-1 error rate • 더 좋은 top-5 error rate와 expected calibration error • 최근 self-distillation 방식들보다 좋은 top-1 error rate • Mixup, knowledge distillation 등 방법들과 합쳤을 때 더 좋은 성능 4
  • 5. Class-wise Self-Knowledge Diatillation • Softmax classifier 𝑃 𝑦|𝑥; 𝜃, 𝑇 = exp 𝑓𝑦 𝑥; 𝜃 /𝑇 σ𝑖=1 𝐶 exp 𝑓𝑖 𝑥; 𝜃 /𝑇 • Class-wise regularization • 같은 클래스의 sample들에 대해 일정한 predictive distribution을 유도 • Class-wise regularization loss ℒcls 𝑥, 𝑥′ ; 𝜃, 𝑇 ≔ KL 𝑃 𝑦|𝑥′ ; ෨𝜃, 𝑇 ||𝑃 𝑦|𝑥; 𝜃, 𝑇 • 𝑥, 𝑥′ : input, another randomly sampled input having the same label 𝑦 • KL : the Kullback-Leibler divergence • ෨𝜃 : a fixed copy of the parameters 𝜃. Stop gradient to avoid the model collapse issue 5
  • 6. Class-wise Self-Knowledge Distillation • Class-wise regularization (Cont.) • Total training loss ℒCS-KD ℒCS−KD 𝑥, 𝑥′ , 𝑦; 𝜃, 𝑇 ≔ ℒCE 𝑥, 𝑦; 𝜃 + 𝜆cls ⋅ 𝑇2 ⋅ ℒcls 𝑥, 𝑥′ ; 𝜃, 𝑇 • ℒCE : the standard cross-entropy loss • 𝜆cls > 0 : a loss weight for the class-wise regularization • Original KD와 동일하게 the square of the temperature 𝑇2 적용 6 https://arxiv.org/pdf/1503.02531.pdf
  • 7. Class-wise Self-Knowledge Distillation • Effects of class-wise regularization • Preventing overconfident predictions • 다른 sample들의 model-prediction을 soft-label로 사용 • Label-smoothing보다 현실적 • Reducing the intra-class variations • 같은 클래스의 두 logit 사이의 거리를 최소화 • Softmax의 prediction value 조사 • PreAct ResNet-18 trained on the CIFAR-100 • CIFAR-100에서 잘못 예측한 데이터로 확인 • Overconfident prediction 완화 • Ground-truth 클래스의 prediction value 강화 7
  • 8. Class-wise Self-Knowledge Distillation • Effects of class-wise regularization (Cont.) • Log-probabilities of the softmax scores • (a) 잘못 예측한 sample의 confident prediction이 낮음 • (b) 잘못 예측한 sample의 ground-truth class의 score가 높음 8
  • 9. Experiments • Experimental setup • Datasets • CIFAR-100, TinyImageNet : datasets for conventional classification tasks • CUB-200-2011, Standford Dogs, MIT67 : datasets for fine-grained classification tasks • 시각적으로 유사한 클래스들이 존재, 클래스당 training sample이 적음 • ImageNet : a large-scale classification task • Network architecture • ResNet-18 with 64 filters, DenseNet-121 with a growth rate of 32 : fine-grained classification • PreAct ResNet-18 : conventional classification • Hyper-parameters • SGD with momentum 0.9, weight decay 0.0001, an initial learning rate of 0.1 (divided by 10 after epochs 100 and 150) • 200 epochs / batch size : 128(conventional), 32(fine-grained) / Flipping, random cropping • 𝑇 {1, 4} / 𝜆cls {1, 2, 3, 4} 9
  • 10. Experiments • Experimental setup (Cont.) • Baselines 10 https://arxiv.org/pdf/1905.00292.pdf https://arxiv.org/pdf/1811.12611.pdf https://arxiv.org/pdf/1809.05934.pdf https://arxiv.org/pdf/1906.02629.pdf https://www.aaai.org/ojs/index.php/AAAI/article/view/4498 https://arxiv.org/pdf/1905.08094.pdf
  • 11. Experiments • Experimental setup (Cont.) • Evaluation metric 11 https://arxiv.org/pdf/1706.04599.pdf
  • 12. Experiments • Classification accuracy • Comparison with output regularization methods 12
  • 13. Experiments • Classification accuracy • Comparison with self-distillation methods 13
  • 14. Experiments • Classification accuracy • Evaluation on large-scale datasets 14
  • 15. Experiments • Classification accuracy • Compatibility with other regularization methods 15
  • 16. Experiments • Ablation study • Feature embedding analysis 16
  • 17. Experiments • Ablation study • Hierarchical image classification • 387 fine-grained labels and three hierarchy labels : bird(CUB-200-2011), dog(Standford Dogs), indoor (MIT67) • Fine-grained label당 30 sample씩 임의 추출 후 학습. Original testset으로 테스트 • Fine-grained label을 예측하고 hierarchical classification accuracy 측정 17
  • 18. Experiments • Calibration effects • Plotted identity function (dashed diagonal) : perfect calibration 18
  • 19. Experiments • Calibration effects (Cont.) • Consistency loss와 결합 ℒCS-KD-E ℒCS−KD−E 𝑥, 𝑥′ , 𝑦; 𝜃, 𝑇 ≔ ℒCS−KD 𝑥aug, 𝑥aug ′ , 𝑦; 𝜃, 𝑇 + 𝜆E ⋅ 𝑇2 ⋅ KL 𝑃 𝑦|𝑥; ෨𝜃, 𝑇 ||𝑃 𝑦|𝑥aug; 𝜃, 𝑇 • 𝑥aug : an augmented sample that is generated by the data augmentation technique • 𝜆E > 0 : the loss weight for balancing 19
  • 20. Conclusion • A simple regularization method to enhance the generalization performance of DNN • 같은 label의 다른 sample들의 predictive distribution의 Kullback-Leibler divergence를 최소화 • Generalization and calibration of neural network • Applicable with a broader range of applications • Exploration in deep reinforcement learning • Transfer learning • Face verification • Detection of out-of-distribution samples 20
  • 21. 감 사 합 니 다 21