SlideShare a Scribd company logo
PR-373
NeurIPS 2021 주성훈, VUNO Inc.


2022. 2. 27.
1. Research Background
2. Methods
1. Research Background 3
The performance of a vision model
•최신 training methodology를 적용한 새로운 architectures에서의 성능과,


이전 방식의 training methodology를 적용한 오래된 architectures성능과 비교하는 것을 지적함
•Architecture •Training methods •Scaling strategy
/ 24
2. Methods
1. Research Background 4
Objective:


Vision architecture를 scaling 하는데에 new perspectives와 practical advice 제공
•ResNet에 modern training and regularization techniques를 적용. Large Epoch.
•Re-scaled (Depth, Width, Resolution) ResNet (ResNet-RS)
/ 24
2. Methods
1. Research Background 5
ResNeXt
Automated architecture search를 활용한 구조
[67 (NASNET), 41 (AmoebaNet: 83.9), 55 (EfficientNet-B7, 84.4%, 2019)].
Adapting self-attention to the visual domain
AA-ResNet-152, 79.1%, 2019
ViT-L/16 87.76±0.03%, 2020
LambdaResNet200 84.3%, 2021
Previous works
•Architecture
VGG
ResNet
Inception
ViT-L/16 87.76±0.03%, 2020
/ 24
2. Methods
1. Research Background 6
Previous works
Compound Scaling
•이 논문에서 지적하는 단점들
• Small Epoch
• FLOPS 를 최적화해 설계된 baseline network (EfficientNet-B0)
•Architecture - EfficientNet
•EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, 2019, ICML 2019)
•PR-169
/ 24
2. Methods
1. Research Background 7
•Innovations in training (e.g. improved learning rate schedules)
•regularization methods
•dropout
•label smoothing
•stochastic depth
•dropblock
•data augmentation
Regularization methods have become especially useful to prevent overfitting when training ever-increasingly
larger models on limited data (e.g. 1.2M ImageNet images).
Previous works
•training/regularization methodology
https://arxiv.org/pdf/2011.12562.pdf
Label smoothing
/ 24
2. Methods
1. Research Background 8
Previous works
NFNets : 89.2% (Brock, A. et al. High-Performance Large-Scale Image Recognition Without
Normalization. (2021).)
• ResNet
• JFT-300M
• Batch normalization -> Adaptive Gradient Clipping(AGC)
ViT: An Image is Worth 16x16 Words:
Transformers for Image Recognition
at Scale (ICLR 2021): 88.6%
• Transformer
• ImageNet-21k (1,400만 개, 10배)
• JFT-300M
•Additional training data.
•Pre-training on large-scale datasets
76.4 -> 79.2
Revisiting Unreasonable Effectiveness of
Data in Deep Learning Era. ICCV. 2017
/ 24
2. Methods
1. Research Background 9
NoisyStudent (EfficientNet-L2) 88.4%, CVPR
-Self-training with Noisy Student improves ImageNet classification
Previous works
•Additional training data.
•Semi-supervised learning
Meta Pseudo Labels (EfficientNet-L2), 90.2%,
PR-336
/ 24
2. Experimental Results & Methodology
2. Methods
2. Experimental Results & Methodology 11
Additive Study of Improvements
•Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성
•Baseline ResNet-200
•256x256
•90 epochs
•Stepwise learning rate decay
•Hyperparameter selection:
•held-out validation set comprising 2% of the
ImageNet training set (minival-set)
•Increase training epochs (90->350 epoch)
•Regularization을 적용한 후에는 유용함
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Training Regularization Architecture
/ 24
2. Methods
2. Experimental Results & Methodology 12
Additive Study of Improvements
•Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성
Regularization and Data Augmentation
•Dropout: after the global average pooling
•Decrease weight decay 적용 유무에 따라 결과 달라짐
•Stochastic depth (G. Huang et al., 2016, ECCV)
•RandAugment: random image transformations (e.g.
translate, shear, color distortions)
•Momentum with cosine learning rate schedule
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Baseline ResNet-200
•256x256, 90 epochs, Stepwise learning rate decay
Training Regularization Architecture
/ 24
2. Methods
2. Experimental Results & Methodology 13
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Additive Study of Improvements
•Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성
•간단한 architecture change (SE, ResNet-D) 를 추가해 Top-1 accuracy 83.4% (+4.4%) 달성
Baseline ResNet-200
•256x256, 90 epochs, Stepwise learning rate decay
Architecture changes
•ResNet-D (PR-201)
•Squeeze-and-Excitation (Hu, J. et al., 2020)
Training Regularization Architecture
/ 24
2. Methods
2. Experimental Results & Methodology 14
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Importance of decreasing weight decay when combining regularization methods
•RA나 LS를 적용하면 default weight decay (1e-4)에서 변경 할 필요 없음
RA: RandAugment, LS: Label Smoothing, DO: Dropout on FC, SD: Stochastic Depth
•더 많은 regularization을 적용할 때 weight decay 수준을 낮추지 않으면 성능이 좋아지지 않는다.


-> 여러 regularization을 적용했을 때 모델을 과도하게 regularization이 되지 않도록 조절이 필요함.
/ 24
2. Methods
2. Experimental Results & Methodology 15
Extensive search on ImageNet over scaling strategies
•Extensive search on ImageNet over width multipliers in [0.25,0.5,1.0,1.5,2.0], depths of [26, 50, 101, 200, 300, 350,
400] and resolutions of [128,160,224,320,448].
•350 epochs, increase regularization with model size in an effort to limit overfitting
Smaller model - power law trend
동일 FLOPs에서 image resolution의 영향을 받음 ->
slow image resolution scaling의 motivation이 됨
•RandAugment:
•magnitude is set to 10 for filter multipliers in [0.25, 0.5]


or image resolution in [64, 160],
•15 for image resolution in [224, 320]
•20 otherwise.
•Stochastic depth
•Drop rate of 0.2 for image resolutions 224 and above.
•We do not apply stochastic depth filter multiplier 0.25 (or images smaller than 224).
•All models use a label smoothing of 0.1 and a weight decay of 4e-5.
/ 24
2. Methods
2. Experimental Results & Methodology 16
Depth Scaling in Regimes Where Overfitting Can Occur
•Depth scaling은 longer epoch 조건에서, width scaling은 short epoch 조건에서 성능 향상
•Width scaling하면 overfitting 발생: Model Width를 확장할 때 hyperparameter가 더 크게 증가
•이전 연구들은 40 epoch정도의 training regime에서도 width scaling을 택함을 지적
•BiT (2019) scales the width of ResNet-152 with 4x filter multiplier
•FNnet (2021) scales the width with ~1.5x filter multiplier
Image resolution [128, 160, 224, 320]
Depth [101, 200, 300, 400]
Width multiplier [1.0, 1.5, 2.0]
/ 24
2. Methods
2. Experimental Results & Methodology 17
Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve
•FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인
•GPU/TPU에서 memory access cost가 latency에 더 큰 영향을 주는데, FLOPs는 이를 반영하지 못함
•EfficientNet의 더 큰 activations을 활용하는 Depthwise convolution 으로 인해 ResNet보다 GPU/TPU에서 latency 손해를 봄
•더 많은 Memory 소비 -> memory access 증가 -> latency 증가
• RS-350의 FLOPs가 1.8배 많지만
TPU-v3 latency가 2.7배 낮음
• RS-350의 Params가 3.8배 많지만
2.3배 적은 메모리를 사용함
• Depth-resolution
/ 24
2. Methods
2. Experimental Results & Methodology 18
Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve
•Width, depth보다 천천히 scale-up되는 image resolution을 활용해 activation 수를 줄이고 그 결과 메모리 사용량을 줄이는 것이
ResNet-RS의 핵심
/ 24
2. Methods
2. Experimental Results & Methodology 19
Improving scaling of EfficientNet
•EfficientNets(EfficientNet-RS)의 성능 개선
•EfficientNet의 jointly increases model depth, width and resolution at a constant rate 전략이 차선책
이라고 주장
• Architecture - resolution
Slow image resolution scaling strategy를 EfficientNets에 적용 (width와 depth 변경 없음)
/ 24
2. Methods
2. Experimental Results & Methodology 20
ResNet-RS performance in a large scale semi-supervised learning setup
•Semi-supervised learning setup에서도 높은 정확도를 보이면서 EfficientNet보다 4.7x - 5.5x 더 빠른 inference
speed를 보인다.
•ResNets-RS 를 1.3M labeled ImageNet images 와 130M pseudo-labeled images로 학습 (Noisy Student 처럼)
•130M pseudo-labeled image는 Noisy Student 논문의 EfficientNet-L2 (88.4%)로 생성
/ 24
2. Methods
2. Experimental Results & Methodology 21
Transfer Learning to Downstream Tasks with ResNet-RS
•re-scale을 적용한 supervised learning이 SimCLR/SImCLRv2 (Self-supervised learning)을 outperform함.
•Self-supervised learning이 supervised learning보다 universal representations을 학습한다는 개념과 상반된 결과.
•SimCLR와의 공정한 비교를 위한 실험 셋팅:
•ResNet-RS: 400 epochs with cosine learning rate decay, data augmentation (RandAugment),
label smoothing, dropout and decreased weight decay but do not use stochastic depth or
exponential moving average (EMA) of the weights.
•SimCLR/SimCLRv2: longer training (800 epochs) with cosine learning rate decay, a tailored data
augmentation strategy, a tuned temperature parameter in the contrastive loss and a tuned
weight decay.
•Vanilla ResNet architecture pre-trained on ImageNet.
/ 24
2. Methods
2. Experimental Results & Methodology 22
Revised 3D ResNet for Video Classification
Kinetics-400 video classification task
•Image classification과 비슷한 결과. Architecture 변경 없이도 높은 성능 개선 효과가 있었음.
/ 24
3. Conclusion
2. Methods
3. Conclusions 24
• 복잡한 architectural change에서 얻을 수 있는 성능 향상만큼 scaling and training strategies만으로도
달성할 수 있다는 것을 보임
• 모델 아키텍처를 변경하지 않고도 강력한 성능(예: +3.2% top-1 ImageNet 정확도, +4.0% top-1
Kinetics-400 정확도)을 달성하는 regularization techniques와 training strategy 에 대한 실증적 연구
• 새로운 network scaling strategy:
• (1) Scale depth when overfitting can occur (scaling width can be preferable otherwise)
• (2) Scale the image resolution more slowly
• FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인
• Our work suggests that the field has myopically overemphasized architectural innovations at
the expense of experimental diligence, and we hope it encourages further scrutiny in
maintaining consistent methodology for both proposed innovations and baselines alike.
/ 24

More Related Content

PDF
MySQL Group Replication - HandsOn Tutorial
PDF
Feature Engineering
PPTX
Candidate elimination algorithm in ML Lab
PDF
I. Alpha-Beta Pruning in ai
PDF
An Introduction to Neural Architecture Search
PPTX
Cs419 lec10 left recursion and left factoring
PPTX
Simple overview of machine learning
PDF
Digital Forensics and Incident Response (DFIR) using Docker Containers
MySQL Group Replication - HandsOn Tutorial
Feature Engineering
Candidate elimination algorithm in ML Lab
I. Alpha-Beta Pruning in ai
An Introduction to Neural Architecture Search
Cs419 lec10 left recursion and left factoring
Simple overview of machine learning
Digital Forensics and Incident Response (DFIR) using Docker Containers

What's hot (20)

PDF
Chap 8. Optimization for training deep models
PDF
Intriguing properties of contrastive losses
PPTX
Alpha beta pruning in ai
PPTX
Dbscan algorithom
PPTX
PROLOG: Database Manipulation In Prolog
PDF
An Introduction to Spectral Graph Theory
PPTX
Support Vector Machine without tears
PDF
Advanced Techniques for Exploiting ILP
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PPT
12-greedy.ppt
PPT
Bayesian networks
PPT
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
PDF
04 reasoning systems
PPTX
PDF
What is the Expectation Maximization (EM) Algorithm?
PPTX
Linked list in Data Structure and Algorithm
PDF
module4_dynamic programming_2022.pdf
PPTX
Kruskal’s Algorithm
PDF
Dropout as a Bayesian Approximation
Chap 8. Optimization for training deep models
Intriguing properties of contrastive losses
Alpha beta pruning in ai
Dbscan algorithom
PROLOG: Database Manipulation In Prolog
An Introduction to Spectral Graph Theory
Support Vector Machine without tears
Advanced Techniques for Exploiting ILP
Artificial Intelligence, Machine Learning and Deep Learning
12-greedy.ppt
Bayesian networks
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
04 reasoning systems
What is the Expectation Maximization (EM) Algorithm?
Linked list in Data Structure and Algorithm
module4_dynamic programming_2022.pdf
Kruskal’s Algorithm
Dropout as a Bayesian Approximation
Ad

Similar to PR-373: Revisiting ResNets: Improved Training and Scaling Strategies. (20)

PDF
ResNeSt: Split-Attention Networks
PDF
FixMatch:simplifying semi supervised learning with consistency and confidence
PDF
Bag of tricks for image classification with convolutional neural networks r...
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PDF
Mixed Precision Training Review
PDF
"Revisiting self supervised visual representation learning" Paper Review
PPTX
Hyperparameter Tuning
PDF
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PPT
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
PPTX
Upscaling Image using Fast Super Resolution Convolution Natural Network
PDF
Kaggle Higgs Boson Machine Learning Challenge
PDF
Centertrack and naver airush 2020 review
PPTX
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTION
PPTX
A Fully Progressive approach to Single image super-resolution
PDF
Deep Learning in Limited Resource Environments
PPTX
Deep gradient compression
PPTX
Regularization in deep learning
PDF
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
PDF
MLConf 2016 SigOpt Talk by Scott Clark
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
ResNeSt: Split-Attention Networks
FixMatch:simplifying semi supervised learning with consistency and confidence
Bag of tricks for image classification with convolutional neural networks r...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Mixed Precision Training Review
"Revisiting self supervised visual representation learning" Paper Review
Hyperparameter Tuning
PR-393: ResLT: Residual Learning for Long-tailed Recognition
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
Upscaling Image using Fast Super Resolution Convolution Natural Network
Kaggle Higgs Boson Machine Learning Challenge
Centertrack and naver airush 2020 review
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTION
A Fully Progressive approach to Single image super-resolution
Deep Learning in Limited Resource Environments
Deep gradient compression
Regularization in deep learning
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
MLConf 2016 SigOpt Talk by Scott Clark
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Ad

More from Sunghoon Joo (19)

PDF
PR-445: Token Merging: Your ViT But Faster
PDF
PR-433: Test-time Training with Masked Autoencoders
PDF
PR422_hyper-deep ensembles.pdf
PDF
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PDF
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PDF
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PDF
PR-339: Maintaining discrimination and fairness in class incremental learning
PDF
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
PDF
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PDF
PR-298 PARADE: Passage representation aggregation for document reranking
PDF
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PDF
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PDF
PR-246: A deep learning system for differential diagnosis of skin diseases
PDF
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PDF
PR-218: MFAS: Multimodal Fusion Architecture Search
PDF
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PDF
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PDF
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PDF
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-445: Token Merging: Your ViT But Faster
PR-433: Test-time Training with Masked Autoencoders
PR422_hyper-deep ensembles.pdf
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-339: Maintaining discrimination and fairness in class incremental learning
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-298 PARADE: Passage representation aggregation for document reranking
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Geodesy 1.pptx...............................................
PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Sustainable Sites - Green Building Construction
PDF
Well-logging-methods_new................
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Project quality management in manufacturing
PDF
Digital Logic Computer Design lecture notes
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
composite construction of structures.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
UNIT 4 Total Quality Management .pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Geodesy 1.pptx...............................................
Construction Project Organization Group 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Sustainable Sites - Green Building Construction
Well-logging-methods_new................
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Project quality management in manufacturing
Digital Logic Computer Design lecture notes
Foundation to blockchain - A guide to Blockchain Tech
composite construction of structures.pdf
additive manufacturing of ss316l using mig welding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
UNIT 4 Total Quality Management .pptx

PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.

  • 1. PR-373 NeurIPS 2021 주성훈, VUNO Inc. 2022. 2. 27.
  • 3. 2. Methods 1. Research Background 3 The performance of a vision model •최신 training methodology를 적용한 새로운 architectures에서의 성능과, 
 이전 방식의 training methodology를 적용한 오래된 architectures성능과 비교하는 것을 지적함 •Architecture •Training methods •Scaling strategy / 24
  • 4. 2. Methods 1. Research Background 4 Objective: 
 Vision architecture를 scaling 하는데에 new perspectives와 practical advice 제공 •ResNet에 modern training and regularization techniques를 적용. Large Epoch. •Re-scaled (Depth, Width, Resolution) ResNet (ResNet-RS) / 24
  • 5. 2. Methods 1. Research Background 5 ResNeXt Automated architecture search를 활용한 구조 [67 (NASNET), 41 (AmoebaNet: 83.9), 55 (EfficientNet-B7, 84.4%, 2019)]. Adapting self-attention to the visual domain AA-ResNet-152, 79.1%, 2019 ViT-L/16 87.76±0.03%, 2020 LambdaResNet200 84.3%, 2021 Previous works •Architecture VGG ResNet Inception ViT-L/16 87.76±0.03%, 2020 / 24
  • 6. 2. Methods 1. Research Background 6 Previous works Compound Scaling •이 논문에서 지적하는 단점들 • Small Epoch • FLOPS 를 최적화해 설계된 baseline network (EfficientNet-B0) •Architecture - EfficientNet •EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, 2019, ICML 2019) •PR-169 / 24
  • 7. 2. Methods 1. Research Background 7 •Innovations in training (e.g. improved learning rate schedules) •regularization methods •dropout •label smoothing •stochastic depth •dropblock •data augmentation Regularization methods have become especially useful to prevent overfitting when training ever-increasingly larger models on limited data (e.g. 1.2M ImageNet images). Previous works •training/regularization methodology https://arxiv.org/pdf/2011.12562.pdf Label smoothing / 24
  • 8. 2. Methods 1. Research Background 8 Previous works NFNets : 89.2% (Brock, A. et al. High-Performance Large-Scale Image Recognition Without Normalization. (2021).) • ResNet • JFT-300M • Batch normalization -> Adaptive Gradient Clipping(AGC) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021): 88.6% • Transformer • ImageNet-21k (1,400만 개, 10배) • JFT-300M •Additional training data. •Pre-training on large-scale datasets 76.4 -> 79.2 Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. ICCV. 2017 / 24
  • 9. 2. Methods 1. Research Background 9 NoisyStudent (EfficientNet-L2) 88.4%, CVPR -Self-training with Noisy Student improves ImageNet classification Previous works •Additional training data. •Semi-supervised learning Meta Pseudo Labels (EfficientNet-L2), 90.2%, PR-336 / 24
  • 10. 2. Experimental Results & Methodology
  • 11. 2. Methods 2. Experimental Results & Methodology 11 Additive Study of Improvements •Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성 •Baseline ResNet-200 •256x256 •90 epochs •Stepwise learning rate decay •Hyperparameter selection: •held-out validation set comprising 2% of the ImageNet training set (minival-set) •Increase training epochs (90->350 epoch) •Regularization을 적용한 후에는 유용함 Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Training Regularization Architecture / 24
  • 12. 2. Methods 2. Experimental Results & Methodology 12 Additive Study of Improvements •Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성 Regularization and Data Augmentation •Dropout: after the global average pooling •Decrease weight decay 적용 유무에 따라 결과 달라짐 •Stochastic depth (G. Huang et al., 2016, ECCV) •RandAugment: random image transformations (e.g. translate, shear, color distortions) •Momentum with cosine learning rate schedule Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Baseline ResNet-200 •256x256, 90 epochs, Stepwise learning rate decay Training Regularization Architecture / 24
  • 13. 2. Methods 2. Experimental Results & Methodology 13 Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Additive Study of Improvements •Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성 •간단한 architecture change (SE, ResNet-D) 를 추가해 Top-1 accuracy 83.4% (+4.4%) 달성 Baseline ResNet-200 •256x256, 90 epochs, Stepwise learning rate decay Architecture changes •ResNet-D (PR-201) •Squeeze-and-Excitation (Hu, J. et al., 2020) Training Regularization Architecture / 24
  • 14. 2. Methods 2. Experimental Results & Methodology 14 Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Importance of decreasing weight decay when combining regularization methods •RA나 LS를 적용하면 default weight decay (1e-4)에서 변경 할 필요 없음 RA: RandAugment, LS: Label Smoothing, DO: Dropout on FC, SD: Stochastic Depth •더 많은 regularization을 적용할 때 weight decay 수준을 낮추지 않으면 성능이 좋아지지 않는다. 
 -> 여러 regularization을 적용했을 때 모델을 과도하게 regularization이 되지 않도록 조절이 필요함. / 24
  • 15. 2. Methods 2. Experimental Results & Methodology 15 Extensive search on ImageNet over scaling strategies •Extensive search on ImageNet over width multipliers in [0.25,0.5,1.0,1.5,2.0], depths of [26, 50, 101, 200, 300, 350, 400] and resolutions of [128,160,224,320,448]. •350 epochs, increase regularization with model size in an effort to limit overfitting Smaller model - power law trend 동일 FLOPs에서 image resolution의 영향을 받음 -> slow image resolution scaling의 motivation이 됨 •RandAugment: •magnitude is set to 10 for filter multipliers in [0.25, 0.5] 
 or image resolution in [64, 160], •15 for image resolution in [224, 320] •20 otherwise. •Stochastic depth •Drop rate of 0.2 for image resolutions 224 and above. •We do not apply stochastic depth filter multiplier 0.25 (or images smaller than 224). •All models use a label smoothing of 0.1 and a weight decay of 4e-5. / 24
  • 16. 2. Methods 2. Experimental Results & Methodology 16 Depth Scaling in Regimes Where Overfitting Can Occur •Depth scaling은 longer epoch 조건에서, width scaling은 short epoch 조건에서 성능 향상 •Width scaling하면 overfitting 발생: Model Width를 확장할 때 hyperparameter가 더 크게 증가 •이전 연구들은 40 epoch정도의 training regime에서도 width scaling을 택함을 지적 •BiT (2019) scales the width of ResNet-152 with 4x filter multiplier •FNnet (2021) scales the width with ~1.5x filter multiplier Image resolution [128, 160, 224, 320] Depth [101, 200, 300, 400] Width multiplier [1.0, 1.5, 2.0] / 24
  • 17. 2. Methods 2. Experimental Results & Methodology 17 Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve •FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인 •GPU/TPU에서 memory access cost가 latency에 더 큰 영향을 주는데, FLOPs는 이를 반영하지 못함 •EfficientNet의 더 큰 activations을 활용하는 Depthwise convolution 으로 인해 ResNet보다 GPU/TPU에서 latency 손해를 봄 •더 많은 Memory 소비 -> memory access 증가 -> latency 증가 • RS-350의 FLOPs가 1.8배 많지만 TPU-v3 latency가 2.7배 낮음 • RS-350의 Params가 3.8배 많지만 2.3배 적은 메모리를 사용함 • Depth-resolution / 24
  • 18. 2. Methods 2. Experimental Results & Methodology 18 Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve •Width, depth보다 천천히 scale-up되는 image resolution을 활용해 activation 수를 줄이고 그 결과 메모리 사용량을 줄이는 것이 ResNet-RS의 핵심 / 24
  • 19. 2. Methods 2. Experimental Results & Methodology 19 Improving scaling of EfficientNet •EfficientNets(EfficientNet-RS)의 성능 개선 •EfficientNet의 jointly increases model depth, width and resolution at a constant rate 전략이 차선책 이라고 주장 • Architecture - resolution Slow image resolution scaling strategy를 EfficientNets에 적용 (width와 depth 변경 없음) / 24
  • 20. 2. Methods 2. Experimental Results & Methodology 20 ResNet-RS performance in a large scale semi-supervised learning setup •Semi-supervised learning setup에서도 높은 정확도를 보이면서 EfficientNet보다 4.7x - 5.5x 더 빠른 inference speed를 보인다. •ResNets-RS 를 1.3M labeled ImageNet images 와 130M pseudo-labeled images로 학습 (Noisy Student 처럼) •130M pseudo-labeled image는 Noisy Student 논문의 EfficientNet-L2 (88.4%)로 생성 / 24
  • 21. 2. Methods 2. Experimental Results & Methodology 21 Transfer Learning to Downstream Tasks with ResNet-RS •re-scale을 적용한 supervised learning이 SimCLR/SImCLRv2 (Self-supervised learning)을 outperform함. •Self-supervised learning이 supervised learning보다 universal representations을 학습한다는 개념과 상반된 결과. •SimCLR와의 공정한 비교를 위한 실험 셋팅: •ResNet-RS: 400 epochs with cosine learning rate decay, data augmentation (RandAugment), label smoothing, dropout and decreased weight decay but do not use stochastic depth or exponential moving average (EMA) of the weights. •SimCLR/SimCLRv2: longer training (800 epochs) with cosine learning rate decay, a tailored data augmentation strategy, a tuned temperature parameter in the contrastive loss and a tuned weight decay. •Vanilla ResNet architecture pre-trained on ImageNet. / 24
  • 22. 2. Methods 2. Experimental Results & Methodology 22 Revised 3D ResNet for Video Classification Kinetics-400 video classification task •Image classification과 비슷한 결과. Architecture 변경 없이도 높은 성능 개선 효과가 있었음. / 24
  • 24. 2. Methods 3. Conclusions 24 • 복잡한 architectural change에서 얻을 수 있는 성능 향상만큼 scaling and training strategies만으로도 달성할 수 있다는 것을 보임 • 모델 아키텍처를 변경하지 않고도 강력한 성능(예: +3.2% top-1 ImageNet 정확도, +4.0% top-1 Kinetics-400 정확도)을 달성하는 regularization techniques와 training strategy 에 대한 실증적 연구 • 새로운 network scaling strategy: • (1) Scale depth when overfitting can occur (scaling width can be preferable otherwise) • (2) Scale the image resolution more slowly • FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인 • Our work suggests that the field has myopically overemphasized architectural innovations at the expense of experimental diligence, and we hope it encourages further scrutiny in maintaining consistent methodology for both proposed innovations and baselines alike. / 24