SlideShare a Scribd company logo
How to train your ViT?
Data, Augmentation, and Regularization in Vision Transformers
Andreas Steiner et al., “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers”
11th July, 2021
PR12 Paper Review
JinWon Lee
Samsung Electronics
Introduction
• ViT has recently emerged as a competitive alternative to
convolutional neural networks.
• Without the translational equivariance of CNNs, ViT models are
generally found to perform best in settings with large amounts of
training data[ViT] or to require strong AugReg(Augmentation and
Reegularization) schemes[DeiT] to avoid overfitting.
• There was no comprehensive study of the trade-offs between model
regularization, data augmentation, training data size and compute
budget in Vision Transformers.
Introduction
• The authors pre-train a large collection of ViT models on datasets of
different sizes, while at the same time performing comparisons across
different amounts of regularization and data augmentation.
• The homogeneity of the performed study constitutes one of the key
contributions of this paper.
• The insights from this study constitute another important
contribution of this paper.
More than 50,000 ViT Models
• https://github.com/google-research/vision_transformer
• https://github.com/rwightman/pytorch-image-models
Scope of the Study
• Pre-training models on large datasets once and re-using their
parameters as initialization or a part of the model as feature
extractors in models trained on a broad variety of other tasks become
common practice in computer vision.
• In this setup, there are multiple ways to characterize computational
and sample efficiency.
▪ One approach is to look at the overall computational and sample cost of both
pre-training and fine-tuning. Normally, pre-training cost will dominate overall
costs. This interpretation is valid in specific scenarios, especially when pre-
training needs to be done repeatedly or reproduced for academic/industrial
purposes.
Scope of the Study
▪ However, in the majority of cases the pre-trained model can be downloaded
or, in the worst case, trained once in a while. Contrary, in these cases, the
budget required for adapting this model may become the main bottleneck.
▪ A more extreme viewpoint is that the training cost is not crucial, and all what
matters is eventual inference cost of the trained model, deployment cost
which will amortize all other costs.
• Overall, there are three major viewpoints on what is considered to be
the central cost of training a vision model. In this study we touch on
all three of them, but mostly concentrate on “practioner’s” and
“deployment” costs.
Experimental Setup
• Datasets and metrics
▪ For pre-training
➢ImageNet-21k – approximately 14M images with about 21,000 categories.
➢ImageNet-1k – a subset of ImageNet-21k consisting of about 1.3M training images and
1000 categories.
➢De-duplicate images in ImageNet-21k with respect to the test sets of the downstream
tasks
➢ImageNet V2 are used for valuation purposes.
▪ For transfer learning
➢4 popular computer vision datasets from the VTAB benchmark
• CIFAR-100, Oxford IIIT Pets(or Pets37 for short), Resisc45 and Kitti-distance
▪ Top-1 classification accuracy is used as main metric.
Experimental Setup
• Models
▪ 4 different configuration: ViT-Ti, ViT-S, ViT-B and ViT-L.
▪ Patch-size 16 for all models, and additionally patch-size 32 for the ViT-S and ViT-B.
▪ The hidden layer in the head of ViT is dropped, as empirically it does not lead to
more accurate models and often results in optimization instabilities.
▪ Hybrid models that first process images with ResNet and then feed the spatial
output to a ViT as the initial patch embeddings are used.
▪ Rn+{Ti,S,L}/p when n counts the number of convolutions and p denotes the
patch-size in the input image.
Experimental Setup
• Regularization and data augmentations
▪ Dropout to intermediate activations of ViT and the stochastic depth
regularization technique are applied.
▪ For data augmentation, Mixup and RandAugment are applied. 𝛼 is a Mixup
parameter and 𝑙, 𝑚 are number of augmentation layers and magnitude
respectively in RandAugment.
▪ Weight decay is used too.
▪ Sweep contains 28 configurations, which is a cross-product of the followings.
➢No dropout/no stochastic depth or dropout with prob. 0.1 and stochastic depth with
maximal layer dropping prob. 0.1
➢7 data augmentation setups for (𝑙, 𝑚, 𝛼): none (0,0,0), light1 (2,0,0), light2 (2,10,0.2),
medium1 (2,15,0.2), medium2 (2,15,0.5), strong1 (2,20,0.5), strong(2,20,0.8)
➢Weight decay: 0.1 or 0.03
Experimental Setup
• Pre-training
▪ Models were pre-trained with Adam, with a batch size of 4096 and a cosine
learning rate schedule with a linear warmup.
▪ To stabilize training, gradients were clipped at global norm 1.
▪ The images are pre-processed by Inception-style cropping and random
horizontal flipping.
▪ ImageNet-1k was trained for 300 epochs and ImageNet-21k dataset was
trained for 30 and 300 epochs. This allows us to examine the effects of the
increased dataset size also with a roughly constant total compute used for
pre-training.
Experimental Setup
• Fine-tuning
▪ Models were fine-tuned with SGD with a momentum of 0.9, sweeping over 2-
3 learning rates and 1-2 training durations per dataset.
▪ A fixed batch size of 512 was used, gradients were clipped at global norm 1
and a cosine learning rate schedule with linear warmup was also used.
▪ Fine-tuning was done both at the original resolution (224), as well as at a
higher resolution (384).
Findings - Scaling datasets with AugReg and
compute
• Best models trained on AugReg ImageNet-
1k perform about equal to the same
models pre-trained on the 10x larger plain
ImageNet-21k dataset. Similarly, best
models trained on AugReg ImageNet-21k,
when compute is also increased, match or
outperform those from which were trained
on the plain JFT-300M dataset with 25x
more images.
Findings - Scaling datasets with AugReg and
compute
• It is possible to match these private results with a publicly available
dataset, and it is imaginable that training longer and with AugReg on
JFT-300M might further increase performance.
• These results cannot hold for arbitrarily small datasets. Training a
ResNet50 on only 10% of ImageNet-1k with heavy data augmentation
improves results, but does not recover training on the full dataset.
Table 5. from “Unsupervised Data Augmentation for Consistency Training”
Findings – Transfer is the better option
• For most practical purposes, transferring a pre-trained model is both
more cost-efficient and leads to better results.
• The most striking finding is that, no matter how much training time is
spent, for the tiny Pet37 dataset, it does not seem possible to train
ViT models from scratch to reach accuracy anywhere near that of
transferred models.
Findings – Transfer is the better option
• For the larger Resisc45 dataset, this result still holds, although spending
two orders of magnitude more compute and performing a heavy search
may come close (but not reach) to the accuracy of pre-trained models.
• Notably, this does not account for the exploration cost which is difficult
to quantify.
Findings – More data yields more generic models
• Interestingly, the model pre-trained on ImageNet-21k(30 ep) is
significantly better than the ImageNet-1k(300 ep) one, across all the
three VTAB categories.
• As the compute budget keeps growing, we observe consistent
improvements on ImageNet-21k dataset with 10x longer schedule.
• Overall, we conclude that more data yields more generic models, the
trend holds across very diverse tasks.
Findings – Prefer augmentation to regularization
• The authors aim to discover general patterns for data augmentation and
regularization that can be used as rules of thumb when applying Vision
Transformers to a new task.
• The colour of a cell encodes its improvement or deterioration in score
when compared to the unregularized, unaugmented setting.
Findings – Prefer augmentation to regularization
• The first observation that becomes visible, is that for the mid-sized
ImageNet-1k dataset, any kind of AugReg helps.
• However, when using the 10x larger ImageNet-21k dataset and
keeping compute fixed, i.e. running for 30 epochs, any kind of AugReg
hurts performance for all but the largest models.
Findings – Prefer augmentation to regularization
• It is only when also increasing the computation budget to 300 epochs
that AugReg helps more models, although even then, it continues
hurting the smaller ones.
• Generally speaking, there are significantly more cases where adding
augmentation helps, than where adding regularization helps.
Findings – Prefer augmentation to regularization
• Below figure tells us that when using ImageNet-21k, regularization
hurts almost across the board.
Findings – Choosing which pre-trained model to
transfer
• When pre-training ViT models, various regularization and data
augmentation settings result in models with drastically different
performance.
• Then, from the practitioner’s point of view, a natural question
emerges: how to select a model for further adaption for an end
application?
• One way is to run for all available pre-trained models and then select
the best performing model, based on the validation score on the
downstream task of interest. This could be quite expensive in practice.
• Alternatively, one can select a single pre-trained model based on the
upstream validation accuracy and then only use this model for
adaptation, which is much cheaper.
Findings – Choosing which pre-trained model to
transfer
• Below figure shows the performance difference between the cheaper
strategy and the more expensive strategy.
• The results are mixed, but generally reflect that the cheaper strategy works
equally well as the more expensive strategy in the majority of scenarios.
• Selecting a single pre-trained model based on the upstream score is a cost-
effective practical strategy.
Findings – Choosing which pre-trained model to
transfer
• For every architecture and upstream dataset, the best model selected
by upstream validation accuracy.
• Bold numbers indicate results that are on par or surpass the
published JFT-300M results without AugReg for the same models.
Findings – Prefer increasing patch-size to
shrinking model-size
• Models containing the “Tiny” variants perform
significantly worse than the similarly fast larger
models with “/32” patch-size.
• For a given resolution, the patch-size influences
the amount of tokens on which self-attention is
performed and, thus, is a contributor to model
capacity which is not reflected by parameter
count.
• Parameter count is reflective neither of speed,
nor of capacity.
Conclusion
• This paper conduct the first systematic, large scale study of the
interplay between regularization, data augmentation, model size, and
training data size when pre-training ViTs.
• These experiments yield a number of surprising insights around the
impact of various techniques and the situations when augmentation
and regularization are beneficial and when not.
• Across a wide range of datasets, even if the downstream data of
interest appears to only be weakly related to the data used for pre-
training, transfer learning remains the best available option.
• Among similarly performing pre-trained models, for transfer learning
a model with more training data should likely be preferred over one
with more data augmentation.
Thank you

More Related Content

PDF
Emerging Properties in Self-Supervised Vision Transformers
PDF
Deep Learningによる超解像の進歩
PDF
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PDF
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
PDF
20190804_icml_kyoto
PDF
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
PPTX
Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unk...
PPTX
動画像を用いた経路予測手法の分類
Emerging Properties in Self-Supervised Vision Transformers
Deep Learningによる超解像の進歩
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
20190804_icml_kyoto
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unk...
動画像を用いた経路予測手法の分類

What's hot (20)

PDF
A brief introduction to recent segmentation methods
PPTX
Person Re-Identification におけるRe-ranking のための K reciprocal-encoding
PPTX
DNNの曖昧性に関する研究動向
PPTX
[DL輪読会]医用画像解析におけるセグメンテーション
PPTX
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
PDF
Introduction to Generative Adversarial Networks (GANs)
PDF
(2022年3月版)深層学習によるImage Classificaitonの発展
PPTX
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
PDF
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
PDF
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
PPTX
[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PDF
20210711 deepI2P
PPTX
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PPTX
論文要約:AUGMIX: A SIMPLE DATA PROCESSING METHOD TO IMPROVE ROBUSTNESS AND UNCERT...
PPTX
画像認識と深層学習
PDF
Transformer 動向調査 in 画像認識(修正版)
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
PDF
[DL輪読会]画像を使ったSim2Realの現況
PDF
【論文読み会】Self-Attention Generative Adversarial Networks
A brief introduction to recent segmentation methods
Person Re-Identification におけるRe-ranking のための K reciprocal-encoding
DNNの曖昧性に関する研究動向
[DL輪読会]医用画像解析におけるセグメンテーション
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
Introduction to Generative Adversarial Networks (GANs)
(2022年3月版)深層学習によるImage Classificaitonの発展
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
20210711 deepI2P
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
論文要約:AUGMIX: A SIMPLE DATA PROCESSING METHOD TO IMPROVE ROBUSTNESS AND UNCERT...
画像認識と深層学習
Transformer 動向調査 in 画像認識(修正版)
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
[DL輪読会]画像を使ったSim2Realの現況
【論文読み会】Self-Attention Generative Adversarial Networks
Ad

Similar to PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers (20)

PDF
Machine Learning - Implementation with Python - 3.pdf
PPTX
Mnist soln
PDF
How useful is self-supervised pretraining for Visual tasks?
PDF
ResNeSt: Split-Attention Networks
PDF
Paper Explained: RandAugment: Practical automated data augmentation with a re...
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PPTX
PPTX
Deeplearning
PDF
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PDF
FixMatch:simplifying semi supervised learning with consistency and confidence
PPT
R in Insurance 2014
PPTX
Computer Vision for Beginners
PDF
Bag of tricks for image classification with convolutional neural networks r...
PPTX
Pricing like a data scientist
PPTX
250602_JW_labseminar[Graph Contrastive Learning Automated].pptx
PDF
PR-433: Test-time Training with Masked Autoencoders
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
PDF
3_Transfer_Learning.pdf
 
PDF
“The Fundamentals of Training AI Models for Computer Vision Applications,” a ...
PDF
IRJET- Automatic Object Sorting using Deep Learning
Machine Learning - Implementation with Python - 3.pdf
Mnist soln
How useful is self-supervised pretraining for Visual tasks?
ResNeSt: Split-Attention Networks
Paper Explained: RandAugment: Practical automated data augmentation with a re...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Deeplearning
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
FixMatch:simplifying semi supervised learning with consistency and confidence
R in Insurance 2014
Computer Vision for Beginners
Bag of tricks for image classification with convolutional neural networks r...
Pricing like a data scientist
250602_JW_labseminar[Graph Contrastive Learning Automated].pptx
PR-433: Test-time Training with Masked Autoencoders
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
3_Transfer_Learning.pdf
 
“The Fundamentals of Training AI Models for Computer Vision Applications,” a ...
IRJET- Automatic Object Sorting using Deep Learning
Ad

More from Jinwon Lee (20)

PDF
PR-366: A ConvNet for 2020s
PDF
PR-355: Masked Autoencoders Are Scalable Vision Learners
PDF
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PDF
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PDF
PR-297: Training data-efficient image transformers & distillation through att...
PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PDF
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PDF
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PDF
PR243: Designing Network Design Spaces
PDF
PR-217: EfficientDet: Scalable and Efficient Object Detection
PDF
PR-207: YOLOv3: An Incremental Improvement
PDF
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PDF
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PDF
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PDF
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PDF
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PDF
PR-132: SSD: Single Shot MultiBox Detector
PDF
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PDF
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PDF
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR-366: A ConvNet for 2020s
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-297: Training data-efficient image transformers & distillation through att...
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR243: Designing Network Design Spaces
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-207: YOLOv3: An Incremental Improvement
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-132: SSD: Single Shot MultiBox Detector
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
Approach and Philosophy of On baking technology

PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers

  • 1. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers Andreas Steiner et al., “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers” 11th July, 2021 PR12 Paper Review JinWon Lee Samsung Electronics
  • 2. Introduction • ViT has recently emerged as a competitive alternative to convolutional neural networks. • Without the translational equivariance of CNNs, ViT models are generally found to perform best in settings with large amounts of training data[ViT] or to require strong AugReg(Augmentation and Reegularization) schemes[DeiT] to avoid overfitting. • There was no comprehensive study of the trade-offs between model regularization, data augmentation, training data size and compute budget in Vision Transformers.
  • 3. Introduction • The authors pre-train a large collection of ViT models on datasets of different sizes, while at the same time performing comparisons across different amounts of regularization and data augmentation. • The homogeneity of the performed study constitutes one of the key contributions of this paper. • The insights from this study constitute another important contribution of this paper.
  • 4. More than 50,000 ViT Models • https://github.com/google-research/vision_transformer • https://github.com/rwightman/pytorch-image-models
  • 5. Scope of the Study • Pre-training models on large datasets once and re-using their parameters as initialization or a part of the model as feature extractors in models trained on a broad variety of other tasks become common practice in computer vision. • In this setup, there are multiple ways to characterize computational and sample efficiency. ▪ One approach is to look at the overall computational and sample cost of both pre-training and fine-tuning. Normally, pre-training cost will dominate overall costs. This interpretation is valid in specific scenarios, especially when pre- training needs to be done repeatedly or reproduced for academic/industrial purposes.
  • 6. Scope of the Study ▪ However, in the majority of cases the pre-trained model can be downloaded or, in the worst case, trained once in a while. Contrary, in these cases, the budget required for adapting this model may become the main bottleneck. ▪ A more extreme viewpoint is that the training cost is not crucial, and all what matters is eventual inference cost of the trained model, deployment cost which will amortize all other costs. • Overall, there are three major viewpoints on what is considered to be the central cost of training a vision model. In this study we touch on all three of them, but mostly concentrate on “practioner’s” and “deployment” costs.
  • 7. Experimental Setup • Datasets and metrics ▪ For pre-training ➢ImageNet-21k – approximately 14M images with about 21,000 categories. ➢ImageNet-1k – a subset of ImageNet-21k consisting of about 1.3M training images and 1000 categories. ➢De-duplicate images in ImageNet-21k with respect to the test sets of the downstream tasks ➢ImageNet V2 are used for valuation purposes. ▪ For transfer learning ➢4 popular computer vision datasets from the VTAB benchmark • CIFAR-100, Oxford IIIT Pets(or Pets37 for short), Resisc45 and Kitti-distance ▪ Top-1 classification accuracy is used as main metric.
  • 8. Experimental Setup • Models ▪ 4 different configuration: ViT-Ti, ViT-S, ViT-B and ViT-L. ▪ Patch-size 16 for all models, and additionally patch-size 32 for the ViT-S and ViT-B. ▪ The hidden layer in the head of ViT is dropped, as empirically it does not lead to more accurate models and often results in optimization instabilities. ▪ Hybrid models that first process images with ResNet and then feed the spatial output to a ViT as the initial patch embeddings are used. ▪ Rn+{Ti,S,L}/p when n counts the number of convolutions and p denotes the patch-size in the input image.
  • 9. Experimental Setup • Regularization and data augmentations ▪ Dropout to intermediate activations of ViT and the stochastic depth regularization technique are applied. ▪ For data augmentation, Mixup and RandAugment are applied. 𝛼 is a Mixup parameter and 𝑙, 𝑚 are number of augmentation layers and magnitude respectively in RandAugment. ▪ Weight decay is used too. ▪ Sweep contains 28 configurations, which is a cross-product of the followings. ➢No dropout/no stochastic depth or dropout with prob. 0.1 and stochastic depth with maximal layer dropping prob. 0.1 ➢7 data augmentation setups for (𝑙, 𝑚, 𝛼): none (0,0,0), light1 (2,0,0), light2 (2,10,0.2), medium1 (2,15,0.2), medium2 (2,15,0.5), strong1 (2,20,0.5), strong(2,20,0.8) ➢Weight decay: 0.1 or 0.03
  • 10. Experimental Setup • Pre-training ▪ Models were pre-trained with Adam, with a batch size of 4096 and a cosine learning rate schedule with a linear warmup. ▪ To stabilize training, gradients were clipped at global norm 1. ▪ The images are pre-processed by Inception-style cropping and random horizontal flipping. ▪ ImageNet-1k was trained for 300 epochs and ImageNet-21k dataset was trained for 30 and 300 epochs. This allows us to examine the effects of the increased dataset size also with a roughly constant total compute used for pre-training.
  • 11. Experimental Setup • Fine-tuning ▪ Models were fine-tuned with SGD with a momentum of 0.9, sweeping over 2- 3 learning rates and 1-2 training durations per dataset. ▪ A fixed batch size of 512 was used, gradients were clipped at global norm 1 and a cosine learning rate schedule with linear warmup was also used. ▪ Fine-tuning was done both at the original resolution (224), as well as at a higher resolution (384).
  • 12. Findings - Scaling datasets with AugReg and compute • Best models trained on AugReg ImageNet- 1k perform about equal to the same models pre-trained on the 10x larger plain ImageNet-21k dataset. Similarly, best models trained on AugReg ImageNet-21k, when compute is also increased, match or outperform those from which were trained on the plain JFT-300M dataset with 25x more images.
  • 13. Findings - Scaling datasets with AugReg and compute • It is possible to match these private results with a publicly available dataset, and it is imaginable that training longer and with AugReg on JFT-300M might further increase performance. • These results cannot hold for arbitrarily small datasets. Training a ResNet50 on only 10% of ImageNet-1k with heavy data augmentation improves results, but does not recover training on the full dataset. Table 5. from “Unsupervised Data Augmentation for Consistency Training”
  • 14. Findings – Transfer is the better option • For most practical purposes, transferring a pre-trained model is both more cost-efficient and leads to better results. • The most striking finding is that, no matter how much training time is spent, for the tiny Pet37 dataset, it does not seem possible to train ViT models from scratch to reach accuracy anywhere near that of transferred models.
  • 15. Findings – Transfer is the better option • For the larger Resisc45 dataset, this result still holds, although spending two orders of magnitude more compute and performing a heavy search may come close (but not reach) to the accuracy of pre-trained models. • Notably, this does not account for the exploration cost which is difficult to quantify.
  • 16. Findings – More data yields more generic models • Interestingly, the model pre-trained on ImageNet-21k(30 ep) is significantly better than the ImageNet-1k(300 ep) one, across all the three VTAB categories. • As the compute budget keeps growing, we observe consistent improvements on ImageNet-21k dataset with 10x longer schedule. • Overall, we conclude that more data yields more generic models, the trend holds across very diverse tasks.
  • 17. Findings – Prefer augmentation to regularization • The authors aim to discover general patterns for data augmentation and regularization that can be used as rules of thumb when applying Vision Transformers to a new task. • The colour of a cell encodes its improvement or deterioration in score when compared to the unregularized, unaugmented setting.
  • 18. Findings – Prefer augmentation to regularization • The first observation that becomes visible, is that for the mid-sized ImageNet-1k dataset, any kind of AugReg helps. • However, when using the 10x larger ImageNet-21k dataset and keeping compute fixed, i.e. running for 30 epochs, any kind of AugReg hurts performance for all but the largest models.
  • 19. Findings – Prefer augmentation to regularization • It is only when also increasing the computation budget to 300 epochs that AugReg helps more models, although even then, it continues hurting the smaller ones. • Generally speaking, there are significantly more cases where adding augmentation helps, than where adding regularization helps.
  • 20. Findings – Prefer augmentation to regularization • Below figure tells us that when using ImageNet-21k, regularization hurts almost across the board.
  • 21. Findings – Choosing which pre-trained model to transfer • When pre-training ViT models, various regularization and data augmentation settings result in models with drastically different performance. • Then, from the practitioner’s point of view, a natural question emerges: how to select a model for further adaption for an end application? • One way is to run for all available pre-trained models and then select the best performing model, based on the validation score on the downstream task of interest. This could be quite expensive in practice. • Alternatively, one can select a single pre-trained model based on the upstream validation accuracy and then only use this model for adaptation, which is much cheaper.
  • 22. Findings – Choosing which pre-trained model to transfer • Below figure shows the performance difference between the cheaper strategy and the more expensive strategy. • The results are mixed, but generally reflect that the cheaper strategy works equally well as the more expensive strategy in the majority of scenarios. • Selecting a single pre-trained model based on the upstream score is a cost- effective practical strategy.
  • 23. Findings – Choosing which pre-trained model to transfer • For every architecture and upstream dataset, the best model selected by upstream validation accuracy. • Bold numbers indicate results that are on par or surpass the published JFT-300M results without AugReg for the same models.
  • 24. Findings – Prefer increasing patch-size to shrinking model-size • Models containing the “Tiny” variants perform significantly worse than the similarly fast larger models with “/32” patch-size. • For a given resolution, the patch-size influences the amount of tokens on which self-attention is performed and, thus, is a contributor to model capacity which is not reflected by parameter count. • Parameter count is reflective neither of speed, nor of capacity.
  • 25. Conclusion • This paper conduct the first systematic, large scale study of the interplay between regularization, data augmentation, model size, and training data size when pre-training ViTs. • These experiments yield a number of surprising insights around the impact of various techniques and the situations when augmentation and regularization are beneficial and when not. • Across a wide range of datasets, even if the downstream data of interest appears to only be weakly related to the data used for pre- training, transfer learning remains the best available option. • Among similarly performing pre-trained models, for transfer learning a model with more training data should likely be preferred over one with more data augmentation.