PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers

How to train your ViT?
Data, Augmentation, and Regularization in Vision Transformers
Andreas Steiner et al., “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers”
11th July, 2021
PR12 Paper Review
JinWon Lee
Samsung Electronics

Introduction
• ViT has recently emerged as a competitive alternative to
convolutional neural networks.
• Without the translational equivariance of CNNs, ViT models are
generally found to perform best in settings with large amounts of
training data[ViT] or to require strong AugReg(Augmentation and
Reegularization) schemes[DeiT] to avoid overfitting.
• There was no comprehensive study of the trade-offs between model
regularization, data augmentation, training data size and compute
budget in Vision Transformers.

Introduction
• The authors pre-train a large collection of ViT models on datasets of
different sizes, while at the same time performing comparisons across
different amounts of regularization and data augmentation.
• The homogeneity of the performed study constitutes one of the key
contributions of this paper.
• The insights from this study constitute another important
contribution of this paper.

More than 50,000 ViT Models
• https://github.com/google-research/vision_transformer
• https://github.com/rwightman/pytorch-image-models

Scope of the Study
• Pre-training models on large datasets once and re-using their
parameters as initialization or a part of the model as feature
extractors in models trained on a broad variety of other tasks become
common practice in computer vision.
• In this setup, there are multiple ways to characterize computational
and sample efficiency.
▪ One approach is to look at the overall computational and sample cost of both
pre-training and fine-tuning. Normally, pre-training cost will dominate overall
costs. This interpretation is valid in specific scenarios, especially when pre-
training needs to be done repeatedly or reproduced for academic/industrial
purposes.

Scope of the Study
▪ However, in the majority of cases the pre-trained model can be downloaded
or, in the worst case, trained once in a while. Contrary, in these cases, the
budget required for adapting this model may become the main bottleneck.
▪ A more extreme viewpoint is that the training cost is not crucial, and all what
matters is eventual inference cost of the trained model, deployment cost
which will amortize all other costs.
• Overall, there are three major viewpoints on what is considered to be
the central cost of training a vision model. In this study we touch on
all three of them, but mostly concentrate on “practioner’s” and
“deployment” costs.

Experimental Setup
• Datasets and metrics
▪ For pre-training
➢ImageNet-21k – approximately 14M images with about 21,000 categories.
➢ImageNet-1k – a subset of ImageNet-21k consisting of about 1.3M training images and
1000 categories.
➢De-duplicate images in ImageNet-21k with respect to the test sets of the downstream
tasks
➢ImageNet V2 are used for valuation purposes.
▪ For transfer learning
➢4 popular computer vision datasets from the VTAB benchmark
• CIFAR-100, Oxford IIIT Pets(or Pets37 for short), Resisc45 and Kitti-distance
▪ Top-1 classification accuracy is used as main metric.

Experimental Setup
• Models
▪ 4 different configuration: ViT-Ti, ViT-S, ViT-B and ViT-L.
▪ Patch-size 16 for all models, and additionally patch-size 32 for the ViT-S and ViT-B.
▪ The hidden layer in the head of ViT is dropped, as empirically it does not lead to
more accurate models and often results in optimization instabilities.
▪ Hybrid models that first process images with ResNet and then feed the spatial
output to a ViT as the initial patch embeddings are used.
▪ Rn+{Ti,S,L}/p when n counts the number of convolutions and p denotes the
patch-size in the input image.

Experimental Setup
• Regularization and data augmentations
▪ Dropout to intermediate activations of ViT and the stochastic depth
regularization technique are applied.
▪ For data augmentation, Mixup and RandAugment are applied. 𝛼 is a Mixup
parameter and 𝑙, 𝑚 are number of augmentation layers and magnitude
respectively in RandAugment.
▪ Weight decay is used too.
▪ Sweep contains 28 configurations, which is a cross-product of the followings.
➢No dropout/no stochastic depth or dropout with prob. 0.1 and stochastic depth with
maximal layer dropping prob. 0.1
➢7 data augmentation setups for (𝑙, 𝑚, 𝛼): none (0,0,0), light1 (2,0,0), light2 (2,10,0.2),
medium1 (2,15,0.2), medium2 (2,15,0.5), strong1 (2,20,0.5), strong(2,20,0.8)
➢Weight decay: 0.1 or 0.03

Experimental Setup
• Pre-training
▪ Models were pre-trained with Adam, with a batch size of 4096 and a cosine
learning rate schedule with a linear warmup.
▪ To stabilize training, gradients were clipped at global norm 1.
▪ The images are pre-processed by Inception-style cropping and random
horizontal flipping.
▪ ImageNet-1k was trained for 300 epochs and ImageNet-21k dataset was
trained for 30 and 300 epochs. This allows us to examine the effects of the
increased dataset size also with a roughly constant total compute used for
pre-training.

Experimental Setup
• Fine-tuning
▪ Models were fine-tuned with SGD with a momentum of 0.9, sweeping over 2-
3 learning rates and 1-2 training durations per dataset.
▪ A fixed batch size of 512 was used, gradients were clipped at global norm 1
and a cosine learning rate schedule with linear warmup was also used.
▪ Fine-tuning was done both at the original resolution (224), as well as at a
higher resolution (384).

Findings - Scaling datasets with AugReg and
compute
• Best models trained on AugReg ImageNet-
1k perform about equal to the same
models pre-trained on the 10x larger plain
ImageNet-21k dataset. Similarly, best
models trained on AugReg ImageNet-21k,
when compute is also increased, match or
outperform those from which were trained
on the plain JFT-300M dataset with 25x
more images.

Findings - Scaling datasets with AugReg and
compute
• It is possible to match these private results with a publicly available
dataset, and it is imaginable that training longer and with AugReg on
JFT-300M might further increase performance.
• These results cannot hold for arbitrarily small datasets. Training a
ResNet50 on only 10% of ImageNet-1k with heavy data augmentation
improves results, but does not recover training on the full dataset.
Table 5. from “Unsupervised Data Augmentation for Consistency Training”

Findings – Transfer is the better option
• For most practical purposes, transferring a pre-trained model is both
more cost-efficient and leads to better results.
• The most striking finding is that, no matter how much training time is
spent, for the tiny Pet37 dataset, it does not seem possible to train
ViT models from scratch to reach accuracy anywhere near that of
transferred models.

Findings – Transfer is the better option
• For the larger Resisc45 dataset, this result still holds, although spending
two orders of magnitude more compute and performing a heavy search
may come close (but not reach) to the accuracy of pre-trained models.
• Notably, this does not account for the exploration cost which is difficult
to quantify.

Findings – More data yields more generic models
• Interestingly, the model pre-trained on ImageNet-21k(30 ep) is
significantly better than the ImageNet-1k(300 ep) one, across all the
three VTAB categories.
• As the compute budget keeps growing, we observe consistent
improvements on ImageNet-21k dataset with 10x longer schedule.
• Overall, we conclude that more data yields more generic models, the
trend holds across very diverse tasks.

Findings – Prefer augmentation to regularization
• The authors aim to discover general patterns for data augmentation and
regularization that can be used as rules of thumb when applying Vision
Transformers to a new task.
• The colour of a cell encodes its improvement or deterioration in score
when compared to the unregularized, unaugmented setting.

• The first observation that becomes visible, is that for the mid-sized
ImageNet-1k dataset, any kind of AugReg helps.
• However, when using the 10x larger ImageNet-21k dataset and
keeping compute fixed, i.e. running for 30 epochs, any kind of AugReg
hurts performance for all but the largest models.

• It is only when also increasing the computation budget to 300 epochs
that AugReg helps more models, although even then, it continues
hurting the smaller ones.
• Generally speaking, there are significantly more cases where adding
augmentation helps, than where adding regularization helps.

• Below figure tells us that when using ImageNet-21k, regularization
hurts almost across the board.

Findings – Choosing which pre-trained model to
transfer
• When pre-training ViT models, various regularization and data
augmentation settings result in models with drastically different
performance.
• Then, from the practitioner’s point of view, a natural question
emerges: how to select a model for further adaption for an end
application?
• One way is to run for all available pre-trained models and then select
the best performing model, based on the validation score on the
downstream task of interest. This could be quite expensive in practice.
• Alternatively, one can select a single pre-trained model based on the
upstream validation accuracy and then only use this model for
adaptation, which is much cheaper.

transfer
• Below figure shows the performance difference between the cheaper
strategy and the more expensive strategy.
• The results are mixed, but generally reflect that the cheaper strategy works
equally well as the more expensive strategy in the majority of scenarios.
• Selecting a single pre-trained model based on the upstream score is a cost-
effective practical strategy.

transfer
• For every architecture and upstream dataset, the best model selected
by upstream validation accuracy.
• Bold numbers indicate results that are on par or surpass the
published JFT-300M results without AugReg for the same models.

Findings – Prefer increasing patch-size to
shrinking model-size
• Models containing the “Tiny” variants perform
significantly worse than the similarly fast larger
models with “/32” patch-size.
• For a given resolution, the patch-size influences
the amount of tokens on which self-attention is
performed and, thus, is a contributor to model
capacity which is not reflected by parameter
count.
• Parameter count is reflective neither of speed,
nor of capacity.

Conclusion
• This paper conduct the first systematic, large scale study of the
interplay between regularization, data augmentation, model size, and
training data size when pre-training ViTs.
• These experiments yield a number of surprising insights around the
impact of various techniques and the situations when augmentation
and regularization are beneficial and when not.
• Across a wide range of datasets, even if the downstream data of
interest appears to only be weakly related to the data used for pre-
training, transfer learning remains the best available option.
• Among similarly performing pre-trained models, for transfer learning
a model with more training data should likely be preferred over one
with more data augmentation.

PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers

More Related Content

What's hot (20)

Similar to PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers (20)

More from Jinwon Lee (20)

Recently uploaded (20)

PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers