Online vs Offline Data Augmentation for Deep Learning
No one has been able to give me a mathematical answer till now for the reason behind doing online augmentation compared to offline augmentation for easier data transformations. I decided to write one.. This article assumes you know deep learning and data augmentation.
Introduction
Data Augmentation is very important in Deep Learning for two reasons:
Essentially if the training features X follow a distribution F, then data augmentation changes it to F’, where F’ is slightly perturbed from F. For example, In images, we can rotate, zoom in, and crop the images. They will make sure the model is invariant to rotation, zooming, and cropping. This makes the model generalizable to future unknown test cases.
Essentially, we have a notion that the larger data size for training in deep learning is better for optimization, which can help us in avoiding overfitting. We add more training datasets to the model for learning.
However, these two points are intricately related to each other. You can say the main goal is to increase the generalizability of the model.
This is also a kind reminder that Data Augmentation is a kind of regularizer that makes sure that the model doesn’t learn the fact that a straight cat is a cat, but a rotated cat is not a cat. There have been many discussions that ask whether the model is learning when one sees multiple variations of the same dataset. The answer is that the main features remain the same in the data for example, in the original cat and rotated cat images (the eyes, the whiskers, etc). This is equivalent to the loss function where the output label of the rotated cat is the same as the output of the normal cat image. You can read some discussions here - Link 1, Link 2, Link 3, and more. (Search data augmentation in pytorch and read mostly the Stack Exchange, Reddit, and pytorch community discussions.)
Offline vs Online Augmentation
Offline and online data augmentation refer to two distinct approaches to how data transformations are applied during the training process.
Offline Data Augmentation: In offline data augmentation, transformations are applied to the entire dataset before training begins. This means that multiple variations of each original data sample are generated and stored in memory or on disk. During training, these augmented versions are then fed into the model as if they were distinct data samples. For example, if you have an original image of a cat, offline augmentation might create multiple rotated, flipped, and color-adjusted versions of that image beforehand. These augmented images are then used directly during the training process without further modification. Offline augmentation is computationally intensive during preprocessing, as it requires generating and storing all variations of the data in advance. However, once prepared, training can proceed more efficiently since the augmented data is readily available.
Online Data Augmentation: Conversely, online data augmentation applies transformations to the data on-the-fly, during the training process. This means that each time a data sample is accessed during training, it is randomly transformed before being passed through the model. For instance, when an image of a cat is retrieved during training, online augmentation might randomly rotate it, flip it horizontally, or adjust its colors before feeding it into the model. These transformations are applied dynamically, ensuring that the model encounters different variations of each data sample across different epochs or batches. Online augmentation is more computationally efficient during preprocessing, as it doesn't require storing multiple copies of the data. However, it introduces a slight overhead during training because transformations must be applied in real time before each data sample is processed.
In summary, while offline augmentation preprocesses and stores augmented data before training begins, online augmentation applies transformations dynamically during the training process. Each approach has its trade-offs in terms of computational efficiency and flexibility in handling data variations during model training.
Why are they the same, and how to make it work?
I had this question for a long time, but nobody explained me properly, and were handwaving in their approaches. Even though I searched the top articles on Google Search, but still had no understanding. I would suggest you read this list properly.
The main questions are that
Now I will explain briefly, why are they the same in an expected stochastic manner. They are not the same if the experimental setup is the same. A small change has to be made. But, before I continue I should remind you of how the optimization is done in Deep Learning.
Deep Learning Process of Optimization
Let’s say you have the following parameters:
Steps in Mini-Batch Optimization:
Now including all the steps, we have in total (T/B * E = 1024/32 * 256 = 2^13 = 8192) steps where gradients are updated. Now, you want to increase the size of the training dataset by offline augmentation by k = 4 = 2^2 times.
You will need therefore (kT/B * E = 4096/32 * 256 = 2^15 = 32768) steps with offline data augmentation with memory storage. How to do the same thing with online data augmentation.
In online data augmentation, the training set with the size T changes on the fly, keeping the distribution the same. This is the good part, but to achieve the same result, we need to make an important change to the model. We need to change the number of epochs and multiply it by k times to keep the same number of gradient updates.
We will therefore get (T/B * kE = 1024/32 * 1024= 2^15 = 32768) gradient updates.
Long story short: For getting the effect of k times increase of the training dataset in offline augmentation, you need to run k times epochs compared to the offline augmentation in the online augmentation to get the same result.