Online vs Offline Data Augmentation for Deep Learning

Online vs Offline Data Augmentation for Deep Learning

No one has been able to give me a mathematical answer till now for the reason behind doing online augmentation compared to offline augmentation for easier data transformations. I decided to write one.. This article assumes you know deep learning and data augmentation.

Introduction

Data Augmentation is very important in Deep Learning for two reasons:

  • increasing the variability of the dataset for better generalization

Essentially if the training features X follow a distribution F, then data augmentation changes it to F’, where F’ is slightly perturbed from F. For example, In images, we can rotate, zoom in, and crop the images. They will make sure the model is invariant to rotation, zooming, and cropping. This makes the model generalizable to future unknown test cases.

  • increasing the training data size for better optimization with a larger data size to avoid overfitting

Essentially, we have a notion that the larger data size for training in deep learning is better for optimization, which can help us in avoiding overfitting. We add more training datasets to the model for learning.

However, these two points are intricately related to each other. You can say the main goal is to increase the generalizability of the model.

This is also a kind reminder that Data Augmentation is a kind of regularizer that makes sure that the model doesn’t learn the fact that a straight cat is a cat, but a rotated cat is not a cat. There have been many discussions that ask whether the model is learning when one sees multiple variations of the same dataset. The answer is that the main features remain the same in the data for example, in the original cat and rotated cat images (the eyes, the whiskers, etc). This is equivalent to the loss function where the output label of the rotated cat is the same as the output of the normal cat image. You can read some discussions here - Link 1, Link 2, Link 3, and more. (Search data augmentation in pytorch and read mostly the Stack Exchange, Reddit, and pytorch community discussions.)


Offline vs Online Augmentation

Offline and online data augmentation refer to two distinct approaches to how data transformations are applied during the training process.

Offline Data Augmentation: In offline data augmentation, transformations are applied to the entire dataset before training begins. This means that multiple variations of each original data sample are generated and stored in memory or on disk. During training, these augmented versions are then fed into the model as if they were distinct data samples. For example, if you have an original image of a cat, offline augmentation might create multiple rotated, flipped, and color-adjusted versions of that image beforehand. These augmented images are then used directly during the training process without further modification. Offline augmentation is computationally intensive during preprocessing, as it requires generating and storing all variations of the data in advance. However, once prepared, training can proceed more efficiently since the augmented data is readily available.

Online Data Augmentation: Conversely, online data augmentation applies transformations to the data on-the-fly, during the training process. This means that each time a data sample is accessed during training, it is randomly transformed before being passed through the model. For instance, when an image of a cat is retrieved during training, online augmentation might randomly rotate it, flip it horizontally, or adjust its colors before feeding it into the model. These transformations are applied dynamically, ensuring that the model encounters different variations of each data sample across different epochs or batches. Online augmentation is more computationally efficient during preprocessing, as it doesn't require storing multiple copies of the data. However, it introduces a slight overhead during training because transformations must be applied in real time before each data sample is processed.

In summary, while offline augmentation preprocesses and stores augmented data before training begins, online augmentation applies transformations dynamically during the training process. Each approach has its trade-offs in terms of computational efficiency and flexibility in handling data variations during model training.


Why are they the same, and how to make it work?

I had this question for a long time, but nobody explained me properly, and were handwaving in their approaches. Even though I searched the top articles on Google Search, but still had no understanding. I would suggest you read this list properly.

Article content

The main questions are that

  • In online augmentation, the dataset size is not increasing, then how is the model learning from more datasets?
  • Also, in online augmentation, the dataset is transformed in every epoch, how is the model learning from the data?
  • How is offline augmentation theoretically similar to online augmentation?


Now I will explain briefly, why are they the same in an expected stochastic manner. They are not the same if the experimental setup is the same. A small change has to be made. But, before I continue I should remind you of how the optimization is done in Deep Learning.

Deep Learning Process of Optimization

Let’s say you have the following parameters:

  • Training Data: T = 2^10 = 1024
  • Model M
  • Batch Size: B = 2^5 = 32
  • Number of Epochs: E = 2^8 = 256
  • Optimization Style: Mini Batch Mode
  • Optimizer: Optim

Steps in Mini-Batch Optimization:

  • Initialization: Initialize the model parameters randomly or using pre-trained weights.
  • Epoch Iteration: Iterate through the entire training dataset for a fixed number of epochs (E = 256 in this case).
  • Mini-Batch Iteration: For each epoch, partition the training data into mini-batches of size B = 32.
  • Forward Pass: For each mini-batch, compute the forward pass through the network:
  • Compute Loss: Calculate the loss function that measures the difference between the predicted outputs and the actual targets (labels).
  • Backward Pass (Gradient Calculation):
  • Gradient Update (Parameter Update):
  • Epoch Completion: After all mini-batches are processed within an epoch, repeat the process for the next epoch (a total of E = 256 epochs in this case).


Now including all the steps, we have in total (T/B * E = 1024/32 * 256 = 2^13 = 8192) steps where gradients are updated. Now, you want to increase the size of the training dataset by offline augmentation by k = 4 = 2^2 times.

  • Training Data: kT = 2^2 * 2^10 = 2^12 = 4096
  • Model M
  • Batch Size: B = 2^5 = 32
  • Number of Epochs: E = 2^8 = 256
  • Optimization Style: Mini Batch Mode
  • Optimizer: Optim

You will need therefore (kT/B * E = 4096/32 * 256 = 2^15 = 32768) steps with offline data augmentation with memory storage. How to do the same thing with online data augmentation.

In online data augmentation, the training set with the size T changes on the fly, keeping the distribution the same. This is the good part, but to achieve the same result, we need to make an important change to the model. We need to change the number of epochs and multiply it by k times to keep the same number of gradient updates.

  • Training Data: T = 2^10 = 1024
  • Model M
  • Batch Size: B = 2^5 = 32
  • Number of Epochs: kE = 2^2 * 2^8 = 1024
  • Optimization Style: Mini Batch Mode
  • Optimizer: Optim

We will therefore get (T/B * kE = 1024/32 * 1024= 2^15 = 32768) gradient updates.

Long story short: For getting the effect of k times increase of the training dataset in offline augmentation, you need to run k times epochs compared to the offline augmentation in the online augmentation to get the same result.

To view or add a comment, sign in

Others also viewed

Explore topics