Deep Double Descent

Modern Learning Theory
● Bigger models tend to overfit

○ Bias-Variance trade-off
○ Weight Regularization
○ Augmentation
○ Dropout
○ BatchNorm
○ Early stop
○ Data-dependent regularization (mixup, etc.)
○ ...

● Bigger models are always better
Reconciling modern machine learning practice and the bias-variance trade-off

● Bigger models not good in some regime
https://mltheory.org/deep.pdf

● Bigger models not good in some regime
● Even more data hurt!
https://mltheory.org/deep.pdf

TL;DR
- Model-wise double descent
- There is a regime where bigger models are worse
- Sample-wise non-monotonicity
- There is a regime where more samples hurts
- Epoch-wise double descent
- There is a regime where training longer reverses overfitting

Generalization in Deep Learning Era
- Network can fit `anything` even random noise
- Larger capacity than people imagine before
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

- Over-parameterized network performs
IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING

- Deep network regulairze itself (has better loss landscape)
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

SENSITIVITY AND GENERALIZATION IN NEURAL NETWORKS: AN EMPIRICAL STUDY

Model-wise double descent
Architecture
- ResNet18, CNN, Transformers
Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch

Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch

- Model-wise double descent
across different architectures,
datasets, optimizers, and
training procedures
- Also in adversarial training

Model-wise & Epoch-wise double descent

Epoch-wise double descent
Sufficiently large models can undergo a “double descent” behavior where test error first decreases then
increases near the interpolation threshold, and then decreases again.
Increasing the train time increases the EMC—and thus a sufficiently large model transitions from under-
to over-parameterized over the course of training.

Conventional training is split into two phases:
1. In the first phase, the network learns a function with a small generalization gap
2. In the second phase, the network starts to over-fit the data leading to an increase in test error
Not the complete picture
- Some regimes, the test error decreases again and may achieve a lower value at the end of training
as compared to the first minimum
Reminds
- Information bottleneck
- Lottery ticket hypothesis

Cifar10 Cifar100

Sample-wise non-monotonicity
More data doesn’t improve
For both models, more data hurt performance

Sample-wise non-monotonicity
Transformers
- language-translation task with no
added label noise.
Two effects combined
- More samples
- Larger models
4.5x more samples hurts performance
for intermediate model

Conclusion
Take home message :
Model behaves unexpectedly in transition regime
- Training longer reverses overfitting
- Double the training epoch is a technique in some task
(eg. object detection)
- Bigger models are worse
- Fitting training set is an indicator
- Also called Effective Model Complexity (EMC)
- More data hurts
- sticky :(
- Generalization is still the Holy Grail in deep learning
- remains the open question (both exp. & theory)
- Connect data complexity with model complexity is still difficult
- NAS in some sence systematically solve this problem
Know your data & model
- noise level (problem difficulty)
- model capacity (fitting power)

Deep Double Descent

More Related Content

Similar to Deep Double Descent (20)

More from Kai-Wen Zhao (8)

Recently uploaded (20)

Deep Double Descent