SlideShare a Scribd company logo
Deep Double Descent
kevin
Modern Learning Theory
● Bigger models tend to overfit
Modern Learning Theory
● Bigger models tend to overfit
○ Bias-Variance trade-off
○ Weight Regularization
○ Augmentation
○ Dropout
○ BatchNorm
○ Early stop
○ Data-dependent regularization (mixup, etc.)
○ ...
Modern Learning Theory
● Bigger models tend to overfit
● Bigger models are always better
Reconciling modern machine learning practice and the bias-variance trade-off
Modern Learning Theory
● Bigger models tend to overfit
● Bigger models are always better
● Bigger models not good in some regime
https://mltheory.org/deep.pdf
Modern Learning Theory
● Bigger models tend to overfit
● Bigger models are always better
● Bigger models not good in some regime
● Even more data hurt!
https://mltheory.org/deep.pdf
TL;DR
- Model-wise double descent
- There is a regime where bigger models are worse
- Sample-wise non-monotonicity
- There is a regime where more samples hurts
- Epoch-wise double descent
- There is a regime where training longer reverses overfitting
Generalization in Deep Learning Era
- Network can fit `anything` even random noise
- Larger capacity than people imagine before
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION
Generalization in Deep Learning Era
- Over-parameterized network performs
IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING
Generalization in Deep Learning Era
- Deep network regulairze itself (has better loss landscape)
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
Generalization in Deep Learning Era
SENSITIVITY AND GENERALIZATION IN NEURAL NETWORKS: AN EMPIRICAL STUDY
Model-wise double descent
Architecture
- ResNet18, CNN, Transformers
Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch
Model-wise double descent
Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch
Model-wise double descent
- Model-wise double descent
across different architectures,
datasets, optimizers, and
training procedures
- Also in adversarial training
Model-wise double descent
Model-wise & Epoch-wise double descent
Epoch-wise double descent
Sufficiently large models can undergo a “double descent” behavior where test error first decreases then
increases near the interpolation threshold, and then decreases again.
Increasing the train time increases the EMC—and thus a sufficiently large model transitions from under-
to over-parameterized over the course of training.
Epoch-wise double descent
Conventional training is split into two phases:
1. In the first phase, the network learns a function with a small generalization gap
2. In the second phase, the network starts to over-fit the data leading to an increase in test error
Not the complete picture
- Some regimes, the test error decreases again and may achieve a lower value at the end of training
as compared to the first minimum
Reminds
- Information bottleneck
- Lottery ticket hypothesis
Epoch-wise double descent
Epoch-wise double descent
Cifar10 Cifar100
Sample-wise non-monotonicity
More data doesn’t improve
For both models, more data hurt performance
Sample-wise non-monotonicity
Transformers
- language-translation task with no
added label noise.
Two effects combined
- More samples
- Larger models
4.5x more samples hurts performance
for intermediate model
Sample-wise non-monotonicity
Conclusion
Take home message :
Model behaves unexpectedly in transition regime
- Training longer reverses overfitting
- Double the training epoch is a technique in some task
(eg. object detection)
- Bigger models are worse
- Fitting training set is an indicator
- Also called Effective Model Complexity (EMC)
- More data hurts
- sticky :(
- Generalization is still the Holy Grail in deep learning
- remains the open question (both exp. & theory)
- Connect data complexity with model complexity is still difficult
- NAS in some sence systematically solve this problem
Know your data & model
- noise level (problem difficulty)
- model capacity (fitting power)

More Related Content

DOCX
Zero Inbox approach - Templates to reduce e mail inflow
PPTX
Polynomial Regression explaining with examples .pptx
PDF
Day 4
PDF
The deep bootstrap 논문 리뷰
PDF
The deep bootstrap framework review
PDF
BSSML17 - Ensembles
PDF
Dark Knowledge - Google Transference in Ml
PPTX
Overfitting & Underfitting
Zero Inbox approach - Templates to reduce e mail inflow
Polynomial Regression explaining with examples .pptx
Day 4
The deep bootstrap 논문 리뷰
The deep bootstrap framework review
BSSML17 - Ensembles
Dark Knowledge - Google Transference in Ml
Overfitting & Underfitting

Similar to Deep Double Descent (20)

PPTX
Building Continuous Learning Systems
PPTX
MACHINE LEARNING YEAR DL SECOND PART.pptx
PPTX
Lecture 4b _Overfitting_Underfitting_Bias_Variance_Presentation.pptx
PPTX
Tricking a DNN with adversarial examples
DOCX
Dnn guidelines
PDF
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
PDF
Multi task learning stepping away from narrow expert models 7.11.18
PPTX
Machine learning basics using python programking
PPTX
Model Development And Evaluation in ML.pptx
PPTX
digital image processing - classification
PDF
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
PDF
VSSML17 L2. Ensembles and Logistic Regressions
PPTX
Hyperparameter Tuning
PPTX
Prior On Model Space
PPTX
Transfer learning-presentation
PPTX
machine learning intyerview que 2.pptx
PPTX
achine internet based learning ques.pptx2Presentation2.pptx
PPTX
How to fine-tune and develop your own large language model.pptx
PPTX
Super tickets in pre trained language models
PDF
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Building Continuous Learning Systems
MACHINE LEARNING YEAR DL SECOND PART.pptx
Lecture 4b _Overfitting_Underfitting_Bias_Variance_Presentation.pptx
Tricking a DNN with adversarial examples
Dnn guidelines
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Multi task learning stepping away from narrow expert models 7.11.18
Machine learning basics using python programking
Model Development And Evaluation in ML.pptx
digital image processing - classification
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
VSSML17 L2. Ensembles and Logistic Regressions
Hyperparameter Tuning
Prior On Model Space
Transfer learning-presentation
machine learning intyerview que 2.pptx
achine internet based learning ques.pptx2Presentation2.pptx
How to fine-tune and develop your own large language model.pptx
Super tickets in pre trained language models
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Ad

More from Kai-Wen Zhao (8)

PDF
Learning visual representation without human label
PDF
Recent Object Detection Research & Person Detection
PDF
Learning to discover monte carlo algorithm on spin ice manifold
PDF
Toward Disentanglement through Understand ELBO
PDF
Deep Reinforcement Learning: Q-Learning
PDF
Paper Review: An exact mapping between the Variational Renormalization Group ...
PDF
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
PDF
High Dimensional Data Visualization using t-SNE
Learning visual representation without human label
Recent Object Detection Research & Person Detection
Learning to discover monte carlo algorithm on spin ice manifold
Toward Disentanglement through Understand ELBO
Deep Reinforcement Learning: Q-Learning
Paper Review: An exact mapping between the Variational Renormalization Group ...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
High Dimensional Data Visualization using t-SNE
Ad

Recently uploaded (20)

PDF
Introduction to Business Data Analytics.
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
1_Introduction to advance data techniques.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
.pdf is not working space design for the following data for the following dat...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Business Data Analytics.
Fluorescence-microscope_Botany_detailed content
Moving the Public Sector (Government) to a Digital Adoption
Acceptance and paychological effects of mandatory extra coach I classes.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Miokarditis (Inflamasi pada Otot Jantung)
IB Computer Science - Internal Assessment.pptx
Quality review (1)_presentation of this 21
1_Introduction to advance data techniques.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
.pdf is not working space design for the following data for the following dat...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Supervised vs unsupervised machine learning algorithms
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Deep Double Descent

  • 2. Modern Learning Theory ● Bigger models tend to overfit
  • 3. Modern Learning Theory ● Bigger models tend to overfit ○ Bias-Variance trade-off ○ Weight Regularization ○ Augmentation ○ Dropout ○ BatchNorm ○ Early stop ○ Data-dependent regularization (mixup, etc.) ○ ...
  • 4. Modern Learning Theory ● Bigger models tend to overfit ● Bigger models are always better Reconciling modern machine learning practice and the bias-variance trade-off
  • 5. Modern Learning Theory ● Bigger models tend to overfit ● Bigger models are always better ● Bigger models not good in some regime https://mltheory.org/deep.pdf
  • 6. Modern Learning Theory ● Bigger models tend to overfit ● Bigger models are always better ● Bigger models not good in some regime ● Even more data hurt! https://mltheory.org/deep.pdf
  • 7. TL;DR - Model-wise double descent - There is a regime where bigger models are worse - Sample-wise non-monotonicity - There is a regime where more samples hurts - Epoch-wise double descent - There is a regime where training longer reverses overfitting
  • 8. Generalization in Deep Learning Era - Network can fit `anything` even random noise - Larger capacity than people imagine before UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION
  • 9. Generalization in Deep Learning Era - Over-parameterized network performs IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING
  • 10. Generalization in Deep Learning Era - Deep network regulairze itself (has better loss landscape) Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
  • 11. Generalization in Deep Learning Era SENSITIVITY AND GENERALIZATION IN NEURAL NETWORKS: AN EMPIRICAL STUDY
  • 12. Model-wise double descent Architecture - ResNet18, CNN, Transformers Noise Label ● `Hard` Distribution ● label noise is sampled only once and not per epoch
  • 13. Model-wise double descent Noise Label ● `Hard` Distribution ● label noise is sampled only once and not per epoch
  • 14. Model-wise double descent - Model-wise double descent across different architectures, datasets, optimizers, and training procedures - Also in adversarial training
  • 16. Model-wise & Epoch-wise double descent
  • 17. Epoch-wise double descent Sufficiently large models can undergo a “double descent” behavior where test error first decreases then increases near the interpolation threshold, and then decreases again. Increasing the train time increases the EMC—and thus a sufficiently large model transitions from under- to over-parameterized over the course of training.
  • 18. Epoch-wise double descent Conventional training is split into two phases: 1. In the first phase, the network learns a function with a small generalization gap 2. In the second phase, the network starts to over-fit the data leading to an increase in test error Not the complete picture - Some regimes, the test error decreases again and may achieve a lower value at the end of training as compared to the first minimum Reminds - Information bottleneck - Lottery ticket hypothesis
  • 21. Sample-wise non-monotonicity More data doesn’t improve For both models, more data hurt performance
  • 22. Sample-wise non-monotonicity Transformers - language-translation task with no added label noise. Two effects combined - More samples - Larger models 4.5x more samples hurts performance for intermediate model
  • 24. Conclusion Take home message : Model behaves unexpectedly in transition regime - Training longer reverses overfitting - Double the training epoch is a technique in some task (eg. object detection) - Bigger models are worse - Fitting training set is an indicator - Also called Effective Model Complexity (EMC) - More data hurts - sticky :( - Generalization is still the Holy Grail in deep learning - remains the open question (both exp. & theory) - Connect data complexity with model complexity is still difficult - NAS in some sence systematically solve this problem Know your data & model - noise level (problem difficulty) - model capacity (fitting power)