SlideShare a Scribd company logo
weight | watcher
Data Free Diagnostics for Deep Learning
(TM)
c|c
(TM)
charles@calculationconsulting.com
weight|watcher
Data Free Diagnostics for Deep Learning
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow);
fi
rst $1B IPO since Google
Wall Street: Barclays, BlackRock
Fortune 500: Roche, France Telecom,Walmart
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Grif
fi
n Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
c|c
(TM)
(TM)
4
calculation | consulting why deep learning works
weight|watcher
c|c
(TM)
(TM)
5
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Motivations: WeightWatcher Theory
The weightwatcher theory is a quant/physics-based approach based on:
the Statistical Mechanics of Generalization,
Random MatrixTheory, &
the theory of Strongly Correlated Systems
Motivated by the Self Organized Criticality / Critical Brain Hypothesis
c|c
(TM)
Research: Implicit Self-Regularization in Deep Learning
(TM)
6
calculation | consulting why deep learning works
• Predicting trends in the quality of state-of-the-art neural networks without
access to training or testing data
(JMLR 2021)
(ICML 2019, SDM 2020) (KDD 2023)
(Nature Communications 2021)
Selected publications (with UC Berkeley)
…
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning.
c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
7
calculation | consulting why deep learning works
The tail of the ESD contains the information
c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
8
calculation | consulting why deep learning works
Well trained layers are heavy-tailed and well shaped
GPT-2 Fits a Power Law
(or Truncated Power Law)
alpha in [2, 6]
watcher.analyze(plot=True)
Good quality of
fi
t (D is small)
c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
9
calculation | consulting why deep learning works
Better trained layers are more heavy-tailed and better shaped
GPT GPT-2
c|c
(TM)
(TM)
10
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom
fl
uctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear
c|c
(TM)
(TM)
11
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
12
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
AlexNet,
VGG11,VGG13, …
ResNet, …
Inception,
DenseNet,
BERT, RoBERT, …
GPT, GPT2, …
…
Heavy-Tailed: Self-Regularization
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
HTSR
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Heavy Tailed Metrics: GPT vs GPT2
The original GPT is poorly trained on purpose; GPT2 is well trained
alpha for every layer
smaller alpha is better
large alpha are bad
fi
ts
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
HT-SR Theory: 5+1 Phases of Training
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
c|c
(TM)
WeightWatcher: predict trends in generalization
(TM)
20
calculation | consulting why deep learning works
Predict test accuracies across variations in hyper-parameters
The average Power Law exponent alpha
predicts generalization—at
fi
xed depth
Smaller average-alpha is better
Better models are easier to treat
Charles H. Martin, Michael W. Mahoney;
[Contest post-mortem paper]
c|c
(TM)
WeightWatcher: Shape vs Scale metrics
(TM)
21
calculation | consulting why deep learning works
Purely norm-based (scale) metrics (from SLT) can be correlated with depth
but anti-correlated with hyper-parameter changes
c|c
(TM)
WeightWatcher: treat architecture changes
(TM)
22
calculation | consulting why deep learning works
Predict test accuracies across variations in hyper-parameters and depth
The alpha-hat metric combines
shape and scale metrics
and corrects
for different depths (grey line)
can be derived from theory…
c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
23
calculation | consulting why deep learning works
alpha-hat works for 100s of different CV and NLP models
(Nature Communications 2021)
We do not have access to
The training or test data
But we can still predict
trends in the generalization
c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
24
calculation | consulting why deep learning works
ResNet, DenseNet, etc.
(Nature Communications 2021)
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
Predicting test accuracies: 100 pretrained models
The heavy tailed (shape) metrics perform best
https://github.com/osmr/imgclsmob
From an open source sandbox of
nearly 500 pretrained CV models
(picked >=5 models / regression)
(Nature Communications 2021)
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Correlation Flow: CV Models
We can study correlation
fl
ow looking at vs. depth
VGG ResNet DenseNet
(Nature Communications 2021)
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Comparing Transformers: BERT vs XLNet
weightwatcher layer alphas
over-
fi
t under-
fi
t
very heavy-tailed weakly heavy-tailed
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Comparing Transformers: Bloom vs 560m
weightwatcher layer alphas
Smaller alpha
is better
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Comparing LLMs: Falcon vs Llama
Falcon Llama
weightwatcher layer alphas
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
Comparing LLMs: Correlation Flow
Falcon Llama
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Fined-Tuned LLMs: Only the deltas
Vicuna Dromedary
weightwatcher layer alphas
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
LLM Base Models: Truthfulness
TruthfulQA
metric
layer-averaged / model alpha
c|c
(TM)
WeightWatcher: why Power Law
fi
ts ?
(TM)
33
calculation | consulting why deep learning works
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL
fi
ts
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)
c|c
(TM)
WeightWatcher: why Power Law
fi
ts ?
(TM)
34
calculation | consulting why deep learning works
Spiking (i.e real) neurons exhibit (truncated) power law behavior
The Critical Brain Hypothesis
Evidence of Self-Organized Criticality (SOC)
Per Bak (How Nature Works)
As neural systems become more complex
they exhibit power law behavior
and then truncated power law behavior
We see exactly this behavior in DNNs
and it is predictive of learning capacity
c|c
(TM)
WeightWatcher: open-source, open-science
(TM)
35
calculation | consulting why deep learning works
We are looking for early adopters and collaborators
https://github.com/CalculatedContent/WeightWatcher
We have a Discord channel to support the tool
100K+ downloads; 1000 stars
https://discord.com/invite/uVVsEAcfyF
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Statistical Mechanics
derivation of the alpha-hat metric
c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
37
calculation | consulting why deep learning works
alpha-hat works for 100s of different CV and NLP models
(Nature Communications 2021)
We do not have access to
The training or test data
But we can still predict
trends in the generalization
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
MultiLayer Feed Forward Network Perceptron
c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Average overlap over random students J
Use to compute the typical / average case generalization error
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
New Set Up: Matrix-generalized Student-Teacher
real DNN matrices:
NxM
Strongly correlated
Heavy-Tailed
correlation matrices
Solve for Free Energy associated with the generalization error
This Free Energy is the weightwatcher layer quality metric
c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
Layer Quality Metrics : SemiEmpirical Theory
“Generalized Norm”
simple, functional form
can infer from empirical
fi
t
Eigenvalues of Teacher
empirical
fi
t to:
“Asymptotics of HCZI integrals …” Tanaka (2008)
WeightWatcher
PowerLaw metric
c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
42
calculation | consulting why deep learning works
Smaller alpha corresponds to more convex energy landscapes
Transformers (alpha ~ 3-4 or more)
alpha 2-3 (or less)
Rational Decisions, Random Matrices and Spin Glasses" (1998)
by Galluccio, Bouchaud, and Potters:
c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
43
calculation | consulting why deep learning works
When the layer alpha < 2, we think this means the layer is over
fi
t
We suspect that the early layers
of some Convolutional Nets
may be slightly overtrained
Some alpha < 2
This is predicted from our HTSR theory
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

More Related Content

PDF
Weight watcher Bay Area ACM Feb 28, 2022
PDF
ENS Macrh 2022.pdf
PDF
Why Deep Learning Works: Self Regularization in Deep Neural Networks
PDF
Stanford ICME Lecture on Why Deep Learning Works
PDF
Why Deep Learning Works: Self Regularization in Deep Neural Networks
PDF
This Week in Machine Learning and AI Feb 2019
PDF
Why Deep Learning Works: Self Regularization in Deep Neural Networks
PDF
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Weight watcher Bay Area ACM Feb 28, 2022
ENS Macrh 2022.pdf
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Stanford ICME Lecture on Why Deep Learning Works
Why Deep Learning Works: Self Regularization in Deep Neural Networks
This Week in Machine Learning and AI Feb 2019
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

Similar to WeightWatcher LLM Update (20)

PDF
WeightWatcher: Data Free Diagnostics for Deep Learning
PDF
ICCF24.pdf
PDF
An Overview of the WeightWatcher Project: March 2025
PDF
Cc stat phys draft
PDF
WeightWatcher Update: January 2021
PDF
SETOL: a SemiEmpirical Theory of (Deep) Learning
PDF
CC mmds talk 2106
PDF
Flavours of Physics Challenge: Transfer Learning approach
PDF
Search relevance
PDF
A simple framework for contrastive learning of visual representations
PDF
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
PDF
Heavy Tails Workshop NeurIPS2023.pdf
PDF
SETOL: SemiEmpirical Theory of (Deep Learning)
PDF
WMT14_sakaguchi
PDF
Safety Verification of Deep Neural Networks_.pdf
PDF
Continuous Architecting of Stream-Based Systems
PPTX
iccv2009 tutorial: boosting and random forest - part II
DOC
HW2-1_05.doc
PDF
Hands-on Tutorial of Machine Learning in Python
PDF
Data clustering
WeightWatcher: Data Free Diagnostics for Deep Learning
ICCF24.pdf
An Overview of the WeightWatcher Project: March 2025
Cc stat phys draft
WeightWatcher Update: January 2021
SETOL: a SemiEmpirical Theory of (Deep) Learning
CC mmds talk 2106
Flavours of Physics Challenge: Transfer Learning approach
Search relevance
A simple framework for contrastive learning of visual representations
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Heavy Tails Workshop NeurIPS2023.pdf
SETOL: SemiEmpirical Theory of (Deep Learning)
WMT14_sakaguchi
Safety Verification of Deep Neural Networks_.pdf
Continuous Architecting of Stream-Based Systems
iccv2009 tutorial: boosting and random forest - part II
HW2-1_05.doc
Hands-on Tutorial of Machine Learning in Python
Data clustering
Ad

More from Charles Martin (12)

PDF
The Emergence of Signatures of AGI: The Physics of Learning
PDF
Spin Glass Models of Neural Networks: The Curie-Weiss Model from Statistical ...
PDF
Overview of basic statistical mechanics of NNs
PDF
LLM avalanche June 2023.pdf
PDF
Georgetown B-school Talk 2021
PDF
WeightWatcher Introduction
PDF
Building AI Products: Delivery Vs Discovery
PDF
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
PDF
AI and Machine Learning for the Lean Start Up
PDF
Capsule Networks
PDF
Palo alto university rotary club talk Sep 29, 2107
PDF
Applied machine learning for search engine relevance 3
The Emergence of Signatures of AGI: The Physics of Learning
Spin Glass Models of Neural Networks: The Curie-Weiss Model from Statistical ...
Overview of basic statistical mechanics of NNs
LLM avalanche June 2023.pdf
Georgetown B-school Talk 2021
WeightWatcher Introduction
Building AI Products: Delivery Vs Discovery
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
AI and Machine Learning for the Lean Start Up
Capsule Networks
Palo alto university rotary club talk Sep 29, 2107
Applied machine learning for search engine relevance 3
Ad

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Tartificialntelligence_presentation.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mushroom cultivation and it's methods.pdf
PPTX
A Presentation on Touch Screen Technology
PDF
Approach and Philosophy of On baking technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
1. Introduction to Computer Programming.pptx
Hindi spoken digit analysis for native and non-native speakers
Web App vs Mobile App What Should You Build First.pdf
Group 1 Presentation -Planning and Decision Making .pptx
A novel scalable deep ensemble learning framework for big data classification...
Digital-Transformation-Roadmap-for-Companies.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Encapsulation_ Review paper, used for researhc scholars
A comparative study of natural language inference in Swahili using monolingua...
Tartificialntelligence_presentation.pptx
1 - Historical Antecedents, Social Consideration.pdf
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
Mushroom cultivation and it's methods.pdf
A Presentation on Touch Screen Technology
Approach and Philosophy of On baking technology
SOPHOS-XG Firewall Administrator PPT.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

WeightWatcher LLM Update

  • 1. weight | watcher Data Free Diagnostics for Deep Learning (TM) c|c (TM) charles@calculationconsulting.com
  • 2. weight|watcher Data Free Diagnostics for Deep Learning (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry, UIUC Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); fi rst $1B IPO since Google Wall Street: Barclays, BlackRock Fortune 500: Roche, France Telecom,Walmart BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Grif fi n Advisors Alt. Energy: Anthropocene Institute (Page Family) www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. c|c (TM) (TM) 4 calculation | consulting why deep learning works weight|watcher
  • 5. c|c (TM) (TM) 5 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Motivations: WeightWatcher Theory The weightwatcher theory is a quant/physics-based approach based on: the Statistical Mechanics of Generalization, Random MatrixTheory, & the theory of Strongly Correlated Systems Motivated by the Self Organized Criticality / Critical Brain Hypothesis
  • 6. c|c (TM) Research: Implicit Self-Regularization in Deep Learning (TM) 6 calculation | consulting why deep learning works • Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data (JMLR 2021) (ICML 2019, SDM 2020) (KDD 2023) (Nature Communications 2021) Selected publications (with UC Berkeley) … • Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning.
  • 7. c|c (TM) WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices (TM) 7 calculation | consulting why deep learning works The tail of the ESD contains the information
  • 8. c|c (TM) WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices (TM) 8 calculation | consulting why deep learning works Well trained layers are heavy-tailed and well shaped GPT-2 Fits a Power Law (or Truncated Power Law) alpha in [2, 6] watcher.analyze(plot=True) Good quality of fi t (D is small)
  • 9. c|c (TM) WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices (TM) 9 calculation | consulting why deep learning works Better trained layers are more heavy-tailed and better shaped GPT GPT-2
  • 10. c|c (TM) (TM) 10 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fl uctuations very crisp edges Q RMT says if W is a simple random Gaussian matrix, then the ESD will have a very simple , known form Shape depends on Q=N/M (and variance ~ 1) Eigenvalues tightly bounded a few spikes may appear
  • 11. c|c (TM) (TM) 11 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 12. c|c (TM) (TM) 12 calculation | consulting why deep learning works Random Matrix Theory: Heavy Tailed But if W is heavy tailed, the ESD will also have heavy tails (i.e. its all spikes, bulk vanishes) If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Nearly all pre-trained DNNs display heavy tails…as shall soon see
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) Conv2D MaxPool Conv2D MaxPool FC FC
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works AlexNet, VGG11,VGG13, … ResNet, … Inception, DenseNet, BERT, RoBERT, … GPT, GPT2, … … Heavy-Tailed: Self-Regularization All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free HTSR
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Heavy Tailed Metrics: GPT vs GPT2 The original GPT is poorly trained on purpose; GPT2 is well trained alpha for every layer smaller alpha is better large alpha are bad fi ts
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality 500 matrices ~50 architectures Linear layers & Conv2D feature maps 80-90% < 4
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs and/or Small batch sizes
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works HT-SR Theory: 5+1 Phases of Training Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
  • 20. c|c (TM) WeightWatcher: predict trends in generalization (TM) 20 calculation | consulting why deep learning works Predict test accuracies across variations in hyper-parameters The average Power Law exponent alpha predicts generalization—at fi xed depth Smaller average-alpha is better Better models are easier to treat Charles H. Martin, Michael W. Mahoney; [Contest post-mortem paper]
  • 21. c|c (TM) WeightWatcher: Shape vs Scale metrics (TM) 21 calculation | consulting why deep learning works Purely norm-based (scale) metrics (from SLT) can be correlated with depth but anti-correlated with hyper-parameter changes
  • 22. c|c (TM) WeightWatcher: treat architecture changes (TM) 22 calculation | consulting why deep learning works Predict test accuracies across variations in hyper-parameters and depth The alpha-hat metric combines shape and scale metrics and corrects for different depths (grey line) can be derived from theory…
  • 23. c|c (TM) WeightWatcher: predict test accuracies (TM) 23 calculation | consulting why deep learning works alpha-hat works for 100s of different CV and NLP models (Nature Communications 2021) We do not have access to The training or test data But we can still predict trends in the generalization
  • 24. c|c (TM) WeightWatcher: predict test accuracies (TM) 24 calculation | consulting why deep learning works ResNet, DenseNet, etc. (Nature Communications 2021)
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works Predicting test accuracies: 100 pretrained models The heavy tailed (shape) metrics perform best https://github.com/osmr/imgclsmob From an open source sandbox of nearly 500 pretrained CV models (picked >=5 models / regression) (Nature Communications 2021)
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Correlation Flow: CV Models We can study correlation fl ow looking at vs. depth VGG ResNet DenseNet (Nature Communications 2021)
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Comparing Transformers: BERT vs XLNet weightwatcher layer alphas over- fi t under- fi t very heavy-tailed weakly heavy-tailed
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Comparing Transformers: Bloom vs 560m weightwatcher layer alphas Smaller alpha is better
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Comparing LLMs: Falcon vs Llama Falcon Llama weightwatcher layer alphas
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Comparing LLMs: Correlation Flow Falcon Llama
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Fined-Tuned LLMs: Only the deltas Vicuna Dromedary weightwatcher layer alphas
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works LLM Base Models: Truthfulness TruthfulQA metric layer-averaged / model alpha
  • 33. c|c (TM) WeightWatcher: why Power Law fi ts ? (TM) 33 calculation | consulting why deep learning works Spiking (i.e real) neurons exhibit power law behavior weightwatcher supports several PL fi ts from experimental neuroscience plus totally new shape metrics we have invented (and published)
  • 34. c|c (TM) WeightWatcher: why Power Law fi ts ? (TM) 34 calculation | consulting why deep learning works Spiking (i.e real) neurons exhibit (truncated) power law behavior The Critical Brain Hypothesis Evidence of Self-Organized Criticality (SOC) Per Bak (How Nature Works) As neural systems become more complex they exhibit power law behavior and then truncated power law behavior We see exactly this behavior in DNNs and it is predictive of learning capacity
  • 35. c|c (TM) WeightWatcher: open-source, open-science (TM) 35 calculation | consulting why deep learning works We are looking for early adopters and collaborators https://github.com/CalculatedContent/WeightWatcher We have a Discord channel to support the tool 100K+ downloads; 1000 stars https://discord.com/invite/uVVsEAcfyF
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Statistical Mechanics derivation of the alpha-hat metric
  • 37. c|c (TM) WeightWatcher: predict test accuracies (TM) 37 calculation | consulting why deep learning works alpha-hat works for 100s of different CV and NLP models (Nature Communications 2021) We do not have access to The training or test data But we can still predict trends in the generalization
  • 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Statistical Mechanics of Learning Engle &Van den Broeck (2001) MultiLayer Feed Forward Network Perceptron
  • 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Average overlap over random students J Use to compute the typical / average case generalization error
  • 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works New Set Up: Matrix-generalized Student-Teacher real DNN matrices: NxM Strongly correlated Heavy-Tailed correlation matrices Solve for Free Energy associated with the generalization error This Free Energy is the weightwatcher layer quality metric
  • 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Layer Quality Metrics : SemiEmpirical Theory “Generalized Norm” simple, functional form can infer from empirical fi t Eigenvalues of Teacher empirical fi t to: “Asymptotics of HCZI integrals …” Tanaka (2008) WeightWatcher PowerLaw metric
  • 42. c|c (TM) WeightWatcher: global and local convexity metrics (TM) 42 calculation | consulting why deep learning works Smaller alpha corresponds to more convex energy landscapes Transformers (alpha ~ 3-4 or more) alpha 2-3 (or less) Rational Decisions, Random Matrices and Spin Glasses" (1998) by Galluccio, Bouchaud, and Potters:
  • 43. c|c (TM) WeightWatcher: global and local convexity metrics (TM) 43 calculation | consulting why deep learning works When the layer alpha < 2, we think this means the layer is over fi t We suspect that the early layers of some Convolutional Nets may be slightly overtrained Some alpha < 2 This is predicted from our HTSR theory