WeightWatcher LLM Update

weight | watcher
Data Free Diagnostics for Deep Learning
(TM)
c|c
(TM)
charles@calculationconsulting.com

weight|watcher
Data Free Diagnostics for Deep Learning
(TM)

calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow);
fi
rst $1B IPO since Google
Wall Street: Barclays, BlackRock
Fortune 500: Roche, France Telecom,Walmart
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Grif
fi
n Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
(TM)
3

c|c
(TM)
(TM)
4
weight|watcher

c|c
(TM)
(TM)
5
Understanding deep learning requires rethinking generalization
Motivations: WeightWatcher Theory
The weightwatcher theory is a quant/physics-based approach based on:
the Statistical Mechanics of Generalization,
Random MatrixTheory, &
the theory of Strongly Correlated Systems
Motivated by the Self Organized Criticality / Critical Brain Hypothesis

c|c
(TM)
Research: Implicit Self-Regularization in Deep Learning
(TM)
6
• Predicting trends in the quality of state-of-the-art neural networks without
access to training or testing data
(JMLR 2021)
(ICML 2019, SDM 2020) (KDD 2023)
(Nature Communications 2021)
Selected publications (with UC Berkeley)
…
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning.

c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
7
The tail of the ESD contains the information

c|c
(TM)
(TM)
8
Well trained layers are heavy-tailed and well shaped
GPT-2 Fits a Power Law
(or Truncated Power Law)
alpha in [2, 6]
watcher.analyze(plot=True)
Good quality of
fi
t (D is small)

c|c
(TM)
(TM)
9
Better trained layers are more heavy-tailed and better shaped
GPT GPT-2

c|c
(TM)
(TM)
10
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom
fl
uctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear

c|c
(TM)
(TM)
11
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in

c|c
(TM)
(TM)
12
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see

c|c
(TM)
(TM)
13
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
Conv2D MaxPool Conv2D MaxPool FC FC

c|c
(TM)
(TM)
14
AlexNet,
VGG11,VGG13, …
ResNet, …
Inception,
DenseNet,
BERT, RoBERT, …
GPT, GPT2, …
…
Heavy-Tailed: Self-Regularization
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
HTSR

c|c
(TM)
(TM)
15
Heavy Tailed Metrics: GPT vs GPT2
The original GPT is poorly trained on purpose; GPT2 is well trained
alpha for every layer
smaller alpha is better
large alpha are bad
fi
ts

c|c
(TM)
(TM)
16
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4

c|c
(TM)
(TM)
17
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes

c|c
(TM)
(TM)
18
HT-SR Theory: 5+1 Phases of Training
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.

c|c
(TM)
(TM)
19
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.

c|c
(TM)
WeightWatcher: predict trends in generalization
(TM)
20
Predict test accuracies across variations in hyper-parameters
The average Power Law exponent alpha
predicts generalization—at
fi
xed depth
Smaller average-alpha is better
Better models are easier to treat
Charles H. Martin, Michael W. Mahoney;
[Contest post-mortem paper]

c|c
(TM)
WeightWatcher: Shape vs Scale metrics
(TM)
21
Purely norm-based (scale) metrics (from SLT) can be correlated with depth
but anti-correlated with hyper-parameter changes

c|c
(TM)
WeightWatcher: treat architecture changes
(TM)
22
Predict test accuracies across variations in hyper-parameters and depth
The alpha-hat metric combines
shape and scale metrics
and corrects
for different depths (grey line)
can be derived from theory…

c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
23
alpha-hat works for 100s of different CV and NLP models
We do not have access to
The training or test data
But we can still predict
trends in the generalization

c|c
(TM)
(TM)
24
ResNet, DenseNet, etc.

c|c
(TM)
(TM)
25
Predicting test accuracies: 100 pretrained models
The heavy tailed (shape) metrics perform best
https://github.com/osmr/imgclsmob
From an open source sandbox of
nearly 500 pretrained CV models
(picked >=5 models / regression)

c|c
(TM)
(TM)
26
Correlation Flow: CV Models
We can study correlation
fl
ow looking at vs. depth
VGG ResNet DenseNet

c|c
(TM)
(TM)
27
Comparing Transformers: BERT vs XLNet
weightwatcher layer alphas
over-
fi
t under-
fi
t
very heavy-tailed weakly heavy-tailed

c|c
(TM)
(TM)
28
Comparing Transformers: Bloom vs 560m
Smaller alpha
is better

c|c
(TM)
(TM)
29
Comparing LLMs: Falcon vs Llama
Falcon Llama

c|c
(TM)
(TM)
30
Comparing LLMs: Correlation Flow
Falcon Llama

c|c
(TM)
(TM)
31
Fined-Tuned LLMs: Only the deltas
Vicuna Dromedary

c|c
(TM)
(TM)
32
LLM Base Models: Truthfulness
TruthfulQA
metric
layer-averaged / model alpha

c|c
(TM)
WeightWatcher: why Power Law
fi
ts ?
(TM)
33
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL
fi
ts
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)

c|c
(TM)
WeightWatcher: why Power Law
fi
ts ?
(TM)
34
Spiking (i.e real) neurons exhibit (truncated) power law behavior
The Critical Brain Hypothesis
Evidence of Self-Organized Criticality (SOC)
Per Bak (How Nature Works)
As neural systems become more complex
they exhibit power law behavior
and then truncated power law behavior
We see exactly this behavior in DNNs
and it is predictive of learning capacity

c|c
(TM)
WeightWatcher: open-source, open-science
(TM)
35
We are looking for early adopters and collaborators
https://github.com/CalculatedContent/WeightWatcher
We have a Discord channel to support the tool
100K+ downloads; 1000 stars
https://discord.com/invite/uVVsEAcfyF

c|c
(TM)
(TM)
36
Statistical Mechanics
derivation of the alpha-hat metric

c|c
(TM)
(TM)
37
alpha-hat works for 100s of different CV and NLP models
We do not have access to
The training or test data
But we can still predict
trends in the generalization

c|c
(TM)
(TM)
38
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
MultiLayer Feed Forward Network Perceptron

c|c
(TM)
(TM)
39
Classic Set Up: Student-Teacher model
Average overlap over random students J
Use to compute the typical / average case generalization error

c|c
(TM)
(TM)
40
New Set Up: Matrix-generalized Student-Teacher
real DNN matrices:
NxM
Strongly correlated
Heavy-Tailed
correlation matrices
Solve for Free Energy associated with the generalization error
This Free Energy is the weightwatcher layer quality metric

c|c
(TM)
(TM)
41
Layer Quality Metrics : SemiEmpirical Theory
“Generalized Norm”
simple, functional form
can infer from empirical
fi
t
Eigenvalues of Teacher
empirical
fi
t to:
“Asymptotics of HCZI integrals …” Tanaka (2008)
WeightWatcher
PowerLaw metric

c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
42
Smaller alpha corresponds to more convex energy landscapes
Transformers (alpha ~ 3-4 or more)
alpha 2-3 (or less)
Rational Decisions, Random Matrices and Spin Glasses" (1998)
by Galluccio, Bouchaud, and Potters:

c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
43
When the layer alpha < 2, we think this means the layer is over
fi
t
We suspect that the early layers
of some Convolutional Nets
may be slightly overtrained
Some alpha < 2
This is predicted from our HTSR theory

(TM)
c|c
(TM)
c | c

WeightWatcher LLM Update

More Related Content

Similar to WeightWatcher LLM Update (20)

More from Charles Martin (12)

Recently uploaded (20)

WeightWatcher LLM Update