Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, Researcher @CNRS

STATISTICAL PHYSICS MODELLING
OF MACHINE LEARNING
Lenka Zdeborová
(IPhT; CEA Saclay & CNRS)
WiMLDS meetup, November 29, 2018

Long history of physics influencing machine learning.
Examples:
Gibbs-Bogoluibov-Feynman’60s - physics behind variational
inference.
Hopfield model’82. Spin glass models of neural networks Amit,
Gutfreund, Sompolinsky’85.
Boltzmann machine Hinton, Sejnowski’86 - named after the
Boltzmann distribution.
Gardner’87 - Maximum storage capacity in neural networks
(related to VC dimension).
SVMs by Boser, Guyon, Vapnik’92 inspired by Krauth, Mézard’87
Many papers on neural networks in physics in 80s and 90s.
PHYSICS IN MACHINE
LEARNING

PHYSICS IN MACHINE
LEARNING
Les Houches, 1985

THE PUZZLE OF GENERALIZATION
According to PAC bounds (via VC dimension, Rademacher
complexity) neural networks that generalize well should not be
able to fit random labels.
ICLR’16

THEORETICAL QUESTIONS
IN DEEP LEARNING
Why the lack of overfitting?
“More parameters = more overfitting”
Does not seem to hold in deep learning.

SAMPLE COMPLEXITY
Cifar10 - 50000 samples.
How many samples are
really needed?
How low is the optimal sample complexity? Are we achieving it?
If not, is it because of architectures or algorithms?

THEORETICAL-PHYSICS
ROADMAP
1. Experimental observation or fundamental hypothesis.
2. Unreasonably simple model for which toughest questions
can be understood mathematically.
3. Generalize to more realistic models, relies on universality
(= important laws of nature rarely depend on many details).

MODELS
H = J
X
(ij)2E
SiSj
P({Si}i=1,...,N ) =
e H
Z
magnetism of materials
In data science, models are used to fit the data. (e.g. linear
regression: What is the best straight line that captures the
dependence of y on x?)
In physics, models are the main tool for understanding.

MODELS
In data science, models are used to fit the data. (e.g. linear
regression: What is the best straight line that captures the
dependence of y on x?)
In physics, models are the main tool for understanding.
P({Si}i=1,...,N ) =
e H
Z
H =
X
(ijk)2E
JijkSiSjSk
glass transitionp-spin model
Jijk ⇠ N(0, 1)

IS THIS USEFUL IN MACHINE LEARNING?
Example: Single layer neural network = generalized linear regression.
Given (X,y) find w such that
μ = 1,…, n
i = 1,…, pyμ = φ(
p
∑
i=1
Xμiwi)
data
X
y
labels
w
weights
data
weights
(noisy) activation function

Take random iid Gaussian and random iid from
Create
Goal: Compute the best possible generalisation error achievable
with n samples of dimension p.
High-dimensional regime:
TEACHER-STUDENT MODEL
Xμi w*i
yμ = φ(
p
∑
i=1
Xμiw*i )
Pw
p → ∞
n → ∞
n/p = Ω(1)
data
X
y
labels
w
weights
data
weights
Gardner, Derrida’89, Gyorgyi’90

Take random iid Gaussian and random iid from
Create
Goal: Compute the best possible generalisation error achievable
with n samples of dimension p.
High-dimensional regime:
Xμi w*i
yμ = φ(
p
∑
i=1
Xμiw*i )
Pw
p → ∞n → ∞ n/p = Ω(1)
What did we win? Posterior is tractable with replica
and cavity method, developed in the theory of spin glasses.
P(w|X, y)
TEACHER-STUDENT MODEL
Gardner, Derrida’89, Gyorgyi’90

Optimal generalisation error for any non-linearity
and prior on weights.
Proof of the replica formula for the optimal
generalisation error.
Approximate message passing provably reaching the
optimal generalization error (out of the hard region).
Barbier, Krzakala, Macris, Miolane, LZ, COLT’18, arXiv:1708.03395
NEW W.R.T. 1990

LEARNING CURVESgeneralisationerror
φ(z) = sign(z) p → ∞
n → ∞ n/p = Ω(1)
optimal
AMP algorithm
logistic regression
Pw = 𝒩(0,1)
# of samples per dimension n/p

generalisationerror
# of samples per dimension
optimal, achievable
optimal
AMP algorithm
logistic regression
wi 2 { 1, +1}φ(z) = sign(z)
n/p
p → ∞
n → ∞ n/p = Ω(1)
PHASE TRANSITIONS

generalisationerror
# of samples per dimension
optimal, achievable
optimal
AMP algorithm
logistic regression
wi 2 { 1, +1}φ(z) = sign(z)
n/p
p → ∞
n → ∞ n/p = Ω(1)
hard
PHASE TRANSITIONS

INCLUDING HIDDEN VARIABLES
data
X
y
labels
w
v1
v2
weights
p input units
K hidden units
output unit
L=3 layers
n training samples
w learned, v1 & v2 fixed
Limit:
Committee machine
Model from Schwarze’92.
Proof of the replica formula, and approximate message passing Aubin, Maillard,
Barbier, Macris, Krzakala, LZ’19, spotlight at NeurIPS’18.
K = O(1)<latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit><latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit><latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit><latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit>
p → ∞
n → ∞ α = n/p = Ω(1)

PHASE TRANSITONS
Specialization phase transition
= hidden units specialise to
correlate with specific features.
K=2
sign(0) = 0<latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit><latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit><latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit><latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit>
0 1 2 3 4
α
0.00
0.05
0.10
0.15
0.20
0.25
Generalizationerrorϵg(α)
0.0
0.2
0.4
0.6
0.8
1.0
Overlapq
AMP q00
AMP q01
SE q00
SE q01
SE ϵg(α)
AMP ϵg(α)
Specialization
yμ = sign[sign(∑
i
Xμ,iwi,1) + sign
∑
i
(Xμ,iwi,2)]

Large algorithmic gap:
IT threshold:
Algorithmic threshold
K 1<latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit><latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit><latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit><latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit>
yμ = sign[
K
∑
a=1
sign(∑
i
Xμ,iwi,a)]
n > 7.65Kp
n > const . K2
p
PHASE TRANSITONS
0 2 4 6 8 10 12 14
α = (# of samples)/(#hidden units × input size)
0.0
0.1
0.2
0.3
0.4
0.5
Generalizationerrorϵg(α)
Non-specialized
hidden units
Specialized
hidden units
Computational gap
Bayes optimal ϵg(α)
AMP ϵg(α)
Discontinuous specialization

impossible hard doable todaydoable
# of samples
Good generalisation error
Our goal: Quantify this in more realistic models.
Design algorithms working in the doable region.

LZ, F. Krzakala, Statistical Physics of Algorithm: Threshold and Algorithms,
Advances of Physics (2016), arXiv:1511.02476.
J. Barbier, N. Macris, L. Miolane, F. Krzakala, LZ, Phase Transitions, Optimal
Errors and Optimality of Message-Passing in Generalized Linear Models,
arXiv:1708.03395, COLT’18.
B. Aubin, A. Maillard, J. Barbier, F. Krzakala N. Macris,, LZ, The committee
machine: Computational to statistical gaps in learning a two-layers neural
network, arXiv:1806.05451, NeurIPS’18.
REFERENCES

Thank you for your attention!
LZ, F. Krzakala, Statistical Physics of Algorithm: Threshold and Algorithms,
Advances of Physics (2016), arXiv:1511.02476.
J. Barbier, N. Macris, L. Miolane, F. Krzakala, LZ, Phase Transitions, Optimal
Errors and Optimality of Message-Passing in Generalized Linear Models,
arXiv:1708.03395, COLT’18.
B. Aubin, A. Maillard, J. Barbier, F. Krzakala N. Macris,, LZ, The committee
machine: Computational to statistical gaps in learning a two-layers neural
network, arXiv:1806.05451, NeurIPS’18.
REFERENCES

Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, Researcher @CNRS

More Related Content

What's hot (20)

Similar to Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, Researcher @CNRS (20)

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded (20)

Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, Researcher @CNRS