SlideShare a Scribd company logo
MiWoCI IEEE 2018 1
The statistical physics of learning - revisited
www.cs.rug.nl/~biehl
Michael Biehl
Bernoulli Institute for
Mathematics, Computer Science
and Artificial Intelligence
University of Groningen / NL
MiWoCI IEEE 2018 2
machine learning theory ?
Computational Learning Theory
performance bounds & guarantees
independent of
- specific task
- statistical properties of data
- details of the training
...
Statistical Physics of Learning:
typical properties & phenomena
for models of specific
- systems/network architectures
- statistics of data and noise
- training algorithms / cost functions
...
MiWoCI IEEE 2018
www.ibm.com/developerworks/library/cc-cognitive-neural-networks-deep-
dive/
SVM
math. analogies with the theory of
disordered magnetic materials
statistical physics
- of network dynamics (neurons)
- of learning processes (weights)
A Neural Networks timeline
SOM
LVQMinsky
& Papert
Perceptrons
Widrow&Hoff: Adaline
MiWoCI IEEE 2018 4
news from the stone age of neural networks
Statistical Physics of Neural Networks: Two ground-breaking papers
Training, feed-forward networks:
Elizabeth Gardner (1957-1988).
The space of interactions in neural
networks. J. Phys. A 21:257-270 (1988)
Dynamics, attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities.
PNAS 79(8):2554-2558 (1982)
MiWoCI IEEE 2018 5
From stochastic optimization Monte Carlo, Langevin dynamics
.... to thermal equilibrium: temperature, free energy, entropy, ...
(.... and back) formal application to optimization
training: stochastic optimization of (many) weights
guided by a data-dependent cost function
randomized data ( frozen disorder )
models: student/teacher scenarios
Machine learning: typical properties of large learning systems
Examples: perceptron classifier, “Ising” perceptron, layered networks
analysis: order parameters, disorder average, replica trick
annealed approximation, high temperature limit
overview
Outlook
MiWoCI IEEE 2018 6
stochastic optimization
objective/cost/energy function , e.g. with many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of the change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
• compute
• continuous temporal change,
„noisy gradient descent“
controls acceptance rate
for „uphill“ moves
... controls noise level, i.e.
random deviation from gradient
• with delta-correlated white noise
(spatial + temporal independence)
MiWoCI IEEE 2018
thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:
normalization: „Zustandssumme“, partition function
Gibbs-Boltzmann density of states
• physics: thermal equilibrium of a physical system at temperature T
• optimization: formal equilibrium situation, control parameter T
7
note: additional constraints
can be imposed on the weights,
for instance: normalization
MiWoCI IEEE 2018
the role of Z: thermal averages <...>T in equilibrium, e.g.
... can be expressed as derivatives of ln Z
~ vol. of states with energy E
(microcanonical) entropy
per degree of freedom:
assume extensive energy, proportional to system size N:
thermal averages and entropy
8
re-write as an integral over all possible energies:
MiWoCI IEEE 2018 9
Darwin-Fowler, aka saddle point integration
function with maximum in , consider thermodynamic limit
is given by the minimum of the free energy
f = e - s(e) / β
MiWoCI IEEE 2018
free energy and temperature
in large systems (thermodynamic limit) ln Z is dominated by
the states with minimal free energy
T controls competition between - smaller energies
- larger number of available states
singles out the lowest energy (groundstate)
Metropolis: only down-hill, Langevin: true gradient descent
all states occur with equal probability, independent of energy
Metropolis: accept all random changes
Langevin: noise term suppresses gradient
10
T=1/β is the temperature at which <H>T = N eo
assumption: ergodicity (all states can be reached in the dynamics)
MiWoCI IEEE 2018 11
theory of stochastic optimization
by means of statistical physics
- development of algorithms
(e.g. Simulated Annealing )
- analysis of problem properties, even
in absence of practical algorithms
(number of groundstates, minima,...)
- applicable in many different
contexts, universality
statistical physics & optimization
MiWoCI IEEE 2018
machine learning
special case machine learning: choice of adaptive
e.g. all weights in a neural network, prototype
components in LVQ, centers in RBF-network....
cost function: defined w.r.t.
sum over examples, feature vectors xμ and target labels σ μ (if supervised)
costs or error measure ε(...) per example, e.g. number of misclassifications
training:
• consider weights as the outcome of a stochastic optimization process
• formal (thermal) equilibrium given by
• < ... >T : thermal average over training process for a particular data set
12
MiWoCI IEEE 2018
quenched average over training data
• note: energy/cost function is defined for one particular data set
typical properties by additional average over randomized data
• typical properties on average over randomized data set: derivatives of
quenched free energy ~ yield averages of the form
• the simplest assumption: i.i.d. input vectors
with i.i.d. components
• training labels given by target function:
for instance provided by a teacher network
• student / teacher scenarios
control the complexity of target rule and learning system
analyse training by (stochastic) optimization
? ? ? ? ? ? ?
13
MiWoCI IEEE 2018
average over training data
„replica trick“
n non-interacting „copies“ of the system (replicas)
quenched average introduces effective interactions between replicas
... saddle point integration for <Zn>ID , quenched free energy
requires analytic continuation to
Marc Mezard, Giorgio Parisi, Miguel Virasoro
Spin Glass Theory and Beyond, World Scientific (1987)
mathematical subtleties, replica symmetry-breaking,
order parameter functions, ...
14
MiWoCI IEEE 2018
annealed approximation:
becomes exact (=) in the high-temperature limit (replicas decouple)
• independent single examples:
• saddle point integration: < lnZ >ID / N is dominated by minimum of
• extensive number of examples: (prop. to number of weights)
generalization error plays the role
of the energy (i.e. training error?)
annealed approximation and high-T limit
with finite
“ learn almost nothing... ” (high T )
“ ...from infinitely many examples ”
15
average in the
exponent for β≈0
MiWoCI IEEE 2018
example: perceptron training
• student:
• teacher:
• training data: with independent
zero mean, unit variance
• Central Limit Theorem (CLT), for large N :
normally distributed with
16
fully specifies
MiWoCI IEEE 2018
example: perceptron training
i.i.d. isotropic data, geometry:
H SR=
=
J
B
S
f
• or, more intuitively...
order parameter R
17
MiWoCI IEEE 2018
example: perceptron training
• entropy:
- all weights with order parameter R: hypersphere
with radius ~ , volume ~
(+ irrelevant constants)
note: result carries over to more general C (many students and teachers)
• high-T free energy
• re-scaled number of examples
18
R
- or: exp. representation of the δ-functions + saddle point integration...
MiWoCI IEEE 2018 19
• “physical state”: (arg-)minimum of
• typical learning curves
example: perceptron training
R
R
R
• perfect generalization is achieved
MiWoCI IEEE 2018
perceptron learning curve
a very simple model:
- linearly separable rule (teacher)
- i.i.d. isotropic random data
- high temperature stochastic training
with perfect generalization for
Modifications/extensions:
- noisy data, unlearnable rules
- low-T results (annealed, replica...)
- unsupervised learning
- structured input data (clusters)
- large margin perceptron and SVM
- variational optimization of energy
function (i.e. training algorithm)
- binary weights (“Ising Perceptron”)
typical learning curve,
on average over random
linearly separable data sets
of a given size
20
MiWoCI IEEE 2018
example: Ising perceptron
• student:
• teacher:
• generalization error unchanged:
• entropy:
probability for alignment/misalignment
entropy of mixing N(1+R)/2 aligned and N(1-R)/2 misaligned components
21
MiWoCI IEEE 2018 22
• competing minima in
• for
example: Ising perceptron
• for
co-existing phases of poor/perfect
generalization, lower minimum is stable,
higher minimum is meta-stable
• for only one minimum (R=1)
“first order phase transition”
to perfect generalization
“system freezes” in
R
R
R
MiWoCI IEEE 2018
Monte Carlo results
(no prior knowledge)
results carry over (qualitatively) to low (zero) temperature training:
e.g. nature of phase transitions etc.
first order phase transition
(local) (global)
(global)
(local)
finite size
effects
equal f
23
MiWoCI IEEE 2018
adaptive student
N input units
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
soft committee machine
order parameters: model parameters:macroscopic
properties of
the student
network:
training: minimization of
24
MiWoCI IEEE 2018 25
exploit thermodynamic limit, CLT for
normally distributed with zero means and covariance matrix
(+ constant)
soft committee machine
MiWoCI IEEE 2018 26
soft committee machine
K=M=2
symmetry breaking
phase transition (2nd order)
K=M > 2
1st order phase transition
with metastable states
K=5
K=2
(e.g.)
hidden unit specialization
MiWoCI IEEE 2018
adaptive student teacher
? ? ? ? ? ? ?
soft committee machine
• initial training phase: unspecialized hidden unit weights:
all student units represent “mean teacher”
• transition to specialization, makes perfect agreement possible
27
MiWoCI IEEE 2018
adaptive student teacher
? ? ? ? ? ? ?
soft committee machine
• successful training requires a critical number of examples
• hidden unit permutation symmetry has to be broken
28
• initial training phase: unspecialized hidden unit weights
all student units represent “mean teacher”
• transition to specialization, makes perfect agreement possible
equivalent permutations:
MiWoCI IEEE 2018 29
unspecialized state
remains meta-stable up to
large hidden layer:
many hidden units
perfect generalization without prior knowledge
impossible with order O(NK) examples ?
MiWoCI IEEE 2018 30
what’s next ?
• activation functions (ReLu etc.)
• deep networks
• online training by stochastic g.d.
• math. description in terms of ODE
• learning rates, momentum etc.
• regularization, e.g. drop-out,
weight decay etc.
• tree-like architectures as models
of convolution & pooling
• concept drift: time-dependent
statistics of data and target
... a lot more & new ideas to come
network architecture and design
dynamics of network training
other topics

More Related Content

PPTX
The statistical physics of learning revisted: Phase transitions in layered ne...
PPSX
2017: Prototype-based models in unsupervised and supervised machine learning
PDF
From RNN to neural networks for cyclic undirected graphs
PDF
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
PDF
OOPSLA04.ppt
PDF
A short and naive introduction to using network in prediction models
PDF
MUMS Opening Workshop - Hierarchical Bayesian Models for Inverse Problems and...
PDF
About functional SIR
The statistical physics of learning revisted: Phase transitions in layered ne...
2017: Prototype-based models in unsupervised and supervised machine learning
From RNN to neural networks for cyclic undirected graphs
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
OOPSLA04.ppt
A short and naive introduction to using network in prediction models
MUMS Opening Workshop - Hierarchical Bayesian Models for Inverse Problems and...
About functional SIR

What's hot (20)

PDF
Random Forest for Big Data
PDF
Convolutional networks and graph networks through kernels
PDF
Nonnegative Matrix Factorization with Side Information for Time Series Recove...
PPTX
Listrik Magnet (1)
PDF
Matrix Factorisation (and Dimensionality Reduction)
PDF
Webinar on Graph Neural Networks
PDF
Tracking the tracker: Time Series Analysis in Python from First Principles
PPSX
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
PPTX
Graph R-CNN for Scene Graph Generation
PDF
010_20160216_Variational Gaussian Process
PDF
Reproducibility and differential analysis with selfish
PDF
Kernel methods and variable selection for exploratory analysis and multi-omic...
PDF
Tracking the tracker: Time Series Analysis in Python from First Principles
PDF
Differential analyses of structures in HiC data
PDF
Gnn overview
PDF
About functional SIR
PDF
Kernel methods for data integration in systems biology
PPTX
Cross-view Activity Recognition using Hankelets
PDF
Recognition of Handwritten Mathematical Equations
Random Forest for Big Data
Convolutional networks and graph networks through kernels
Nonnegative Matrix Factorization with Side Information for Time Series Recove...
Listrik Magnet (1)
Matrix Factorisation (and Dimensionality Reduction)
Webinar on Graph Neural Networks
Tracking the tracker: Time Series Analysis in Python from First Principles
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
Graph R-CNN for Scene Graph Generation
010_20160216_Variational Gaussian Process
Reproducibility and differential analysis with selfish
Kernel methods and variable selection for exploratory analysis and multi-omic...
Tracking the tracker: Time Series Analysis in Python from First Principles
Differential analyses of structures in HiC data
Gnn overview
About functional SIR
Kernel methods for data integration in systems biology
Cross-view Activity Recognition using Hankelets
Recognition of Handwritten Mathematical Equations
Ad

Similar to The statistical physics of learning - revisited (20)

PDF
stat-phys-AMALEA.pdf
PDF
stat-phys-appis-reduced.pdf
PDF
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
PDF
Deep learning MindMap
PDF
A Few Useful Things to Know about Machine Learning
PPTX
ML_ Unit_1_PART_A
PDF
super-cheatsheet-artificial-intelligence.pdf
PDF
m3 (2).pdf
PDF
Buku panduan untuk Machine Learning.pdf
PDF
AI Lesson 39
PDF
Lesson 39
PDF
Apr. 2, 2024 unsupervised learning fintecj.pdf
PDF
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
PPT
lec1.ppt
PPT
Lecture 1
PDF
Deep learning concepts
PDF
Artificial Neural Networks
PDF
Quantum Deep Learning
PDF
MLBOOK.pdf
PDF
Machine learning and its parameter is discussed here
stat-phys-AMALEA.pdf
stat-phys-appis-reduced.pdf
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
Deep learning MindMap
A Few Useful Things to Know about Machine Learning
ML_ Unit_1_PART_A
super-cheatsheet-artificial-intelligence.pdf
m3 (2).pdf
Buku panduan untuk Machine Learning.pdf
AI Lesson 39
Lesson 39
Apr. 2, 2024 unsupervised learning fintecj.pdf
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
lec1.ppt
Lecture 1
Deep learning concepts
Artificial Neural Networks
Quantum Deep Learning
MLBOOK.pdf
Machine learning and its parameter is discussed here
Ad

More from University of Groningen (18)

PDF
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
PDF
ESE-Eyes-2023.pdf
PDF
APPIS-FDGPET.pdf
PDF
prototypes-AMALEA.pdf
PDF
Evidence for tissue and stage-specific composition of the ribosome: machine l...
PPTX
Interpretable machine-learning (in endocrinology and beyond)
PPSX
Biehl hanze-2021
PPSX
2020: Prototype-based classifiers and relevance learning: medical application...
PPTX
2020: So you thought the ribosome was constant and conserved ...
PPSX
Prototype-based classifiers and their applications in the life sciences
PPSX
Prototype-based models in machine learning
PPSX
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
PPSX
2013: Prototype-based learning and adaptive distances for classification
PPSX
2015: Distance based classifiers: Basic concepts, recent developments and app...
PPSX
2016: Classification of FDG-PET Brain Data
PPSX
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
PPSX
June 2017: Biomedical applications of prototype-based classifiers and relevan...
PPSX
January 2020: Prototype-based systems in machine learning
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
ESE-Eyes-2023.pdf
APPIS-FDGPET.pdf
prototypes-AMALEA.pdf
Evidence for tissue and stage-specific composition of the ribosome: machine l...
Interpretable machine-learning (in endocrinology and beyond)
Biehl hanze-2021
2020: Prototype-based classifiers and relevance learning: medical application...
2020: So you thought the ribosome was constant and conserved ...
Prototype-based classifiers and their applications in the life sciences
Prototype-based models in machine learning
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
2013: Prototype-based learning and adaptive distances for classification
2015: Distance based classifiers: Basic concepts, recent developments and app...
2016: Classification of FDG-PET Brain Data
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
June 2017: Biomedical applications of prototype-based classifiers and relevan...
January 2020: Prototype-based systems in machine learning

Recently uploaded (20)

PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Sciences of Europe No 170 (2025)
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Derivatives of integument scales, beaks, horns,.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
neck nodes and dissection types and lymph nodes levels
7. General Toxicologyfor clinical phrmacy.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Classification Systems_TAXONOMY_SCIENCE8.pptx
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Cell Membrane: Structure, Composition & Functions
AlphaEarth Foundations and the Satellite Embedding dataset
bbec55_b34400a7914c42429908233dbd381773.pdf
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
TOTAL hIP ARTHROPLASTY Presentation.pptx
Sciences of Europe No 170 (2025)
Phytochemical Investigation of Miliusa longipes.pdf
INTRODUCTION TO EVS | Concept of sustainability
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Taita Taveta Laboratory Technician Workshop Presentation.pptx
. Radiology Case Scenariosssssssssssssss
Derivatives of integument scales, beaks, horns,.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
neck nodes and dissection types and lymph nodes levels

The statistical physics of learning - revisited

  • 1. MiWoCI IEEE 2018 1 The statistical physics of learning - revisited www.cs.rug.nl/~biehl Michael Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen / NL
  • 2. MiWoCI IEEE 2018 2 machine learning theory ? Computational Learning Theory performance bounds & guarantees independent of - specific task - statistical properties of data - details of the training ... Statistical Physics of Learning: typical properties & phenomena for models of specific - systems/network architectures - statistics of data and noise - training algorithms / cost functions ...
  • 3. MiWoCI IEEE 2018 www.ibm.com/developerworks/library/cc-cognitive-neural-networks-deep- dive/ SVM math. analogies with the theory of disordered magnetic materials statistical physics - of network dynamics (neurons) - of learning processes (weights) A Neural Networks timeline SOM LVQMinsky & Papert Perceptrons Widrow&Hoff: Adaline
  • 4. MiWoCI IEEE 2018 4 news from the stone age of neural networks Statistical Physics of Neural Networks: Two ground-breaking papers Training, feed-forward networks: Elizabeth Gardner (1957-1988). The space of interactions in neural networks. J. Phys. A 21:257-270 (1988) Dynamics, attractor neural networks: John Hopfield. Neural Networks and physical systems with emergent collective computational abilities. PNAS 79(8):2554-2558 (1982)
  • 5. MiWoCI IEEE 2018 5 From stochastic optimization Monte Carlo, Langevin dynamics .... to thermal equilibrium: temperature, free energy, entropy, ... (.... and back) formal application to optimization training: stochastic optimization of (many) weights guided by a data-dependent cost function randomized data ( frozen disorder ) models: student/teacher scenarios Machine learning: typical properties of large learning systems Examples: perceptron classifier, “Ising” perceptron, layered networks analysis: order parameters, disorder average, replica trick annealed approximation, high temperature limit overview Outlook
  • 6. MiWoCI IEEE 2018 6 stochastic optimization objective/cost/energy function , e.g. with many degrees of freedom discrete, e.g. continuous, e.g. Metropolis algorithm Langevin dynamics • acceptance of the change - always if - with probability if • suggest a (small) change , e.g. „single spin flip“ for a random j • compute • continuous temporal change, „noisy gradient descent“ controls acceptance rate for „uphill“ moves ... controls noise level, i.e. random deviation from gradient • with delta-correlated white noise (spatial + temporal independence)
  • 7. MiWoCI IEEE 2018 thermal equilibrium Markov chain continuous dynamics stationary density of configurations: normalization: „Zustandssumme“, partition function Gibbs-Boltzmann density of states • physics: thermal equilibrium of a physical system at temperature T • optimization: formal equilibrium situation, control parameter T 7 note: additional constraints can be imposed on the weights, for instance: normalization
  • 8. MiWoCI IEEE 2018 the role of Z: thermal averages <...>T in equilibrium, e.g. ... can be expressed as derivatives of ln Z ~ vol. of states with energy E (microcanonical) entropy per degree of freedom: assume extensive energy, proportional to system size N: thermal averages and entropy 8 re-write as an integral over all possible energies:
  • 9. MiWoCI IEEE 2018 9 Darwin-Fowler, aka saddle point integration function with maximum in , consider thermodynamic limit is given by the minimum of the free energy f = e - s(e) / β
  • 10. MiWoCI IEEE 2018 free energy and temperature in large systems (thermodynamic limit) ln Z is dominated by the states with minimal free energy T controls competition between - smaller energies - larger number of available states singles out the lowest energy (groundstate) Metropolis: only down-hill, Langevin: true gradient descent all states occur with equal probability, independent of energy Metropolis: accept all random changes Langevin: noise term suppresses gradient 10 T=1/β is the temperature at which <H>T = N eo assumption: ergodicity (all states can be reached in the dynamics)
  • 11. MiWoCI IEEE 2018 11 theory of stochastic optimization by means of statistical physics - development of algorithms (e.g. Simulated Annealing ) - analysis of problem properties, even in absence of practical algorithms (number of groundstates, minima,...) - applicable in many different contexts, universality statistical physics & optimization
  • 12. MiWoCI IEEE 2018 machine learning special case machine learning: choice of adaptive e.g. all weights in a neural network, prototype components in LVQ, centers in RBF-network.... cost function: defined w.r.t. sum over examples, feature vectors xμ and target labels σ μ (if supervised) costs or error measure ε(...) per example, e.g. number of misclassifications training: • consider weights as the outcome of a stochastic optimization process • formal (thermal) equilibrium given by • < ... >T : thermal average over training process for a particular data set 12
  • 13. MiWoCI IEEE 2018 quenched average over training data • note: energy/cost function is defined for one particular data set typical properties by additional average over randomized data • typical properties on average over randomized data set: derivatives of quenched free energy ~ yield averages of the form • the simplest assumption: i.i.d. input vectors with i.i.d. components • training labels given by target function: for instance provided by a teacher network • student / teacher scenarios control the complexity of target rule and learning system analyse training by (stochastic) optimization ? ? ? ? ? ? ? 13
  • 14. MiWoCI IEEE 2018 average over training data „replica trick“ n non-interacting „copies“ of the system (replicas) quenched average introduces effective interactions between replicas ... saddle point integration for <Zn>ID , quenched free energy requires analytic continuation to Marc Mezard, Giorgio Parisi, Miguel Virasoro Spin Glass Theory and Beyond, World Scientific (1987) mathematical subtleties, replica symmetry-breaking, order parameter functions, ... 14
  • 15. MiWoCI IEEE 2018 annealed approximation: becomes exact (=) in the high-temperature limit (replicas decouple) • independent single examples: • saddle point integration: < lnZ >ID / N is dominated by minimum of • extensive number of examples: (prop. to number of weights) generalization error plays the role of the energy (i.e. training error?) annealed approximation and high-T limit with finite “ learn almost nothing... ” (high T ) “ ...from infinitely many examples ” 15 average in the exponent for β≈0
  • 16. MiWoCI IEEE 2018 example: perceptron training • student: • teacher: • training data: with independent zero mean, unit variance • Central Limit Theorem (CLT), for large N : normally distributed with 16 fully specifies
  • 17. MiWoCI IEEE 2018 example: perceptron training i.i.d. isotropic data, geometry: H SR= = J B S f • or, more intuitively... order parameter R 17
  • 18. MiWoCI IEEE 2018 example: perceptron training • entropy: - all weights with order parameter R: hypersphere with radius ~ , volume ~ (+ irrelevant constants) note: result carries over to more general C (many students and teachers) • high-T free energy • re-scaled number of examples 18 R - or: exp. representation of the δ-functions + saddle point integration...
  • 19. MiWoCI IEEE 2018 19 • “physical state”: (arg-)minimum of • typical learning curves example: perceptron training R R R • perfect generalization is achieved
  • 20. MiWoCI IEEE 2018 perceptron learning curve a very simple model: - linearly separable rule (teacher) - i.i.d. isotropic random data - high temperature stochastic training with perfect generalization for Modifications/extensions: - noisy data, unlearnable rules - low-T results (annealed, replica...) - unsupervised learning - structured input data (clusters) - large margin perceptron and SVM - variational optimization of energy function (i.e. training algorithm) - binary weights (“Ising Perceptron”) typical learning curve, on average over random linearly separable data sets of a given size 20
  • 21. MiWoCI IEEE 2018 example: Ising perceptron • student: • teacher: • generalization error unchanged: • entropy: probability for alignment/misalignment entropy of mixing N(1+R)/2 aligned and N(1-R)/2 misaligned components 21
  • 22. MiWoCI IEEE 2018 22 • competing minima in • for example: Ising perceptron • for co-existing phases of poor/perfect generalization, lower minimum is stable, higher minimum is meta-stable • for only one minimum (R=1) “first order phase transition” to perfect generalization “system freezes” in R R R
  • 23. MiWoCI IEEE 2018 Monte Carlo results (no prior knowledge) results carry over (qualitatively) to low (zero) temperature training: e.g. nature of phase transitions etc. first order phase transition (local) (global) (global) (local) finite size effects equal f 23
  • 24. MiWoCI IEEE 2018 adaptive student N input units (K) hidden units (M) teacher ? ? ? ? ? ? ? soft committee machine order parameters: model parameters:macroscopic properties of the student network: training: minimization of 24
  • 25. MiWoCI IEEE 2018 25 exploit thermodynamic limit, CLT for normally distributed with zero means and covariance matrix (+ constant) soft committee machine
  • 26. MiWoCI IEEE 2018 26 soft committee machine K=M=2 symmetry breaking phase transition (2nd order) K=M > 2 1st order phase transition with metastable states K=5 K=2 (e.g.) hidden unit specialization
  • 27. MiWoCI IEEE 2018 adaptive student teacher ? ? ? ? ? ? ? soft committee machine • initial training phase: unspecialized hidden unit weights: all student units represent “mean teacher” • transition to specialization, makes perfect agreement possible 27
  • 28. MiWoCI IEEE 2018 adaptive student teacher ? ? ? ? ? ? ? soft committee machine • successful training requires a critical number of examples • hidden unit permutation symmetry has to be broken 28 • initial training phase: unspecialized hidden unit weights all student units represent “mean teacher” • transition to specialization, makes perfect agreement possible equivalent permutations:
  • 29. MiWoCI IEEE 2018 29 unspecialized state remains meta-stable up to large hidden layer: many hidden units perfect generalization without prior knowledge impossible with order O(NK) examples ?
  • 30. MiWoCI IEEE 2018 30 what’s next ? • activation functions (ReLu etc.) • deep networks • online training by stochastic g.d. • math. description in terms of ODE • learning rates, momentum etc. • regularization, e.g. drop-out, weight decay etc. • tree-like architectures as models of convolution & pooling • concept drift: time-dependent statistics of data and target ... a lot more & new ideas to come network architecture and design dynamics of network training other topics