The statistical physics of learning - revisited

MiWoCI IEEE 2018 1
The statistical physics of learning - revisited
www.cs.rug.nl/~biehl
Michael Biehl
Bernoulli Institute for
Mathematics, Computer Science
and Artificial Intelligence
University of Groningen / NL

MiWoCI IEEE 2018 2
machine learning theory ?
Computational Learning Theory
performance bounds & guarantees
independent of
- specific task
- statistical properties of data
- details of the training
...
Statistical Physics of Learning:
typical properties & phenomena
for models of specific
- systems/network architectures
- statistics of data and noise
- training algorithms / cost functions
...

MiWoCI IEEE 2018
www.ibm.com/developerworks/library/cc-cognitive-neural-networks-deep-
dive/
SVM
math. analogies with the theory of
disordered magnetic materials
statistical physics
- of network dynamics (neurons)
- of learning processes (weights)
A Neural Networks timeline
SOM
LVQMinsky
& Papert
Perceptrons
Widrow&Hoff: Adaline

MiWoCI IEEE 2018 4
news from the stone age of neural networks
Statistical Physics of Neural Networks: Two ground-breaking papers
Training, feed-forward networks:
Elizabeth Gardner (1957-1988).
The space of interactions in neural
networks. J. Phys. A 21:257-270 (1988)
Dynamics, attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities.
PNAS 79(8):2554-2558 (1982)

MiWoCI IEEE 2018 5
From stochastic optimization Monte Carlo, Langevin dynamics
.... to thermal equilibrium: temperature, free energy, entropy, ...
(.... and back) formal application to optimization
training: stochastic optimization of (many) weights
guided by a data-dependent cost function
randomized data ( frozen disorder )
models: student/teacher scenarios
Machine learning: typical properties of large learning systems
Examples: perceptron classifier, “Ising” perceptron, layered networks
analysis: order parameters, disorder average, replica trick
annealed approximation, high temperature limit
overview
Outlook

MiWoCI IEEE 2018 6
stochastic optimization
objective/cost/energy function , e.g. with many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of the change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
• compute
• continuous temporal change,
„noisy gradient descent“
controls acceptance rate
for „uphill“ moves
... controls noise level, i.e.
random deviation from gradient
• with delta-correlated white noise
(spatial + temporal independence)

MiWoCI IEEE 2018
thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:
normalization: „Zustandssumme“, partition function
Gibbs-Boltzmann density of states
• physics: thermal equilibrium of a physical system at temperature T
• optimization: formal equilibrium situation, control parameter T
7
note: additional constraints
can be imposed on the weights,
for instance: normalization

MiWoCI IEEE 2018
the role of Z: thermal averages <...>T in equilibrium, e.g.
... can be expressed as derivatives of ln Z
~ vol. of states with energy E
(microcanonical) entropy
per degree of freedom:
assume extensive energy, proportional to system size N:
thermal averages and entropy
8
re-write as an integral over all possible energies:

MiWoCI IEEE 2018 9
Darwin-Fowler, aka saddle point integration
function with maximum in , consider thermodynamic limit
is given by the minimum of the free energy
f = e - s(e) / β

MiWoCI IEEE 2018
free energy and temperature
in large systems (thermodynamic limit) ln Z is dominated by
the states with minimal free energy
T controls competition between - smaller energies
- larger number of available states
singles out the lowest energy (groundstate)
Metropolis: only down-hill, Langevin: true gradient descent
all states occur with equal probability, independent of energy
Metropolis: accept all random changes
Langevin: noise term suppresses gradient
10
T=1/β is the temperature at which <H>T = N eo
assumption: ergodicity (all states can be reached in the dynamics)

MiWoCI IEEE 2018 11
theory of stochastic optimization
by means of statistical physics
- development of algorithms
(e.g. Simulated Annealing )
- analysis of problem properties, even
in absence of practical algorithms
(number of groundstates, minima,...)
- applicable in many different
contexts, universality
statistical physics & optimization

MiWoCI IEEE 2018
machine learning
special case machine learning: choice of adaptive
e.g. all weights in a neural network, prototype
components in LVQ, centers in RBF-network....
cost function: defined w.r.t.
sum over examples, feature vectors xμ and target labels σ μ (if supervised)
costs or error measure ε(...) per example, e.g. number of misclassifications
training:
• consider weights as the outcome of a stochastic optimization process
• formal (thermal) equilibrium given by
• < ... >T : thermal average over training process for a particular data set
12

MiWoCI IEEE 2018
quenched average over training data
• note: energy/cost function is defined for one particular data set
typical properties by additional average over randomized data
• typical properties on average over randomized data set: derivatives of
quenched free energy ~ yield averages of the form
• the simplest assumption: i.i.d. input vectors
with i.i.d. components
• training labels given by target function:
for instance provided by a teacher network
• student / teacher scenarios
control the complexity of target rule and learning system
analyse training by (stochastic) optimization
? ? ? ? ? ? ?
13

MiWoCI IEEE 2018
average over training data
„replica trick“
n non-interacting „copies“ of the system (replicas)
quenched average introduces effective interactions between replicas
... saddle point integration for <Zn>ID , quenched free energy
requires analytic continuation to
Marc Mezard, Giorgio Parisi, Miguel Virasoro
Spin Glass Theory and Beyond, World Scientific (1987)
mathematical subtleties, replica symmetry-breaking,
order parameter functions, ...
14

MiWoCI IEEE 2018
annealed approximation:
becomes exact (=) in the high-temperature limit (replicas decouple)
• independent single examples:
• saddle point integration: < lnZ >ID / N is dominated by minimum of
• extensive number of examples: (prop. to number of weights)
generalization error plays the role
of the energy (i.e. training error?)
annealed approximation and high-T limit
with finite
“ learn almost nothing... ” (high T )
“ ...from infinitely many examples ”
15
average in the
exponent for β≈0

MiWoCI IEEE 2018
example: perceptron training
• student:
• teacher:
• training data: with independent
zero mean, unit variance
• Central Limit Theorem (CLT), for large N :
normally distributed with
16
fully specifies

MiWoCI IEEE 2018
i.i.d. isotropic data, geometry:
H SR=
=
J
B
S
f
• or, more intuitively...
order parameter R
17

MiWoCI IEEE 2018
• entropy:
- all weights with order parameter R: hypersphere
with radius ~ , volume ~
(+ irrelevant constants)
note: result carries over to more general C (many students and teachers)
• high-T free energy
• re-scaled number of examples
18
R
- or: exp. representation of the δ-functions + saddle point integration...

MiWoCI IEEE 2018 19
• “physical state”: (arg-)minimum of
• typical learning curves
R
R
R
• perfect generalization is achieved

MiWoCI IEEE 2018
perceptron learning curve
a very simple model:
- linearly separable rule (teacher)
- i.i.d. isotropic random data
- high temperature stochastic training
with perfect generalization for
Modifications/extensions:
- noisy data, unlearnable rules
- low-T results (annealed, replica...)
- unsupervised learning
- structured input data (clusters)
- large margin perceptron and SVM
- variational optimization of energy
function (i.e. training algorithm)
- binary weights (“Ising Perceptron”)
typical learning curve,
on average over random
linearly separable data sets
of a given size
20

MiWoCI IEEE 2018
example: Ising perceptron
• student:
• teacher:
• generalization error unchanged:
• entropy:
probability for alignment/misalignment
entropy of mixing N(1+R)/2 aligned and N(1-R)/2 misaligned components
21

MiWoCI IEEE 2018 22
• competing minima in
• for
example: Ising perceptron
• for
co-existing phases of poor/perfect
generalization, lower minimum is stable,
higher minimum is meta-stable
• for only one minimum (R=1)
“first order phase transition”
to perfect generalization
“system freezes” in
R
R
R

MiWoCI IEEE 2018
Monte Carlo results
(no prior knowledge)
results carry over (qualitatively) to low (zero) temperature training:
e.g. nature of phase transitions etc.
first order phase transition
(local) (global)
(global)
(local)
finite size
effects
equal f
23

MiWoCI IEEE 2018
adaptive student
N input units
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
soft committee machine
order parameters: model parameters:macroscopic
properties of
the student
network:
training: minimization of
24

MiWoCI IEEE 2018 25
exploit thermodynamic limit, CLT for
normally distributed with zero means and covariance matrix
(+ constant)

MiWoCI IEEE 2018 26
K=M=2
symmetry breaking
phase transition (2nd order)
K=M > 2
1st order phase transition
with metastable states
K=5
K=2
(e.g.)
hidden unit specialization

MiWoCI IEEE 2018
adaptive student teacher
? ? ? ? ? ? ?
• initial training phase: unspecialized hidden unit weights:
all student units represent “mean teacher”
• transition to specialization, makes perfect agreement possible
27

MiWoCI IEEE 2018
adaptive student teacher
? ? ? ? ? ? ?
• successful training requires a critical number of examples
• hidden unit permutation symmetry has to be broken
28
• initial training phase: unspecialized hidden unit weights
all student units represent “mean teacher”
• transition to specialization, makes perfect agreement possible
equivalent permutations:

MiWoCI IEEE 2018 29
unspecialized state
remains meta-stable up to
large hidden layer:
many hidden units
perfect generalization without prior knowledge
impossible with order O(NK) examples ?

MiWoCI IEEE 2018 30
what’s next ?
• activation functions (ReLu etc.)
• deep networks
• online training by stochastic g.d.
• math. description in terms of ODE
• learning rates, momentum etc.
• regularization, e.g. drop-out,
weight decay etc.
• tree-like architectures as models
of convolution & pooling
• concept drift: time-dependent
statistics of data and target
... a lot more & new ideas to come
network architecture and design
dynamics of network training
other topics

The statistical physics of learning - revisited

More Related Content

What's hot (20)

Similar to The statistical physics of learning - revisited (20)

More from University of Groningen (18)

Recently uploaded (20)

The statistical physics of learning - revisited