2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activation

CITEC June 2020
Phase transitions in layered neural networks:
rectified linear units vs. sigmoidal activation
Elisa Oostwal
Michiel Straat
Michael Biehl
Bernoulli Institute for Mathematics,
Computer Science and Artificial Intelligence
University of Groningen / NL
arXiv:1910.07476: Hidden unit Specialization in Layered Neural Networks:
ReLU vs. Sigmoidal Activation [updated May 27, 2020]

CITEC June 2020
motivation/research question
ReLU activation, e.g. in deep learning
folklore:
• computationally cheap and fast
• circumvents “vanishing gradient” problem
• sparse activity (biologically plausible)
? more efficient training
? favorable generalization ability

CITEC June 2020
motivation/research question
ReLU activation, e.g. in deep learning
folklore:
• computationally cheap and fast
• circumvents “vanishing gradient” problem
• sparse activity (biologically plausible)
? more efficient training
? favorable generalization ability
Theoretical study
• model scenarios, student/teacher setup
• typical learning curves, statistical physics based approach
• systematic comparison of ReLU and “classical” sigmoidal activation
• significant differences or “only” practical issues?

CITEC June 2020
statistical physics (of learning) in a nutshell
objective/cost/energy function with
• training by stochastic optimization of adaptive weights (thresholds etc.)

CITEC June 2020
• Metropolis algorithm, noisy gradient descent (Langevin)
with long time equilibrium (Gibbs-Boltzmann)
control parameter: „inverse temperature“ β =1 / T

CITEC June 2020
• equilibrium state: compromise/competition between
minimal energy (ground state) vs. huge number of available states
with higher energy

CITEC June 2020
with higher energy
minimal free energy (per weight)

CITEC June 2020
with higher energy
„thermal avg.“ over Peq , (microcanonical) entropy
minimal free energy (per weight)

CITEC June 2020
• computation of system properties in equilibrium: thermal averages

CITEC June 2020
T =1 / β controls competition between energy and entropy
singles out the lowest energy (groundstate)
all states occur with equal probability, independent of energy
thermal noise dominates the training

CITEC June 2020
• practical machine learning: algorithm (hyper-) parameters control
minimization of the cost function
e.g. learning rate, weight decay ...

CITEC June 2020
• practical machine learning: algorithm (hyper-) parameters control
minimization of the cost function
e.g. learning rate, weight decay ...
• statistical physics / stochastic optimization
 design of algorithms (Metrropolis, simulated annealing)
 theoretical analysis of model scenarios

CITEC June 2020
machine learning
• optimization of adaptive quantities, e.g. weights of a network
based on a given specific set of example data (input/output pairs)
cost function: defined w.r.t.

CITEC June 2020
machine learning
• typical results: additional average of thermal averages over (difficult!)
even for the simplest model densities:
with independent identically distributed (zero mean, unit vaiance)
unstructured input density, information about the target through labels

CITEC June 2020
machine learning
• typical results: additional average of thermal averages over (difficult!)
even for the simplest model densities:
with independent identically distributed (zero mean, unit vaiance)
unstructured input density, information about the target through labels
proper -average of the free energy requires replica trick or
annealed approximation

CITEC June 2020
machine learning at high temperatures
• a simplifying limit: training at high (formal) temperature

CITEC June 2020
• independent i.i.d. training data:
• extensive number of examples: (prop. to number of weights)

CITEC June 2020
with finite
“ learn almost nothing... ” (high T )
...from infinitely many examples ”

CITEC June 2020
with finite
“ learn almost nothing... ” (high T )
...from infinitely many examples ”
limitations:
- training error and generalization error are identical
- number of examples / training temperature are coupled

CITEC June 2020
adaptive student N inputs
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
student teacher scenario: “soft committees”

CITEC June 2020
teacher
? ? ? ? ? ? ?
training: minimization of
here: learnable rules, reliable data (outputs provided by teacher)
perfectly matching complexity K=M

CITEC June 2020
teacher
? ? ? ? ? ? ?
training: minimization of
here: learnable rules, reliable data (outputs provided by teacher)
perfectly matching complexity K=M
consider two prototypical activation functions:
sigmoidal / ReLU in student and teacher

CITEC June 2020
exploit thermodynamic limit, CLT for
normally distributed with zero mean and covariance matrix
large N: Central Limit Theorem

CITEC June 2020
order parameters: model parameters:macroscopic
properties of
the system

CITEC June 2020
order parameters: model parameters:macroscopic
properties of
the system
(+ constant) independent of details (e.g. activation)

CITEC June 2020
generalization error
on average over P({xi,xj
*})

CITEC June 2020
*})
[D. Saad, S. Solla, 1995]

CITEC June 2020
*})
[D. Saad, S. Solla, 1995]
[M. Straat, 2019]

CITEC June 2020
site symmetry
simplification: orthonormal teacher vectors, isotropic input density
reflects permutation symmetry, allows for hidden unit specialization

CITEC June 2020
site symmetry
entropy
(+ constant)

CITEC June 2020
site symmetry
sigmoidal
hidden units
entropy
(+ constant)

CITEC June 2020
site symmetry
sigmoidal
hidden units
ReLU
activations
entropy
(+ constant)

CITEC June 2020
given: K, g(x),
typical learning curves

CITEC June 2020
given: K, g(x),
solve:

CITEC June 2020
determine (global and local) minima of
given: K, g(x),
obtain learning curves
order parameters, generalization error (typical, average)
as a function of the (scaled) training set size
solve:

CITEC June 2020
sigmoidal (K=2)

CITEC June 2020
sigmoidal (K=2)
invariance under exchange of
the two hidden units
R=S: both units ~ (w1
* + w2
*) + noise
symmetry breaking phase transition
(continuous, “second order”)...

CITEC June 2020
sigmoidal (K=2)
invariance under exchange of
the two hidden units
R=S: both units ~ (w1
* + w2
*) + noise
symmetry breaking phase transition
(continuous, “second order”)...
... results in a kink in
the typical learning curve

CITEC June 2020
continuous transition (schematic)
R<S R>S
R=S
βf βf

CITEC June 2020
ReLU (K=2)
qualitatively identical behavior
Note: num. values of and/or
are irrelevant, scale depends a.o.
on pre-factor of g(z)

CITEC June 2020
sigmoidal (K>2)
K=5
permutation symmetry of h.u.
initial R=S phase

CITEC June 2020
sigmoidal (K>2)
K=5
initial R=S phase
first order transition, local min.
R>S competes with R=S
R>S becomes global minimum
facilitates perfect learning

CITEC June 2020
sigmoidal (K>2)
K=5
initial R=S phase
additional transition:
“anti-specialization” S>R
(overlooked in 1998!)

CITEC June 2020
sigmoidal (K>2)
K=5
initial R=S phase
discontinuous jump in ε g
coexistence of poor and good
generalization
weak / no effect of additional
anti-specialization on gen. error
additional transition:
“anti-specialization” S>R
(overlooked in 1998!)

CITEC June 2020
discontinuous transition (schematic)
R=S
R>S
R>S
R<S
R=S R=S R>S
βf βf βf βf

CITEC June 2020
ReLU (K>2)
K=10
initial R=S phase

CITEC June 2020
ReLU (K>2)
K=10
initial R=S phase
continuous phase transtion
global minimum: R>S
local minimum: R<S

CITEC June 2020
ReLU (K>2)
K=10
initial R=S phase
continuous kink(s) in ε g
competing minima of
poor* vs. good generalization
* pretty good
global minimum: R>S
local minimum: R<S

CITEC June 2020
ReLU (large K)
initial R=S phase

CITEC June 2020
ReLU (large K)
initial R=S phase
at
degenerate minima: R>S, R<S

CITEC June 2020
ReLU (large K)
initial R=S phase
specialized and
anti-specialized branch
both achieve perfect
generalization, asymptotically !
at
degenerate minima: R>S, R<S

CITEC June 2020
R=1: perfect agreement of x with x*
R≈0: conditional avg. (linear!)

CITEC June 2020
perfectly aligned,
specialized student
= teacher

CITEC June 2020
=
“anti-specialized”
student, large K
perfectly aligned,
specialized student
= teacher
+

CITEC June 2020
unspecialized R=S state
remains meta-stable up to
large hidden layer:
sigmoidal (large K)
perfect generalization without prior knowledge
impossible with order O(NK) examples ?

CITEC June 2020
Monte Carlo simulations
continous Metropolis dynamics, K=4, N=50, β=1 (=T)
generalization error vs. time, specialized and unspecialized initialization

CITEC June 2020
histogram of
observed R
unspecialized

CITEC June 2020
histogram of
observed R
anti-specialized specialized
unspecialized

CITEC June 2020
sigmoidal activation K=4

CITEC June 2020
sigmoidal activation ReLUK=4

CITEC June 2020
• formal equilibrium of training at high temperature in
student/teacher model situations of supervised learning
Summary

CITEC June 2020
• unspecialized and (partially) specialized configurations
compete as local/global minima of the free energy
• phase transitions with temperature / number of examples:
Summary

CITEC June 2020
• unspecialized and (partially) specialized configurations
compete as local/global minima of the free energy
• phase transitions with temperature / number of examples:
 K=2: continuous symmetry-breaking transitions
with equivalent competing states
 K>2, sigmoidal activations: first order transition with
competing states of distinct generalization ability
 K>2, ReLU networks: continuous transition with
competing states of similar performance
Summary

CITEC June 2020
Outlook
which is the decisive
property of the activation?
• consider various activation functions (leaky ReLU ✓, swish ... )
most important question:

CITEC June 2020
piece-wise linear
„sigmoidal“ activation
ReLU
Outlook

CITEC June 2020
piece-wise linear
„sigmoidal“ activation
ReLU
with increasing slope,
change from discontinuous
to continuous transition
Outlook

CITEC June 2020
• annealed approximation / replica trick:
- low temperatures, vary number of examples and temperature indep.
- mismatched student/teacher networks
- overfitting / underfitting effects
Outlook

CITEC June 2020
• universal approximators
- adaptive thresholds in hidden units
- adaptive hidden-to-output weights
Outlook

CITEC June 2020
• realistic input data
- clustered / correlated data
- recent developments: Zdeborova, Mezard, Goldt et al.
Outlook

CITEC June 2020
• deep networks
- multi-layered networks
- tree-like architectures
• realistic input data
- clustered / correlated data
- recent developments: Zdeborova, Mezard, Goldt et al.
Outlook

2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activation

More Related Content

Similar to 2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activation (20)

More from University of Groningen (19)

Recently uploaded (20)

2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activation