SlideShare a Scribd company logo
CITEC June 2020
Phase transitions in layered neural networks:
rectified linear units vs. sigmoidal activation
Elisa Oostwal
Michiel Straat
Michael Biehl
Bernoulli Institute for Mathematics,
Computer Science and Artificial Intelligence
University of Groningen / NL
arXiv:1910.07476: Hidden unit Specialization in Layered Neural Networks:
ReLU vs. Sigmoidal Activation [updated May 27, 2020]
CITEC June 2020
motivation/research question
ReLU activation, e.g. in deep learning
folklore:
• computationally cheap and fast
• circumvents “vanishing gradient” problem
• sparse activity (biologically plausible)
? more efficient training
? favorable generalization ability
CITEC June 2020
motivation/research question
ReLU activation, e.g. in deep learning
folklore:
• computationally cheap and fast
• circumvents “vanishing gradient” problem
• sparse activity (biologically plausible)
? more efficient training
? favorable generalization ability
Theoretical study
• model scenarios, student/teacher setup
• typical learning curves, statistical physics based approach
• systematic comparison of ReLU and “classical” sigmoidal activation
• significant differences or “only” practical issues?
CITEC June 2020
statistical physics (of learning) in a nutshell
objective/cost/energy function with
• training by stochastic optimization of adaptive weights (thresholds etc.)
CITEC June 2020
statistical physics (of learning) in a nutshell
objective/cost/energy function with
• Metropolis algorithm, noisy gradient descent (Langevin)
with long time equilibrium (Gibbs-Boltzmann)
control parameter: „inverse temperature“ β =1 / T
• training by stochastic optimization of adaptive weights (thresholds etc.)
CITEC June 2020
statistical physics (of learning) in a nutshell
objective/cost/energy function with
• equilibrium state: compromise/competition between
minimal energy (ground state) vs. huge number of available states
with higher energy
• Metropolis algorithm, noisy gradient descent (Langevin)
with long time equilibrium (Gibbs-Boltzmann)
control parameter: „inverse temperature“ β =1 / T
• training by stochastic optimization of adaptive weights (thresholds etc.)
CITEC June 2020
statistical physics (of learning) in a nutshell
objective/cost/energy function with
• equilibrium state: compromise/competition between
minimal energy (ground state) vs. huge number of available states
with higher energy
• Metropolis algorithm, noisy gradient descent (Langevin)
with long time equilibrium (Gibbs-Boltzmann)
control parameter: „inverse temperature“ β =1 / T
• training by stochastic optimization of adaptive weights (thresholds etc.)
minimal free energy (per weight)
CITEC June 2020
statistical physics (of learning) in a nutshell
objective/cost/energy function with
• equilibrium state: compromise/competition between
minimal energy (ground state) vs. huge number of available states
with higher energy
• Metropolis algorithm, noisy gradient descent (Langevin)
with long time equilibrium (Gibbs-Boltzmann)
control parameter: „inverse temperature“ β =1 / T
• training by stochastic optimization of adaptive weights (thresholds etc.)
„thermal avg.“ over Peq , (microcanonical) entropy
minimal free energy (per weight)
CITEC June 2020
• computation of system properties in equilibrium: thermal averages
statistical physics (of learning) in a nutshell
CITEC June 2020
T =1 / β controls competition between energy and entropy
singles out the lowest energy (groundstate)
all states occur with equal probability, independent of energy
thermal noise dominates the training
• computation of system properties in equilibrium: thermal averages
statistical physics (of learning) in a nutshell
CITEC June 2020
T =1 / β controls competition between energy and entropy
singles out the lowest energy (groundstate)
all states occur with equal probability, independent of energy
thermal noise dominates the training
• computation of system properties in equilibrium: thermal averages
statistical physics (of learning) in a nutshell
• practical machine learning: algorithm (hyper-) parameters control
minimization of the cost function
e.g. learning rate, weight decay ...
CITEC June 2020
T =1 / β controls competition between energy and entropy
singles out the lowest energy (groundstate)
all states occur with equal probability, independent of energy
thermal noise dominates the training
• computation of system properties in equilibrium: thermal averages
statistical physics (of learning) in a nutshell
• practical machine learning: algorithm (hyper-) parameters control
minimization of the cost function
e.g. learning rate, weight decay ...
• statistical physics / stochastic optimization
 design of algorithms (Metrropolis, simulated annealing)
 theoretical analysis of model scenarios
CITEC June 2020
machine learning
• optimization of adaptive quantities, e.g. weights of a network
based on a given specific set of example data (input/output pairs)
cost function: defined w.r.t.
CITEC June 2020
machine learning
• optimization of adaptive quantities, e.g. weights of a network
based on a given specific set of example data (input/output pairs)
cost function: defined w.r.t.
• typical results: additional average of thermal averages over (difficult!)
even for the simplest model densities:
with independent identically distributed (zero mean, unit vaiance)
unstructured input density, information about the target through labels
CITEC June 2020
machine learning
• optimization of adaptive quantities, e.g. weights of a network
based on a given specific set of example data (input/output pairs)
cost function: defined w.r.t.
• typical results: additional average of thermal averages over (difficult!)
even for the simplest model densities:
with independent identically distributed (zero mean, unit vaiance)
unstructured input density, information about the target through labels
proper -average of the free energy requires replica trick or
annealed approximation
CITEC June 2020
machine learning at high temperatures
• a simplifying limit: training at high (formal) temperature
CITEC June 2020
machine learning at high temperatures
• a simplifying limit: training at high (formal) temperature
• independent i.i.d. training data:
• extensive number of examples: (prop. to number of weights)
CITEC June 2020
machine learning at high temperatures
• a simplifying limit: training at high (formal) temperature
• independent i.i.d. training data:
• extensive number of examples: (prop. to number of weights)
with finite
“ learn almost nothing... ” (high T )
...from infinitely many examples ”
CITEC June 2020
machine learning at high temperatures
• a simplifying limit: training at high (formal) temperature
• independent i.i.d. training data:
• extensive number of examples: (prop. to number of weights)
with finite
“ learn almost nothing... ” (high T )
...from infinitely many examples ”
limitations:
- training error and generalization error are identical
- number of examples / training temperature are coupled
CITEC June 2020
adaptive student N inputs
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
student teacher scenario: “soft committees”
CITEC June 2020
adaptive student N inputs
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
student teacher scenario: “soft committees”
training: minimization of
here: learnable rules, reliable data (outputs provided by teacher)
perfectly matching complexity K=M
CITEC June 2020
adaptive student N inputs
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
student teacher scenario: “soft committees”
training: minimization of
here: learnable rules, reliable data (outputs provided by teacher)
perfectly matching complexity K=M
consider two prototypical activation functions:
sigmoidal / ReLU in student and teacher
CITEC June 2020
exploit thermodynamic limit, CLT for
normally distributed with zero mean and covariance matrix
large N: Central Limit Theorem
CITEC June 2020
exploit thermodynamic limit, CLT for
normally distributed with zero mean and covariance matrix
large N: Central Limit Theorem
order parameters: model parameters:macroscopic
properties of
the system
CITEC June 2020
exploit thermodynamic limit, CLT for
normally distributed with zero mean and covariance matrix
large N: Central Limit Theorem
order parameters: model parameters:macroscopic
properties of
the system
(+ constant) independent of details (e.g. activation)
CITEC June 2020
generalization error
on average over P({xi,xj
*})
CITEC June 2020
generalization error
on average over P({xi,xj
*})
[D. Saad, S. Solla, 1995]
CITEC June 2020
generalization error
on average over P({xi,xj
*})
[D. Saad, S. Solla, 1995]
[M. Straat, 2019]
CITEC June 2020
site symmetry
simplification: orthonormal teacher vectors, isotropic input density
reflects permutation symmetry, allows for hidden unit specialization
CITEC June 2020
site symmetry
simplification: orthonormal teacher vectors, isotropic input density
reflects permutation symmetry, allows for hidden unit specialization
entropy
(+ constant)
CITEC June 2020
site symmetry
simplification: orthonormal teacher vectors, isotropic input density
reflects permutation symmetry, allows for hidden unit specialization
sigmoidal
hidden units
entropy
(+ constant)
CITEC June 2020
site symmetry
simplification: orthonormal teacher vectors, isotropic input density
reflects permutation symmetry, allows for hidden unit specialization
sigmoidal
hidden units
ReLU
activations
entropy
(+ constant)
CITEC June 2020
given: K, g(x),
typical learning curves
CITEC June 2020
given: K, g(x),
typical learning curves
solve:
CITEC June 2020
determine (global and local) minima of
given: K, g(x),
obtain learning curves
typical learning curves
order parameters, generalization error (typical, average)
as a function of the (scaled) training set size
solve:
CITEC June 2020
sigmoidal (K=2)
CITEC June 2020
sigmoidal (K=2)
invariance under exchange of
the two hidden units
R=S: both units ~ (w1
* + w2
*) + noise
symmetry breaking phase transition
(continuous, “second order”)...
CITEC June 2020
sigmoidal (K=2)
invariance under exchange of
the two hidden units
R=S: both units ~ (w1
* + w2
*) + noise
symmetry breaking phase transition
(continuous, “second order”)...
... results in a kink in
the typical learning curve
CITEC June 2020
sigmoidal (K=2)
invariance under exchange of
the two hidden units
R=S: both units ~ (w1
* + w2
*) + noise
symmetry breaking phase transition
(continuous, “second order”)...
... results in a kink in
the typical learning curve
CITEC June 2020
continuous transition (schematic)
R<S R>S
R=S
βf βf
CITEC June 2020
ReLU (K=2)
qualitatively identical behavior
Note: num. values of and/or
are irrelevant, scale depends a.o.
on pre-factor of g(z)
CITEC June 2020
ReLU (K=2)
qualitatively identical behavior
Note: num. values of and/or
are irrelevant, scale depends a.o.
on pre-factor of g(z)
CITEC June 2020
sigmoidal (K>2)
K=5
permutation symmetry of h.u.
initial R=S phase
CITEC June 2020
sigmoidal (K>2)
K=5
permutation symmetry of h.u.
initial R=S phase
first order transition, local min.
R>S competes with R=S
R>S becomes global minimum
facilitates perfect learning
CITEC June 2020
sigmoidal (K>2)
K=5
permutation symmetry of h.u.
initial R=S phase
first order transition, local min.
R>S competes with R=S
R>S becomes global minimum
facilitates perfect learning
additional transition:
“anti-specialization” S>R
(overlooked in 1998!)
CITEC June 2020
sigmoidal (K>2)
K=5
permutation symmetry of h.u.
initial R=S phase
first order transition, local min.
R>S competes with R=S
R>S becomes global minimum
facilitates perfect learning
additional transition:
“anti-specialization” S>R
(overlooked in 1998!)
CITEC June 2020
sigmoidal (K>2)
K=5
permutation symmetry of h.u.
initial R=S phase
first order transition, local min.
R>S competes with R=S
R>S becomes global minimum
facilitates perfect learning
additional transition:
“anti-specialization” S>R
(overlooked in 1998!)
CITEC June 2020
sigmoidal (K>2)
K=5
permutation symmetry of h.u.
initial R=S phase
discontinuous jump in ε g
coexistence of poor and good
generalization
weak / no effect of additional
anti-specialization on gen. error
first order transition, local min.
R>S competes with R=S
R>S becomes global minimum
facilitates perfect learning
additional transition:
“anti-specialization” S>R
(overlooked in 1998!)
CITEC June 2020
discontinuous transition (schematic)
R=S
R>S
R>S
R<S
R=S R=S R>S
βf βf βf βf
CITEC June 2020
ReLU (K>2)
K=10
permutation symmetry of h.u.
initial R=S phase
CITEC June 2020
ReLU (K>2)
K=10
permutation symmetry of h.u.
initial R=S phase
continuous phase transtion
global minimum: R>S
local minimum: R<S
CITEC June 2020
ReLU (K>2)
K=10
permutation symmetry of h.u.
initial R=S phase
continuous kink(s) in ε g
competing minima of
poor* vs. good generalization
* pretty good
continuous phase transtion
global minimum: R>S
local minimum: R<S
CITEC June 2020
ReLU (large K)
permutation symmetry of h.u.
initial R=S phase
CITEC June 2020
ReLU (large K)
permutation symmetry of h.u.
initial R=S phase
continuous phase transtion
at
degenerate minima: R>S, R<S
CITEC June 2020
ReLU (large K)
permutation symmetry of h.u.
initial R=S phase
specialized and
anti-specialized branch
both achieve perfect
generalization, asymptotically !
continuous phase transtion
at
degenerate minima: R>S, R<S
CITEC June 2020
R=1: perfect agreement of x with x*
R≈0: conditional avg. (linear!)
CITEC June 2020
perfectly aligned,
specialized student
= teacher
R=1: perfect agreement of x with x*
R≈0: conditional avg. (linear!)
CITEC June 2020
=
“anti-specialized”
student, large K
perfectly aligned,
specialized student
= teacher
R=1: perfect agreement of x with x*
R≈0: conditional avg. (linear!)
+
CITEC June 2020
unspecialized R=S state
remains meta-stable up to
large hidden layer:
sigmoidal (large K)
perfect generalization without prior knowledge
impossible with order O(NK) examples ?
CITEC June 2020
Monte Carlo simulations
continous Metropolis dynamics, K=4, N=50, β=1 (=T)
generalization error vs. time, specialized and unspecialized initialization
CITEC June 2020
Monte Carlo simulations
continous Metropolis dynamics, K=4, N=50, β=1 (=T)
generalization error vs. time, specialized and unspecialized initialization
CITEC June 2020
Monte Carlo simulations
histogram of
observed R
continous Metropolis dynamics, K=4, N=50, β=1 (=T)
generalization error vs. time, specialized and unspecialized initialization
unspecialized
CITEC June 2020
Monte Carlo simulations
histogram of
observed R
continous Metropolis dynamics, K=4, N=50, β=1 (=T)
generalization error vs. time, specialized and unspecialized initialization
anti-specialized specialized
unspecialized
CITEC June 2020
Monte Carlo simulations
sigmoidal activation K=4
CITEC June 2020
Monte Carlo simulations
sigmoidal activation ReLUK=4
CITEC June 2020
• formal equilibrium of training at high temperature in
student/teacher model situations of supervised learning
Summary
CITEC June 2020
• formal equilibrium of training at high temperature in
student/teacher model situations of supervised learning
• unspecialized and (partially) specialized configurations
compete as local/global minima of the free energy
• phase transitions with temperature / number of examples:
Summary
CITEC June 2020
• formal equilibrium of training at high temperature in
student/teacher model situations of supervised learning
• unspecialized and (partially) specialized configurations
compete as local/global minima of the free energy
• phase transitions with temperature / number of examples:
 K=2: continuous symmetry-breaking transitions
with equivalent competing states
 K>2, sigmoidal activations: first order transition with
competing states of distinct generalization ability
 K>2, ReLU networks: continuous transition with
competing states of similar performance
Summary
CITEC June 2020
Outlook
which is the decisive
property of the activation?
• consider various activation functions (leaky ReLU ✓, swish ... )
most important question:
CITEC June 2020
piece-wise linear
„sigmoidal“ activation
ReLU
Outlook
which is the decisive
property of the activation?
• consider various activation functions (leaky ReLU ✓, swish ... )
most important question:
CITEC June 2020
piece-wise linear
„sigmoidal“ activation
ReLU
with increasing slope,
change from discontinuous
to continuous transition
Outlook
which is the decisive
property of the activation?
• consider various activation functions (leaky ReLU ✓, swish ... )
most important question:
CITEC June 2020
• annealed approximation / replica trick:
- low temperatures, vary number of examples and temperature indep.
- mismatched student/teacher networks
- overfitting / underfitting effects
Outlook
CITEC June 2020
• annealed approximation / replica trick:
- low temperatures, vary number of examples and temperature indep.
- mismatched student/teacher networks
- overfitting / underfitting effects
• universal approximators
- adaptive thresholds in hidden units
- adaptive hidden-to-output weights
Outlook
CITEC June 2020
• annealed approximation / replica trick:
- low temperatures, vary number of examples and temperature indep.
- mismatched student/teacher networks
- overfitting / underfitting effects
• universal approximators
- adaptive thresholds in hidden units
- adaptive hidden-to-output weights
• realistic input data
- clustered / correlated data
- recent developments: Zdeborova, Mezard, Goldt et al.
Outlook
CITEC June 2020
• annealed approximation / replica trick:
- low temperatures, vary number of examples and temperature indep.
- mismatched student/teacher networks
- overfitting / underfitting effects
• universal approximators
- adaptive thresholds in hidden units
- adaptive hidden-to-output weights
• deep networks
- multi-layered networks
- tree-like architectures
• realistic input data
- clustered / correlated data
- recent developments: Zdeborova, Mezard, Goldt et al.
Outlook

More Related Content

PDF
A Smart Home Testbed for Evaluating XAI with Non-Experts
PPSX
The statistical physics of learning - revisited
PDF
ICIS webinar - Price sensitivity analysis with neural networks
PDF
ICIS - Power price prediction with neural networks
PDF
stat-phys-appis-reduced.pdf
PDF
stat-phys-AMALEA.pdf
PPTX
The statistical physics of learning revisted: Phase transitions in layered ne...
PDF
自然方策勾配法の基礎と応用
A Smart Home Testbed for Evaluating XAI with Non-Experts
The statistical physics of learning - revisited
ICIS webinar - Price sensitivity analysis with neural networks
ICIS - Power price prediction with neural networks
stat-phys-appis-reduced.pdf
stat-phys-AMALEA.pdf
The statistical physics of learning revisted: Phase transitions in layered ne...
自然方策勾配法の基礎と応用

Similar to 2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activation (20)

PDF
Learning to discover monte carlo algorithm on spin ice manifold
PDF
Cc stat phys draft
PDF
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
PDF
Computing near-optimal policies from trajectories by solving a sequence of st...
PPTX
Learning Task in machine learning
PDF
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
PDF
Deep learning MindMap
PDF
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
PDF
Maximum Entropy Inverse reinforcement learning
PDF
Deep learning concepts
PDF
Gan
PPTX
Master defence 2020 -Volodymyr Lut-Neural Architecture Search: a Probabilisti...
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
PPTX
22PCOAM16_UNIT 1_Session 3 concept Learning task.pptx
PDF
Anomaly Detection through Reinforcement Learning
PPTX
PowerPoint Presentation - Research Project 2015
PPTX
Learning sparse Neural Networks using L0 Regularization
PPTX
Information Theoretic aspect of reinforcement learning
PPTX
ML_ Unit_1_PART_A
PDF
Deep Learning for Cyber Security
Learning to discover monte carlo algorithm on spin ice manifold
Cc stat phys draft
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Computing near-optimal policies from trajectories by solving a sequence of st...
Learning Task in machine learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
Deep learning MindMap
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
Maximum Entropy Inverse reinforcement learning
Deep learning concepts
Gan
Master defence 2020 -Volodymyr Lut-Neural Architecture Search: a Probabilisti...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
22PCOAM16_UNIT 1_Session 3 concept Learning task.pptx
Anomaly Detection through Reinforcement Learning
PowerPoint Presentation - Research Project 2015
Learning sparse Neural Networks using L0 Regularization
Information Theoretic aspect of reinforcement learning
ML_ Unit_1_PART_A
Deep Learning for Cyber Security
Ad

More from University of Groningen (19)

PDF
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
PDF
ESE-Eyes-2023.pdf
PDF
APPIS-FDGPET.pdf
PDF
prototypes-AMALEA.pdf
PDF
Evidence for tissue and stage-specific composition of the ribosome: machine l...
PPTX
Interpretable machine-learning (in endocrinology and beyond)
PPSX
Biehl hanze-2021
PPSX
2020: Prototype-based classifiers and relevance learning: medical application...
PPTX
2020: So you thought the ribosome was constant and conserved ...
PPSX
Prototype-based classifiers and their applications in the life sciences
PPSX
Prototype-based models in machine learning
PPSX
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
PPSX
2013: Prototype-based learning and adaptive distances for classification
PPSX
2015: Distance based classifiers: Basic concepts, recent developments and app...
PPSX
2016: Classification of FDG-PET Brain Data
PPSX
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
PPSX
2017: Prototype-based models in unsupervised and supervised machine learning
PPSX
June 2017: Biomedical applications of prototype-based classifiers and relevan...
PPSX
January 2020: Prototype-based systems in machine learning
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
ESE-Eyes-2023.pdf
APPIS-FDGPET.pdf
prototypes-AMALEA.pdf
Evidence for tissue and stage-specific composition of the ribosome: machine l...
Interpretable machine-learning (in endocrinology and beyond)
Biehl hanze-2021
2020: Prototype-based classifiers and relevance learning: medical application...
2020: So you thought the ribosome was constant and conserved ...
Prototype-based classifiers and their applications in the life sciences
Prototype-based models in machine learning
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
2013: Prototype-based learning and adaptive distances for classification
2015: Distance based classifiers: Basic concepts, recent developments and app...
2016: Classification of FDG-PET Brain Data
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
2017: Prototype-based models in unsupervised and supervised machine learning
June 2017: Biomedical applications of prototype-based classifiers and relevan...
January 2020: Prototype-based systems in machine learning
Ad

Recently uploaded (20)

PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
An interstellar mission to test astrophysical black holes
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Sciences of Europe No 170 (2025)
PPTX
Microbiology with diagram medical studies .pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Derivatives of integument scales, beaks, horns,.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
HPLC-PPT.docx high performance liquid chromatography
The scientific heritage No 166 (166) (2025)
Cell Membrane: Structure, Composition & Functions
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
POSITIONING IN OPERATION THEATRE ROOM.ppt
An interstellar mission to test astrophysical black holes
Biophysics 2.pdffffffffffffffffffffffffff
Sciences of Europe No 170 (2025)
Microbiology with diagram medical studies .pptx
. Radiology Case Scenariosssssssssssssss
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
2. Earth - The Living Planet Module 2ELS
2Systematics of Living Organisms t-.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Introduction to Cardiovascular system_structure and functions-1
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...

2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activation

  • 1. CITEC June 2020 Phase transitions in layered neural networks: rectified linear units vs. sigmoidal activation Elisa Oostwal Michiel Straat Michael Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen / NL arXiv:1910.07476: Hidden unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation [updated May 27, 2020]
  • 2. CITEC June 2020 motivation/research question ReLU activation, e.g. in deep learning folklore: • computationally cheap and fast • circumvents “vanishing gradient” problem • sparse activity (biologically plausible) ? more efficient training ? favorable generalization ability
  • 3. CITEC June 2020 motivation/research question ReLU activation, e.g. in deep learning folklore: • computationally cheap and fast • circumvents “vanishing gradient” problem • sparse activity (biologically plausible) ? more efficient training ? favorable generalization ability Theoretical study • model scenarios, student/teacher setup • typical learning curves, statistical physics based approach • systematic comparison of ReLU and “classical” sigmoidal activation • significant differences or “only” practical issues?
  • 4. CITEC June 2020 statistical physics (of learning) in a nutshell objective/cost/energy function with • training by stochastic optimization of adaptive weights (thresholds etc.)
  • 5. CITEC June 2020 statistical physics (of learning) in a nutshell objective/cost/energy function with • Metropolis algorithm, noisy gradient descent (Langevin) with long time equilibrium (Gibbs-Boltzmann) control parameter: „inverse temperature“ β =1 / T • training by stochastic optimization of adaptive weights (thresholds etc.)
  • 6. CITEC June 2020 statistical physics (of learning) in a nutshell objective/cost/energy function with • equilibrium state: compromise/competition between minimal energy (ground state) vs. huge number of available states with higher energy • Metropolis algorithm, noisy gradient descent (Langevin) with long time equilibrium (Gibbs-Boltzmann) control parameter: „inverse temperature“ β =1 / T • training by stochastic optimization of adaptive weights (thresholds etc.)
  • 7. CITEC June 2020 statistical physics (of learning) in a nutshell objective/cost/energy function with • equilibrium state: compromise/competition between minimal energy (ground state) vs. huge number of available states with higher energy • Metropolis algorithm, noisy gradient descent (Langevin) with long time equilibrium (Gibbs-Boltzmann) control parameter: „inverse temperature“ β =1 / T • training by stochastic optimization of adaptive weights (thresholds etc.) minimal free energy (per weight)
  • 8. CITEC June 2020 statistical physics (of learning) in a nutshell objective/cost/energy function with • equilibrium state: compromise/competition between minimal energy (ground state) vs. huge number of available states with higher energy • Metropolis algorithm, noisy gradient descent (Langevin) with long time equilibrium (Gibbs-Boltzmann) control parameter: „inverse temperature“ β =1 / T • training by stochastic optimization of adaptive weights (thresholds etc.) „thermal avg.“ over Peq , (microcanonical) entropy minimal free energy (per weight)
  • 9. CITEC June 2020 • computation of system properties in equilibrium: thermal averages statistical physics (of learning) in a nutshell
  • 10. CITEC June 2020 T =1 / β controls competition between energy and entropy singles out the lowest energy (groundstate) all states occur with equal probability, independent of energy thermal noise dominates the training • computation of system properties in equilibrium: thermal averages statistical physics (of learning) in a nutshell
  • 11. CITEC June 2020 T =1 / β controls competition between energy and entropy singles out the lowest energy (groundstate) all states occur with equal probability, independent of energy thermal noise dominates the training • computation of system properties in equilibrium: thermal averages statistical physics (of learning) in a nutshell • practical machine learning: algorithm (hyper-) parameters control minimization of the cost function e.g. learning rate, weight decay ...
  • 12. CITEC June 2020 T =1 / β controls competition between energy and entropy singles out the lowest energy (groundstate) all states occur with equal probability, independent of energy thermal noise dominates the training • computation of system properties in equilibrium: thermal averages statistical physics (of learning) in a nutshell • practical machine learning: algorithm (hyper-) parameters control minimization of the cost function e.g. learning rate, weight decay ... • statistical physics / stochastic optimization  design of algorithms (Metrropolis, simulated annealing)  theoretical analysis of model scenarios
  • 13. CITEC June 2020 machine learning • optimization of adaptive quantities, e.g. weights of a network based on a given specific set of example data (input/output pairs) cost function: defined w.r.t.
  • 14. CITEC June 2020 machine learning • optimization of adaptive quantities, e.g. weights of a network based on a given specific set of example data (input/output pairs) cost function: defined w.r.t. • typical results: additional average of thermal averages over (difficult!) even for the simplest model densities: with independent identically distributed (zero mean, unit vaiance) unstructured input density, information about the target through labels
  • 15. CITEC June 2020 machine learning • optimization of adaptive quantities, e.g. weights of a network based on a given specific set of example data (input/output pairs) cost function: defined w.r.t. • typical results: additional average of thermal averages over (difficult!) even for the simplest model densities: with independent identically distributed (zero mean, unit vaiance) unstructured input density, information about the target through labels proper -average of the free energy requires replica trick or annealed approximation
  • 16. CITEC June 2020 machine learning at high temperatures • a simplifying limit: training at high (formal) temperature
  • 17. CITEC June 2020 machine learning at high temperatures • a simplifying limit: training at high (formal) temperature • independent i.i.d. training data: • extensive number of examples: (prop. to number of weights)
  • 18. CITEC June 2020 machine learning at high temperatures • a simplifying limit: training at high (formal) temperature • independent i.i.d. training data: • extensive number of examples: (prop. to number of weights) with finite “ learn almost nothing... ” (high T ) ...from infinitely many examples ”
  • 19. CITEC June 2020 machine learning at high temperatures • a simplifying limit: training at high (formal) temperature • independent i.i.d. training data: • extensive number of examples: (prop. to number of weights) with finite “ learn almost nothing... ” (high T ) ...from infinitely many examples ” limitations: - training error and generalization error are identical - number of examples / training temperature are coupled
  • 20. CITEC June 2020 adaptive student N inputs (K) hidden units (M) teacher ? ? ? ? ? ? ? student teacher scenario: “soft committees”
  • 21. CITEC June 2020 adaptive student N inputs (K) hidden units (M) teacher ? ? ? ? ? ? ? student teacher scenario: “soft committees” training: minimization of here: learnable rules, reliable data (outputs provided by teacher) perfectly matching complexity K=M
  • 22. CITEC June 2020 adaptive student N inputs (K) hidden units (M) teacher ? ? ? ? ? ? ? student teacher scenario: “soft committees” training: minimization of here: learnable rules, reliable data (outputs provided by teacher) perfectly matching complexity K=M consider two prototypical activation functions: sigmoidal / ReLU in student and teacher
  • 23. CITEC June 2020 exploit thermodynamic limit, CLT for normally distributed with zero mean and covariance matrix large N: Central Limit Theorem
  • 24. CITEC June 2020 exploit thermodynamic limit, CLT for normally distributed with zero mean and covariance matrix large N: Central Limit Theorem order parameters: model parameters:macroscopic properties of the system
  • 25. CITEC June 2020 exploit thermodynamic limit, CLT for normally distributed with zero mean and covariance matrix large N: Central Limit Theorem order parameters: model parameters:macroscopic properties of the system (+ constant) independent of details (e.g. activation)
  • 26. CITEC June 2020 generalization error on average over P({xi,xj *})
  • 27. CITEC June 2020 generalization error on average over P({xi,xj *}) [D. Saad, S. Solla, 1995]
  • 28. CITEC June 2020 generalization error on average over P({xi,xj *}) [D. Saad, S. Solla, 1995] [M. Straat, 2019]
  • 29. CITEC June 2020 site symmetry simplification: orthonormal teacher vectors, isotropic input density reflects permutation symmetry, allows for hidden unit specialization
  • 30. CITEC June 2020 site symmetry simplification: orthonormal teacher vectors, isotropic input density reflects permutation symmetry, allows for hidden unit specialization entropy (+ constant)
  • 31. CITEC June 2020 site symmetry simplification: orthonormal teacher vectors, isotropic input density reflects permutation symmetry, allows for hidden unit specialization sigmoidal hidden units entropy (+ constant)
  • 32. CITEC June 2020 site symmetry simplification: orthonormal teacher vectors, isotropic input density reflects permutation symmetry, allows for hidden unit specialization sigmoidal hidden units ReLU activations entropy (+ constant)
  • 33. CITEC June 2020 given: K, g(x), typical learning curves
  • 34. CITEC June 2020 given: K, g(x), typical learning curves solve:
  • 35. CITEC June 2020 determine (global and local) minima of given: K, g(x), obtain learning curves typical learning curves order parameters, generalization error (typical, average) as a function of the (scaled) training set size solve:
  • 37. CITEC June 2020 sigmoidal (K=2) invariance under exchange of the two hidden units R=S: both units ~ (w1 * + w2 *) + noise symmetry breaking phase transition (continuous, “second order”)...
  • 38. CITEC June 2020 sigmoidal (K=2) invariance under exchange of the two hidden units R=S: both units ~ (w1 * + w2 *) + noise symmetry breaking phase transition (continuous, “second order”)... ... results in a kink in the typical learning curve
  • 39. CITEC June 2020 sigmoidal (K=2) invariance under exchange of the two hidden units R=S: both units ~ (w1 * + w2 *) + noise symmetry breaking phase transition (continuous, “second order”)... ... results in a kink in the typical learning curve
  • 40. CITEC June 2020 continuous transition (schematic) R<S R>S R=S βf βf
  • 41. CITEC June 2020 ReLU (K=2) qualitatively identical behavior Note: num. values of and/or are irrelevant, scale depends a.o. on pre-factor of g(z)
  • 42. CITEC June 2020 ReLU (K=2) qualitatively identical behavior Note: num. values of and/or are irrelevant, scale depends a.o. on pre-factor of g(z)
  • 43. CITEC June 2020 sigmoidal (K>2) K=5 permutation symmetry of h.u. initial R=S phase
  • 44. CITEC June 2020 sigmoidal (K>2) K=5 permutation symmetry of h.u. initial R=S phase first order transition, local min. R>S competes with R=S R>S becomes global minimum facilitates perfect learning
  • 45. CITEC June 2020 sigmoidal (K>2) K=5 permutation symmetry of h.u. initial R=S phase first order transition, local min. R>S competes with R=S R>S becomes global minimum facilitates perfect learning additional transition: “anti-specialization” S>R (overlooked in 1998!)
  • 46. CITEC June 2020 sigmoidal (K>2) K=5 permutation symmetry of h.u. initial R=S phase first order transition, local min. R>S competes with R=S R>S becomes global minimum facilitates perfect learning additional transition: “anti-specialization” S>R (overlooked in 1998!)
  • 47. CITEC June 2020 sigmoidal (K>2) K=5 permutation symmetry of h.u. initial R=S phase first order transition, local min. R>S competes with R=S R>S becomes global minimum facilitates perfect learning additional transition: “anti-specialization” S>R (overlooked in 1998!)
  • 48. CITEC June 2020 sigmoidal (K>2) K=5 permutation symmetry of h.u. initial R=S phase discontinuous jump in ε g coexistence of poor and good generalization weak / no effect of additional anti-specialization on gen. error first order transition, local min. R>S competes with R=S R>S becomes global minimum facilitates perfect learning additional transition: “anti-specialization” S>R (overlooked in 1998!)
  • 49. CITEC June 2020 discontinuous transition (schematic) R=S R>S R>S R<S R=S R=S R>S βf βf βf βf
  • 50. CITEC June 2020 ReLU (K>2) K=10 permutation symmetry of h.u. initial R=S phase
  • 51. CITEC June 2020 ReLU (K>2) K=10 permutation symmetry of h.u. initial R=S phase continuous phase transtion global minimum: R>S local minimum: R<S
  • 52. CITEC June 2020 ReLU (K>2) K=10 permutation symmetry of h.u. initial R=S phase continuous kink(s) in ε g competing minima of poor* vs. good generalization * pretty good continuous phase transtion global minimum: R>S local minimum: R<S
  • 53. CITEC June 2020 ReLU (large K) permutation symmetry of h.u. initial R=S phase
  • 54. CITEC June 2020 ReLU (large K) permutation symmetry of h.u. initial R=S phase continuous phase transtion at degenerate minima: R>S, R<S
  • 55. CITEC June 2020 ReLU (large K) permutation symmetry of h.u. initial R=S phase specialized and anti-specialized branch both achieve perfect generalization, asymptotically ! continuous phase transtion at degenerate minima: R>S, R<S
  • 56. CITEC June 2020 R=1: perfect agreement of x with x* R≈0: conditional avg. (linear!)
  • 57. CITEC June 2020 perfectly aligned, specialized student = teacher R=1: perfect agreement of x with x* R≈0: conditional avg. (linear!)
  • 58. CITEC June 2020 = “anti-specialized” student, large K perfectly aligned, specialized student = teacher R=1: perfect agreement of x with x* R≈0: conditional avg. (linear!) +
  • 59. CITEC June 2020 unspecialized R=S state remains meta-stable up to large hidden layer: sigmoidal (large K) perfect generalization without prior knowledge impossible with order O(NK) examples ?
  • 60. CITEC June 2020 Monte Carlo simulations continous Metropolis dynamics, K=4, N=50, β=1 (=T) generalization error vs. time, specialized and unspecialized initialization
  • 61. CITEC June 2020 Monte Carlo simulations continous Metropolis dynamics, K=4, N=50, β=1 (=T) generalization error vs. time, specialized and unspecialized initialization
  • 62. CITEC June 2020 Monte Carlo simulations histogram of observed R continous Metropolis dynamics, K=4, N=50, β=1 (=T) generalization error vs. time, specialized and unspecialized initialization unspecialized
  • 63. CITEC June 2020 Monte Carlo simulations histogram of observed R continous Metropolis dynamics, K=4, N=50, β=1 (=T) generalization error vs. time, specialized and unspecialized initialization anti-specialized specialized unspecialized
  • 64. CITEC June 2020 Monte Carlo simulations sigmoidal activation K=4
  • 65. CITEC June 2020 Monte Carlo simulations sigmoidal activation ReLUK=4
  • 66. CITEC June 2020 • formal equilibrium of training at high temperature in student/teacher model situations of supervised learning Summary
  • 67. CITEC June 2020 • formal equilibrium of training at high temperature in student/teacher model situations of supervised learning • unspecialized and (partially) specialized configurations compete as local/global minima of the free energy • phase transitions with temperature / number of examples: Summary
  • 68. CITEC June 2020 • formal equilibrium of training at high temperature in student/teacher model situations of supervised learning • unspecialized and (partially) specialized configurations compete as local/global minima of the free energy • phase transitions with temperature / number of examples:  K=2: continuous symmetry-breaking transitions with equivalent competing states  K>2, sigmoidal activations: first order transition with competing states of distinct generalization ability  K>2, ReLU networks: continuous transition with competing states of similar performance Summary
  • 69. CITEC June 2020 Outlook which is the decisive property of the activation? • consider various activation functions (leaky ReLU ✓, swish ... ) most important question:
  • 70. CITEC June 2020 piece-wise linear „sigmoidal“ activation ReLU Outlook which is the decisive property of the activation? • consider various activation functions (leaky ReLU ✓, swish ... ) most important question:
  • 71. CITEC June 2020 piece-wise linear „sigmoidal“ activation ReLU with increasing slope, change from discontinuous to continuous transition Outlook which is the decisive property of the activation? • consider various activation functions (leaky ReLU ✓, swish ... ) most important question:
  • 72. CITEC June 2020 • annealed approximation / replica trick: - low temperatures, vary number of examples and temperature indep. - mismatched student/teacher networks - overfitting / underfitting effects Outlook
  • 73. CITEC June 2020 • annealed approximation / replica trick: - low temperatures, vary number of examples and temperature indep. - mismatched student/teacher networks - overfitting / underfitting effects • universal approximators - adaptive thresholds in hidden units - adaptive hidden-to-output weights Outlook
  • 74. CITEC June 2020 • annealed approximation / replica trick: - low temperatures, vary number of examples and temperature indep. - mismatched student/teacher networks - overfitting / underfitting effects • universal approximators - adaptive thresholds in hidden units - adaptive hidden-to-output weights • realistic input data - clustered / correlated data - recent developments: Zdeborova, Mezard, Goldt et al. Outlook
  • 75. CITEC June 2020 • annealed approximation / replica trick: - low temperatures, vary number of examples and temperature indep. - mismatched student/teacher networks - overfitting / underfitting effects • universal approximators - adaptive thresholds in hidden units - adaptive hidden-to-output weights • deep networks - multi-layered networks - tree-like architectures • realistic input data - clustered / correlated data - recent developments: Zdeborova, Mezard, Goldt et al. Outlook