SlideShare a Scribd company logo
Emergence of Invariance and
Disentangling in Deep Representations
2018.02.14.
Sangwoo Mo
Overview
• Emergence of Invariance and Disentangling in Deep Representations
• Authors: Achille & Soatto (UCLA)
• Appeared in ICML 2017 Workshop
• Contribution
• Investigate relation between properties for representation
• Propose measure for network complexity
1. Relation between Properties
Properties for Representation
• Representation ! is a stochastic function of data ", that is useful for given task # while
nuisance $ affects to the data
• “Good representation” should satisfy
• sufficient: % !; # = %("; #)
• minimal: minimize % !; " among sufficient !
• invariant: minimize % !; $
• disentangled: minimize *+ ! = ,-(.(!) ∥ ∏1 . !1 )
• However, we will show that only minimal sufficiency is essential; i.e. invariance and
disentanglement are automatically satisfied in some model assumption
* TC: total correlation
** Actually, the assumption is not mild; still, the result is quite interesting
$
#
" !
Minimal Sufficiency ⇒ IB Lagrangian
• Information Bottleneck (IB) Lagrangian
ℒ = $ % & + ( ⋅ *(&; -)
• Minimizing IB Lagrangian yields the minimal sufficient representation
• By Data Processing Inequality (DPI), deep network
- → &0 → ⋯ → &2
• satisfies that * &2; - ≤ *(&0; -); i.e. stacking layer increases minimality
• Q. In real scenario, we do not optimize IB Lagrangian. Does it still apply?
• A. SGD implicitly implies minimality
* ResNet also satisfies the Markov chain, when we define & as “block”
Model Assumption
• Now we will show that under some model assumption
1. minimality implies invariance and disentanglement
2. SGD implicitly implies minimality
• Model Assumption
• Assume log-uniform prior on !; i.e. " !# ∝ 1/|!#|
• Assume posterior !#|( = *# ⋅ ,!# where *# ∼ log 1(−4#/2, 4#)
• 4# will be also optimized (Variational Dropout; Kingma’ 2015)
• Then weight information is
8 !; ( = −
1
2
:
#;<
=>? @
log 4# + B
Minimality ⇒ Invariance & Disentanglement
• Proposition 1. For a single layer " = $ ⋅ &,
' ( $; * ≤ ( "; & + -. " ≤ ' ( $; * + /
• where ' is some strictly increasing function and / = 0(1/ dim & )
• Corollary 1. For MLP,
( "8; & ≤ min
:;8
( ":<=, ": ≤ min
:;8
( ?: ⋅ ":, ":
• Here, we can only obtain the upper bound
• ⇒ minimality implies invariance and disentanglement
SGD ⇒ Minimize Weight Information
• Proposition 2. Let " be the Hessian at the local minimum #$. Assume (#$, ') is optimal
solution of IB Lagrangian ℒ = " + , + ' ⋅ /(,; 1). Then
/ $; 2 ≤ 4[log " ∗ + log #$ :
:
− log '4:]
• where 4 = dim($) and ⋅ ∗ is nuclear norm
• Empirical evidence: SGD converges to the flat minima; i.e. " ∗ = tr(") is small
• ⇒ SGD implicitly minimize the weight information
2. Measure Network Complexity
Revisit Overfitting
• Let !"($, &) be data distribution and (( ⋅ |$; ,) be neural network
• Decompose cross entropy loss
-.,/ 0 , = - 0 2 + 4 2; 0 , + 56(( ! − 4(0; ,|2)
• Since 4(0; ,|2) is intractable, we use 4(0; ,) as a regularizer; i.e. solve IB Lagrangian
ℒ = -.,/ 0 , + 9 ⋅ 4(0; ,)
• Also, we will use 4(0; ,) as a measure for model complexity
• 4(0; ,) is small if underfitting, large if overfitting
Intrinsic error sufficiency model efficiency overfitting
Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
• Random Label Test: Deep learning easily fits random label
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
• Random Label Test: Deep learning easily fits random label
• For random label neural network overfits, but weight information increases
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
• Random Label Test: Deep learning easily fits random label
• For random label neural network overfits, but weight information increases
• …and it recovers bias-variance tradeoff
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
Effect of !
• As ! increases, "($; &) decreases, and ( become more invariant (= lose information)
Conclusion
• Conclusion
1. Authors proposed the properties for “good representation”, and minimal sufficiency
is sufficient for invariance and disentanglement
2. Authors proposed the measure for neural network, which solves the paradox of the
rethinking generalization paper
• Research Question
1. minimality ⇒ invariance satisfies in general, but how about ⇒ disentanglement?
In which assumption can we guarantee disentanglement?
2. Weight information seems to be an alternative measure for generalization theory.
How can we estimate "($; &) efficiently for general neural network?

More Related Content

PDF
Learning Theory 101 ...and Towards Learning the Flat Minima
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PDF
Sharpness-aware minimization (SAM)
PDF
Self-Attention with Linear Complexity
PDF
Recursive Neural Networks
PDF
Deep Implicit Layers: Learning Structured Problems with Neural Networks
PDF
Meta-Learning with Implicit Gradients
PDF
Bayesian Model-Agnostic Meta-Learning
Learning Theory 101 ...and Towards Learning the Flat Minima
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sharpness-aware minimization (SAM)
Self-Attention with Linear Complexity
Recursive Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Meta-Learning with Implicit Gradients
Bayesian Model-Agnostic Meta-Learning

What's hot (20)

PDF
Introduction to Diffusion Models
PDF
Score-Based Generative Modeling through Stochastic Differential Equations
PDF
Explicit Density Models
PDF
Deep Learning Theory Seminar (Chap 3, part 2)
PDF
Object-Region Video Transformers
PDF
Domain Transfer and Adaptation Survey
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PDF
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
PPTX
Iclr2020: Compression based bound for non-compressed network: unified general...
PDF
Deep Feed Forward Neural Networks and Regularization
PPTX
Exploring Simple Siamese Representation Learning
PDF
PPTX
Regularization in deep learning
PPTX
Machine Learning - Introduction to Convolutional Neural Networks
PPTX
Introduction to Hamiltonian Neural Networks
PPTX
Spectral clustering Tutorial
PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PPTX
Activation function
PDF
Score based Generative Modeling through Stochastic Differential Equations
PPTX
Sigmoid function machine learning made simple
Introduction to Diffusion Models
Score-Based Generative Modeling through Stochastic Differential Equations
Explicit Density Models
Deep Learning Theory Seminar (Chap 3, part 2)
Object-Region Video Transformers
Domain Transfer and Adaptation Survey
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Iclr2020: Compression based bound for non-compressed network: unified general...
Deep Feed Forward Neural Networks and Regularization
Exploring Simple Siamese Representation Learning
Regularization in deep learning
Machine Learning - Introduction to Convolutional Neural Networks
Introduction to Hamiltonian Neural Networks
Spectral clustering Tutorial
PR-284: End-to-End Object Detection with Transformers(DETR)
Activation function
Score based Generative Modeling through Stochastic Differential Equations
Sigmoid function machine learning made simple
Ad

Similar to Emergence of Invariance and Disentangling in Deep Representations (20)

PDF
The marginal value of adaptive gradient methods in machine learning
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
KEY
Java Building Blocks
PDF
Generating Sequences with Deep LSTMs & RNNS in julia
PPTX
Introduction to Deep learning and H2O for beginner's
PDF
Understanding deep learning requires rethinking generalization
PDF
Realtime Analytics
PPTX
Deep Learning Sample Class (Jon Lederman)
PDF
The Back Propagation Learning Algorithm
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
PDF
L5. Data Transformation and Feature Engineering
ODP
Preparing Java 7 Certifications
PDF
On the Validity of Bayesian Neural Networks for Uncertainty Estimation
PDF
Genetic programming
PPTX
Machine Learning, Deep Learning and Data Analysis Introduction
PDF
09_dm1_knn_2022_23.pdf
PDF
Dictionary Learning in Games - GDC 2014
PDF
Functional programming techniques in regular JavaScript
PPTX
Nimrita deep learning
PDF
SPATIAL POINT PATTERNS
The marginal value of adaptive gradient methods in machine learning
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Java Building Blocks
Generating Sequences with Deep LSTMs & RNNS in julia
Introduction to Deep learning and H2O for beginner's
Understanding deep learning requires rethinking generalization
Realtime Analytics
Deep Learning Sample Class (Jon Lederman)
The Back Propagation Learning Algorithm
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
L5. Data Transformation and Feature Engineering
Preparing Java 7 Certifications
On the Validity of Bayesian Neural Networks for Uncertainty Estimation
Genetic programming
Machine Learning, Deep Learning and Data Analysis Introduction
09_dm1_knn_2022_23.pdf
Dictionary Learning in Games - GDC 2014
Functional programming techniques in regular JavaScript
Nimrita deep learning
SPATIAL POINT PATTERNS
Ad

More from Sangwoo Mo (15)

PDF
Brief History of Visual Representation Learning
PDF
Learning Visual Representations from Uncurated Data
PDF
Hyperbolic Deep Reinforcement Learning
PDF
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
PDF
Self-supervised Learning Lecture Note
PDF
Generative Models for General Audiences
PDF
Deep Learning for Natural Language Processing
PDF
Neural Processes
PDF
Improved Trainings of Wasserstein GANs (WGAN-GP)
PDF
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
PDF
Topology for Computing: Homology
PDF
Reinforcement Learning with Deep Energy-Based Policies
PDF
Statistical Decision Theory
PDF
On Unifying Deep Generative Models
PDF
Dropout as a Bayesian Approximation
Brief History of Visual Representation Learning
Learning Visual Representations from Uncurated Data
Hyperbolic Deep Reinforcement Learning
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Self-supervised Learning Lecture Note
Generative Models for General Audiences
Deep Learning for Natural Language Processing
Neural Processes
Improved Trainings of Wasserstein GANs (WGAN-GP)
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
Topology for Computing: Homology
Reinforcement Learning with Deep Energy-Based Policies
Statistical Decision Theory
On Unifying Deep Generative Models
Dropout as a Bayesian Approximation

Recently uploaded (20)

PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Tartificialntelligence_presentation.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Hybrid model detection and classification of lung cancer
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
project resource management chapter-09.pdf
PPTX
A Presentation on Touch Screen Technology
PPTX
Chapter 5: Probability Theory and Statistics
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Tartificialntelligence_presentation.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
A Presentation on Artificial Intelligence
A comparative study of natural language inference in Swahili using monolingua...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
1 - Historical Antecedents, Social Consideration.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Web App vs Mobile App What Should You Build First.pdf
Hindi spoken digit analysis for native and non-native speakers
Hybrid model detection and classification of lung cancer
WOOl fibre morphology and structure.pdf for textiles
project resource management chapter-09.pdf
A Presentation on Touch Screen Technology
Chapter 5: Probability Theory and Statistics
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology

Emergence of Invariance and Disentangling in Deep Representations

  • 1. Emergence of Invariance and Disentangling in Deep Representations 2018.02.14. Sangwoo Mo
  • 2. Overview • Emergence of Invariance and Disentangling in Deep Representations • Authors: Achille & Soatto (UCLA) • Appeared in ICML 2017 Workshop • Contribution • Investigate relation between properties for representation • Propose measure for network complexity
  • 3. 1. Relation between Properties
  • 4. Properties for Representation • Representation ! is a stochastic function of data ", that is useful for given task # while nuisance $ affects to the data • “Good representation” should satisfy • sufficient: % !; # = %("; #) • minimal: minimize % !; " among sufficient ! • invariant: minimize % !; $ • disentangled: minimize *+ ! = ,-(.(!) ∥ ∏1 . !1 ) • However, we will show that only minimal sufficiency is essential; i.e. invariance and disentanglement are automatically satisfied in some model assumption * TC: total correlation ** Actually, the assumption is not mild; still, the result is quite interesting $ # " !
  • 5. Minimal Sufficiency ⇒ IB Lagrangian • Information Bottleneck (IB) Lagrangian ℒ = $ % & + ( ⋅ *(&; -) • Minimizing IB Lagrangian yields the minimal sufficient representation • By Data Processing Inequality (DPI), deep network - → &0 → ⋯ → &2 • satisfies that * &2; - ≤ *(&0; -); i.e. stacking layer increases minimality • Q. In real scenario, we do not optimize IB Lagrangian. Does it still apply? • A. SGD implicitly implies minimality * ResNet also satisfies the Markov chain, when we define & as “block”
  • 6. Model Assumption • Now we will show that under some model assumption 1. minimality implies invariance and disentanglement 2. SGD implicitly implies minimality • Model Assumption • Assume log-uniform prior on !; i.e. " !# ∝ 1/|!#| • Assume posterior !#|( = *# ⋅ ,!# where *# ∼ log 1(−4#/2, 4#) • 4# will be also optimized (Variational Dropout; Kingma’ 2015) • Then weight information is 8 !; ( = − 1 2 : #;< =>? @ log 4# + B
  • 7. Minimality ⇒ Invariance & Disentanglement • Proposition 1. For a single layer " = $ ⋅ &, ' ( $; * ≤ ( "; & + -. " ≤ ' ( $; * + / • where ' is some strictly increasing function and / = 0(1/ dim & ) • Corollary 1. For MLP, ( "8; & ≤ min :;8 ( ":<=, ": ≤ min :;8 ( ?: ⋅ ":, ": • Here, we can only obtain the upper bound • ⇒ minimality implies invariance and disentanglement
  • 8. SGD ⇒ Minimize Weight Information • Proposition 2. Let " be the Hessian at the local minimum #$. Assume (#$, ') is optimal solution of IB Lagrangian ℒ = " + , + ' ⋅ /(,; 1). Then / $; 2 ≤ 4[log " ∗ + log #$ : : − log '4:] • where 4 = dim($) and ⋅ ∗ is nuclear norm • Empirical evidence: SGD converges to the flat minima; i.e. " ∗ = tr(") is small • ⇒ SGD implicitly minimize the weight information
  • 9. 2. Measure Network Complexity
  • 10. Revisit Overfitting • Let !"($, &) be data distribution and (( ⋅ |$; ,) be neural network • Decompose cross entropy loss -.,/ 0 , = - 0 2 + 4 2; 0 , + 56(( ! − 4(0; ,|2) • Since 4(0; ,|2) is intractable, we use 4(0; ,) as a regularizer; i.e. solve IB Lagrangian ℒ = -.,/ 0 , + 9 ⋅ 4(0; ,) • Also, we will use 4(0; ,) as a measure for model complexity • 4(0; ,) is small if underfitting, large if overfitting Intrinsic error sufficiency model efficiency overfitting
  • 11. Revisit Rethinking Generalization • [Zhang’ 2017] claimed that we need new generalization theory for deep learning * [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
  • 12. Revisit Rethinking Generalization • [Zhang’ 2017] claimed that we need new generalization theory for deep learning * [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
  • 13. Revisit Rethinking Generalization • [Zhang’ 2017] claimed that we need new generalization theory for deep learning • Random Label Test: Deep learning easily fits random label * [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
  • 14. Revisit Rethinking Generalization • [Zhang’ 2017] claimed that we need new generalization theory for deep learning • Random Label Test: Deep learning easily fits random label • For random label neural network overfits, but weight information increases * [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
  • 15. Revisit Rethinking Generalization • [Zhang’ 2017] claimed that we need new generalization theory for deep learning • Random Label Test: Deep learning easily fits random label • For random label neural network overfits, but weight information increases • …and it recovers bias-variance tradeoff * [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
  • 16. Effect of ! • As ! increases, "($; &) decreases, and ( become more invariant (= lose information)
  • 17. Conclusion • Conclusion 1. Authors proposed the properties for “good representation”, and minimal sufficiency is sufficient for invariance and disentanglement 2. Authors proposed the measure for neural network, which solves the paradox of the rethinking generalization paper • Research Question 1. minimality ⇒ invariance satisfies in general, but how about ⇒ disentanglement? In which assumption can we guarantee disentanglement? 2. Weight information seems to be an alternative measure for generalization theory. How can we estimate "($; &) efficiently for general neural network?