SlideShare a Scribd company logo
Learning Sparse Networks Using Targeted Dropout
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
Google Brain, University of Oxford, for.ai, Geoffrey Hinton | Neurips 2018
2020.08.09
Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents
Targeted Dropout
Introduction – Background
• Large number of learnable parameters can lead to overfitting.
• There has been lots of works on compressing neural networks
<Sparsification Techniques>
Introduction / Related Work / Methods and Experiments / Conclusion
01
L1 penalty
L2 penalty
1. sparsity-inducing regulariser 2. Post hoc pruning (training Full size network → pruning)
- Removing weights with the smallest magnitude [1]
- Ranking the weights by the sensitivity of the task
performance and remove [2]
[1] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems,
pages 1135–1143, 2015.
[2] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems
3. Dropout Regularisation
- Standard Dropout
- Variational Dropout
Targeted Dropout
Introduction – Proposal
• Issues of Previous works
- Standard training does not encourage nets to be amenable to pruning
- Applying sparsification techniques with little negative impact to task performance is difficult
• Propose “Targeted Dropout”
- Specifically apply dropout to the set of units that are believed to be less useful
- Rank weights or units and apply dropout primarily to those elements with small magnitudes
- Network learns to be robust the choice of post hoc pruning stretegy.
Introduction / Related Work / Methods and Experiments / Conclusion
02
Targeted Dropout
Introduction – Contribution
• Makes networks extremely robust to the post hoc pruning strategy of choice
• Gives intimate control over the desired sparsity patterns
• Easy to implement
• Achieved impressive sparsity rates on a wide range of architectures and datsets
- 99% sparsity on the ResNet-32 with less than 4% drop in test set accuracy on
CIFAR-10
Introduction / Related Work / Methods and Experiments / Conclusion
03
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
04
Dropout
• Two kinds of Bernoulli dropout techniques
1. Unit Dropout
- Randomly drops units at each training
step to reduce dependence between units
and prevent overfitting
2. Weight Dropout
- Randomly drops individual weights in the
weight matrices at each training step.
Dropping connections between layers
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
05
Magnitude-based pruning
• Treat the top-k largest magnitude weights as important (use argmax-k)
1. Unit Pruning
- Considers the units (column-vectors) of
weight metrices under the L2 norm
(usually faster with less computation)
2. Weight Pruning
- Considers the entries of each feature
vector under the L1 norm (usually more
accurate)
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
06
Sparsification Methods
• L1 regularisation [3]
- cost added to the loss function, intended to drive unimportant weights to zero.
[3] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems,
pages 1135–1143, 2015
[4] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017.
• L0 regularisation [4]
- Apply an augmentation of Concrete Dropout to parameters.
- Weights follow a Hard-Concrete distribution where each weight is associated with a gating
parameter that determines the drop rate.
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
07
Sparsification Methods
• Variational Dropout [5]
- Apply Gaussian dropout with trainable drop rates to the weights and interprets the model as
a variational posterior with a particular prior.
[5] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
[6] Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, and Samuel Madden. Smallify: Learning network size while training. arXiv preprint arXiv:1806.03723, 2018.
• Smallify [6]
- Use trainable gates on weights/units and regularize gates towards zero using L1 regularisation.
- Shown to be extremely effective at reaching high prune rates on VGG networks.
Methods and Experiments
Targeted Dropout
Introduction / Related Work / Methods and Experiments / Conclusion
08
• Want to make low-valued elements to be able to increase their value if they
become important during training.
• Introduce stochasticity into the process using two parameters
• Targeting proportion means selecting bottom weights as candidates for
dropout.
• Expected number of units to keep during each round of targeted dropout is
• Result is a reduction in the important subnetwork’s dependency on the
unimportant subnetwork.
Methods and Experiments
Dependence between the important and unimportant subnetworks
Introduction / Related Work / Methods and Experiments / Conclusion
09
• Important subnetwork is completely separated from the unimportant one.
• Estimate the effect of pruning weights by considering the second-degree Taylor expansion
of change in loss
• At the end of training, with critical point , (gradients of the loss for
parameters ) , leaving only the Hessian term
• Compute Hessian-weight product matrix as an estimate of weight correlations and network
dependence
• Empirically confirm that targeted dropout reduces dependence between the important
and unimportant subnetworks by an order of magnitude.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
10
• Perform experiments using the original ResNet, Wide ResNet, and Transformer architectures
applied to the CIFAR-10, ImageNet, and WMT English-German Translation datasets
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
11
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
12
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
13
Methods and Experiments
Experiments – scheduling targeted dropout
Introduction / Related Work / Methods and Experiments / Conclusion
14
• On the evaluation test with Smallify, authors discovered Scheduling the
Targeting Proportion and Drop out rate can dramatically improve accuracy.
- annealing from zero to 95%, and from 0% to 100%
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
15
• Comparison with Random Pruning method (pruning away a random
subnetwork before training
Conclusion
Introduction / Related Work / Methods and Experiments / Conclusion
• Targeted dropout is a simple and effective regularisation tool for training
neural networks that are robust to post hoc pruning.
• Targeted dropout performs well across a range of network architectures
and tasks.
• Showed how dropout can be used as a tool to encode prior structural
assumptions into neural networks
16

More Related Content

PDF
ResNeSt: Split-Attention Networks
DOCX
An exponential rayleigh model for rss-based device-free localization and trac...
PDF
Novel Scheme for Minimal Iterative PSO Algorithm for Extending Network Lifeti...
PDF
Learning to Learn by Gradient Descent by Gradient Descent
PDF
A resource allocation scheme for scalable video multicast in wi max relay net...
PPTX
Sparse Representations for Packetized Predictive Networked Control
PDF
Network recasting
ResNeSt: Split-Attention Networks
An exponential rayleigh model for rss-based device-free localization and trac...
Novel Scheme for Minimal Iterative PSO Algorithm for Extending Network Lifeti...
Learning to Learn by Gradient Descent by Gradient Descent
A resource allocation scheme for scalable video multicast in wi max relay net...
Sparse Representations for Packetized Predictive Networked Control
Network recasting

Similar to Learning Sparse Networks using Targeted Dropout (20)

PDF
How useful is self-supervised pretraining for Visual tasks?
PDF
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PDF
Mix Conv: Mixed Depthwise Convolutional Kernels
PDF
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
PDF
sensors-20-04577-v3akslññidnlasjjc,,jas.pdf
PDF
Fahroo - Optimization and Discrete Mathematics - Spring Review 2013
PPT
Neural networks for the prediction and forecasting of water resources variables
PPTX
Group Communication Techniques in Overlay Networks
PDF
Neural Network Based Individual Classification System
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
DOC
Benefit based data caching in ad hoc networks (synopsis)
PDF
Large Scale GAN Training for High Fidelity Natural Image Synthesis
PPTX
Artificial Neural Network8_Practical (1).pptx
PDF
Augmix review [cdm]
PDF
Presentacion seminario m_vallejo_marzo11
PDF
B. Kim, ICLR 2025, MLILAB, KAIST AI.pptx.pdf
PDF
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
PPTX
ANN and Regression presentation_updated.pptx
PDF
End-to-End Object Detection with Transformers
How useful is self-supervised pretraining for Visual tasks?
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Mix Conv: Mixed Depthwise Convolutional Kernels
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
sensors-20-04577-v3akslññidnlasjjc,,jas.pdf
Fahroo - Optimization and Discrete Mathematics - Spring Review 2013
Neural networks for the prediction and forecasting of water resources variables
Group Communication Techniques in Overlay Networks
Neural Network Based Individual Classification System
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Benefit based data caching in ad hoc networks (synopsis)
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Artificial Neural Network8_Practical (1).pptx
Augmix review [cdm]
Presentacion seminario m_vallejo_marzo11
B. Kim, ICLR 2025, MLILAB, KAIST AI.pptx.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
ANN and Regression presentation_updated.pptx
End-to-End Object Detection with Transformers
Ad

More from Seunghyun Hwang (13)

PDF
An annotation sparsification strategy for 3D medical image segmentation via r...
PDF
Do wide and deep networks learn the same things? Uncovering how neural networ...
PPTX
Deep Learning-based Fully Automated Detection and Quantification of Acute Inf...
PDF
Diagnosis of Maxillary Sinusitis in Water’s view based on Deep learning model
PDF
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
PDF
Deep Generative model-based quality control for cardiac MRI segmentation
PDF
Segmenting Medical MRI via Recurrent Decoding Cell
PDF
Progressive learning and Disentanglement of hierarchical representations
PDF
A Simple Framework for Contrastive Learning of Visual Representations
PDF
DeepStrip: High Resolution Boundary Refinement
PDF
Your Classifier is Secretly an Energy based model and you should treat it lik...
PPTX
A Probabilistic U-Net for Segmentation of Ambiguous Images
PDF
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
An annotation sparsification strategy for 3D medical image segmentation via r...
Do wide and deep networks learn the same things? Uncovering how neural networ...
Deep Learning-based Fully Automated Detection and Quantification of Acute Inf...
Diagnosis of Maxillary Sinusitis in Water’s view based on Deep learning model
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
Deep Generative model-based quality control for cardiac MRI segmentation
Segmenting Medical MRI via Recurrent Decoding Cell
Progressive learning and Disentanglement of hierarchical representations
A Simple Framework for Contrastive Learning of Visual Representations
DeepStrip: High Resolution Boundary Refinement
Your Classifier is Secretly an Energy based model and you should treat it lik...
A Probabilistic U-Net for Segmentation of Ambiguous Images
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Mobile App Security Testing_ A Comprehensive Guide.pdf
sap open course for s4hana steps from ECC to s4
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The AUB Centre for AI in Media Proposal.docx

Learning Sparse Networks using Targeted Dropout

  • 1. Learning Sparse Networks Using Targeted Dropout Hwang seung hyun Yonsei University Severance Hospital CCIDS Google Brain, University of Oxford, for.ai, Geoffrey Hinton | Neurips 2018 2020.08.09
  • 2. Introduction Related Work Methods and Experiments 01 02 03 Conclusion 04 Yonsei Unversity Severance Hospital CCIDS Contents
  • 3. Targeted Dropout Introduction – Background • Large number of learnable parameters can lead to overfitting. • There has been lots of works on compressing neural networks <Sparsification Techniques> Introduction / Related Work / Methods and Experiments / Conclusion 01 L1 penalty L2 penalty 1. sparsity-inducing regulariser 2. Post hoc pruning (training Full size network → pruning) - Removing weights with the smallest magnitude [1] - Ranking the weights by the sensitivity of the task performance and remove [2] [1] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015. [2] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems 3. Dropout Regularisation - Standard Dropout - Variational Dropout
  • 4. Targeted Dropout Introduction – Proposal • Issues of Previous works - Standard training does not encourage nets to be amenable to pruning - Applying sparsification techniques with little negative impact to task performance is difficult • Propose “Targeted Dropout” - Specifically apply dropout to the set of units that are believed to be less useful - Rank weights or units and apply dropout primarily to those elements with small magnitudes - Network learns to be robust the choice of post hoc pruning stretegy. Introduction / Related Work / Methods and Experiments / Conclusion 02
  • 5. Targeted Dropout Introduction – Contribution • Makes networks extremely robust to the post hoc pruning strategy of choice • Gives intimate control over the desired sparsity patterns • Easy to implement • Achieved impressive sparsity rates on a wide range of architectures and datsets - 99% sparsity on the ResNet-32 with less than 4% drop in test set accuracy on CIFAR-10 Introduction / Related Work / Methods and Experiments / Conclusion 03
  • 6. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 04 Dropout • Two kinds of Bernoulli dropout techniques 1. Unit Dropout - Randomly drops units at each training step to reduce dependence between units and prevent overfitting 2. Weight Dropout - Randomly drops individual weights in the weight matrices at each training step. Dropping connections between layers
  • 7. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 05 Magnitude-based pruning • Treat the top-k largest magnitude weights as important (use argmax-k) 1. Unit Pruning - Considers the units (column-vectors) of weight metrices under the L2 norm (usually faster with less computation) 2. Weight Pruning - Considers the entries of each feature vector under the L1 norm (usually more accurate)
  • 8. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 06 Sparsification Methods • L1 regularisation [3] - cost added to the loss function, intended to drive unimportant weights to zero. [3] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015 [4] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017. • L0 regularisation [4] - Apply an augmentation of Concrete Dropout to parameters. - Weights follow a Hard-Concrete distribution where each weight is associated with a gating parameter that determines the drop rate.
  • 9. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 07 Sparsification Methods • Variational Dropout [5] - Apply Gaussian dropout with trainable drop rates to the weights and interprets the model as a variational posterior with a particular prior. [5] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017. [6] Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, and Samuel Madden. Smallify: Learning network size while training. arXiv preprint arXiv:1806.03723, 2018. • Smallify [6] - Use trainable gates on weights/units and regularize gates towards zero using L1 regularisation. - Shown to be extremely effective at reaching high prune rates on VGG networks.
  • 10. Methods and Experiments Targeted Dropout Introduction / Related Work / Methods and Experiments / Conclusion 08 • Want to make low-valued elements to be able to increase their value if they become important during training. • Introduce stochasticity into the process using two parameters • Targeting proportion means selecting bottom weights as candidates for dropout. • Expected number of units to keep during each round of targeted dropout is • Result is a reduction in the important subnetwork’s dependency on the unimportant subnetwork.
  • 11. Methods and Experiments Dependence between the important and unimportant subnetworks Introduction / Related Work / Methods and Experiments / Conclusion 09 • Important subnetwork is completely separated from the unimportant one. • Estimate the effect of pruning weights by considering the second-degree Taylor expansion of change in loss • At the end of training, with critical point , (gradients of the loss for parameters ) , leaving only the Hessian term • Compute Hessian-weight product matrix as an estimate of weight correlations and network dependence • Empirically confirm that targeted dropout reduces dependence between the important and unimportant subnetworks by an order of magnitude.
  • 12. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 10 • Perform experiments using the original ResNet, Wide ResNet, and Transformer architectures applied to the CIFAR-10, ImageNet, and WMT English-German Translation datasets
  • 13. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 11
  • 14. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 12
  • 15. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 13
  • 16. Methods and Experiments Experiments – scheduling targeted dropout Introduction / Related Work / Methods and Experiments / Conclusion 14 • On the evaluation test with Smallify, authors discovered Scheduling the Targeting Proportion and Drop out rate can dramatically improve accuracy. - annealing from zero to 95%, and from 0% to 100%
  • 17. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 15 • Comparison with Random Pruning method (pruning away a random subnetwork before training
  • 18. Conclusion Introduction / Related Work / Methods and Experiments / Conclusion • Targeted dropout is a simple and effective regularisation tool for training neural networks that are robust to post hoc pruning. • Targeted dropout performs well across a range of network architectures and tasks. • Showed how dropout can be used as a tool to encode prior structural assumptions into neural networks 16