Learning Sparse Networks using Targeted Dropout

Learning Sparse Networks Using Targeted Dropout
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
Google Brain, University of Oxford, for.ai, Geoffrey Hinton | Neurips 2018
2020.08.09

Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents

Targeted Dropout
Introduction – Background
• Large number of learnable parameters can lead to overfitting.
• There has been lots of works on compressing neural networks
<Sparsification Techniques>
Introduction / Related Work / Methods and Experiments / Conclusion
01
L1 penalty
L2 penalty
1. sparsity-inducing regulariser 2. Post hoc pruning (training Full size network → pruning)
- Removing weights with the smallest magnitude [1]
- Ranking the weights by the sensitivity of the task
performance and remove [2]
[1] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems,
pages 1135–1143, 2015.
[2] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems
3. Dropout Regularisation
- Standard Dropout
- Variational Dropout

Targeted Dropout
Introduction – Proposal
• Issues of Previous works
- Standard training does not encourage nets to be amenable to pruning
- Applying sparsification techniques with little negative impact to task performance is difficult
• Propose “Targeted Dropout”
- Specifically apply dropout to the set of units that are believed to be less useful
- Rank weights or units and apply dropout primarily to those elements with small magnitudes
- Network learns to be robust the choice of post hoc pruning stretegy.
02

Targeted Dropout
Introduction – Contribution
• Makes networks extremely robust to the post hoc pruning strategy of choice
• Gives intimate control over the desired sparsity patterns
• Easy to implement
• Achieved impressive sparsity rates on a wide range of architectures and datsets
- 99% sparsity on the ResNet-32 with less than 4% drop in test set accuracy on
CIFAR-10
03

Related Work
04
Dropout
• Two kinds of Bernoulli dropout techniques
1. Unit Dropout
- Randomly drops units at each training
step to reduce dependence between units
and prevent overfitting
2. Weight Dropout
- Randomly drops individual weights in the
weight matrices at each training step.
Dropping connections between layers

Related Work
05
Magnitude-based pruning
• Treat the top-k largest magnitude weights as important (use argmax-k)
1. Unit Pruning
- Considers the units (column-vectors) of
weight metrices under the L2 norm
(usually faster with less computation)
2. Weight Pruning
- Considers the entries of each feature
vector under the L1 norm (usually more
accurate)

Related Work
06
Sparsification Methods
• L1 regularisation [3]
- cost added to the loss function, intended to drive unimportant weights to zero.
[3] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems,
pages 1135–1143, 2015
[4] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017.
• L0 regularisation [4]
- Apply an augmentation of Concrete Dropout to parameters.
- Weights follow a Hard-Concrete distribution where each weight is associated with a gating
parameter that determines the drop rate.

Related Work
07
Sparsification Methods
• Variational Dropout [5]
- Apply Gaussian dropout with trainable drop rates to the weights and interprets the model as
a variational posterior with a particular prior.
[5] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
[6] Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, and Samuel Madden. Smallify: Learning network size while training. arXiv preprint arXiv:1806.03723, 2018.
• Smallify [6]
- Use trainable gates on weights/units and regularize gates towards zero using L1 regularisation.
- Shown to be extremely effective at reaching high prune rates on VGG networks.

Methods and Experiments
Targeted Dropout
08
• Want to make low-valued elements to be able to increase their value if they
become important during training.
• Introduce stochasticity into the process using two parameters
• Targeting proportion means selecting bottom weights as candidates for
dropout.
• Expected number of units to keep during each round of targeted dropout is
• Result is a reduction in the important subnetwork’s dependency on the
unimportant subnetwork.

Dependence between the important and unimportant subnetworks
09
• Important subnetwork is completely separated from the unimportant one.
• Estimate the effect of pruning weights by considering the second-degree Taylor expansion
of change in loss
• At the end of training, with critical point , (gradients of the loss for
parameters ) , leaving only the Hessian term
• Compute Hessian-weight product matrix as an estimate of weight correlations and network
dependence
• Empirically confirm that targeted dropout reduces dependence between the important
and unimportant subnetworks by an order of magnitude.

Experiments
10
• Perform experiments using the original ResNet, Wide ResNet, and Transformer architectures
applied to the CIFAR-10, ImageNet, and WMT English-German Translation datasets

Experiments
11

Experiments
12

Experiments
13

Experiments – scheduling targeted dropout
14
• On the evaluation test with Smallify, authors discovered Scheduling the
Targeting Proportion and Drop out rate can dramatically improve accuracy.
- annealing from zero to 95%, and from 0% to 100%

Experiments
15
• Comparison with Random Pruning method (pruning away a random
subnetwork before training

Conclusion
• Targeted dropout is a simple and effective regularisation tool for training
neural networks that are robust to post hoc pruning.
• Targeted dropout performs well across a range of network architectures
and tasks.
• Showed how dropout can be used as a tool to encode prior structural
assumptions into neural networks
16

Learning Sparse Networks using Targeted Dropout

More Related Content

Similar to Learning Sparse Networks using Targeted Dropout (20)

More from Seunghyun Hwang (13)

Recently uploaded (20)

Learning Sparse Networks using Targeted Dropout