SlideShare a Scribd company logo
Battery
Sensor
μP
Mem
E0 294: Systems for Machine Learning
Lecture #12
27th February 2025
2
Date Topic covered Important event
07/01 Introduction
09/01 No Class Systems Day
14/01 Holiday List of project releases
16/01 Basics of Deep Neural Networks
21/01 Convolutional Neural Network (1) Assignment-1 released
23/01 Convolutional Neural Network (2)
28/01 Examples of CNN CNN Accelerators (1) Assignment-1 submission deadline
30/01 No Class (Tentative) CNN Accelerators (2)
04/02 Parallelism in ML Training Assignment-2 released
06/02 Basics of Large Language Models (LLM)
11/02 No Class (Tentative) Assignment-2 submission deadline
13/02 No Class (Tentative)
18/02 Extreme scale training Assignment-3 released; Phase-1 due
20/02 LLM Serving Infra : vLLM
Timeline of the Course
3
Date Topic covered Deadline (if there is)
25/02 Scheduling : Efficient LLM scheduling and routing
Checkpointing in ML Training
Assignment-3 submission deadline
27/02 CNN Accelerators (1) Pruning and Quantization Assignment-3 released
04/03 Scheduling : Efficient LLM scheduling and routing
06/03 Memory optimizations : FlashAttention1,2,3/flexgen Assignment-3 submission deadline
11/03 Multimodality : Large Multimodal models Assignment-4 released
13/03 CNN Accelerators (2) Accelerating decoding: Speculative
decoding, lookahead decoding
18/03 Introduction to AMD AIPC Assignment-4 submission deadline
20/03 Pruning and Quantization CNN Accelerators (3)
25/03 Checkpointing in ML Training Workload Characterization Phase-2 due
27/03 Accelerators with Emerging Technology (1) Assignment-5 released
01/04 Accelerators for Graph Algorithms
03/04 Accelerators for Graph Convolutional Network (GCN)
08/04 GCN Accelerators with Emerging Technology Assignment-5 submission deadline
Timeline of the Course
4
Date Topic covered Deadline (if there is)
16/04 Project presentation (1)
17/04 Project presentation (2)
18/04 Project presentation (3)
26/04 Project report submission deadline
Timeline of the Course
Lectures by Industry Experts
5
▪ DNN pruning
Today’s Class
6
▪ The need for smaller models
▪ How to select which parameter to prune
▪ Develop ‘a’ recipe for DNN pruning
▪ Problems due to pruning
▪ Role of underlying hardware on pruning
Topics in DNN Pruning
Slide Courtesy: Sourav Mazumdar, IISc
7
▪ Increasing size of neural networks
– Overparameterization
▪ High cost of deployment and sustained inference
▪ Unable to run on edge devices
– High inference time
– High memory
– Energy consumption
The Need for Smaller Models
8
▪ ‘somehow’ create compressed models with faster execution times and without
losing accuracy much
▪ Knowledge Distillation (KD)
– A smaller (student) model is made to learn from a bigger (teacher) model
▪ KD suffers from the drawback of having to define a completely new model and
train that model from scratch
▪ Better approach
– Compress existing models without training any new model from scratch
In search of Smaller Models
9
Overview of DNN Pruning (1)
Remove Edges
Original network Pruned network
10
Overview of DNN Pruning (2)
Remove Nodes
Original network Pruned network
11
▪ Fine Grained: Individual Weights
▪ Vector Level
– Sub-kernel vectors of size W
▪ Kernel Level
– Kernels of shape H x W
▪ Filter Level
– Filter of shape C x H x W
Pruning Granularity in DNNs (1)
Unstructured Structured
12
▪ Unstructured
– Finer control over selecting
elements to prune
– Can provide better sparsity to
accuracy ratio, i.e., smaller network
with higher accuracy
– Difficult to implement
▪ Structured
– Less control over selecting
elements to prune
– Relatively less sparsity to accuracy
ratio
– Easier to implement
Pruning Granularity in DNNs (2)
Unstructured Structured
13
▪ DNN pruning can be defined as
𝒂𝒓𝒈𝒎𝒊𝒏𝒑𝑳 𝑾𝒑, 𝑫 𝒔. 𝒕. 𝝓 𝑾𝒑 ≤ 𝑪
▪ 𝑳 is the loss/objective function of the DNN
▪ 𝑫 is the target dataset
▪ 𝑾 is the original weights, 𝑾𝒑 ∈ 𝑾 is a subset of weights remaining after
pruning
▪ 𝝓(. ) is some function which maps the network to some constraint 𝑪 such as
inference time, FLOPS, memory etc
Mathematical Definition of DNN Pruning
14
▪ Select the less important parameters to prune
– Define a saliency term, which can map each parameter to some importance
score
– The importance score can be the loss in performance (degradation of
accuracy) when removing that particular parameter
– Remove the lowest salient parameters from the network
Which Parameters to Prune?
15
▪ f(.) = ReLU(.)
▪ W = [10,-8, 0.1]
▪ Y = f(10x0-8x1+0.1x2)
▪ If you can afford to remove one weight parameter, which one do you remove?
▪ What is the basis for your selection?
Magnitude-based Pruning
Importance ∝ |W|
16
▪ Remove the parameters whose absolute values are closer to 0
▪ For element-wise pruning
Magnitude-based Pruning
Importance = |W|
17
▪ Remove the parameters whose absolute values are closer to 0
▪ For element-wise pruning
Magnitude-based Pruning
Importance = σ𝒊∈S |𝑾𝒊|
10 -5
3 14
1 -5
3 -2
|10| + |-5| +
|3| + |14| =
32
|1| + |-5| +
|3| + |-2| =
11
L1 Norm in the
structural set S
32
11
Importance
32
0
Pruned Weights
18
Result of Pruning
Can we recover the lost accuracy performance?
19
Results of Pruning and Fine tuning
Ref :Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]
20
Iterative Pruning
21
▪ Estimate the importance of a neuron by the error induced when removing that
neuron
▪ Removing each neuron and calculating the error is computationally expensive
▪ Use Taylor series expansion to approximate the importance
▪ A perturbation 𝜹𝑼 of the parameter vector will change the objective function by
𝜹𝑬 = ෍
𝒊
𝒈𝒊𝜹𝒖𝒊 +
𝟏
𝟐
෍
𝒊
𝒉𝒊𝒊𝜹𝒖𝒊
𝟐
+
𝟏
𝟐
෍
𝒊≠𝒋
𝒉𝒊𝒋𝜹𝒖𝒊𝜹𝒖𝒋 + 𝑶( 𝜹𝑼
𝟑
)
Where, 𝒈𝒊 =
𝜹𝑬
𝜹𝒖𝒊
and 𝒉𝒊𝒋 =
𝜹𝟐𝑬
𝜹𝒖𝒊𝜹𝒖𝒋
Which Parameters to Prune: Derivative-based
22
𝜹𝑬 = ෍
𝒊
𝒈𝒊𝜹𝒖𝒊 +
𝟏
𝟐
෍
𝒊
𝒉𝒊𝒊𝜹𝒖𝒊
𝟐
+
𝟏
𝟐
෍
𝒊≠𝒋
𝒉𝒊𝒋𝜹𝒖𝒊𝜹𝒖𝒋 + 𝑶( 𝜹𝑼
𝟑
)
The following assumptions can be made to simplify the approximation:
1. Diagonal: The change in error caused by deleting one filter is independent of
others – cross terms are deleted
2. External: The network has already converged – the first term is deleted
3. Quadratic: Cost function is nearly quadratic – the last term is deleted
𝜹𝑬 =
𝟏
𝟐
෍
𝒊
𝒉𝒊𝒊𝜹𝒖𝒊
𝟐
The parameter with the smallest |𝜹𝑬| are removed.
Which Parameters to Prune: Derivative-based
Ref : Optimal Brain Damage, LeCun et al
23
▪ Zero activation neurons are redundant and can be removed without affecting
the overall accuracy of the network
▪ Average percentage of Zero activations (APoZ) can be used to measure the
importance of filters
▪ Filter with higher APoZ can be removed
Which Parameters to Prune: Activation Based
24
▪ The aim of scaling-based pruning is to find the important and redundant layers
while training the network itself
▪ For channel pruning of CNN, each channel can be associated with a scaling
factor which are trainable parameters
▪ The channels with low scaling factors can be removed
Which Parameters to Prune: Scaling-based Pruning
Ref: Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
25
▪ Many modern networks architectures like ResNet make use of skip connection
▪ This skip connection contains coupled channels
▪ Channels of multiple layers are coupled and need to be pruned simultaneously
▪ Prior pruning algorithms skip pruning the coupled channels due to its
complexity, but this leads to lower benefit in execution time
The Case of Coupled Channels
Grouped Saliencies
● L .Liu et. al. PMLR 2021 proposed layer grouping algorithm to find the coupled channels. Layers in the same
group shares the same pruning mask and are pruned simultaneously.
● T Narshana et al , ICLR 2023 proposes DFC, Data Flow Coupling which abstracts the end-to-end transformation
and the associated layers for an instance of coupling in a network.
26
▪ Select the pruning granularity
▪ Repeat
– Select the parameters to prune
– Remove the parameters
– Fine tune the model
Is this the only procedure for pruning neural network models?
Recipe for Neural Network Pruning
Trained Model
Prune
Fine Tune
27
▪ Static Pruning
– Static pruning permanently removes network parameters which, in general
cannot be recovered
– May lead to decrease in model capacity leading to non-recoverable accuracy
– Static pruning doesn’t take into account the current input
▪ Dynamic pruning
– Determines in runtime which parameters: filter, connections, layers will be
skipped in the forward pass
– The network is pruned more for images which are easier to predict
– Since the ability of the network is fully preserved, the balance point can be
adjusted according to the available resources
Dynamic Pruning
28
▪ Dynamic Pruning or Runtime
Pruning can consist of 2 sub-
networks
– CNN Backbone
– Decision Network
▪ A mask variable can be used to
denote the filters that will be
used in the forward pass for the
current iteration
Dynamic Pruning (1)
Ref: Pruning and Quantization for Deep Neural Network Acceleration: A Survey, Liang et al.
29
▪ The (i-1)th layer of the backbone
network produces some feature
map
▪ Based on this feature map, the
decision network dynamically
prune the kernels for the ith
layer in the backbone network
Dynamic Pruning (2)
Ref: Pruning and Quantization for Deep Neural Network Acceleration: A Survey, Liang et al.
30
▪ Static pruning is performed offline prior to inference while dynamic pruning is
performed at the runtime
▪ Dynamic pruning requires no fine tuning after pruning
▪ Dynamic pruning may result in higher overhead compared to static pruning
Static Pruning vs Dynamic Pruning
31
▪ LTH asks the question whether we can train a sparse network which is a subset
of some dense network, from scratch, and achieve the same level of accuracy
▪ But how do we identify the sparse subnetwork?
– Iterative magnitude pruning is one such way
The Lottery Ticket Hypothesis (LTH)
Initial Model
Converged
Model
Pruned Model Retrain Model
Train Prune
Unable to recover
accuracy
Initial Model
Converged
Model
Pruned Model Retrain Model
Reinitialize
to original
values
Similar accuracy
performance
Train Prune
Ref: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]
32
▪ A deep neural network consists of many layers which hypothetically specializes
at identifying different level of patterns in the data
▪ How do different layers react to pruning? Can we prune different layers with the
same pruning ratio or are there some layers which are more sensitive to pruning
compared to others?
Pruning in Different Layers
Pruning different layers of VGG on CIFAR 10
33
▪ We see that some layers are more sensitive to pruning than others
▪ Layers deeper in the network can be pruned more without much loss in accuracy
compared to the initial layers
Pruning in Different Layers
34
▪ Different layers differently to pruning in
terms of loss in accuracy
▪ Should we choose only the layers with
lower loss in accuracy to prune?
▪ But do all layers have the same execution
time?
▪ Pruning some layer with low loss in
accuracy may fail to yield much reduction
in total execution time
▪ We need to strike a balance in terms of
execution time saved and the loss in
accuracy incurred
Pruning in Different Layers
35
▪ As we prune a layer, we are reducing the number of parameters, this should
result in a gradual decrease in its Execution time.
▪ But is this decrease in Execution time linear?
How do Execution Time Change with Pruning?
▪ It is observed that even for uniformly
increasing pruning ratio, the
execution time does not decrease
linearly but in a staircase pattern
36
▪ It is observed a non-linear decrease in execution time as the layers are pruned
▪ The exact trend may differ from platform to platform of execution
▪ In the case of staircase pattern in decreasing of execution time, we want the
model to take benefit of the sharp decrease in execution time
Platform Awareness in Pruning
▪ To get the benefits from pruning, we need to
prune at or nearby the drop points
▪ Pruning at the flat regions won’t translate to a
reduction in execution time but may result in
loss of accuracy
▪ Pruned model at point A or C is desirable
whereas pruned model at point between A and
B is not
37
▪ It is seen that FLOPS needs to be reduced comparatively more to gain similar
benefit in execution time
▪ There are also irregularities in the behaviour speed-up trend that we may
benefit from
Platform Awareness in Pruning
Δt represents the
normalised speedup
Δf represents the
normalised
reduction in FLOPS
Battery
Sensor
μP
Mem
Q & A
Battery
Sensor
μP
Mem
THANK YOU

More Related Content

PDF
DLD meetup 2017, Efficient Deep Learning
PDF
Compressing Neural Networks with Intel AI Lab's Distiller
PPTX
Pruning convolutional neural networks for resource efficient inference
PDF
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
PDF
PDF
Fractional step discriminant pruning
PDF
Neural network pruning with residual connections and limited-data review [cdm]
PDF
2019-06-14:7 - Neutral Network Compression
DLD meetup 2017, Efficient Deep Learning
Compressing Neural Networks with Intel AI Lab's Distiller
Pruning convolutional neural networks for resource efficient inference
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Fractional step discriminant pruning
Neural network pruning with residual connections and limited-data review [cdm]
2019-06-14:7 - Neutral Network Compression

Similar to Efficient_DNN_pruning_System_for_machine_learning.pdf (20)

PDF
Electi Deep Learning Optimization
PDF
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PPTX
Deep learning
PPTX
Automatic Attendace using convolutional neural network Face Recognition
PPTX
Deep Learning in Low Power Devices
PDF
Hardware Acceleration for Machine Learning
PPTX
NITW_Improving Deep Neural Networks (1).pptx
PPTX
NITW_Improving Deep Neural Networks.pptx
PPTX
11_Saloni Malhotra_SummerTraining_PPT.pptx
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
PDF
Introduction to Chainer
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
PDF
Artificial neural networks
PPTX
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
PDF
Deep Style: Using Variational Auto-encoders for Image Generation
PPTX
Deep learning summary
PDF
Introduction to Chainer: A Flexible Framework for Deep Learning
PDF
unit 1- NN concpts.pptx.pdf withautomstion
Electi Deep Learning Optimization
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
Deep learning
Automatic Attendace using convolutional neural network Face Recognition
Deep Learning in Low Power Devices
Hardware Acceleration for Machine Learning
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks.pptx
11_Saloni Malhotra_SummerTraining_PPT.pptx
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Introduction to Chainer
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Artificial neural networks
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
Deep Style: Using Variational Auto-encoders for Image Generation
Deep learning summary
Introduction to Chainer: A Flexible Framework for Deep Learning
unit 1- NN concpts.pptx.pdf withautomstion
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Tartificialntelligence_presentation.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Getting Started with Data Integration: FME Form 101
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
A Presentation on Artificial Intelligence
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Tartificialntelligence_presentation.pptx
Assigned Numbers - 2025 - Bluetooth® Document
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25-Week II
MIND Revenue Release Quarter 2 2025 Press Release
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Getting Started with Data Integration: FME Form 101
SOPHOS-XG Firewall Administrator PPT.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Ad

Efficient_DNN_pruning_System_for_machine_learning.pdf

  • 1. Battery Sensor μP Mem E0 294: Systems for Machine Learning Lecture #12 27th February 2025
  • 2. 2 Date Topic covered Important event 07/01 Introduction 09/01 No Class Systems Day 14/01 Holiday List of project releases 16/01 Basics of Deep Neural Networks 21/01 Convolutional Neural Network (1) Assignment-1 released 23/01 Convolutional Neural Network (2) 28/01 Examples of CNN CNN Accelerators (1) Assignment-1 submission deadline 30/01 No Class (Tentative) CNN Accelerators (2) 04/02 Parallelism in ML Training Assignment-2 released 06/02 Basics of Large Language Models (LLM) 11/02 No Class (Tentative) Assignment-2 submission deadline 13/02 No Class (Tentative) 18/02 Extreme scale training Assignment-3 released; Phase-1 due 20/02 LLM Serving Infra : vLLM Timeline of the Course
  • 3. 3 Date Topic covered Deadline (if there is) 25/02 Scheduling : Efficient LLM scheduling and routing Checkpointing in ML Training Assignment-3 submission deadline 27/02 CNN Accelerators (1) Pruning and Quantization Assignment-3 released 04/03 Scheduling : Efficient LLM scheduling and routing 06/03 Memory optimizations : FlashAttention1,2,3/flexgen Assignment-3 submission deadline 11/03 Multimodality : Large Multimodal models Assignment-4 released 13/03 CNN Accelerators (2) Accelerating decoding: Speculative decoding, lookahead decoding 18/03 Introduction to AMD AIPC Assignment-4 submission deadline 20/03 Pruning and Quantization CNN Accelerators (3) 25/03 Checkpointing in ML Training Workload Characterization Phase-2 due 27/03 Accelerators with Emerging Technology (1) Assignment-5 released 01/04 Accelerators for Graph Algorithms 03/04 Accelerators for Graph Convolutional Network (GCN) 08/04 GCN Accelerators with Emerging Technology Assignment-5 submission deadline Timeline of the Course
  • 4. 4 Date Topic covered Deadline (if there is) 16/04 Project presentation (1) 17/04 Project presentation (2) 18/04 Project presentation (3) 26/04 Project report submission deadline Timeline of the Course Lectures by Industry Experts
  • 6. 6 ▪ The need for smaller models ▪ How to select which parameter to prune ▪ Develop ‘a’ recipe for DNN pruning ▪ Problems due to pruning ▪ Role of underlying hardware on pruning Topics in DNN Pruning Slide Courtesy: Sourav Mazumdar, IISc
  • 7. 7 ▪ Increasing size of neural networks – Overparameterization ▪ High cost of deployment and sustained inference ▪ Unable to run on edge devices – High inference time – High memory – Energy consumption The Need for Smaller Models
  • 8. 8 ▪ ‘somehow’ create compressed models with faster execution times and without losing accuracy much ▪ Knowledge Distillation (KD) – A smaller (student) model is made to learn from a bigger (teacher) model ▪ KD suffers from the drawback of having to define a completely new model and train that model from scratch ▪ Better approach – Compress existing models without training any new model from scratch In search of Smaller Models
  • 9. 9 Overview of DNN Pruning (1) Remove Edges Original network Pruned network
  • 10. 10 Overview of DNN Pruning (2) Remove Nodes Original network Pruned network
  • 11. 11 ▪ Fine Grained: Individual Weights ▪ Vector Level – Sub-kernel vectors of size W ▪ Kernel Level – Kernels of shape H x W ▪ Filter Level – Filter of shape C x H x W Pruning Granularity in DNNs (1) Unstructured Structured
  • 12. 12 ▪ Unstructured – Finer control over selecting elements to prune – Can provide better sparsity to accuracy ratio, i.e., smaller network with higher accuracy – Difficult to implement ▪ Structured – Less control over selecting elements to prune – Relatively less sparsity to accuracy ratio – Easier to implement Pruning Granularity in DNNs (2) Unstructured Structured
  • 13. 13 ▪ DNN pruning can be defined as 𝒂𝒓𝒈𝒎𝒊𝒏𝒑𝑳 𝑾𝒑, 𝑫 𝒔. 𝒕. 𝝓 𝑾𝒑 ≤ 𝑪 ▪ 𝑳 is the loss/objective function of the DNN ▪ 𝑫 is the target dataset ▪ 𝑾 is the original weights, 𝑾𝒑 ∈ 𝑾 is a subset of weights remaining after pruning ▪ 𝝓(. ) is some function which maps the network to some constraint 𝑪 such as inference time, FLOPS, memory etc Mathematical Definition of DNN Pruning
  • 14. 14 ▪ Select the less important parameters to prune – Define a saliency term, which can map each parameter to some importance score – The importance score can be the loss in performance (degradation of accuracy) when removing that particular parameter – Remove the lowest salient parameters from the network Which Parameters to Prune?
  • 15. 15 ▪ f(.) = ReLU(.) ▪ W = [10,-8, 0.1] ▪ Y = f(10x0-8x1+0.1x2) ▪ If you can afford to remove one weight parameter, which one do you remove? ▪ What is the basis for your selection? Magnitude-based Pruning Importance ∝ |W|
  • 16. 16 ▪ Remove the parameters whose absolute values are closer to 0 ▪ For element-wise pruning Magnitude-based Pruning Importance = |W|
  • 17. 17 ▪ Remove the parameters whose absolute values are closer to 0 ▪ For element-wise pruning Magnitude-based Pruning Importance = σ𝒊∈S |𝑾𝒊| 10 -5 3 14 1 -5 3 -2 |10| + |-5| + |3| + |14| = 32 |1| + |-5| + |3| + |-2| = 11 L1 Norm in the structural set S 32 11 Importance 32 0 Pruned Weights
  • 18. 18 Result of Pruning Can we recover the lost accuracy performance?
  • 19. 19 Results of Pruning and Fine tuning Ref :Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]
  • 21. 21 ▪ Estimate the importance of a neuron by the error induced when removing that neuron ▪ Removing each neuron and calculating the error is computationally expensive ▪ Use Taylor series expansion to approximate the importance ▪ A perturbation 𝜹𝑼 of the parameter vector will change the objective function by 𝜹𝑬 = ෍ 𝒊 𝒈𝒊𝜹𝒖𝒊 + 𝟏 𝟐 ෍ 𝒊 𝒉𝒊𝒊𝜹𝒖𝒊 𝟐 + 𝟏 𝟐 ෍ 𝒊≠𝒋 𝒉𝒊𝒋𝜹𝒖𝒊𝜹𝒖𝒋 + 𝑶( 𝜹𝑼 𝟑 ) Where, 𝒈𝒊 = 𝜹𝑬 𝜹𝒖𝒊 and 𝒉𝒊𝒋 = 𝜹𝟐𝑬 𝜹𝒖𝒊𝜹𝒖𝒋 Which Parameters to Prune: Derivative-based
  • 22. 22 𝜹𝑬 = ෍ 𝒊 𝒈𝒊𝜹𝒖𝒊 + 𝟏 𝟐 ෍ 𝒊 𝒉𝒊𝒊𝜹𝒖𝒊 𝟐 + 𝟏 𝟐 ෍ 𝒊≠𝒋 𝒉𝒊𝒋𝜹𝒖𝒊𝜹𝒖𝒋 + 𝑶( 𝜹𝑼 𝟑 ) The following assumptions can be made to simplify the approximation: 1. Diagonal: The change in error caused by deleting one filter is independent of others – cross terms are deleted 2. External: The network has already converged – the first term is deleted 3. Quadratic: Cost function is nearly quadratic – the last term is deleted 𝜹𝑬 = 𝟏 𝟐 ෍ 𝒊 𝒉𝒊𝒊𝜹𝒖𝒊 𝟐 The parameter with the smallest |𝜹𝑬| are removed. Which Parameters to Prune: Derivative-based Ref : Optimal Brain Damage, LeCun et al
  • 23. 23 ▪ Zero activation neurons are redundant and can be removed without affecting the overall accuracy of the network ▪ Average percentage of Zero activations (APoZ) can be used to measure the importance of filters ▪ Filter with higher APoZ can be removed Which Parameters to Prune: Activation Based
  • 24. 24 ▪ The aim of scaling-based pruning is to find the important and redundant layers while training the network itself ▪ For channel pruning of CNN, each channel can be associated with a scaling factor which are trainable parameters ▪ The channels with low scaling factors can be removed Which Parameters to Prune: Scaling-based Pruning Ref: Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
  • 25. 25 ▪ Many modern networks architectures like ResNet make use of skip connection ▪ This skip connection contains coupled channels ▪ Channels of multiple layers are coupled and need to be pruned simultaneously ▪ Prior pruning algorithms skip pruning the coupled channels due to its complexity, but this leads to lower benefit in execution time The Case of Coupled Channels Grouped Saliencies ● L .Liu et. al. PMLR 2021 proposed layer grouping algorithm to find the coupled channels. Layers in the same group shares the same pruning mask and are pruned simultaneously. ● T Narshana et al , ICLR 2023 proposes DFC, Data Flow Coupling which abstracts the end-to-end transformation and the associated layers for an instance of coupling in a network.
  • 26. 26 ▪ Select the pruning granularity ▪ Repeat – Select the parameters to prune – Remove the parameters – Fine tune the model Is this the only procedure for pruning neural network models? Recipe for Neural Network Pruning Trained Model Prune Fine Tune
  • 27. 27 ▪ Static Pruning – Static pruning permanently removes network parameters which, in general cannot be recovered – May lead to decrease in model capacity leading to non-recoverable accuracy – Static pruning doesn’t take into account the current input ▪ Dynamic pruning – Determines in runtime which parameters: filter, connections, layers will be skipped in the forward pass – The network is pruned more for images which are easier to predict – Since the ability of the network is fully preserved, the balance point can be adjusted according to the available resources Dynamic Pruning
  • 28. 28 ▪ Dynamic Pruning or Runtime Pruning can consist of 2 sub- networks – CNN Backbone – Decision Network ▪ A mask variable can be used to denote the filters that will be used in the forward pass for the current iteration Dynamic Pruning (1) Ref: Pruning and Quantization for Deep Neural Network Acceleration: A Survey, Liang et al.
  • 29. 29 ▪ The (i-1)th layer of the backbone network produces some feature map ▪ Based on this feature map, the decision network dynamically prune the kernels for the ith layer in the backbone network Dynamic Pruning (2) Ref: Pruning and Quantization for Deep Neural Network Acceleration: A Survey, Liang et al.
  • 30. 30 ▪ Static pruning is performed offline prior to inference while dynamic pruning is performed at the runtime ▪ Dynamic pruning requires no fine tuning after pruning ▪ Dynamic pruning may result in higher overhead compared to static pruning Static Pruning vs Dynamic Pruning
  • 31. 31 ▪ LTH asks the question whether we can train a sparse network which is a subset of some dense network, from scratch, and achieve the same level of accuracy ▪ But how do we identify the sparse subnetwork? – Iterative magnitude pruning is one such way The Lottery Ticket Hypothesis (LTH) Initial Model Converged Model Pruned Model Retrain Model Train Prune Unable to recover accuracy Initial Model Converged Model Pruned Model Retrain Model Reinitialize to original values Similar accuracy performance Train Prune Ref: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]
  • 32. 32 ▪ A deep neural network consists of many layers which hypothetically specializes at identifying different level of patterns in the data ▪ How do different layers react to pruning? Can we prune different layers with the same pruning ratio or are there some layers which are more sensitive to pruning compared to others? Pruning in Different Layers Pruning different layers of VGG on CIFAR 10
  • 33. 33 ▪ We see that some layers are more sensitive to pruning than others ▪ Layers deeper in the network can be pruned more without much loss in accuracy compared to the initial layers Pruning in Different Layers
  • 34. 34 ▪ Different layers differently to pruning in terms of loss in accuracy ▪ Should we choose only the layers with lower loss in accuracy to prune? ▪ But do all layers have the same execution time? ▪ Pruning some layer with low loss in accuracy may fail to yield much reduction in total execution time ▪ We need to strike a balance in terms of execution time saved and the loss in accuracy incurred Pruning in Different Layers
  • 35. 35 ▪ As we prune a layer, we are reducing the number of parameters, this should result in a gradual decrease in its Execution time. ▪ But is this decrease in Execution time linear? How do Execution Time Change with Pruning? ▪ It is observed that even for uniformly increasing pruning ratio, the execution time does not decrease linearly but in a staircase pattern
  • 36. 36 ▪ It is observed a non-linear decrease in execution time as the layers are pruned ▪ The exact trend may differ from platform to platform of execution ▪ In the case of staircase pattern in decreasing of execution time, we want the model to take benefit of the sharp decrease in execution time Platform Awareness in Pruning ▪ To get the benefits from pruning, we need to prune at or nearby the drop points ▪ Pruning at the flat regions won’t translate to a reduction in execution time but may result in loss of accuracy ▪ Pruned model at point A or C is desirable whereas pruned model at point between A and B is not
  • 37. 37 ▪ It is seen that FLOPS needs to be reduced comparatively more to gain similar benefit in execution time ▪ There are also irregularities in the behaviour speed-up trend that we may benefit from Platform Awareness in Pruning Δt represents the normalised speedup Δf represents the normalised reduction in FLOPS