Efficient_DNN_pruning_System_for_machine_learning.pdf

Battery
Sensor
μP
Mem
E0 294: Systems for Machine Learning
Lecture #12
27th February 2025

2
Date Topic covered Important event
07/01 Introduction
09/01 No Class Systems Day
14/01 Holiday List of project releases
16/01 Basics of Deep Neural Networks
21/01 Convolutional Neural Network (1) Assignment-1 released
23/01 Convolutional Neural Network (2)
28/01 Examples of CNN CNN Accelerators (1) Assignment-1 submission deadline
30/01 No Class (Tentative) CNN Accelerators (2)
04/02 Parallelism in ML Training Assignment-2 released
06/02 Basics of Large Language Models (LLM)
11/02 No Class (Tentative) Assignment-2 submission deadline
13/02 No Class (Tentative)
18/02 Extreme scale training Assignment-3 released; Phase-1 due
20/02 LLM Serving Infra : vLLM
Timeline of the Course

3
Date Topic covered Deadline (if there is)
25/02 Scheduling : Efficient LLM scheduling and routing
Checkpointing in ML Training
Assignment-3 submission deadline
27/02 CNN Accelerators (1) Pruning and Quantization Assignment-3 released
04/03 Scheduling : Efficient LLM scheduling and routing
06/03 Memory optimizations : FlashAttention1,2,3/flexgen Assignment-3 submission deadline
11/03 Multimodality : Large Multimodal models Assignment-4 released
13/03 CNN Accelerators (2) Accelerating decoding: Speculative
decoding, lookahead decoding
18/03 Introduction to AMD AIPC Assignment-4 submission deadline
20/03 Pruning and Quantization CNN Accelerators (3)
25/03 Checkpointing in ML Training Workload Characterization Phase-2 due
27/03 Accelerators with Emerging Technology (1) Assignment-5 released
01/04 Accelerators for Graph Algorithms
03/04 Accelerators for Graph Convolutional Network (GCN)
08/04 GCN Accelerators with Emerging Technology Assignment-5 submission deadline

4
Date Topic covered Deadline (if there is)
16/04 Project presentation (1)
26/04 Project report submission deadline
Lectures by Industry Experts

5
▪ DNN pruning
Today’s Class

6
▪ The need for smaller models
▪ How to select which parameter to prune
▪ Develop ‘a’ recipe for DNN pruning
▪ Problems due to pruning
▪ Role of underlying hardware on pruning
Topics in DNN Pruning
Slide Courtesy: Sourav Mazumdar, IISc

7
▪ Increasing size of neural networks
– Overparameterization
▪ High cost of deployment and sustained inference
▪ Unable to run on edge devices
– High inference time
– High memory
– Energy consumption
The Need for Smaller Models

8
▪ ‘somehow’ create compressed models with faster execution times and without
losing accuracy much
▪ Knowledge Distillation (KD)
– A smaller (student) model is made to learn from a bigger (teacher) model
▪ KD suffers from the drawback of having to define a completely new model and
train that model from scratch
▪ Better approach
– Compress existing models without training any new model from scratch
In search of Smaller Models

9
Overview of DNN Pruning (1)
Remove Edges
Original network Pruned network

10
Overview of DNN Pruning (2)
Remove Nodes
Original network Pruned network

11
▪ Fine Grained: Individual Weights
▪ Vector Level
– Sub-kernel vectors of size W
▪ Kernel Level
– Kernels of shape H x W
▪ Filter Level
– Filter of shape C x H x W
Pruning Granularity in DNNs (1)
Unstructured Structured

12
▪ Unstructured
– Finer control over selecting
elements to prune
– Can provide better sparsity to
accuracy ratio, i.e., smaller network
with higher accuracy
– Difficult to implement
▪ Structured
– Less control over selecting
elements to prune
– Relatively less sparsity to accuracy
ratio
– Easier to implement
Pruning Granularity in DNNs (2)
Unstructured Structured

13
▪ DNN pruning can be defined as
𝒂𝒓𝒈𝒎𝒊𝒏𝒑𝑳 𝑾𝒑, 𝑫 𝒔. 𝒕. 𝝓 𝑾𝒑 ≤ 𝑪
▪ 𝑳 is the loss/objective function of the DNN
▪ 𝑫 is the target dataset
▪ 𝑾 is the original weights, 𝑾𝒑 ∈ 𝑾 is a subset of weights remaining after
pruning
▪ 𝝓(. ) is some function which maps the network to some constraint 𝑪 such as
inference time, FLOPS, memory etc
Mathematical Definition of DNN Pruning

14
▪ Select the less important parameters to prune
– Define a saliency term, which can map each parameter to some importance
score
– The importance score can be the loss in performance (degradation of
accuracy) when removing that particular parameter
– Remove the lowest salient parameters from the network
Which Parameters to Prune?

15
▪ f(.) = ReLU(.)
▪ W = [10,-8, 0.1]
▪ Y = f(10x0-8x1+0.1x2)
▪ If you can afford to remove one weight parameter, which one do you remove?
▪ What is the basis for your selection?
Magnitude-based Pruning
Importance ∝ |W|

16
▪ Remove the parameters whose absolute values are closer to 0
▪ For element-wise pruning
Importance = |W|

17
▪ Remove the parameters whose absolute values are closer to 0
▪ For element-wise pruning
Importance = σ𝒊∈S |𝑾𝒊|
10 -5
3 14
1 -5
3 -2
|10| + |-5| +
|3| + |14| =
32
|1| + |-5| +
|3| + |-2| =
11
L1 Norm in the
structural set S
32
11
Importance
32
0
Pruned Weights

18
Result of Pruning
Can we recover the lost accuracy performance?

19
Results of Pruning and Fine tuning
Ref :Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

21
▪ Estimate the importance of a neuron by the error induced when removing that
neuron
▪ Removing each neuron and calculating the error is computationally expensive
▪ Use Taylor series expansion to approximate the importance
▪ A perturbation 𝜹𝑼 of the parameter vector will change the objective function by
𝜹𝑬 = ෍
𝒊
𝒈𝒊𝜹𝒖𝒊 +
𝟏
𝟐
෍
𝒊
𝒉𝒊𝒊𝜹𝒖𝒊
𝟐
+
𝟏
𝟐
෍
𝒊≠𝒋
𝒉𝒊𝒋𝜹𝒖𝒊𝜹𝒖𝒋 + 𝑶( 𝜹𝑼
𝟑
)
Where, 𝒈𝒊 =
𝜹𝑬
𝜹𝒖𝒊
and 𝒉𝒊𝒋 =
𝜹𝟐𝑬
𝜹𝒖𝒊𝜹𝒖𝒋
Which Parameters to Prune: Derivative-based

22
𝜹𝑬 = ෍
𝒊
𝒈𝒊𝜹𝒖𝒊 +
𝟏
𝟐
෍
𝒊
𝟐
+
𝟏
𝟐
෍
𝒊≠𝒋
𝒉𝒊𝒋𝜹𝒖𝒊𝜹𝒖𝒋 + 𝑶( 𝜹𝑼
𝟑
)
The following assumptions can be made to simplify the approximation:
1. Diagonal: The change in error caused by deleting one filter is independent of
others – cross terms are deleted
2. External: The network has already converged – the first term is deleted
3. Quadratic: Cost function is nearly quadratic – the last term is deleted
𝜹𝑬 =
𝟏
𝟐
෍
𝒊
𝟐
The parameter with the smallest |𝜹𝑬| are removed.
Which Parameters to Prune: Derivative-based
Ref : Optimal Brain Damage, LeCun et al

23
▪ Zero activation neurons are redundant and can be removed without affecting
the overall accuracy of the network
▪ Average percentage of Zero activations (APoZ) can be used to measure the
importance of filters
▪ Filter with higher APoZ can be removed
Which Parameters to Prune: Activation Based

24
▪ The aim of scaling-based pruning is to find the important and redundant layers
while training the network itself
▪ For channel pruning of CNN, each channel can be associated with a scaling
factor which are trainable parameters
▪ The channels with low scaling factors can be removed
Which Parameters to Prune: Scaling-based Pruning
Ref: Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]

25
▪ Many modern networks architectures like ResNet make use of skip connection
▪ This skip connection contains coupled channels
▪ Channels of multiple layers are coupled and need to be pruned simultaneously
▪ Prior pruning algorithms skip pruning the coupled channels due to its
complexity, but this leads to lower benefit in execution time
The Case of Coupled Channels
Grouped Saliencies
● L .Liu et. al. PMLR 2021 proposed layer grouping algorithm to find the coupled channels. Layers in the same
group shares the same pruning mask and are pruned simultaneously.
● T Narshana et al , ICLR 2023 proposes DFC, Data Flow Coupling which abstracts the end-to-end transformation
and the associated layers for an instance of coupling in a network.

26
▪ Select the pruning granularity
▪ Repeat
– Select the parameters to prune
– Remove the parameters
– Fine tune the model
Is this the only procedure for pruning neural network models?
Recipe for Neural Network Pruning
Trained Model
Prune
Fine Tune

27
▪ Static Pruning
– Static pruning permanently removes network parameters which, in general
cannot be recovered
– May lead to decrease in model capacity leading to non-recoverable accuracy
– Static pruning doesn’t take into account the current input
▪ Dynamic pruning
– Determines in runtime which parameters: filter, connections, layers will be
skipped in the forward pass
– The network is pruned more for images which are easier to predict
– Since the ability of the network is fully preserved, the balance point can be
adjusted according to the available resources
Dynamic Pruning

28
▪ Dynamic Pruning or Runtime
Pruning can consist of 2 sub-
networks
– CNN Backbone
– Decision Network
▪ A mask variable can be used to
denote the filters that will be
used in the forward pass for the
current iteration
Dynamic Pruning (1)
Ref: Pruning and Quantization for Deep Neural Network Acceleration: A Survey, Liang et al.

29
▪ The (i-1)th layer of the backbone
network produces some feature
map
▪ Based on this feature map, the
decision network dynamically
prune the kernels for the ith
layer in the backbone network
Dynamic Pruning (2)
Ref: Pruning and Quantization for Deep Neural Network Acceleration: A Survey, Liang et al.

30
▪ Static pruning is performed offline prior to inference while dynamic pruning is
performed at the runtime
▪ Dynamic pruning requires no fine tuning after pruning
▪ Dynamic pruning may result in higher overhead compared to static pruning
Static Pruning vs Dynamic Pruning

31
▪ LTH asks the question whether we can train a sparse network which is a subset
of some dense network, from scratch, and achieve the same level of accuracy
▪ But how do we identify the sparse subnetwork?
– Iterative magnitude pruning is one such way
The Lottery Ticket Hypothesis (LTH)
Initial Model
Converged
Model
Pruned Model Retrain Model
Train Prune
Unable to recover
accuracy
Initial Model
Converged
Model
Pruned Model Retrain Model
Reinitialize
to original
values
Similar accuracy
performance
Train Prune
Ref: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]

32
▪ A deep neural network consists of many layers which hypothetically specializes
at identifying different level of patterns in the data
▪ How do different layers react to pruning? Can we prune different layers with the
same pruning ratio or are there some layers which are more sensitive to pruning
compared to others?
Pruning in Different Layers
Pruning different layers of VGG on CIFAR 10

33
▪ We see that some layers are more sensitive to pruning than others
▪ Layers deeper in the network can be pruned more without much loss in accuracy
compared to the initial layers

34
▪ Different layers differently to pruning in
terms of loss in accuracy
▪ Should we choose only the layers with
lower loss in accuracy to prune?
▪ But do all layers have the same execution
time?
▪ Pruning some layer with low loss in
accuracy may fail to yield much reduction
in total execution time
▪ We need to strike a balance in terms of
execution time saved and the loss in
accuracy incurred

35
▪ As we prune a layer, we are reducing the number of parameters, this should
result in a gradual decrease in its Execution time.
▪ But is this decrease in Execution time linear?
How do Execution Time Change with Pruning?
▪ It is observed that even for uniformly
increasing pruning ratio, the
execution time does not decrease
linearly but in a staircase pattern

36
▪ It is observed a non-linear decrease in execution time as the layers are pruned
▪ The exact trend may differ from platform to platform of execution
▪ In the case of staircase pattern in decreasing of execution time, we want the
model to take benefit of the sharp decrease in execution time
Platform Awareness in Pruning
▪ To get the benefits from pruning, we need to
prune at or nearby the drop points
▪ Pruning at the flat regions won’t translate to a
reduction in execution time but may result in
loss of accuracy
▪ Pruned model at point A or C is desirable
whereas pruned model at point between A and
B is not

37
▪ It is seen that FLOPS needs to be reduced comparatively more to gain similar
benefit in execution time
▪ There are also irregularities in the behaviour speed-up trend that we may
benefit from
Platform Awareness in Pruning
Δt represents the
normalised speedup
Δf represents the
normalised
reduction in FLOPS

Battery
Sensor
μP
Mem
THANK YOU

Efficient_DNN_pruning_System_for_machine_learning.pdf

More Related Content

Similar to Efficient_DNN_pruning_System_for_machine_learning.pdf (20)

Recently uploaded (20)

Efficient_DNN_pruning_System_for_machine_learning.pdf