Deploying Pretrained Model In Edge IoT Devices.pdf

Accelerated Inferencing of a Pretrained Model In Edge
IoT Devices
• Introduction
• Challenges
• Solutions
• Limitations
• Proposed Solution
• Results of Experimentation
• Conclusion

Introduction
• This work is related to solving
one of the most complex
problems in the AI domain
• Explains about those
challenges/problems
• The works done by the research
community to address these
challenges
• Limitations and gaps in the
existing solutions

Challenges in deploying pretrained
models
• Hardware architectural difference between processors of pretrained model
and edge device
• Smaller size memory and less computation capacity
• Poor energy efficiency
• No source code available for pretrained model
• No knowledge of how it is trained and its hyper parameters
• Need to maintain accuracy as close as possible to pretrained model
Pretrained AI Model
To be deployed on resource
constrained edge device

Present research work (Cluster of heterogeneous
devices)

Literature work 1
Microcontr
oller
Input test data
(AI workload)
Partition AI
workload
MCU
MCU
Raspberry
Pi
Combine partial
outputs and
display results

Disadvantages
If one of the nodes fail, then the whole system collapses
Devices made use are small microcontrollers, RaspberryPI, MCUs which have fewer cores
and less computational capacity(fewer core, clock speed).
Inference speed cannot match to the speed of pretrained model
Takes more time to perform inference and hence consume more energy(power)
Due to this it has got poor energy efficiency. It will result in more corban efficiency

Edge-Cloud Co-Operation
• Disadvantages
- Though the cloud has got high computational capacity but there is always a
delay in exchange of data between edge and cloud
- The data speed is not constant and more delay if the public network is
congested
- Threat of data as it is exchanged over public network
Public IP Network
Remote Cloud Edge Device

Deploying model on FPGA
Advantages
• Able deploy and improve inference
Disdvantages
• Only specific AI models for which FPGA is
designed
• Cannot deploy other AI models
Deploying model on the GPU
cores
Pretrained
Model
Convert to FPGA
Specific Format
Run on
FPGA

Proposed Solution
Reduce model size by reducing the precision bits size of weight and biases
Make network simpler by reducing number of layers in CNN/DNN
Run the model parallelly on hundreds of cores of GPU
Accelerate inference using parallel execution, CUDA Graph and batch
processing
Achieve processor occupancy of the model using CUDA computing
Achieve energy efficiency making use of DLA core

Proposed
Solution
The pretrained model size is reduced making
use of following optimization techniques
• - Using the precision bits FP16 or INT8 instead of
FP32
• - Using Layer Fusion
A model is optimized for inference
acceleration using optimization techniques
• - CUDA Computing
• - CUDA Graph
• - Batch processing
• dsjfdkfj
A model is optimized to achieve energy
efficiency using
• - DLA Core

How the model size is reduced
• Any CNN or DNN model size depends on
Number of layers, parameters in each layer and size of the weights and bias in
each layer default size is floating point 32 bit(FP32)
• Reduce size of precision bits of weights and bias
• Fuse the CNN/DNN layers together to make network simpler
• What if we reduce it to 16 bits floating point or integer 8-bits
32 bits
16 bits
8 bits

Create TensorRT Builder
From Builder Create TensorRT Parser and
Config components
TensorRT Parser
Import Input
Pretrained
Model
TensorRT Config
TensorRT Network
Optimization
Input Parameters
FP32, FP16,
INT8, CUDA
Graph, Layer
Fusion, DLA
Core
(Input Network and Config)
Create TensorRT Engine
TensorRT Engine for
Inference on GPU

NVIDIA Jetson Xavier Family GPU
6 CPU Cores 40 Tensor Cores
384 GPU Cores 1 DLA (Deep Learning
Accelerator)
6 Streaming
Multiprocessors
Clock Frequency 1.109 GHz
Number of CUDA cores 384
Compute Clock Rate 1.109 GHz

•1
CPU
C
P
U
M
E
M
O
R
Y
G
P
U
M
E
M
O
R
Y
Perform parallel
execution in GPU
GPU IDLE
CPU IDLE
GPU IDLE
Transfer contents (resultant matrix) from
GPU (Device) to CPU (Host) memory
GPU
Transfer contents (metrices) from CPU
(Host) to GPU (Device) memory

Results from the experiment
Model Model
Size (Kbs)
(BS = 1)
Model
Size(Kbs)
(BS=32)
Model
Size(Kbs)
(BS=64)
Model
Size(Kbs)
(BS=128)
Model
Size(Kbs)
(BS=256)
CPU_FP32 55831
GPU_FP32 1715 1823 1823 1771 1768
GPU_FP16 877 919 917 919 917
GPU_INT8 487 533 537 532 538

Deploying Pretrained Model In Edge IoT Devices.pdf

More Related Content

Similar to Deploying Pretrained Model In Edge IoT Devices.pdf (20)

More from Object Automation (20)

Recently uploaded (20)

Deploying Pretrained Model In Edge IoT Devices.pdf