SlideShare a Scribd company logo
Accelerated Inferencing of a Pretrained Model In Edge
IoT Devices
• Introduction
• Challenges
• Solutions
• Limitations
• Proposed Solution
• Results of Experimentation
• Conclusion
Introduction
• This work is related to solving
one of the most complex
problems in the AI domain
• Explains about those
challenges/problems
• The works done by the research
community to address these
challenges
• Limitations and gaps in the
existing solutions
Challenges in deploying pretrained
models
• Hardware architectural difference between processors of pretrained model
and edge device
• Smaller size memory and less computation capacity
• Poor energy efficiency
• No source code available for pretrained model
• No knowledge of how it is trained and its hyper parameters
• Need to maintain accuracy as close as possible to pretrained model
Pretrained AI Model
To be deployed on resource
constrained edge device
Present research work (Cluster of heterogeneous
devices)
Literature work 1
Microcontr
oller
Input test data
(AI workload)
Partition AI
workload
MCU
MCU
Raspberry
Pi
Combine partial
outputs and
display results
Disadvantages
If one of the nodes fail, then the whole system collapses
Devices made use are small microcontrollers, RaspberryPI, MCUs which have fewer cores
and less computational capacity(fewer core, clock speed).
Inference speed cannot match to the speed of pretrained model
Takes more time to perform inference and hence consume more energy(power)
Due to this it has got poor energy efficiency. It will result in more corban efficiency
Edge-Cloud Co-Operation
• Disadvantages
- Though the cloud has got high computational capacity but there is always a
delay in exchange of data between edge and cloud
- The data speed is not constant and more delay if the public network is
congested
- Threat of data as it is exchanged over public network
Public IP Network
Remote Cloud Edge Device
Deploying model on FPGA
Advantages
• Able deploy and improve inference
Disdvantages
• Only specific AI models for which FPGA is
designed
• Cannot deploy other AI models
Deploying model on the GPU
cores
Pretrained
Model
Convert to FPGA
Specific Format
Run on
FPGA
Proposed Solution
Reduce model size by reducing the precision bits size of weight and biases
Make network simpler by reducing number of layers in CNN/DNN
Run the model parallelly on hundreds of cores of GPU
Accelerate inference using parallel execution, CUDA Graph and batch
processing
Achieve processor occupancy of the model using CUDA computing
Achieve energy efficiency making use of DLA core
Proposed
Solution
The pretrained model size is reduced making
use of following optimization techniques
• - Using the precision bits FP16 or INT8 instead of
FP32
• - Using Layer Fusion
A model is optimized for inference
acceleration using optimization techniques
• - CUDA Computing
• - CUDA Graph
• - Batch processing
• dsjfdkfj
A model is optimized to achieve energy
efficiency using
• - DLA Core
How the model size is reduced
• Any CNN or DNN model size depends on
Number of layers, parameters in each layer and size of the weights and bias in
each layer default size is floating point 32 bit(FP32)
• Reduce size of precision bits of weights and bias
• Fuse the CNN/DNN layers together to make network simpler
• What if we reduce it to 16 bits floating point or integer 8-bits
32 bits
16 bits
8 bits
Proposed Solution
Reduce model size by reducing the precision bits size of weight and biases
Make network simpler by reducing number of layers in CNN/DNN
Run the model parallelly on hundreds of cores of GPU
Accelerate inference using parallel execution, CUDA Graph and batch
processing
Achieve processor occupancy of the model using CUDA computing
Achieve energy efficiency making use of DLA core
Create TensorRT Builder
From Builder Create TensorRT Parser and
Config components
TensorRT Parser
Import Input
Pretrained
Model
TensorRT Config
TensorRT Network
Optimization
Input Parameters
FP32, FP16,
INT8, CUDA
Graph, Layer
Fusion, DLA
Core
(Input Network and Config)
Create TensorRT Engine
TensorRT Engine for
Inference on GPU
NVIDIA Jetson Xavier Family GPU
6 CPU Cores 40 Tensor Cores
384 GPU Cores 1 DLA (Deep Learning
Accelerator)
6 Streaming
Multiprocessors
Clock Frequency 1.109 GHz
Number of CUDA cores 384
Compute Clock Rate 1.109 GHz
•1
CPU
C
P
U
M
E
M
O
R
Y
G
P
U
M
E
M
O
R
Y
Perform parallel
execution in GPU
GPU IDLE
CPU IDLE
GPU IDLE
Transfer contents (resultant matrix) from
GPU (Device) to CPU (Host) memory
GPU
Transfer contents (metrices) from CPU
(Host) to GPU (Device) memory
Results from the experiment
Model Model
Size (Kbs)
(BS = 1)
Model
Size(Kbs)
(BS=32)
Model
Size(Kbs)
(BS=64)
Model
Size(Kbs)
(BS=128)
Model
Size(Kbs)
(BS=256)
CPU_FP32 55831
GPU_FP32 1715 1823 1823 1771 1768
GPU_FP16 877 919 917 919 917
GPU_INT8 487 533 537 532 538

More Related Content

PDF
Compact optimized deep learning model for edge: a review
PDF
Edge AI Miramond technical seminCERN.pdf
PPTX
Novel Optimized Models for Deep Learning
PDF
Deep learning at the edge: 100x Inference improvement on edge devices
PPTX
improve deep learning training and inference performance
PDF
Creating smaller, faster, production-ready mobile machine learning models.
PPTX
Edge and ai
PDF
Tensorflow IoT - 1 Wk coding challenge
Compact optimized deep learning model for edge: a review
Edge AI Miramond technical seminCERN.pdf
Novel Optimized Models for Deep Learning
Deep learning at the edge: 100x Inference improvement on edge devices
improve deep learning training and inference performance
Creating smaller, faster, production-ready mobile machine learning models.
Edge and ai
Tensorflow IoT - 1 Wk coding challenge

Similar to Deploying Pretrained Model In Edge IoT Devices.pdf (20)

PPTX
Micron: Seamless Prediction at the Edge Using TensorFlow on FPGAs
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
PDF
Deep Learning Initiative @ NECSTLab
PDF
Deep Learning on Everyday Devices
PDF
Smaller and Easier: Machine Learning on Embedded Things
PDF
Presentation - webinar embedded machine learning
PDF
Inference accelerators
PPTX
Machine learning and Deep learning on edge devices using TensorFlow
PPTX
DigitRecognition.pptx
PDF
JMI Techtalk: 한재근 - How to use GPU for developing AI
PDF
Toward Distributed, Global, Deep Learning Using IoT Devices
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
PPTX
uTensor COSCUP
PDF
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
PDF
Breaking New Frontiers in Robotics and Edge Computing with AI
PPTX
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
PDF
Deep Learning on the SaturnV Cluster
PDF
ACCELERATED DEEP LEARNING INFERENCE FROM CONSTRAINED EMBEDDED DEVICES
Micron: Seamless Prediction at the Edge Using TensorFlow on FPGAs
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Innovation with ai at scale on the edge vt sept 2019 v0
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Deep Learning Initiative @ NECSTLab
Deep Learning on Everyday Devices
Smaller and Easier: Machine Learning on Embedded Things
Presentation - webinar embedded machine learning
Inference accelerators
Machine learning and Deep learning on edge devices using TensorFlow
DigitRecognition.pptx
JMI Techtalk: 한재근 - How to use GPU for developing AI
Toward Distributed, Global, Deep Learning Using IoT Devices
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
uTensor COSCUP
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Breaking New Frontiers in Robotics and Edge Computing with AI
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
Deep Learning on the SaturnV Cluster
ACCELERATED DEEP LEARNING INFERENCE FROM CONSTRAINED EMBEDDED DEVICES
Ad

More from Object Automation (20)

PDF
Data Science and Practical Application Course
PDF
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
PDF
CHIPS Alliance_Object Automation Inc_workshop
PDF
RTL Design Methodologies_Object Automation Inc
PDF
High-Level Synthesis for the Design of AI Chips
PDF
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
PDF
GenAI and AI GCC State of AI_Object Automation Inc
PDF
CDAC presentation as part of Global AI Festival and Future
PDF
Global AI Festivla and Future one day event
PDF
Generative AI In Logistics_Object Automation
PDF
Gen AI_Object Automation_TechnologyWorkshop
PDF
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
PDF
5G Edge Computing_Object Automation workshop
PDF
COE AI Lab Universities
PDF
Bootcamp_AIApps.pdf
PDF
Bootcamp_AIApps.pdf
PPTX
Bootcamp_AIAppsUCSD.pptx
PDF
Course_Object Automation.pdf
PDF
Enterprise AI_New.pdf
PDF
Super AI tools
Data Science and Practical Application Course
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
CHIPS Alliance_Object Automation Inc_workshop
RTL Design Methodologies_Object Automation Inc
High-Level Synthesis for the Design of AI Chips
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
GenAI and AI GCC State of AI_Object Automation Inc
CDAC presentation as part of Global AI Festival and Future
Global AI Festivla and Future one day event
Generative AI In Logistics_Object Automation
Gen AI_Object Automation_TechnologyWorkshop
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
5G Edge Computing_Object Automation workshop
COE AI Lab Universities
Bootcamp_AIApps.pdf
Bootcamp_AIApps.pdf
Bootcamp_AIAppsUCSD.pptx
Course_Object Automation.pdf
Enterprise AI_New.pdf
Super AI tools
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
A comparative analysis of optical character recognition models for extracting...
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Assigned Numbers - 2025 - Bluetooth® Document
MIND Revenue Release Quarter 2 2025 Press Release
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx

Deploying Pretrained Model In Edge IoT Devices.pdf

  • 1. Accelerated Inferencing of a Pretrained Model In Edge IoT Devices • Introduction • Challenges • Solutions • Limitations • Proposed Solution • Results of Experimentation • Conclusion
  • 2. Introduction • This work is related to solving one of the most complex problems in the AI domain • Explains about those challenges/problems • The works done by the research community to address these challenges • Limitations and gaps in the existing solutions
  • 3. Challenges in deploying pretrained models • Hardware architectural difference between processors of pretrained model and edge device • Smaller size memory and less computation capacity • Poor energy efficiency • No source code available for pretrained model • No knowledge of how it is trained and its hyper parameters • Need to maintain accuracy as close as possible to pretrained model Pretrained AI Model To be deployed on resource constrained edge device
  • 4. Present research work (Cluster of heterogeneous devices)
  • 5. Literature work 1 Microcontr oller Input test data (AI workload) Partition AI workload MCU MCU Raspberry Pi Combine partial outputs and display results
  • 6. Disadvantages If one of the nodes fail, then the whole system collapses Devices made use are small microcontrollers, RaspberryPI, MCUs which have fewer cores and less computational capacity(fewer core, clock speed). Inference speed cannot match to the speed of pretrained model Takes more time to perform inference and hence consume more energy(power) Due to this it has got poor energy efficiency. It will result in more corban efficiency
  • 7. Edge-Cloud Co-Operation • Disadvantages - Though the cloud has got high computational capacity but there is always a delay in exchange of data between edge and cloud - The data speed is not constant and more delay if the public network is congested - Threat of data as it is exchanged over public network Public IP Network Remote Cloud Edge Device
  • 8. Deploying model on FPGA Advantages • Able deploy and improve inference Disdvantages • Only specific AI models for which FPGA is designed • Cannot deploy other AI models Deploying model on the GPU cores Pretrained Model Convert to FPGA Specific Format Run on FPGA
  • 9. Proposed Solution Reduce model size by reducing the precision bits size of weight and biases Make network simpler by reducing number of layers in CNN/DNN Run the model parallelly on hundreds of cores of GPU Accelerate inference using parallel execution, CUDA Graph and batch processing Achieve processor occupancy of the model using CUDA computing Achieve energy efficiency making use of DLA core
  • 10. Proposed Solution The pretrained model size is reduced making use of following optimization techniques • - Using the precision bits FP16 or INT8 instead of FP32 • - Using Layer Fusion A model is optimized for inference acceleration using optimization techniques • - CUDA Computing • - CUDA Graph • - Batch processing • dsjfdkfj A model is optimized to achieve energy efficiency using • - DLA Core
  • 11. How the model size is reduced • Any CNN or DNN model size depends on Number of layers, parameters in each layer and size of the weights and bias in each layer default size is floating point 32 bit(FP32) • Reduce size of precision bits of weights and bias • Fuse the CNN/DNN layers together to make network simpler • What if we reduce it to 16 bits floating point or integer 8-bits 32 bits 16 bits 8 bits
  • 12. Proposed Solution Reduce model size by reducing the precision bits size of weight and biases Make network simpler by reducing number of layers in CNN/DNN Run the model parallelly on hundreds of cores of GPU Accelerate inference using parallel execution, CUDA Graph and batch processing Achieve processor occupancy of the model using CUDA computing Achieve energy efficiency making use of DLA core
  • 13. Create TensorRT Builder From Builder Create TensorRT Parser and Config components TensorRT Parser Import Input Pretrained Model TensorRT Config TensorRT Network Optimization Input Parameters FP32, FP16, INT8, CUDA Graph, Layer Fusion, DLA Core (Input Network and Config) Create TensorRT Engine TensorRT Engine for Inference on GPU
  • 14. NVIDIA Jetson Xavier Family GPU 6 CPU Cores 40 Tensor Cores 384 GPU Cores 1 DLA (Deep Learning Accelerator) 6 Streaming Multiprocessors Clock Frequency 1.109 GHz Number of CUDA cores 384 Compute Clock Rate 1.109 GHz
  • 15. •1 CPU C P U M E M O R Y G P U M E M O R Y Perform parallel execution in GPU GPU IDLE CPU IDLE GPU IDLE Transfer contents (resultant matrix) from GPU (Device) to CPU (Host) memory GPU Transfer contents (metrices) from CPU (Host) to GPU (Device) memory
  • 16. Results from the experiment Model Model Size (Kbs) (BS = 1) Model Size(Kbs) (BS=32) Model Size(Kbs) (BS=64) Model Size(Kbs) (BS=128) Model Size(Kbs) (BS=256) CPU_FP32 55831 GPU_FP32 1715 1823 1823 1771 1768 GPU_FP16 877 919 917 919 917 GPU_INT8 487 533 537 532 538