SlideShare a Scribd company logo
Visual Semantics
DATE : 5/2/21
Machine Vision on Embedded Hardware
YoloV3
 53 Layers
 Multi-class classification as compared to SoftMax
 3 anchor boxes
 Uses ResNet kind of structure.
 r. Darknet-53 has similar performance to ResNet-152 and is 2× faster
 No hard negative mining ( we can see more on this)
 Multi-Scale sampling which makes it better for smaller objects but worse
for medium and large sized objects.
 Focal loss didn’t work – which means the training has to be uniform.
 IoU thresholding can be explored. It uses single ( 0.5 thresholding).
Paper – 2 : Embedded System based
NN
 Proper System design. Some examples include MobileNets [8], Single-Shot
Detectors (SSD) [9], Yolo [10], and SqueezeNet [11], with the state of the art
that is evolving rapidly. We consider Yolov3 and Yolov3- tiny [12],
Mobilenetv2-SSDLite [13], Centernet-Resnet101 and Centernet-DLA34 [14],
designed to achieve high throughput.
 There are several popular model compression methods: parameter
quantization, parameter pruning, and knowledge distillation. In this paper, we
use quantization such as half floating-point precision (FP16) or INT8 inference
 Platforms like FPGA boards, ASIC design and x86 providing simple sequential
flow of instructions.
 Compared to the reference results on the GPU, they notice that in most of
cases inference speed on the GPU is higher than the FPGA, as well as accuracy.
However, FPGA always achieves higher power efficiency than the GPU
 FP16 was the best data type.
 Yolo-tint best in terms of power and latency and Centrenet most accurate.
 MobileNets were affected by confidence intervals.
 In all the hardware, some of the layers had to be coded by hand to optimize
them. For example the mathematical operations like softmax to be done on
the PL instead of PS and DPUs.
 Centernet [14] proposes modeling an object as a single point. It uses key point
estimation to find center points and regresses all other object properties
including 3D location, pose orientation, and size. In this model, an image is fed
to a CNN which generates a heatmap, whose maximum values represent the
centers of the objects in the image. The objects’ size and pose are regressed
from features of the image at the center location. CenterNet was tested with
four different backbones, i.e. ResNet18, ResNet101, DLA34 and Hourglass,
substituting the convolutional layers with deformable convolutional layers v2
[39]. Deformable convolutional networks (DCN) [40] are detectors able to
adapt to the geometric variations of objects. Regular convolutional networks
can only focus on features of fixed square size (according to the kernel), thus
the receptive field does not properly cover each pixel of a target object to
represent it. The DCN produce a deformable kernel and the offset from the
initial convolution kernel (of fixed size) is learned during training.
https://github.com/xingyizhou/CenterNet
 The execution time of a method can be
divided into 3 phases, i.e. (i) pre-
processing to convert the image in the
NN input, (ii) NN inference, (iii) post-
processing to convert the output of a
NN into BBs.
MobileNets
 MobileNetv2 [13] is an efficient CNN model with depthwise convolution layers, that
have fewer weights compared with normal convolution layers. It is one of the most used
models for embedded systems because it is lightweight and can achieve high FPS also
on mobile devices
 It takes a 192x192 color bitmap as input
 It can identify 1000 different types of objects
 It makes the correct prediction 70% of the time
 Its internal state is represented by 32-bit floating-point numbers
 It is 16.9 MB in size
 we use less regularization and data augmentation techniques because small models
have less trouble with overfitting. When training MobileNets we do not use side heads
or label smoothing and additionally reduce the amount image of distortions by limiting
the size of small crops
 Need a backbone architecture.
SSD
 SSD [9] is a one-stage detector which divides images into grid cells, and
for each grid cell, uses a pre-generated set of anchors with multiple scales
and aspect-ratios to discretize the output space of BBs. SSD predicts
objects on multiple feature maps, and each of them is responsible for
detecting a certain scale of objects, according to its receptive fields.
 For 300 × 300 input, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS
on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9% mAP,
outperforming a comparable state-of-the-art Faster R-CNN model
 Is just an architecture so needs a backbone.
Comparison
 YoloV3 – 33mAP and 28.2mAP at 608sq and 320sq pixel square each
My Idea
 Try something similar to ResNets, but by making use of different activation
functions. This can be while detecting different properties of an image.
 Use sequential data from the video and stich it together to get meaningful
data with a smaller network itself.
 Keep some kind of memory effect because the sign boards are not very
accurate in showing the directions. i.e. Image localization.
 For semantics we could use cognitive approach i.e.it would actively learn
and detect the next action. For ex – Sign + Stop = Stop + Red => Red is
for stopping.

More Related Content

PPT
Data comparation
PDF
cbs_sips2005
PDF
33 8951 suseela g suseela paper8 (edit)new2
PDF
34 8951 suseela g suseela paper8 (edit)new
PDF
2019-06-14:6 - Reti neurali e compressione immagine
PDF
Pixel Recursive Super Resolution. Google Brain
PPTX
Parallel convolutional neural network
PDF
Mnist report
Data comparation
cbs_sips2005
33 8951 suseela g suseela paper8 (edit)new2
34 8951 suseela g suseela paper8 (edit)new
2019-06-14:6 - Reti neurali e compressione immagine
Pixel Recursive Super Resolution. Google Brain
Parallel convolutional neural network
Mnist report

What's hot (20)

PDF
Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...
PDF
Neural network based image compression with lifting scheme and rlc
PPTX
Deep learning lecture - part 1 (basics, CNN)
PPTX
Deep learning for image super resolution
PDF
Image compression and reconstruction using a new approach by artificial neura...
PPTX
CONVOLUTIONAL NEURAL NETWORK
PPT
Cnn method
PDF
MobileNet - PR044
PDF
Mobilenetv1 v2 slide
PPTX
Convolutional neural network from VGG to DenseNet
PPTX
Convolutional neural network
PDF
Efficient Neural Architecture Search via Parameter Sharing
PPTX
2021 05-04-u2-net
PDF
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PPTX
Image Compression Using Neural Network
PDF
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PDF
Network Deconvolution review [cdm]
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PPTX
2020 12-03-vit
PDF
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...
Neural network based image compression with lifting scheme and rlc
Deep learning lecture - part 1 (basics, CNN)
Deep learning for image super resolution
Image compression and reconstruction using a new approach by artificial neura...
CONVOLUTIONAL NEURAL NETWORK
Cnn method
MobileNet - PR044
Mobilenetv1 v2 slide
Convolutional neural network from VGG to DenseNet
Convolutional neural network
Efficient Neural Architecture Search via Parameter Sharing
2021 05-04-u2-net
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Image Compression Using Neural Network
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Network Deconvolution review [cdm]
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
2020 12-03-vit
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Ad

Similar to Machine Vision on Embedded Hardware (20)

PDF
Depth-DensePose: an efficient densely connected deep learning model for came...
PPTX
2022-01-17-Rethinking_Bisenet.pptx
PDF
kanimozhi2019.pdf
PDF
Development of 3D convolutional neural network to recognize human activities ...
PDF
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
PDF
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
PDF
33 8951 suseela g suseela paper8 (edit)new2
PDF
REVIEW ON OBJECT DETECTION WITH CNN
PDF
Standardising the compressed representation of neural networks
PDF
Improving AI surveillance using Edge Computing
PDF
A Survey on Image Processing using CNN in Deep Learning
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PDF
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
PDF
A Review on Color Recognition using Deep Learning and Different Image Segment...
PDF
PDF
PDF
Machine learning based augmented reality for improved learning application th...
PDF
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
PDF
A survey on the layers of convolutional Neural Network
PDF
Efficient de cvpr_2020_paper
Depth-DensePose: an efficient densely connected deep learning model for came...
2022-01-17-Rethinking_Bisenet.pptx
kanimozhi2019.pdf
Development of 3D convolutional neural network to recognize human activities ...
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
33 8951 suseela g suseela paper8 (edit)new2
REVIEW ON OBJECT DETECTION WITH CNN
Standardising the compressed representation of neural networks
Improving AI surveillance using Edge Computing
A Survey on Image Processing using CNN in Deep Learning
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
A Review on Color Recognition using Deep Learning and Different Image Segment...
Machine learning based augmented reality for improved learning application th...
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
A survey on the layers of convolutional Neural Network
Efficient de cvpr_2020_paper
Ad

More from Jash Shah (6)

PDF
Atlas robotics assignment
PDF
Blood infusion warmer fuzzy embedded system design latest developments and...
PDF
Autonomous Balancing of 2-wheeled segway robot
PDF
Effective Public Speaking
PPTX
Innovative products
PPTX
Cypress T&D analysis
Atlas robotics assignment
Blood infusion warmer fuzzy embedded system design latest developments and...
Autonomous Balancing of 2-wheeled segway robot
Effective Public Speaking
Innovative products
Cypress T&D analysis

Recently uploaded (20)

PPTX
Independence_Day_Patriotic theme (1).pptx
PDF
Journal Meraj.pdfuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
DOC
EAU-960 COMBINED INJECTION AND IGNITION SYSTEM WITH ELECTRONIC REGULATION.doc
PPTX
deforestation.ppt[1]bestpptondeforestation.pptx
PDF
EC290C NL EC290CNL - Volvo Service Repair Manual.pdf
PDF
Volvo EC20C Excavator Step-by-step Maintenance Instructions pdf
PDF
Physics class 12thstep down transformer project.pdf
PPTX
Business Economics uni 1.pptxRTRETRETRTRETRETRETRETERT
PDF
Volvo EC290C NL EC290CNL engine Manual.pdf
PPTX
Understanding Machine Learning with artificial intelligence.pptx
PPTX
description of motor equipments and its process.pptx
PDF
6. Chapter Twenty_Managing Mass Communications Advertising Sales Promotions E...
PDF
Volvo EC290C NL EC290CNL Hydraulic Excavator Specs Manual.pdf
PPTX
Applications of SAP S4HANA in Mechanical by Sidhant Vohra (SET23A24040166).pptx
PPTX
Robot_ppt_YRG[1] [Read-Only]bestppt.pptx
PPTX
Zeem: Transition Your Fleet, Seamlessly by Margaret Boelter
PPTX
Dipak Presentation final 18 05 2018.pptx
PDF
Honda Dealership SNS Evaluation pdf/ppts
PDF
LB85 New Holland Service Repair Manual.pdf
PPTX
Culture by Design.pptxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Independence_Day_Patriotic theme (1).pptx
Journal Meraj.pdfuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
EAU-960 COMBINED INJECTION AND IGNITION SYSTEM WITH ELECTRONIC REGULATION.doc
deforestation.ppt[1]bestpptondeforestation.pptx
EC290C NL EC290CNL - Volvo Service Repair Manual.pdf
Volvo EC20C Excavator Step-by-step Maintenance Instructions pdf
Physics class 12thstep down transformer project.pdf
Business Economics uni 1.pptxRTRETRETRTRETRETRETRETERT
Volvo EC290C NL EC290CNL engine Manual.pdf
Understanding Machine Learning with artificial intelligence.pptx
description of motor equipments and its process.pptx
6. Chapter Twenty_Managing Mass Communications Advertising Sales Promotions E...
Volvo EC290C NL EC290CNL Hydraulic Excavator Specs Manual.pdf
Applications of SAP S4HANA in Mechanical by Sidhant Vohra (SET23A24040166).pptx
Robot_ppt_YRG[1] [Read-Only]bestppt.pptx
Zeem: Transition Your Fleet, Seamlessly by Margaret Boelter
Dipak Presentation final 18 05 2018.pptx
Honda Dealership SNS Evaluation pdf/ppts
LB85 New Holland Service Repair Manual.pdf
Culture by Design.pptxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Machine Vision on Embedded Hardware

  • 3. YoloV3  53 Layers  Multi-class classification as compared to SoftMax  3 anchor boxes  Uses ResNet kind of structure.  r. Darknet-53 has similar performance to ResNet-152 and is 2× faster  No hard negative mining ( we can see more on this)  Multi-Scale sampling which makes it better for smaller objects but worse for medium and large sized objects.  Focal loss didn’t work – which means the training has to be uniform.  IoU thresholding can be explored. It uses single ( 0.5 thresholding).
  • 4. Paper – 2 : Embedded System based NN  Proper System design. Some examples include MobileNets [8], Single-Shot Detectors (SSD) [9], Yolo [10], and SqueezeNet [11], with the state of the art that is evolving rapidly. We consider Yolov3 and Yolov3- tiny [12], Mobilenetv2-SSDLite [13], Centernet-Resnet101 and Centernet-DLA34 [14], designed to achieve high throughput.  There are several popular model compression methods: parameter quantization, parameter pruning, and knowledge distillation. In this paper, we use quantization such as half floating-point precision (FP16) or INT8 inference  Platforms like FPGA boards, ASIC design and x86 providing simple sequential flow of instructions.  Compared to the reference results on the GPU, they notice that in most of cases inference speed on the GPU is higher than the FPGA, as well as accuracy. However, FPGA always achieves higher power efficiency than the GPU
  • 5.  FP16 was the best data type.  Yolo-tint best in terms of power and latency and Centrenet most accurate.  MobileNets were affected by confidence intervals.  In all the hardware, some of the layers had to be coded by hand to optimize them. For example the mathematical operations like softmax to be done on the PL instead of PS and DPUs.  Centernet [14] proposes modeling an object as a single point. It uses key point estimation to find center points and regresses all other object properties including 3D location, pose orientation, and size. In this model, an image is fed to a CNN which generates a heatmap, whose maximum values represent the centers of the objects in the image. The objects’ size and pose are regressed from features of the image at the center location. CenterNet was tested with four different backbones, i.e. ResNet18, ResNet101, DLA34 and Hourglass, substituting the convolutional layers with deformable convolutional layers v2 [39]. Deformable convolutional networks (DCN) [40] are detectors able to adapt to the geometric variations of objects. Regular convolutional networks can only focus on features of fixed square size (according to the kernel), thus the receptive field does not properly cover each pixel of a target object to represent it. The DCN produce a deformable kernel and the offset from the initial convolution kernel (of fixed size) is learned during training. https://github.com/xingyizhou/CenterNet
  • 6.  The execution time of a method can be divided into 3 phases, i.e. (i) pre- processing to convert the image in the NN input, (ii) NN inference, (iii) post- processing to convert the output of a NN into BBs.
  • 7. MobileNets  MobileNetv2 [13] is an efficient CNN model with depthwise convolution layers, that have fewer weights compared with normal convolution layers. It is one of the most used models for embedded systems because it is lightweight and can achieve high FPS also on mobile devices  It takes a 192x192 color bitmap as input  It can identify 1000 different types of objects  It makes the correct prediction 70% of the time  Its internal state is represented by 32-bit floating-point numbers  It is 16.9 MB in size  we use less regularization and data augmentation techniques because small models have less trouble with overfitting. When training MobileNets we do not use side heads or label smoothing and additionally reduce the amount image of distortions by limiting the size of small crops  Need a backbone architecture.
  • 8. SSD  SSD [9] is a one-stage detector which divides images into grid cells, and for each grid cell, uses a pre-generated set of anchors with multiple scales and aspect-ratios to discretize the output space of BBs. SSD predicts objects on multiple feature maps, and each of them is responsible for detecting a certain scale of objects, according to its receptive fields.  For 300 × 300 input, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9% mAP, outperforming a comparable state-of-the-art Faster R-CNN model  Is just an architecture so needs a backbone.
  • 9. Comparison  YoloV3 – 33mAP and 28.2mAP at 608sq and 320sq pixel square each
  • 10. My Idea  Try something similar to ResNets, but by making use of different activation functions. This can be while detecting different properties of an image.  Use sequential data from the video and stich it together to get meaningful data with a smaller network itself.  Keep some kind of memory effect because the sign boards are not very accurate in showing the directions. i.e. Image localization.  For semantics we could use cognitive approach i.e.it would actively learn and detect the next action. For ex – Sign + Stop = Stop + Red => Red is for stopping.

Editor's Notes

  • #4: Yolov3 [12] is a one-stage detector which divides images into grid cells and predicts BBs using dimension clusters as anchor boxes. It adopts independent logistic classifiers to output an object score for each BB. The BBs are predicted at three different scales through extracting features from these scales. Yolov3 uses a backbone network, named Darknet53, for performing feature extraction, which is a residual network with 53 convolutional layers. Due to the introduction of Darknet-53 and multi-scale feature maps, Yolov3 achieves great speed improvement and improves the detection accuracy of small-sized objects when compared with Yolov2
  • #5: In case you are using Nvidea, use TensorRT frmewrok, and also look for strategies specific to our board. It has GPGPUs Also the other FPGA board from Xilinx has DPU that has special engines for Conv and Pooling.
  • #8: .