Machine Vision on Embedded Hardware

Visual Semantics
DATE : 5/2/21

YoloV3
 53 Layers
 Multi-class classification as compared to SoftMax
 3 anchor boxes
 Uses ResNet kind of structure.
 r. Darknet-53 has similar performance to ResNet-152 and is 2× faster
 No hard negative mining ( we can see more on this)
 Multi-Scale sampling which makes it better for smaller objects but worse
for medium and large sized objects.
 Focal loss didn’t work – which means the training has to be uniform.
 IoU thresholding can be explored. It uses single ( 0.5 thresholding).

Paper – 2 : Embedded System based
NN
 Proper System design. Some examples include MobileNets [8], Single-Shot
Detectors (SSD) [9], Yolo [10], and SqueezeNet [11], with the state of the art
that is evolving rapidly. We consider Yolov3 and Yolov3- tiny [12],
Mobilenetv2-SSDLite [13], Centernet-Resnet101 and Centernet-DLA34 [14],
designed to achieve high throughput.
 There are several popular model compression methods: parameter
quantization, parameter pruning, and knowledge distillation. In this paper, we
use quantization such as half floating-point precision (FP16) or INT8 inference
 Platforms like FPGA boards, ASIC design and x86 providing simple sequential
flow of instructions.
 Compared to the reference results on the GPU, they notice that in most of
cases inference speed on the GPU is higher than the FPGA, as well as accuracy.
However, FPGA always achieves higher power efficiency than the GPU

 FP16 was the best data type.
 Yolo-tint best in terms of power and latency and Centrenet most accurate.
 MobileNets were affected by confidence intervals.
 In all the hardware, some of the layers had to be coded by hand to optimize
them. For example the mathematical operations like softmax to be done on
the PL instead of PS and DPUs.
 Centernet [14] proposes modeling an object as a single point. It uses key point
estimation to find center points and regresses all other object properties
including 3D location, pose orientation, and size. In this model, an image is fed
to a CNN which generates a heatmap, whose maximum values represent the
centers of the objects in the image. The objects’ size and pose are regressed
from features of the image at the center location. CenterNet was tested with
four different backbones, i.e. ResNet18, ResNet101, DLA34 and Hourglass,
substituting the convolutional layers with deformable convolutional layers v2
[39]. Deformable convolutional networks (DCN) [40] are detectors able to
adapt to the geometric variations of objects. Regular convolutional networks
can only focus on features of fixed square size (according to the kernel), thus
the receptive field does not properly cover each pixel of a target object to
represent it. The DCN produce a deformable kernel and the offset from the
initial convolution kernel (of fixed size) is learned during training.
https://github.com/xingyizhou/CenterNet

 The execution time of a method can be
divided into 3 phases, i.e. (i) pre-
processing to convert the image in the
NN input, (ii) NN inference, (iii) post-
processing to convert the output of a
NN into BBs.

MobileNets
 MobileNetv2 [13] is an efficient CNN model with depthwise convolution layers, that
have fewer weights compared with normal convolution layers. It is one of the most used
models for embedded systems because it is lightweight and can achieve high FPS also
on mobile devices
 It takes a 192x192 color bitmap as input
 It can identify 1000 different types of objects
 It makes the correct prediction 70% of the time
 Its internal state is represented by 32-bit floating-point numbers
 It is 16.9 MB in size
 we use less regularization and data augmentation techniques because small models
have less trouble with overfitting. When training MobileNets we do not use side heads
or label smoothing and additionally reduce the amount image of distortions by limiting
the size of small crops
 Need a backbone architecture.

SSD
 SSD [9] is a one-stage detector which divides images into grid cells, and
for each grid cell, uses a pre-generated set of anchors with multiple scales
and aspect-ratios to discretize the output space of BBs. SSD predicts
objects on multiple feature maps, and each of them is responsible for
detecting a certain scale of objects, according to its receptive fields.
 For 300 × 300 input, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS
on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9% mAP,
outperforming a comparable state-of-the-art Faster R-CNN model
 Is just an architecture so needs a backbone.

Comparison
 YoloV3 – 33mAP and 28.2mAP at 608sq and 320sq pixel square each

My Idea
 Try something similar to ResNets, but by making use of different activation
functions. This can be while detecting different properties of an image.
 Use sequential data from the video and stich it together to get meaningful
data with a smaller network itself.
 Keep some kind of memory effect because the sign boards are not very
accurate in showing the directions. i.e. Image localization.
 For semantics we could use cognitive approach i.e.it would actively learn
and detect the next action. For ex – Sign + Stop = Stop + Red => Red is
for stopping.

Machine Vision on Embedded Hardware

More Related Content

What's hot (20)

Similar to Machine Vision on Embedded Hardware (20)

More from Jash Shah (6)

Recently uploaded (20)

Machine Vision on Embedded Hardware

Editor's Notes