Yolo v2 ai_tech_20190421

陳穗碧(Mora Chen)
20190421@AI Tech
YOLO v2
Better, Faster, Stronger

YOLO-one-stage object detection
• Object Detection = Image Classification + Feature Extraction +
Image Classification+ Object Localization
• YOLO series
 YOLO v1[2015]
 YOLO v2,YOLO 9000[2016/12]
 YOLO v3[2018]
 Spiking-YOLO: Spiking Neural Network for Real-time Object
[2019] Accuracy 97%
3

What Are Covered
• YOLOv1 review
• YOLOv2 Improvements Over YOLOv1 (Better)
• YOLOv2 Using Darknet-19 (Faster)
• YOLO9000 by WordTree (Stronger)
• Deep Learning on Event Detection for River Surveillance in Taiwan
4

三個YOLO重要的步驟
• 1.Resize輸入的圖到448*448
2. 執行一個卷積神經網路
3. 基於模型輸出的信心程度(Confidence)依據閾值和Non-max suppression得到偵測結果
5
Activation function是Leaky ReLU

tensor dimension: S x S x (B x 5 + c)
6
𝐶 = 𝑃 𝑜𝑏𝑗𝑒𝑐𝑡 × 𝐼𝑂𝑈 𝑝 𝐶 = 𝑝(𝐶𝑙𝑎𝑠𝑠 |𝑜𝑏𝑗𝑒𝑐𝑡)
 每個 box裡面有特定Class的機率
𝑝 𝐶𝑙𝑎𝑠𝑠 × 𝐼𝑂𝑈 = 𝑝(𝐶𝑙𝑎𝑠𝑠 |𝑜𝑏𝑗𝑒𝑐𝑡) × 𝑃 𝑜𝑏𝑗𝑒𝑐𝑡 × 𝐼𝑂𝑈
bbox中心點:0~1bbox長寬:0~1
S x S
B=2
5
C=20
模型假設:每一個grid cell負責偵測一個物體，不管有多少bounding box

Loss function
7
𝜆 = 0.5 𝐶 = 0,沒有物體，, 如果 𝐶 ≠ 0，
代表模型估計錯誤，是背景卻是辨識成物體的錯誤，
不希望模型著墨於此，因此 𝜆 <1
1 :代表cell i主要採用box j來預測物體
𝜆 = 5
希望中心點距離越近越好
1 :代表cell i 有物體
Q: 有物件是少數，所以有unbalance的問題?
大物體產生同樣誤差的距離的機率>小物體產生同樣誤
差距離的機率，透過開平方根來處理

8
大物體產生同樣誤差的距離的機率>小物體產生同樣誤
差距離的機率，透過開平方根來處理

Non-max Suppression(非极大值抑制)
9
論文:Improving Object Detection With One Line of Code
目的:去除多餘的bounding box,保留最好的一個
𝑝 𝐶𝑙𝑎𝑠𝑠 × 𝐼𝑂𝑈

Limitaion of YOLOv1
• YOLOv1每個cell上雖輸出B個bounding box，但卻只有㇐組class存在機率，限制了每個cell只能預測出
㇐種物件，如果幾個較大物件的中心剛好出現在同㇐個cell，可能就會做出錯誤判斷。(這個限制在YOLO
v2中就被改掉了)
• Non-max Suppression(非极大值抑制)，如果多個物件位置重疊率高，可能因此被NMS判斷為同一物件。
• 對於很小的物體會較難辨識
• Fast R-CNN相比，YOLO會產生較多的bounding boxes的定位錯誤。
• 基於region proposal的檢測系統相比，YOLO的Recall較低。
• 如果要做預測的圖中所出現的物件的⾧寬比或排列方式較為異常(即訓練資料中不常見)，則YOLO並無這
樣的泛化學習能力，而容易做出錯誤預測。
• CNN分類器架構中有多層max-pooling layer，使某些特徵消失。
• 雖然loss function中第(2)(2)項有考慮相同誤差在較大圖中的影響應該要較小，但在IoU的計算中卻是沒
有辦法做到同樣的處理。
11

YOLO v2
Better, Faster, Stronger

YOLOv2 怎樣調整越來越better
13
YOLO V2
 1.Batch Normalization
 2.High resolution Classifier
 3.Convolution with Anchor Boxes
 4.Dimension Clusters
 5.Direct location prediction
 6.Fine-Grained Features
 7.Multi-Scale Training

YOLOv21.Batch Normalization
14
dropout
• Dropout通常放在激活函數之後
• 卷基層訓練參數本來就少，Dropout應用在卷基層中，降低模型參數效果有限
• 目前網絡中，全連接層被全局平均持化層(Global Average Pooling)所取代，不但能降低模型尺
寸，還可以提升性能，因此，漸漸大家都不採用Dropout

YOLOv21.Batch Normalization
15
What : dropout batch normalization
How : adding batch normalization on all of the convolution layers
Why : without overfitting+加速訓練的速度+避免梯度消失
Value : 2% mAP
Batch Normalizationdropout
Sigmoid
derivative
Near zero

YOLOv22.Convolution with Anchor Boxes
16
Anchor Boxesbounding boxes
一個格子預
測一個物體
一個格式預
測多個物體
（grid cell，anchor box）grid cell
Q:選擇anchor box 數量? 5到10個anchor box 越多涵蓋比例種類越多
Q:選擇anchor box比例?
anchor boxes 的尺寸該怎麼挑選? Faster RCNN是人為選定(1:1,2:1,1:2)
 為了處理兩個對象出現在同一個格子的情況
YoloV2 (5 anchors)
正確地調整Anchor Boxes1比例可以大大提高模型檢測某些位置大小和形狀的對象能力

17
bbox Anchor Boxes
 每個 Anchor Boxes裡面有特定Class的機率
= 𝑝(𝐶𝑙𝑎𝑠𝑠 |𝑜𝑏𝑗𝑒𝑐𝑡) × 𝑃 𝑜𝑏𝑗𝑒𝑐𝑡 × 𝐼𝑂𝑈

18
448 × 448 416 × 416
odd number spatial dimension
What : 448 416
How : Shrink the network o operate on 416 input images
Why : Large objects , tend to occupy the center of the image so it’s good to have a single location right
at the center to predict these objects.

YOLOv24.Dimensional Clusters
19
Q:網絡自身可以學着不斷調節box的大小,給一個好的Anchor boxes的
真的會比較好嗎?

YOLOv24.Dimensional Clusters Step1: 先決定群組數k
Step2:決定群中心，
Step3:重複做圖3–5
Step4:直到群心不太變動(收斂)Hand picked
K-means
clustering
參考資料來源:

Hand picked
K-means
clustering
Q:如何建立群之間的距離?
Q:標準的k-means方法用的是歐氏距離，適用嗎?
導致 larger boxes generate more error than smaller boxes.
what we really want are priors that lead to good IOU scores,
which is independent of the size of the box.
We choose k = 5 as a good tradeoff between model complexity
and high recall.
 (prior information)
學習算法融入應用場景的資訊

22
Hand picked
K-means
clustering
This indicates that using k-means to generate our bounding box starts the model off with a better
representation and makes the task easier to learn.
The cluster centroids are significantly different than hand-
picked anchor boxes.
There are fewer short, wide boxes and more tall, thin boxes

YOLOv25.Direct location prediction
23
unconstrained constrained
In region proposal networks the network predicts values 𝑡 and
𝑡 and the (x, y) center coordinates are calculated as:
For example, a prediction of 𝑡 = 1 would shift the box to the
right by the width of the anchor box, a prediction of 𝑡 = −1
would shift it to the left by the same amount. This formulation is
unconstrained so any anchor box can end up at any point in the
image, regardless of what location predicted the box. With
random initialization the model takes a long time to stabilize to
predicting sensible offsets.
predict location coordinates relative to the location of the grid
cell

24
In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature
map
ANCHORS = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434,
7.88282, 3.52778, 9.77052, 9.16828]
prior

• The final values are based on not coordinates but grid values.
• YOLO default set:
anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071
this means the height and width of first anchor is slightly over one grid cell [1.3221, 1.73145] and the
last anchor almost covers the whole image [11.2364, 10.0071] considering the image is 13x13 grid.
25

26
What : unconstrained constrained
How : predict location coordinates relative to the location of the grid cell
Why : Using anchor boxes->model instability(especially during early iterations)(predicting the (x,y) for the ox)
Value : 5% mAP

YOLOv26.Fine-Grained Features
27
Add passthrough
layer
• Faster R-CNN and SSD both run their proposal networks at various feature
maps in the network o get a range of resolutions.
• Add a pass-through layer that brings features from an earlier layer at 26*26
resolution.

YOLOv26.Fine-Grained Features
28
The passthrough layer concatenates the higher resolution features with the
low resolution features by stacking adjacent features into different channels
instead of spatial locations, similar to the identity mappings in ResNet.
資料來源

YOLOv26.Fine-Grained Features[reorg]
• 這步驟是YOLOv2先用到的功能，全名是reorganization。目的是希望能將圖的大小在減少1/2，
但又不希望像maxpool的作法會丟掉訊息，因此利用相對位置的資訊，將1張圖拆成4小張，這
樣大圖的訊息就能完全保留下來。因此如果feature map大小是100*100*126 (長*寬*通道數)經
由[reorg]，feature map將會變成 50*50*(126*4)= 50*50*504。
29
資料來源

YOLOv26.Fine-Grained Features[concatenates]
30
Reshape
concatenates
Pass through
Q:為什麼要先pass through要先經過conv 64 filters?(擷取較重要的features?)
Q: 為什麼filters的數量一直增加?(增加feature的細微度?)
Q: 3*3和1*1為什麼要交互使用?

YOLOv27.Multi-Scale Training
31
228
352
416
480544
What : Fixing the input images size Multi-Scale Training
How : Every 10 batches our network randomly chooses a new image dimension size from the multiples of 32:{320,352,…,608}
Why : Be robust to running on images of different sizes.
[Since our model only uses convolutional and pooling layers it can be resized on the fly]
Value : 1% mAP
Fixing the input
images size
Multi-Scale
Training

YOLOv2 怎樣調整越來越faster
32
Training for classification
on ImageNet
Training for detection
19 convolutional layers and 5 maxpooling layers

• YOLOv2的目標是高精度實時檢測，所以期望在不增大網路、精度不下降的前提下來對定位錯
誤和低Recall進行改善。
33

2019預計活動
34
邀請外部講師
4/28 (with Women Techmakers Taichung)
胡筱薇教授: 社群網路分析和情感分析
杜岳華:review of Graph Neural Network
Meetup
• 論文導讀
• 實作經驗分享
課程&競賽
• Google MLCC
• Julia(3/31:Neural ODE & DiffEqFlux)
• 組團比賽
姊妹社團
• Pytorch Taichung(4/27_LSTM)
• Women TechmakersTaichung

Yolo v2 ai_tech_20190421

More Related Content

What's hot (20)

Similar to Yolo v2 ai_tech_20190421 (20)

Recently uploaded (20)

Yolo v2 ai_tech_20190421