IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP Adaptive Streaming

Efficient Content-Adaptive Feature-based Shot Detection for
HTTP Adaptive Streaming
Vignesh V Menon, Hadi Amirpour, Mohammad Ghanbari, Christian Timmerer
Christian Doppler Laboratory ATHENA, Institute of Information Technology (ITEC), University of Klagenfurt, Austria
19-22 September 2021
Vignesh V Menon Efficient Content-Adaptive Feature-based Shot Detection for HTTP Adaptive Streaming 1

Outline
1 Introduction
2 Shot detection
3 Proposed Algorithm
4 Evaluation
5 Conclusions and Future Directions

Introduction
Introduction

Introduction
Introduction
Background of HTTP Adaptive Streaming (HAS)1
Source: https://bitmovin.com/adaptive-streaming/
Why Adaptive Streaming?
Adapt for a wide range of devices
Adapt for a broad set of Internet speeds
What HAS does?
Each source video is split into segments
Encoded at multiple bitrates, resolutions,
and codecs
Delivered to the client based on the device
capability, network speed etc.
1
A. Bentaleb et al. “A Survey on Bitrate Adaptation Schemes for Streaming Media Over HTTP”. In: IEEE Communications Surveys Tutorials 21.1 (2019),
pp. 562–585. doi: 10.1109/COMST.2018.2862938.

Introduction
Introduction
Multi-shot encoding framework for VoD HAS applications2
Input Video Shot Detection
Shot Encodings
Video Quality Measure
Convex Hull Determination
Encoding Set Generation
Multi-shot Encoding
Encoded Shots
Bitrate Quality Pairs
Bitrate Resolution Pairs
Target Encoding Set
2
Venkata Phani Kumar M, Christian Timmerer, and Hermann Hellwagner. “MiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive Streaming”. In:
2020 IEEE International Conference on Multimedia and Expo (ICME). 2020, pp. 1–6. doi: 10.1109/ICME46284.2020.9102775.

Shot detection
Shot detection

Shot detection
Shot Detection
The boundaries between video shots are commonly known as shot transitions or shot-cuts.
The act of segmenting a video sequence into shots is called shot detection.
Objective:
Detect the first picture of each shot and encode it as an Instantaneous Decoder Refresh
(IDR) frame.
Encode the subsequent frames of the new shot based on the first one via motion compen-
sation and prediction.3
3
J.-R Ding and Jar-Ferr Yang. “Adaptive group-of-pictures and scene change detection methods based on existing H.264 advanced video coding information”.
In: Image Processing, IET 2 (May 2008), pp. 85 –94. doi: 10.1049/iet-ipr:20070014.

Shot detection
Shot Detection
Shot transitions can be present in two ways:
hard shot-cuts
gradual shot transitions
The detection of gradual changes is much more difficult owing to the fact it is difficult to
determine the change in the visual information in a quantitative format.
Note
1 Ratio of IDR frames to non-IDR frames is skewed, i.e, uneven distribution.
2 Missed shot-cut detections and wrong IDR placements cause low compression efficiency,
i.e., cost of error is large.

Proposed Algorithm
Proposed Algorithm

Proposed Algorithm Phase 1: Feature Extraction
Proposed Algorithm
Phase 1: Feature Extraction
Compute texture energy per Coding Tree Unit (CTU)
A DCT-based energy function is used to determine the block-wise feature of each frame
defined as:
Hk =
w
X
i=1
h
X
j=1
e|( ij
wh
)2−1|
|DCT(i − 1, j − 1)| (1)
where w and h are the width and height of the block, and DCT(i, j) is the (i, j)th DCT
component when i + j > 2, and 0 otherwise.
The energy values of CTUs in a frame is averaged to determine the energy per frame.4
4
Michael King, Zinovi Tauber, and Ze-Nian Li. “A New Energy Function for Segmentation and Compression”. In: July 2007, pp. 1647–1650. doi:
10.1109/ICME.2007.4284983.

Proposed Algorithm
Figure: Hk of Tears of Steel sequence. Black circles denote the regions of shot transitions.

Proposed Algorithm
hk: Mean Squared Error (MSE) of the CTU level energy values of frame k to that of the
previous frame k − 1, normalized to Hk.
hk =
PM
i=1(Hk(i) − Hk−1(i))2
MHk
(2)
where M denotes the number of CTUs in frame k.
: gradient of h per frame, given by:
k =
hk−1 − hk
hk−1
(3)
Note
If hk = 0, kth frame is a duplicate of (k − 1)th frame.

Proposed Algorithm Phase 2: Successive Elimination Algorithm
Proposed Algorithm
Phase 2: Successive Elimination Algorithm
Step 1: while Parsing all video frames do
if k T1 then
k ← IDR-frame, a new shot.
else if k ≤ T2 then
k ← P-frame or B-frame, not a new shot.
T1 , T2 : maximum and minimum threshold for k
Note
The frames are classified into three categories in this step:
1 a new shot
2 not a new shot
3 not decided
In the next steps of the algorithm, only frames of category (3) are considered.

Proposed Algorithm
Phase 2: Successive Elimination Algorithm
f : video fps
Q : set of frames where T1 ≥ T2
q0: current frame number in the set Q
q−1: previous frame number in the set Q
q1: next frame number in the set Q
Step 2: while Parsing Q do
if q0 − q−1 f and q1 − q0 f then
q0 ← IDR-frame, a new shot.
Eliminate q0 from Q.
Step 3: while Parsing Q do
if q0 − q−1 f and q1 − q0 ≤ f then
compare q0 with q when q is from the subset of Q where q1 − q0 ≤ f
Frame q with the highest value ← IDR-frame, a new shot.

Proposed Algorithm
Working Example
Table: Step 1.
Frame Hk
33 52162 21.68
54 52119 13.51
65 52625 19.21
86 52038 10.12
97 52499 17.34
161 47790 11.53
833 48644 11.49
1409 40367 14.51
1665 35321 19.93
1686 40463 10.72
1889 38475 12.16
2205 37218 10.08
2536 35793 10.49
Table: Step 2.
Frame Hk q0 − q−1 q1 − q0
33 52162 21.68 33 21
54 52119 13.51 21 11
65 52625 19.21 11 21
86 52038 10.12 21 11
97 52499 17.34 11 64
161 47790 11.53 64 672
833 48644 11.49 672 576
1409 40367 14.51 576 256
1665 35321 19.93 256 21
1686 40463 10.72 21 203
1889 38475 12.16 203 316
2205 37218 10.08 316 331
2536 35793 10.49 331 -
Table: Step 3.
Frame Hk q0 − q−1 q1 − q0
33 52162 21.68 33 21
54 52119 13.51 21 11
65 52625 19.21 11 21
86 52038 10.12 21 11
97 52499 17.34 11 64
1665 35321 19.93 256 21
1686 40463 10.72 21 203
2536 35793 10.49 331 -
This example uses FunOnTheRiver (24 fps) test sequence. Detected frames to be encoded as
IDR-frames in each step are:
Step 1: -
Step 2: 161, 833, 1409, 1889, 2205
Step 3: 33, 1665, 2536

Evaluation
Evaluation

Evaluation
Evaluation
Test Methodology
Test videos: JVET test sequences5 and professionally produced UHD HDR cinematic con-
tent6 having typical multi-scene content
System: Dual-processor server with Intel Xeon Gold 5218R (80 cores, 2.10 GHz)
Benchmark algorithm: default shot detection algorithm in x265
T1 = 50 and T2 = 10 for the proposed algorithm; determined experimentally
Metrics: accuracy, precision, recall,7 and F-measure8
5
Jill Boyce et al. JVET-J1010: JVET common test conditions and software reference configurations. July 2018.
6
M. H. Pinson. “The Consumer Digital Video Library [Best of the Web]”. In: IEEE Signal Processing Magazine 30.4 (2013), pp. 172–174. doi:
10.1109/MSP.2013.2258265.
7
Markus Junker, Rainer Hoch, and Andreas Dengel. “On the Evaluation of Document Analysis Components by Recall, Precision, and Accuracy”. In: (Apr.
2000). doi: 10.1109/ICDAR.1999.791887.
8
Sasaki Yutaka. “The truth of the F-measure”. In: https://www.toyota-ti.ac.jp/Lab/Denshi/COIN/people/yutaka.sasaki/F-measure-YS-26Oct07.pdf. 2007.

Evaluation
Evaluation
Experimental Results
Table: Shot detection results
Video Actual Benchmark algorithm Proposed algorithm
shot-cuts Accuracy Precision Recall F-measure Accuracy Precision Recall F-measure
BigBuckBunny 10 99.88% 100.00% 80.00% 88.89% 100.00% 100.00% 100.00% 100.00%
Dinner 4 99.89% 100.00% 75.00% 85.71% 99.89% 100.00% 75.00% 85.71%
FoodMarket4 2 99.72% - 0% - 99.86% 100.00% 50.00% 66.67%
sintel trailer 14 99.86% 100.00% 85.71% 92.31% 99.93% 100.00% 92.86% 96.30%
snow mnt 3 99.47% - 0% - 99.65% 100.00% 33.33% 50.00%
Tears of Steel 13 99.93% 100.00% 92.31% 96.00 % 100.00% 100.00% 100.00% 100.00%
Busy City 11 99.64% 50.00% 18.18% 26.67% 99.87% 100.00% 63.64% 77.78%
FunOnTheRiver 12 99.60% 0% 0% - 99.80% 85.71% 50.00% 63.16%
Remarks
1 Actual shot-cuts: the ground truth, i.e., the number of real shot transitions in the considered test videos
determined manually.
2 Recall rate of the proposed algorithm is 25% better than the benchmark algorithm.
3 F-measure of the proposed algorithm is 20% higher compared to the benchmark algorithm.

Evaluation
Evaluation
Experimental Results
Table: Detection rate statistics of the algorithms
Algorithm TPR FPR
Benchmark 53.62% 0.03%
Proposed 78.26% 0.01%
Runtime per frame: 0.1% of the total time taken for encoding each frame.
The algorithm needs to be run only once for a video. The decisions made can be used for
all remaining representations in HAS applications.

Conclusions and Future Directions

Conclusions
Proposed a shot detection algorithm as a feature-based pre-processing step for x265-based
HEVC encoding in VoD HAS applications.
Identified a DCT-based energy function as a feature to determine shot cuts.
Proposed a successive elimination algorithm to remove the false detections during gradual
shot transitions.
The proposed algorithm gives better-balanced shot detections compared to the benchmark
algorithm.

Future Directions
We can extend the work in this paper to compute the relative complexity of the shots to
that of the entire video sequence using the feature metric and predict the ideal bitrate per
resolution for each shot.
As an extension of this work, more encoding parameter decisions like optimal block parti-
tioning, quantization offsets can be predicted.
This work can be extended to support more recent codecs e.g., VVC.

Q A
Thank you for your attention!
Vignesh V Menon (vignesh.menon@aau.at)
Hadi Amirpour (hadi.amirpourazarian@aau.at)
Mohammad Ghanbari (ghan@essex.ac.uk)
Christian Timmerer (Christian.Timmerer@aau.at)

IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP Adaptive Streaming

More Related Content

What's hot (20)

Similar to IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP Adaptive Streaming (20)

More from Vignesh V Menon (20)

Recently uploaded (20)

IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP Adaptive Streaming