OPSE_Online Per-Scene Encoding for Adaptive HTTP Live Streaming.pdf

OPSE: Online Per-Scene Encoding for Adaptive HTTP Live
Streaming
Vignesh V Menon1, Hadi Amirpour1, Christian Feldmann2, Mohammad Ghanbari1,3, and
Christian Timmerer1
1
Christian Doppler Laboratory ATHENA, Alpen-Adria-Universität, Klagenfurt, Austria
2
Bitmovin, Klagenfurt, Austria
3
School of Computer Science and Electronic Engineering, University of Essex, UK
21 July 2022
Vignesh V Menon OPSE: Online Per-Scene Encoding for Adaptive HTTP Live Streaming 1

Outline
1 Introduction
2 OPSE
3 Evaluation
4 Q & A

Introduction
Motivation
Per-scene encoding schemes are based on the fact that each resolution performs better
than others in a scene for a given bitrate range, and these regions depend on the video
complexity.
Increase the Quality of Experience (QoE) or decrease the bitrate of the representations as
introduced for VoD services.1
Figure: The bitrate ladder prediction envisioned using OPSE.
1
J. De Cock et al. “Complexity-based consistent-quality encoding in the cloud”. In: 2016 IEEE International Conference on Image Processing (ICIP). 2016,
pp. 1484–1488. doi: 10.1109/ICIP.2016.7532605.

Introduction
Why not in live yet?
Though per-title encoding schemes2 enhance the quality of video delivery, determining the
convex-hull is computationally expensive, making it suitable for only VoD streaming
applications.
Some methods pre-analyze the video contents3.
Katsenou et al.4
introduced a content-gnostic method that employs machine learning to find
the bitrate range for each resolution that outperforms other resolutions. Bhat et al.5
proposed a Random Forest (RF) classifier to decide encoding resolution best suited over
different quality ranges and studied machine learning based adaptive resolution prediction.
However, these approaches still yield latency much higher than the accepted latency in
live streaming.
2
De Cock et al., “Complexity-based consistent-quality encoding in the cloud”; Hadi Amirpour et al. “PSTR: Per-Title Encoding Using Spatio-Temporal
Resolutions”. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). 2021, pp. 1–6. doi: 10.1109/ICME51207.2021.9428247.
3
https://bitmovin.com/whitepapers/Bitmovin-Per-Title.pdf, last access: May 10, 2022.
4
A. V. Katsenou et al. “Content-gnostic Bitrate Ladder Prediction for Adaptive Video Streaming”. In: 2019 Picture Coding Symposium (PCS). 2019. doi:
10.1109/PCS48520.2019.8954529.
5
Madhukar Bhat et al. “Combining Video Quality Metrics To Select Perceptually Accurate Resolution In A Wide Quality Range: A Case Study”. In: 2021 IEEE
International Conference on Image Processing (ICIP). 2021, pp. 2164–2168. doi: 10.1109/ICIP42928.2021.9506310.

OPSE
OPSE
Input Video
Video Complexity
Feature Extraction
Scene Detection
Resolution
Prediction
Resolutions (R)
Bitrates (B)
Per-Scene
Encoding
(E, h, ϵ)
(E, h)
Scenes (ˆ
r, b)
Figure: OPSE architecture.
E, h, and ϵ features are extracted using VCA open-source video complexity analyzer software.6
6
Vignesh V Menon et al. “VCA: Video Complexity Analyzer”. In: Proceedings of the 13th ACM Multimedia Systems Conference. 2022. isbn: 9781450392839.
doi: 10.1145/3524273.3532896. url: https://doi.org/10.1145/3524273.3532896.

OPSE
OPSE
Phase 1: Feature Extraction
Compute texture energy per block
A DCT-based energy function is used to determine the block-wise feature of each frame
defined as:
Hk =
w−1
X
i=0
w−1
X
j=0
e|( ij
wh
)2−1|
|DCT(i, j)| (1)
where wxw is the size of the block, and DCT(i, j) is the (i, j)th DCT component when
i + j > 0, and 0 otherwise.
The energy values of blocks in a frame is averaged to determine the energy per frame.7
E =
C−1
X
k=0
Hp,k
C · w2
(2)
7
Michael King et al. “A New Energy Function for Segmentation and Compression”. In: 2007 IEEE International Conference on Multimedia and Expo. 2007,
pp. 1647–1650. doi: 10.1109/ICME.2007.4284983.

OPSE
OPSE
Phase 1: Feature Extraction
hp: SAD of the block level energy values of frame p to that of the previous frame p − 1.
hp =
C−1
X
k=0
| Hp,k, Hp−1,k |
C · w2
(3)
where C denotes the number of blocks in frame p.
The gradient of h per frame p, ϵp is also defined, which is given by:
ϵp =
hp−1 − hp
hp−1
(4)
Latency
Speed of feature extraction = 1480fps for Full HD (1080p) video with 8 CPU threads and x86
SIMD optimization

OPSE
OPSE
Phase 2: Scene Detection
Objective:
Detect the first picture of each shot and encode it as an Instantaneous Decoder Refresh
(IDR) frame.
Encode the subsequent frames of the new shot based on the first one via motion compen-
sation and prediction.
Shot transitions can be present in two ways:
hard shot-cuts
gradual shot transitions
The detection of gradual changes is much more difficult owing to the fact it is difficult to
determine the change in the visual information in a quantitative format.

OPSE
OPSE
Phase 2: Scene Detection
Step 1: while Parsing all video frames do
if ϵk > T1 then
k ← IDR-frame, a new shot.
else if ϵk ≤ T2 then
k ← P-frame or B-frame, not a new shot.
T1 , T2 : maximum and minimum threshold for ϵk
f : video fps
Q : Q : set of frames where T1 ≥ ϵ > T2 and ∆h > T3
q0: current frame number in the set Q
q−1: previous frame number in the set Q
q1: next frame number in the set Q
Step 2: while Parsing Q do
if q0 − q−1 > f and q1 − q0 > f then
q0 ← IDR-frame, a new shot.
Eliminate q0 from Q.

OPSE
OPSE
Phase 3: Resolution Prediction
For each detected scene, the optimized bitrate ladder is predicted using the E and h features
of the first GOP of each scene and the sets R and B. The optimized resolution ˆ
r is predicted
for each target bitrate b ∈ B. The resolution scaling factor s is defined as:
s =
r
rmax

; r ∈ R (5)
where rmax is the maximum resolution in R.
Hidden Layer
E R4
Hidden Layer
E R4
Input Layer
E R3
Output Layer
E R1
E
h
log(b)
ŝ
Figure: Neural network structure to predict optimized resolution scaling factor ŝ for a maximum
resolution rmax and framerate f .

Evaluation
Evaluation
R = {360p, 432p, 540p, 720p, 1080p}
B = {145, 300, 600, 900, 1600, 2400, 3400, 4500, 5800, 8100}.
Figure: BDRV results for scenes characterized by various average E and h.
BDRV : Bjøntegaard delta rate8 refers to the average increase in bitrate of the representations
compared with that of the fixed bitrate ladder encoding to maintain the same VMAF.
8
G. Bjontegaard. “Calculation of average PSNR differences between RD-curves”. In: VCEG-M33 (2001).

Evaluation
Evaluation
(a) Scene1 (b) Scene2
Figure: Comparison of RD curves for encoding two sample scenes, Scene1 (E = 31.96, h = 11.12) and
Scene2 (E = 67.96, h = 5.12) using the fixed bitrate ladder and OPSE.

Q A
Q A
Thank you for your attention!
Vignesh V Menon (vignesh.menon@aau.at)

OPSE_Online Per-Scene Encoding for Adaptive HTTP Live Streaming.pdf

More Related Content

Similar to OPSE_Online Per-Scene Encoding for Adaptive HTTP Live Streaming.pdf (20)

More from Vignesh V Menon (18)

Recently uploaded (20)

OPSE_Online Per-Scene Encoding for Adaptive HTTP Live Streaming.pdf