Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality

Fixation Prediction for
360˚ Video Streaming in
Head-Mounted Virtual Reality
Ching-Ling Fan1, Jean Lee1, Wen-Chih Lo1, Chun-Ying Huang2, Kuan-
Ta Chen 3 and Cheng-Hsin Hsu1
1Department of Computer Science, National Tsing Hua University
2Department of Computer Science, National Chiao Tung University
3Institute of Information Science, Academia Sinica

360˚ Videos Streaming to HMD
• 360˚ videos contain wider view than
conventional videos
 much more information
 extremely high resolutions and
large file size
⇒ Insufficient bandwidth & degraded user
experience
2

360˚ Videos Streaming to HMD
• Sol: only stream the current
Field-of-View (FoV) of the
viewer
 The HMD viewer only gets to see
a small part of the whole 360˚ video
• Q: Which FoV should we stream
to meet the viewer’s needs in the next moment
(a few seconds)?
⇒ Fixation Prediction
3

Fixation Prediction
• Videos are split into tiles of sub-videos
 Encoded using H.264 and streamed using MPEG
Dynamic Adaptive Streaming over HTTP (DASH)
• Goal: predict which tiles are most likely
viewed by the viewers
➔ which tiles should be included in the next
segment
4

Proposed Approach
• Neural network trained with viewing
features
 content-related:
saliency maps and motion maps
 sensor-related:
viewer’s yaw, roll, and pitch
5
roll
yaw

System Overview
• Image saliency network: to predict the saliency of
images[1]
• Motion feature detector: analyzes Lucas-Kanade
optical flow of consecutive frames
• Orientation extractor: extracts the orientation data
from HMD sensor raw data
6
Image Saliency
Network
[1] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. 2016. A Deep Multi-Level Network for
Saliency Prediction. In Proc. of International Conference on Pattern Recognition (ICPR’16).

System Overview
images[1]
7
Image Saliency
Network
[1] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. 2016. A Deep Multi-Level Network for

System Overview
images[1]
8[1] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. 2016. A Deep Multi-Level Network for
Image Saliency
Network

System Overview
• Feature buffer: stores the features in a sliding
window
• Fixation prediction network: to predict the video
fixations
• Tile rate selector: performs rate allocations among
video tiles
9
Image Saliency
Network

System Overview
window
fixations
video tiles
10
Image Saliency
Network

System Overview
window
fixations
video tiles
11
Image Saliency
Network

Fixation Prediction Network
• Recurrent Neural Network
• Goal: predict the viewing probability of each
tile in the next few seconds
 Orientation-based network
 Tile-based network
12
RNN
(t-1)
RNN
(t)
RNN
(t+1)
Input
Output
State
RNN
(t-2)

Fixation Prediction Network:
Orientation-Based Network
13
Orientation
Motion
Saliency
𝐏f+1 𝐏f+n
Predicted Viewing Probability
Features
…
Ff-m Ff
m
…
Orientation
Motion
Saliency
LSTMLSTM
…=
n
Output: Predicted
viewing probabilities

Fixation Prediction Network:
Tile-Based Network
14
Viewed
Tiles
Motion
Saliency
𝐏f+1 𝐏f+n
Predicted Viewing Probability
Features
…
Ff-m Ff
m
n
…
…
Viewed
Tiles
Motion
Saliency
Predicted
Tiles
Motion
Saliency
LSTMLSTM LSTM
Output: Predicted
viewing probabilities

Ground Truth
• The tiles viewed by the viewers at each
frame in equirectangular mapping model
 Calculate the FoV on the sphere by the orientation
 Sample the points within the FoV and map them
from sphere to equirectangular model
15
α
β
θ
FoV
0°
1 1 1 1
1 1 1 1
1 1 1
0 0 0 0 0 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0
0 0
011 1 1 1
1 1 1 1
1 1 1
1 1

Testbed
• HMD: Oculus Rift DK2
• Sensor Logger: OpenTrack[1]
• Frame Capturer: GamingAnywhere[2]
• 25 viewers and 10 360˚ videos
 12 viewers for training and the rest for testing
16
[1] https://github.com/opentrack/opentrack
[2] http://gaminganywhere.org

Network Training
• Implement the proposed neural network
architecture using Keras
• In training set, 80% of the traces are for
training and the rest are for cross validation
• Training parameters
 Number of neurons in {256, 512, 1024}
 Number of layers in {1, 2}
 Drop out in {True, False}
17

Training Results
• Orientation-based network
• Tile-based network
18
Parameters Training Set Testing Set
No.
Neu.
LSTM
Layers
Drop.
Rank.
Loss
Accuracy F-score
Rank.
Loss
Accuracy F-score
256 1 T 0.1 88.20% 0.67 0.15 85.72% 0.60
512 1 T 0.09 89.25% 0.70 0.14 86.35% 0.62
1024 1 T 0.09 89.28% 0.71 0.14 86.06% 0.62
Parameters Training Set Testing Set
No.
Neu.
LSTM
Layers
Drop.
Rank.
Loss
Accuracy F-score
Rank.
Loss
Accuracy F-score
256 2 F 0.14 86.58% 0.57 0.20 83.94% 0.52
512 2 F 0.13 86.91% 0.58 0.19 84.11% 0.52
1024 2 F 0.12 87.29% 0.60 0.19 84.22% 0.53

Performance Metrics
• Missing ratio
 the fraction of missing tiles over all viewed tiles
• Bandwidth
 the consumed bandwidth for streaming the predicted
tiles
• Initial buffering time
 the minimum buffering time for smooth playout
• Video quality
 the video quality viewed by the viewers
• Running time
 the consumed time for predicting viewed tiles
19

Simulation Setup
• Each viewer randomly selects a 360˚ video to watch
(traces in testing set)
• Each simulation lasts for 1 min and is repeated 8 times
 Bandwidth 150 Mbps (for 13 viewers), latency of 2 secs, 4-sec
segments, and tile size of 192x192
• Baselines
 Current (Cur), Dead Reckoning (DR), and Saliency (Sal)
• Ensure a <10% average missing ratio
 𝜌: the threshold for round the predicted probability to boolean
decision (Our)
 𝛿: the times to iteratively add new tiles at the edge of predicted
tiles (Cur and DR)
 𝜆: the percentile of the saliency value to decide transmit or not
(Sal) 20

• Result in comparable video quality
• Require shorter initial buffering time
Our Fixation Prediction Network
Outperforms Other Solutions
21
2.38s
Up to 43% reduction in initial buffering time
[1] K. Skarseth, H. Bj¿rlo, P. Halvorsen, M. Riegler, and C. Griwodz. 2016. OpenVQ: a video
quality assessment toolkit. In Proc. of ACM International Conference on Multimedia (MM’16),
OSSC paper. 1197–1200.
[1]

• Consumes less bandwidth
• Runs in real-time
Overhead of Our Fixation
Prediction Network
22
4 Mbps
Reduce about 22-36% in bandwidth consumption
< 50 ms

Conclusion
• Fixation prediction for 360˚ video
streaming to HMDs using neural
networks
 leverage both sensor- and content-related features
• Dataset collection and trace-driven simulation
• Our fixation prediction network outperforms
others
• The prediction is performed in real-time
(< 50 ms)
23

• Larger-scale datasets and more extensive simulations
• Eye-tracking HMDs
• The negative impact of 360˚ video projection
• Bitrate allocation algorithms, foveated rendering
Future Work
24
High
Medium
Low
High
Medium
Low

Our Dataset
• 50 subjects, each of them watch 10 360˚
videos using HMD
 sensor data: raw sensor data, viewer orientation,
and viewed tiles
 content data: detected saliency maps and motion
maps of each video
• W. Lo, C. Fan, J. Lee, C. Huang, K. Chen, and C. Hsu. 2017.
360˚ Video Viewing Dataset in Head-Mounted Virtual Reality.
In Proc. of ACM International Conference on Multimedia
Systems (MMSys’17), Dataset Track.
25

Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality

More Related Content

What's hot (20)

Similar to Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality (20)

Recently uploaded (20)

Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality

Editor's Notes