Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019

@DocXavi
[http://pagines.uab.cat/mcv/]
Module 6 - Day 8 - Lecture 2
Deep Video
Object Segmentation
28th March 2019
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de
Catalunya

Acknowledgements
2
Carles
Ventura
Miriam
Bellver
Amaia
Salvador
Andreu
Girbau

Outline
3
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN

Video Object Segmentation (VOS)
Semi-supervised
(“One-shot”) video
object segmentation
Unsupervised
(“zero-shot”) video
object segmentation
VS

#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
Semi-
supervised
VOS
One-shot
VOS

Un
supervised
VOS
Zero-shot
VOS

Outline
7
● Motivation
● RNN

Datasets and Benchmarks
8
Perazzi, Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. "A
benchmark dataset and evaluation methodology for video object segmentation." CVPR 2016.
DAVIS-2017
● 90 training videos (train+val)
● 30 testing videos (test-dev set)

9
#DAVIS Perazzi, Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung.
"A benchmark dataset and evaluation methodology for video object segmentation." CVPR 2016.

10
#YouTube-VOS Xu, Ning, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen,
and Thomas Huang. "Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.

11
#YouTube-VOS Xu, Ning, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang.
"YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark." arXiv preprint arXiv:1809.03327 (2018).

Outline
12
● Motivation
● RNN

Online learning (frame-based)
13
A neural network is fine-tuned with the provided mask for the first frame (online
learning). Each frame is processed separately
#OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool.
"One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé]

14
...but results are still very convincing.
Frame-based processing introduces temporal inconsistencies...

15
What are the limitations of online learning (OL) ?
Qualitative evolution of the fine tuning:
Results at 10 seconds and 1 minute per sequence.

16
How is it possible to fine-tune a ConvNet with just a single frame ?

Outline
17
● Motivation
● RNN

Mask Propagation
18
CNN CNN
time
…CNN

19
#MaskTrack Perazzi, Federico, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung.
"Learning video object segmentation from static images." CVPR 2017. [talk]
Mask Propagation
The ConvNet is trained to refine the previous mask to the current frame.

Outline
20
● Motivation
● RNN

21
Jain, Suyog Dutt, Bo Xiong, and Kristen Grauman. "Fusionseg: Learning to combine motion and appearance for fully
automatic segmentation of generic objects in videos." CVPR 2017.
Flow Propagation

22
Jain, Suyog Dutt, Bo Xiong, and Kristen Grauman. "Fusionseg: Learning to combine motion and appearance for fully
automatic segmentation of generic objects in videos." CVPR 2017.

Outline
23
● Motivation
● RNN

24
#MaskRNN Hu, Yuan-Ting, Jia-Bin Huang, and Alexander Schwing. "MaskRNN: Instance level video object segmentation."
NIPS 2017.
Mask + Flow Propagation
The masks of the N objects in the previous frame are warped with the optical flow.
Each mask is fed separately into another NN that detects & segments instances.

25
#MaskRNN Hu, Yuan-Ting, Jia-Bin Huang, and Alexander Schwing. "MaskRNN: Instance level video object segmentation."
NIPS 2017.
Where is the RNN in the MaskRNN architecture ?

26
Mask from previous frame is warped & concatenated with optical flow in two set ups:
Two streams One stream
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.

27
How could these architectures deal with multiple objects in a single pass ?

28
Multiple object tracking is handled by adding more mask channels.

29
Which architecture do you think it will perform better ?

30
Two
streams
One
stream
Which architecture do you think it will perform better ?

31
Two
streams
One
stream
Which architecture would you use ?

32
Which architecture would you use ?
“The lighter one stream network
performs as well as a network with two
streams. We will thus use the one
stream architecture”
“One stream network is more affordable
to train and allows to easily add extra
input channels, e.g. providing additional
semantic information about objects.”
One stream

Outline
33
● Motivation
● RNN

RNN (ConvLSTM)
35
Limitations
● Each instance is trained and segmented independently
● Designed only for one-shot video object segmentation.
#S2S Xu, Ning, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas
Huang. "Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.

RNN
36
Tokmakov, Pavel, Karteek Alahari, and Cordelia Schmid. "Learning video object segmentation with visual memory." ICCV
2017. [talk]
Limitations
● Each instance is trained and segmented independently
● Optical flow depends on a network trained for another task: model is not end-to-end trainable

RNN (Spatial + Temporal)
37
time
(frame sequence)
space
(object sequence)

RNN (Spatial)
38
space
(object sequence)
Previous work
#RSIS Salvador, Amaia, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, and Xavier
Giro-i-Nieto. "Recurrent neural networks for semantic instance segmentation." arXiv preprint arXiv:1712.00617 (2017).

RNN (Spatial)
39
#RSIS Salvador, Amaia, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, and Xavier
Giro-i-Nieto. "Recurrent neural networks for semantic instance segmentation." arXiv preprint arXiv:1712.00617 (2017).
Previous work

40
time
(frame sequence)
space
(object sequence)

41

42
One-shot Quality vs Inference Time for the Semi-supervised (one-shot) task
Speed values measured on a GPU K80 (*) and P100 (♱), otherwise obtained from YouTube-VOS paper..

43
Why are techniques using online learning (OL) much slower than those that don’t ?

44
RVOS can naturally solve both the semi-supervised (one-shot) & unsupervised
(zero-shot) tasks:
CNN
time
CNN
One-shot RVOS
CNN CNN CNN
Zero-shot RVOS
time

45
CNN CNN CNN
Zero-shot RVOS
time
In zero-shot RVOS, masks were not propagated because of their low quality (also
for the first frame). How could this limitation be addressed ?
Seen clases Unseen classes
Jseen
Junseen
Fseen
Funseen
Semi-supervised 63.6 45.5 67.2 51.0
Unsupervised 44.7 21.2 45.0 23.9
MSc
thesis

46
An alternative semi-supervised signal is a language description of the object to
segment, instead of a binary mask. How could RVOS be adapted to it ?
MSc
thesis
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions."
ACCV 2018.

47
An alternative task would be an interactive set up in which the user draws
scribbles over the object to segment. How could RVOS be adapted to it ?
MSc
thesis
“The interactive scenario assumes the user gives iterative refinement inputs to the algorithm, in our case in
the form of a scribble, to segment the object of interest. Methods have to produce the segmentation mask
for that object in all the frames of a video sequence taking into account all the user interactions.”

48
We released the RVOS PyTorch source code today, so feel free to play with it (maybe
even for your final M6 project deliverable ?).

Outline
49
● Motivation
● RNN

51
Deep Learning courses @ UPC TelecomBCN:
● MSc course [2017] [2018]
● BSc course [2018] [2019]
● 1st edition (2016)
● 2nd edition (2017)
● 3rd edition (2018)
● 4th edition (2019)
● 1st edition (2017)
● 2nd edition (2018)
● 3rd edition - NLP (2019)
Next edition: Autumn 2019 Registration open for 2019Registration open for 2019

52
Deep Learning for Professionals @ UPC School
Next edition starts November 2019. Sign up here.

Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019

More Related Content

What's hot (20)

Similar to Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019 (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019