SlideShare a Scribd company logo
@DocXavi
[http://pagines.uab.cat/mcv/]
Module 6 - Day 8 - Lecture 2
Deep Video
Object Segmentation
28th March 2019
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de
Catalunya
Acknowledgements
2
Carles
Ventura
Miriam
Bellver
Amaia
Salvador
Andreu
Girbau
Outline
3
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
Video Object Segmentation (VOS)
Semi-supervised
(“One-shot”) video
object segmentation
Unsupervised
(“zero-shot”) video
object segmentation
VS
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
Semi-
supervised
VOS
One-shot
VOS
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
Un
supervised
VOS
Zero-shot
VOS
Outline
7
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
Datasets and Benchmarks
8
Perazzi, Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. "A
benchmark dataset and evaluation methodology for video object segmentation." CVPR 2016.
DAVIS-2017
● 90 training videos (train+val)
● 30 testing videos (test-dev set)
Datasets and Benchmarks
9
#DAVIS Perazzi, Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung.
"A benchmark dataset and evaluation methodology for video object segmentation." CVPR 2016.
Datasets and Benchmarks
10
#YouTube-VOS Xu, Ning, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen,
and Thomas Huang. "Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.
Datasets and Benchmarks
11
#YouTube-VOS Xu, Ning, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang.
"YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark." arXiv preprint arXiv:1809.03327 (2018).
Outline
12
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
Online learning (frame-based)
13
A neural network is fine-tuned with the provided mask for the first frame (online
learning). Each frame is processed separately
#OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool.
"One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé]
Online learning (frame-based)
14
...but results are still very convincing.
#OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool.
"One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé]
Frame-based processing introduces temporal inconsistencies...
15
What are the limitations of online learning (OL) ?
#OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool.
"One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé]
Online learning (frame-based)
Qualitative evolution of the fine tuning:
Results at 10 seconds and 1 minute per sequence.
16
How is it possible to fine-tune a ConvNet with just a single frame ?
#OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool.
"One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé]
Online learning (frame-based)
Outline
17
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
Mask Propagation
18
CNN CNN
time
…CNN
19
#MaskTrack Perazzi, Federico, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung.
"Learning video object segmentation from static images." CVPR 2017. [talk]
Mask Propagation
The ConvNet is trained to refine the previous mask to the current frame.
Outline
20
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
21
Jain, Suyog Dutt, Bo Xiong, and Kristen Grauman. "Fusionseg: Learning to combine motion and appearance for fully
automatic segmentation of generic objects in videos." CVPR 2017.
Flow Propagation
22
Jain, Suyog Dutt, Bo Xiong, and Kristen Grauman. "Fusionseg: Learning to combine motion and appearance for fully
automatic segmentation of generic objects in videos." CVPR 2017.
Outline
23
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
24
#MaskRNN Hu, Yuan-Ting, Jia-Bin Huang, and Alexander Schwing. "MaskRNN: Instance level video object segmentation."
NIPS 2017.
Mask + Flow Propagation
The masks of the N objects in the previous frame are warped with the optical flow.
Each mask is fed separately into another NN that detects & segments instances.
25
#MaskRNN Hu, Yuan-Ting, Jia-Bin Huang, and Alexander Schwing. "MaskRNN: Instance level video object segmentation."
NIPS 2017.
Mask + Flow Propagation
Where is the RNN in the MaskRNN architecture ?
26
Mask from previous frame is warped & concatenated with optical flow in two set ups:
Two streams One stream
Mask + Flow Propagation
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.
27
Two streams One stream
How could these architectures deal with multiple objects in a single pass ?
Mask + Flow Propagation
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.
28
Multiple object tracking is handled by adding more mask channels.
Mask + Flow Propagation
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.
29
Two streams One stream
Which architecture do you think it will perform better ?
Mask + Flow Propagation
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.
30
Two
streams
One
stream
Which architecture do you think it will perform better ?
Mask + Flow Propagation
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.
31
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.
Two
streams
One
stream
Which architecture would you use ?
Mask + Flow Propagation
32
Which architecture would you use ?
“The lighter one stream network
performs as well as a network with two
streams. We will thus use the one
stream architecture”
“One stream network is more affordable
to train and allows to easily add extra
input channels, e.g. providing additional
semantic information about objects.”
One stream
Mask + Flow Propagation
#LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object
tracking." IJCV 2019.
Outline
33
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
RNN
34
RNN RNN
time
…RNN
RNN (ConvLSTM)
35
Limitations
● Each instance is trained and segmented independently
● Designed only for one-shot video object segmentation.
#S2S Xu, Ning, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas
Huang. "Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.
RNN
36
Tokmakov, Pavel, Karteek Alahari, and Cordelia Schmid. "Learning video object segmentation with visual memory." ICCV
2017. [talk]
Limitations
● Each instance is trained and segmented independently
● Optical flow depends on a network trained for another task: model is not end-to-end trainable
RNN (Spatial + Temporal)
37
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
time
(frame sequence)
space
(object sequence)
RNN (Spatial)
38
space
(object sequence)
Previous work
#RSIS Salvador, Amaia, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, and Xavier
Giro-i-Nieto. "Recurrent neural networks for semantic instance segmentation." arXiv preprint arXiv:1712.00617 (2017).
RNN (Spatial)
39
#RSIS Salvador, Amaia, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, and Xavier
Giro-i-Nieto. "Recurrent neural networks for semantic instance segmentation." arXiv preprint arXiv:1712.00617 (2017).
Previous work
RNN (Spatial + Temporal)
40
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
time
(frame sequence)
space
(object sequence)
RNN (Spatial + Temporal)
41
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
RNN (Spatial + Temporal)
42
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
One-shot Quality vs Inference Time for the Semi-supervised (one-shot) task
Speed values measured on a GPU K80 (*) and P100 (♱), otherwise obtained from YouTube-VOS paper..
RNN (Spatial + Temporal)
43
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
Why are techniques using online learning (OL) much slower than those that don’t ?
RNN (Spatial + Temporal)
44
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
RVOS can naturally solve both the semi-supervised (one-shot) & unsupervised
(zero-shot) tasks:
CNN
time
CNN
One-shot RVOS
CNN CNN CNN
Zero-shot RVOS
time
RNN (Spatial + Temporal)
45
CNN CNN CNN
Zero-shot RVOS
time
In zero-shot RVOS, masks were not propagated because of their low quality (also
for the first frame). How could this limitation be addressed ?
Seen clases Unseen classes
Jseen
Junseen
Fseen
Funseen
Semi-supervised 63.6 45.5 67.2 51.0
Unsupervised 44.7 21.2 45.0 23.9
MSc
thesis
RNN (Spatial + Temporal)
46
An alternative semi-supervised signal is a language description of the object to
segment, instead of a binary mask. How could RVOS be adapted to it ?
MSc
thesis
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions."
ACCV 2018.
RNN (Spatial + Temporal)
47
An alternative task would be an interactive set up in which the user draws
scribbles over the object to segment. How could RVOS be adapted to it ?
MSc
thesis
“The interactive scenario assumes the user gives iterative refinement inputs to the algorithm, in our case in
the form of a scribble, to segment the object of interest. Methods have to produce the segmentation mask
for that object in all the frames of a video sequence taking into account all the user interactions.”
RNN (Spatial + Temporal)
48
We released the RVOS PyTorch source code today, so feel free to play with it (maybe
even for your final M6 project deliverable ?).
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS:
End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
Outline
49
● Motivation
● Datasets & Benchmarks
● Online learning (Frame-based)
● Mask propagation
● Flow Propagation
● RNN
50
Questions ?
51
Deep Learning courses @ UPC TelecomBCN:
● MSc course [2017] [2018]
● BSc course [2018] [2019]
● 1st edition (2016)
● 2nd edition (2017)
● 3rd edition (2018)
● 4th edition (2019)
● 1st edition (2017)
● 2nd edition (2018)
● 3rd edition - NLP (2019)
Next edition: Autumn 2019 Registration open for 2019Registration open for 2019
52
Deep Learning for Professionals @ UPC School
Next edition starts November 2019. Sign up here.

More Related Content

PDF
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
PDF
Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019
PDF
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
PDF
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
PDF
Deep Learning for Video: Language (UPC 2018)
PDF
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
PDF
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
PDF
Deep Learning from Videos (UPC 2018)
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Deep Learning for Video: Language (UPC 2018)
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Deep Learning from Videos (UPC 2018)

What's hot (20)

PDF
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
PDF
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PDF
Deep Learning for Video: Object Tracking (UPC 2018)
PDF
Neural Architectures for Video Encoding
PDF
Deep Learning for Video: Action Recognition (UPC 2018)
PDF
One Perceptron to Rule Them All: Language and Vision
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
PDF
Multimodal Deep Learning
PDF
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
PDF
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Deep Learning Representations for All (a.ka. the AI hype)
PDF
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
PDF
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
PDF
Skipping and Repeating Samples in Recurrent Neural Networks
PDF
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
PDF
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Deep Learning for Video: Object Tracking (UPC 2018)
Neural Architectures for Video Encoding
Deep Learning for Video: Action Recognition (UPC 2018)
One Perceptron to Rule Them All: Language and Vision
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Multimodal Deep Learning
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Deep Learning Representations for All (a.ka. the AI hype)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Skipping and Repeating Samples in Recurrent Neural Networks
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Ad

Similar to Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019 (20)

PDF
RVOS: End-to-End Recurrent Network for Video Object Segmentation (CVPR 2019)
PDF
Video Object Segmentation - Laura Leal-Taixé - UPC Barcelona 2018
PDF
Online video object segmentation via convolutional trident network
PDF
Video Object Segmentation in Videos
PDF
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
PPTX
Video Annotation for Visual Tracking via Selection and Refinement_tran.pptx
PDF
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
PDF
IRJET- Semantic Segmentation using Deep Learning
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
PDF
Video Object Linguistic Grounding
PDF
Temporal Segment Network
PDF
proposal_pura
PPTX
Image Segmentation Using Deep Learning : A survey
PDF
Cvpr 2017 Summary Meetup
PDF
Brodmann17 CVPR 2017 review - meetup slides
PDF
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
PDF
Recurrent Instance Segmentation with Linguistic Referring Expressions
PDF
2019 cvpr paper_overview
PDF
2019 cvpr paper overview by Ho Seong Lee
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
RVOS: End-to-End Recurrent Network for Video Object Segmentation (CVPR 2019)
Video Object Segmentation - Laura Leal-Taixé - UPC Barcelona 2018
Online video object segmentation via convolutional trident network
Video Object Segmentation in Videos
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Video Annotation for Visual Tracking via Selection and Refinement_tran.pptx
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
IRJET- Semantic Segmentation using Deep Learning
Semantic segmentation with Convolutional Neural Network Approaches
Video Object Linguistic Grounding
Temporal Segment Network
proposal_pura
Image Segmentation Using Deep Learning : A survey
Cvpr 2017 Summary Meetup
Brodmann17 CVPR 2017 review - meetup slides
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
Recurrent Instance Segmentation with Linguistic Referring Expressions
2019 cvpr paper_overview
2019 cvpr paper overview by Ho Seong Lee
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
PDF
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
PDF
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Lecture1 pattern recognition............
PPT
Quality review (1)_presentation of this 21
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
Business Acumen Training GuidePresentation.pptx
Database Infoormation System (DBIS).pptx
Business Analytics and business intelligence.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Knowledge Engineering Part 1
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction-to-Cloud-ComputingFinal.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Lecture1 pattern recognition............
Quality review (1)_presentation of this 21

Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019

  • 1. @DocXavi [http://pagines.uab.cat/mcv/] Module 6 - Day 8 - Lecture 2 Deep Video Object Segmentation 28th March 2019 Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya
  • 3. Outline 3 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 4. Video Object Segmentation (VOS) Semi-supervised (“One-shot”) video object segmentation Unsupervised (“zero-shot”) video object segmentation VS
  • 5. #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. Semi- supervised VOS One-shot VOS
  • 6. #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. Un supervised VOS Zero-shot VOS
  • 7. Outline 7 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 8. Datasets and Benchmarks 8 Perazzi, Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. "A benchmark dataset and evaluation methodology for video object segmentation." CVPR 2016. DAVIS-2017 ● 90 training videos (train+val) ● 30 testing videos (test-dev set)
  • 9. Datasets and Benchmarks 9 #DAVIS Perazzi, Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. "A benchmark dataset and evaluation methodology for video object segmentation." CVPR 2016.
  • 10. Datasets and Benchmarks 10 #YouTube-VOS Xu, Ning, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. "Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.
  • 11. Datasets and Benchmarks 11 #YouTube-VOS Xu, Ning, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. "YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark." arXiv preprint arXiv:1809.03327 (2018).
  • 12. Outline 12 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 13. Online learning (frame-based) 13 A neural network is fine-tuned with the provided mask for the first frame (online learning). Each frame is processed separately #OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. "One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé]
  • 14. Online learning (frame-based) 14 ...but results are still very convincing. #OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. "One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé] Frame-based processing introduces temporal inconsistencies...
  • 15. 15 What are the limitations of online learning (OL) ? #OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. "One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé] Online learning (frame-based) Qualitative evolution of the fine tuning: Results at 10 seconds and 1 minute per sequence.
  • 16. 16 How is it possible to fine-tune a ConvNet with just a single frame ? #OSVOS Caelles, Sergi, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. "One-shot video object segmentation." CVPR 2017. [Talk by Laura Leal-Taixé] Online learning (frame-based)
  • 17. Outline 17 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 19. 19 #MaskTrack Perazzi, Federico, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. "Learning video object segmentation from static images." CVPR 2017. [talk] Mask Propagation The ConvNet is trained to refine the previous mask to the current frame.
  • 20. Outline 20 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 21. 21 Jain, Suyog Dutt, Bo Xiong, and Kristen Grauman. "Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos." CVPR 2017. Flow Propagation
  • 22. 22 Jain, Suyog Dutt, Bo Xiong, and Kristen Grauman. "Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos." CVPR 2017.
  • 23. Outline 23 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 24. 24 #MaskRNN Hu, Yuan-Ting, Jia-Bin Huang, and Alexander Schwing. "MaskRNN: Instance level video object segmentation." NIPS 2017. Mask + Flow Propagation The masks of the N objects in the previous frame are warped with the optical flow. Each mask is fed separately into another NN that detects & segments instances.
  • 25. 25 #MaskRNN Hu, Yuan-Ting, Jia-Bin Huang, and Alexander Schwing. "MaskRNN: Instance level video object segmentation." NIPS 2017. Mask + Flow Propagation Where is the RNN in the MaskRNN architecture ?
  • 26. 26 Mask from previous frame is warped & concatenated with optical flow in two set ups: Two streams One stream Mask + Flow Propagation #LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object tracking." IJCV 2019.
  • 27. 27 Two streams One stream How could these architectures deal with multiple objects in a single pass ? Mask + Flow Propagation #LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object tracking." IJCV 2019.
  • 28. 28 Multiple object tracking is handled by adding more mask channels. Mask + Flow Propagation #LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object tracking." IJCV 2019.
  • 29. 29 Two streams One stream Which architecture do you think it will perform better ? Mask + Flow Propagation #LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object tracking." IJCV 2019.
  • 30. 30 Two streams One stream Which architecture do you think it will perform better ? Mask + Flow Propagation #LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object tracking." IJCV 2019.
  • 31. 31 #LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object tracking." IJCV 2019. Two streams One stream Which architecture would you use ? Mask + Flow Propagation
  • 32. 32 Which architecture would you use ? “The lighter one stream network performs as well as a network with two streams. We will thus use the one stream architecture” “One stream network is more affordable to train and allows to easily add extra input channels, e.g. providing additional semantic information about objects.” One stream Mask + Flow Propagation #LucidTracker Khoreva, Anna, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. "Lucid data dreaming for object tracking." IJCV 2019.
  • 33. Outline 33 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 35. RNN (ConvLSTM) 35 Limitations ● Each instance is trained and segmented independently ● Designed only for one-shot video object segmentation. #S2S Xu, Ning, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. "Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.
  • 36. RNN 36 Tokmakov, Pavel, Karteek Alahari, and Cordelia Schmid. "Learning video object segmentation with visual memory." ICCV 2017. [talk] Limitations ● Each instance is trained and segmented independently ● Optical flow depends on a network trained for another task: model is not end-to-end trainable
  • 37. RNN (Spatial + Temporal) 37 #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. time (frame sequence) space (object sequence)
  • 38. RNN (Spatial) 38 space (object sequence) Previous work #RSIS Salvador, Amaia, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, and Xavier Giro-i-Nieto. "Recurrent neural networks for semantic instance segmentation." arXiv preprint arXiv:1712.00617 (2017).
  • 39. RNN (Spatial) 39 #RSIS Salvador, Amaia, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, and Xavier Giro-i-Nieto. "Recurrent neural networks for semantic instance segmentation." arXiv preprint arXiv:1712.00617 (2017). Previous work
  • 40. RNN (Spatial + Temporal) 40 #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. time (frame sequence) space (object sequence)
  • 41. RNN (Spatial + Temporal) 41 #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
  • 42. RNN (Spatial + Temporal) 42 #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. One-shot Quality vs Inference Time for the Semi-supervised (one-shot) task Speed values measured on a GPU K80 (*) and P100 (♱), otherwise obtained from YouTube-VOS paper..
  • 43. RNN (Spatial + Temporal) 43 #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. Why are techniques using online learning (OL) much slower than those that don’t ?
  • 44. RNN (Spatial + Temporal) 44 #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. RVOS can naturally solve both the semi-supervised (one-shot) & unsupervised (zero-shot) tasks: CNN time CNN One-shot RVOS CNN CNN CNN Zero-shot RVOS time
  • 45. RNN (Spatial + Temporal) 45 CNN CNN CNN Zero-shot RVOS time In zero-shot RVOS, masks were not propagated because of their low quality (also for the first frame). How could this limitation be addressed ? Seen clases Unseen classes Jseen Junseen Fseen Funseen Semi-supervised 63.6 45.5 67.2 51.0 Unsupervised 44.7 21.2 45.0 23.9 MSc thesis
  • 46. RNN (Spatial + Temporal) 46 An alternative semi-supervised signal is a language description of the object to segment, instead of a binary mask. How could RVOS be adapted to it ? MSc thesis Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV 2018.
  • 47. RNN (Spatial + Temporal) 47 An alternative task would be an interactive set up in which the user draws scribbles over the object to segment. How could RVOS be adapted to it ? MSc thesis “The interactive scenario assumes the user gives iterative refinement inputs to the algorithm, in our case in the form of a scribble, to segment the object of interest. Methods have to produce the segmentation mask for that object in all the frames of a video sequence taking into account all the user interactions.”
  • 48. RNN (Spatial + Temporal) 48 We released the RVOS PyTorch source code today, so feel free to play with it (maybe even for your final M6 project deliverable ?). #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
  • 49. Outline 49 ● Motivation ● Datasets & Benchmarks ● Online learning (Frame-based) ● Mask propagation ● Flow Propagation ● RNN
  • 51. 51 Deep Learning courses @ UPC TelecomBCN: ● MSc course [2017] [2018] ● BSc course [2018] [2019] ● 1st edition (2016) ● 2nd edition (2017) ● 3rd edition (2018) ● 4th edition (2019) ● 1st edition (2017) ● 2nd edition (2018) ● 3rd edition - NLP (2019) Next edition: Autumn 2019 Registration open for 2019Registration open for 2019
  • 52. 52 Deep Learning for Professionals @ UPC School Next edition starts November 2019. Sign up here.