SlideShare a Scribd company logo
Ivan Laptev 
ivan.laptev@inria.fr 
WILLOW, INRIA/ENS/CNRS, ParisComputer Vision: Weakly-supervised learning from video and images 
CSClubSaint PetersburgNovember 17, 2014 
Joint work with: 
Piotr Bojanowski–RémiLajugie–MaximeOquab– Francis Bach –Leon Bottou–Jean Ponce – Cordelia Schmid–Josef Sivic
Контакты: 
Официальный сайт:http://visionlabs.ru/ 
Контактное лицо:Ханин Александр 
E-mail: a.khanin@visionlabs.ru 
Тел.: +7 (926) 988-7891 
VisionLabs–командапрофессионалов,обладающихзначительнымизнаниямиисущественнымпрактическимопытомвсфереразработкиалгоритмовкомпьютерногозренияиинтеллектуальныхсистем. 
Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему. 
О компании–Advertisement –
Команда 
Александр 
Ханин 
Chief 
Executive 
Officer 
Алексей 
Нехаев 
Executive 
Officer 
Слава 
Казьмин 
Chief 
Technical 
Officer 
Иван 
Лаптев 
Scientific 
advisor 
Сергей 
Миляев 
Senior 
CV engineer 
Алексей 
Кордичев 
Financial 
advisor 
Иван 
Трусков 
Software 
developer 
Сергей 
Черепанов 
Software 
developer 
Наша команда – 
симбиоз науки и бизнеса 
Направления деятельности 
Технологияраспознаваниялиц 
Системавыявлениямошенниковвбанках 
Технологияраспознаванияномеров 
Системаучетаиавтоматизациидоступатранспорта 
Технологиидлябезопасногогорода 
Системавыявлениянарушенийиопасныхситуаций–Advertisement –
–Advertisement – 
Проекты масштаба государства 
Достижения
–Advertisement – 
Мы ищем единомышленников 
Создание и внедрение интеллектуальных систем 
Решение интересных практических задач 
Работа в дружной амбициозной команде 
Спасибо за внимание! 
Контакты: 
Официальный сайт:http://visionlabs.ru/ 
Контактное лицо:Ханин Александр 
E-mail: a.khanin@visionlabs.ru 
Тел.: +7 (926) 988-7891
What is Computer Vision?
7 
What is Computer Vision?
Computer Vision
What is the recent progress? 
1990s: 
Recognition at the level of a few 
toy objects (COIL 20 dataset) 
Industry Research 
Automated quality inspection 
(controlled lighting, scale,…) 
Now: 
Face recognition in social media ImageNet: 14M images, 21K classes 
6% Top-5 error rate in 2014 Challenge
~5K image uploads every min. 
>34K hours of video upload every day 
TV-channels recorded since 60’s 
~30M surveillance cameras in US => ~700K video hours/day 
~2.5 Billion new images / month 
And even more with future wearable devicesWhy image and video analysis? 
Data:
Movies 
TV 
YouTubeWhy looking at people? 
How many person-pixels are in the video?
Movies 
TV 
YouTube 
How many person-pixels are in the video? 
40% 
35% 
34% Why looking at people?
How many person pixels in our daily life? 
Wearable camera data: Microsoft SenseCamdataset 

How many person pixels in our daily life? 
Wearable camera data: Microsoft SenseCamdataset 
 
~4%
 Large variations in appearance: 
occlusions, non-rigid motion, view-point 
changes, clothing… 
What are the difficulties? 
 Manual collection of training 
samples is prohibitive: many 
action classes, rare occurrence 
 Action vocabulary is not 
well-defined 
… 
Action Open: 
… 
… 
Action Hugging:
This talk: 
Brief overview of recent techniques 
Weakly-supervised learning from video and scripts 
Weakly-supervised learning with convolutional neural networks
Standard visual recognition pipeline 
GetOutCar 
AnswerPhone 
Kiss 
HandShake 
StandUp 
DriveCar 
Collect image/video samples and corresponding class labels 
Design appropriate data representation, with certain invariance properties 
Design / use existing machine learning methods for learning and classification
Occurrence histogram of visual words 
space-time patches 
Extraction of 
Local features 
Feature 
description 
K-means clustering (k=4000) 
Feature 
quantization 
Non-linear SVM with χ2kernel 
[Laptev, Marszałek, Schmid, Rozenfeld2008] Bag-of-Features action recognition
Action classification 
Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
Where to get training data? 
Shoot actions in the lab 
• 
KTH datasetWeizmandataset,… 
-Limited variability 
-Unrealistic 
Manually annotate existing content 
• 
HMDB, Olympic Sports, UCF50, UCF101, … 
-Very time-consuming 
Use readily-available video scripts 
• 
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com 
-Scripts are available for 1000’s of hours of movies and TV-series 
-Scripts describe dynamic and static content of videos
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
21
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
22
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past.The headwaiter seats Ilsa... 
23
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
24
… 
1172 
01:20:17,240 --> 01:20:20,437 
Why weren't you honest with me? 
Why'dyou keep your marriage a secret? 
1173 
01:20:20,640 --> 01:20:23,598 
lt wasn't my secret, Richard. 
Victor wanted it that way. 
1174 
01:20:23,800 --> 01:20:26,189 
Not even our closest friends 
knew about our marriage. 
… 
… 
RICK 
Why weren't you honest with me? Why 
didyou keep your marriage a secret? 
Rick sits down with Ilsa. 
ILSA 
Oh,it wasn't my secret, Richard. 
Victor wanted it that way. Not even 
our closest friends knew about our 
marriage. 
… 
01:20:17 
01:20:23 
subtitles 
movie script 
•Scripts available for >500 movies (no time synchronization) 
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … 
•Subtitles (with time info.) are available for the most of movies 
•Can transfer time to scripts by text alignmentScript-based video annotation 
[Laptev, Marszałek, Schmid, Rozenfeld2008]
Scripts as weak supervision 
Uncertainty 
24:25 
24:51 
Imprecise temporal localization 
• 
No explicit spatial localization 
• 
NLP problems, scripts ≠ training labels 
• 
“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” 
vs. Get-out-car 
Challenges:
Previous work 
Sivic, Everingham, and Zisserman, ''Who are you?'' --Learning Person Specific Classifiers from Video, In CVPR 2009. 
Buehler, Everingham, and Zisserman"Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. 
Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009. 
…wanted to know about the history of the trees
Joint Learning of Actors and Actions 
Rick? 
Rick? 
Walks? 
Walks? 
[Bojanowskiet al. ICCV 2013] 
Rick walks up behind Ilsa
Rick 
Walks 
Rick walks up behind IlsaJoint Learning of Actors and Actions 
[Bojanowskiet al. ICCV 2013]
Formulation: Cost function 
RickIlsaSam 
Actor labels 
Actor image features 
Actor classifier
Formulation: Cost function 
Person pappears at least once in clipN: 
p = Rick 
Weak supervision from scripts:
Action aappears at least once in clipN: 
a = Walk 
Weak supervision from scripts: Formulation: Cost function
Formulation: Cost function 
Action aappears in clipN: 
Weak supervision from scripts: 
Person pappears in clipN: 
Person pand 
Action aappear in clipN:
34 
Image and video features 
•Facial features [Everingham’06] 
•HOG descriptor on normalized face image 
•Dense Trajectory features in person bounding box [Wang et al.,’11] 
Face features 
Action features
35 
Results for Person Labelling 
American beauty (11 character names) 
Casablanca (17 character names)
36 
Results for Person + Action Labelling 
Casablanca, 
Walking
Finding Actions and Actors in Movies 
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
38 
Action Learning with Ordering Constraints 
[Bojanowskiet al. ECCV 2014]
39 
Action Learning with Ordering Constraints 
[Bojanowskiet al. ECCV 2014]
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Is the optimization tractable? 
•Path constraints are implicit 
•Cannot use off-the-shelf solvers 
•Frank-Wolfe optimization algorithm
Results 
937 video clips from 60 Hollywood movies 
• 
16 action classes 
• 
Each clip is annotated by a sequence of n actions (2≤n≤11) 
•
Computer Vision
Object recognition
Convolutional Neural Networks 
•ImageNetLarge-Scale Visual Recognition Challenge is very hard: 1000 classes, 1.2M images 
•Krizhevskyet al. ILSVRC12 results improve other methods with a large margin 
2012 
2014GoogleLeNet: 6%
CNN of Krizhevskyet al. NIPS’12 
•Learns low-level features at the first layer. 
•Has some tricks but the main principle is similar to LeCun’88 
•Has 60M parameters and 650K neurons. 
•Success seems to be determined by (a) lots of labeled images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.
Approach 
1.Design training/test procedure using sliding windows 
2.Train adaptation layers to map labels 
See also [Girshicket al.’13], [Donahue et al.’13], [Sermanetet al. ’14], [Zeilerand Fergus ’13] Transfer learning workshop at ICCV’13, ImageNetworkshop at ICCV’13
Approach –sliding window training / testing
Results 
Object localization
Results 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
Results
Vision works?
Vision works? 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
VOC Action Classification Taster Challenge 
Given the bounding box of a person, predict whether they are performing a given action 
Playing Instrument? 
Reading? 
Encourage research on still-imageactivity recognition: more detailed understanding of image
Nine Action Classes 
Phoning 
Playing Instrument 
Reading 
Riding Bike 
Riding Horse 
Running 
Taking Photo 
Using Computer 
Walking
CNN action recognition and localization 
Qualitative results: reading
CNN action recognition and localization 
Qualitative results: phoning
CNN action recognition and localization 
Qualitative results: playing instrument
Results PASCAL VOC 2012 
Object classification 
Action classification 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
Are bounding boxes needed for training CNNs? 
Image-level labels: Bicycle, Person 
[Oquab, Bottou, Laptev, Sivic, 2014]
Motivation: labeling bounding boxes is tedious
Motivation: image-level labels are plentiful 
“Beautiful red leaves in a back street of Freiburg” 
[Kuznetsovaet al., ACL 2013] 
http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
Motivation: image-level labels are plentiful 
“Public bikes in Warsaw during night” 
https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
Let the algorithm localize the object in the image 
[Oquab, Bottou, Laptev, Sivic, 2014] 
Example training images with bounding boxes 
The locations of objects or their parts learnt by the CNN 
NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object 
localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
Approach: search over object’s location 
1.Efficient window sliding to find object location hypothesis 
2.Image-level aggregation (max-pool) 
3.Multi-label loss function (allow multiple objects in image) 
See also [Sermanetet al. ’14] and [Chaftieldet al.’14] 
Max-pool over image 
Per-image score 
FCa 
FCb 
C1-C2-C3-C4-C5 
FC6 
FC7 
4096- dim 
vector 
9216- dim 
vector 
4096- dim 
vector 
… 
motorbike 
person 
diningtable 
pottedplant 
chair 
car 
bus 
train 
… 
Max
1. Efficient window sliding to find object location 
192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
…
2. Image-level aggregation using global max-pool 
192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
…
3. Multi-label loss function 
(to allow for multiple objects in image) 192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
complex scenes [2, 9, 28] or fromvideo [30]. Thesemethods typically localize objectswith visually 
consistent appearance in the training data that often contains multiple objects in different spatial 
Sum of K (=20) log-loss functions, one for each of K classes: 
K-vector of network output 
for image x 
K-vector of (+1,-1) labels indicating 
presence/absence of each class
SearcMh foar xob-jpecotso usliinngg m asxe-paorolcinhg 
aeroplane map 
car map 
«Keep up the 
good work !» 
(increase score) 
«Wrong !» 
(decrease score) 
«Found something 
there !» Receptive field of the maximum-scoring 
neuron 
max-pool 
max-pool 
mardi 10 juin 14 
Correct label: 
increase score 
for this class 
Incorrect label: 
decrease score 
for this class
Search for objects using max-pooling 
a 
What is the effect of errors?
Multi-scale training and testing 
16216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320.7…1.4 ] chairdiningtablesofapottedplantpersoncarbustrain… Figure3:Weaklysupervisedtrainingchairdiningtablepersonpottedplantpersoncarbustrain… RescaleFigure4:MultiscaleobjectrecognitionConvolutionaladaptationlayers.
Training videos
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset

More Related Content

PDF
20140427 parallel programming_zlobin_lecture11
PDF
E03404025032
PPTX
PDF
Activity detection at TRECVID
PDF
Video search by deep-learning
PDF
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
PDF
Human Action Recognition in Videos
PDF
Video Classification: Human Action Recognition on HMDB-51 dataset
20140427 parallel programming_zlobin_lecture11
E03404025032
Activity detection at TRECVID
Video search by deep-learning
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Human Action Recognition in Videos
Video Classification: Human Action Recognition on HMDB-51 dataset

Similar to Computer Vision (20)

PDF
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...
PDF
Video+Language: From Classification to Description
PPTX
Action_recognition-topic.pptx
PPTX
Andrii Boichuk: Video-based action recognition - past, present and future
PDF
Video + Language 2019
PPTX
Matt Feiszli at AI Frontiers : Video Understanding
PDF
cvpr2011: human activity recognition - part 1: introduction
PPTX
Transformer in Vision
PPTX
A general survey of previous works on action recognition
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PPT
Fcv scene schmid
PDF
Volume 2-issue-6-1960-1964
PDF
Volume 2-issue-6-1960-1964
PDF
Flow Trajectory Approach for Human Action Recognition
PDF
Human Action Recognition Using Deep Learning
PDF
IRJET- A Survey on the Enhancement of Video Action Recognition using Semi-Sup...
PDF
Deep Learning from Videos (UPC 2018)
PDF
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
PDF
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...
Video+Language: From Classification to Description
Action_recognition-topic.pptx
Andrii Boichuk: Video-based action recognition - past, present and future
Video + Language 2019
Matt Feiszli at AI Frontiers : Video Understanding
cvpr2011: human activity recognition - part 1: introduction
Transformer in Vision
A general survey of previous works on action recognition
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Fcv scene schmid
Volume 2-issue-6-1960-1964
Volume 2-issue-6-1960-1964
Flow Trajectory Approach for Human Action Recognition
Human Action Recognition Using Deep Learning
IRJET- A Survey on the Enhancement of Video Action Recognition using Semi-Sup...
Deep Learning from Videos (UPC 2018)
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

More from Computer Science Club (20)

PDF
20141223 kuznetsov distributed
PDF
20140531 serebryany lecture01_fantastic_cpp_bugs
PDF
20140531 serebryany lecture02_find_scary_cpp_bugs
PDF
20140531 serebryany lecture01_fantastic_cpp_bugs
PDF
20140511 parallel programming_kalishenko_lecture12
PDF
20140420 parallel programming_kalishenko_lecture10
PDF
20140413 parallel programming_kalishenko_lecture09
PDF
20140329 graph drawing_dainiak_lecture02
PDF
20140329 graph drawing_dainiak_lecture01
PDF
20140310 parallel programming_kalishenko_lecture03-04
PDF
20140223-SuffixTrees-lecture01-03
PDF
20140216 parallel programming_kalishenko_lecture01
PDF
20131106 h10 lecture6_matiyasevich
PDF
20131027 h10 lecture5_matiyasevich
PDF
20131027 h10 lecture5_matiyasevich
PDF
20131013 h10 lecture4_matiyasevich
PDF
20131006 h10 lecture3_matiyasevich
PDF
20131006 h10 lecture3_matiyasevich
PDF
20131006 h10 lecture2_matiyasevich
PDF
20130922 h10 lecture1_matiyasevich
20141223 kuznetsov distributed
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
20140511 parallel programming_kalishenko_lecture12
20140420 parallel programming_kalishenko_lecture10
20140413 parallel programming_kalishenko_lecture09
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture01
20140310 parallel programming_kalishenko_lecture03-04
20140223-SuffixTrees-lecture01-03
20140216 parallel programming_kalishenko_lecture01
20131106 h10 lecture6_matiyasevich
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
20131013 h10 lecture4_matiyasevich
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture2_matiyasevich
20130922 h10 lecture1_matiyasevich

Recently uploaded (20)

PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Business Ethics Teaching Materials for college
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Cell Types and Its function , kingdom of life
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Anesthesia in Laparoscopic Surgery in India
Microbial disease of the cardiovascular and lymphatic systems
O7-L3 Supply Chain Operations - ICLT Program
01-Introduction-to-Information-Management.pdf
Pharma ospi slides which help in ospi learning
Business Ethics Teaching Materials for college
Module 4: Burden of Disease Tutorial Slides S2 2025
Cell Types and Its function , kingdom of life
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Insiders guide to clinical Medicine.pdf
RMMM.pdf make it easy to upload and study
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
FourierSeries-QuestionsWithAnswers(Part-A).pdf

Computer Vision

  • 1. Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, ParisComputer Vision: Weakly-supervised learning from video and images CSClubSaint PetersburgNovember 17, 2014 Joint work with: Piotr Bojanowski–RémiLajugie–MaximeOquab– Francis Bach –Leon Bottou–Jean Ponce – Cordelia Schmid–Josef Sivic
  • 2. Контакты: Официальный сайт:http://visionlabs.ru/ Контактное лицо:Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891 VisionLabs–командапрофессионалов,обладающихзначительнымизнаниямиисущественнымпрактическимопытомвсфереразработкиалгоритмовкомпьютерногозренияиинтеллектуальныхсистем. Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему. О компании–Advertisement –
  • 3. Команда Александр Ханин Chief Executive Officer Алексей Нехаев Executive Officer Слава Казьмин Chief Technical Officer Иван Лаптев Scientific advisor Сергей Миляев Senior CV engineer Алексей Кордичев Financial advisor Иван Трусков Software developer Сергей Черепанов Software developer Наша команда – симбиоз науки и бизнеса Направления деятельности Технологияраспознаваниялиц Системавыявлениямошенниковвбанках Технологияраспознаванияномеров Системаучетаиавтоматизациидоступатранспорта Технологиидлябезопасногогорода Системавыявлениянарушенийиопасныхситуаций–Advertisement –
  • 4. –Advertisement – Проекты масштаба государства Достижения
  • 5. –Advertisement – Мы ищем единомышленников Создание и внедрение интеллектуальных систем Решение интересных практических задач Работа в дружной амбициозной команде Спасибо за внимание! Контакты: Официальный сайт:http://visionlabs.ru/ Контактное лицо:Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891
  • 7. 7 What is Computer Vision?
  • 9. What is the recent progress? 1990s: Recognition at the level of a few toy objects (COIL 20 dataset) Industry Research Automated quality inspection (controlled lighting, scale,…) Now: Face recognition in social media ImageNet: 14M images, 21K classes 6% Top-5 error rate in 2014 Challenge
  • 10. ~5K image uploads every min. >34K hours of video upload every day TV-channels recorded since 60’s ~30M surveillance cameras in US => ~700K video hours/day ~2.5 Billion new images / month And even more with future wearable devicesWhy image and video analysis? Data:
  • 11. Movies TV YouTubeWhy looking at people? How many person-pixels are in the video?
  • 12. Movies TV YouTube How many person-pixels are in the video? 40% 35% 34% Why looking at people?
  • 13. How many person pixels in our daily life? Wearable camera data: Microsoft SenseCamdataset 
  • 14. How many person pixels in our daily life? Wearable camera data: Microsoft SenseCamdataset  ~4%
  • 15.  Large variations in appearance: occlusions, non-rigid motion, view-point changes, clothing… What are the difficulties?  Manual collection of training samples is prohibitive: many action classes, rare occurrence  Action vocabulary is not well-defined … Action Open: … … Action Hugging:
  • 16. This talk: Brief overview of recent techniques Weakly-supervised learning from video and scripts Weakly-supervised learning with convolutional neural networks
  • 17. Standard visual recognition pipeline GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar Collect image/video samples and corresponding class labels Design appropriate data representation, with certain invariance properties Design / use existing machine learning methods for learning and classification
  • 18. Occurrence histogram of visual words space-time patches Extraction of Local features Feature description K-means clustering (k=4000) Feature quantization Non-linear SVM with χ2kernel [Laptev, Marszałek, Schmid, Rozenfeld2008] Bag-of-Features action recognition
  • 19. Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
  • 20. Where to get training data? Shoot actions in the lab • KTH datasetWeizmandataset,… -Limited variability -Unrealistic Manually annotate existing content • HMDB, Olympic Sports, UCF50, UCF101, … -Very time-consuming Use readily-available video scripts • www.dailyscript.com, www.movie-page.com, www.weeklyscript.com -Scripts are available for 1000’s of hours of movies and TV-series -Scripts describe dynamic and static content of videos
  • 21. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 21
  • 22. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 22
  • 23. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past.The headwaiter seats Ilsa... 23
  • 24. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 24
  • 25. … 1172 01:20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'dyou keep your marriage a secret? 1173 01:20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way. 1174 01:20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … … RICK Why weren't you honest with me? Why didyou keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh,it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. … 01:20:17 01:20:23 subtitles movie script •Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … •Subtitles (with time info.) are available for the most of movies •Can transfer time to scripts by text alignmentScript-based video annotation [Laptev, Marszałek, Schmid, Rozenfeld2008]
  • 26. Scripts as weak supervision Uncertainty 24:25 24:51 Imprecise temporal localization • No explicit spatial localization • NLP problems, scripts ≠ training labels • “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” vs. Get-out-car Challenges:
  • 27. Previous work Sivic, Everingham, and Zisserman, ''Who are you?'' --Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman"Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009. …wanted to know about the history of the trees
  • 28. Joint Learning of Actors and Actions Rick? Rick? Walks? Walks? [Bojanowskiet al. ICCV 2013] Rick walks up behind Ilsa
  • 29. Rick Walks Rick walks up behind IlsaJoint Learning of Actors and Actions [Bojanowskiet al. ICCV 2013]
  • 30. Formulation: Cost function RickIlsaSam Actor labels Actor image features Actor classifier
  • 31. Formulation: Cost function Person pappears at least once in clipN: p = Rick Weak supervision from scripts:
  • 32. Action aappears at least once in clipN: a = Walk Weak supervision from scripts: Formulation: Cost function
  • 33. Formulation: Cost function Action aappears in clipN: Weak supervision from scripts: Person pappears in clipN: Person pand Action aappear in clipN:
  • 34. 34 Image and video features •Facial features [Everingham’06] •HOG descriptor on normalized face image •Dense Trajectory features in person bounding box [Wang et al.,’11] Face features Action features
  • 35. 35 Results for Person Labelling American beauty (11 character names) Casablanca (17 character names)
  • 36. 36 Results for Person + Action Labelling Casablanca, Walking
  • 37. Finding Actions and Actors in Movies [Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
  • 38. 38 Action Learning with Ordering Constraints [Bojanowskiet al. ECCV 2014]
  • 39. 39 Action Learning with Ordering Constraints [Bojanowskiet al. ECCV 2014]
  • 40. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 41. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 42. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 43. Is the optimization tractable? •Path constraints are implicit •Cannot use off-the-shelf solvers •Frank-Wolfe optimization algorithm
  • 44. Results 937 video clips from 60 Hollywood movies • 16 action classes • Each clip is annotated by a sequence of n actions (2≤n≤11) •
  • 47. Convolutional Neural Networks •ImageNetLarge-Scale Visual Recognition Challenge is very hard: 1000 classes, 1.2M images •Krizhevskyet al. ILSVRC12 results improve other methods with a large margin 2012 2014GoogleLeNet: 6%
  • 48. CNN of Krizhevskyet al. NIPS’12 •Learns low-level features at the first layer. •Has some tricks but the main principle is similar to LeCun’88 •Has 60M parameters and 650K neurons. •Success seems to be determined by (a) lots of labeled images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.
  • 49. Approach 1.Design training/test procedure using sliding windows 2.Train adaptation layers to map labels See also [Girshicket al.’13], [Donahue et al.’13], [Sermanetet al. ’14], [Zeilerand Fergus ’13] Transfer learning workshop at ICCV’13, ImageNetworkshop at ICCV’13
  • 50. Approach –sliding window training / testing
  • 52. Results [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 55. Vision works? [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 56. VOC Action Classification Taster Challenge Given the bounding box of a person, predict whether they are performing a given action Playing Instrument? Reading? Encourage research on still-imageactivity recognition: more detailed understanding of image
  • 57. Nine Action Classes Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking
  • 58. CNN action recognition and localization Qualitative results: reading
  • 59. CNN action recognition and localization Qualitative results: phoning
  • 60. CNN action recognition and localization Qualitative results: playing instrument
  • 61. Results PASCAL VOC 2012 Object classification Action classification [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 62. Are bounding boxes needed for training CNNs? Image-level labels: Bicycle, Person [Oquab, Bottou, Laptev, Sivic, 2014]
  • 63. Motivation: labeling bounding boxes is tedious
  • 64. Motivation: image-level labels are plentiful “Beautiful red leaves in a back street of Freiburg” [Kuznetsovaet al., ACL 2013] http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
  • 65. Motivation: image-level labels are plentiful “Public bikes in Warsaw during night” https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
  • 66. Let the algorithm localize the object in the image [Oquab, Bottou, Laptev, Sivic, 2014] Example training images with bounding boxes The locations of objects or their parts learnt by the CNN NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
  • 67. Approach: search over object’s location 1.Efficient window sliding to find object location hypothesis 2.Image-level aggregation (max-pool) 3.Multi-label loss function (allow multiple objects in image) See also [Sermanetet al. ’14] and [Chaftieldet al.’14] Max-pool over image Per-image score FCa FCb C1-C2-C3-C4-C5 FC6 FC7 4096- dim vector 9216- dim vector 4096- dim vector … motorbike person diningtable pottedplant chair car bus train … Max
  • 68. 1. Efficient window sliding to find object location 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in …
  • 69. 2. Image-level aggregation using global max-pool 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in …
  • 70. 3. Multi-label loss function (to allow for multiple objects in image) 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in complex scenes [2, 9, 28] or fromvideo [30]. Thesemethods typically localize objectswith visually consistent appearance in the training data that often contains multiple objects in different spatial Sum of K (=20) log-loss functions, one for each of K classes: K-vector of network output for image x K-vector of (+1,-1) labels indicating presence/absence of each class
  • 71. SearcMh foar xob-jpecotso usliinngg m asxe-paorolcinhg aeroplane map car map «Keep up the good work !» (increase score) «Wrong !» (decrease score) «Found something there !» Receptive field of the maximum-scoring neuron max-pool max-pool mardi 10 juin 14 Correct label: increase score for this class Incorrect label: decrease score for this class
  • 72. Search for objects using max-pooling a What is the effect of errors?
  • 73. Multi-scale training and testing 16216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320.7…1.4 ] chairdiningtablesofapottedplantpersoncarbustrain… Figure3:Weaklysupervisedtrainingchairdiningtablepersonpottedplantpersoncarbustrain… RescaleFigure4:MultiscaleobjectrecognitionConvolutionaladaptationlayers.
  • 75. Test results on 80 classes in Microsoft COCO dataset
  • 76. Test results on 80 classes in Microsoft COCO dataset
  • 77. Test results on 80 classes in Microsoft COCO dataset
  • 78. Test results on 80 classes in Microsoft COCO dataset
  • 79. Test results on 80 classes in Microsoft COCO dataset
  • 80. Test results on 80 classes in Microsoft COCO dataset