SlideShare a Scribd company logo
Large-Scale Video Understanding:
YouTube and Beyond
Rahul Sukthankar
Machine Perception, Google Research
https://research.google.com/teams/perception/
AI Frontiers Conference - Nov. 3, 2017
Machine Perception
Really Works!
(better than I expected)
Sample of Perception tech in products
Signals for Image Search ranking, related images, search-by-image, etc.
Sample of Perception tech in products
Cloud Video API Cloud Vision API
Sample of Perception tech in products
(Seth LaForge, Nexus 5X)
HDR+ in Android Camera Mobile Vision API
Sample of Perception tech in products
Organizing Photos image & video
collections and making them
searchable by content
Microvideo tech in
Photos & Motion Stills
De-reflection & tracking
in Photo Scanner
Sample of Perception tech in products
Personalized sticker
packs in Allo
On-device handwriting
input & recognition
OCR for lots of languages
Sample of Perception tech in products
Visual & auditory
annotation & signals on
YouTube
Thumbnail/preview selection &
optimization for YouTube
Non-speech sound captions
on YouTube
Sample of Perception tech in products
Region tracking for custom blurring
tool on YouTube
Mobile creative effects on YouTube
watch, listen, understandcapture a moment improve & manipulate
Useful Applications for Video Technology
Help users create, enhance, organize, and discover videos.
Privacy Region Tracking & Blurring for YouTube
Fun Effects from Tracking (on Mobile) for YouTube
Large-Scale Video
Annotation for YouTube
Large-Scale Video Annotation for YouTube
extract
features
quantize &
aggregate
train model
(e.g., AdaBoost)
training data
Video understanding pipeline as of ~5 years ago
frame
features
video
features
“Roller-blading”
hand-designed
descriptors
codebook
histogram
pixels & sound
samples
Large-Scale Video Annotation for YouTube
extract
features
training data
Modern video understanding pipeline
“Roller-blading”
pixels & sound
samples
Magic box containing many
convolutional, deep, end-to-
end buzzwords :-)
Deep-learned visual features
Inception model
trained on noisy
data (images)
Bottleneck
embedding
layer (1000-d)
Videos with noisy labels
Frame-level Video-level
- Max pooling
- Avg pooling
- VLAD pooling
+80%
mean avg.
precision
40x more compact features
Deep learned visual
features, VLAD coding:
1024-d, 0.272 MAP
Handcrafted audio-
visual features: ~40K-
d, 0.153 MAP
MeanAveragePrecision
Dimensionality
0.40
0.30
0.20
0.10
0
Deep-learned vs. handcrafted features
Personal video search in Google Photos
Lots of videos
Almost no metadata
“Dancing” on the web
“Dancing” in home videos
Domain adaptation: Finding home videos on YouTube
By capture device
vs
By video frame rate
By video orientation
vs
The technology behind personal video search
Video
Trained on web images
Image / photo
annotation model
1
The technology behind personal video search
Video
Trained on web images
Image / photo
annotation model
YouTube frame
annotation model
Trained on video thumbnails
Domain-adapted
frame-level
vision model
1
2
YouTube video
annotation model
Trained on YouTube videos
The technology behind personal video search
Video
Trained on web images
Image / photo
annotation model
YouTube frame
annotation model
Trained on video thumbnails
Domain-adapted
frame-level
vision model
Domain-adapted
video-level
vision model
1
2
3
YouTube video
annotation model
Trained on YouTube videos
The technology behind personal video search
Video
Audio
Trained on web images
Image / photo
annotation model
Trained on YouTube videos
YouTube audio
annotation model
YouTube frame
annotation model
Trained on video thumbnails
Domain-adapted
frame-level
vision model
Domain-adapted
video-level
vision model
Domain-adapted
audio model
1
2
3
4
YouTube video
annotation model
Trained on YouTube videos
toddler
dancing
The technology behind personal video search
Video
Audio
Trained on web images
Image / photo
annotation model
Trained on YouTube videos
YouTube audio
annotation model
YouTube frame
annotation model
Trained on video thumbnails
Domain-adapted
frame-level
vision model
Domain-adapted
video-level
vision model
Domain-adapted
audio model
1
2
3
4
Fusion &
calibration
5
Trained on
home videos
Domain-adapted
personal video
model
Evolution of personal video annotation models
1
2
3
4
Evolution of personal video annotation models
1
2
3
4
Photo annotation model applied on video frames
Evolution of personal video annotation models
Domain adaptation + fusion across frames
1
2
3
4
Photo annotation model applied on video frames
Evolution of personal video annotation models
Fusion across multiple vision models
Domain adaptation + fusion across frames
1
2
3
4
Photo annotation model applied on video frames
Evolution of personal video annotation models
Fusion across multiple audio-visual models
Fusion across multiple vision models
Photo annotation model applied on video frames
Domain adaptation + fusion across frames
1
2
3
4
Evolution of personal video annotation models
1
2
3
4
> 2x recall gain
Learning aesthetics: YouTube Thumbnails
Learning aesthetics: YouTube Thumbnails
YouTube thumbnail
quality model
Learning aesthetics: YouTube Thumbnails
Learning aesthetics: YouTube Thumbnails
Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015
Video retargeting (spatial)
Original video. Reframed for a banner aspect ratio.
Video retargeting (temporal)
Video preview:
(duration: 6 secs)
Motion Stabilization
Motion Stills app
Stream One-Up
Motion Still examples: cinemagraphs
Motion Stills examples: gifs / memes
Motion Stills examples: timelapse
Promising Directions for
Future Research:
Learning from Video
Sermanet, Self-Supervised Imitation, Google Brain
Self-Supervised Imitation
Pierre Sermanet* Corey Lynch* Yevgen Chebotar*
Jasmine Hsu Eric Jang Stefan Schaal Sergey Levine
Google Brain + University of Southern California
* equal contribution
Sermanet, Self-Supervised Imitation, Google Brain
Multi-view capture
This image cannot currently be displayed.
Sermanet, Self-Supervised Imitation, Google Brain
Time-Contrastive Networks (TCN)
(source: [Rippel et al 2015])
arxiv.org/abs/1704.06888v2
sermanet.github.io/imitate
Sermanet, Self-Supervised Imitation, Google Brain
Approach (pouring, real)
* RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning,
Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]
Sermanet, Self-Supervised Imitation, Google Brain
Resulting policies
Sermanet, Self-Supervised Imitation, Google Brain
Pose imitation (real robot)
Useful Datasets for Video Understanding
● Large-scale video annotation
○ Sports-1M > 1M videos from ~500 classes [with
Stanford]
○ YouTube-8M ~8M videos from ~4800 classes
● Action recognition in video
○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA]
○ Kinetics 400+ short clips for 400 actions [with
DeepMind]
○ AVA Spatially localized atomic actions
[with Berkeley, INRIA]
● Object recognition
○ YouTube-BB Spatially localized objects in video (80 classes)
○ Open Images Spatially localized objects in images (600 classes)
Sports-1M: 1.1M videos from 487 sports classes (video classification)
YouTube-8M Video Research Dataset
research.google.com/youtube8m/
THUMOS Challenge Series: Temporal Localization in Untrimmed Videos
YouTube Bounding Boxes: Spatial localization of one object through time
AVA: Spatial localization of an actor performing atomic actions
Atomic action: “Paint”
Open Images v3 - detailed spatial annotations in images
Example validation images
Open Images v3 - detailed spatial annotations in images
Example validation images
● Significant progress in large-scale video annotation for YouTube
● Video understanding has many applications beyond YouTube
● We encourage others to work on video through public datasets
● Many exciting research problems ahead, particularly in learning from video
(I think there’s a lot more progress to be made in video understanding)
Conclusion

More Related Content

PPTX
Divya Jain at AI Frontiers : Video Summarization
PDF
Q1 epp ict entrep
PDF
The Essence of an Entrepreneur
PDF
The Entrepreneur Mindset (by Ty Rhame)
DOCX
BURGER KING Franchise business plan
PDF
How should startups embrace the trend of IoT and Big Data
PPT
Mark Zuckerberg
PDF
Dekang Lin at AI Frontiers: Adding Conversation to GUIs
Divya Jain at AI Frontiers : Video Summarization
Q1 epp ict entrep
The Essence of an Entrepreneur
The Entrepreneur Mindset (by Ty Rhame)
BURGER KING Franchise business plan
How should startups embrace the trend of IoT and Big Data
Mark Zuckerberg
Dekang Lin at AI Frontiers: Adding Conversation to GUIs

Viewers also liked (12)

PDF
Omar Tawakol at AI Frontiers: The Rise Of Voice-Activated Assistants In The W...
PDF
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...
PPTX
Esp 10 modyul 10 Pagmamahal sa bayan
PDF
Yuandong Tian at AI Frontiers: AI in Games: Achievements and Challenges
PDF
Ilya Gelfenbeyn at AI Frontiers: Successful Exits - Lessons from API.AI
PDF
Xiaofeng Ren at AI Frontiers: The Quest for Video Understanding
PDF
James Manyika at AI Frontiers: Sizing up the promise of AI
PDF
Magnus Nordin at AI Frontiers: Deep Learning for Game Development
PDF
Roland Memisevic at AI Frontiers: Common sense video understanding at TwentyBN
PDF
Tracxn Research - Chatbots Startup Landscape, June 2016
PPTX
Tracxn Research - Industrial Internet of Things Report, June 2017
PDF
Frank Chen at AI Frontiers: Startups and AI
Omar Tawakol at AI Frontiers: The Rise Of Voice-Activated Assistants In The W...
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...
Esp 10 modyul 10 Pagmamahal sa bayan
Yuandong Tian at AI Frontiers: AI in Games: Achievements and Challenges
Ilya Gelfenbeyn at AI Frontiers: Successful Exits - Lessons from API.AI
Xiaofeng Ren at AI Frontiers: The Quest for Video Understanding
James Manyika at AI Frontiers: Sizing up the promise of AI
Magnus Nordin at AI Frontiers: Deep Learning for Game Development
Roland Memisevic at AI Frontiers: Common sense video understanding at TwentyBN
Tracxn Research - Chatbots Startup Landscape, June 2016
Tracxn Research - Industrial Internet of Things Report, June 2017
Frank Chen at AI Frontiers: Startups and AI
Ad

Similar to Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond (20)

PPTX
Jay Y
PPTX
TechnicalBackgroundOverview
PDF
Big Video Data Revolution, Challenges Unresolved
PDF
pgdip-project-report-final-148245F
PPTX
Matt Feiszli at AI Frontiers : Video Understanding
PDF
Automatic multi-modal metadata annotation based on trained cognitive solution...
PDF
What is Video Annotation and How automation helps?
PDF
Advanced Video Search - Leveraging Twelve Labs and Milvus for Semantic Retrieval
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
PDF
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
PDF
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
PDF
Rosinski ibm ai overview with several examples of projects in the media and l...
PDF
“Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a...
PPT
Raskar Emtech2010 Mar Final
PPT
Raskar Emtech2010 Mar Final
PPTX
Audiovisual content exploitation JTS2010
PDF
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
PDF
Interactive Video Search: Where is the User in the Age of Deep Learning?
Jay Y
TechnicalBackgroundOverview
Big Video Data Revolution, Challenges Unresolved
pgdip-project-report-final-148245F
Matt Feiszli at AI Frontiers : Video Understanding
Automatic multi-modal metadata annotation based on trained cognitive solution...
What is Video Annotation and How automation helps?
Advanced Video Search - Leveraging Twelve Labs and Milvus for Semantic Retrieval
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Rosinski ibm ai overview with several examples of projects in the media and l...
“Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a...
Raskar Emtech2010 Mar Final
Raskar Emtech2010 Mar Final
Audiovisual content exploitation JTS2010
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Interactive Video Search: Where is the User in the Age of Deep Learning?
Ad

More from AI Frontiers (20)

PPTX
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
PDF
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...
PDF
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
PDF
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...
PDF
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks
PDF
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 3: Any-Angl...
PDF
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
PDF
Percy Liang at AI Frontiers : Pushing the Limits of Machine Learning
PDF
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
PDF
Mark Moore at AI Frontiers : Uber Elevate
PPTX
Mario Munich at AI Frontiers : Consumer robotics: embedding affordable AI in ...
PPTX
Arnaud Thiercelin at AI Frontiers : AI in the Sky
PPTX
Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen...
PPTX
Wei Xu at AI Frontiers : Language Learning in an Interactive and Embodied Set...
PPTX
Sumit Gupta at AI Frontiers : AI for Enterprise
PPTX
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
PPTX
Alex Ermolaev at AI Frontiers : Major Applications of AI in Healthcare
PPTX
Long Lin at AI Frontiers : AI in Gaming
PDF
Melissa Goldman at AI Frontiers : AI & Finance
PPTX
Li Deng at AI Frontiers : From Modeling Speech/Language to Modeling Financial...
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 3: Any-Angl...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Percy Liang at AI Frontiers : Pushing the Limits of Machine Learning
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Mark Moore at AI Frontiers : Uber Elevate
Mario Munich at AI Frontiers : Consumer robotics: embedding affordable AI in ...
Arnaud Thiercelin at AI Frontiers : AI in the Sky
Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen...
Wei Xu at AI Frontiers : Language Learning in an Interactive and Embodied Set...
Sumit Gupta at AI Frontiers : AI for Enterprise
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Alex Ermolaev at AI Frontiers : Major Applications of AI in Healthcare
Long Lin at AI Frontiers : AI in Gaming
Melissa Goldman at AI Frontiers : AI & Finance
Li Deng at AI Frontiers : From Modeling Speech/Language to Modeling Financial...

Recently uploaded (20)

PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Global Data and Analytics Market Outlook Report
DOCX
Factor Analysis Word Document Presentation
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Microsoft 365 products and services descrption
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
modul_python (1).pptx for professional and student
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Predictive modeling basics in data cleaning process
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Introduction to Data Science and Data Analysis
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
DU, AIS, Big Data and Data Analytics.ppt
Global Data and Analytics Market Outlook Report
Factor Analysis Word Document Presentation
[EN] Industrial Machine Downtime Prediction
IBA_Chapter_11_Slides_Final_Accessible.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Microsoft 365 products and services descrption
STERILIZATION AND DISINFECTION-1.ppthhhbx
Optimise Shopper Experiences with a Strong Data Estate.pdf
modul_python (1).pptx for professional and student
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...

Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

  • 1. Large-Scale Video Understanding: YouTube and Beyond Rahul Sukthankar Machine Perception, Google Research https://research.google.com/teams/perception/ AI Frontiers Conference - Nov. 3, 2017
  • 3. Sample of Perception tech in products Signals for Image Search ranking, related images, search-by-image, etc.
  • 4. Sample of Perception tech in products Cloud Video API Cloud Vision API
  • 5. Sample of Perception tech in products (Seth LaForge, Nexus 5X) HDR+ in Android Camera Mobile Vision API
  • 6. Sample of Perception tech in products Organizing Photos image & video collections and making them searchable by content Microvideo tech in Photos & Motion Stills De-reflection & tracking in Photo Scanner
  • 7. Sample of Perception tech in products Personalized sticker packs in Allo On-device handwriting input & recognition OCR for lots of languages
  • 8. Sample of Perception tech in products Visual & auditory annotation & signals on YouTube Thumbnail/preview selection & optimization for YouTube Non-speech sound captions on YouTube
  • 9. Sample of Perception tech in products Region tracking for custom blurring tool on YouTube Mobile creative effects on YouTube
  • 10. watch, listen, understandcapture a moment improve & manipulate Useful Applications for Video Technology Help users create, enhance, organize, and discover videos.
  • 11. Privacy Region Tracking & Blurring for YouTube
  • 12. Fun Effects from Tracking (on Mobile) for YouTube
  • 14. Large-Scale Video Annotation for YouTube extract features quantize & aggregate train model (e.g., AdaBoost) training data Video understanding pipeline as of ~5 years ago frame features video features “Roller-blading” hand-designed descriptors codebook histogram pixels & sound samples
  • 15. Large-Scale Video Annotation for YouTube extract features training data Modern video understanding pipeline “Roller-blading” pixels & sound samples Magic box containing many convolutional, deep, end-to- end buzzwords :-)
  • 16. Deep-learned visual features Inception model trained on noisy data (images) Bottleneck embedding layer (1000-d) Videos with noisy labels Frame-level Video-level - Max pooling - Avg pooling - VLAD pooling
  • 17. +80% mean avg. precision 40x more compact features Deep learned visual features, VLAD coding: 1024-d, 0.272 MAP Handcrafted audio- visual features: ~40K- d, 0.153 MAP MeanAveragePrecision Dimensionality 0.40 0.30 0.20 0.10 0 Deep-learned vs. handcrafted features
  • 18. Personal video search in Google Photos Lots of videos Almost no metadata
  • 21. Domain adaptation: Finding home videos on YouTube By capture device vs By video frame rate By video orientation vs
  • 22. The technology behind personal video search Video Trained on web images Image / photo annotation model 1
  • 23. The technology behind personal video search Video Trained on web images Image / photo annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model 1 2
  • 24. YouTube video annotation model Trained on YouTube videos The technology behind personal video search Video Trained on web images Image / photo annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model Domain-adapted video-level vision model 1 2 3
  • 25. YouTube video annotation model Trained on YouTube videos The technology behind personal video search Video Audio Trained on web images Image / photo annotation model Trained on YouTube videos YouTube audio annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model Domain-adapted video-level vision model Domain-adapted audio model 1 2 3 4
  • 26. YouTube video annotation model Trained on YouTube videos toddler dancing The technology behind personal video search Video Audio Trained on web images Image / photo annotation model Trained on YouTube videos YouTube audio annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model Domain-adapted video-level vision model Domain-adapted audio model 1 2 3 4 Fusion & calibration 5 Trained on home videos Domain-adapted personal video model
  • 27. Evolution of personal video annotation models 1 2 3 4
  • 28. Evolution of personal video annotation models 1 2 3 4 Photo annotation model applied on video frames
  • 29. Evolution of personal video annotation models Domain adaptation + fusion across frames 1 2 3 4 Photo annotation model applied on video frames
  • 30. Evolution of personal video annotation models Fusion across multiple vision models Domain adaptation + fusion across frames 1 2 3 4 Photo annotation model applied on video frames
  • 31. Evolution of personal video annotation models Fusion across multiple audio-visual models Fusion across multiple vision models Photo annotation model applied on video frames Domain adaptation + fusion across frames 1 2 3 4
  • 32. Evolution of personal video annotation models 1 2 3 4 > 2x recall gain
  • 34. Learning aesthetics: YouTube Thumbnails YouTube thumbnail quality model
  • 36. Learning aesthetics: YouTube Thumbnails Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015
  • 37. Video retargeting (spatial) Original video. Reframed for a banner aspect ratio.
  • 38. Video retargeting (temporal) Video preview: (duration: 6 secs)
  • 41. Motion Still examples: cinemagraphs
  • 42. Motion Stills examples: gifs / memes
  • 44. Promising Directions for Future Research: Learning from Video
  • 45. Sermanet, Self-Supervised Imitation, Google Brain Self-Supervised Imitation Pierre Sermanet* Corey Lynch* Yevgen Chebotar* Jasmine Hsu Eric Jang Stefan Schaal Sergey Levine Google Brain + University of Southern California * equal contribution
  • 46. Sermanet, Self-Supervised Imitation, Google Brain Multi-view capture This image cannot currently be displayed.
  • 47. Sermanet, Self-Supervised Imitation, Google Brain Time-Contrastive Networks (TCN) (source: [Rippel et al 2015]) arxiv.org/abs/1704.06888v2 sermanet.github.io/imitate
  • 48. Sermanet, Self-Supervised Imitation, Google Brain Approach (pouring, real) * RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning, Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]
  • 49. Sermanet, Self-Supervised Imitation, Google Brain Resulting policies
  • 50. Sermanet, Self-Supervised Imitation, Google Brain Pose imitation (real robot)
  • 51. Useful Datasets for Video Understanding ● Large-scale video annotation ○ Sports-1M > 1M videos from ~500 classes [with Stanford] ○ YouTube-8M ~8M videos from ~4800 classes ● Action recognition in video ○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA] ○ Kinetics 400+ short clips for 400 actions [with DeepMind] ○ AVA Spatially localized atomic actions [with Berkeley, INRIA] ● Object recognition ○ YouTube-BB Spatially localized objects in video (80 classes) ○ Open Images Spatially localized objects in images (600 classes)
  • 52. Sports-1M: 1.1M videos from 487 sports classes (video classification)
  • 53. YouTube-8M Video Research Dataset research.google.com/youtube8m/
  • 54. THUMOS Challenge Series: Temporal Localization in Untrimmed Videos
  • 55. YouTube Bounding Boxes: Spatial localization of one object through time
  • 56. AVA: Spatial localization of an actor performing atomic actions Atomic action: “Paint”
  • 57. Open Images v3 - detailed spatial annotations in images Example validation images
  • 58. Open Images v3 - detailed spatial annotations in images Example validation images
  • 59. ● Significant progress in large-scale video annotation for YouTube ● Video understanding has many applications beyond YouTube ● We encourage others to work on video through public datasets ● Many exciting research problems ahead, particularly in learning from video (I think there’s a lot more progress to be made in video understanding) Conclusion