Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Large-Scale Video Understanding:
YouTube and Beyond
Rahul Sukthankar
Machine Perception, Google Research
https://research.google.com/teams/perception/
AI Frontiers Conference - Nov. 3, 2017

Machine Perception
Really Works!
(better than I expected)

Sample of Perception tech in products
Signals for Image Search ranking, related images, search-by-image, etc.

Cloud Video API Cloud Vision API

(Seth LaForge, Nexus 5X)
HDR+ in Android Camera Mobile Vision API

Organizing Photos image & video
collections and making them
searchable by content
Microvideo tech in
Photos & Motion Stills
De-reflection & tracking
in Photo Scanner

Personalized sticker
packs in Allo
On-device handwriting
input & recognition
OCR for lots of languages

Visual & auditory
annotation & signals on
YouTube
Thumbnail/preview selection &
optimization for YouTube
Non-speech sound captions
on YouTube

Region tracking for custom blurring
tool on YouTube
Mobile creative effects on YouTube

watch, listen, understandcapture a moment improve & manipulate
Useful Applications for Video Technology
Help users create, enhance, organize, and discover videos.

Privacy Region Tracking & Blurring for YouTube

Fun Effects from Tracking (on Mobile) for YouTube

Large-Scale Video
Annotation for YouTube

Large-Scale Video Annotation for YouTube
extract
features
quantize &
aggregate
train model
(e.g., AdaBoost)
training data
Video understanding pipeline as of ~5 years ago
frame
features
video
features
“Roller-blading”
hand-designed
descriptors
codebook
histogram
pixels & sound
samples

Large-Scale Video Annotation for YouTube
extract
features
training data
Modern video understanding pipeline
“Roller-blading”
pixels & sound
samples
Magic box containing many
convolutional, deep, end-to-
end buzzwords :-)

Deep-learned visual features
Inception model
trained on noisy
data (images)
Bottleneck
embedding
layer (1000-d)
Videos with noisy labels
Frame-level Video-level
- Max pooling
- Avg pooling
- VLAD pooling

+80%
mean avg.
precision
40x more compact features
Deep learned visual
features, VLAD coding:
1024-d, 0.272 MAP
Handcrafted audio-
visual features: ~40K-
d, 0.153 MAP
MeanAveragePrecision
Dimensionality
0.40
0.30
0.20
0.10
0
Deep-learned vs. handcrafted features

Personal video search in Google Photos
Lots of videos
Almost no metadata

Domain adaptation: Finding home videos on YouTube
By capture device
vs
By video frame rate
By video orientation
vs

The technology behind personal video search
Video
Trained on web images
Image / photo
annotation model
1

Video
Image / photo
annotation model
YouTube frame
annotation model
Trained on video thumbnails
Domain-adapted
frame-level
vision model
1
2

YouTube video
annotation model
Trained on YouTube videos
Video
Image / photo
annotation model
YouTube frame
annotation model
Domain-adapted
frame-level
vision model
Domain-adapted
video-level
vision model
1
2
3

YouTube video
annotation model
Video
Audio
Image / photo
annotation model
YouTube audio
annotation model
YouTube frame
annotation model
Domain-adapted
frame-level
vision model
Domain-adapted
video-level
vision model
Domain-adapted
audio model
1
2
3
4

YouTube video
annotation model
toddler
dancing
Video
Audio
Image / photo
annotation model
YouTube audio
annotation model
YouTube frame
annotation model
Domain-adapted
frame-level
vision model
Domain-adapted
video-level
vision model
Domain-adapted
audio model
1
2
3
4
Fusion &
calibration
5
Trained on
home videos
Domain-adapted
personal video
model

Evolution of personal video annotation models
1
2
3
4

1
2
3
4
Photo annotation model applied on video frames

Domain adaptation + fusion across frames
1
2
3
4

Fusion across multiple vision models
1
2
3
4

Fusion across multiple audio-visual models
Fusion across multiple vision models
1
2
3
4

1
2
3
4
> 2x recall gain

Learning aesthetics: YouTube Thumbnails

YouTube thumbnail
quality model

Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015

Video retargeting (spatial)
Original video. Reframed for a banner aspect ratio.

Video retargeting (temporal)
Video preview:
(duration: 6 secs)

Motion Stills app
Stream One-Up

Motion Still examples: cinemagraphs

Motion Stills examples: gifs / memes

Motion Stills examples: timelapse

Promising Directions for
Future Research:
Learning from Video

Sermanet, Self-Supervised Imitation, Google Brain
Self-Supervised Imitation
Pierre Sermanet* Corey Lynch* Yevgen Chebotar*
Jasmine Hsu Eric Jang Stefan Schaal Sergey Levine
Google Brain + University of Southern California
* equal contribution

Multi-view capture
This image cannot currently be displayed.

Time-Contrastive Networks (TCN)
(source: [Rippel et al 2015])
arxiv.org/abs/1704.06888v2
sermanet.github.io/imitate

Approach (pouring, real)
* RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning,
Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]

Resulting policies

Pose imitation (real robot)

Useful Datasets for Video Understanding
● Large-scale video annotation
○ Sports-1M > 1M videos from ~500 classes [with
Stanford]
○ YouTube-8M ~8M videos from ~4800 classes
● Action recognition in video
○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA]
○ Kinetics 400+ short clips for 400 actions [with
DeepMind]
○ AVA Spatially localized atomic actions
[with Berkeley, INRIA]
● Object recognition
○ YouTube-BB Spatially localized objects in video (80 classes)
○ Open Images Spatially localized objects in images (600 classes)

Sports-1M: 1.1M videos from 487 sports classes (video classification)

YouTube-8M Video Research Dataset
research.google.com/youtube8m/

THUMOS Challenge Series: Temporal Localization in Untrimmed Videos

YouTube Bounding Boxes: Spatial localization of one object through time

AVA: Spatial localization of an actor performing atomic actions
Atomic action: “Paint”

Open Images v3 - detailed spatial annotations in images
Example validation images

● Significant progress in large-scale video annotation for YouTube
● Video understanding has many applications beyond YouTube
● We encourage others to work on video through public datasets
● Many exciting research problems ahead, particularly in learning from video
(I think there’s a lot more progress to be made in video understanding)
Conclusion

Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

More Related Content

Viewers also liked (12)

Similar to Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond (20)

More from AI Frontiers (20)

Recently uploaded (20)

Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond