YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)

YouTube-8M: A Large-Scale Video Classification
Benchmark (and Google Cloud ML Engine)
Slides by Dídac Surís
ReadAI Reading Group, UPC
13th March, 2017
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul
Natsev, George Toderici, Balakrishnan Varadarajan,
Sudheendra Vijayanarasimhan
[arxiv] (27 Sep 2016) [web]

Index
1. YouTube-8M
a. Dataset
b. Baseline approaches
c. Results
2. Google Cloud ML Engine

YouTube-8M: Dataset
Main features
● Multi-label (average 1.8)
● 4800 entities (24 top-level categories)
● 8, 264, 650 videos
● 500K hours of video
● Only visual entities
● Remove computational barriers

YouTube-8M: Dataset
Obtention
● YouTube video annotation system (metadata, context, …)
● First step: define entities
○ Human ratings to define entities (only visual ones)
○ At least 200 videos per entity
● Second step: collect videos
○ 10 M randomly sampled videos
○ Discard according to several
criteria
○ Split into train/validate/test

YouTube-8M: Dataset
Feature Extraction
● 50 years of video real time: impractical
● Sampling at 1 frame per second
● Frame-level feature extraction: fetch the ReLu activation of the last hidden
layer from the Inception network trained on ImageNet
● 2048 dimensions. With PCA + quantization size reduced 8x
● Audio features also extracted later:
https://www.kaggle.com/c/youtube8m/discussion/29475

YouTube-8M: Dataset
Not perfect ground truth
● 78.8 % precision
● 14.5 % recall

YouTube-8M: Baseline approaches
Frame-level
Training of 4800 independent one-vs-all classifiers
1. Average pooling + logistic
○ The frame-level probabilities are aggregated
to the video-level using a simple average
2. Deep Bag of Frame (DBoF) Pooling
○ k frames projected to an M-dimensional space
with RELU activations
○ Batch normalization
○ Aggregation of frames with max-pooling
3. LSTM
○ 2 LSTM layers with 1024 hidden units
○ Linearly increasing per-frame weights going
from 1/N to 1 for the last frame.

YouTube-8M: Baseline approaches
Video-level
Only difference is that now we combine features before the
neural network: fixed-length video features
● Mean, standard deviation, top 5 ordinal statistics
● Posterior normalization (subtract mean, PCA)
Online learning algorithms instead of batch optimization (¿?)
1. Logistic regression
2. SVM (online) + Hinge loss
3. Mixture of Experts

YouTube-8M: Results
Evaluation metrics and comparison
● Mean Average Precision
(Precision, Recall)
● Hit @k
● Precision at equal recall rate
(PERR)
These are results on the validation
set. On the human rated test set
the results are consistent.

YouTube-8M: Results
Results on other databases (transfer learning)
● Sports 1M
● Activity Net

Google Cloud Machine Learning Engine
Basics
● Google Cloud Platform: 300 $ trial
● Google Cloud Shell
● Pricing
○ Training: in ML units (depending on scale tier) * hours
○ Prediction: Per hour + # of predictions
● Google Cloud Storage for the results

Task submission

TensorBoard

YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group) (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)