YouTube-8M: A Large-Scale Video Classification
Benchmark (and Google Cloud ML Engine)
Slides by Dídac Surís
ReadAI Reading Group, UPC
13th March, 2017
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul
Natsev, George Toderici, Balakrishnan Varadarajan,
Sudheendra Vijayanarasimhan
[arxiv] (27 Sep 2016) [web]
Index
1. YouTube-8M
a. Dataset
b. Baseline approaches
c. Results
2. Google Cloud ML Engine
Index
1. YouTube-8M
a. Dataset
b. Baseline approaches
c. Results
2. Google Cloud ML Engine
YouTube-8M: Dataset
Main features
● Multi-label (average 1.8)
● 4800 entities (24 top-level categories)
● 8, 264, 650 videos
● 500K hours of video
● Only visual entities
● Remove computational barriers
YouTube-8M: Dataset
Obtention
● YouTube video annotation system (metadata, context, …)
● First step: define entities
○ Human ratings to define entities (only visual ones)
○ At least 200 videos per entity
● Second step: collect videos
○ 10 M randomly sampled videos
○ Discard according to several
criteria
○ Split into train/validate/test
YouTube-8M: Dataset
Feature Extraction
● 50 years of video real time: impractical
● Sampling at 1 frame per second
● Frame-level feature extraction: fetch the ReLu activation of the last hidden
layer from the Inception network trained on ImageNet
● 2048 dimensions. With PCA + quantization size reduced 8x
● Audio features also extracted later:
https://www.kaggle.com/c/youtube8m/discussion/29475
YouTube-8M: Dataset
Not perfect ground truth
● 78.8 % precision
● 14.5 % recall
Index
1. YouTube-8M
a. Dataset
b. Baseline approaches
c. Results
2. Google Cloud ML Engine
YouTube-8M: Baseline approaches
Frame-level
Training of 4800 independent one-vs-all classifiers
1. Average pooling + logistic
○ The frame-level probabilities are aggregated
to the video-level using a simple average
2. Deep Bag of Frame (DBoF) Pooling
○ k frames projected to an M-dimensional space
with RELU activations
○ Batch normalization
○ Aggregation of frames with max-pooling
3. LSTM
○ 2 LSTM layers with 1024 hidden units
○ Linearly increasing per-frame weights going
from 1/N to 1 for the last frame.
YouTube-8M: Baseline approaches
Video-level
Only difference is that now we combine features before the
neural network: fixed-length video features
● Mean, standard deviation, top 5 ordinal statistics
● Posterior normalization (subtract mean, PCA)
Online learning algorithms instead of batch optimization (¿?)
1. Logistic regression
2. SVM (online) + Hinge loss
3. Mixture of Experts
Index
1. YouTube-8M
a. Dataset
b. Baseline approaches
c. Results
2. Google Cloud ML Engine
YouTube-8M: Results
Evaluation metrics and comparison
● Mean Average Precision
(Precision, Recall)
● Hit @k
● Precision at equal recall rate
(PERR)
These are results on the validation
set. On the human rated test set
the results are consistent.
YouTube-8M: Results
Results on other databases (transfer learning)
● Sports 1M
● Activity Net
Index
1. YouTube-8M
a. Dataset
b. Baseline approaches
c. Results
2. Google Cloud ML Engine
Google Cloud Machine Learning Engine
Basics
● Google Cloud Platform: 300 $ trial
● Google Cloud Shell
● Pricing
○ Training: in ML units (depending on scale tier) * hours
○ Prediction: Per hour + # of predictions
● Google Cloud Storage for the results
Google Cloud Machine Learning Engine
Task submission
Google Cloud Machine Learning Engine
TensorBoard

More Related Content

PDF
Multi-label Remote Sensing Image Retrieval based on Deep Features
PDF
med_poster_spie
PPT
KTTO_2015_Vavrek
PDF
Background Subtraction Algorithm for Moving Object Detection Using Denoising ...
PDF
Review : Rethinking Pre-training and Self-training
PDF
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
PDF
A0270107
PPT
Video summarization using clustering
Multi-label Remote Sensing Image Retrieval based on Deep Features
med_poster_spie
KTTO_2015_Vavrek
Background Subtraction Algorithm for Moving Object Detection Using Denoising ...
Review : Rethinking Pre-training and Self-training
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
A0270107
Video summarization using clustering

What's hot (19)

PPT
B Eng Final Year Project Presentation
PDF
IRJET-Multiple Object Detection using Deep Neural Networks
PPTX
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
PPTX
Background subtraction
PDF
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
PPTX
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
PDF
Performance Enhancement for Quality Inter-Layer Scalable Video Coding
PDF
A flexible method to create wave file features
PDF
Be36338341
PDF
Kassem2009
PDF
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
PDF
Median based parallel steering kernel regression for image reconstruction
PDF
Complex Background Subtraction Using Kalman Filter
PDF
Comparing Incremental Learning Strategies for Convolutional Neural Networks
PDF
Bag of tricks for image classification with convolutional neural networks r...
PDF
Robust foreground modelling to segment and detect multiple moving objects in ...
PDF
Keyframe-based Video Summarization Designer
PDF
Seed net automatic seed generation with deep reinforcement learning for robus...
PPSX
Image processing on matlab presentation
B Eng Final Year Project Presentation
IRJET-Multiple Object Detection using Deep Neural Networks
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Background subtraction
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Performance Enhancement for Quality Inter-Layer Scalable Video Coding
A flexible method to create wave file features
Be36338341
Kassem2009
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
Median based parallel steering kernel regression for image reconstruction
Complex Background Subtraction Using Kalman Filter
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Bag of tricks for image classification with convolutional neural networks r...
Robust foreground modelling to segment and detect multiple moving objects in ...
Keyframe-based Video Summarization Designer
Seed net automatic seed generation with deep reinforcement learning for robus...
Image processing on matlab presentation
Ad

Viewers also liked (20)

PDF
Visual Translation Embedding Network for Visual Relation Detection (UPC Readi...
PDF
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
PPT
How to invest in capital market
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
PDF
Deep Learning for Computer Vision: Generative models and adversarial training...
PDF
La figura del director en la LOMCE
PDF
Prot. 337 17 mensagem de veto 002 - integral ao autógrafo de lei nº 3.602-16
PDF
Defective products
PDF
Creating new classes of objects with deep generative neural nets
PDF
Paper crf design_tools
PPT
Tools for Image Retrieval in Large Multimedia Databases
PPTX
Conditional Random Fields - Vidya Venkiteswaran
PPT
Project Portfolio Summaries
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
PDF
Deep Learning for Computer Vision: Optimization (UPC 2016)
PDF
Web本文抽出 using crf
PDF
Machine Learning: Generative and Discriminative Models
PDF
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
PDF
Region-oriented Convolutional Networks for Object Retrieval
Visual Translation Embedding Network for Visual Relation Detection (UPC Readi...
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
How to invest in capital market
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Generative models and adversarial training...
La figura del director en la LOMCE
Prot. 337 17 mensagem de veto 002 - integral ao autógrafo de lei nº 3.602-16
Defective products
Creating new classes of objects with deep generative neural nets
Paper crf design_tools
Tools for Image Retrieval in Large Multimedia Databases
Conditional Random Fields - Vidya Venkiteswaran
Project Portfolio Summaries
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
Web本文抽出 using crf
Machine Learning: Generative and Discriminative Models
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
Region-oriented Convolutional Networks for Object Retrieval
Ad

Similar to YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group) (20)

PPTX
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
PDF
Sprint 71
PDF
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
PDF
Managing 600 instances
PDF
Deep neural networks for Youtube recommendations
PDF
Image Object Detection Pipeline
PDF
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
PPTX
Activity Recognition project
PPTX
2021 05-04-u2-net
PDF
Sprint 50 review
PDF
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
PPTX
DhakaNet: Unstructured Vehicle Detection using Limited Computational Resources
PPTX
Practical ML
PPTX
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
PDF
Video Thumbnail Selector
PDF
Key frame extraction for video summarization using motion activity descriptors
PDF
Key frame extraction for video summarization using motion activity descriptors
PDF
USING IMAGE CLASSIFICATION TO INCENTIVIZE RECYCLING
PDF
Effective Compression of Digital Video
PDF
Sprint 44 review
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
Sprint 71
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Managing 600 instances
Deep neural networks for Youtube recommendations
Image Object Detection Pipeline
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
Activity Recognition project
2021 05-04-u2-net
Sprint 50 review
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
DhakaNet: Unstructured Vehicle Detection using Limited Computational Resources
Practical ML
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
Video Thumbnail Selector
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptors
USING IMAGE CLASSIFICATION TO INCENTIVIZE RECYCLING
Effective Compression of Digital Video
Sprint 44 review

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

Recently uploaded (20)

PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Transcultural that can help you someday.
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Introduction to Data Science and Data Analysis
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
chrmotography.pptx food anaylysis techni
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
IMPACT OF LANDSLIDE.....................
DOCX
Factor Analysis Word Document Presentation
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Microsoft 365 products and services descrption
PPT
Predictive modeling basics in data cleaning process
PPTX
SAP 2 completion done . PRESENTATION.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Transcultural that can help you someday.
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Introduction to Data Science and Data Analysis
STERILIZATION AND DISINFECTION-1.ppthhhbx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
chrmotography.pptx food anaylysis techni
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
CYBER SECURITY the Next Warefare Tactics
IMPACT OF LANDSLIDE.....................
Factor Analysis Word Document Presentation
Pilar Kemerdekaan dan Identi Bangsa.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Microsoft 365 products and services descrption
Predictive modeling basics in data cleaning process
SAP 2 completion done . PRESENTATION.pptx

YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)

  • 1. YouTube-8M: A Large-Scale Video Classification Benchmark (and Google Cloud ML Engine) Slides by Dídac Surís ReadAI Reading Group, UPC 13th March, 2017 Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan [arxiv] (27 Sep 2016) [web]
  • 2. Index 1. YouTube-8M a. Dataset b. Baseline approaches c. Results 2. Google Cloud ML Engine
  • 3. Index 1. YouTube-8M a. Dataset b. Baseline approaches c. Results 2. Google Cloud ML Engine
  • 4. YouTube-8M: Dataset Main features ● Multi-label (average 1.8) ● 4800 entities (24 top-level categories) ● 8, 264, 650 videos ● 500K hours of video ● Only visual entities ● Remove computational barriers
  • 5. YouTube-8M: Dataset Obtention ● YouTube video annotation system (metadata, context, …) ● First step: define entities ○ Human ratings to define entities (only visual ones) ○ At least 200 videos per entity ● Second step: collect videos ○ 10 M randomly sampled videos ○ Discard according to several criteria ○ Split into train/validate/test
  • 6. YouTube-8M: Dataset Feature Extraction ● 50 years of video real time: impractical ● Sampling at 1 frame per second ● Frame-level feature extraction: fetch the ReLu activation of the last hidden layer from the Inception network trained on ImageNet ● 2048 dimensions. With PCA + quantization size reduced 8x ● Audio features also extracted later: https://www.kaggle.com/c/youtube8m/discussion/29475
  • 7. YouTube-8M: Dataset Not perfect ground truth ● 78.8 % precision ● 14.5 % recall
  • 8. Index 1. YouTube-8M a. Dataset b. Baseline approaches c. Results 2. Google Cloud ML Engine
  • 9. YouTube-8M: Baseline approaches Frame-level Training of 4800 independent one-vs-all classifiers 1. Average pooling + logistic ○ The frame-level probabilities are aggregated to the video-level using a simple average 2. Deep Bag of Frame (DBoF) Pooling ○ k frames projected to an M-dimensional space with RELU activations ○ Batch normalization ○ Aggregation of frames with max-pooling 3. LSTM ○ 2 LSTM layers with 1024 hidden units ○ Linearly increasing per-frame weights going from 1/N to 1 for the last frame.
  • 10. YouTube-8M: Baseline approaches Video-level Only difference is that now we combine features before the neural network: fixed-length video features ● Mean, standard deviation, top 5 ordinal statistics ● Posterior normalization (subtract mean, PCA) Online learning algorithms instead of batch optimization (¿?) 1. Logistic regression 2. SVM (online) + Hinge loss 3. Mixture of Experts
  • 11. Index 1. YouTube-8M a. Dataset b. Baseline approaches c. Results 2. Google Cloud ML Engine
  • 12. YouTube-8M: Results Evaluation metrics and comparison ● Mean Average Precision (Precision, Recall) ● Hit @k ● Precision at equal recall rate (PERR) These are results on the validation set. On the human rated test set the results are consistent.
  • 13. YouTube-8M: Results Results on other databases (transfer learning) ● Sports 1M ● Activity Net
  • 14. Index 1. YouTube-8M a. Dataset b. Baseline approaches c. Results 2. Google Cloud ML Engine
  • 15. Google Cloud Machine Learning Engine Basics ● Google Cloud Platform: 300 $ trial ● Google Cloud Shell ● Pricing ○ Training: in ML units (depending on scale tier) * hours ○ Prediction: Per hour + # of predictions ● Google Cloud Storage for the results
  • 16. Google Cloud Machine Learning Engine Task submission
  • 17. Google Cloud Machine Learning Engine TensorBoard