SlideShare a Scribd company logo
LARGE-SCALE VIDEO CLASSIFICATION WITH
CONVOLUTIONAL NEURAL NETWORK
ANDREJ KARPATHY LI FEI-FEI SANKETH SHETTY
RAHUL SUKHTHANKAR THOMAS LEUNG GEORGE TODERICI
PRESENTED BY: KHALID KHAN
SUMMARY
• Convolutional Neural networks have been established as a powerful class of models image
recognition problems.
• This paper provides the an extensive empirical evaluation of CNNs on large scale video
classification using a new dataset of approx. 1 million YouTube videos.
• Multiple approaches were studied for extending the connectivity of a CNN in time domain.
• Suggested a multiresolution architecture as a promising way of speeding up the training.
• Some performance improvements were observed compared to previous feature-based and
single-frame models.
2
CONVOLUTIONAL NEURAL NETWORK
• Similar to other Neural Networks, CNN consists of several different layers, each contain neurons
that are independent to each other.
• Each neuron has a learnable weights, they receive some input, performs some operation and
provide the output to the next neuron on another layer.
• CNN consists of an input layer, an output layer and multiple hidden layers, which includes
Convolutional layer, Pooling layer and Fully-Connected Layer.
3
VIDEO CLASSIFICATION USING CNN
• A new dataset Sports-1M is used to train the CNN architecture.
• Sports-1M consists of 1 million YouTube video belonging to 487 classes of sports.
• Provide an architecture that process input into two different resolution – a low resolution
context stream and a high-resolution fovea stream, to improve the runtime performance.
• Applied the network again on another dataset, UCF-101, observe the significant improvement
compared to the results obtained by training networks on UCF-101 alone.
4
RELATED WORK
• CNNs have been applied to small scale image recognition problems on datasets such as MNIST,
CIFAR-10/100, NORB and Caltech-101/256.
• Little to no work on applying CNNs to video classification.
• Available video datasets contain only few thousands of clips and few dozens of classes, which
may be the cause of lack of contribution in video classification.
5
MODELS
• Divided the videos into small clips
and generate the frames.
• Described the three broad
connectivity pattern categories,
Early Fusion, Late Fusion and Slow
Fusion.
• Early Fusion combines information
across an entire time window.
• Late Fusion places two separate
single frame network and then
merges in Fully connected layer.
• Slow fusion is combination of Early
and Late fusion, which results in
higher layer get more global
information.
6
MULTIRESOLUTION CNN
• CNN takes weeks to train large-scale dataset,
therefore runtime performance is critically
important.
• One approach was to reduce the layers, which
will result in lower performance.
• Another approach was to reduce the size of the
images, which will lower the accuracy.
• Finally the solution was to fed two frames, one
with half resolution, referred as context stream
and another one was center-cropped version of
the original frame, referred as fovea stream.
• Improvement was observed, since in most of
the online videos, object of interest often
occupies the center region.
7
RESULTS
8
RESULTS
9
UCF-101
• Transfer learning experiment was done on
another dataset, UCF-101 Activity
Recognition dataset, which consists of
13,320 videos belonging to 101 categories.
• Following scenarios were considered:
• Fine-tune top layer: Retrained the last
layer
• Fine-tune top 3 layers: Along with last
layer, retrained two fully-connected
layers
• Fine-tune all layers: All the layers
including convolutional are retrained
• Train from scratch: Full network from
scratch was trained
10
CONCLUSION
• CNNs are capable of learning not only image recognition but video classification also.
• A Slow fusion model consistently performs better than Early and Late Fusion.
• Transfer learning experiment on UCF-101 suggests that highest transfer learning performance
by retraining the top 3 layers.
11
• Hope to incorporate broader categories in the dataset to obtain more powerful and generic
features.
• Explore recurrent neural networks as more powerful technique for combining clip-level
prediction into global video-level prediction.
FUTURE WORK

More Related Content

PDF
Video Classification Basic
PDF
PR-217: EfficientDet: Scalable and Efficient Object Detection
PPTX
Normalization 방법
PPTX
Object detection
PDF
The Real-time Volumetric Cloudscapes of Horizon Zero Dawn
PPTX
Ibica2014(p15)image fusion based on broveywavelet
PDF
ViT (Vision Transformer) Review [CDM]
PDF
감마가 어디감마
Video Classification Basic
PR-217: EfficientDet: Scalable and Efficient Object Detection
Normalization 방법
Object detection
The Real-time Volumetric Cloudscapes of Horizon Zero Dawn
Ibica2014(p15)image fusion based on broveywavelet
ViT (Vision Transformer) Review [CDM]
감마가 어디감마

What's hot (20)

PPTX
Batch normalization presentation
PDF
Introduction to object detection
PDF
CIFAR-10
PDF
[Kgc2012] deferred forward 이창희
PPTX
Convolutional Neural Network (CNN) - image recognition
PPTX
Convolutional Neural Network (CNN)
PDF
Object detection and Instance Segmentation
PPTX
Object detection with deep learning
PPTX
CNN and its applications by ketaki
PPTX
[Unite2015 박민근] 유니티 최적화 테크닉 총정리
PPTX
You only look once
PDF
PR-305: Exploring Simple Siamese Representation Learning
PDF
Forward-Forward Algorithm
PPT
Shadow mapping 정리
PDF
GAN - Generative Adversarial Nets
PDF
Rendering AAA-Quality Characters of Project A1
PPTX
CNN Tutorial
PPTX
PDF
Deep Learning for Video: Action Recognition (UPC 2018)
PPTX
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
Batch normalization presentation
Introduction to object detection
CIFAR-10
[Kgc2012] deferred forward 이창희
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)
Object detection and Instance Segmentation
Object detection with deep learning
CNN and its applications by ketaki
[Unite2015 박민근] 유니티 최적화 테크닉 총정리
You only look once
PR-305: Exploring Simple Siamese Representation Learning
Forward-Forward Algorithm
Shadow mapping 정리
GAN - Generative Adversarial Nets
Rendering AAA-Quality Characters of Project A1
CNN Tutorial
Deep Learning for Video: Action Recognition (UPC 2018)
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
Ad

Similar to Large-scale Video Classification with Convolutional Neural Network (20)

DOCX
Large-scale Video Classification with Convolutional Neural Net.docx
PPTX
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
PPTX
04 Deep CNN (Ch_01 to Ch_3).pptx
PDF
Towards better analysis of deep convolutional neural networks
PPTX
FINAL_Team_4.pptx
PDF
An Introduction to Deep Learning
PPTX
Deep learning summary
PDF
Big Data Malaysia - A Primer on Deep Learning
PDF
Mix Conv: Mixed Depthwise Convolutional Kernels
PDF
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
PDF
DSRLab seminar Introduction to deep learning
PPTX
Lecture-7 Applied ML.pptx
PPTX
VGG.pptx
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
Deep learning-based switchable network for in-loop filtering in high efficie...
PPTX
convolutional_neural_networks.pptx
PDF
A brief introduction to recent segmentation methods
PPTX
Deep learning L1-CO2-session-4 CNN .pptx
PPTX
Convolutional neural networks 이론과 응용
Large-scale Video Classification with Convolutional Neural Net.docx
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
04 Deep CNN (Ch_01 to Ch_3).pptx
Towards better analysis of deep convolutional neural networks
FINAL_Team_4.pptx
An Introduction to Deep Learning
Deep learning summary
Big Data Malaysia - A Primer on Deep Learning
Mix Conv: Mixed Depthwise Convolutional Kernels
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
DSRLab seminar Introduction to deep learning
Lecture-7 Applied ML.pptx
VGG.pptx
MIT's experience on OpenPOWER/POWER 9 platform
Deep learning-based switchable network for in-loop filtering in high efficie...
convolutional_neural_networks.pptx
A brief introduction to recent segmentation methods
Deep learning L1-CO2-session-4 CNN .pptx
Convolutional neural networks 이론과 응용
Ad

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administration Chapter 2
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Nekopoi APK 2025 free lastest update
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Essential Infomation Tech presentation.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
medical staffing services at VALiNTRY
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
top salesforce developer skills in 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Digital Strategies for Manufacturing Companies
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administration Chapter 2
Odoo POS Development Services by CandidRoot Solutions
Navsoft: AI-Powered Business Solutions & Custom Software Development
Nekopoi APK 2025 free lastest update
Adobe Illustrator 28.6 Crack My Vision of Vector Design
CHAPTER 2 - PM Management and IT Context
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Essential Infomation Tech presentation.pptx
PTS Company Brochure 2025 (1).pdf.......
Odoo Companies in India – Driving Business Transformation.pdf
medical staffing services at VALiNTRY
How to Choose the Right IT Partner for Your Business in Malaysia
Which alternative to Crystal Reports is best for small or large businesses.pdf
Transform Your Business with a Software ERP System
top salesforce developer skills in 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Digital Strategies for Manufacturing Companies

Large-scale Video Classification with Convolutional Neural Network

  • 1. LARGE-SCALE VIDEO CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORK ANDREJ KARPATHY LI FEI-FEI SANKETH SHETTY RAHUL SUKHTHANKAR THOMAS LEUNG GEORGE TODERICI PRESENTED BY: KHALID KHAN
  • 2. SUMMARY • Convolutional Neural networks have been established as a powerful class of models image recognition problems. • This paper provides the an extensive empirical evaluation of CNNs on large scale video classification using a new dataset of approx. 1 million YouTube videos. • Multiple approaches were studied for extending the connectivity of a CNN in time domain. • Suggested a multiresolution architecture as a promising way of speeding up the training. • Some performance improvements were observed compared to previous feature-based and single-frame models. 2
  • 3. CONVOLUTIONAL NEURAL NETWORK • Similar to other Neural Networks, CNN consists of several different layers, each contain neurons that are independent to each other. • Each neuron has a learnable weights, they receive some input, performs some operation and provide the output to the next neuron on another layer. • CNN consists of an input layer, an output layer and multiple hidden layers, which includes Convolutional layer, Pooling layer and Fully-Connected Layer. 3
  • 4. VIDEO CLASSIFICATION USING CNN • A new dataset Sports-1M is used to train the CNN architecture. • Sports-1M consists of 1 million YouTube video belonging to 487 classes of sports. • Provide an architecture that process input into two different resolution – a low resolution context stream and a high-resolution fovea stream, to improve the runtime performance. • Applied the network again on another dataset, UCF-101, observe the significant improvement compared to the results obtained by training networks on UCF-101 alone. 4
  • 5. RELATED WORK • CNNs have been applied to small scale image recognition problems on datasets such as MNIST, CIFAR-10/100, NORB and Caltech-101/256. • Little to no work on applying CNNs to video classification. • Available video datasets contain only few thousands of clips and few dozens of classes, which may be the cause of lack of contribution in video classification. 5
  • 6. MODELS • Divided the videos into small clips and generate the frames. • Described the three broad connectivity pattern categories, Early Fusion, Late Fusion and Slow Fusion. • Early Fusion combines information across an entire time window. • Late Fusion places two separate single frame network and then merges in Fully connected layer. • Slow fusion is combination of Early and Late fusion, which results in higher layer get more global information. 6
  • 7. MULTIRESOLUTION CNN • CNN takes weeks to train large-scale dataset, therefore runtime performance is critically important. • One approach was to reduce the layers, which will result in lower performance. • Another approach was to reduce the size of the images, which will lower the accuracy. • Finally the solution was to fed two frames, one with half resolution, referred as context stream and another one was center-cropped version of the original frame, referred as fovea stream. • Improvement was observed, since in most of the online videos, object of interest often occupies the center region. 7
  • 10. UCF-101 • Transfer learning experiment was done on another dataset, UCF-101 Activity Recognition dataset, which consists of 13,320 videos belonging to 101 categories. • Following scenarios were considered: • Fine-tune top layer: Retrained the last layer • Fine-tune top 3 layers: Along with last layer, retrained two fully-connected layers • Fine-tune all layers: All the layers including convolutional are retrained • Train from scratch: Full network from scratch was trained 10
  • 11. CONCLUSION • CNNs are capable of learning not only image recognition but video classification also. • A Slow fusion model consistently performs better than Early and Late Fusion. • Transfer learning experiment on UCF-101 suggests that highest transfer learning performance by retraining the top 3 layers. 11 • Hope to incorporate broader categories in the dataset to obtain more powerful and generic features. • Explore recurrent neural networks as more powerful technique for combining clip-level prediction into global video-level prediction. FUTURE WORK