SlideShare a Scribd company logo
Learning object dynamics in video generation
Anant Gupta
Motivation
● Unsupervised Video Generation is an important problem
● Recent progress opened up the venue for the next milestone of challenges
lying in the area
● In this work, we take up one of the best performing models and try to address
and solve these challenges
● We realise the current metrics do not suffice to measure the progress of our
models
Methods used previously
● Direct pixel-level prediction
● Learning a geometric transformation function
● Autoregressive generation
● Supervised learning
● Adversarial methods
● Learning distribution for uncertainty from
○ Residual Error
○ Past frames
Stochastic Video Generation (Baseline)
● The uncertainty in the future frames is learned as a prior distribution
● This is then combined with the deterministic part
●
Issues
● Generation of interaction between objects
● Generation of previously occluded scenes
Methods
1. Hierarchical Latent Model
○ Latent variables at each layer in the hierarchy are dependent on the previous ones
○ Latent variables in each layer learn uncertainties lying at particular frequency levels
○ Similar to multi-scale signal representation in Computer Vision
○ Training can be done jointly or layer-wise
2. Pixel-Level Masking:
○ Hard Negative Sampling of pixel-level prediction errors.
Experiments and Evaluation
● Model Variants
○ Trained layer wise (HLM1)
○ Trained jointly (HLM2)
○ Pixel wise masked L2 loss with LR decay (PM1)
○ Pixel wise masked L1 loss with LR decay (PM2)
● Models initialized with pretrained baseline model
● Dataset: BAIR Robot Push dataset
● Evaluation Methods:
○ Peak Signal to Noise Ratio (PSNR)
○ Structural Similarity (SSIM)
○ Qualitative Analysis
Results
● HLM1 beats the baseline model for later timesteps in SSIM
Results
Ground Truth
HLM1
Baseline
Results
References
1. E. Denton and R. Fergus, “Stochastic video generation with a learned prior,”
arXiv preprint arXiv:1802.07687, 2018.
2. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine,
“Stochastic variational video prediction,” arXiv preprint arXiv:1710.11252,
2017.
3. F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning
with temporal skip connections,” arXiv preprint arXiv:1710.05268, 2017.
4. C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical
interaction through video prediction,” in Advances in neural information
processing systems, pp. 64–72, 2016.

More Related Content

PPTX
Optical Flow with Semantic Segmentation and Localized Layers
PDF
Hierarchical Object Detection with Deep Reinforcement Learning
PPTX
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
PDF
Learning visual representation without human label
PDF
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
PDF
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
PDF
物件偵測與辨識技術
PDF
Recurrent Neural Networks, LSTM and GRU
Optical Flow with Semantic Segmentation and Localized Layers
Hierarchical Object Detection with Deep Reinforcement Learning
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Learning visual representation without human label
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
物件偵測與辨識技術
Recurrent Neural Networks, LSTM and GRU

Similar to Learning object dynamics in video generation (20)

PPTX
[NS][Lab_Seminar_240614]Video Matting via Consistency-Regularized Graph Neura...
PDF
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
PPTX
VIBE: Video Inference for Human Body Pose and Shape Estimation
PDF
Pixel Recurrent Neural Networks
PPT
convolutional_rbm.ppt
PPTX
WaveNet
PDF
Automated Speech Recognition
PPTX
Flash Photography and toonification
PDF
Leveraging high level and low-level features for multimedia event detection.2...
PDF
Processing in Mobile Applications: A Case Study
PDF
Towards Native Code Offloading Platforms for Image Processing in Mobile Appli...
PDF
Deep Learning & NLP: Graphs to the Rescue!
PPTX
Single Image Super Resolution using Fuzzy Deep Convolutional Networks
PDF
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019
PPTX
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
PPTX
You only look once (YOLO) : unified real time object detection
PDF
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
PPTX
Computer Vision and GenAI for Geoscientists.pptx
PPTX
Computer Vision and GenAI for Geoscientists.pptx
PDF
Video Denoising using Transform Domain Method
[NS][Lab_Seminar_240614]Video Matting via Consistency-Regularized Graph Neura...
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
VIBE: Video Inference for Human Body Pose and Shape Estimation
Pixel Recurrent Neural Networks
convolutional_rbm.ppt
WaveNet
Automated Speech Recognition
Flash Photography and toonification
Leveraging high level and low-level features for multimedia event detection.2...
Processing in Mobile Applications: A Case Study
Towards Native Code Offloading Platforms for Image Processing in Mobile Appli...
Deep Learning & NLP: Graphs to the Rescue!
Single Image Super Resolution using Fuzzy Deep Convolutional Networks
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
You only look once (YOLO) : unified real time object detection
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Computer Vision and GenAI for Geoscientists.pptx
Computer Vision and GenAI for Geoscientists.pptx
Video Denoising using Transform Domain Method
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Machine Learning_overview_presentation.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine Learning_overview_presentation.pptx
sap open course for s4hana steps from ECC to s4
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Ad

Learning object dynamics in video generation

  • 1. Learning object dynamics in video generation Anant Gupta
  • 2. Motivation ● Unsupervised Video Generation is an important problem ● Recent progress opened up the venue for the next milestone of challenges lying in the area ● In this work, we take up one of the best performing models and try to address and solve these challenges ● We realise the current metrics do not suffice to measure the progress of our models
  • 3. Methods used previously ● Direct pixel-level prediction ● Learning a geometric transformation function ● Autoregressive generation ● Supervised learning ● Adversarial methods ● Learning distribution for uncertainty from ○ Residual Error ○ Past frames
  • 4. Stochastic Video Generation (Baseline) ● The uncertainty in the future frames is learned as a prior distribution ● This is then combined with the deterministic part ●
  • 5. Issues ● Generation of interaction between objects ● Generation of previously occluded scenes
  • 6. Methods 1. Hierarchical Latent Model ○ Latent variables at each layer in the hierarchy are dependent on the previous ones ○ Latent variables in each layer learn uncertainties lying at particular frequency levels ○ Similar to multi-scale signal representation in Computer Vision ○ Training can be done jointly or layer-wise 2. Pixel-Level Masking: ○ Hard Negative Sampling of pixel-level prediction errors.
  • 7. Experiments and Evaluation ● Model Variants ○ Trained layer wise (HLM1) ○ Trained jointly (HLM2) ○ Pixel wise masked L2 loss with LR decay (PM1) ○ Pixel wise masked L1 loss with LR decay (PM2) ● Models initialized with pretrained baseline model ● Dataset: BAIR Robot Push dataset ● Evaluation Methods: ○ Peak Signal to Noise Ratio (PSNR) ○ Structural Similarity (SSIM) ○ Qualitative Analysis
  • 8. Results ● HLM1 beats the baseline model for later timesteps in SSIM
  • 11. References 1. E. Denton and R. Fergus, “Stochastic video generation with a learned prior,” arXiv preprint arXiv:1802.07687, 2018. 2. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” arXiv preprint arXiv:1710.11252, 2017. 3. F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning with temporal skip connections,” arXiv preprint arXiv:1710.05268, 2017. 4. C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in Advances in neural information processing systems, pp. 64–72, 2016.