SlideShare a Scribd company logo
Predicting Media Memorability Using
Ensemble Models at MediaEval2019
David Azcona, Enric Moreu, Feiyan Hu,
Tomás Ward, Alan F. Smeaton
Dublin City University, Insight Centre for Data Analytics
david.azcona@dcu.ie
The MediaEval Predicting Media Memorability Task
Predicting how memorable a video is to viewers
● Dataset 10,000 soundless short videos each with two
scores for memorability: short-term and long-term
● Literature In 2018 CNNs trained on large datasets
(ImageNet) performed better than video captions and
pre-computed features
2
https://github.com/dazcona/memorability
In 2018 ..
• We participated, did badly, so we upped our efforts to perform
better, and try to understand memorability
• Builds on our long-standing work on memory and recall …
mindfulness, mind wandering, cognitive stimulation therapy, BCI and
EEG
• Used 8,000 videos, divided into 1,000 evaluation and 7,000
training/testing and set the task to a Masters class ... 135 students
in total
• Some of them did very well …
3
4
They used a wide range of techniques, submitted
a 2-page paper description, and were graded
based on approach, and performance
5
Our approach
Divided the 8,000 videos into training and validation
Developed individual models per set of features & combined
with ensemble models using:
● Traditional Machine Learning:
Support Vector Regression & Bayesian Ridge Regression
● Deep Learning (highly regularized):
Embeddings for words (captions) & Transfer Learning w/ Neural
Network activations as features and fine-tuning our own networks
6https://github.com/dazcona/memorability
Our approach
Extracted 8 frames per video (first + one per second)
● Off the shelf pre-computed features:
C3D, HMP, LBP, InceptionV3, Color Histogram & Aesthetic
● Our pre-computed features: Aesthetics & Emotions
● Textual information: bag-of-words TF-IDF with linear models &
Glove's Embeddings + RNN GRU + high dropout
● Pre-trained CNNs as feature extractors: transfer learning
with ImageNet: VGG16, DenseNet121, ResNet50 & ResNet152
● Fine-tuning our own CNN: ResNet - head + FC + sigmoid
● Ensemble models: combinations of individual models’ predictions
7https://github.com/dazcona/memorability
Why Emotions?
8
Long-term scores: 0.727 (left), 0.273 (right)
MediaEval 2018: Duy-Tue Tran-Van et al. @ HCMUS’s paper: "Predicting
Media Memorability Using Deep Features and Recurrent Network"
https://github.com/dazcona/memorability
9
Our Emotions 7: anger, disgust, fear, happiness, sadness,
surprise, neutral .. plus gender scores & spatial information
10
Our pre-submission
results informed us of
relative importance of
different techniques
11
12
Our official
performance figures …
13
14
Our relative weightings
for different techniques
in the ensembles
15
Official Results
(*) Organiser Team
16
Team Best Short Term Best Long Term
Insight@DCU 0.528 0.27
MeMAD 0.522 0.277
Best 2018 0.497 0.257
UPB-L25 (*) 0.477 0.232
RUC 0.472 0.216
EssexHubTV 0.467 0.203
TCNJ-CS 0.445 0.218
HCMUS 0.445 0.208
GIBIS 0.438 0.199
Baseline (MemNet) 0.39 0.17
Average 2018 0.359 0.173
Findings & Contributions
17
● DL CNN models typically outperform models trained with
captions and other visual features for short-term
memorability; however, techniques such as embeddings
and RNNs can achieve very high results for captions
● We believe fine-tuned CNN models will outperform pre-
trained models as feature extractors given enough
training samples (not proven in this paper)
https://github.com/dazcona/memorability
Findings & Contributions
18
● Ensembling models by using predictions instead of
training models with very long vectors of features is an
alternative we used to counteract memory limitations
● Ensembling models with different modalities such as
emotions with captions, high-level representations from
CNNs and visual pre-computed features achieves best
results as they represent different high-level
abstractions
https://github.com/dazcona/memorability
19
Visualizations for captions using Wordclouds
Class Activation Maps for Most Memorable Videos
Which parts of a video led a CNN (ResNet 152 trained w/ ImageNet) to
its final classification decision is illustrated with class activation maps,
allows us to explore what makes memorability
Github repository & Tables
https://github.com/dazcona/memorability
Insight@DCU Results
Short-term Ensembles & Long-term Ensembles

More Related Content

PDF
“Learn How to Cross the Street” a virtual reality game for autistic children
PPSX
Bibler+power point[1]
PPT
Smart lesson to create digital lessons in one step one second
PDF
Game play of “TacTec”, a frame game for developing an implementation plan – P...
DOC
Camberwell South Primary School – Knowledge Bank Tpl Case Study 2009 –
PPT
Ch9visualtech
DOCX
technology-mandatory-years-7-8-sample-unit-digital-technologies-life-skills (...
PPTX
Project528
“Learn How to Cross the Street” a virtual reality game for autistic children
Bibler+power point[1]
Smart lesson to create digital lessons in one step one second
Game play of “TacTec”, a frame game for developing an implementation plan – P...
Camberwell South Primary School – Knowledge Bank Tpl Case Study 2009 –
Ch9visualtech
technology-mandatory-years-7-8-sample-unit-digital-technologies-life-skills (...
Project528

Similar to Predicting Media Memorability Using Ensemble Models (20)

PDF
Deep analytics via learning to reason
PPTX
iPads for Teamworking: Inspiring Teachers conference presentation
PDF
Defining and measuring digital competence in a rapidly changing world: Perspe...
PPTX
Reflections from various Evaluations of ICT projects - Benita Williams
PPT
ICT in Elementary Education
PDF
Learning Analytics: what are we optimizing for?
PPTX
Introduction to mLearning for MobiMOOC
PDF
Appendix A Work Distribution
PDF
Online Learning Management System and Analytics using Deep Learning
PPTX
major project ppt final (SignLanguage Detection)
PDF
Appendix A Work Distribution
PDF
TLE 8 QUARTER 1 MODULE WEEK 1 MATATAG CURRICULUM
PDF
Development of video-based emotion recognition using deep learning with Googl...
PPT
Mobile Moodle and mLearning project for mLearncon in San Diego
PPTX
t.pptx is a ppt for DDS and software applications
DOCX
DAILY LESSON PLAN COMPUTER SYSTEM SERVICING
PDF
Mehrnoosh vahdat workshop-data sharing 2014
PDF
Pushing the awareness envelope
Deep analytics via learning to reason
iPads for Teamworking: Inspiring Teachers conference presentation
Defining and measuring digital competence in a rapidly changing world: Perspe...
Reflections from various Evaluations of ICT projects - Benita Williams
ICT in Elementary Education
Learning Analytics: what are we optimizing for?
Introduction to mLearning for MobiMOOC
Appendix A Work Distribution
Online Learning Management System and Analytics using Deep Learning
major project ppt final (SignLanguage Detection)
Appendix A Work Distribution
TLE 8 QUARTER 1 MODULE WEEK 1 MATATAG CURRICULUM
Development of video-based emotion recognition using deep learning with Googl...
Mobile Moodle and mLearning project for mLearncon in San Diego
t.pptx is a ppt for DDS and software applications
DAILY LESSON PLAN COMPUTER SYSTEM SERVICING
Mehrnoosh vahdat workshop-data sharing 2014
Pushing the awareness envelope
Ad

More from multimediaeval (20)

PPTX
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
PDF
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
PDF
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
PDF
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
PPTX
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
PDF
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
PDF
Fooling an Automatic Image Quality Estimator
PDF
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
PDF
Pixel Privacy: Quality Camouflage for Social Images
PDF
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
PPTX
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
PDF
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
PDF
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
PPTX
Deep Conditional Adversarial learning for polyp Segmentation
PPTX
A Temporal-Spatial Attention Model for Medical Image Detection
PPTX
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
PDF
Fine-tuning for Polyp Segmentation with Attention
PPTX
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
PPTX
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
PDF
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
Fooling an Automatic Image Quality Estimator
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
Pixel Privacy: Quality Camouflage for Social Images
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
Deep Conditional Adversarial learning for polyp Segmentation
A Temporal-Spatial Attention Model for Medical Image Detection
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
Fine-tuning for Polyp Segmentation with Attention
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
Ad

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Spectral efficient network and resource selection model in 5G networks
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
MIND Revenue Release Quarter 2 2025 Press Release
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity

Predicting Media Memorability Using Ensemble Models

  • 1. Predicting Media Memorability Using Ensemble Models at MediaEval2019 David Azcona, Enric Moreu, Feiyan Hu, Tomás Ward, Alan F. Smeaton Dublin City University, Insight Centre for Data Analytics david.azcona@dcu.ie
  • 2. The MediaEval Predicting Media Memorability Task Predicting how memorable a video is to viewers ● Dataset 10,000 soundless short videos each with two scores for memorability: short-term and long-term ● Literature In 2018 CNNs trained on large datasets (ImageNet) performed better than video captions and pre-computed features 2 https://github.com/dazcona/memorability
  • 3. In 2018 .. • We participated, did badly, so we upped our efforts to perform better, and try to understand memorability • Builds on our long-standing work on memory and recall … mindfulness, mind wandering, cognitive stimulation therapy, BCI and EEG • Used 8,000 videos, divided into 1,000 evaluation and 7,000 training/testing and set the task to a Masters class ... 135 students in total • Some of them did very well … 3
  • 4. 4
  • 5. They used a wide range of techniques, submitted a 2-page paper description, and were graded based on approach, and performance 5
  • 6. Our approach Divided the 8,000 videos into training and validation Developed individual models per set of features & combined with ensemble models using: ● Traditional Machine Learning: Support Vector Regression & Bayesian Ridge Regression ● Deep Learning (highly regularized): Embeddings for words (captions) & Transfer Learning w/ Neural Network activations as features and fine-tuning our own networks 6https://github.com/dazcona/memorability
  • 7. Our approach Extracted 8 frames per video (first + one per second) ● Off the shelf pre-computed features: C3D, HMP, LBP, InceptionV3, Color Histogram & Aesthetic ● Our pre-computed features: Aesthetics & Emotions ● Textual information: bag-of-words TF-IDF with linear models & Glove's Embeddings + RNN GRU + high dropout ● Pre-trained CNNs as feature extractors: transfer learning with ImageNet: VGG16, DenseNet121, ResNet50 & ResNet152 ● Fine-tuning our own CNN: ResNet - head + FC + sigmoid ● Ensemble models: combinations of individual models’ predictions 7https://github.com/dazcona/memorability
  • 8. Why Emotions? 8 Long-term scores: 0.727 (left), 0.273 (right) MediaEval 2018: Duy-Tue Tran-Van et al. @ HCMUS’s paper: "Predicting Media Memorability Using Deep Features and Recurrent Network" https://github.com/dazcona/memorability
  • 9. 9 Our Emotions 7: anger, disgust, fear, happiness, sadness, surprise, neutral .. plus gender scores & spatial information
  • 10. 10 Our pre-submission results informed us of relative importance of different techniques
  • 11. 11
  • 13. 13
  • 14. 14 Our relative weightings for different techniques in the ensembles
  • 15. 15
  • 16. Official Results (*) Organiser Team 16 Team Best Short Term Best Long Term Insight@DCU 0.528 0.27 MeMAD 0.522 0.277 Best 2018 0.497 0.257 UPB-L25 (*) 0.477 0.232 RUC 0.472 0.216 EssexHubTV 0.467 0.203 TCNJ-CS 0.445 0.218 HCMUS 0.445 0.208 GIBIS 0.438 0.199 Baseline (MemNet) 0.39 0.17 Average 2018 0.359 0.173
  • 17. Findings & Contributions 17 ● DL CNN models typically outperform models trained with captions and other visual features for short-term memorability; however, techniques such as embeddings and RNNs can achieve very high results for captions ● We believe fine-tuned CNN models will outperform pre- trained models as feature extractors given enough training samples (not proven in this paper) https://github.com/dazcona/memorability
  • 18. Findings & Contributions 18 ● Ensembling models by using predictions instead of training models with very long vectors of features is an alternative we used to counteract memory limitations ● Ensembling models with different modalities such as emotions with captions, high-level representations from CNNs and visual pre-computed features achieves best results as they represent different high-level abstractions https://github.com/dazcona/memorability
  • 19. 19 Visualizations for captions using Wordclouds
  • 20. Class Activation Maps for Most Memorable Videos Which parts of a video led a CNN (ResNet 152 trained w/ ImageNet) to its final classification decision is illustrated with class activation maps, allows us to explore what makes memorability
  • 21. Github repository & Tables https://github.com/dazcona/memorability Insight@DCU Results Short-term Ensembles & Long-term Ensembles