Predicting Media Memorability Using Ensemble Models

Predicting Media Memorability Using
Ensemble Models at MediaEval2019
David Azcona, Enric Moreu, Feiyan Hu,
Tomás Ward, Alan F. Smeaton
Dublin City University, Insight Centre for Data Analytics
david.azcona@dcu.ie

The MediaEval Predicting Media Memorability Task
Predicting how memorable a video is to viewers
● Dataset 10,000 soundless short videos each with two
scores for memorability: short-term and long-term
● Literature In 2018 CNNs trained on large datasets
(ImageNet) performed better than video captions and
pre-computed features
2
https://github.com/dazcona/memorability

In 2018 ..
• We participated, did badly, so we upped our efforts to perform
better, and try to understand memorability
• Builds on our long-standing work on memory and recall …
mindfulness, mind wandering, cognitive stimulation therapy, BCI and
EEG
• Used 8,000 videos, divided into 1,000 evaluation and 7,000
training/testing and set the task to a Masters class ... 135 students
in total
• Some of them did very well …
3

They used a wide range of techniques, submitted
a 2-page paper description, and were graded
based on approach, and performance
5

Our approach
Divided the 8,000 videos into training and validation
Developed individual models per set of features & combined
with ensemble models using:
● Traditional Machine Learning:
Support Vector Regression & Bayesian Ridge Regression
● Deep Learning (highly regularized):
Embeddings for words (captions) & Transfer Learning w/ Neural
Network activations as features and fine-tuning our own networks
6https://github.com/dazcona/memorability

Our approach
Extracted 8 frames per video (first + one per second)
● Off the shelf pre-computed features:
C3D, HMP, LBP, InceptionV3, Color Histogram & Aesthetic
● Our pre-computed features: Aesthetics & Emotions
● Textual information: bag-of-words TF-IDF with linear models &
Glove's Embeddings + RNN GRU + high dropout
● Pre-trained CNNs as feature extractors: transfer learning
with ImageNet: VGG16, DenseNet121, ResNet50 & ResNet152
● Fine-tuning our own CNN: ResNet - head + FC + sigmoid
● Ensemble models: combinations of individual models’ predictions
7https://github.com/dazcona/memorability

Why Emotions?
8
Long-term scores: 0.727 (left), 0.273 (right)
MediaEval 2018: Duy-Tue Tran-Van et al. @ HCMUS’s paper: "Predicting
Media Memorability Using Deep Features and Recurrent Network"

9
Our Emotions 7: anger, disgust, fear, happiness, sadness,
surprise, neutral .. plus gender scores & spatial information

10
Our pre-submission
results informed us of
relative importance of
different techniques

12
Our official
performance figures …

14
Our relative weightings
for different techniques
in the ensembles

Official Results
(*) Organiser Team
16
Team Best Short Term Best Long Term
Insight@DCU 0.528 0.27
MeMAD 0.522 0.277
Best 2018 0.497 0.257
UPB-L25 (*) 0.477 0.232
RUC 0.472 0.216
EssexHubTV 0.467 0.203
TCNJ-CS 0.445 0.218
HCMUS 0.445 0.208
GIBIS 0.438 0.199
Baseline (MemNet) 0.39 0.17
Average 2018 0.359 0.173

Findings & Contributions
17
● DL CNN models typically outperform models trained with
captions and other visual features for short-term
memorability; however, techniques such as embeddings
and RNNs can achieve very high results for captions
● We believe fine-tuned CNN models will outperform pre-
trained models as feature extractors given enough
training samples (not proven in this paper)

Findings & Contributions
18
● Ensembling models by using predictions instead of
training models with very long vectors of features is an
alternative we used to counteract memory limitations
● Ensembling models with different modalities such as
emotions with captions, high-level representations from
CNNs and visual pre-computed features achieves best
results as they represent different high-level
abstractions

19
Visualizations for captions using Wordclouds

Class Activation Maps for Most Memorable Videos
Which parts of a video led a CNN (ResNet 152 trained w/ ImageNet) to
its final classification decision is illustrated with class activation maps,
allows us to explore what makes memorability

Github repository & Tables
Insight@DCU Results
Short-term Ensembles & Long-term Ensembles

Predicting Media Memorability Using Ensemble Models

More Related Content

Similar to Predicting Media Memorability Using Ensemble Models (20)

More from multimediaeval (20)

Recently uploaded (20)

Predicting Media Memorability Using Ensemble Models