Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models

Corresponding author: ricardo.kleinlein@upm.es
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online

Table of contents
About
1. the dataset
Approach
2.
Feature
• extraction methods
Single
• -modality learning models
Late
• -fusion ensemble of modalities
Results
3.
Conclusion
4.
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 1
Contents Our approach
Contents About DS Our approach Results Conclusions

About the dataset
• 1500 short videos with sound
§ 1000 development
§ 500 test
• Every video
§ ~5-7 seconds long
§ Quick action happening
§ Between 2-5 text captions
§ Short term memorability score
§ Long term memorability score
Further details in the official task paper [1].
[1]: Alba García Seco de Herrera, Rukiye Savran Kiziltepe, Jon Chamberlain, Mihai Gabriel ConstanEn, Claire-Hélène Demarty, Faiyaz Doctor, Bogdan Ionescu, and Alan F.
Smeaton. 2020. Overview of MediaEval 2020 PredicEng Media Memorability task: What Makes a Video Memorable?. In Working Notes Proceedings of the MediaEval 2020
Workshop.

Contents About DS Our approach: overview Results Conclusions
Our approach
where

Contents About DS Our approach: capFons Results Conclusions
Feature extraction - captions
"A group of kids wearing the
color orange play a game of
football.”
"Boys in orange jerseys run across
the field in a football game.”
a group of kid wear the color
orange play a game of football
boy in orange jersey run across
the field in a football game.
[2]: Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.2016. Enriching Word Vectors with Subword InformaLon.arXivpreprint arXiv:1607.04606(2016).
[3]: George A. Miller. 1995. WordNet: A Lexical Database for English.COMMUNICATIONS OF THE ACM38 (1995), 39–41.
[2]
[3]

Contents About DS Our approach: audio Results Conclusions
Feature extraction– audio signal
[5]
[4]
[4]: Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gem-meke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt,Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J.
Weiss, andKevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classi-fication. (2017). arXiv:cs.SD/1609.09430
[5]: J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter. 2017. Audio Set: An ontologyand human-labeled dataset for audio events. In2017
IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). 776–780. https://doi.org/10.1109/ICASSP.2017.7952261

Contents About DS Our approach: visual Results Conclusions
Feature extraction– visual
[6]: Tak-WaiHui,XiaoouTang,andChenChangeLoy.2018.LiteFlowNet: A Lightweight ConvoluEonal Neural Network for OpEcal Flow EsEma- Eon. In IEEE Conference on
Computer Vision and PaQern RecogniEon.
[7]: KarlPearson.1901.LIII.On lines and planes of closest to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2,
11 (1901), 559–572.
[6] [7]

• DEVELOPMENT SET
• Single-modal models do not show high Spearman correlation values
• Ensemble of predictions + SVR significantly improves Spearman.
• TEST SET
• Ensemble do not show correlation between video features and memorability scores
Results

Analysis
•
Shortage in development data
•
Memorability scores do not show much variability
•
Learning capabilities not fully exploited
•
System ends up
• learning the average memorability score
Future lines of research
•
More compact and specific data encodings, suited for small datasets
•
Conclusions

Corresponding author: ricardo.kleinlein@upm.es

Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models

More Related Content

Similar to Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models (20)

More from multimediaeval (20)

Recently uploaded (20)

Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models