SlideShare a Scribd company logo
Corresponding author: ricardo.kleinlein@upm.es
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online
Table of contents
About
1. the dataset
Approach
2.
Feature
• extraction methods
Single
• -modality learning models
Late
• -fusion ensemble of modalities
Results
3.
Conclusion
4.
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 1
Contents Our approach
Contents About DS Our approach Results Conclusions
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 2
Contents Our approach
Contents About DS Our approach Results Conclusions
About the dataset
• 1500 short videos with sound
§ 1000 development
§ 500 test
• Every video
§ ~5-7 seconds long
§ Quick action happening
§ Between 2-5 text captions
§ Short term memorability score
§ Long term memorability score
Further details in the official task paper [1].
[1]: Alba García Seco de Herrera, Rukiye Savran Kiziltepe, Jon Chamberlain, Mihai Gabriel ConstanEn, Claire-Hélène Demarty, Faiyaz Doctor, Bogdan Ionescu, and Alan F.
Smeaton. 2020. Overview of MediaEval 2020 PredicEng Media Memorability task: What Makes a Video Memorable?. In Working Notes Proceedings of the MediaEval 2020
Workshop.
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 3
Contents Our approach
Contents About DS Our approach: overview Results Conclusions
Our approach
where
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 4
Contents Our approach
Contents About DS Our approach: capFons Results Conclusions
Feature extraction - captions
"A group of kids wearing the
color orange play a game of
football.”
"Boys in orange jerseys run across
the field in a football game.”
a group of kid wear the color
orange play a game of football
boy in orange jersey run across
the field in a football game.
[2]: Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.2016. Enriching Word Vectors with Subword InformaLon.arXivpreprint arXiv:1607.04606(2016).
[3]: George A. Miller. 1995. WordNet: A Lexical Database for English.COMMUNICATIONS OF THE ACM38 (1995), 39–41.
[2]
[3]
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 5
Contents Our approach
Contents About DS Our approach: audio Results Conclusions
Feature extraction– audio signal
[5]
[4]
[4]: Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gem-meke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt,Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J.
Weiss, andKevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classi-fication. (2017). arXiv:cs.SD/1609.09430
[5]: J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter. 2017. Audio Set: An ontologyand human-labeled dataset for audio events. In2017
IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). 776–780. https://doi.org/10.1109/ICASSP.2017.7952261
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 6
Contents Our approach
Contents About DS Our approach: visual Results Conclusions
Feature extraction– visual
[6]: Tak-WaiHui,XiaoouTang,andChenChangeLoy.2018.LiteFlowNet: A Lightweight ConvoluEonal Neural Network for OpEcal Flow EsEma- Eon. In IEEE Conference on
Computer Vision and PaQern RecogniEon.
[7]: KarlPearson.1901.LIII.On lines and planes of closest to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2,
11 (1901), 559–572.
[6] [7]
• DEVELOPMENT SET
• Single-modal models do not show high Spearman correlation values
• Ensemble of predictions + SVR significantly improves Spearman.
• TEST SET
• Ensemble do not show correlation between video features and memorability scores
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 7
Results
Contents Our approach
Contents About DS Our approach Results Conclusions
Analysis
•
Shortage in development data
•
Memorability scores do not show much variability
•
Learning capabilities not fully exploited
•
System ends up
• learning the average memorability score
Future lines of research
•
More compact and specific data encodings, suited for small datasets
•
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 8
Conclusions
Contents Our approach
Contents About DS Our approach Results Conclusions
Corresponding author: ricardo.kleinlein@upm.es
R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 9
Contents Our approach
Contents About DS Our approach Results Conclusions

More Related Content

PPTX
Predicting Media Memorability Using Ensemble Models
PDF
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
PPTX
Predicting Media Memorability with Audio, Video, and Text representations
PDF
Video Summarization for Sports
PDF
Tejal nijai 19210412_report
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
PDF
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Predicting Media Memorability Using Ensemble Models
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
Predicting Media Memorability with Audio, Video, and Text representations
Video Summarization for Sports
Tejal nijai 19210412_report
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...

Similar to Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models (20)

PDF
Activity detection at TRECVID
PDF
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
PDF
Action event retrieval from cricket video using audio energy feature for even...
PDF
Action event retrieval from cricket video using audio energy feature for event
PPTX
Visual instance mining of news videos using a graph-based approach
PPTX
PPTX
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
PDF
A new alley in Opinion Mining using Senti Audio Visual Algorithm
PDF
Ijarcet vol-2-issue-4-1347-1351
PDF
PPTX
Music Gesture for Visual Sound Separation
PPTX
DeepFak.pptx asdasdasdasdasdasdasdasdasd
PDF
One Perceptron to Rule Them All: Language and Vision
PPTX
Eurecom and Aalto University at Mediaeval 2021
PDF
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
PDF
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PDF
MediaEval 2018: GIBIS at MediaEval 2018: Predicting Media Memorability Task
PDF
2022_11_11 «AI and ML methods for Multimodal Learning Analytics»
PPTX
MediaEval 2018: Show and Recall at MediaEval 2018 ViMemNet: Predicting Video ...
Activity detection at TRECVID
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
Action event retrieval from cricket video using audio energy feature for even...
Action event retrieval from cricket video using audio energy feature for event
Visual instance mining of news videos using a graph-based approach
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
A new alley in Opinion Mining using Senti Audio Visual Algorithm
Ijarcet vol-2-issue-4-1347-1351
Music Gesture for Visual Sound Separation
DeepFak.pptx asdasdasdasdasdasdasdasdasd
One Perceptron to Rule Them All: Language and Vision
Eurecom and Aalto University at Mediaeval 2021
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
MediaEval 2018: GIBIS at MediaEval 2018: Predicting Media Memorability Task
2022_11_11 «AI and ML methods for Multimodal Learning Analytics»
MediaEval 2018: Show and Recall at MediaEval 2018 ViMemNet: Predicting Video ...
Ad

More from multimediaeval (20)

PPTX
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
PDF
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
PDF
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
PDF
Fooling an Automatic Image Quality Estimator
PDF
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
PDF
Pixel Privacy: Quality Camouflage for Social Images
PDF
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
PPTX
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
PDF
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
PDF
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
PPTX
Deep Conditional Adversarial learning for polyp Segmentation
PPTX
A Temporal-Spatial Attention Model for Medical Image Detection
PPTX
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
PDF
Fine-tuning for Polyp Segmentation with Attention
PPTX
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
PPTX
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
PDF
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
PDF
Personal Air Quality Index Prediction Using Inverse Distance Weighting Method
PPTX
Overview of MediaEval 2020 Insights for Wellbeing: Multimodal Personal Health...
PPTX
Ensemble based method for the classification of flooding event using social m...
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
Fooling an Automatic Image Quality Estimator
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
Pixel Privacy: Quality Camouflage for Social Images
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
Deep Conditional Adversarial learning for polyp Segmentation
A Temporal-Spatial Attention Model for Medical Image Detection
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
Fine-tuning for Polyp Segmentation with Attention
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
Personal Air Quality Index Prediction Using Inverse Distance Weighting Method
Overview of MediaEval 2020 Insights for Wellbeing: Multimodal Personal Health...
Ensemble based method for the classification of flooding event using social m...
Ad

Recently uploaded (20)

PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Microbiology with diagram medical studies .pptx
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
POSITIONING IN OPERATION THEATRE ROOM.ppt
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
TOTAL hIP ARTHROPLASTY Presentation.pptx
neck nodes and dissection types and lymph nodes levels
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
. Radiology Case Scenariosssssssssssssss
Microbiology with diagram medical studies .pptx
2Systematics of Living Organisms t-.pptx
2. Earth - The Living Planet Module 2ELS
INTRODUCTION TO EVS | Concept of sustainability
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Viruses (History, structure and composition, classification, Bacteriophage Re...
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
AlphaEarth Foundations and the Satellite Embedding dataset
Classification Systems_TAXONOMY_SCIENCE8.pptx
ECG_Course_Presentation د.محمد صقران ppt
bbec55_b34400a7914c42429908233dbd381773.pdf

Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models

  • 1. Corresponding author: ricardo.kleinlein@upm.es R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online
  • 2. Table of contents About 1. the dataset Approach 2. Feature • extraction methods Single • -modality learning models Late • -fusion ensemble of modalities Results 3. Conclusion 4. R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 1 Contents Our approach Contents About DS Our approach Results Conclusions
  • 3. R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 2 Contents Our approach Contents About DS Our approach Results Conclusions About the dataset • 1500 short videos with sound § 1000 development § 500 test • Every video § ~5-7 seconds long § Quick action happening § Between 2-5 text captions § Short term memorability score § Long term memorability score Further details in the official task paper [1]. [1]: Alba García Seco de Herrera, Rukiye Savran Kiziltepe, Jon Chamberlain, Mihai Gabriel ConstanEn, Claire-Hélène Demarty, Faiyaz Doctor, Bogdan Ionescu, and Alan F. Smeaton. 2020. Overview of MediaEval 2020 PredicEng Media Memorability task: What Makes a Video Memorable?. In Working Notes Proceedings of the MediaEval 2020 Workshop.
  • 4. R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 3 Contents Our approach Contents About DS Our approach: overview Results Conclusions Our approach where
  • 5. R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 4 Contents Our approach Contents About DS Our approach: capFons Results Conclusions Feature extraction - captions "A group of kids wearing the color orange play a game of football.” "Boys in orange jerseys run across the field in a football game.” a group of kid wear the color orange play a game of football boy in orange jersey run across the field in a football game. [2]: Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.2016. Enriching Word Vectors with Subword InformaLon.arXivpreprint arXiv:1607.04606(2016). [3]: George A. Miller. 1995. WordNet: A Lexical Database for English.COMMUNICATIONS OF THE ACM38 (1995), 39–41. [2] [3]
  • 6. R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 5 Contents Our approach Contents About DS Our approach: audio Results Conclusions Feature extraction– audio signal [5] [4] [4]: Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gem-meke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt,Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, andKevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classi-fication. (2017). arXiv:cs.SD/1609.09430 [5]: J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter. 2017. Audio Set: An ontologyand human-labeled dataset for audio events. In2017 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). 776–780. https://doi.org/10.1109/ICASSP.2017.7952261
  • 7. R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 6 Contents Our approach Contents About DS Our approach: visual Results Conclusions Feature extraction– visual [6]: Tak-WaiHui,XiaoouTang,andChenChangeLoy.2018.LiteFlowNet: A Lightweight ConvoluEonal Neural Network for OpEcal Flow EsEma- Eon. In IEEE Conference on Computer Vision and PaQern RecogniEon. [7]: KarlPearson.1901.LIII.On lines and planes of closest to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572. [6] [7]
  • 8. • DEVELOPMENT SET • Single-modal models do not show high Spearman correlation values • Ensemble of predictions + SVR significantly improves Spearman. • TEST SET • Ensemble do not show correlation between video features and memorability scores R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 7 Results Contents Our approach Contents About DS Our approach Results Conclusions
  • 9. Analysis • Shortage in development data • Memorability scores do not show much variability • Learning capabilities not fully exploited • System ends up • learning the average memorability score Future lines of research • More compact and specific data encodings, suited for small datasets • R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 8 Conclusions Contents Our approach Contents About DS Our approach Results Conclusions
  • 10. Corresponding author: ricardo.kleinlein@upm.es R. Kleinlein et al -- MediaEval’20, December 14-15 2020, Online 9 Contents Our approach Contents About DS Our approach Results Conclusions