MediaEval 2020: Emotion and Theme Recognition in Music Using Jamendo

Emotion and Theme Recognition in Music
Using Jamendo
Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, Minz Won
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Multimedia Evaluation Benchmark 2020 Workshop
14-15 December 2020, Online

Emotions and Themes in Music
● Same format as Emotions and Themes in Music 2019
● A popular task in Music Information Retrieval: Music Emotion Recognition (MER)
● Emotion in Music Task at MediaEval: 2013 - 2015 (Aljanaki et al. 2015)
○ Per-second arousal/valence annotations
● Audio mood classification task at MIREX (since 2007)
○ 600 tracks, 5 emotion clusters from tags
● Theme recognition is not explored enough (Bischoff et al. 2009)
○ Epic, dark, christmas, etc.
Downie, X. H. J. S., Laurier, C., & Ehmann, M. B. A. F. (2008). The 2007 MIREX audio mood classification task: Lessons learned. In Proc. 9th Int. Conf. Music Inf. Retrieval (pp. 462-467).
Aljanaki, A., Yang, Y. H., & Soleymani, M. (2015, October). Emotion in Music Task at MediaEval 2015. In MediaEval.
Bischoff, K., Firan, C. S., Paiu, R., Nejdl, W., Laurier, C., & Sordo, M. (2009, October). Music mood and theme classification-a hybrid approach. In ISMIR (pp. 657-662).

MTG-Jamendo Dataset
● Creative Commons license
● Quality audio and labels (curated by Jamendo)
● Tag categories and subsets (top50, toy)
● Tag pre-processing (“sadness” → “sad”)
● Standardized 5 splits without artist effect
● Reproducible pre-processing and baseline
Bogdanov, D., Won, M., Tovstogan, P., Porter, A. & Serra, X. The MTG-Jamendo dataset for automatic music tagging. In Proceedings of the Machine Learning for Music Discovery
Workshop, 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (2019)
github.com/MTG/
mtg-jamendo-dataset

Moods and themes
● Mood/theme subset: 56 tags (after splits), fixed split: split-0
● Wide spectrum of interesting tags
○ 224 before selecting tags that have at least 50 different artists
○ E.g. ethereal (23 artists, 25 tracks), water (28, 43), suspense (32, 108), halloween (24, 101)
● What is the difference between mood and theme?
○ Deep? Soft vs calm? Dark?
○ No distinction in this task

Emotions and Themes in Music Task
Content-based emotion and theme recognition in music
Goal: Automatically recognize the emotions and themes conveyed in a music
recording using machine learning algorithms.
Task introduced in 2019:
● New category of tags: themes
● New open dataset with better quality of tags and audio
○ Multiple categories of tags including mood/theme
● Audio is available under CC licenses

Examples
● Example 1
● Example 2
● Example 3
Emotions?
Moods?
Themes?
○ Veaceslav Draganov - Motivation
○ commercial, corporate, happy, motivational
○ XCYRIL - Wandering Heart
○ documentary, emotional, space
○ Bendjamin Lambert - Everyone’s Disease
○ melancholic, melodic, sad

Data
Available: 18,486 tracks with 56 tags
● 320kbps MP3 (152 GB)
● Compressed NPY with spectrograms (68GB)
● Essentia features (0.4GB)
Pre-processing:
● Only tracks with duration > 30s
● Merging similar tags (stemming, translation)
● Tag is used by at least 50 different artists
D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J. R. Zapata, and X. Serra. 2013. Essentia: An Audio Analysis
Library for Music Information Retrieval. In International Society for Music Information Retrieval Conference. Curitiba, Brazil

Submission and evaluation metrics
Submissions: on the test partition
● Tag activation values
● Binary tag predictions (optional)
● Basic script provided to generate binary predictions from activations
Evaluation metrics:
● Macro ROC-AUC and PR-AUC on tag activations
● Micro- and macro-averaged precision, recall and F-score for binary predictions
● Main leaderboard metric: Macro PR-AUC

VGG-ish Baseline
● 5 CNN layers + dense
Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Automatic tagging using deep
convolutional neural networks.arXiv preprint arXiv:1606.00298 (2016)
Use only a centered 29.1s audio segment
Mel-spectrograms (12KHz SR, 96 bands, 21ms hop)
ADAM, SGD, optimize ROC-AUC

VGG-ish Baseline: best and worst tags
Top 10 PR-AUC # tracks*
deep 0.5761 429
summer 0.4466 261
film 0.3441 746
corporate 0.3373 349
epic 0.308 601
happy 0.2534 927
advertising 0.2477 363
dark 0.2183 620
motivational 0.2012 372
meditative 0.1669 374
Bottom 10 PR-AUC # tracks*
travel 0.0097 89
holiday 0.0153 98
cool 0.0215 170
groovy 0.0238 78
sexy 0.0238 59
retro 0.0247 139
movie 0.025 207
drama 0.0253 448
fast 0.0266 62
funny 0.0279 109
* in training set

Proposed solutions
● 21 runs by 6 teams
● All submissions use deep learning
(expected from the task design)
Comparing to 2019:
● Attention everywhere
● SpecAugment is very popular
● Late fusion even more frequent
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and
Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech
recognition. arXiv preprint arXiv:1904.08779 (2019).

Results (best run per team)
Team PR-AUC ROC-AUC F-Score Approach, Focus, External datasets
1 SAIL-MiM-USC 0.1609 0.7812 0.2203 VGGish (MSD, Music4All), Mixup, experimenting w/ losses
4 SAIL-MiM-USC 0.1421 0.7625 0.1976 Same, without external data
5 HCMUS 0.1414 0.7663 0.0594 EfficientNet, WaveNet (NSynth), MobileNetV2, SpecAug
8 AugsBurger 0.1313 0.7533 0.1901 CBAM, Self-Attention + RNNs
9 UAI-CNRL 0.1275 0.7360 0.1883 ResNet, Self-Attention, Mixup, SpecAug
12 AUGment 0.1178 0.7353 0.1738 VGGish, Self-Attention, AReLU, smaller nets
15 baseline-vgg 0.1077 0.7258 0.1656 VGGish
19 UIBK-DBIS 0.0965 0.7043 0.1040 CRNN, pre-processing, moods vs themes
24 baseline-pop 0.0319 0.5 0.0026 -

Results: PR-AUC and ROC-AUC
PR-AUC ROC-AUC F-Score Approach
1 Best 2020 0.1609 0.7812 0.2203 VGGish (MSD, Music4All), Mixup, experimenting w/ losses
3 Best 2019 0.1546 0.7729 0.2124 Shake-FA-ResNet + FA-ResNet
15 Baseline VGG 0.1077 0.7258 0.1656 VGG
+ 0.0063 + 0.0083 + 0.0079

Conclusions: The Task is Challenging!
● Improvements are present, but are they significant?
● Is it about having more data vs having a better approach?
● Dataset not large enough, thus data augmentation is commonly used.
● Winning team SAIL-MiM-USC used ensemble with different losses designed
to compensate for unbalanced tag representation, maybe that is the
direction?
● Team AUGment focused on more lightweight models with less FLOPs

Future directions
● Considering the 2021 edition of the task
● Several datasets?
● Improved tag balance of the dataset?
● Individual tag analysis?
● Additional tag metadata (genre, instruments)

Reproducibility
● Open auto-tagging dataset - MTG-Jamendo
● Pre-processing, download, baseline scripts available on GitHub
● Audio and metadata available under Creative Commons licence
● Open source code for the baseline and submissions
○ 4 out of 6 teams published their source code
All info is available on the website: tinyurl.com/mediaeval2020emotions

Questions?
Thank you!
Multimedia Evaluation Benchmark 2020 Workshop
14-15 December 2020, Online

MediaEval 2020: Emotion and Theme Recognition in Music Using Jamendo

More Related Content

Similar to MediaEval 2020: Emotion and Theme Recognition in Music Using Jamendo (20)

More from multimediaeval (20)

Recently uploaded (20)

MediaEval 2020: Emotion and Theme Recognition in Music Using Jamendo