MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Short Term Memory Recurrent Neural Networks

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.
......
Automatically Estimating Emotion in Music with
Deep Long-Short Term Memory Recurrent Neural
Networks
George Trigeorgis, Eduardo Coutinho, Stefanos Zafeiriou,
Bj¨orn Schuller
Department of Computing
Imperial College London
MediaEval 2015, Wurzen, Germany

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Method: feature sets
Feature Set 1 (FS1):
2013 INTERSPEECH Computational Paralinguistics Challenge
65 (energy, spectrum and voice-related) LLDs (plus ﬁrst order
derivates) covering a broad set of descriptors from the ﬁelds
of speech processing, Music Information Retrieval, and general
sound analysis
We computed the mean and standard deviation functionals of
each feature over 1s time windows with 50% overlap
Final set: 260 features extracted at a rate of 2Hz.
All features were extracted openSMILE

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Method: feature sets
Feature Set 2 (FS2):
FS1 plus four new features
Roughness (R) and Sensory Dissonance (SDiss)
Tempo (T) and Event Density (ED).
Correspond to two psychoacoustic dimensions consistently
associated with the communication of emotion in music and
speech - Roughness and Duration (Coutinho & Dibben, 2013)
The four features were extracted with the MIR Toolbox
mirroughness: SDiss (Sethares formula) and R (Vassilakis
algorithm) mirtempo (T) mireventdensity (ED)

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Method: regressor
Given the importance of the temporal context to the
perception of emotion in music we consider temporal models.
Deep Recurrent Neural Networks. RNNs are neural nets which
operate also on the time-domain instead only the spatial
domain.

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Model training
Joint learning of Arousal and Valence time-continuous values
(multitask).
Cross-validation
...1 The fold subdivision followed a modulus based scheme
instance ID modulus 11
...2 The instances yielding a remainder of 10 were left out to
create a small test set for performance estimation
...3 On the remaining instances, a 10-fold cross-validation was
performed.
...4 We computed 4 trials of the same model each with
randomized initial weights in the range [-0.1,0.1].

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Model training (cont.)
Basic architecture: Deep LSTM-RNN (2 hidden layers) Optimised
parameters
number of LSTM blocks in each hidden layer,
learning rate
standard deviation of the Gaussian noise applied to the input
activations
used to alleviate the effects of over-fitting
A momentum of 0.9 was used for all tests Early stopping strategy
(to avoid overfitting the training data)
training was stopped after 20 iterations without improvement
of the validation set performance
For each fold, instances were presented in random order
The input (acoustic features) and output (emotion features) data
were standardised to zero mean and unit variance (on the
correspondent training sets used in each cross-validation fold)

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Pretraining using denoising autoencoders
The unsupervised pre-training strategy consisted of denoising
LSTM-RNN auto-encoders.
We ﬁrst created a LSTM-RNN with a single hidden layer
trained to predict the input features (y(t) = x(t)).
In order to avoid over-ﬁtting, in each training epoch and
timestep t, we added a noise vector n to x(t), sampled from a
Gaussian distribution with zero mean and variance σ2.
Both the development and test set instances were used to
train the DAE.

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Results
Results on the oﬃcial test set.
CB: challenge baseline features
Run Arousal Valence
a)
RMSE
2 0.242±0.116 0.373±0.195
3 0.234±0.114 0.372±0.190
4 0.236±0.114 0.375±0.191
CB 0.270±0.110 0.366±0.180
r
2 0.611±0.254 0.004±0.505
3 0.599±0.287 0.017±0.492
4 0.613±0.278 0.026±0.500
CB 0.360±0.260 0.010±0.380

.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Questions.

MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Short Term Memory Recurrent Neural Networks

More Related Content

Viewers also liked (16)

Similar to MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Short Term Memory Recurrent Neural Networks (20)

More from multimediaeval (20)

Recently uploaded (20)

MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Short Term Memory Recurrent Neural Networks