SlideShare a Scribd company logo
Deep learning for
music recommendation
Aloïs Gruson
@nilandmusic@aloisgr niland.io
Who we are
• Founded in 2013 by 2 PhDs who worked at IRCAM
• Won Mirex 2011 in Music Similarity Estimation and Music
Classification
• We sell our technology through our API
• A team of 9 today
What we want to do
•Create a high-dimensional space where every
song is a vector
•Use this space to find similars and classify
songs
•Each query must be <50ms in millions of tracks
How music information retrieval worked in 2011
• Short-term descriptors: MFCCs,
Fluctuation Patterns ("Block-level
audio features for music genres
classification",Seyerlehner and
al.) and much more !
• Pooling techniques : VQ, GMM-SV
("GMM Supervector for content
based music similarity",
Charbuillet and al.), Vlad
("Aggregation local descriptors
into a compact image
representation", Jégou and al.) ...
Audio
MFCCs
Vlad
FP
GMM-
SV
One of our evaluation datasets
• Evaluation metrics for search engine : Precision at K or
mean Average Precision
• Evaluation set presented here : 8500 tracks in 141
playlists from mainstream music
P@k 1 5 10 20 50
mirex2011 17,48 15,39 13,87 12,23 10,00
From 2013 to 2014 @niland
• How to make a product from research work !
• And a lot of work on short-term descriptors and pooling techniques
• But still completely unsupervised, no real way to match outputs with
human perception !
P@k 1 5 10 20 50
mirex2011 17,48 15,39 13,87 12,23 10,00
2014 19,70 16,81 15,37 13,57 11,01
% +12.70 +9.23 +10.81 +10.96 +10.10
Matching algorithm outputs with human perception
•Learn the outputs of a collaborative filtering
model
"Deep content-based music recommendation", Oord and
al.
•Or use a network trained to classify into groups
of similar tracks
Integrating human idea of similarity
•150k tracks in 3500 theme-based albums from
of our clients
•Each album represents a genre, mood or an
usage
•Each gathers socially similar tracks
• We use outputs from our previous system
• We train it with a classification cost
• And remove the classification layer !
P@k 1 5 10 20 50
2014 19,70 16,81 15,37 13,57 11,01
+deep 23,40 21,09 19,68 18,07 15,19
% +18.78 +25.46 +28.04 +33.16 +37.97
Learning with theme-based albums
What if we want to remove the highly engineered features and
pooling techniques ?
Convolutional Neural Networks for Image Recognition :
Source : http://www.clarifai.com/technology
And for music ?
• Mel-Spectrogram (time-frequency representation) as an
input : axis have different meanings !
Should we really use square filters ?
• Labels on the whole track (>= 30 seconds) : input is
128x1200 for a 30 second song !
We have to pool along time axis !
And for music ?
Source : Sander Dieleman, http://benanne.github.io/2014/08/05/spotify-cnns.html
And for music ?
Some ideas to slightly improve it :
• Multi-scale pooling
• Reduce max pooling
• Add batch-norm
P@k 1 5 10 20 50
2014+deep 23,40 21,09 19,68 18,07 15,19
CNN 23,85 21,31 19,81 18,06 15,18
Okay, so ?
• Our 2014 system is a mix of 6 different short-term
descriptors + 6 different "smart" pooling functions, 10
years of research !
• Has the engineering problem become a data problem ?
P@k 1 5 10 20 50
2014+deep 23,40 21,09 19,68 18,07 15,19
CNN 23,85 21,31 19,81 18,06 15,18
From Fisher Vectors to simple pooling functions?
• A very simple pooling function can give great results !
P@k 1 5 10 20 50
Mean 20,94 19,04 17,69 16,17 13,74
Max 22,21 19,90 18,58 17,07 14,61
Var 21,66 19,46 18,14 16,58 14,13
Mean+Max+Var 23,85 21,31 19,81 18,06 15,18
And with square filters?
•Square filters also seem to work !
P@k 1 5 10 20 50
CNN 23,85 21,31 19,81 18,06 15,18
CNNsq 22,94 20,84 19,79 18,15 15,52
A transferable model for music
• Works also for world music, library music…
• This dataset : 10k tracks from library music, 300 groups
P@k 1 5 10 20 50
2014+deep 30,66 19,99 15,57 11,81 7,93
CNN 29,76 19,82 15,55 11,85 7,80
The spectrogram is still an engineered feature…
Could we learn a better temporal filter bank to
replace FFT and mel-filtering ?
“End-to-end learning for music audio", Dieleman and al.
"Learning the Speech Front-end with raw waveform CLDNNs",
Sainath and al.
Source: "Learning the Speech Front-end with raw waveform CLDNNs", Sainath and al.
P@k 1 5 10 20 50
Raw 20,11 18,95 17,23 15,91 14,26
Spectro 23,85 21,31 19,81 18,06 15,18
The spectrogram is still an engineered feature…
Maybe we need more data ?
We can improve !
• Add more albums !
• With 500k tracks ? 1M ?
P@k 1 5 10 20 50
25k tracks 19,84 17,98 15,21 14,06 13,41
150k tracks 23,85 21,31 19,81 18,06 15,18
And …
• Add more layers !
"Deep Residual Learning for Image Recognition", He and al.
P@k 1 5 10 20 50
PlainNet9 23,85 21,31 19,81 18,06 15,18
ResNet78 23,87 22,17 20,98 19,38 16,68
And ?
• Data augmentation ?
"Exploring data augmentation for improved singing voice detection with neural networks",
Schlüter and Grill
• Recurrent Neural Networks ?
• Siamese Network ?
"An exploration of deep learning in music informatics", Humphrey and al.
• More data ! Or semi supervised approach ?
"Semi-supervised learning with ladder networks", Rasmus and al.
Questions ?
@aloisgr @nilandmusicniland.io
Try it for yourself : http://demo.niland.io

More Related Content

PDF
ICML Talk on deep learning for music recommendation
PDF
Machine learning for Music
PDF
Convolutional recurrent neural networks for music classification
PDF
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)
PPTX
Conditional generative model for audio
PPTX
ISMIR 2016_Melody Extraction
PDF
Learning to Generate Jazz & Pop Piano Music from Audio via MIR Techniques
PDF
Machine learning for creative AI applications in music (2018 nov)
ICML Talk on deep learning for music recommendation
Machine learning for Music
Convolutional recurrent neural networks for music classification
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)
Conditional generative model for audio
ISMIR 2016_Melody Extraction
Learning to Generate Jazz & Pop Piano Music from Audio via MIR Techniques
Machine learning for creative AI applications in music (2018 nov)

What's hot (15)

PDF
"All you need is AI and music" by Keunwoo Choi
PDF
20211026 taicca 2 music generation
PDF
machine learning x music
PDF
20190625 Research at Taiwan AI Labs: Music and Speech AI
PDF
Music Personalization : Real time Platforms.
PDF
Automatic Music Transcription
PDF
Music Personalization At Spotify
PDF
More Like This: Machine Learning Approaches to Music similarity
PDF
Scala Data Pipelines for Music Recommendations
PDF
Recommendations 101
PPTX
Audio Source Separation Based on Low-Rank Structure and Statistical Independence
PDF
Personalized Playlists at Spotify
PDF
Igor Kostiuk “Как приручить музыкальную рекомендательную систему”
PDF
Understanding Music Playlists
PDF
The effects of noisy labels on deep convolutional neural networks for music t...
"All you need is AI and music" by Keunwoo Choi
20211026 taicca 2 music generation
machine learning x music
20190625 Research at Taiwan AI Labs: Music and Speech AI
Music Personalization : Real time Platforms.
Automatic Music Transcription
Music Personalization At Spotify
More Like This: Machine Learning Approaches to Music similarity
Scala Data Pipelines for Music Recommendations
Recommendations 101
Audio Source Separation Based on Low-Rank Structure and Statistical Independence
Personalized Playlists at Spotify
Igor Kostiuk “Как приручить музыкальную рекомендательную систему”
Understanding Music Playlists
The effects of noisy labels on deep convolutional neural networks for music t...
Ad

Viewers also liked (12)

PDF
Deep learning for music classification, 2016-05-24
PDF
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
PPTX
Talwar_Rakshak_2016URD
PDF
Deep Learning for Speech Recognition - Vikrant Singh Tomar
PDF
Pycon apac 2014
PDF
Audio chord recognition using deep neural networks
PDF
딥러닝 개요 (2015-05-09 KISTEP)
PDF
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
PDF
Deep Convolutional Neural Networks - Overview
PPTX
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
PDF
GTC 2016 ディープラーニング最新情報
Deep learning for music classification, 2016-05-24
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Talwar_Rakshak_2016URD
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Pycon apac 2014
Audio chord recognition using deep neural networks
딥러닝 개요 (2015-05-09 KISTEP)
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
Deep Convolutional Neural Networks - Overview
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
GTC 2016 ディープラーニング最新情報
Ad

Similar to Deep Learning Meetup #5 (20)

PDF
Literature Survey for Music Genre Classification Using Neural Network
PPTX
Teaching Computers to Listen to Music
PPTX
Ml conf2013 teaching_computers_share
PPTX
MLConf2013: Teaching Computer to Listen to Music
PDF
Spotify Machine Learning Solution for Music Discovery
PDF
Machine Learning and Big Data for Music Discovery at Spotify
PDF
IRJET- Music Genre Recognition using Convolution Neural Network
PDF
IRJET- A Personalized Music Recommendation System
PDF
Btp 1st
PDF
AI&BigData Lab 2016. Игорь Костюк: Как приручить музыкальную рекомендательную...
PDF
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
PDF
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
PDF
Recognition of music genres using deep learning.
PDF
Audio Classification using Artificial Neural Network with Denoising Algorithm...
PDF
FORECASTING MUSIC GENRE (RNN - LSTM)
PDF
DHRUV_rawat_21scse1011607_project_report.pdf
PDF
IRJET- Musical Instrument Recognition using CNN and SVM
PDF
IRJET- Implementation of Emotion based Music Recommendation System using SVM ...
PDF
Investigating Multi-Feature Selection and Ensembling for Audio Classification
PDF
Music Genre Classification using Machine Learning
Literature Survey for Music Genre Classification Using Neural Network
Teaching Computers to Listen to Music
Ml conf2013 teaching_computers_share
MLConf2013: Teaching Computer to Listen to Music
Spotify Machine Learning Solution for Music Discovery
Machine Learning and Big Data for Music Discovery at Spotify
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- A Personalized Music Recommendation System
Btp 1st
AI&BigData Lab 2016. Игорь Костюк: Как приручить музыкальную рекомендательную...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
Recognition of music genres using deep learning.
Audio Classification using Artificial Neural Network with Denoising Algorithm...
FORECASTING MUSIC GENRE (RNN - LSTM)
DHRUV_rawat_21scse1011607_project_report.pdf
IRJET- Musical Instrument Recognition using CNN and SVM
IRJET- Implementation of Emotion based Music Recommendation System using SVM ...
Investigating Multi-Feature Selection and Ensembling for Audio Classification
Music Genre Classification using Machine Learning

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Cloud computing and distributed systems.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
KodekX | Application Modernization Development
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Cloud computing and distributed systems.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding

Deep Learning Meetup #5

  • 1. Deep learning for music recommendation Aloïs Gruson @nilandmusic@aloisgr niland.io
  • 2. Who we are • Founded in 2013 by 2 PhDs who worked at IRCAM • Won Mirex 2011 in Music Similarity Estimation and Music Classification • We sell our technology through our API • A team of 9 today
  • 3. What we want to do •Create a high-dimensional space where every song is a vector •Use this space to find similars and classify songs •Each query must be <50ms in millions of tracks
  • 4. How music information retrieval worked in 2011 • Short-term descriptors: MFCCs, Fluctuation Patterns ("Block-level audio features for music genres classification",Seyerlehner and al.) and much more ! • Pooling techniques : VQ, GMM-SV ("GMM Supervector for content based music similarity", Charbuillet and al.), Vlad ("Aggregation local descriptors into a compact image representation", Jégou and al.) ... Audio MFCCs Vlad FP GMM- SV
  • 5. One of our evaluation datasets • Evaluation metrics for search engine : Precision at K or mean Average Precision • Evaluation set presented here : 8500 tracks in 141 playlists from mainstream music P@k 1 5 10 20 50 mirex2011 17,48 15,39 13,87 12,23 10,00
  • 6. From 2013 to 2014 @niland • How to make a product from research work ! • And a lot of work on short-term descriptors and pooling techniques • But still completely unsupervised, no real way to match outputs with human perception ! P@k 1 5 10 20 50 mirex2011 17,48 15,39 13,87 12,23 10,00 2014 19,70 16,81 15,37 13,57 11,01 % +12.70 +9.23 +10.81 +10.96 +10.10
  • 7. Matching algorithm outputs with human perception •Learn the outputs of a collaborative filtering model "Deep content-based music recommendation", Oord and al. •Or use a network trained to classify into groups of similar tracks
  • 8. Integrating human idea of similarity •150k tracks in 3500 theme-based albums from of our clients •Each album represents a genre, mood or an usage •Each gathers socially similar tracks
  • 9. • We use outputs from our previous system • We train it with a classification cost • And remove the classification layer ! P@k 1 5 10 20 50 2014 19,70 16,81 15,37 13,57 11,01 +deep 23,40 21,09 19,68 18,07 15,19 % +18.78 +25.46 +28.04 +33.16 +37.97 Learning with theme-based albums
  • 10. What if we want to remove the highly engineered features and pooling techniques ? Convolutional Neural Networks for Image Recognition : Source : http://www.clarifai.com/technology
  • 11. And for music ? • Mel-Spectrogram (time-frequency representation) as an input : axis have different meanings ! Should we really use square filters ? • Labels on the whole track (>= 30 seconds) : input is 128x1200 for a 30 second song ! We have to pool along time axis !
  • 12. And for music ? Source : Sander Dieleman, http://benanne.github.io/2014/08/05/spotify-cnns.html
  • 13. And for music ? Some ideas to slightly improve it : • Multi-scale pooling • Reduce max pooling • Add batch-norm P@k 1 5 10 20 50 2014+deep 23,40 21,09 19,68 18,07 15,19 CNN 23,85 21,31 19,81 18,06 15,18
  • 14. Okay, so ? • Our 2014 system is a mix of 6 different short-term descriptors + 6 different "smart" pooling functions, 10 years of research ! • Has the engineering problem become a data problem ? P@k 1 5 10 20 50 2014+deep 23,40 21,09 19,68 18,07 15,19 CNN 23,85 21,31 19,81 18,06 15,18
  • 15. From Fisher Vectors to simple pooling functions? • A very simple pooling function can give great results ! P@k 1 5 10 20 50 Mean 20,94 19,04 17,69 16,17 13,74 Max 22,21 19,90 18,58 17,07 14,61 Var 21,66 19,46 18,14 16,58 14,13 Mean+Max+Var 23,85 21,31 19,81 18,06 15,18
  • 16. And with square filters? •Square filters also seem to work ! P@k 1 5 10 20 50 CNN 23,85 21,31 19,81 18,06 15,18 CNNsq 22,94 20,84 19,79 18,15 15,52
  • 17. A transferable model for music • Works also for world music, library music… • This dataset : 10k tracks from library music, 300 groups P@k 1 5 10 20 50 2014+deep 30,66 19,99 15,57 11,81 7,93 CNN 29,76 19,82 15,55 11,85 7,80
  • 18. The spectrogram is still an engineered feature… Could we learn a better temporal filter bank to replace FFT and mel-filtering ? “End-to-end learning for music audio", Dieleman and al. "Learning the Speech Front-end with raw waveform CLDNNs", Sainath and al.
  • 19. Source: "Learning the Speech Front-end with raw waveform CLDNNs", Sainath and al.
  • 20. P@k 1 5 10 20 50 Raw 20,11 18,95 17,23 15,91 14,26 Spectro 23,85 21,31 19,81 18,06 15,18 The spectrogram is still an engineered feature… Maybe we need more data ?
  • 21. We can improve ! • Add more albums ! • With 500k tracks ? 1M ? P@k 1 5 10 20 50 25k tracks 19,84 17,98 15,21 14,06 13,41 150k tracks 23,85 21,31 19,81 18,06 15,18
  • 22. And … • Add more layers ! "Deep Residual Learning for Image Recognition", He and al. P@k 1 5 10 20 50 PlainNet9 23,85 21,31 19,81 18,06 15,18 ResNet78 23,87 22,17 20,98 19,38 16,68
  • 23. And ? • Data augmentation ? "Exploring data augmentation for improved singing voice detection with neural networks", Schlüter and Grill • Recurrent Neural Networks ? • Siamese Network ? "An exploration of deep learning in music informatics", Humphrey and al. • More data ! Or semi supervised approach ? "Semi-supervised learning with ladder networks", Rasmus and al.
  • 24. Questions ? @aloisgr @nilandmusicniland.io Try it for yourself : http://demo.niland.io