SlideShare a Scribd company logo
TEACHING COMPUTERS TO LISTEN TO MUSIC

Eric Battenberg

ebattenberg@gracenote.com

http://ericbattenberg.com
Gracenote, Inc.
Previously:
UC Berkeley, Dept. of EECS
Parallel Computing Laboratory
CNMAT (Center for New Music and Audio Technologies)
Business Verticals

Rich, Diverse
Multimedia Metadata

Cloud Music
Services,
Mobile Apps and
Devices

Smart TVs, Second
Screen Apps,
Targeted Ads

Connected
Infotainment
Systems on the
Road

2
SOME OF GRACENOTE’S PRODUCTS


Original product: CDDB (1998)


Recognize CDs and retrieve associated metadata



Scan & Match to precisely identify digital music quickly in
large catalogs



ACR (Automatic Content Recognition)





Audio/Video Fingerprinting, Enabling second screen experiences
Real-time audio classification systems for mobile devices.

Contextually aware cross-modal personalization




Taste profiling for Music, TV, Movies, Social networks

Music Mood descriptors that combine machine learning and
editorial training data.
3
GRACENOTE’S CUSTOMERS
Music

Video

Auto

4
5
MACHINE LISTENING


Speech processing






Speech processing makes up the vast majority of funded
machine listening research.
Just as there’s more to Computer Vision than OCR, there’s more
to machine listening than speech recognition!

Audio content analysis



Audio classification (music, speech, noise, laughter, cheering)




Audio fingerprinting (e.g. Gracenote, Shazam)
Audio event detection (new song, channel change, hotword)

Content-based Music Information Retrieval (MIR)


Today’s topic
6
GETTING COMPUTERS TO
“LISTEN” TO MUSIC


Not trying to get computers to “listen” for enjoyment.



More accurate: Analyzing music with computers.



What kind of information to get out of the analysis?


What instruments are playing?



What is the mood?



How fast/slow is it (tempo)?



What does the singer sound like?



How can I play this song on my instrument?
Ben Harper

James Brown

7
CONTENT-BASED MUSIC INFORMATION
RETRIEVAL


Many tasks:


Genre, mood classification, auto-tagging



Beat tracking, tempo detection



Music similarity, playlisting, recommendation


Heavily aided by collaborative filtering



Automatic music transcription



Source separation (voice, drums, bass…)



Music segmentation (verse, chorus)

8
TALK SUMMARY


Introduction to Music Information Retrieval (MIR)


Some common techniques




Exciting new research directions




Auto-tagging, onset detection

Deep learning!

Example: Live Drum Understanding


Drum detection/transcription

9
TALK SUMMARY


Introduction to Music Information Retrieval (MIR)


Some common techniques




Exciting new research directions




Auto-tagging, onset detection

Deep learning!

Example: Live Drum Understanding


Drum detection/transcription

10
QUICK LESSON: THE SPECTROGRAM
The spectrogram: Very common feature used in audio analysis.



Time-frequency representation of audio.



Take FFT of adjacent frames of audio samples, put them in a
matrix.



Each column shows frequency content at a particular time.

Frequency



11

Time
MUSIC AUTO-TAGGING:
TYPICAL APPROACHES


Typical approach:





Extract a bunch of hand-designed features describing small
windows of the signal (e.g., spectral centroid, kurtosis,
harmonicity, percussiveness, MFCCs, 100’s more…).
Train a GMM or SVM to predict genre/mood/tags.

Pros:





Works fairly well, was state of the art for a while.
Well understood models, implementations widely available.

Cons:


Bag-of-frames style approach lacks ability to describe rhythm and
temporal dynamics.



Getting further improvements requires hand designing more
features.

12
LEARNING FEATURES:
NEURAL NETWORKS
Each layer computes a non-linear
transformation of the previous layer.



Train to minimize output error.



Each hidden layer can be thought of as
a set of features.



Train using backpropagation.



Iterative steps:


Compute output error



Backprop. error signal



Compute gradients




Compute activations

0.0

Backpropagate



1.0

Feedfordward



Sigmoid non-linearity

Update all weights.

Resurgence of neural networks:


More compute (GPUs, Google, etc.)



More data



A few new tricks…

13
DEEP NEURAL NETWORKS
Deep Neural Networks




Problem: Vanishing error signal.




Weights of lower layers do not
change much.

Solutions:




Train for a really long time. 
Pre-train each hidden layer as an
autoencoder. [Hinton, 2006]
Rectified Linear Units [Krizhevsky,
2012]

Sigmoid non-linearity

Backpropagate



Millions to billions of parameters
Many layers of “features”
Achieving state of the art
performance in vision and speech
tasks.
Feedfordward



Rectifier non-linearity

14
AUTOENCODERS AND
UNSUPERVISED FEATURE LEARNING




Many ways to learn features in an
unsupervised way:
Autoencoders – train a network to
reconstruct the input


Reconstructed
Hiddens Data
(Features)

Denoising Autoencoders [Vincent, 2008]



Data

Restricted Boltzmann Machine (RBM)
[Hinton, 2006]



Autoencoder:
Train to reconstruct input

Sparse Autoencoders



Clustering – K-Means, mixture models,
etc.



Sparse Coding – learn overcomplete
dictionary of features with sparsity
constraint

15
MUSIC AUTO-TAGGING:
NEWER APPROACHES


Newer approaches to feature extraction:








Learn spectral features using Restricted Boltzmann
Machines (RBMs) and Deep Neural Networks (DNN)
[Hamel, 2010] – good genre performance.
Learn sparse features using Predictive Sparse
Decomposition (PSD) [Henaff, 2011] – good genre
performance
Learn beat-synchronous rhythm and timbre features with
RBMs and DNNs [Schmidt, 2013] – improved mood
performance

Pros:





Some learned
frequency features

Hand-designing individual features is not required.
Computers can learn complex high-order features that
humans cannot hand code.

[Henaff, 2011]

Missing pieces:


More work on incorporating time: context, rhythm, and longterm structure into feature learning

16
RECURRENT NEURAL NETWORKS


Non-linear sequence model



Hidden units have connections to previous time step



Unlike HMMs, can model long-term dependencies using
distributed hidden state.



Recent developments (+ more compute) have made them much
more feasible to train.
Standard Recurrent Neural Network

Output units

Distributed
hidden state

Input units

17

from [Sutskever, 2013]
APPLYING RNNS TO ONSET DETECTION


Onset Detection:


Audio Signal

Detect the beginning of notes



Important for music transcription, beat
tracking, etc.



Onset detection can be hard for certain
instruments with ambiguous attacks or
in the presence of background
interference.

Onset Envelope

18
from (Bello, 2005)
ONSET DETECTION: STATE-OF-THE-ART


Using Recurrent Neural Networks (RNN) [Eyben, 2010],
[Böck, 2012]
Spectrogram + Δ’s
@ 3 time-scales

3 layer RNN

Onset Detection
Function

3 Layer RNN
3 Layer RNN



RNN output trained to predict onset locations.



80-90% accurate (state-of-the art), compared to 60-80%



Can improve with more labeled training data, or possibly more
unsupervised training.

19
OTHER EXAMPLES OF RNNS IN MUSIC


Examples:



Classical music generation and transcription [BoulangerLewandowski, 2012]





Polyphonic piano note transcription [Böck, 2012]

NMF-based source separation with predictive constraints from
RNN. [ICASSP 2014 “Deep Learning in Music”].

RNNs are a promising way to model long-term contextual and
temporal dependencies present in music.

20
TALK SUMMARY


Introduction to Music Information Retrieval (MIR)


Some common techniques




Exciting new research directions




Auto-tagging, onset detection

Deep learning!

Example: Live Drum Understanding


Drum detection/transcription

21
LIVE DRUM TRANSCRIPTION


Real-Time/Live operation



Useful with any percussion setup.





Before a performance, we can quickly train the system for a particular
percussion setup.
Or train a more general model for common drum types.

Amplitude (dynamics) information.


Very important for musical understanding

22
DRUM DETECTION SYSTEM
training data
(drum-wise audio)

Drum Modeling

Onset
Detection

Gamma
Mixture Model
cluster parameters
(drum templates)

performance
(raw audio)

Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)

performance
(raw audio)
performance
(raw audio)

Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance

Non-negative
Vector
Decomposition

drum
activations

Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)

Non-negative

23
DRUM DETECTION SYSTEM
training data
(drum-wise audio)

Drum Modeling

Onset
Detection

Gamma
Mixture Model
cluster parameters
(drum templates)

performance
(raw audio)

Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)

performance
(raw audio)
performance
(raw audio)

Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance

Non-negative
Vector
Decomposition

drum
activations

Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)

Non-negative

24
SPECTROGRAM SLICES


Extracted at onsets.



Each slice contains 100ms of audio

33ms

67ms

80 bands

Detected onset

25
Head Slice
DRUM DETECTION SYSTEM
training data
(drum-wise audio)

Drum Modeling

Onset
Detection

Gamma
Mixture Model
cluster parameters
(drum templates)

performance
(raw audio)

Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)

performance
(raw audio)
performance
(raw audio)

Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance

Non-negative
Vector
Decomposition

drum
activations

Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)

Non-negative

26
MODELING DRUM SOUNDS WITH
GAMMA MIXTURE MODELS


Instead of taking an “average” of all training slices
for a single drum…



Model each drum using means from Gamma
Mixture Model.


Captures range of sounds produced by each drum.



Cheaper to train than GMM (no covariance matrix)



More stable than GMM (no covariance matrix)



Cluster according to human auditory perception.




Not Euclidean distance

Keep audio spectrogram in linear domain (not logdomain)


Important for linear source separation
27
DRUM DETECTION SYSTEM
training data
(drum-wise audio)

Drum Modeling

Onset
Detection

Gamma
Mixture Model
cluster parameters
(drum templates)

performance
(raw audio)

Spectrogram
Slice
Extraction
training data
(drum-wisedata
training audio)
(drum-wise audio)

performance
(raw audio)
performance
(raw audio)

Feature Extraction
Onset
Training
Onset
Detection
Detection
Performance

Non-negative
Vector
Decomposition

drum
activations

Source Separation
Gamma
Gamma
Mixture Model
Mixture Model
cluster parameters
(drum parameters
cluster templates)
(drum templates)

Non-negative

28
DECOMPOSING ONSETS ONTO TEMPLATES


Non-negative Vector Decomposition (NVD)


A simplification of Non-negative Matrix Factorization (NMF)



W matrix contains drum templates in its columns.



Adding a sparsity penalty (L1) on h improves NVD.
Solve:

Input

x

Drum Templates

=

W
*
Bass

Snare

Hi-Hat
(closed)

Hi-Hat
(open)

Ride

29

h
Output
EVALUATION


Test Data:



Recorded to stereo using multiple
microphones.




8 different drums/cymbals

50-100 training hits per drum.

Vary maximum number of templates per
drum

30
DETECTION RESULTS


Varying maximum templates per drum.



More templates = better accuracy
Detection Accuracy
(F-score)

Amplitude Accuracy
(Cosine Similarity)

31

More Templates Per Drum

More Templates Per Drum
DRUM DETECTION: MAIN POINTS


Gamma Mixture Model



Cheaper to train than GMM



More stable than GMM




For learning spectral drum templates.

Keep audio spectrogram in linear domain (not log-domain)

Non-negative Vector Decomposition (NVD)


For computing template activations from drum onsets.



Learning multiple templates per drum improves source
separation and detection.

32
AUDIO EXAMPLES


100 BPM, rock, syncopated snare drum, fills/flams.



Note the messy fills in the single template version.
Original Performance

Multiple Templates per Drum

Single Template per Drum

33
AUDIO EXAMPLES


181 BPM, fast rock, open hi-hat



Note the extra hi-hat notes in the single template version.
Original Performance

Multiple Templates per Drum

Single Template per Drum
34
AUDIO EXAMPLES


94 BPM, snare drum march, accented notes



Note the extra bass drum notes and many extra cymbals in the single template version.

Original Performance

Multiple Templates per Drum

Single Template per Drum

35
SUMMARY


Content-Based Music Information Retrieval





Mood, genre, onsets, beats, transcription, recommendation, and
much, much more!
Exciting directions include deep/feature learning and RNNs.

Drum Detection


Gamma mixture model template learning.



Learning multiple templates = better source separation
performance

36
TEACHING COMPUTERS TO LISTEN TO MUSIC

Eric Battenberg

ebattenberg@gracenote.com

http://ericbattenberg.com
Gracenote, Inc.
Previously:
UC Berkeley, Dept. of EECS
Parallel Computing Laboratory
CNMAT (Center for New Music and Audio Technologies)
GETTING INVOLVED IN MUSIC INFORMATION
RETRIEVAL


Check out the proceedings of ISMIR (free online):




Participate in MIREX (annual MIR eval):




http://www.music-ir.org/mirex/wiki/MIREX_HOME

Join the Music-IR mailing list:




http://www.ismir.net/

http://listes.ircam.fr/wws/info/music-ir

Join the Music Information Retrieval Google Plus community
(just started it):


https://plus.google.com/communities/109771668656894350107

38
REFERENCES


Genre/Mood Classification:


[1] P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” Proc. of the
11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 339–344,
2010.



[2] M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features
for scalable audio classification,” Proceedings of International Symposium on Music Information
Retrieval (ISMIR’11), 2011.



[3] J. Anden and S. Mallat, “Deep Scattering Spectrum,” arXiv.org. 2013.



[4] E. Schmidt and Y. Kim, “Learning rhythm and melody features with deep belief networks,” ISMIR,
2013.

39
REFERENCES


Onset Detection:


[1] J. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. Davies, and M. Sandler, “A tutorial on onset
detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, p.
1035, 2005.



[2] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acoustic musical signals,” IEEE
Transactions on Speech and Audio Processing, vol. 14, no. 1, p. 342, 2006.



[3] J. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the use of phase and energy for musical
onset detection in the complex domain,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553–556,
2004.



[4] S. Böck, A. Arzt, F. Krebs, and M. Schedl, “Online realtime onset detection with recurrent neural
networks,” presented at the International Conference on Digital Audio Effects (DAFx-12), 2012.



[5] F. Eyben, S. Böck, B. Schuller, and A. Graves, “Universal onset detection with bidirectional longshort term memory neural networks,” Proc. of ISMIR, 2010.

40
REFERENCES


Neural Networks :




[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional
neural networks,” Advances in neural information processing systems, 2012.

Unsupervised Feature Learning:






[2] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural
Computation, 2006.
[3] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust
features with denoising autoencoders,” pp. 1096–1103, 2008.

Recurrent Neural Networks:


[4] I. Sutskever, “Training Recurrent Neural Networks,” 2013.



[6] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues improvisation with LSTM
recurrent networks,” pp. 747–756, 2002.



[7] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” pp.
121–124, 2012.



[8] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription,” arXiv preprint
arXiv:1206.6392, 2012.



[9] G. Taylor and G. E. Hinton, “Two Distributed-State Models For Generating High-Dimensional Time
Series,” Journal of Machine Learning Research, 2011.

41
REFERENCES


Drum Understanding:


[1] E. Battenberg, “Techniques for Machine Understanding of Live Drum Performances,” PhD Thesis,
University of California, Berkeley, 2012.



[2] E. Battenberg and D. Wessel, “Analyzing drum patterns using conditional deep belief networks,”
presented at the International Society for Music Information Retrieval Conference, Porto, Portugal,
2012.



[3] E. Battenberg, V. Huang, and D. Wessel, “Toward live drum separation using probabilistic spectral
clustering based on the Itakura-Saito divergence,” presented at the Audio Engineering Society
Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio,
2012, vol. 3.

42
THANK YOU!

43

More Related Content

PPTX
Fun with MATLAB
PPT
Miniproject audioenhancement-100223094301-phpapp02
PPTX
Homomorphic speech processing
PPT
Distance Coding And Performance Of The Mark 5 And St350 Soundfield Microphone...
PPT
MPEG 4
PDF
Introductory Lecture to Audio Signal Processing
Fun with MATLAB
Miniproject audioenhancement-100223094301-phpapp02
Homomorphic speech processing
Distance Coding And Performance Of The Mark 5 And St350 Soundfield Microphone...
MPEG 4
Introductory Lecture to Audio Signal Processing

What's hot (18)

PDF
Real Time Drum Augmentation with Physical Modeling
PPT
Mono and stereo
PDF
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
PDF
PHOENIX AUDIO TECHNOLOGIES - A large Audio Signal Algorithm Portfolio
PPTX
Teaching Computers to Listen to Music
PPTX
Surround sount system
DOC
3 Digital Audio
PDF
Automatic Music Transcription
PPTX
Digital modeling of speech signal
PDF
Sound for Dummies and DJ's
PPTX
Audio compression
PDF
Py conjp2019 renyuanlyu_3
PDF
SoundField UPM-1 Review
KEY
Spatial Sound 3: Audio Rendering and Ambisonics
PPTX
Speech Signal Processing
PPT
Soundpres
PPT
Surround Sound
Real Time Drum Augmentation with Physical Modeling
Mono and stereo
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
PHOENIX AUDIO TECHNOLOGIES - A large Audio Signal Algorithm Portfolio
Teaching Computers to Listen to Music
Surround sount system
3 Digital Audio
Automatic Music Transcription
Digital modeling of speech signal
Sound for Dummies and DJ's
Audio compression
Py conjp2019 renyuanlyu_3
SoundField UPM-1 Review
Spatial Sound 3: Audio Rendering and Ambisonics
Speech Signal Processing
Soundpres
Surround Sound
Ad

Similar to Ml conf2013 teaching_computers_share (20)

PDF
Research at MAC Lab, Academia Sincia, in 2017
PDF
Deep Learning Meetup #5
PDF
Generating Musical Notes and Transcription using Deep Learning
PDF
Literature Survey for Music Genre Classification Using Neural Network
PDF
Audio Classification using Artificial Neural Network with Denoising Algorithm...
PDF
Machine learning for creative AI applications in music (2018 nov)
PPTX
BTP_MIDSEM_RNN.pptx
PDF
20211026 taicca 2 music generation
PDF
FORECASTING MUSIC GENRE (RNN - LSTM)
PDF
Machine learning for Music
PDF
20211026 taicca 1 intro to mir
PDF
IRJET- The Complete Music Player
PDF
Deep learning for music classification, 2016-05-24
PDF
Chord recognition mac lab presentation
PDF
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
PDF
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
PDF
Audio chord recognition using deep neural networks
PDF
Recognition of music genres using deep learning.
PDF
Applications of AI and NLP to advance Music Recommendations on Voice Assistants
PDF
楊奕軒/音樂資料檢索
Research at MAC Lab, Academia Sincia, in 2017
Deep Learning Meetup #5
Generating Musical Notes and Transcription using Deep Learning
Literature Survey for Music Genre Classification Using Neural Network
Audio Classification using Artificial Neural Network with Denoising Algorithm...
Machine learning for creative AI applications in music (2018 nov)
BTP_MIDSEM_RNN.pptx
20211026 taicca 2 music generation
FORECASTING MUSIC GENRE (RNN - LSTM)
Machine learning for Music
20211026 taicca 1 intro to mir
IRJET- The Complete Music Player
Deep learning for music classification, 2016-05-24
Chord recognition mac lab presentation
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
Audio chord recognition using deep neural networks
Recognition of music genres using deep learning.
Applications of AI and NLP to advance Music Recommendations on Voice Assistants
楊奕軒/音樂資料檢索
Ad

More from MLconf (20)

PDF
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
PPTX
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
PDF
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
PPTX
Josh Wills - Data Labeling as Religious Experience
PDF
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
PDF
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
PDF
Meghana Ravikumar - Optimized Image Classification on the Cheap
PDF
Noam Finkelstein - The Importance of Modeling Data Collection
PDF
June Andrews - The Uncanny Valley of ML
PDF
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
PDF
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
PDF
Vito Ostuni - The Voice: New Challenges in a Zero UI World
PDF
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
PDF
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
PPTX
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
PPTX
Neel Sundaresan - Teaching a machine to code
PDF
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
PPTX
Soumith Chintala - Increasing the Impact of AI Through Better Software
PPTX
Roy Lowrance - Predicting Bond Prices: Regime Changes
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Josh Wills - Data Labeling as Religious Experience
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Meghana Ravikumar - Optimized Image Classification on the Cheap
Noam Finkelstein - The Importance of Modeling Data Collection
June Andrews - The Uncanny Valley of ML
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Neel Sundaresan - Teaching a machine to code
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Soumith Chintala - Increasing the Impact of AI Through Better Software
Roy Lowrance - Predicting Bond Prices: Regime Changes

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf

Ml conf2013 teaching_computers_share

  • 1. TEACHING COMPUTERS TO LISTEN TO MUSIC Eric Battenberg ebattenberg@gracenote.com http://ericbattenberg.com Gracenote, Inc. Previously: UC Berkeley, Dept. of EECS Parallel Computing Laboratory CNMAT (Center for New Music and Audio Technologies)
  • 2. Business Verticals Rich, Diverse Multimedia Metadata Cloud Music Services, Mobile Apps and Devices Smart TVs, Second Screen Apps, Targeted Ads Connected Infotainment Systems on the Road 2
  • 3. SOME OF GRACENOTE’S PRODUCTS  Original product: CDDB (1998)  Recognize CDs and retrieve associated metadata  Scan & Match to precisely identify digital music quickly in large catalogs  ACR (Automatic Content Recognition)    Audio/Video Fingerprinting, Enabling second screen experiences Real-time audio classification systems for mobile devices. Contextually aware cross-modal personalization   Taste profiling for Music, TV, Movies, Social networks Music Mood descriptors that combine machine learning and editorial training data. 3
  • 5. 5
  • 6. MACHINE LISTENING  Speech processing    Speech processing makes up the vast majority of funded machine listening research. Just as there’s more to Computer Vision than OCR, there’s more to machine listening than speech recognition! Audio content analysis   Audio classification (music, speech, noise, laughter, cheering)   Audio fingerprinting (e.g. Gracenote, Shazam) Audio event detection (new song, channel change, hotword) Content-based Music Information Retrieval (MIR)  Today’s topic 6
  • 7. GETTING COMPUTERS TO “LISTEN” TO MUSIC  Not trying to get computers to “listen” for enjoyment.  More accurate: Analyzing music with computers.  What kind of information to get out of the analysis?  What instruments are playing?  What is the mood?  How fast/slow is it (tempo)?  What does the singer sound like?  How can I play this song on my instrument? Ben Harper James Brown 7
  • 8. CONTENT-BASED MUSIC INFORMATION RETRIEVAL  Many tasks:  Genre, mood classification, auto-tagging  Beat tracking, tempo detection  Music similarity, playlisting, recommendation  Heavily aided by collaborative filtering  Automatic music transcription  Source separation (voice, drums, bass…)  Music segmentation (verse, chorus) 8
  • 9. TALK SUMMARY  Introduction to Music Information Retrieval (MIR)  Some common techniques   Exciting new research directions   Auto-tagging, onset detection Deep learning! Example: Live Drum Understanding  Drum detection/transcription 9
  • 10. TALK SUMMARY  Introduction to Music Information Retrieval (MIR)  Some common techniques   Exciting new research directions   Auto-tagging, onset detection Deep learning! Example: Live Drum Understanding  Drum detection/transcription 10
  • 11. QUICK LESSON: THE SPECTROGRAM The spectrogram: Very common feature used in audio analysis.  Time-frequency representation of audio.  Take FFT of adjacent frames of audio samples, put them in a matrix.  Each column shows frequency content at a particular time. Frequency  11 Time
  • 12. MUSIC AUTO-TAGGING: TYPICAL APPROACHES  Typical approach:    Extract a bunch of hand-designed features describing small windows of the signal (e.g., spectral centroid, kurtosis, harmonicity, percussiveness, MFCCs, 100’s more…). Train a GMM or SVM to predict genre/mood/tags. Pros:    Works fairly well, was state of the art for a while. Well understood models, implementations widely available. Cons:  Bag-of-frames style approach lacks ability to describe rhythm and temporal dynamics.  Getting further improvements requires hand designing more features. 12
  • 13. LEARNING FEATURES: NEURAL NETWORKS Each layer computes a non-linear transformation of the previous layer.  Train to minimize output error.  Each hidden layer can be thought of as a set of features.  Train using backpropagation.  Iterative steps:  Compute output error  Backprop. error signal  Compute gradients   Compute activations 0.0 Backpropagate  1.0 Feedfordward  Sigmoid non-linearity Update all weights. Resurgence of neural networks:  More compute (GPUs, Google, etc.)  More data  A few new tricks… 13
  • 14. DEEP NEURAL NETWORKS Deep Neural Networks    Problem: Vanishing error signal.   Weights of lower layers do not change much. Solutions:    Train for a really long time.  Pre-train each hidden layer as an autoencoder. [Hinton, 2006] Rectified Linear Units [Krizhevsky, 2012] Sigmoid non-linearity Backpropagate  Millions to billions of parameters Many layers of “features” Achieving state of the art performance in vision and speech tasks. Feedfordward  Rectifier non-linearity 14
  • 15. AUTOENCODERS AND UNSUPERVISED FEATURE LEARNING   Many ways to learn features in an unsupervised way: Autoencoders – train a network to reconstruct the input  Reconstructed Hiddens Data (Features) Denoising Autoencoders [Vincent, 2008]  Data Restricted Boltzmann Machine (RBM) [Hinton, 2006]  Autoencoder: Train to reconstruct input Sparse Autoencoders  Clustering – K-Means, mixture models, etc.  Sparse Coding – learn overcomplete dictionary of features with sparsity constraint 15
  • 16. MUSIC AUTO-TAGGING: NEWER APPROACHES  Newer approaches to feature extraction:     Learn spectral features using Restricted Boltzmann Machines (RBMs) and Deep Neural Networks (DNN) [Hamel, 2010] – good genre performance. Learn sparse features using Predictive Sparse Decomposition (PSD) [Henaff, 2011] – good genre performance Learn beat-synchronous rhythm and timbre features with RBMs and DNNs [Schmidt, 2013] – improved mood performance Pros:    Some learned frequency features Hand-designing individual features is not required. Computers can learn complex high-order features that humans cannot hand code. [Henaff, 2011] Missing pieces:  More work on incorporating time: context, rhythm, and longterm structure into feature learning 16
  • 17. RECURRENT NEURAL NETWORKS  Non-linear sequence model  Hidden units have connections to previous time step  Unlike HMMs, can model long-term dependencies using distributed hidden state.  Recent developments (+ more compute) have made them much more feasible to train. Standard Recurrent Neural Network Output units Distributed hidden state Input units 17 from [Sutskever, 2013]
  • 18. APPLYING RNNS TO ONSET DETECTION  Onset Detection:  Audio Signal Detect the beginning of notes  Important for music transcription, beat tracking, etc.  Onset detection can be hard for certain instruments with ambiguous attacks or in the presence of background interference. Onset Envelope 18 from (Bello, 2005)
  • 19. ONSET DETECTION: STATE-OF-THE-ART  Using Recurrent Neural Networks (RNN) [Eyben, 2010], [Böck, 2012] Spectrogram + Δ’s @ 3 time-scales 3 layer RNN Onset Detection Function 3 Layer RNN 3 Layer RNN  RNN output trained to predict onset locations.  80-90% accurate (state-of-the art), compared to 60-80%  Can improve with more labeled training data, or possibly more unsupervised training. 19
  • 20. OTHER EXAMPLES OF RNNS IN MUSIC  Examples:   Classical music generation and transcription [BoulangerLewandowski, 2012]   Polyphonic piano note transcription [Böck, 2012] NMF-based source separation with predictive constraints from RNN. [ICASSP 2014 “Deep Learning in Music”]. RNNs are a promising way to model long-term contextual and temporal dependencies present in music. 20
  • 21. TALK SUMMARY  Introduction to Music Information Retrieval (MIR)  Some common techniques   Exciting new research directions   Auto-tagging, onset detection Deep learning! Example: Live Drum Understanding  Drum detection/transcription 21
  • 22. LIVE DRUM TRANSCRIPTION  Real-Time/Live operation  Useful with any percussion setup.    Before a performance, we can quickly train the system for a particular percussion setup. Or train a more general model for common drum types. Amplitude (dynamics) information.  Very important for musical understanding 22
  • 23. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 23
  • 24. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 24
  • 25. SPECTROGRAM SLICES  Extracted at onsets.  Each slice contains 100ms of audio 33ms 67ms 80 bands Detected onset 25 Head Slice
  • 26. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 26
  • 27. MODELING DRUM SOUNDS WITH GAMMA MIXTURE MODELS  Instead of taking an “average” of all training slices for a single drum…  Model each drum using means from Gamma Mixture Model.  Captures range of sounds produced by each drum.  Cheaper to train than GMM (no covariance matrix)  More stable than GMM (no covariance matrix)  Cluster according to human auditory perception.   Not Euclidean distance Keep audio spectrogram in linear domain (not logdomain)  Important for linear source separation 27
  • 28. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 28
  • 29. DECOMPOSING ONSETS ONTO TEMPLATES  Non-negative Vector Decomposition (NVD)  A simplification of Non-negative Matrix Factorization (NMF)  W matrix contains drum templates in its columns.  Adding a sparsity penalty (L1) on h improves NVD. Solve: Input x Drum Templates = W * Bass Snare Hi-Hat (closed) Hi-Hat (open) Ride 29 h Output
  • 30. EVALUATION  Test Data:   Recorded to stereo using multiple microphones.   8 different drums/cymbals 50-100 training hits per drum. Vary maximum number of templates per drum 30
  • 31. DETECTION RESULTS  Varying maximum templates per drum.  More templates = better accuracy Detection Accuracy (F-score) Amplitude Accuracy (Cosine Similarity) 31 More Templates Per Drum More Templates Per Drum
  • 32. DRUM DETECTION: MAIN POINTS  Gamma Mixture Model   Cheaper to train than GMM  More stable than GMM   For learning spectral drum templates. Keep audio spectrogram in linear domain (not log-domain) Non-negative Vector Decomposition (NVD)  For computing template activations from drum onsets.  Learning multiple templates per drum improves source separation and detection. 32
  • 33. AUDIO EXAMPLES  100 BPM, rock, syncopated snare drum, fills/flams.  Note the messy fills in the single template version. Original Performance Multiple Templates per Drum Single Template per Drum 33
  • 34. AUDIO EXAMPLES  181 BPM, fast rock, open hi-hat  Note the extra hi-hat notes in the single template version. Original Performance Multiple Templates per Drum Single Template per Drum 34
  • 35. AUDIO EXAMPLES  94 BPM, snare drum march, accented notes  Note the extra bass drum notes and many extra cymbals in the single template version. Original Performance Multiple Templates per Drum Single Template per Drum 35
  • 36. SUMMARY  Content-Based Music Information Retrieval    Mood, genre, onsets, beats, transcription, recommendation, and much, much more! Exciting directions include deep/feature learning and RNNs. Drum Detection  Gamma mixture model template learning.  Learning multiple templates = better source separation performance 36
  • 37. TEACHING COMPUTERS TO LISTEN TO MUSIC Eric Battenberg ebattenberg@gracenote.com http://ericbattenberg.com Gracenote, Inc. Previously: UC Berkeley, Dept. of EECS Parallel Computing Laboratory CNMAT (Center for New Music and Audio Technologies)
  • 38. GETTING INVOLVED IN MUSIC INFORMATION RETRIEVAL  Check out the proceedings of ISMIR (free online):   Participate in MIREX (annual MIR eval):   http://www.music-ir.org/mirex/wiki/MIREX_HOME Join the Music-IR mailing list:   http://www.ismir.net/ http://listes.ircam.fr/wws/info/music-ir Join the Music Information Retrieval Google Plus community (just started it):  https://plus.google.com/communities/109771668656894350107 38
  • 39. REFERENCES  Genre/Mood Classification:  [1] P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” Proc. of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 339–344, 2010.  [2] M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features for scalable audio classification,” Proceedings of International Symposium on Music Information Retrieval (ISMIR’11), 2011.  [3] J. Anden and S. Mallat, “Deep Scattering Spectrum,” arXiv.org. 2013.  [4] E. Schmidt and Y. Kim, “Learning rhythm and melody features with deep belief networks,” ISMIR, 2013. 39
  • 40. REFERENCES  Onset Detection:  [1] J. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. Davies, and M. Sandler, “A tutorial on onset detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, p. 1035, 2005.  [2] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acoustic musical signals,” IEEE Transactions on Speech and Audio Processing, vol. 14, no. 1, p. 342, 2006.  [3] J. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the use of phase and energy for musical onset detection in the complex domain,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553–556, 2004.  [4] S. Böck, A. Arzt, F. Krebs, and M. Schedl, “Online realtime onset detection with recurrent neural networks,” presented at the International Conference on Digital Audio Effects (DAFx-12), 2012.  [5] F. Eyben, S. Böck, B. Schuller, and A. Graves, “Universal onset detection with bidirectional longshort term memory neural networks,” Proc. of ISMIR, 2010. 40
  • 41. REFERENCES  Neural Networks :   [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in neural information processing systems, 2012. Unsupervised Feature Learning:    [2] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural Computation, 2006. [3] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” pp. 1096–1103, 2008. Recurrent Neural Networks:  [4] I. Sutskever, “Training Recurrent Neural Networks,” 2013.  [6] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues improvisation with LSTM recurrent networks,” pp. 747–756, 2002.  [7] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” pp. 121–124, 2012.  [8] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription,” arXiv preprint arXiv:1206.6392, 2012.  [9] G. Taylor and G. E. Hinton, “Two Distributed-State Models For Generating High-Dimensional Time Series,” Journal of Machine Learning Research, 2011. 41
  • 42. REFERENCES  Drum Understanding:  [1] E. Battenberg, “Techniques for Machine Understanding of Live Drum Performances,” PhD Thesis, University of California, Berkeley, 2012.  [2] E. Battenberg and D. Wessel, “Analyzing drum patterns using conditional deep belief networks,” presented at the International Society for Music Information Retrieval Conference, Porto, Portugal, 2012.  [3] E. Battenberg, V. Huang, and D. Wessel, “Toward live drum separation using probabilistic spectral clustering based on the Itakura-Saito divergence,” presented at the Audio Engineering Society Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio, 2012, vol. 3. 42

Editor's Notes

  • #4: And ultimately, transform how people experience their favorite movies, TV shows and music
  • #13: GMM = Gaussian Mixture ModelSVM = Support Vector Machine
  • #23: Snare, bass, hi-hat is not all there is to drumming
  • #24: Feature extraction -> machine learning
  • #25: Feature extraction -> machine learning
  • #26: 80 bark bands down from 513 positive FFT coefficientsWork in 2008 shows that this type of spectral dimensionality reduction speeds up NMF while retaining/improving separation performance
  • #27: Feature extraction -> machine learning
  • #29: Feature extraction -> machine learning
  • #30: \vec{h}_i are template activations
  • #31: Superior drummer 2.0 is highly multi-sampled so it is a good approximation of a real-world signal (thought it’s probably a little too professional sounding for most people)
  • #34: Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  • #35: Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  • #36: Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}