SlideShare a Scribd company logo
Deep Learning with
Audio Signals
Prepare, Process, Design, Expect
Keunwoo Ch i
Keunwoo Choi
QMUL, UK

ETRI, S. Korea

SNU, S. Korea

@keunwoochoi (twtr, github)
Research Scientist
WARNING
THIS MATERIAL IS WRITTEN FOR ATTENDEES IN
QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP
LEARNING PRACTITIONERS TO PROVIDE AN OFF-THE-
SHELF GUIDE. MY ADVICE MIGHT NOT BE THE FINAL
SOLUTION FOR YOUR PROBLEM, BUT WOULD BE A
GOOD STARTING POINT.

..ALSO, THERE'S NO SPOTIFY SECRET HERE :P
Content
• Prepare the dataset

• Pre-process the signal

• Design your network

• Expect the result
Prepare the datasets
or, know your data
Q. How to start an audio task?
LMGTFY
• Google them, of course

• But....
Audio dataset
• Lucky → the exactly same class(es), many of them, yay!

• Meh → same or similar classes, sounds alright..

• Ugh.. → there are 2 in freesound.org and 3 on youtube
Audio (or, sound) dataset
• Our algorithm is living in the
digital space

• So is the .wav files

• But,

the sound is in the real world
Our lovely cyberspace
Audio dataset
Source
Noise
Reverberation
Microphone
• Room reverberation image from https://johnlsayers.com/Recmanual/Pages/Reverb.htm
Audio dataset
Dear everyone,
YOU ARE ALWAYS IN THE
"UGH..." SITUATION
→ HOW TO BUILD A CORRECT
AUDIO DATASET?
What we can do
• Know your real situation

• You can mimic noise/reverberation/mic if you have

• clean/dry/high-quality source signals
DL models are robust only within the variance they've seen.
→ Good at interpolation.. only.
E.g., a model trained with clean signals probably can't deal with noisy signals
noisy environment cheap mic
Simulate the real world
+ noise signalclean signal noisy signal
room impulse responsedry signal wet signal
band-pass filter
original
signal
recorded
signal
What to Google
Noise
babble noise recording
home noise recording
cafe noise recording
street noise recording
white noise, brown noise
x_noise = x + alpha * noise
Reverberation

(maybe skip it)
room impulse responses, RIR
reverberation simulators
x_wet = np.conv(x, rir)
Microphone
band pass filter
scipy.signal filtering
microphone specification
speaker specification
microphone frequency response
scipy.signal.convolve

scipy.signal.fftconvolve

Or trimming-off your
spectrograms
Pre-process the signals
or, log(melgram)
Q. What to do after loading the signals?
Digital Audio 101
• 1 second of digital audio:

size=(44100, ), dtype=int16

• MNIST: (28, 28, 1), int8

CIFAR10: (32, 32, 3), int8

ImageNet: (256, 256, 3), int8

• Audio: Lots of data points in
one item!
Audio representations
Type Description
Data shape and size

for e.g., 1 second,

sampling rate=44100
Waveform x
44100 x [int16]

Spectrograms
STFT(x)
Melspectrogram(x)
CQT(x)
513 x 87 x [float32]

128 x 87 x [float32]

72 x 87 x [float32]
Features
MFCC(x)

= some process on STFT(x)
20 x 87 x [float32]
Spoiler: log10(Melspectrograms) for the win,
but let's see some details
Spectrograms
• 2-dim representation of audio signal
TODO: IMAGE
Practitioner's choice
• Rule of thumb: DISCARD ALL THE REDUNDANCY

• Sample rate, or bandwidth

• Goal: To optimize the input audio data for your model

• by resampling - can be computation heavier

• by discarding some freq bands - can be storage heavy
https://www.summerrankin.com/dogandponyshow/2017/10/16/catdog
Practitioner's choice
• Melspectrogram

- in decibel scale

- which only covers the frequency range you're
interested in.

• Why?

- smaller, therefore easier and faster training

- perceptual - weighing more on the freq region where
humans are more interested

- faster than CQT to compute

- decibel scale - another perceptually motivated choice
Q. Ok, how can I compute them?
import librosa
import madmom
• Python libraries - librosa/madmom/scipy/.. 

• Computations on CPU

• Best when all the processing will be done before
the training
import kapre
• Keras Audio Preprocessing layers

• CPU and GPU

• Best when you want to do things on the fly/GPU

= Best to optimize audio-related parameters
• pip install kapre

• There's also pytorch-audio!Disclaimer: I'm the maintainer
Design your network
or, know the assumptions
Q. What kind of network structure I need?
A dumb-but-strong-therefore-good-while-
annoying-since-it's-from-computer-vision
baseline approach
• Trim the signals properly (e.g. 1-sec)

• Do the classification with 2D
convnet, 3x3 kernel (=aka vggnet)

• Raise $1B
• Retire
• Post "why i retired.." on Medium
• Happy life!
Go even dumber
• Just download some pre-trained networks for..

- music

- audio

- image (?)

• Re-use it for your task (aka transfer learning)

• 1B - retire - Medium - happy - repeat
Better and stronger,
by understanding assumptions
• assert "Receptive field" size == size of the target pattern

• How sparse the target pattern is?

- Bird singing sparse? 

- Voice-in-music sparse? 

- Distortion-guitar-in-Metallica sparse?
Have no idea?
• Go see how computer vision people are doing

• Clone it

• It's ok, it's a good baseline at least
My spectrogram is 28x28 bc
the model I downloaded is
trained on MNIST
Don't use spectrograms as if
they are images
It all boils down to the
pattern recognition, they're
actually similar tasks.
the time and frequency axes
have totally different
meanings
I don't know how to
incorporate them into my
model.. BUT IT WORKS!
Expecting the result
or, know the problem
Q. How would it work?
YOU
• You are responsible for the feasibility

• Is it a task you can?

• Is the information in the input (mel-spectrogram)?

• Are similar tasks being solved?
Think about it!
• Is it possible? To what extent? E.g., 

• Baby crying detection

• Baby crying recognition and classification

• Dog barking translation

• Hit song detection
Conclusion
Conclusion..
Conclusion!
Conclusion
• Sound is analog, you might need to think about some
analog process, too.

• Pre-process: Follow others when you're lost

• Audio is big in data size, but sparse in information.
Reduce the size. Don't start with end-to-end.

• Design: Follow others when you're lost

• Expect: Make sure if it's doable
Deep Learning with
Audio Signal
Prepare, Process, Design, Expect
Keunwoo Ch i
Q&A
PS. See you soon at the panel talk!

More Related Content

PPTX
Deep learning takes on Signal Processing
PPTX
Convolution&Correlation
PPTX
Digital Image Processing - Frequency Filters
PPTX
PDF
DSP lab manual
PDF
Lab manual of Digital image processing using python by khalid Shaikh
PDF
Introductory Lecture to Audio Signal Processing
PPTX
Image Enhancement using Frequency Domain Filters
Deep learning takes on Signal Processing
Convolution&Correlation
Digital Image Processing - Frequency Filters
DSP lab manual
Lab manual of Digital image processing using python by khalid Shaikh
Introductory Lecture to Audio Signal Processing
Image Enhancement using Frequency Domain Filters

What's hot (20)

PPT
Enhancement in frequency domain
DOC
Digital Signal Processing Lab Manual ECE students
PPT
Medical Image Processing
PPTX
discrete wavelet transform
PPTX
Wavelet
PPTX
Introduction to Wavelet Transform with Applications to DSP
PPTX
Smoothing in Digital Image Processing
PDF
Image Restoration (Digital Image Processing)
PPTX
Radial basis function network ppt bySheetal,Samreen and Dhanashri
PPTX
Linear Predictive Coding
PDF
Speaker Diarization
PPT
digital filters
PPTX
Problem reduction AND OR GRAPH & AO* algorithm.ppt
PPT
Chapter 5 Image Processing: Fourier Transformation
PPSX
Perceptron (neural network)
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
PDF
DSP_2018_FOEHU - Lec 07 - IIR Filter Design
PPT
Image denoising
PDF
Adaptive filter
PPT
Discrete cosine transform
Enhancement in frequency domain
Digital Signal Processing Lab Manual ECE students
Medical Image Processing
discrete wavelet transform
Wavelet
Introduction to Wavelet Transform with Applications to DSP
Smoothing in Digital Image Processing
Image Restoration (Digital Image Processing)
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Linear Predictive Coding
Speaker Diarization
digital filters
Problem reduction AND OR GRAPH & AO* algorithm.ppt
Chapter 5 Image Processing: Fourier Transformation
Perceptron (neural network)
Deep Learning Based Voice Activity Detection and Speech Enhancement
DSP_2018_FOEHU - Lec 07 - IIR Filter Design
Image denoising
Adaptive filter
Discrete cosine transform
Ad

Similar to Deep Learning with Audio Signals: Prepare, Process, Design, Expect (20)

PDF
“Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio f...
PDF
Investigating Multi-Feature Selection and Ensembling for Audio Classification
PPTX
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
PDF
Digital signal processing through speech, hearing, and Python
PPTX
ICLR 2 papers review in signal processing domain
PDF
Audio chord recognition using deep neural networks
PDF
FORECASTING MUSIC GENRE (RNN - LSTM)
PPTX
Maulana Azad National Insitute Of Technology.pptx
PPTX
Audio Signal Processing Basics, mirtoolbox contains many useful audio process...
PDF
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
PDF
Deep Learning Meetup #5
PDF
Audio Classification using Artificial Neural Network with Denoising Algorithm...
PDF
Automatic speech recognition system using deep learning
PDF
Emotion and Theme Recognition of Music Using Convolutional Neural Networks
PDF
2018 IEEE Big Data Cup Challenge - FEMH ​Voice Data Challenge
PDF
repport christian el hajj
PDF
A computationally efficient learning model to classify audio signal attributes
PPTX
CNN architectures for large-scale audio classification CONFERENCE PAPER REVIE...
PPTX
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
PDF
20211008 修論中間発表
“Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio f...
Investigating Multi-Feature Selection and Ensembling for Audio Classification
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
Digital signal processing through speech, hearing, and Python
ICLR 2 papers review in signal processing domain
Audio chord recognition using deep neural networks
FORECASTING MUSIC GENRE (RNN - LSTM)
Maulana Azad National Insitute Of Technology.pptx
Audio Signal Processing Basics, mirtoolbox contains many useful audio process...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
Deep Learning Meetup #5
Audio Classification using Artificial Neural Network with Denoising Algorithm...
Automatic speech recognition system using deep learning
Emotion and Theme Recognition of Music Using Convolutional Neural Networks
2018 IEEE Big Data Cup Challenge - FEMH ​Voice Data Challenge
repport christian el hajj
A computationally efficient learning model to classify audio signal attributes
CNN architectures for large-scale audio classification CONFERENCE PAPER REVIE...
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
20211008 修論中間発表
Ad

More from Keunwoo Choi (12)

PDF
"All you need is AI and music" by Keunwoo Choi
PDF
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)
PDF
가상현실을 위한 오디오 기술
PPTX
Conditional generative model for audio
PDF
Convolutional recurrent neural networks for music classification
PDF
The effects of noisy labels on deep convolutional neural networks for music t...
PDF
dl4mir tutorial at ETRI, Korea
PDF
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
PDF
Deep Convolutional Neural Networks - Overview
PDF
Deep learning for music classification, 2016-05-24
PDF
딥러닝 개요 (2015-05-09 KISTEP)
PDF
Understanding Music Playlists
"All you need is AI and music" by Keunwoo Choi
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)
가상현실을 위한 오디오 기술
Conditional generative model for audio
Convolutional recurrent neural networks for music classification
The effects of noisy labels on deep convolutional neural networks for music t...
dl4mir tutorial at ETRI, Korea
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Deep Convolutional Neural Networks - Overview
Deep learning for music classification, 2016-05-24
딥러닝 개요 (2015-05-09 KISTEP)
Understanding Music Playlists

Recently uploaded (20)

PPTX
Machine Learning_overview_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine Learning_overview_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”

Deep Learning with Audio Signals: Prepare, Process, Design, Expect

  • 1. Deep Learning with Audio Signals Prepare, Process, Design, Expect Keunwoo Ch i
  • 2. Keunwoo Choi QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github) Research Scientist
  • 3. WARNING THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP LEARNING PRACTITIONERS TO PROVIDE AN OFF-THE- SHELF GUIDE. MY ADVICE MIGHT NOT BE THE FINAL SOLUTION FOR YOUR PROBLEM, BUT WOULD BE A GOOD STARTING POINT. ..ALSO, THERE'S NO SPOTIFY SECRET HERE :P
  • 4. Content • Prepare the dataset • Pre-process the signal • Design your network • Expect the result
  • 5. Prepare the datasets or, know your data Q. How to start an audio task?
  • 6. LMGTFY • Google them, of course • But....
  • 7. Audio dataset • Lucky → the exactly same class(es), many of them, yay! • Meh → same or similar classes, sounds alright.. • Ugh.. → there are 2 in freesound.org and 3 on youtube
  • 8. Audio (or, sound) dataset • Our algorithm is living in the digital space • So is the .wav files • But,
 the sound is in the real world Our lovely cyberspace
  • 9. Audio dataset Source Noise Reverberation Microphone • Room reverberation image from https://johnlsayers.com/Recmanual/Pages/Reverb.htm
  • 10. Audio dataset Dear everyone, YOU ARE ALWAYS IN THE "UGH..." SITUATION → HOW TO BUILD A CORRECT AUDIO DATASET?
  • 11. What we can do • Know your real situation • You can mimic noise/reverberation/mic if you have • clean/dry/high-quality source signals DL models are robust only within the variance they've seen. → Good at interpolation.. only. E.g., a model trained with clean signals probably can't deal with noisy signals noisy environment cheap mic
  • 12. Simulate the real world + noise signalclean signal noisy signal room impulse responsedry signal wet signal band-pass filter original signal recorded signal
  • 13. What to Google Noise babble noise recording home noise recording cafe noise recording street noise recording white noise, brown noise x_noise = x + alpha * noise Reverberation (maybe skip it) room impulse responses, RIR reverberation simulators x_wet = np.conv(x, rir) Microphone band pass filter scipy.signal filtering microphone specification speaker specification microphone frequency response scipy.signal.convolve scipy.signal.fftconvolve Or trimming-off your spectrograms
  • 14. Pre-process the signals or, log(melgram) Q. What to do after loading the signals?
  • 15. Digital Audio 101 • 1 second of digital audio:
 size=(44100, ), dtype=int16 • MNIST: (28, 28, 1), int8
 CIFAR10: (32, 32, 3), int8
 ImageNet: (256, 256, 3), int8 • Audio: Lots of data points in one item!
  • 16. Audio representations Type Description Data shape and size for e.g., 1 second,
 sampling rate=44100 Waveform x 44100 x [int16] Spectrograms STFT(x) Melspectrogram(x) CQT(x) 513 x 87 x [float32] 128 x 87 x [float32] 72 x 87 x [float32] Features MFCC(x) = some process on STFT(x) 20 x 87 x [float32] Spoiler: log10(Melspectrograms) for the win, but let's see some details
  • 17. Spectrograms • 2-dim representation of audio signal TODO: IMAGE
  • 18. Practitioner's choice • Rule of thumb: DISCARD ALL THE REDUNDANCY • Sample rate, or bandwidth • Goal: To optimize the input audio data for your model • by resampling - can be computation heavier • by discarding some freq bands - can be storage heavy https://www.summerrankin.com/dogandponyshow/2017/10/16/catdog
  • 19. Practitioner's choice • Melspectrogram
 - in decibel scale
 - which only covers the frequency range you're interested in. • Why?
 - smaller, therefore easier and faster training
 - perceptual - weighing more on the freq region where humans are more interested
 - faster than CQT to compute
 - decibel scale - another perceptually motivated choice Q. Ok, how can I compute them?
  • 20. import librosa import madmom • Python libraries - librosa/madmom/scipy/.. • Computations on CPU • Best when all the processing will be done before the training
  • 21. import kapre • Keras Audio Preprocessing layers • CPU and GPU • Best when you want to do things on the fly/GPU
 = Best to optimize audio-related parameters • pip install kapre • There's also pytorch-audio!Disclaimer: I'm the maintainer
  • 22. Design your network or, know the assumptions Q. What kind of network structure I need?
  • 23. A dumb-but-strong-therefore-good-while- annoying-since-it's-from-computer-vision baseline approach • Trim the signals properly (e.g. 1-sec) • Do the classification with 2D convnet, 3x3 kernel (=aka vggnet) • Raise $1B • Retire • Post "why i retired.." on Medium • Happy life!
  • 24. Go even dumber • Just download some pre-trained networks for..
 - music
 - audio
 - image (?) • Re-use it for your task (aka transfer learning) • 1B - retire - Medium - happy - repeat
  • 25. Better and stronger, by understanding assumptions • assert "Receptive field" size == size of the target pattern • How sparse the target pattern is?
 - Bird singing sparse? 
 - Voice-in-music sparse? 
 - Distortion-guitar-in-Metallica sparse?
  • 26. Have no idea? • Go see how computer vision people are doing • Clone it • It's ok, it's a good baseline at least
  • 27. My spectrogram is 28x28 bc the model I downloaded is trained on MNIST Don't use spectrograms as if they are images It all boils down to the pattern recognition, they're actually similar tasks. the time and frequency axes have totally different meanings I don't know how to incorporate them into my model.. BUT IT WORKS!
  • 28. Expecting the result or, know the problem Q. How would it work?
  • 29. YOU • You are responsible for the feasibility • Is it a task you can? • Is the information in the input (mel-spectrogram)? • Are similar tasks being solved?
  • 30. Think about it! • Is it possible? To what extent? E.g., • Baby crying detection • Baby crying recognition and classification • Dog barking translation • Hit song detection
  • 32. Conclusion • Sound is analog, you might need to think about some analog process, too. • Pre-process: Follow others when you're lost • Audio is big in data size, but sparse in information. Reduce the size. Don't start with end-to-end. • Design: Follow others when you're lost • Expect: Make sure if it's doable
  • 33. Deep Learning with Audio Signal Prepare, Process, Design, Expect Keunwoo Ch i Q&A PS. See you soon at the panel talk!