Deep Learning with Audio Signals: Prepare, Process, Design, Expect

Deep Learning with
Audio Signals
Prepare, Process, Design, Expect
Keunwoo Ch i

Keunwoo Choi
QMUL, UK

ETRI, S. Korea

SNU, S. Korea

@keunwoochoi (twtr, github)
Research Scientist

WARNING
THIS MATERIAL IS WRITTEN FOR ATTENDEES IN
QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP
LEARNING PRACTITIONERS TO PROVIDE AN OFF-THE-
SHELF GUIDE. MY ADVICE MIGHT NOT BE THE FINAL
SOLUTION FOR YOUR PROBLEM, BUT WOULD BE A
GOOD STARTING POINT.

..ALSO, THERE'S NO SPOTIFY SECRET HERE :P

Content
• Prepare the dataset

• Pre-process the signal

• Design your network

• Expect the result

Prepare the datasets
or, know your data
Q. How to start an audio task?

LMGTFY
• Google them, of course

• But....

Audio dataset
• Lucky → the exactly same class(es), many of them, yay!

• Meh → same or similar classes, sounds alright..

• Ugh.. → there are 2 in freesound.org and 3 on youtube

Audio (or, sound) dataset
• Our algorithm is living in the
digital space

• So is the .wav ﬁles

• But, 
the sound is in the real world
Our lovely cyberspace

Audio dataset
Source
Noise
Reverberation
Microphone
• Room reverberation image from https://johnlsayers.com/Recmanual/Pages/Reverb.htm

Audio dataset
Dear everyone,
YOU ARE ALWAYS IN THE
"UGH..." SITUATION
→ HOW TO BUILD A CORRECT
AUDIO DATASET?

What we can do
• Know your real situation

• You can mimic noise/reverberation/mic if you have

• clean/dry/high-quality source signals
DL models are robust only within the variance they've seen.
→ Good at interpolation.. only.
E.g., a model trained with clean signals probably can't deal with noisy signals
noisy environment cheap mic

Simulate the real world
+ noise signalclean signal noisy signal
room impulse responsedry signal wet signal
band-pass ﬁlter
original
signal
recorded
signal

What to Google
Noise
babble noise recording
home noise recording
cafe noise recording
street noise recording
white noise, brown noise
x_noise = x + alpha * noise
Reverberation

(maybe skip it)
room impulse responses, RIR
reverberation simulators
x_wet = np.conv(x, rir)
Microphone
band pass filter
scipy.signal filtering
microphone specification
speaker specification
microphone frequency response
scipy.signal.convolve

scipy.signal.fftconvolve

Or trimming-off your
spectrograms

Pre-process the signals
or, log(melgram)
Q. What to do after loading the signals?

Digital Audio 101
• 1 second of digital audio: 
size=(44100, ), dtype=int16

• MNIST: (28, 28, 1), int8 
CIFAR10: (32, 32, 3), int8 
ImageNet: (256, 256, 3), int8

• Audio: Lots of data points in
one item!

Audio representations
Type Description
Data shape and size

for e.g., 1 second, 
sampling rate=44100
Waveform x
44100 x [int16]

Spectrograms
STFT(x)
Melspectrogram(x)
CQT(x)
513 x 87 x [float32]

128 x 87 x [float32]

72 x 87 x [float32]
Features
MFCC(x)

= some process on STFT(x)
20 x 87 x [float32]
Spoiler: log10(Melspectrograms) for the win,
but let's see some details

Spectrograms
• 2-dim representation of audio signal
TODO: IMAGE

Practitioner's choice
• Rule of thumb: DISCARD ALL THE REDUNDANCY

• Sample rate, or bandwidth

• Goal: To optimize the input audio data for your model

• by resampling - can be computation heavier

• by discarding some freq bands - can be storage heavy
https://www.summerrankin.com/dogandponyshow/2017/10/16/catdog

Practitioner's choice
• Melspectrogram 
- in decibel scale 
- which only covers the frequency range you're
interested in.

• Why? 
- smaller, therefore easier and faster training 
- perceptual - weighing more on the freq region where
humans are more interested 
- faster than CQT to compute 
- decibel scale - another perceptually motivated choice
Q. Ok, how can I compute them?

import librosa
import madmom
• Python libraries - librosa/madmom/scipy/..

• Computations on CPU

• Best when all the processing will be done before
the training

import kapre
• Keras Audio Preprocessing layers

• CPU and GPU

• Best when you want to do things on the ﬂy/GPU 
= Best to optimize audio-related parameters
• pip install kapre

• There's also pytorch-audio!Disclaimer: I'm the maintainer

Design your network
or, know the assumptions
Q. What kind of network structure I need?

A dumb-but-strong-therefore-good-while-
annoying-since-it's-from-computer-vision
baseline approach
• Trim the signals properly (e.g. 1-sec)

• Do the classiﬁcation with 2D
convnet, 3x3 kernel (=aka vggnet)

• Raise $1B
• Retire
• Post "why i retired.." on Medium
• Happy life!

Go even dumber
• Just download some pre-trained networks for.. 
- music 
- audio 
- image (?)

• Re-use it for your task (aka transfer learning)

• 1B - retire - Medium - happy - repeat

Better and stronger,
by understanding assumptions
• assert "Receptive ﬁeld" size == size of the target pattern

• How sparse the target pattern is? 
- Bird singing sparse?  
- Voice-in-music sparse?  
- Distortion-guitar-in-Metallica sparse?

Have no idea?
• Go see how computer vision people are doing

• Clone it

• It's ok, it's a good baseline at least

My spectrogram is 28x28 bc
the model I downloaded is
trained on MNIST
Don't use spectrograms as if
they are images
It all boils down to the
pattern recognition, they're
actually similar tasks.
the time and frequency axes
have totally different
meanings
I don't know how to
incorporate them into my
model.. BUT IT WORKS!

Expecting the result
or, know the problem
Q. How would it work?

YOU
• You are responsible for the feasibility

• Is it a task you can?

• Is the information in the input (mel-spectrogram)?

• Are similar tasks being solved?

Think about it!
• Is it possible? To what extent? E.g.,

• Baby crying detection

• Baby crying recognition and classiﬁcation

• Dog barking translation

• Hit song detection

Conclusion
Conclusion..
Conclusion!

Conclusion
• Sound is analog, you might need to think about some
analog process, too.

• Pre-process: Follow others when you're lost

• Audio is big in data size, but sparse in information.
Reduce the size. Don't start with end-to-end.

• Design: Follow others when you're lost

• Expect: Make sure if it's doable

Deep Learning with
Audio Signal
Prepare, Process, Design, Expect
Keunwoo Ch i
Q&A
PS. See you soon at the panel talk!

Deep Learning with Audio Signals: Prepare, Process, Design, Expect

More Related Content

What's hot (20)

Similar to Deep Learning with Audio Signals: Prepare, Process, Design, Expect (20)

More from Keunwoo Choi (12)

Recently uploaded (20)

Deep Learning with Audio Signals: Prepare, Process, Design, Expect