Techfest jan17

3
Example-Based Audio Editing
Ramin Anushiravani
TechFest - Jan’17
1

Who are you?
• Bachelor in Electrical and Computer Engineering
§ University of Illinois at Urbana Champaign
§ Thesis: 3D Audio Playback through Two Loudspeaker
• Master’s in Electrical and Computer Engineering
§ University of Illinois at Urbana Champaign
§ Thesis: Example-Based Audio Editing
• Internships
– Advance Digital Science Center in Singapore (x2)
• 3D Audio Recording through microphone arrays and Playback through two loudspeakers
– GN-Resound
• Spatial Hearing with Hearing Aids
– Adobe
• Acoustics Matching
– AARP
• Recommender Systems
• Now at Dolby
– Patent Engineer for Audio
• MPEG Standards, Audio and Speech Codecs, and Dynamic Range Control 2

Outline
• Example-Based What?
• Acoustic Matching
– Manual/Automatic Equalization
– Manual/Automatic Noise
– Manual/Automatic Reverberation
• User Study
3

Acoustics Matching
• Equalization
• Background Noise
• Reverberation
6

How?
7
Xex,wet
Xin,wet
Xin,effect
Xex,dry
Matching
Reconstruct
Xex,effect
Xin,dry Ymat

Graphic Equalizer
iTunes Equalizer setting
8

Equalizer Matching
9
P[k] =
1
L
LX
⌧
|STFT{x[n]}(k, ⌧)|2

Equalizer Matching
Log Mag-dB
Log spaced frequency-Hz
11

Denoising
Spectral Subtraction
Noise profile estimate
Estimate clean power spectrum Noise suppression
factor
Fourier transform of the noisy
signal in one frame
Z(!) = X(!) + D(!)
| ˆX(!)|2
= H2
(!)|Z(!)|2
H(!) =
s
1
| ˆD(!)|2
|Z(!)|2
In practice,
• Noise profile is estimated over multiple frequency bands.
• Spectral subtraction fails at low SNR regions by creating musical noises. This artifact is
reduced by post-filtering the spectral subtraction.
(Philipos C. Loizou, Speech Enhancement
Theory and Practice, 2013)
Additive stationary noise
( Esch and Vary, Efficient Musical Noise Suppression for
Speech Enhancement Systems, 2009)
13

Spectral Subtraction
y(n) = x(n) + d(n)
noisy Signal clean Signal noise
Y (!) = X(!) + D(!)
|Y (!)|2
= |X(!)|2
+ |D(!)|2
+ X(!).D⇤
(!) + X⇤
(!).D(!)
2Re{X(!).D⇤
(!)}
A common assumption in most papers:
Noise and the clean signal are uncorrelated.
| ˆX(!)|2
= H2
(!)|Y (!)|2
|Y (!)| |D(!)| <= |Y (!)|2
> | ˆD(!)|2
0 <= |Y (!)|2
< | ˆD(!)|2
(Philipos C. Loizou, Speech Enhancement
Theory and Practice, 2013)
H(!) =
s
1
| ˆD(!)|2
|Y (!)|2
Fourier Transform over a segment of x(n).
AWGN. Same over all clean input segments.
Estimated Noise PSD.
In practice H is learned
over different
frequency bands.
14

Musical Noise Reduction
( Esch and Vary, Efficient Musical Noise Suppression for
Speech Enhancement Systems, 2009)
Aim: Retain the naturalness of the
remaining background noise.
How?
• 1
Detect low SNR frames based on the
noisy signal and the estimated clean signal.
• 2
Design a smoothing window based on 1.
Lower the SNR, longer the window.
• 3
Design a post-filter to smooth the low SNR
frames, i.e. an FIR low pass filter designed
based on 2.
• 3
Element-wise multiply the noise suppression
factor by 2.
Step 3
Enhanced Spectral Subtraction 15

SS + Musical Noise Reduction
⌦ =
G.*H Musical Suppression PostFilterSNR= 22 dB
Noisy Input
Much Better!
.^2 .^2
(
(
.^0.5
16

Non-Stationary Noise
19
Z. Duan1, G. J. Mysore, and P. Smaragdis, Speech enhancement by
online non-negative spectrogram decomposition in non-stationary noise
environments," in Interspeech Conference, 2012.

Reverberation
Krannert Center for the Performing Arts, Foellinger Great Hall
20

Reverberation
Falkland Palace Bottle Dungeon
reverb sound
dry sound reverb kernel
(OpenAir database, www.openairlib.net)
Approximate in the
magnitude STFT domain
Convolution between
time frames of
magnitude X and H at
each frequency index
y(n) =
N 1X
k=0
x(k)r(n k)
|Y (t⌧ , k)| ⇡
Lh 1X
⌧=0
|R(⌧, k)||X(t⌧ ⌧, k)| = |X| ? |R|
?
(R. Talmon, I. Cohen, and S. Gannot, “Relative
transfer function identification using convolutive
transfer function approximation,” IEEE Trans.
Audio, Speech, and Language Process, 2009.)
22

Metrics for Ideal Reverberation
time
Magnitude-dB
Energy Decay Relief
Energy Decay Curve
EDC at multiple frequency bands
24

Reverberation Model
• Time Domain Statistical Model
Where b(t) is a zero mean Gaussian noise. is related to reverberation time.
• Reverberation time = RT60= Length of time to drop below 60 dB below the original level.
Sabine Formula:
Volume of the enclosure
Effective absorbing area
Area
of each wall
Absorption
coefficient
Reflection Coefficients:
25

Image Source Method
Source
Microphone
Mirror image
of the original source
Actual path
Perceived path
Image source produces
another image source
(Allen, J and Berkley, D. 'Image Method
for efficiently simulating small-room acoustics'. The Journal of the
Acoustical Society of America, Vol 65, No.4, pp. 943-950, 1978)
(Pictures from: Alex Tu, Reverberation
simulation from impulse response using
the Image Source Method)
Parameters that control which image source in which dimension
Reflection coefficients of the six surfaces in a rectangular
Time delay of the considered image source
26

Reverberation Matching
Adry
Ra
Bdry
Rb
Dereverberation
Dereverberation
Ideal case – Perfect decomposition of reverb sounds into dry sounds and
reverb kernels.
Running out of letters!
input
example
Focus is on decomposing the magnitude spectrograms into magnitude spectrograms.
I took the signals back to time domain using the reverberated input phase information.
C = Adry ? Rb
?
⇡ result
27

Non-Negative Matrix Factorization
X 2 R 0,m⇥n
, W 2 R 0,m⇥k
, H 2 R 0,k⇥n
where k < min(m, n)
,
• Applying Gradient Descent under positive initial conditions for W and H and a ‘clever’ learning rate results in
the following multiplicative update rules,
(Lee and Seung, 1999)
X ⇡
T 1X
k=0
WkHk
min W 0,H 0,||X WH||2
H = H ⌦ WT
.
X
W.H
W = W ⌦
X
W.H
.HT
W =
Wmk
P
j Wjk
Normalize W
28

Why NMF? (Lee and Seung, 1999)
Visually meaningful.
Decomposition can only be
positive. Part based
presentation.
Statistically meaningful.
Eigen faces are in the
direction of the largest
variance. Subtraction can
occur.
29

Why NMF?
m ,Frequency
n, time Frame
k, Components = 2 n, time framem , Frequency
⇡
hann(1024)
75% overlap
k, Components = 2
W HX
30

Temporal Failure!
(Adopted from: Paul O’Grady & Barak Pearkmutter, Convolutive NMF
with a Sparseness Constraint, MLSP Conference, 2006)
31

Convolutive NMF
32
(Adopted from: Paul O’Grady & Barak Pearkmutter, Convolutive NMF
with a Sparseness Constraint, MLSP Conference, 2006)

Convolutive NMF
T
H
m
k
k
n
X ⇡
n
m ⌦
Ht!
0
T 1X
t=0
X ⇡
T 1X
t=0
W(t).Ht!
X 2 R 0,m⇥n
, W(t) 2 R 0,m⇥k
, H 2 R 0,k⇥n
H1!
W(t)
W(1)
33

Convolutive Non-negative Matrix Factorization
Update Equations:
,
Paul O’Grady & Barak Pearkmutter, Convolutive NMF with a
Sparseness Constraint, MLSP Conference, 2006
ˆY = ˆX ? ˆR ˆX 2 R 0,Lx⇥k
ˆR 2 R 0,Lh⇥k
Ly = Lx + Lh 1
Y ⇡ ˆX ? ˆR Y and ˆY 2 R 0,Ly⇥k
ˆR = ˆR
ˆXt
T
.{Y
ˆY
} t
ˆXt
T
.1
ˆXt = ˆXt
Y
ˆY
. ˆRT,t!
1. ˆRT,t!
Convolution of non-
negative matrices
Shift operator
Spectrum at time frame t
Matrix of size
Ly x k with all
its elements
set to 1.
34

Convolutive NMF
Iteration 1Iteration 2Iteration 3Iteration 10
35

Dereverberation
• Initialize with positive random values.
• Initialize with positive exponential decays.
• Apply CNMF on Y.
• On each iteration, enforce anti-sparsity on
I dropped indices and absolute values, but they’re there.
Y ⇡ ˆX ? ˆR
ˆR ! ˆR↵
, ↵ 2 [0.85, 0.98]
ˆX
ˆR
ˆR
36

Set of dry speech bases (trained offline)
Corresponding activation
Hr Reverberated activation matrix
Dereverberation
We can do better by using more prior knowledge.
Convolution is associative
average R over multiple
frequency bands
(Paris Smaragdis, “Convolutive speech
bases and their application to supervised
speech separation,” in Speech And Audio
Processing. IEEE, 2007)
ˆX ⇡ Wc ? Hc
Y ⇡ (Wc ? Hc) ? R⌃
Y ⇡ Wc ? (Hc ? R⌃
)
37𝑌" ≈ 𝑊%. 𝐻%

Demo
0 0.678 1.356
Time (sec)
10-4
10-2
10
0
LogAmplitude
PSD for kernels
Original
Estimated
Exponentiated
Estimated bases Wd
5 10 15 20 25
Component
1000
2000
4000
8000
Frequency(Hz)
Dereverb Sound
1 2 3
Time (sec)
1000
2000
4000
8000
Frequency(Hz)
Reverb Sound
1 2 3
Time (sec)
1000
2000
4000
8000
Frequency(Hz)
Estimated activation H
d
1 2 3
Time (sec)
5
10
15
20
25
Component
Estimated activation Hr
1 2 3
Time (sec)
5
10
15
20
25
Component
38

40
GUI
User Study
• 40 people
• Beginner to Expert listeners
• 3 Tasks, 3 recordings each
• Ranging easy to hard
• Match Equalization and
Reverberation

User Study
41
ReverbEQ EQ+Reverb
Ease of use
Ease of use

Any Questions?
• Equalization matching
• Noise matching
• Reverberation matching
42

Techfest jan17

More Related Content

What's hot (20)

Similar to Techfest jan17 (20)

Recently uploaded (20)

Techfest jan17