Weakly-Supervised Sound Event Detection with Self-Attention

WEAKLY-SUPERVISED SOUND EVENT DETECTION
WITH SELF-ATTENTION
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda
This work was done in the internship at LINE Corporation
ICASSP2020
Session WE1.L5: Acoustic Event Detection

stacked
Transformer
encoder
Outline of this work
l Goal
– Improve sound event detection (SED) performance
– Utilize weak label data for training
l Contributions
– Propose self-attention based weakly-supervised SED
– Introduce a special tag token to handle weak label information
l Evaluation
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
2
Alarm
Time
detect
onset offset
Alarm, Dog, Speech
weak label

Background
l Sound event detection (SED)
– Identifying environmental sounds with timestamps
l Collecting annotated dataset
– Strong label
• Easy to handle J
• Expensive annotation cost L
– Weak label
• Hard to handle L
• Cheap annotation cost J
3
Alarm
Time
detect
onset offset
Time
Dog
Speech
Alarm, Dog, Speech
→ NOT including timestamps
= only tags are available
→ including timestamps
Alarm
strong label
weak label
Problem

Weakly-supervised training for SED
l Multi-instance learning (MIL)
– Effective approach to train using weal label
– Predict frame-by-frame, aggregate them to obtain sequence-level prediction
4
Aggregate in
time domain
Time
Score
calculate loss
weak label
class
predicted score
class1
class2
class3
What approach is effective to aggregate?

How to aggregate frame-level prediction
l Global max pooling
– Capture short duration
– Weak to effect of noise
l Global average pooling
– Capture long duration
– Ignore short duration
l Attention pooling
– Flexible decision by
attention mechanism
5
weighted sum
max
average
Time
Score
sequence-level
prediction
frame-level
prediction

Attention pooling
l Calculate prediction and confidence of each frame
according to the input
6
Frame-level prediction
input frame
level feature
event feature
frame level confidence
(attention weight)
sum
sigmoidsoftmax
weighted sum
Time
sequence-level
prediction

Self-attention
l Transformer [Vaswani+17]
– Effectively use self-attention model
– Enable to capture local and global context information
– Great success in NLP, various audio/speech tasks
• ASR, speaker recognition, speaker diarization, TTS, etc..
7
Positional
Encoding
Multi-Head
Attention
Add & Norm
Feed
Forward
Add & Norm
N×
Transformerencoder
Input
Output
In this work, we use Transformer encoder

Overview of self-attention
8
DenseDenseDense
×
=
×
event feature
attention weight
In weakly-supervised SED,
how to handle weak label data?
input frame
level feature
Time
output frame
level feature
Time

Proposed method
l Weakly-supervised training for SED with self-attention and tag token
– Introduce transformer encoder as self-attention for sequence modeling
– Introduce tag token dedicated to weak label estimation
9
Predict
stronglabel
Predict
weaklabel
SigmoidSigmoid
Classifier
Append tag token at first frame
stacked
Transformer
encoder
feature
sequence
input

Self attention with tag token
10
DenseDenseDense
×
=
×
event feature
attention weight
: Tag token
TimeTime
input frame
level feature
output frame
level feature

11
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder

12
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
strong label
prediction
weak label
prediction
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder

Experiments
l DCASE2019 task 4
– Sound event detection in domestic environments
– Evaluation metrics: Event-based, Segment based macro F1
– Baseline model: CRNN
– Provided dataset details
13

Experimental conditions
l Network training configuration
– Feature: 64-dim log mel filterbank
– Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim)
14

Experimental results
15
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26

16
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Transformer models outperformed CRNN model

17
: CRNN
: Transformer

18
Especially Blender and Dishes class are improved
=> Effective for repeatedly appear sounds
+10.4%
+13.5%

19
Attention pooling vs. Tag token
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26

20
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Perform comparable results
Attention pooling vs. Tag token

Visualization of attention weights
22

Conclusion
l Proposed method
– Weakly-supervised training for SED with self-attention and tag token
• Self-attention: effective sequence modeling using local and global context
• Tag token: aggregate tag information through self-attention
l Result
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
– Effective for repeatedly appear sounds
23

Weakly-Supervised Sound Event Detection with Self-Attention

More Related Content

What's hot (20)

Similar to Weakly-Supervised Sound Event Detection with Self-Attention (20)

More from NU_I_TODALAB (20)

Recently uploaded (20)

Weakly-Supervised Sound Event Detection with Self-Attention