SlideShare a Scribd company logo
WEAKLY-SUPERVISED SOUND EVENT DETECTION
WITH SELF-ATTENTION
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda
This work was done in the internship at LINE Corporation
ICASSP2020
Session WE1.L5: Acoustic Event Detection
stacked
Transformer
encoder
Outline of this work
l Goal
– Improve sound event detection (SED) performance
– Utilize weak label data for training
l Contributions
– Propose self-attention based weakly-supervised SED
– Introduce a special tag token to handle weak label information
l Evaluation
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
2
Alarm
Time
detect
onset offset
Alarm, Dog, Speech
weak label
Background
l Sound event detection (SED)
– Identifying environmental sounds with timestamps
l Collecting annotated dataset
– Strong label
• Easy to handle J
• Expensive annotation cost L
– Weak label
• Hard to handle L
• Cheap annotation cost J
3
Alarm
Time
detect
onset offset
Time
Dog
Speech
Alarm, Dog, Speech
→ NOT including timestamps
= only tags are available
→ including timestamps
Alarm
strong label
weak label
Problem
Weakly-supervised training for SED
l Multi-instance learning (MIL)
– Effective approach to train using weal label
– Predict frame-by-frame, aggregate them to obtain sequence-level prediction
4
Aggregate in
time domain
Time
Score
calculate loss
weak label
class
predicted score
class1
class2
class3
What approach is effective to aggregate?
How to aggregate frame-level prediction
l Global max pooling
– Capture short duration
– Weak to effect of noise
l Global average pooling
– Capture long duration
– Ignore short duration
l Attention pooling
– Flexible decision by
attention mechanism
5
weighted sum
max
average
Time
Score
sequence-level
prediction
frame-level
prediction
Attention pooling
l Calculate prediction and confidence of each frame
according to the input
6
Frame-level prediction
input frame
level feature
event feature
frame level confidence
(attention weight)
sum
sigmoidsoftmax
weighted sum
Time
sequence-level
prediction
Self-attention
l Transformer [Vaswani+17]
– Effectively use self-attention model
– Enable to capture local and global context information
– Great success in NLP, various audio/speech tasks
• ASR, speaker recognition, speaker diarization, TTS, etc..
7
Positional
Encoding
Multi-Head
Attention
Add & Norm
Feed
Forward
Add & Norm
N×
Transformerencoder
Input
Output
In this work, we use Transformer encoder
Overview of self-attention
8
DenseDenseDense
×
=
×
event feature
attention weight
In weakly-supervised SED,
how to handle weak label data?
input frame
level feature
Time
output frame
level feature
Time
Proposed method
l Weakly-supervised training for SED with self-attention and tag token
– Introduce transformer encoder as self-attention for sequence modeling
– Introduce tag token dedicated to weak label estimation
9
Predict
stronglabel
Predict
weaklabel
SigmoidSigmoid
Classifier
Append tag token at first frame
stacked
Transformer
encoder
feature
sequence
input
Self attention with tag token
10
DenseDenseDense
×
=
×
event feature
attention weight
: Tag token
TimeTime
input frame
level feature
output frame
level feature
Self attention with tag token
11
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Self attention with tag token
12
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
strong label
prediction
weak label
prediction
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Experiments
l DCASE2019 task 4
– Sound event detection in domestic environments
– Evaluation metrics: Event-based, Segment based macro F1
– Baseline model: CRNN
– Provided dataset details
13
Experimental conditions
l Network training configuration
– Feature: 64-dim log mel filterbank
– Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim)
14
Experimental results
15
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Experimental results
16
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Transformer models outperformed CRNN model
Experimental results
17
: CRNN
: Transformer
Experimental results
18
Especially Blender and Dishes class are improved
=> Effective for repeatedly appear sounds
+10.4%
+13.5%
Experimental results
19
Attention pooling vs. Tag token
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Experimental results
20
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Perform comparable results
Attention pooling vs. Tag token
Predicted example
21
Visualization of attention weights
22
Conclusion
l Proposed method
– Weakly-supervised training for SED with self-attention and tag token
• Self-attention: effective sequence modeling using local and global context
• Tag token: aggregate tag information through self-attention
l Result
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
– Effective for repeatedly appear sounds
23

More Related Content

PDF
統計的手法に基づく異常音検知の理論と応用
PDF
JVS:フリーの日本語多数話者音声コーパス
PDF
ICASSP 2019での音響信号処理分野の世界動向
PDF
実環境音響信号処理における収音技術
PDF
テーブル・テキスト・画像の反実仮想説明
PPTX
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~
PDF
深層学習を利用した音声強調
PPTX
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
統計的手法に基づく異常音検知の理論と応用
JVS:フリーの日本語多数話者音声コーパス
ICASSP 2019での音響信号処理分野の世界動向
実環境音響信号処理における収音技術
テーブル・テキスト・画像の反実仮想説明
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~
深層学習を利用した音声強調
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition

What's hot (20)

PDF
Transformerを多層にする際の勾配消失問題と解決法について
PDF
深層生成モデルに基づく音声合成技術
PPTX
異常検知と変化検知 7章方向データの異常検知
PDF
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
PPTX
Curriculum Learning (関東CV勉強会)
PDF
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
PDF
国際会議 interspeech 2020 報告
PDF
レコメンドアルゴリズムの基本と周辺知識と実装方法
PDF
グラフニューラルネットワークとグラフ組合せ問題
PDF
環境音の特徴を活用した音響イベント検出・シーン分類
PDF
グラフィカル Lasso を用いた異常検知
PDF
確率的推論と行動選択
PPTX
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
PDF
[DL輪読会]Estimating Predictive Uncertainty via Prior Networks
PDF
ELBO型VAEのダメなところ
PDF
Anomaly detection 系の論文を一言でまとめた
PDF
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
PDF
【DL輪読会】“Gestalt Principles Emerge When Learning Universal Sound Source Separa...
PPTX
ResNetの仕組み
Transformerを多層にする際の勾配消失問題と解決法について
深層生成モデルに基づく音声合成技術
異常検知と変化検知 7章方向データの異常検知
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Curriculum Learning (関東CV勉強会)
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
国際会議 interspeech 2020 報告
レコメンドアルゴリズムの基本と周辺知識と実装方法
グラフニューラルネットワークとグラフ組合せ問題
環境音の特徴を活用した音響イベント検出・シーン分類
グラフィカル Lasso を用いた異常検知
確率的推論と行動選択
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
[DL輪読会]Estimating Predictive Uncertainty via Prior Networks
ELBO型VAEのダメなところ
Anomaly detection 系の論文を一言でまとめた
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
【DL輪読会】“Gestalt Principles Emerge When Learning Universal Sound Source Separa...
ResNetの仕組み
Ad

Similar to Weakly-Supervised Sound Event Detection with Self-Attention (20)

PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
Transformer based approaches for visual representation learning
PPTX
Conformer review
PDF
Transformers in AI: Revolutionizing Natural Language Processing
PDF
Transformers: Revolutionizing NLP with Self-Attention
PDF
Attention is All You Need (Transformer)
PPTX
240122_Attention Is All You Need (2017 NIPS)2.pptx
PPTX
[Paper Reading] Attention is All You Need
PDF
Introduction to Transformers
PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
PDF
05-transformers.pdf
PDF
Neural Semi-supervised Learning under Domain Shift
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
attention is all you need.pdf attention is all you need.pdfattention is all y...
PPTX
240318_JW_labseminar[Attention Is All You Need].pptx
PPTX
2010 PACLIC - pay attention to categories
PDF
Transformer Introduction (Seminar Material)
PDF
Transformers and BERT with SageMaker
PPTX
Transformers in vision and its challenges and comparision with CNN
PPTX
Transformers in Vision: From Zero to Hero
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Transformer based approaches for visual representation learning
Conformer review
Transformers in AI: Revolutionizing Natural Language Processing
Transformers: Revolutionizing NLP with Self-Attention
Attention is All You Need (Transformer)
240122_Attention Is All You Need (2017 NIPS)2.pptx
[Paper Reading] Attention is All You Need
Introduction to Transformers
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
05-transformers.pdf
Neural Semi-supervised Learning under Domain Shift
The Transformer - Xavier Giró - UPC Barcelona 2021
attention is all you need.pdf attention is all you need.pdfattention is all y...
240318_JW_labseminar[Attention Is All You Need].pptx
2010 PACLIC - pay attention to categories
Transformer Introduction (Seminar Material)
Transformers and BERT with SageMaker
Transformers in vision and its challenges and comparision with CNN
Transformers in Vision: From Zero to Hero
Ad

More from NU_I_TODALAB (20)

PDF
IEEE EMBC 2025 「Improving electrolaryngeal speech enhancement via a represent...
PDF
音学シンポジウム2025「音声研究の知見がニューラルボコーダの発展にもたらす効果」
PDF
音学シンポジウム2025「ニューラルボコーダ概説:生成モデルと実用性の観点から」
PDF
2025年3月音楽情報科学研究会「大局的構造生成のための小節特徴量系列モデリングに基づく階層的自動作曲」
PDF
2025年5月応用音響研究会「ICASSP2025における音楽情報処理の動向」
PDF
2025年5月応用音響研究会「ICASSP2025における異常音検知の動向」
PDF
Automatic Quality Assessment for Speech and Beyond
PDF
異常音検知に対する深層学習適用事例
PDF
信号の独立性に基づく多チャンネル音源分離
PDF
The VoiceMOS Challenge 2022
PDF
敵対的学習による統合型ソースフィルタネットワーク
PDF
距離学習を導入した二値分類モデルによる異常音検知
PDF
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
PDF
Interactive voice conversion for augmented speech production
PDF
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
PDF
Recent progress on voice conversion: What is next?
PDF
Statistical voice conversion with direct waveform modeling
PDF
音素事後確率を利用した表現学習に基づく発話感情認識
PDF
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
PDF
End-to-End音声認識ためのMulti-Head Decoderネットワーク
IEEE EMBC 2025 「Improving electrolaryngeal speech enhancement via a represent...
音学シンポジウム2025「音声研究の知見がニューラルボコーダの発展にもたらす効果」
音学シンポジウム2025「ニューラルボコーダ概説:生成モデルと実用性の観点から」
2025年3月音楽情報科学研究会「大局的構造生成のための小節特徴量系列モデリングに基づく階層的自動作曲」
2025年5月応用音響研究会「ICASSP2025における音楽情報処理の動向」
2025年5月応用音響研究会「ICASSP2025における異常音検知の動向」
Automatic Quality Assessment for Speech and Beyond
異常音検知に対する深層学習適用事例
信号の独立性に基づく多チャンネル音源分離
The VoiceMOS Challenge 2022
敵対的学習による統合型ソースフィルタネットワーク
距離学習を導入した二値分類モデルによる異常音検知
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Interactive voice conversion for augmented speech production
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
Recent progress on voice conversion: What is next?
Statistical voice conversion with direct waveform modeling
音素事後確率を利用した表現学習に基づく発話感情認識
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
End-to-End音声認識ためのMulti-Head Decoderネットワーク

Recently uploaded (20)

PPTX
UNIT 4 Total Quality Management .pptx
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Sustainable Sites - Green Building Construction
PDF
Well-logging-methods_new................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT
Project quality management in manufacturing
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Artificial Intelligence
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
UNIT 4 Total Quality Management .pptx
III.4.1.2_The_Space_Environment.p pdffdf
Sustainable Sites - Green Building Construction
Well-logging-methods_new................
CYBER-CRIMES AND SECURITY A guide to understanding
Safety Seminar civil to be ensured for safe working.
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
UNIT-1 - COAL BASED THERMAL POWER PLANTS
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Project quality management in manufacturing
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Internet of Things (IOT) - A guide to understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Artificial Intelligence
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026

Weakly-Supervised Sound Event Detection with Self-Attention

  • 1. WEAKLY-SUPERVISED SOUND EVENT DETECTION WITH SELF-ATTENTION Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda This work was done in the internship at LINE Corporation ICASSP2020 Session WE1.L5: Acoustic Event Detection
  • 2. stacked Transformer encoder Outline of this work l Goal – Improve sound event detection (SED) performance – Utilize weak label data for training l Contributions – Propose self-attention based weakly-supervised SED – Introduce a special tag token to handle weak label information l Evaluation – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% 2 Alarm Time detect onset offset Alarm, Dog, Speech weak label
  • 3. Background l Sound event detection (SED) – Identifying environmental sounds with timestamps l Collecting annotated dataset – Strong label • Easy to handle J • Expensive annotation cost L – Weak label • Hard to handle L • Cheap annotation cost J 3 Alarm Time detect onset offset Time Dog Speech Alarm, Dog, Speech → NOT including timestamps = only tags are available → including timestamps Alarm strong label weak label Problem
  • 4. Weakly-supervised training for SED l Multi-instance learning (MIL) – Effective approach to train using weal label – Predict frame-by-frame, aggregate them to obtain sequence-level prediction 4 Aggregate in time domain Time Score calculate loss weak label class predicted score class1 class2 class3 What approach is effective to aggregate?
  • 5. How to aggregate frame-level prediction l Global max pooling – Capture short duration – Weak to effect of noise l Global average pooling – Capture long duration – Ignore short duration l Attention pooling – Flexible decision by attention mechanism 5 weighted sum max average Time Score sequence-level prediction frame-level prediction
  • 6. Attention pooling l Calculate prediction and confidence of each frame according to the input 6 Frame-level prediction input frame level feature event feature frame level confidence (attention weight) sum sigmoidsoftmax weighted sum Time sequence-level prediction
  • 7. Self-attention l Transformer [Vaswani+17] – Effectively use self-attention model – Enable to capture local and global context information – Great success in NLP, various audio/speech tasks • ASR, speaker recognition, speaker diarization, TTS, etc.. 7 Positional Encoding Multi-Head Attention Add & Norm Feed Forward Add & Norm N× Transformerencoder Input Output In this work, we use Transformer encoder
  • 8. Overview of self-attention 8 DenseDenseDense × = × event feature attention weight In weakly-supervised SED, how to handle weak label data? input frame level feature Time output frame level feature Time
  • 9. Proposed method l Weakly-supervised training for SED with self-attention and tag token – Introduce transformer encoder as self-attention for sequence modeling – Introduce tag token dedicated to weak label estimation 9 Predict stronglabel Predict weaklabel SigmoidSigmoid Classifier Append tag token at first frame stacked Transformer encoder feature sequence input
  • 10. Self attention with tag token 10 DenseDenseDense × = × event feature attention weight : Tag token TimeTime input frame level feature output frame level feature
  • 11. Self attention with tag token 11 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 12. Self attention with tag token 12 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) strong label prediction weak label prediction input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 13. Experiments l DCASE2019 task 4 – Sound event detection in domestic environments – Evaluation metrics: Event-based, Segment based macro F1 – Baseline model: CRNN – Provided dataset details 13
  • 14. Experimental conditions l Network training configuration – Feature: 64-dim log mel filterbank – Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim) 14
  • 15. Experimental results 15 Method Event-based[%] Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26
  • 16. Experimental results 16 Method Event-based[%] Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26 Transformer models outperformed CRNN model
  • 18. Experimental results 18 Especially Blender and Dishes class are improved => Effective for repeatedly appear sounds +10.4% +13.5%
  • 19. Experimental results 19 Attention pooling vs. Tag token Method Encoder stack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26
  • 20. Experimental results 20 Method Encoder stack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26 Perform comparable results Attention pooling vs. Tag token
  • 23. Conclusion l Proposed method – Weakly-supervised training for SED with self-attention and tag token • Self-attention: effective sequence modeling using local and global context • Tag token: aggregate tag information through self-attention l Result – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% – Effective for repeatedly appear sounds 23