SlideShare a Scribd company logo
Audio tagging system using
densely connected convolutional networks
Il-Young Jeong
Presented by:
Il-Young Jeong and Hyungui Lim
Authors:
Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018
20 November 2018, Surrey, UK
Introduction: DCASE 2018 challenge task 2
General-purpose audio tagging of
Freesound content with AudioSet labels
• Classifying sound events of very diverse nature including:
- musical instruments
- human sounds
- domestic sounds
- animals
- etc.
• Dataset: Subset of Freesound Dataset with AudioSet Ontology
Difficulty of the task was due to:
• Varied input length

- from 300ms to 30s
• Insufficient training data

- ~9.5k recordings for 41 classes

• Imbalanced class distribution

- from 94 to 300 samples per class

• Unreliable annotation

- Only ~40% labels were verified

Introduction: DCASE 2018 challenge task 2
Introduction: DCASE 2018 challenge task 2
Our Solutions
•Segment-wise learning


•Strong augmentation 

(mixup) 

•Evenly-distributed batch



•Batch-wise loss masking
Difficulty of the task was due to:
• Varied input length

- from 300ms to 30s
• Insufficient training data

- ~9.5k recordings for 41 classes

• Imbalanced class distribution

- from 94 to 300 samples per class

• Unreliable annotation

- Only ~40% labels were verified

•Ensemble approach
Segmentation
• All the preprocessing steps are performed for each batch generation. 

Pros: Fast implementation of various settings

Cons: Computation in batch generation
Framework: (On-the-fly) Preprocessing
Mixup Augmentation T-F representation
- Long data
-> Takes excerpts
- Short data
-> Zero-padding
- New data generated by
mixing two segments.
- Raw waveform/ Logmel
- Faster operation using
GPU, thanks to kapre.
Framework: Evenly distributed batch generation
• Mini-batch learning: Updates model by using subset of training data.
• Randomly selected batch: randomly selects N data from training set.

- Not guarantees that a mini-batch consists of all the classes

- Has imbalanced class distribution if whole training data has.

• Evenly distributed batch: Choose M data per class. N=M*C

- All the mini-batch consists of all the classes.

- Has balanced class distribution.

- (Empirically) shows more stable and fast convergence.
• Mixup: Data augmentation using linear interpolation between two data
Framework: Mixup augmentation
• We used mixup to train model to predict the relative scale of data,

rather than binary classification.
x: data

t: label

λ: mixing parameter
w: scale parameter
Low-level-k0
DenseNet-k1
…
DenseNet-kh
n-head Classifier
‘Cello’
Waveform
h modules
(a) Low-level-k module
BN + Relu + 1x1 Conv (k)
(b) DenseNet-k module (c) n-head classifier module
BN + Relu + 3x3 Conv (k)
Dense (n Multi-Head)
GAP + Softmax
Average
SE
Concatenate
2x2 MaxPool
BN
Logmel
BN + Reshape
3x3 Conv (k)
Concatenate
• End-to-end DenseNet
• Frequency-wise BN
• Squeeze-and-Excitation Network
• Multi-head softmax
Framework: Architecture
• DenseNet: Densely connected network

f_dense(x) = concatenate(f(x),x))
• Allows direct path for backpropagation
• End-to-end DenseNet:

- All layers from input(logmel) to output(loss) is concatenated
Framework: End-to-end DenseNet
Framework: Multi-head softmax
• Replacing softmax layer to

average of multiple softmax outputs.
• Why?

- Good initialization close to 0.5
prediction results especially for mixup.

- Allows prediction for near-0.5 easier.
• Categorical cross-entropy for a mini batch:
Framework: Batch-wise loss masking (1)
• Masked loss when false-annotated data is known:
m_n: 1 when n-th data has true label

0 when n-th data has false label
• Our solution: Remove outliers which have the highest loss from the
gradient calculation.

- x may be false-annotated data if:

1) it is non-verified, and

2) it shows the highest or similar loss in the current batch / iteration.
Framework: Batch-wise loss masking (2)
• Efficient computation for max(loss) using batch-wise calculation.
Experimental results
• Audio segment: 64,000 samples for all experiments

- 16kHz/4s, 32kHz/2s, 44.1kHz/1.45s

• Input domain: logmel or waveform

• MAP@3 Results
Images from https://www.kaggle.com/fizzbuzz/beginner-s-guide-to-audio-data
Future work
• Verifying ideas with additional experiments.

• Model size minimization

• Implementation for real-world application
• Thank you!
• We thank to @Zafar and @daisukelab, who provided wonderful kernels
and discussions for the task.

• If you have interests to Cochlear.ai, 

please visit www.cochlear.ai

More Related Content

PDF
Two strategies for large-scale multi-label classification on the YouTube-8M d...
PDF
Do deep nets really need to be deep?
PDF
ShuffleNet - PR054
PPTX
Speaker Dependent WaveNet Vocoder
PPT
Lec 6-bp
PDF
Dear - 딥러닝 논문읽기 모임 김창연님
PPTX
PPTX
DBSCAN : A Clustering Algorithm
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Do deep nets really need to be deep?
ShuffleNet - PR054
Speaker Dependent WaveNet Vocoder
Lec 6-bp
Dear - 딥러닝 논문읽기 모임 김창연님
DBSCAN : A Clustering Algorithm

What's hot (18)

PDF
Exploring Simple Siamese Representation Learning
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PDF
Case Study of Convolutional Neural Network
PDF
N ns 1
PDF
Metric based meta_learning
PDF
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PPTX
Gbm.more GBM in H2O
PDF
Once-for-All: Train One Network and Specialize it for Efficient Deployment
PPTX
Image classification using CNN
PDF
PR-207: YOLOv3: An Incremental Improvement
PPT
Chap4 slides
PDF
Multimodal Residual Learning for Visual QA
PPTX
TensorFlow in 3 sentences
PDF
Convolutional Neural Networks : Popular Architectures
PDF
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
PPTX
PPTX
Grid based method & model based clustering method
PDF
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Exploring Simple Siamese Representation Learning
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Case Study of Convolutional Neural Network
N ns 1
Metric based meta_learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
Gbm.more GBM in H2O
Once-for-All: Train One Network and Specialize it for Efficient Deployment
Image classification using CNN
PR-207: YOLOv3: An Incremental Improvement
Chap4 slides
Multimodal Residual Learning for Visual QA
TensorFlow in 3 sentences
Convolutional Neural Networks : Popular Architectures
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Grid based method & model based clustering method
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Ad

Similar to Audio tagging system using densely connected convolutional networks (DCASE2018 task2) (20)

PPT
backpropagation in neural networks
PPTX
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
PDF
Deep learning
PPTX
UNetEliyaLaialy (2).pptx
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
PPTX
Waste Classification System using Convolutional Neural Networks.pptx
PDF
deep CNN vs conventional ML
PDF
Enterprise Scale Topological Data Analysis Using Spark
PDF
Enterprise Scale Topological Data Analysis Using Spark
PPTX
Anomaly detection using deep one class classifier
PPTX
Trackster Pruning at the CMS High-Granularity Calorimeter
PPTX
Deep learning from a novice perspective
PPTX
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
PPTX
Facial Emotion Detection on Children's Emotional Face
PPTX
presentation of IntroductionDeepLearning.pptx
PPT
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
PDF
Hardware Acceleration for Machine Learning
PPT
Multi-Layer Perceptrons
PPTX
Introduction to Deep Learning
backpropagation in neural networks
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
Deep learning
UNetEliyaLaialy (2).pptx
Deep Learning Based Voice Activity Detection and Speech Enhancement
Waste Classification System using Convolutional Neural Networks.pptx
deep CNN vs conventional ML
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Anomaly detection using deep one class classifier
Trackster Pruning at the CMS High-Granularity Calorimeter
Deep learning from a novice perspective
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
Facial Emotion Detection on Children's Emotional Face
presentation of IntroductionDeepLearning.pptx
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
Hardware Acceleration for Machine Learning
Multi-Layer Perceptrons
Introduction to Deep Learning
Ad

Recently uploaded (20)

PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
Sustainable Sites - Green Building Construction
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
OOP with Java - Java Introduction (Basics)
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
573137875-Attendance-Management-System-original
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Well-logging-methods_new................
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
introduction to datamining and warehousing
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Geodesy 1.pptx...............................................
Sustainable Sites - Green Building Construction
Operating System & Kernel Study Guide-1 - converted.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
OOP with Java - Java Introduction (Basics)
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
573137875-Attendance-Management-System-original
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
bas. eng. economics group 4 presentation 1.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
UNIT 4 Total Quality Management .pptx
Well-logging-methods_new................
R24 SURVEYING LAB MANUAL for civil enggi
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
introduction to datamining and warehousing
Model Code of Practice - Construction Work - 21102022 .pdf

Audio tagging system using densely connected convolutional networks (DCASE2018 task2)

  • 1. Audio tagging system using densely connected convolutional networks Il-Young Jeong Presented by: Il-Young Jeong and Hyungui Lim Authors: Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 20 November 2018, Surrey, UK
  • 2. Introduction: DCASE 2018 challenge task 2 General-purpose audio tagging of Freesound content with AudioSet labels • Classifying sound events of very diverse nature including: - musical instruments - human sounds - domestic sounds - animals - etc. • Dataset: Subset of Freesound Dataset with AudioSet Ontology
  • 3. Difficulty of the task was due to: • Varied input length
 - from 300ms to 30s • Insufficient training data
 - ~9.5k recordings for 41 classes
 • Imbalanced class distribution
 - from 94 to 300 samples per class
 • Unreliable annotation
 - Only ~40% labels were verified
 Introduction: DCASE 2018 challenge task 2
  • 4. Introduction: DCASE 2018 challenge task 2 Our Solutions •Segment-wise learning 
 •Strong augmentation 
 (mixup) 
 •Evenly-distributed batch
 
 •Batch-wise loss masking Difficulty of the task was due to: • Varied input length
 - from 300ms to 30s • Insufficient training data
 - ~9.5k recordings for 41 classes
 • Imbalanced class distribution
 - from 94 to 300 samples per class
 • Unreliable annotation
 - Only ~40% labels were verified
 •Ensemble approach
  • 5. Segmentation • All the preprocessing steps are performed for each batch generation. 
 Pros: Fast implementation of various settings
 Cons: Computation in batch generation Framework: (On-the-fly) Preprocessing Mixup Augmentation T-F representation - Long data -> Takes excerpts - Short data -> Zero-padding - New data generated by mixing two segments. - Raw waveform/ Logmel - Faster operation using GPU, thanks to kapre.
  • 6. Framework: Evenly distributed batch generation • Mini-batch learning: Updates model by using subset of training data. • Randomly selected batch: randomly selects N data from training set.
 - Not guarantees that a mini-batch consists of all the classes
 - Has imbalanced class distribution if whole training data has.
 • Evenly distributed batch: Choose M data per class. N=M*C
 - All the mini-batch consists of all the classes.
 - Has balanced class distribution.
 - (Empirically) shows more stable and fast convergence.
  • 7. • Mixup: Data augmentation using linear interpolation between two data Framework: Mixup augmentation • We used mixup to train model to predict the relative scale of data,
 rather than binary classification. x: data
 t: label
 λ: mixing parameter w: scale parameter
  • 8. Low-level-k0 DenseNet-k1 … DenseNet-kh n-head Classifier ‘Cello’ Waveform h modules (a) Low-level-k module BN + Relu + 1x1 Conv (k) (b) DenseNet-k module (c) n-head classifier module BN + Relu + 3x3 Conv (k) Dense (n Multi-Head) GAP + Softmax Average SE Concatenate 2x2 MaxPool BN Logmel BN + Reshape 3x3 Conv (k) Concatenate • End-to-end DenseNet • Frequency-wise BN • Squeeze-and-Excitation Network • Multi-head softmax Framework: Architecture
  • 9. • DenseNet: Densely connected network
 f_dense(x) = concatenate(f(x),x)) • Allows direct path for backpropagation • End-to-end DenseNet:
 - All layers from input(logmel) to output(loss) is concatenated Framework: End-to-end DenseNet
  • 10. Framework: Multi-head softmax • Replacing softmax layer to
 average of multiple softmax outputs. • Why?
 - Good initialization close to 0.5 prediction results especially for mixup.
 - Allows prediction for near-0.5 easier.
  • 11. • Categorical cross-entropy for a mini batch: Framework: Batch-wise loss masking (1) • Masked loss when false-annotated data is known: m_n: 1 when n-th data has true label
 0 when n-th data has false label
  • 12. • Our solution: Remove outliers which have the highest loss from the gradient calculation.
 - x may be false-annotated data if:
 1) it is non-verified, and
 2) it shows the highest or similar loss in the current batch / iteration. Framework: Batch-wise loss masking (2) • Efficient computation for max(loss) using batch-wise calculation.
  • 13. Experimental results • Audio segment: 64,000 samples for all experiments
 - 16kHz/4s, 32kHz/2s, 44.1kHz/1.45s
 • Input domain: logmel or waveform
 • MAP@3 Results
  • 15. Future work • Verifying ideas with additional experiments.
 • Model size minimization
 • Implementation for real-world application
  • 16. • Thank you! • We thank to @Zafar and @daisukelab, who provided wonderful kernels and discussions for the task.
 • If you have interests to Cochlear.ai, 
 please visit www.cochlear.ai