Audio tagging system using densely connected convolutional networks (DCASE2018 task2)

Audio tagging system using
densely connected convolutional networks
Il-Young Jeong
Presented by:
Il-Young Jeong and Hyungui Lim
Authors:
Workshop on Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) 2018
20 November 2018, Surrey, UK

Introduction: DCASE 2018 challenge task 2
General-purpose audio tagging of
Freesound content with AudioSet labels
• Classifying sound events of very diverse nature including:
- musical instruments
- human sounds
- domestic sounds
- animals
- etc.
• Dataset: Subset of Freesound Dataset with AudioSet Ontology

Difﬁculty of the task was due to:
• Varied input length 
- from 300ms to 30s
• Insufﬁcient training data 
- ~9.5k recordings for 41 classes 
• Imbalanced class distribution 
- from 94 to 300 samples per class 
• Unreliable annotation 
- Only ~40% labels were verified 

Our Solutions
•Segment-wise learning
 
•Strong augmentation  
(mixup)  
•Evenly-distributed batch 
 
•Batch-wise loss masking
Difﬁculty of the task was due to:
• Varied input length 
- from 300ms to 30s
• Insufﬁcient training data 
- ~9.5k recordings for 41 classes 
• Imbalanced class distribution 
- from 94 to 300 samples per class 
• Unreliable annotation 
- Only ~40% labels were verified 
•Ensemble approach

Segmentation
• All the preprocessing steps are performed for each batch generation.  
Pros: Fast implementation of various settings 
Cons: Computation in batch generation
Framework: (On-the-fly) Preprocessing
Mixup Augmentation T-F representation
- Long data
-> Takes excerpts
- Short data
-> Zero-padding
- New data generated by
mixing two segments.
- Raw waveform/ Logmel
- Faster operation using
GPU, thanks to kapre.

Framework: Evenly distributed batch generation
• Mini-batch learning: Updates model by using subset of training data.
• Randomly selected batch: randomly selects N data from training set. 
- Not guarantees that a mini-batch consists of all the classes 
- Has imbalanced class distribution if whole training data has. 
• Evenly distributed batch: Choose M data per class. N=M*C 
- All the mini-batch consists of all the classes. 
- Has balanced class distribution. 
- (Empirically) shows more stable and fast convergence.

• Mixup: Data augmentation using linear interpolation between two data
Framework: Mixup augmentation
• We used mixup to train model to predict the relative scale of data, 
rather than binary classiﬁcation.
x: data 
t: label 
λ: mixing parameter
w: scale parameter

Low-level-k0
DenseNet-k1
…
DenseNet-kh
n-head Classifier
‘Cello’
Waveform
h modules
(a) Low-level-k module
BN + Relu + 1x1 Conv (k)
(b) DenseNet-k module (c) n-head classifier module
BN + Relu + 3x3 Conv (k)
Dense (n Multi-Head)
GAP + Softmax
Average
SE
Concatenate
2x2 MaxPool
BN
Logmel
BN + Reshape
3x3 Conv (k)
Concatenate
• End-to-end DenseNet
• Frequency-wise BN
• Squeeze-and-Excitation Network
• Multi-head softmax
Framework: Architecture

• DenseNet: Densely connected network 
f_dense(x) = concatenate(f(x),x))
• Allows direct path for backpropagation
• End-to-end DenseNet: 
- All layers from input(logmel) to output(loss) is concatenated
Framework: End-to-end DenseNet

Framework: Multi-head softmax
• Replacing softmax layer to 
average of multiple softmax outputs.
• Why? 
- Good initialization close to 0.5
prediction results especially for mixup. 
- Allows prediction for near-0.5 easier.

• Categorical cross-entropy for a mini batch:
Framework: Batch-wise loss masking (1)
• Masked loss when false-annotated data is known:
m_n: 1 when n-th data has true label 
0 when n-th data has false label

• Our solution: Remove outliers which have the highest loss from the
gradient calculation. 
- x may be false-annotated data if: 
1) it is non-verified, and 
2) it shows the highest or similar loss in the current batch / iteration.
Framework: Batch-wise loss masking (2)
• Efﬁcient computation for max(loss) using batch-wise calculation.

Experimental results
• Audio segment: 64,000 samples for all experiments 
- 16kHz/4s, 32kHz/2s, 44.1kHz/1.45s 
• Input domain: logmel or waveform 
• MAP@3 Results

Images from https://www.kaggle.com/ﬁzzbuzz/beginner-s-guide-to-audio-data

Future work
• Verifying ideas with additional experiments. 
• Model size minimization 
• Implementation for real-world application

• Thank you!
• We thank to @Zafar and @daisukelab, who provided wonderful kernels
and discussions for the task. 
• If you have interests to Cochlear.ai,  
please visit www.cochlear.ai

Audio tagging system using densely connected convolutional networks (DCASE2018 task2)

More Related Content

What's hot (18)

Similar to Audio tagging system using densely connected convolutional networks (DCASE2018 task2) (20)

Recently uploaded (20)

Audio tagging system using densely connected convolutional networks (DCASE2018 task2)