CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION

Marmara University,
Electrical and Electronics Engineering
Spring 2020, EEE7000 – Seminar
CONFERENCE PAPER
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO
CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON
ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
by Mehmet Çağrı Aksoy
24/04/2020

What is Audio Classification and CNN?
Audio classification is the process of listening to and analyzing audio recordings. Also known as
sound classification, this process is at the heart of a variety of modern AI technology including
virtual assistants, automatic speech recognition, and text to speech applications. [1]
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural
networks, most applied to analyzing visual imagery. They are also known as shift invariant or
space invariant artificial neural networks (SIANN), based on their shared-weights architecture
and translation invariance characteristics. [2]
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
2

Index Terms of this Paper
Acoustic Event Detection, Acoustic Scene Classification, Convolutional Neural Networks, Deep
Neural Networks, Video Classification
ORLEANS, LA, 2017, PP. 131-135.
3

What is the main event?
The main event and purpose of this task is to do “Acoustic Event Detection” also named as
“Audio Classification”
Historically, audio classification tasks has been addressed with another methods named LSTM,
SVM etc. More recent approaches use some form of DNN ( Deep Neural Network ) including
CNNs and RNNs.
ORLEANS, LA, 2017, PP. 131-135.
4

What is the main event? Cont.
Prior work has been reported on datasets such as TRECVid, ActivityNet, Sports1M and DCASE
Acoustic scenes 2016 which are much smaller than the dataset that are using in this paper.
ORLEANS, LA, 2017, PP. 131-135.
5

What are the issues they are facing?
Problem Statements
Datasets
Audio file overview
Data Exploratory
Data Pre-processing
Extract Features
Building the model
Observing the results
ORLEANS, LA, 2017, PP. 131-135.
6

YouTube100M
The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M
evaluation videos, and a pool of 20M videos that they use for validation. Videos average 4.6
minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more
topic identifiers from a set of 30,871 labels.
ORLEANS, LA, 2017, PP. 131-135.
7

YouTube-100M Cont.
The dataset has some labels that wrongly named. They also need to handle these bugs.
Being machine generated, the labels are not 100% accurate and of the 30K labels, some are
clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear
annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are
often labeled “Entertainment” as well, although no hierarchy is enforced.
ORLEANS, LA, 2017, PP. 131-135.
8

Audio file overview & Data Exploratory
The audio is divided into non-overlapping 960 ms frames.
This gave approximately 20 billion examples from the 70M videos.
Each frame inherits all the labels of its parent video.
The 960 ms frames are decomposed with a short-time Fourier transform applying 25 ms
windows every 10 ms.
ORLEANS, LA, 2017, PP. 131-135.
9

The resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude
of each bin is log transformed after adding a small offset to avoid numerical issues.
This gives log-mel spectrogram patches of 96 x64 bins that form the input to all classifiers.
During training we fetch mini-batches of 128 input examples by randomly sampling from all
patches.
ORLEANS, LA, 2017, PP. 131-135.
10

Spectrogram examples
ORLEANS, LA, 2017, PP. 131-135.
11

System Model
They have used various CNN architectures to classify the soundtracks of a dataset of 70M
training videos (5.24 million hours) with 30,871 video-level labels.
They examine fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and
ResNet.
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 12

System Model cont.
All experiments used TensorFlow and were trained asynchronously on multiple GPUs using the
Adam optimizer.
Batch normalization was applied after all convolutional layers.
All models used a final sigmoid layer rather than a softmax layer since each example can have
multiple labels. Cross-entropy was the loss function.
In view of the large training set size, they did not use dropout, weight decay, or other common
regularization techniques.
For the models trained on 7M or more examples, they saw no evidence of overfitting. During
training.
ORLEANS, LA, 2017, PP. 131-135.

CNN Architectures
Their baseline is a fully connected DNN, which they compared to several networks closely
modeled on successful image classifiers.
Also they have used, AlexNet, ResNet, Inception and VGG.
For their baseline experiments, they trained and evaluated using only the 10% most frequent
labels of the original 30K (i.e, 3K labels).
For each experiment, they optimized number of GPUs and learning rate for the frame level
classification accuracy.
ORLEANS, LA, 2017, PP. 131-135.
14

Fully Connected
ORLEANS, LA, 2017, PP. 131-135.
15

Fully Connected
ORLEANS, LA, 2017, PP. 131-135.
16
Their baseline network is a fully connected model with RELU activations, N layers, and M units
per layer.
N = [2; 3; 4; 5; 6] and M = [500; 1000; 2000; 3000; 4000].
Their best performing model had N = 3 layers, M = 1000 units, learning rate of 3e-5, 10 GPUs and 5
parameter servers.
This network has approximately 11.2M weights and 11.2M multiplies.

AlexNet
ORLEANS, LA, 2017, PP. 131-135.
17

AlexNet
The original AlexNet architectures was designed for a 224x224 3 input with an initial 11 x11
convolutional layer with a stride of 4. Because our inputs are 96 x64, we use a stride of 2 x1 so
that the number of activations are similar after the initial layer.
They also use batch normalization after each convolutional layer instead of local response
normalization (LRN) and replace the final 1000-unit layer with a 3087 unit layer.
While the original AlexNet has approximately 62.4M weights and 1.1G multiplies, their version
has 37.3M weights and 767M multiplies.
Also, for simplicity, unlike the original AlexNet, they do not split filters across multiple devices.
They trained with 20 GPUs and 10 parameter servers.
ORLEANS, LA, 2017, PP. 131-135.
18

VGG
ORLEANS, LA, 2017, PP. 131-135.
19

VGG
ORLEANS, LA, 2017, PP. 131-135.
20
The only changes they made to VGG were to the final layer (3087 units with a sigmoid) as well as
the use of batch normalization instead of LRN.
While the original network had 144M weights and 20B multiplies, the audio variant uses 62M
weights and 2.4B multiplies.
They tried another variant that reduced the initial strides (as we they with AlexNet), but found
that not modifying the strides resulted in faster training and better performance.
With their setup, parallelizing beyond 10 GPUs did not help significantly, so they trained with 10
GPUs and 5 parameter servers.

Inception V3
ORLEANS, LA, 2017, PP. 131-135.
21

Inception V3
ORLEANS, LA, 2017, PP. 131-135.
22
They modified the inception V3 network by removing the first four layers of the stem, up to and
including the MaxPool, as well as removing the auxiliary network.
They changed the Average Pool size to 10 x6 to reflect the change in activations.
The original network has 27M weights with 5.6B multiplies, and the audio variant has 28M
weights and 4.7B multiplies.

ResNet-50
ORLEANS, LA, 2017, PP. 131-135.
23

ResNet-50
ORLEANS, LA, 2017, PP. 131-135.
24
They modified ResNet-50 by removing the stride of 2 from the first 77 convolution so that the
number of activations was not too different in the audio version.
They changed the Average Pool size to 6 x4 to reflect the change in activations.
The original network has 26M weights and 3.8B multiplies. The audio variant has 30M weights
and 1.9B multiplies.

Performance Metrics
mAP -> mean Average Precision
AUC -> AUC is the area under the Receiver Operating Characteristic (ROC) curve.
D-prime -> It provides the separation between the means of the signal and the noise
distributions
Lower mAP values are better.
Higher D-prime values are better.
Perfect classification achieves AUC of 1.0, and random guessing gives an AUC of 0.5
ORLEANS, LA, 2017, PP. 131-135.
25

Results
Table 2 shows the evaluation results calculated over the 100K balanced videos.
All CNNs beat the fully-connected baseline. 
Inception and ResNet achieve the best performance;
◦ They provide high model capacity and their convolutional units can efficiently capture common
structures that may occur in different areas of the input array for both images, and audio
representation.
ORLEANS, LA, 2017, PP. 131-135.
26

Results of comparison between architectures
ORLEANS, LA, 2017, PP. 131-135.

Results of varying label set size
ORLEANS, LA, 2017, PP. 131-135.
28

Results of training with different amount of data
ORLEANS, LA, 2017, PP. 131-135.
29

AED with the Audio Set Dataset
Audio Set is a dataset of over 1 million 10 second excerpts labeled with a vocabulary of acoustic
events.
They train two fully-connected models to predict labels for Audio Set.
The first model uses 6420 log-mel patches and the second uses the output of “embedding” layer
of best ResNet model as inputs.
The log-mel baseline achieves a balanced mAP of 0.137 and AUC of 0.904 (equivalent to d-
prime of 1.846).
The model trained on embeddings achieves mAP / AUC / d-prime of 0.314 / 0.959 / 2.452.
This jump in performance reflects the benefit of the larger YouTube-100M training set embodied
in the ResNet classifier outputs.
ORLEANS, LA, 2017, PP. 131-135.
30

ORLEANS, LA, 2017, PP. 131-135.
31

Conclusions
The results show that state-of-the-art image networks are capable of excellent results on audio
classification when compared to a simple fully connected network or earlier image classification
architectures.
They saw results showing that training on larger label set vocabularies can improve
performance, albeit modestly, when evaluating on smaller label sets.
They saw that increasing the number of videos up to 7M improves performance for the best-
performing ResNet-50 architecture. We note that regularization could have reduced the gap
between the models trained on smaller datasets and the 7M and 70M datasets.
They see a significant increase over our baseline when training a model for AED with ResNet
embeddings on the Audio Set dataset.
ORLEANS, LA, 2017, PP. 131-135.

What do we need to do to move forward?
Creating of more precise architectures.
Training times are very long, hardware technology needs to be waited or faster architectures
should be created.
Removing some noisy data from related dataset.
Train the model with more labels for detecting more audio and increase unique detected audio
population.
ORLEANS, LA, 2017, PP. 131-135.
33

What have you learned?
The importance of dataset size.
Differences between CNN architectures and their responses.
CNN behavior on audio classification.
Understood how label and data size effect on the results.
ORLEANS, LA, 2017, PP. 131-135.
34

References
[1] https://lionbridge.ai/articles/what-is-audio-classification/
[2] https://en.wikipedia.org/wiki/Convolutional_neural_network
[3] https://www.researchgate.net/figure/Polyphonic-acoustic-event-detection-
task_fig2_322910427
[4] http://150.162.46.34:8080/icassp2017/pdfs/0000131.pdf
ORLEANS, LA, 2017, PP. 131-135.
35

Q&A
ORLEANS, LA, 2017, PP. 131-135.
36

CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION

More Related Content

Similar to CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION (20)

Recently uploaded (20)

CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION