SlideShare a Scribd company logo
Marmara University,
Electrical and Electronics Engineering
Spring 2020, EEE7000 – Seminar
CONFERENCE PAPER
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO
CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON
ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
by Mehmet Çağrı Aksoy
24/04/2020
What is Audio Classification and CNN?
Audio classification is the process of listening to and analyzing audio recordings. Also known as
sound classification, this process is at the heart of a variety of modern AI technology including
virtual assistants, automatic speech recognition, and text to speech applications. [1]
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural
networks, most applied to analyzing visual imagery. They are also known as shift invariant or
space invariant artificial neural networks (SIANN), based on their shared-weights architecture
and translation invariance characteristics. [2]
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
2
Index Terms of this Paper
Acoustic Event Detection, Acoustic Scene Classification, Convolutional Neural Networks, Deep
Neural Networks, Video Classification
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
3
What is the main event?
The main event and purpose of this task is to do “Acoustic Event Detection” also named as
“Audio Classification”
Historically, audio classification tasks has been addressed with another methods named LSTM,
SVM etc. More recent approaches use some form of DNN ( Deep Neural Network ) including
CNNs and RNNs.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
4
What is the main event? Cont.
Prior work has been reported on datasets such as TRECVid, ActivityNet, Sports1M and DCASE
Acoustic scenes 2016 which are much smaller than the dataset that are using in this paper.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
5
What are the issues they are facing?
Problem Statements
Datasets
Audio file overview
Data Exploratory
Data Pre-processing
Extract Features
Building the model
Observing the results
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
6
YouTube100M
The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M
evaluation videos, and a pool of 20M videos that they use for validation. Videos average 4.6
minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more
topic identifiers from a set of 30,871 labels.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
7
YouTube-100M Cont.
The dataset has some labels that wrongly named. They also need to handle these bugs.
Being machine generated, the labels are not 100% accurate and of the 30K labels, some are
clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear
annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are
often labeled “Entertainment” as well, although no hierarchy is enforced.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
8
Audio file overview & Data Exploratory
The audio is divided into non-overlapping 960 ms frames.
This gave approximately 20 billion examples from the 70M videos.
Each frame inherits all the labels of its parent video.
The 960 ms frames are decomposed with a short-time Fourier transform applying 25 ms
windows every 10 ms.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
9
The resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude
of each bin is log transformed after adding a small offset to avoid numerical issues.
This gives log-mel spectrogram patches of 96 x64 bins that form the input to all classifiers.
During training we fetch mini-batches of 128 input examples by randomly sampling from all
patches.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
10
Spectrogram examples
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
11
System Model
They have used various CNN architectures to classify the soundtracks of a dataset of 70M
training videos (5.24 million hours) with 30,871 video-level labels.
They examine fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and
ResNet.
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 12
System Model cont.
All experiments used TensorFlow and were trained asynchronously on multiple GPUs using the
Adam optimizer.
Batch normalization was applied after all convolutional layers.
All models used a final sigmoid layer rather than a softmax layer since each example can have
multiple labels. Cross-entropy was the loss function.
In view of the large training set size, they did not use dropout, weight decay, or other common
regularization techniques.
For the models trained on 7M or more examples, they saw no evidence of overfitting. During
training.
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 13
CNN Architectures
Their baseline is a fully connected DNN, which they compared to several networks closely
modeled on successful image classifiers.
Also they have used, AlexNet, ResNet, Inception and VGG.
For their baseline experiments, they trained and evaluated using only the 10% most frequent
labels of the original 30K (i.e, 3K labels).
For each experiment, they optimized number of GPUs and learning rate for the frame level
classification accuracy.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
14
Fully Connected
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
15
Fully Connected
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
16
Their baseline network is a fully connected model with RELU activations, N layers, and M units
per layer.
N = [2; 3; 4; 5; 6] and M = [500; 1000; 2000; 3000; 4000].
Their best performing model had N = 3 layers, M = 1000 units, learning rate of 3e-5, 10 GPUs and 5
parameter servers.
This network has approximately 11.2M weights and 11.2M multiplies.
AlexNet
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
17
AlexNet
The original AlexNet architectures was designed for a 224x224 3 input with an initial 11 x11
convolutional layer with a stride of 4. Because our inputs are 96 x64, we use a stride of 2 x1 so
that the number of activations are similar after the initial layer.
They also use batch normalization after each convolutional layer instead of local response
normalization (LRN) and replace the final 1000-unit layer with a 3087 unit layer.
While the original AlexNet has approximately 62.4M weights and 1.1G multiplies, their version
has 37.3M weights and 767M multiplies.
Also, for simplicity, unlike the original AlexNet, they do not split filters across multiple devices.
They trained with 20 GPUs and 10 parameter servers.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
18
VGG
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
19
VGG
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
20
The only changes they made to VGG were to the final layer (3087 units with a sigmoid) as well as
the use of batch normalization instead of LRN.
While the original network had 144M weights and 20B multiplies, the audio variant uses 62M
weights and 2.4B multiplies.
They tried another variant that reduced the initial strides (as we they with AlexNet), but found
that not modifying the strides resulted in faster training and better performance.
With their setup, parallelizing beyond 10 GPUs did not help significantly, so they trained with 10
GPUs and 5 parameter servers.
Inception V3
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
21
Inception V3
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
22
They modified the inception V3 network by removing the first four layers of the stem, up to and
including the MaxPool, as well as removing the auxiliary network.
They changed the Average Pool size to 10 x6 to reflect the change in activations.
The original network has 27M weights with 5.6B multiplies, and the audio variant has 28M
weights and 4.7B multiplies.
They trained with 40 GPUs and 20 parameter servers.
ResNet-50
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
23
ResNet-50
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
24
They modified ResNet-50 by removing the stride of 2 from the first 77 convolution so that the
number of activations was not too different in the audio version.
They changed the Average Pool size to 6 x4 to reflect the change in activations.
The original network has 26M weights and 3.8B multiplies. The audio variant has 30M weights
and 1.9B multiplies.
They trained with 20 GPUs and 10 parameter servers.
Performance Metrics
mAP -> mean Average Precision
AUC -> AUC is the area under the Receiver Operating Characteristic (ROC) curve.
D-prime -> It provides the separation between the means of the signal and the noise
distributions
Lower mAP values are better.
Higher D-prime values are better.
Perfect classification achieves AUC of 1.0, and random guessing gives an AUC of 0.5
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
25
Results
Table 2 shows the evaluation results calculated over the 100K balanced videos.
All CNNs beat the fully-connected baseline. 
Inception and ResNet achieve the best performance;
◦ They provide high model capacity and their convolutional units can efficiently capture common
structures that may occur in different areas of the input array for both images, and audio
representation.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
26
Results of comparison between architectures
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 27
Results of varying label set size
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
28
Results of training with different amount of data
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
29
AED with the Audio Set Dataset
Audio Set is a dataset of over 1 million 10 second excerpts labeled with a vocabulary of acoustic
events.
They train two fully-connected models to predict labels for Audio Set.
The first model uses 6420 log-mel patches and the second uses the output of “embedding” layer
of best ResNet model as inputs.
The log-mel baseline achieves a balanced mAP of 0.137 and AUC of 0.904 (equivalent to d-
prime of 1.846).
The model trained on embeddings achieves mAP / AUC / d-prime of 0.314 / 0.959 / 2.452.
This jump in performance reflects the benefit of the larger YouTube-100M training set embodied
in the ResNet classifier outputs.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
30
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
31
Conclusions
The results show that state-of-the-art image networks are capable of excellent results on audio
classification when compared to a simple fully connected network or earlier image classification
architectures.
They saw results showing that training on larger label set vocabularies can improve
performance, albeit modestly, when evaluating on smaller label sets.
They saw that increasing the number of videos up to 7M improves performance for the best-
performing ResNet-50 architecture. We note that regularization could have reduced the gap
between the models trained on smaller datasets and the 7M and 70M datasets.
They see a significant increase over our baseline when training a model for AED with ResNet
embeddings on the Audio Set dataset.
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 32
What do we need to do to move forward?
Creating of more precise architectures.
Training times are very long, hardware technology needs to be waited or faster architectures
should be created.
Removing some noisy data from related dataset.
Train the model with more labels for detecting more audio and increase unique detected audio
population.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
33
What have you learned?
The importance of dataset size.
Differences between CNN architectures and their responses.
CNN behavior on audio classification.
Understood how label and data size effect on the results.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
34
References
[1] https://lionbridge.ai/articles/what-is-audio-classification/
[2] https://en.wikipedia.org/wiki/Convolutional_neural_network
[3] https://www.researchgate.net/figure/Polyphonic-acoustic-event-detection-
task_fig2_322910427
[4] http://150.162.46.34:8080/icassp2017/pdfs/0000131.pdf
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
35
Q&A
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
36

More Related Content

PDF
Umeå University -- Supercharging research to enable ground-breaking innovation
PDF
Audio insights
PPTX
Final_Presentation_ENDSEMFORNITJSRI.pptx
PDF
Slides of my presentation at EUSIPCO 2017
PDF
“Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio f...
PDF
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
PDF
SMART APP FOR PHYSICALLY CHALLENGED PEOPLE USING INTERNET OF THINGS
PPTX
Sound is not speech
Umeå University -- Supercharging research to enable ground-breaking innovation
Audio insights
Final_Presentation_ENDSEMFORNITJSRI.pptx
Slides of my presentation at EUSIPCO 2017
“Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio f...
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
SMART APP FOR PHYSICALLY CHALLENGED PEOPLE USING INTERNET OF THINGS
Sound is not speech

Similar to CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION (20)

PDF
Exploring Real-Time Audio Dataset Applications in AI and Machine Learning
PPTX
SoundSense
PDF
IV_WORKSHOP_NVIDIA-Audio_Processing
PDF
How Real-World Audio Datasets Are Shaping AI Breakthroughs
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
PDF
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
PDF
Performance analysis of the convolutional recurrent neural network on acousti...
PDF
Automatic speech recognition system using deep learning
PDF
Issues in AI product development and practices in audio applications
PDF
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
PDF
Deep learning for music classification, 2016-05-24
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PDF
SMART SOUND SYSTEM APPLIED FOR THE EXTENSIVE CARE OF PEOPLE WITH HEARING IMPA...
PDF
SMART SOUND SYSTEM APPLIED FOR THE EXTENSIVE CARE OF PEOPLE WITH HEARING IMPA...
PDF
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
PPTX
WaveNet
PDF
IRJET-speech emotion.pdf
PDF
Automated Speech Recognition
PDF
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Exploring Real-Time Audio Dataset Applications in AI and Machine Learning
SoundSense
IV_WORKSHOP_NVIDIA-Audio_Processing
How Real-World Audio Datasets Are Shaping AI Breakthroughs
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Performance analysis of the convolutional recurrent neural network on acousti...
Automatic speech recognition system using deep learning
Issues in AI product development and practices in audio applications
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Deep learning for music classification, 2016-05-24
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
SMART SOUND SYSTEM APPLIED FOR THE EXTENSIVE CARE OF PEOPLE WITH HEARING IMPA...
SMART SOUND SYSTEM APPLIED FOR THE EXTENSIVE CARE OF PEOPLE WITH HEARING IMPA...
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
WaveNet
IRJET-speech emotion.pdf
Automated Speech Recognition
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to the R Programming Language
PPT
Predictive modeling basics in data cleaning process
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
annual-report-2024-2025 original latest.
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to Data Science and Data Analysis
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Mega Projects Data Mega Projects Data
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to the R Programming Language
Predictive modeling basics in data cleaning process
Galatica Smart Energy Infrastructure Startup Pitch Deck
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
annual-report-2024-2025 original latest.
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IB Computer Science - Internal Assessment.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Reliability_Chapter_ presentation 1221.5784
ISS -ESG Data flows What is ESG and HowHow
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Data Science and Data Analysis
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Mega Projects Data Mega Projects Data
Ad

CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION

  • 1. Marmara University, Electrical and Electronics Engineering Spring 2020, EEE7000 – Seminar CONFERENCE PAPER S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. by Mehmet Çağrı Aksoy 24/04/2020
  • 2. What is Audio Classification and CNN? Audio classification is the process of listening to and analyzing audio recordings. Also known as sound classification, this process is at the heart of a variety of modern AI technology including virtual assistants, automatic speech recognition, and text to speech applications. [1] In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics. [2] Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 2
  • 3. Index Terms of this Paper Acoustic Event Detection, Acoustic Scene Classification, Convolutional Neural Networks, Deep Neural Networks, Video Classification Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 3
  • 4. What is the main event? The main event and purpose of this task is to do “Acoustic Event Detection” also named as “Audio Classification” Historically, audio classification tasks has been addressed with another methods named LSTM, SVM etc. More recent approaches use some form of DNN ( Deep Neural Network ) including CNNs and RNNs. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 4
  • 5. What is the main event? Cont. Prior work has been reported on datasets such as TRECVid, ActivityNet, Sports1M and DCASE Acoustic scenes 2016 which are much smaller than the dataset that are using in this paper. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 5
  • 6. What are the issues they are facing? Problem Statements Datasets Audio file overview Data Exploratory Data Pre-processing Extract Features Building the model Observing the results Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 6
  • 7. YouTube100M The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M evaluation videos, and a pool of 20M videos that they use for validation. Videos average 4.6 minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more topic identifiers from a set of 30,871 labels. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 7
  • 8. YouTube-100M Cont. The dataset has some labels that wrongly named. They also need to handle these bugs. Being machine generated, the labels are not 100% accurate and of the 30K labels, some are clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are often labeled “Entertainment” as well, although no hierarchy is enforced. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 8
  • 9. Audio file overview & Data Exploratory The audio is divided into non-overlapping 960 ms frames. This gave approximately 20 billion examples from the 70M videos. Each frame inherits all the labels of its parent video. The 960 ms frames are decomposed with a short-time Fourier transform applying 25 ms windows every 10 ms. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 9
  • 10. The resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log transformed after adding a small offset to avoid numerical issues. This gives log-mel spectrogram patches of 96 x64 bins that form the input to all classifiers. During training we fetch mini-batches of 128 input examples by randomly sampling from all patches. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 10
  • 11. Spectrogram examples Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 11
  • 12. System Model They have used various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. They examine fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and ResNet. S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 12
  • 13. System Model cont. All experiments used TensorFlow and were trained asynchronously on multiple GPUs using the Adam optimizer. Batch normalization was applied after all convolutional layers. All models used a final sigmoid layer rather than a softmax layer since each example can have multiple labels. Cross-entropy was the loss function. In view of the large training set size, they did not use dropout, weight decay, or other common regularization techniques. For the models trained on 7M or more examples, they saw no evidence of overfitting. During training. S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 13
  • 14. CNN Architectures Their baseline is a fully connected DNN, which they compared to several networks closely modeled on successful image classifiers. Also they have used, AlexNet, ResNet, Inception and VGG. For their baseline experiments, they trained and evaluated using only the 10% most frequent labels of the original 30K (i.e, 3K labels). For each experiment, they optimized number of GPUs and learning rate for the frame level classification accuracy. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 14
  • 15. Fully Connected Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 15
  • 16. Fully Connected Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 16 Their baseline network is a fully connected model with RELU activations, N layers, and M units per layer. N = [2; 3; 4; 5; 6] and M = [500; 1000; 2000; 3000; 4000]. Their best performing model had N = 3 layers, M = 1000 units, learning rate of 3e-5, 10 GPUs and 5 parameter servers. This network has approximately 11.2M weights and 11.2M multiplies.
  • 17. AlexNet Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 17
  • 18. AlexNet The original AlexNet architectures was designed for a 224x224 3 input with an initial 11 x11 convolutional layer with a stride of 4. Because our inputs are 96 x64, we use a stride of 2 x1 so that the number of activations are similar after the initial layer. They also use batch normalization after each convolutional layer instead of local response normalization (LRN) and replace the final 1000-unit layer with a 3087 unit layer. While the original AlexNet has approximately 62.4M weights and 1.1G multiplies, their version has 37.3M weights and 767M multiplies. Also, for simplicity, unlike the original AlexNet, they do not split filters across multiple devices. They trained with 20 GPUs and 10 parameter servers. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 18
  • 19. VGG Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 19
  • 20. VGG Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 20 The only changes they made to VGG were to the final layer (3087 units with a sigmoid) as well as the use of batch normalization instead of LRN. While the original network had 144M weights and 20B multiplies, the audio variant uses 62M weights and 2.4B multiplies. They tried another variant that reduced the initial strides (as we they with AlexNet), but found that not modifying the strides resulted in faster training and better performance. With their setup, parallelizing beyond 10 GPUs did not help significantly, so they trained with 10 GPUs and 5 parameter servers.
  • 21. Inception V3 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 21
  • 22. Inception V3 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 22 They modified the inception V3 network by removing the first four layers of the stem, up to and including the MaxPool, as well as removing the auxiliary network. They changed the Average Pool size to 10 x6 to reflect the change in activations. The original network has 27M weights with 5.6B multiplies, and the audio variant has 28M weights and 4.7B multiplies. They trained with 40 GPUs and 20 parameter servers.
  • 23. ResNet-50 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 23
  • 24. ResNet-50 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 24 They modified ResNet-50 by removing the stride of 2 from the first 77 convolution so that the number of activations was not too different in the audio version. They changed the Average Pool size to 6 x4 to reflect the change in activations. The original network has 26M weights and 3.8B multiplies. The audio variant has 30M weights and 1.9B multiplies. They trained with 20 GPUs and 10 parameter servers.
  • 25. Performance Metrics mAP -> mean Average Precision AUC -> AUC is the area under the Receiver Operating Characteristic (ROC) curve. D-prime -> It provides the separation between the means of the signal and the noise distributions Lower mAP values are better. Higher D-prime values are better. Perfect classification achieves AUC of 1.0, and random guessing gives an AUC of 0.5 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 25
  • 26. Results Table 2 shows the evaluation results calculated over the 100K balanced videos. All CNNs beat the fully-connected baseline.  Inception and ResNet achieve the best performance; ◦ They provide high model capacity and their convolutional units can efficiently capture common structures that may occur in different areas of the input array for both images, and audio representation. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 26
  • 27. Results of comparison between architectures S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 27
  • 28. Results of varying label set size Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 28
  • 29. Results of training with different amount of data Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 29
  • 30. AED with the Audio Set Dataset Audio Set is a dataset of over 1 million 10 second excerpts labeled with a vocabulary of acoustic events. They train two fully-connected models to predict labels for Audio Set. The first model uses 6420 log-mel patches and the second uses the output of “embedding” layer of best ResNet model as inputs. The log-mel baseline achieves a balanced mAP of 0.137 and AUC of 0.904 (equivalent to d- prime of 1.846). The model trained on embeddings achieves mAP / AUC / d-prime of 0.314 / 0.959 / 2.452. This jump in performance reflects the benefit of the larger YouTube-100M training set embodied in the ResNet classifier outputs. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 30
  • 31. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 31
  • 32. Conclusions The results show that state-of-the-art image networks are capable of excellent results on audio classification when compared to a simple fully connected network or earlier image classification architectures. They saw results showing that training on larger label set vocabularies can improve performance, albeit modestly, when evaluating on smaller label sets. They saw that increasing the number of videos up to 7M improves performance for the best- performing ResNet-50 architecture. We note that regularization could have reduced the gap between the models trained on smaller datasets and the 7M and 70M datasets. They see a significant increase over our baseline when training a model for AED with ResNet embeddings on the Audio Set dataset. S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 32
  • 33. What do we need to do to move forward? Creating of more precise architectures. Training times are very long, hardware technology needs to be waited or faster architectures should be created. Removing some noisy data from related dataset. Train the model with more labels for detecting more audio and increase unique detected audio population. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 33
  • 34. What have you learned? The importance of dataset size. Differences between CNN architectures and their responses. CNN behavior on audio classification. Understood how label and data size effect on the results. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 34
  • 35. References [1] https://lionbridge.ai/articles/what-is-audio-classification/ [2] https://en.wikipedia.org/wiki/Convolutional_neural_network [3] https://www.researchgate.net/figure/Polyphonic-acoustic-event-detection- task_fig2_322910427 [4] http://150.162.46.34:8080/icassp2017/pdfs/0000131.pdf Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 35
  • 36. Q&A Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 36