SlideShare a Scribd company logo
Bulletin of Electrical Engineering and Informatics
Vol. 9, No. 4, August 2020, pp. 1387~1393
ISSN: 2302-9285, DOI: 10.11591/eei.v9i4.2230  1387
Journal homepage: http://beei.org
Performance analysis of the convolutional recurrent neural
network on acoustic event detection
Suk-Hwan Jung, Yong-Joo Chung
Department of Electronics Engineering, Keimyung University, South Korea
Article Info ABSTRACT
Article history:
Received Nov 11, 2019
Revised Feb 3, 2020
Accepted Mar 23, 2020
In this study, we attempted to find the optimal hyper-parameters of
the convolutional recurrent neural network (CRNN) by investigating its
performance on acoustic event detection. Important hyper-parameters such
as the input segment length, learning rate, and criterion for the convergence test,
were determined experimentally. Additionally, the effects of batch normalization
and dropout on the performance were measured experimentally to obtain their
optimal combination. Further, we studied the effects of varying the batch
data on every iteration during the training. From the experimental results
using the TUT sound events synthetic 2016 database, we obtained optimal
performance with a learning rate of . We found that a longer input
segment length aided performance improvement, and batch normalization
was far more effective than dropout. Finally, performance improvement was
clearly observed by varying the starting points of the batch data for each iteration
during the training.
Keywords:
Acoustic event detection
Convolutional neural network
Convolutional recurrent neural
network
Hyper-parameters
Recurrent neural network
This is an open access article under the CC BY-SA license.
Corresponding Author:
Yong-Joo Chung,
Department of Electronics Engineering,
Keimyung University,
1095 Dalgubul-Daero, Dalseo-Gu, Daegu, South Korea.
Email: yjjung@kuu.ac.kr
1. INTRODUCTION
Recently, there has been increasing research interests in acoustic event detection, where the existence
and occurrence times of the various sounds in our daily lives are identified. There are many applications for
acoustic event detecton, including surveillance [1, 2], urban sound analysis [3, 4], information retrieval from
multimedia contents [5], health care monitoring [6-7] and bird call detection [8, 9]. Deep neural networks
(DNNs) have demonstrated superior performance to conventional machine learning techniques in image
classification [10-12], speech recognition [13-15], and machine translation [16, 17]. In [18], we see that
the feedforward neural network (FNN) now outperforms the Gaussian mixture model (GMM) and support
vector machine (SVM), which have traditionally been employed for acoustic event detection. FNN has
also been shown to outperform the conventional GMM-HMM-based methods in polyphonic acoustic
event detection [19]. Therefore, we can say that current studies on acoustic event detection mainly focuse on
DNN-based approaches.
However, due to the fixed connection between the input and hidden layers, FNN is apparently
inadequate to overcome the signal distortions in image classification. Similarly, it is apparent that FNN
is also insufficient for acoustic event detection as audio signal distortions are frequently encountered in
the 2-dimensional time-frequency domain of the signal. Another problem with FNN lies in modeling the correlation
between the time-frames of the audio signal. As FNN only concatenates several input frames together to model
the time correlation, it often fails to model the long-term time correlations of the audio signal.
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393
1388
Convolutional neural networks (CNNs) can alleviate the limitation of FNN using 2-dimensional
filters, whose parameters are shared along the time and frequency shift [20], and it has exhibited superior
performance to FNN in various pattern recognition tasks. Particularly, due to its structural characteristics,
CNN can efficiently handle image distortions in the 2-dimensional space [10]. Similarly, we expect that
the time-frequency domain distortions occurring in the audio signal can be accurately modeled by CNN.
Nevertheless, CNN is inefficient for modeling long-term time correlations between audio signal samples.
Recurrent neural networks (RNNs) have been used successfully in speech recognition [13],
and are superior to other neural networks in modeling the time-correlation information of the time-series
signals such as speech and audio. However, because RNN is unable to tackle 2-dimensional distortions in
the time-frequency domain of the audio signal, its performance is usually inferior to CNN when used alone in
acoustic event detection. Recently, there have been some approaches to combine CNN and RNN for their
combined merits. Among them, convolutional recurrent neural networks (CRNNs) have been used
successfully for acoustic event detection [21], speech recognition [22], and music classification [23].
As CRNN is constructed by connecting CNN, RNN, and FNN in series, it is more complex than other neural
networks; therefore, the combined effect of the networks is difficult to predict. Moreover, as the use of CRNN
on acoustic event detection is in its early stages, there are few research studies on optimizing the various
hyper-parameters of CRNN.
Therefore, we attempted to find the optimal hyper-parameters of CRNN for acoustic event
detection in this study. Several experiments were performed to identify the optimal hyper-parameters of
CRNN, and we used the test results on the validation data to determine the hyper-parameters.
Important hyper-parameters, such as the input segment length, learning rate, and criterion for
the convergence test were determined from the experiments. Additionally, the effects of batch normalization
and dropout on the performance were observed. We also studied the effects of varying the batch data in every
iteration during the training. This paper is organized as follows. In Section 2, we introduce the feature
extraction method for audio signals as well as the architecture of the CRNN used for acoustic event detection.
In Section 3, we present and discuss various experimental results, and conclusions are given in Section 4.
2. FEATURE EXTRACTION AND CRNN ARCHITECTURE
2.1. Feature extraction
In this study, we used log-mel filterbank (LMFB) outputs as input features for CRNN and the entire
process of feature extraction is shown in Figure 1. We first computed short-time Fourier transform (STFT)
from the 40-ms audio signals, which are sampled at 44.1 KHz. STFTs were computed at every 20 ms with
50% overlap [21]. Further, 40-dimensional mel filterbanks were extracted from the STFTs spanning 0 to
22050 Hz, and they were log-transformed to obtain the LMFBs, which are normalized by subtracting
the mean and dividing by standard deviation of the entire training data.
Figure 1. Extraction process of log-mel filterbank (LMFB)
2.2. The architecture of CRNN
Figure 2 presents the architecture of the CRNN used in this paper. CRNN consists of CNNs
followed by RNN and FNN in sequence. The CNNs act as audio feature extractors, which are robust against
distortions in the time-frequency domain. The RNN utilize the time-correlation information of the audio
signals. Finally, the FNN serves as an output layer, which produces the posterior probabilities for each sound class at
each time frame.
As CNN takes 40-dimensional LMFBs as input features, the dimension of the input of the CNN
is Tx40, where the length of the input segment is set to T. The CNN consists of 3 convolution layers, each of
which has 256 feature maps with 5 5 filters. The output of the filter is processed by batch normalization
and then passes through ReLU activation function. To maintain the time-domain dimension, max pooling
is performed only in the frequency-domain. Further, dropout is applied at a rate of 0.25 after the max pooling
layer [10]. The input segment length T is set to 1024 frames (20.48s). We experimented with different values
of T to find the one that produced the best result on the validation dataset.
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung)
1389
The output of the CNN is input to the RNN, which consists of 256 gated recurrent units (GRUs).
The output layer consists of K units with sigmoid activation function, where K represents the number of
acoustic event classes. The sigmoid activation function produces the posterior probabilities for each class at
each time frame, from which we decide whether an acoustic event is active based on a threshold (0.5).
Figure 2. Architecture of the CRNN used for acoustic event detection
3. EXPERIMENTAL RESULTS
3.1. Database and evaluation metric
To evaluate the performance of CRNN on acoustic event detection, we used the TUT sound events
synthetic 2016 (TUT-SED Synthetic) database, which is popularly used in this area [24]. TUT-SED Synthetic
contains artificially generated audio data, because it is difficult to obtain enough data using only audio data
recorded in real environments. Moreover, the subjective labelling error can be mitigated by artificial data. TUT-SED
Synthetic was generated by mixing isolated sound events from 16 different classes. The total length of the data is 566
minutes, which were divided into training, testing, and validation data with proportions of 60%, 20%,
and 20%, respectively. Segments of length 3-15 seconds were used for the training, testing, and validation,
and there were no common acoustic event instances between them. The detailed sound classes and their total
duration in the database are shown in Table 1.
For the evaluation metric of the acoustic event detection, we used both error rate (ER) and F-score [25].
We adopted two types of evaluation methods:, segment and event-based. In the segment-based method,
the binarized outputs of the CRNN are compared with the ground truth table in every segment of length 1 s.
In the event-based method, the output of the CRNN are compared with the label in the ground truth table
whenever an acoustic event has been detected by CRNN [25]. We sought the optimal hyper-parameters of
the CRNN by applying various conditions during the training. To find the optimal learning rate,
we experimented as we changed the learning rate. The results are shown in Table 2.
We applied batch normalization and dropout in all cases, and binary cross entropy was employed as
the loss function, which acts as the criterion for the convergence of the weights. The Adam optimizer was
used to optime the neural networks. As seen in Table 2, the optimal CRNN performance was achieved at
a learning rate . Generally, as the optimal learning rate for the neural networks is not pre-determined,
and varies by both network architecture and amount of training data, we searched for the optimal learning
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393
1390
rate by observing the performance of the validation data set. We can see in Table 2 that the best learning rate
for the testing data is same as the learning rate that achieves the best performance on the validation data.
This indicates that it is reasonable to determine the optimal learning rate of CRNN by its performance on
the validation data. Table 2 also shows that as the learning rate decreases, the number of epochs that shows
the best performance increases. For example, the number of epochs is 33 when the learning rate is ,
whereas it increases to 191 with the learning rate of . The larger number of epochs indicates a slower
convergence of the CRNN parameters, which causes performance degradation due to the underfitting of the neural
networks. In contrast, when the learning increases to , the number of epochs decreases dramatically to 16,
which causes overfitting and results in performance degradation.
To investigate the performance variation with the learning rate further, we show, (as shown in Figure 3),
the variation of the loss function and accuracy at the output of the CRNN during the training as the learning
rate varied from to . When the learning rate is , it can be seen that the loss function on
the validation data reaches its minimum at approximately 30 epochs (33 precisely) and subsequently
fluctuates (but never falls below the minimum). However, for the training data, the loss function decreases
from the beginning to the end of the training (we set the maximum number of epoch to 200). As it
is important for the networks not to be overfitted, we stopped the iteration at 33 epochs by the early stopping
algorithm mentioned previosuly. Meanwhile, we see different characteristics when the learning rate is .
The loss function on the validation data decreases for longer and reaches its minimum at 157. The longer
iterations contribute to decrease the performance of the CRNN with both validation and test data due to
the underfitting problem. This phenomenon becomes more pronounced as we further decrease the learning rate.
When the learning rate is , the loss function does not reach its minimum until the end of the training. A similar
trend is observed when we monitor the accuracy instead of the loss function.
We investigated the effect on performance as we changed the input batch data in every epoch of
the training. The starting points of each batch data were shifted at each epoch making the input segments at
successive epochs differ by the shift-length. For the evaluation, both segment-based and event-based methods
were used and the results are shown in F-score and ER. We used ER as the convergence criterion for
the training. Further, early stopping was employed where we stopped the training when the convergence
criterion did not improve for more than 100 epochs on the validation data. This was to prevent overfitting,
and the maximum number of epochs was set to 200. We also investigated the effects of batch normalization (BN)
and dropout. BN has been used to mitigate the vanishing and exploding gradient problems in the backpropagation
algorithm, and we used BN before the ReLU activation function in the convolution layer. Dropout is a popular
regularization method in the neural networks, which is used to exclude the neurons from training at
a predefined probability (dropout rate). In this study, dropout is used in all layers of the CNN and RNN (but not
the FNN [18].
In Table 3, the results of using the shift of batch data are presented. In the segment-based evaluation,
we see an improved average F-score/ER of 63.18%/0.52 using the shift compared to the F-score/ER
of 62.11%/0.54 without the shift (non-shift). We also observe slight performance improvement in
the event-based evaluation. From these results, we confirm that superior performance is expected by
the batch data shift for training the CRNN.
Table 3 also shows the effects of BN and dropout on performance. Expectedly, the best average F-score/ER
is obtained when applying both BN and dropout (56.67%/0.71). If we apply only BN without dropout, we see
slight performance degradation (54.54%/0.79), which implies that the effect of dropout on performances
is not significant. Meanwhile, if we do not apply BN, the performance degrades significantly (regardless of
whether we apply dropout), and the poorest result is obtained when we apply neither BN nor dropout
(47.15%/0.81). In Table 4, we compare the performances of the CRNN between two convergence criteria (ER and
F-score) for training. BN and dropout were applied and the overlap method was used. In the table, we observe
slight performance improvementa with ER, but the performance difference appears negligible, and we
conclude that the ER and F-scores can be used as the convergence criteria without significantly affecting
the performance.
Table 1. Sound classes and their total duration in seconds on the TUT-SED synthetic 2016 databases
Classes Duration(s) Classes Duration(s)
Glass 621 Motor cycle 3691
Gun 534 Foot steps 1173
Cat 941 Claps 3278
Dog 716 Bus 3464
Thunder 3007 Mixer 4020
Bird 2298 Crowd 4825
Horse 1614 Alarm 4405
Crying 2007 Rain 3975
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung)
1391
Table 2. Performance of CRNN as learning rate changes
Leaning rate Validation data (F-score/ER) Testing data (F-score/ER) Epoch
Segment Event Segment Event
61.69%/ 0.52 37.69%/0.96 60.61%/0.53 37.05%/0.97 16
68.75%/ 0.45 43.49%/0.88 64.21%/0.50 40.50%/0.96 33
66.44%/ 0.49 39.10%/0.96 63.76%/0.52 36.48%/1.04 157
44.16%/ 0.69 9.83%/1.24 43.38%/0.71 10.82%/1.27 191
Figure 3. Variation of loss function and accuracy as learning rate changes
Table 3. Performance of CRNN with various training conditions
BN Drop out Segment-based (F-score/ER) Event-based (F-score/ER) Average
Shift Non-shift Shift Non-shift
Yes No 66.10%/0.48 65.28%/0.54 43.80%/1.01 42.99%/1.12 54.54%/0.79
Yes Yes 67.24%/0.49 66.62%/0.49 45.93%/0.94 46.88%/0.92 56.67%/0.71
No No 58.58%/0.57 58.88%/0.58 35.47%/1.06 35.68%/1.04 47.15%/0.81
No Yes 60.80%/0.54 57.67%/0.56 40.02%/0.97 39.18%/1.00 49.4%/0.77
Average 63.18%/0.52 62.11%/0.54 41.3%/1.0 41.18%/1.02
Table 4. Performance of CRNN depending on the convergence criterion
Convergence Criterion Segment-based (F-score/ER) Event-based (F-score/ER) Average
ER 67.24%/0.49 45.93%/0.94 56.59%/0.72
F-score 66.45%/0.51 44.97%/0.98 55.71%/0.75
Finally, in Table 5, we show performance variation as we changed the length of the input segment of
CRNN. Compared to shorter lengths (2.56s, 5.12s), longer length segments (10.24s, 20.56s) show better
performances. This may be due to the fact that many of acoustic events in the TUT-SED Synthetic have long
lengths and the RNN can efficiently capture the time-correlations in the long segments.
Table 5. Performance of CRNN depending on the length of the input segment
SegmentLength (s) Segment-based (F-scores/ER) Event-based (F-scorse/ER) Average
2.56 66.84%/0.51 42.75%/1.13 54.80%/0.82
5.12 67.47%/0.50 44.27%/1.04 55.87%/0.77
10.24 68.01%/0.47 45.33%/0.97 56.67%/0.72
20.56 67.24%/0.49 45.93%/0.94 56.54%/0.70
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393
1392
4. CONCLUSION
For acoustic event detection, approaches based on deep neural networks have demonstrated
superior performances to conventional machine learning methods, such as GMM and SVM. Among them,
CRNN is thought to be well suited for acoustic event detection due to its ability to reduce signal distortions in
the time-frequency domain as well as in exploiting the temporal-correlation information of the audio signal.
In this study, we empolyed CRNN as the classifier for the acoustic event detection, and several of
its conditions were tested by extensive experiments to determine its optimal hyper-parameters.
In the experiments, by varying the learning rate, we found that the optimum performance is obtained
when the learning rate is set to . From the results, we could also see that the learning rate that exhibits
optimum performance on the validation data also performs best in the testing data. This suggests that it
is reasonable to determine the optimal learning rate based on performance tests on the validation data.
We further confirmed that BN and dropout contributed to improving the performance of CRNN.
Particularly, BN had a larger impact on the performance improvement than dropout.
Instead of using identical batch data at every iteration, we obtained improved performance by
changing the batch data in every iteration, which resulted from increasing the number of training samples for
CRNN. We further found that the length of the input segments of the CRNN also affects the performance.
We obtained better performance using longer segments, as the acoustic event used in this paper had relatively
long time-durations.
ACKNOWLEDGEMENTS
This research was supported by Basic Science Research Program through the National Research
Foundation of Korea (NRF) funded by Ministry of Education (No. 2018R1A2B6009328).
REFERENCES
[1] M. K. Nandwana, et al., “Robust Unsupervised Detection of Human Screams In Noisy Acoustic Environments,”
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD,
pp. 161-165, 2015.
[2] M. Crocco, et al., “Audio Surveillance: A Systematic Review,” ACM Computing Surveys, no. 52, 2016.
[3] J. Salamon and J. P. Bello, “Feature Learning with Deep Scattering for Urban Sound Analysis,” 2015 23rd
European Signal Processing Conference (EUSIPCO), Nice, pp. 724-728, 2015.
[4] S. Ntalampiras, et al., "On Acoustic Surveillance of Hazardous Situations", 2009 IEEE International Conference
on Acoustics, Speech and Signal Processing, Taipei, pp. 165-168, 2009.
[5] Y. Wang, et al., “Audio-based Multimedia Event Detection Using Deep Recurrent Neural Networks,” 2016 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 2742-2746, 2016.
[6] D. Stowell and D. Clayton, “Acoustic Event Detection for Multiple Overlapping Similar Sources 2015 IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, pp. 1-5, 2015.
[7] G. Dekkers, et al., “DCASE 2018 Challenge - Task 5: Monitoring of domestic activities based on multi-channel
acoustics,” KU Leuven, Tech. Rep., 2018.
[8] D. Stowell, et al., “Automatic Acoustic Detection of Birds through Deep Learning: the First Bird Audio Detection
Challenge,” Methods in Ecology and Evolution, vol. 10, no. 3, pp.1-21, 2018.
[9] F. Briggs, et al., “Acoustic Classification of Multiple Simultaneous Bird Species: A Multi-instance Multi-Label
Approach,” The Journal of the Acoustical Society of America, vol. 131, no. 6, pp.4640-4640, 2012.
[10] A. Krizhevsky, et al., “Imagenet Classification with Deep Convolutional Neural Networks,” Advances in Neural
Information Processing Systems, vol. 25, no. 2, pp. 1097-1105, 2012.
[11] K. He, et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, pp. 770-778, 2016.
[12] Olga Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of
Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.
[13] A. Graves, et al., “Speech Recognition with Deep Recurrent Neural Networks,” 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 6645-6649, 2013.
[14] J. Chorowski, et al., “Attention-Based Models for Speech Recognition,” in Proceedings of the 28th International
Conference on Neural Information Processing Systems (NIPS), vol. 1, pp. 577-585, 2015.
[15] A. Hannun, et al., “Deep Speech: Scaling up End-to-End Speech Recognition,” arXiv: 1412.5567, 2014.
[16] K. Cho, et al., “Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation,”
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 1724-1734, 2014.
[17] I. Sutskever, et al., “Sequence to Sequence Learning with Neural Networks”, in Proceedings of the 27th
International Conference on Neural Information Processing Systems (NIPS), pp. 3104-3112, 2014.
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung)
1393
[18] S. H. Chung and Y. J. Chung, “Comparison of Audio Event Detection Performance using DNN,” Journal of the
Korea Institute of Electronic Communication Sciences, vol. 13, no. 3, pp. 571-577, 2018.
[19] E. Cakir, et al., “Polyphonic sound event detection using multilabel deep neural networks,” 2015 International
Joint Conference on Neural Networks (IJCNN), Killarney, pp. 1-7, 2015.
[20] Y. LeCun, et al., “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278-2324, 1998.
[21] E. Cakir, et al., “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection,” in IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291-1303, 2017.
[22] T. N. Sainath, et al., “Convolutional, Long Short-term Memory, Fully Connected Deep Neural Networks,” 2015
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD,
pp. 4580-4584, 2015.
[23] K. Choi, et al., “Convolutional Recurrent Neural Networks for Music Classification,” 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 2392-2396, 2017.
[24] “TUT-SED Synthetic,” 2016. [Online]. Available at: http://www.cs.tut.fi/sgn/arg/taslp2017-crnn-sed/tut-sed-synthetic-2016
[25] A. Mesaros, et al., “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6,
pp. 162-178, 2016.
BIOGRAPHIES OF AUTHORS
Sukwhan Jung received his B.Sc. and M.Sc. Degree in Electronics Engineering from Keimyung
University, Daegu, South Korea in 2016 and 2018, respectively. He has been with Samju
Electroincs Co. since March 2018. His main research interests are audio event detection under
noisy environments and deep learning for artificial intelligence.
Yongjoo Chung received his B.Sc. degree in Electronics Engineering from Seoul National
University, Seoul, South Korea in 1988. He earned his M.Sc. and PhD degree in Electrical and
Electronics Engineering from Korea Advanced Science and Technology, Daejon, South Korean
in 1995. He is currently a Professor with the Department of Electronics Engineering at
Keimyung University, Daegu, S. Korea. His research interests are in the areas of speech
recognition, audio event detection, machine learning and pattern recognition.

More Related Content

PDF
Sound event detection using deep neural networks
PDF
IRJET- Musical Instrument Recognition using CNN and SVM
PDF
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
PDF
Audio Features Based Steganography Detection in WAV File
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
PDF
Multimode Vector Modalities of HMM-GMM in Augmented Categorization of Bioacou...
PPTX
Voice Activity Detection using Single Frequency Filtering
PDF
A SURVEY ON DYNAMIC SPECTRUM ACCESS TECHNIQUES FOR COGNITIVE RADIO
Sound event detection using deep neural networks
IRJET- Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
Audio Features Based Steganography Detection in WAV File
Deep Learning Based Voice Activity Detection and Speech Enhancement
Multimode Vector Modalities of HMM-GMM in Augmented Categorization of Bioacou...
Voice Activity Detection using Single Frequency Filtering
A SURVEY ON DYNAMIC SPECTRUM ACCESS TECHNIQUES FOR COGNITIVE RADIO

What's hot (19)

PDF
The influence of sampling frequency on tone recognition of musical instruments
PDF
Improvement of minimum tracking in Minimum Statistics noise estimation method
PDF
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
PDF
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...
PDF
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...
PDF
Depth Estimation of Sound Images Using Directional Clustering and Activation...
PDF
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...
PDF
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
PDF
An Approach to Spectrum Sensing in Cognitive Radio
PDF
FPGA-based implementation of speech recognition for robocar control using MFCC
PDF
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
PDF
Spectral Analysis of Sample Rate Converter
PDF
Dwpt Based FFT and Its Application to SNR Estimation in OFDM Systems
PDF
Stability Analysis in a Cognitive Radio System with Cooperative Beamforming
PDF
Speaker and Speech Recognition for Secured Smart Home Applications
PDF
D04812125
PDF
Single User Eigenvalue Based Detection For Spectrum Sensing In Cognitive Rad...
PDF
Noise uncertainty effect on multi-channel cognitive radio networks
The influence of sampling frequency on tone recognition of musical instruments
Improvement of minimum tracking in Minimum Statistics noise estimation method
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...
Depth Estimation of Sound Images Using Directional Clustering and Activation...
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
An Approach to Spectrum Sensing in Cognitive Radio
FPGA-based implementation of speech recognition for robocar control using MFCC
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
Spectral Analysis of Sample Rate Converter
Dwpt Based FFT and Its Application to SNR Estimation in OFDM Systems
Stability Analysis in a Cognitive Radio System with Cooperative Beamforming
Speaker and Speech Recognition for Secured Smart Home Applications
D04812125
Single User Eigenvalue Based Detection For Spectrum Sensing In Cognitive Rad...
Noise uncertainty effect on multi-channel cognitive radio networks
Ad

Similar to Performance analysis of the convolutional recurrent neural network on acoustic event detection (20)

PDF
Attention gated encoder-decoder for ultrasonic signal denoising
PDF
Frequency based criterion for distinguishing tonal and noisy spectral components
PDF
N017428692
PDF
Features selection by genetic algorithm optimization with k-nearest neighbour...
PDF
A novel speech enhancement technique
PDF
Adaptive wavelet thresholding with robust hybrid features for text-independe...
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
Deep Learnung ANC Approach for Active Noise
PDF
General Kalman Filter & Speech Enhancement for Speaker Identification
PDF
G017513945
PDF
1801 1805
PDF
1801 1805
PDF
Intelligent Arabic letters speech recognition system based on mel frequency c...
PDF
IRJET- Music Genre Classification using MFCC and AANN
PDF
Isolated words recognition using mfcc, lpc and neural network
PDF
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
PDF
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
PDF
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
PDF
Wavelet Based Noise Robust Features for Speaker Recognition
PDF
Recognition of music genres using deep learning.
Attention gated encoder-decoder for ultrasonic signal denoising
Frequency based criterion for distinguishing tonal and noisy spectral components
N017428692
Features selection by genetic algorithm optimization with k-nearest neighbour...
A novel speech enhancement technique
Adaptive wavelet thresholding with robust hybrid features for text-independe...
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
Deep Learnung ANC Approach for Active Noise
General Kalman Filter & Speech Enhancement for Speaker Identification
G017513945
1801 1805
1801 1805
Intelligent Arabic letters speech recognition system based on mel frequency c...
IRJET- Music Genre Classification using MFCC and AANN
Isolated words recognition using mfcc, lpc and neural network
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
Wavelet Based Noise Robust Features for Speaker Recognition
Recognition of music genres using deep learning.
Ad

More from journalBEEI (20)

PDF
Square transposition: an approach to the transposition process in block cipher
PDF
Hyper-parameter optimization of convolutional neural network based on particl...
PDF
Supervised machine learning based liver disease prediction approach with LASS...
PDF
A secure and energy saving protocol for wireless sensor networks
PDF
Plant leaf identification system using convolutional neural network
PDF
Customized moodle-based learning management system for socially disadvantaged...
PDF
Understanding the role of individual learner in adaptive and personalized e-l...
PDF
Prototype mobile contactless transaction system in traditional markets to sup...
PDF
Wireless HART stack using multiprocessor technique with laxity algorithm
PDF
Implementation of double-layer loaded on octagon microstrip yagi antenna
PDF
The calculation of the field of an antenna located near the human head
PDF
Exact secure outage probability performance of uplinkdownlink multiple access...
PDF
Design of a dual-band antenna for energy harvesting application
PDF
Transforming data-centric eXtensible markup language into relational database...
PDF
Key performance requirement of future next wireless networks (6G)
PDF
Noise resistance territorial intensity-based optical flow using inverse confi...
PDF
Modeling climate phenomenon with software grids analysis and display system i...
PDF
An approach of re-organizing input dataset to enhance the quality of emotion ...
PDF
Parking detection system using background subtraction and HSV color segmentation
PDF
Quality of service performances of video and voice transmission in universal ...
Square transposition: an approach to the transposition process in block cipher
Hyper-parameter optimization of convolutional neural network based on particl...
Supervised machine learning based liver disease prediction approach with LASS...
A secure and energy saving protocol for wireless sensor networks
Plant leaf identification system using convolutional neural network
Customized moodle-based learning management system for socially disadvantaged...
Understanding the role of individual learner in adaptive and personalized e-l...
Prototype mobile contactless transaction system in traditional markets to sup...
Wireless HART stack using multiprocessor technique with laxity algorithm
Implementation of double-layer loaded on octagon microstrip yagi antenna
The calculation of the field of an antenna located near the human head
Exact secure outage probability performance of uplinkdownlink multiple access...
Design of a dual-band antenna for energy harvesting application
Transforming data-centric eXtensible markup language into relational database...
Key performance requirement of future next wireless networks (6G)
Noise resistance territorial intensity-based optical flow using inverse confi...
Modeling climate phenomenon with software grids analysis and display system i...
An approach of re-organizing input dataset to enhance the quality of emotion ...
Parking detection system using background subtraction and HSV color segmentation
Quality of service performances of video and voice transmission in universal ...

Recently uploaded (20)

PPTX
web development for engineering and engineering
PPTX
Artificial Intelligence
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Digital Logic Computer Design lecture notes
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
PPT on Performance Review to get promotions
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Current and future trends in Computer Vision.pptx
web development for engineering and engineering
Artificial Intelligence
OOP with Java - Java Introduction (Basics)
Digital Logic Computer Design lecture notes
Safety Seminar civil to be ensured for safe working.
UNIT 4 Total Quality Management .pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Operating System & Kernel Study Guide-1 - converted.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Mechanical Engineering MATERIALS Selection
PPT on Performance Review to get promotions
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
bas. eng. economics group 4 presentation 1.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Current and future trends in Computer Vision.pptx

Performance analysis of the convolutional recurrent neural network on acoustic event detection

  • 1. Bulletin of Electrical Engineering and Informatics Vol. 9, No. 4, August 2020, pp. 1387~1393 ISSN: 2302-9285, DOI: 10.11591/eei.v9i4.2230  1387 Journal homepage: http://beei.org Performance analysis of the convolutional recurrent neural network on acoustic event detection Suk-Hwan Jung, Yong-Joo Chung Department of Electronics Engineering, Keimyung University, South Korea Article Info ABSTRACT Article history: Received Nov 11, 2019 Revised Feb 3, 2020 Accepted Mar 23, 2020 In this study, we attempted to find the optimal hyper-parameters of the convolutional recurrent neural network (CRNN) by investigating its performance on acoustic event detection. Important hyper-parameters such as the input segment length, learning rate, and criterion for the convergence test, were determined experimentally. Additionally, the effects of batch normalization and dropout on the performance were measured experimentally to obtain their optimal combination. Further, we studied the effects of varying the batch data on every iteration during the training. From the experimental results using the TUT sound events synthetic 2016 database, we obtained optimal performance with a learning rate of . We found that a longer input segment length aided performance improvement, and batch normalization was far more effective than dropout. Finally, performance improvement was clearly observed by varying the starting points of the batch data for each iteration during the training. Keywords: Acoustic event detection Convolutional neural network Convolutional recurrent neural network Hyper-parameters Recurrent neural network This is an open access article under the CC BY-SA license. Corresponding Author: Yong-Joo Chung, Department of Electronics Engineering, Keimyung University, 1095 Dalgubul-Daero, Dalseo-Gu, Daegu, South Korea. Email: yjjung@kuu.ac.kr 1. INTRODUCTION Recently, there has been increasing research interests in acoustic event detection, where the existence and occurrence times of the various sounds in our daily lives are identified. There are many applications for acoustic event detecton, including surveillance [1, 2], urban sound analysis [3, 4], information retrieval from multimedia contents [5], health care monitoring [6-7] and bird call detection [8, 9]. Deep neural networks (DNNs) have demonstrated superior performance to conventional machine learning techniques in image classification [10-12], speech recognition [13-15], and machine translation [16, 17]. In [18], we see that the feedforward neural network (FNN) now outperforms the Gaussian mixture model (GMM) and support vector machine (SVM), which have traditionally been employed for acoustic event detection. FNN has also been shown to outperform the conventional GMM-HMM-based methods in polyphonic acoustic event detection [19]. Therefore, we can say that current studies on acoustic event detection mainly focuse on DNN-based approaches. However, due to the fixed connection between the input and hidden layers, FNN is apparently inadequate to overcome the signal distortions in image classification. Similarly, it is apparent that FNN is also insufficient for acoustic event detection as audio signal distortions are frequently encountered in the 2-dimensional time-frequency domain of the signal. Another problem with FNN lies in modeling the correlation between the time-frames of the audio signal. As FNN only concatenates several input frames together to model the time correlation, it often fails to model the long-term time correlations of the audio signal.
  • 2.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393 1388 Convolutional neural networks (CNNs) can alleviate the limitation of FNN using 2-dimensional filters, whose parameters are shared along the time and frequency shift [20], and it has exhibited superior performance to FNN in various pattern recognition tasks. Particularly, due to its structural characteristics, CNN can efficiently handle image distortions in the 2-dimensional space [10]. Similarly, we expect that the time-frequency domain distortions occurring in the audio signal can be accurately modeled by CNN. Nevertheless, CNN is inefficient for modeling long-term time correlations between audio signal samples. Recurrent neural networks (RNNs) have been used successfully in speech recognition [13], and are superior to other neural networks in modeling the time-correlation information of the time-series signals such as speech and audio. However, because RNN is unable to tackle 2-dimensional distortions in the time-frequency domain of the audio signal, its performance is usually inferior to CNN when used alone in acoustic event detection. Recently, there have been some approaches to combine CNN and RNN for their combined merits. Among them, convolutional recurrent neural networks (CRNNs) have been used successfully for acoustic event detection [21], speech recognition [22], and music classification [23]. As CRNN is constructed by connecting CNN, RNN, and FNN in series, it is more complex than other neural networks; therefore, the combined effect of the networks is difficult to predict. Moreover, as the use of CRNN on acoustic event detection is in its early stages, there are few research studies on optimizing the various hyper-parameters of CRNN. Therefore, we attempted to find the optimal hyper-parameters of CRNN for acoustic event detection in this study. Several experiments were performed to identify the optimal hyper-parameters of CRNN, and we used the test results on the validation data to determine the hyper-parameters. Important hyper-parameters, such as the input segment length, learning rate, and criterion for the convergence test were determined from the experiments. Additionally, the effects of batch normalization and dropout on the performance were observed. We also studied the effects of varying the batch data in every iteration during the training. This paper is organized as follows. In Section 2, we introduce the feature extraction method for audio signals as well as the architecture of the CRNN used for acoustic event detection. In Section 3, we present and discuss various experimental results, and conclusions are given in Section 4. 2. FEATURE EXTRACTION AND CRNN ARCHITECTURE 2.1. Feature extraction In this study, we used log-mel filterbank (LMFB) outputs as input features for CRNN and the entire process of feature extraction is shown in Figure 1. We first computed short-time Fourier transform (STFT) from the 40-ms audio signals, which are sampled at 44.1 KHz. STFTs were computed at every 20 ms with 50% overlap [21]. Further, 40-dimensional mel filterbanks were extracted from the STFTs spanning 0 to 22050 Hz, and they were log-transformed to obtain the LMFBs, which are normalized by subtracting the mean and dividing by standard deviation of the entire training data. Figure 1. Extraction process of log-mel filterbank (LMFB) 2.2. The architecture of CRNN Figure 2 presents the architecture of the CRNN used in this paper. CRNN consists of CNNs followed by RNN and FNN in sequence. The CNNs act as audio feature extractors, which are robust against distortions in the time-frequency domain. The RNN utilize the time-correlation information of the audio signals. Finally, the FNN serves as an output layer, which produces the posterior probabilities for each sound class at each time frame. As CNN takes 40-dimensional LMFBs as input features, the dimension of the input of the CNN is Tx40, where the length of the input segment is set to T. The CNN consists of 3 convolution layers, each of which has 256 feature maps with 5 5 filters. The output of the filter is processed by batch normalization and then passes through ReLU activation function. To maintain the time-domain dimension, max pooling is performed only in the frequency-domain. Further, dropout is applied at a rate of 0.25 after the max pooling layer [10]. The input segment length T is set to 1024 frames (20.48s). We experimented with different values of T to find the one that produced the best result on the validation dataset.
  • 3. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung) 1389 The output of the CNN is input to the RNN, which consists of 256 gated recurrent units (GRUs). The output layer consists of K units with sigmoid activation function, where K represents the number of acoustic event classes. The sigmoid activation function produces the posterior probabilities for each class at each time frame, from which we decide whether an acoustic event is active based on a threshold (0.5). Figure 2. Architecture of the CRNN used for acoustic event detection 3. EXPERIMENTAL RESULTS 3.1. Database and evaluation metric To evaluate the performance of CRNN on acoustic event detection, we used the TUT sound events synthetic 2016 (TUT-SED Synthetic) database, which is popularly used in this area [24]. TUT-SED Synthetic contains artificially generated audio data, because it is difficult to obtain enough data using only audio data recorded in real environments. Moreover, the subjective labelling error can be mitigated by artificial data. TUT-SED Synthetic was generated by mixing isolated sound events from 16 different classes. The total length of the data is 566 minutes, which were divided into training, testing, and validation data with proportions of 60%, 20%, and 20%, respectively. Segments of length 3-15 seconds were used for the training, testing, and validation, and there were no common acoustic event instances between them. The detailed sound classes and their total duration in the database are shown in Table 1. For the evaluation metric of the acoustic event detection, we used both error rate (ER) and F-score [25]. We adopted two types of evaluation methods:, segment and event-based. In the segment-based method, the binarized outputs of the CRNN are compared with the ground truth table in every segment of length 1 s. In the event-based method, the output of the CRNN are compared with the label in the ground truth table whenever an acoustic event has been detected by CRNN [25]. We sought the optimal hyper-parameters of the CRNN by applying various conditions during the training. To find the optimal learning rate, we experimented as we changed the learning rate. The results are shown in Table 2. We applied batch normalization and dropout in all cases, and binary cross entropy was employed as the loss function, which acts as the criterion for the convergence of the weights. The Adam optimizer was used to optime the neural networks. As seen in Table 2, the optimal CRNN performance was achieved at a learning rate . Generally, as the optimal learning rate for the neural networks is not pre-determined, and varies by both network architecture and amount of training data, we searched for the optimal learning
  • 4.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393 1390 rate by observing the performance of the validation data set. We can see in Table 2 that the best learning rate for the testing data is same as the learning rate that achieves the best performance on the validation data. This indicates that it is reasonable to determine the optimal learning rate of CRNN by its performance on the validation data. Table 2 also shows that as the learning rate decreases, the number of epochs that shows the best performance increases. For example, the number of epochs is 33 when the learning rate is , whereas it increases to 191 with the learning rate of . The larger number of epochs indicates a slower convergence of the CRNN parameters, which causes performance degradation due to the underfitting of the neural networks. In contrast, when the learning increases to , the number of epochs decreases dramatically to 16, which causes overfitting and results in performance degradation. To investigate the performance variation with the learning rate further, we show, (as shown in Figure 3), the variation of the loss function and accuracy at the output of the CRNN during the training as the learning rate varied from to . When the learning rate is , it can be seen that the loss function on the validation data reaches its minimum at approximately 30 epochs (33 precisely) and subsequently fluctuates (but never falls below the minimum). However, for the training data, the loss function decreases from the beginning to the end of the training (we set the maximum number of epoch to 200). As it is important for the networks not to be overfitted, we stopped the iteration at 33 epochs by the early stopping algorithm mentioned previosuly. Meanwhile, we see different characteristics when the learning rate is . The loss function on the validation data decreases for longer and reaches its minimum at 157. The longer iterations contribute to decrease the performance of the CRNN with both validation and test data due to the underfitting problem. This phenomenon becomes more pronounced as we further decrease the learning rate. When the learning rate is , the loss function does not reach its minimum until the end of the training. A similar trend is observed when we monitor the accuracy instead of the loss function. We investigated the effect on performance as we changed the input batch data in every epoch of the training. The starting points of each batch data were shifted at each epoch making the input segments at successive epochs differ by the shift-length. For the evaluation, both segment-based and event-based methods were used and the results are shown in F-score and ER. We used ER as the convergence criterion for the training. Further, early stopping was employed where we stopped the training when the convergence criterion did not improve for more than 100 epochs on the validation data. This was to prevent overfitting, and the maximum number of epochs was set to 200. We also investigated the effects of batch normalization (BN) and dropout. BN has been used to mitigate the vanishing and exploding gradient problems in the backpropagation algorithm, and we used BN before the ReLU activation function in the convolution layer. Dropout is a popular regularization method in the neural networks, which is used to exclude the neurons from training at a predefined probability (dropout rate). In this study, dropout is used in all layers of the CNN and RNN (but not the FNN [18]. In Table 3, the results of using the shift of batch data are presented. In the segment-based evaluation, we see an improved average F-score/ER of 63.18%/0.52 using the shift compared to the F-score/ER of 62.11%/0.54 without the shift (non-shift). We also observe slight performance improvement in the event-based evaluation. From these results, we confirm that superior performance is expected by the batch data shift for training the CRNN. Table 3 also shows the effects of BN and dropout on performance. Expectedly, the best average F-score/ER is obtained when applying both BN and dropout (56.67%/0.71). If we apply only BN without dropout, we see slight performance degradation (54.54%/0.79), which implies that the effect of dropout on performances is not significant. Meanwhile, if we do not apply BN, the performance degrades significantly (regardless of whether we apply dropout), and the poorest result is obtained when we apply neither BN nor dropout (47.15%/0.81). In Table 4, we compare the performances of the CRNN between two convergence criteria (ER and F-score) for training. BN and dropout were applied and the overlap method was used. In the table, we observe slight performance improvementa with ER, but the performance difference appears negligible, and we conclude that the ER and F-scores can be used as the convergence criteria without significantly affecting the performance. Table 1. Sound classes and their total duration in seconds on the TUT-SED synthetic 2016 databases Classes Duration(s) Classes Duration(s) Glass 621 Motor cycle 3691 Gun 534 Foot steps 1173 Cat 941 Claps 3278 Dog 716 Bus 3464 Thunder 3007 Mixer 4020 Bird 2298 Crowd 4825 Horse 1614 Alarm 4405 Crying 2007 Rain 3975
  • 5. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung) 1391 Table 2. Performance of CRNN as learning rate changes Leaning rate Validation data (F-score/ER) Testing data (F-score/ER) Epoch Segment Event Segment Event 61.69%/ 0.52 37.69%/0.96 60.61%/0.53 37.05%/0.97 16 68.75%/ 0.45 43.49%/0.88 64.21%/0.50 40.50%/0.96 33 66.44%/ 0.49 39.10%/0.96 63.76%/0.52 36.48%/1.04 157 44.16%/ 0.69 9.83%/1.24 43.38%/0.71 10.82%/1.27 191 Figure 3. Variation of loss function and accuracy as learning rate changes Table 3. Performance of CRNN with various training conditions BN Drop out Segment-based (F-score/ER) Event-based (F-score/ER) Average Shift Non-shift Shift Non-shift Yes No 66.10%/0.48 65.28%/0.54 43.80%/1.01 42.99%/1.12 54.54%/0.79 Yes Yes 67.24%/0.49 66.62%/0.49 45.93%/0.94 46.88%/0.92 56.67%/0.71 No No 58.58%/0.57 58.88%/0.58 35.47%/1.06 35.68%/1.04 47.15%/0.81 No Yes 60.80%/0.54 57.67%/0.56 40.02%/0.97 39.18%/1.00 49.4%/0.77 Average 63.18%/0.52 62.11%/0.54 41.3%/1.0 41.18%/1.02 Table 4. Performance of CRNN depending on the convergence criterion Convergence Criterion Segment-based (F-score/ER) Event-based (F-score/ER) Average ER 67.24%/0.49 45.93%/0.94 56.59%/0.72 F-score 66.45%/0.51 44.97%/0.98 55.71%/0.75 Finally, in Table 5, we show performance variation as we changed the length of the input segment of CRNN. Compared to shorter lengths (2.56s, 5.12s), longer length segments (10.24s, 20.56s) show better performances. This may be due to the fact that many of acoustic events in the TUT-SED Synthetic have long lengths and the RNN can efficiently capture the time-correlations in the long segments. Table 5. Performance of CRNN depending on the length of the input segment SegmentLength (s) Segment-based (F-scores/ER) Event-based (F-scorse/ER) Average 2.56 66.84%/0.51 42.75%/1.13 54.80%/0.82 5.12 67.47%/0.50 44.27%/1.04 55.87%/0.77 10.24 68.01%/0.47 45.33%/0.97 56.67%/0.72 20.56 67.24%/0.49 45.93%/0.94 56.54%/0.70
  • 6.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393 1392 4. CONCLUSION For acoustic event detection, approaches based on deep neural networks have demonstrated superior performances to conventional machine learning methods, such as GMM and SVM. Among them, CRNN is thought to be well suited for acoustic event detection due to its ability to reduce signal distortions in the time-frequency domain as well as in exploiting the temporal-correlation information of the audio signal. In this study, we empolyed CRNN as the classifier for the acoustic event detection, and several of its conditions were tested by extensive experiments to determine its optimal hyper-parameters. In the experiments, by varying the learning rate, we found that the optimum performance is obtained when the learning rate is set to . From the results, we could also see that the learning rate that exhibits optimum performance on the validation data also performs best in the testing data. This suggests that it is reasonable to determine the optimal learning rate based on performance tests on the validation data. We further confirmed that BN and dropout contributed to improving the performance of CRNN. Particularly, BN had a larger impact on the performance improvement than dropout. Instead of using identical batch data at every iteration, we obtained improved performance by changing the batch data in every iteration, which resulted from increasing the number of training samples for CRNN. We further found that the length of the input segments of the CRNN also affects the performance. We obtained better performance using longer segments, as the acoustic event used in this paper had relatively long time-durations. ACKNOWLEDGEMENTS This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by Ministry of Education (No. 2018R1A2B6009328). REFERENCES [1] M. K. Nandwana, et al., “Robust Unsupervised Detection of Human Screams In Noisy Acoustic Environments,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, pp. 161-165, 2015. [2] M. Crocco, et al., “Audio Surveillance: A Systematic Review,” ACM Computing Surveys, no. 52, 2016. [3] J. Salamon and J. P. Bello, “Feature Learning with Deep Scattering for Urban Sound Analysis,” 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, pp. 724-728, 2015. [4] S. Ntalampiras, et al., "On Acoustic Surveillance of Hazardous Situations", 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, pp. 165-168, 2009. [5] Y. Wang, et al., “Audio-based Multimedia Event Detection Using Deep Recurrent Neural Networks,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 2742-2746, 2016. [6] D. Stowell and D. Clayton, “Acoustic Event Detection for Multiple Overlapping Similar Sources 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, pp. 1-5, 2015. [7] G. Dekkers, et al., “DCASE 2018 Challenge - Task 5: Monitoring of domestic activities based on multi-channel acoustics,” KU Leuven, Tech. Rep., 2018. [8] D. Stowell, et al., “Automatic Acoustic Detection of Birds through Deep Learning: the First Bird Audio Detection Challenge,” Methods in Ecology and Evolution, vol. 10, no. 3, pp.1-21, 2018. [9] F. Briggs, et al., “Acoustic Classification of Multiple Simultaneous Bird Species: A Multi-instance Multi-Label Approach,” The Journal of the Acoustical Society of America, vol. 131, no. 6, pp.4640-4640, 2012. [10] A. Krizhevsky, et al., “Imagenet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems, vol. 25, no. 2, pp. 1097-1105, 2012. [11] K. He, et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770-778, 2016. [12] Olga Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015. [13] A. Graves, et al., “Speech Recognition with Deep Recurrent Neural Networks,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 6645-6649, 2013. [14] J. Chorowski, et al., “Attention-Based Models for Speech Recognition,” in Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), vol. 1, pp. 577-585, 2015. [15] A. Hannun, et al., “Deep Speech: Scaling up End-to-End Speech Recognition,” arXiv: 1412.5567, 2014. [16] K. Cho, et al., “Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724-1734, 2014. [17] I. Sutskever, et al., “Sequence to Sequence Learning with Neural Networks”, in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), pp. 3104-3112, 2014.
  • 7. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung) 1393 [18] S. H. Chung and Y. J. Chung, “Comparison of Audio Event Detection Performance using DNN,” Journal of the Korea Institute of Electronic Communication Sciences, vol. 13, no. 3, pp. 571-577, 2018. [19] E. Cakir, et al., “Polyphonic sound event detection using multilabel deep neural networks,” 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, pp. 1-7, 2015. [20] Y. LeCun, et al., “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998. [21] E. Cakir, et al., “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291-1303, 2017. [22] T. N. Sainath, et al., “Convolutional, Long Short-term Memory, Fully Connected Deep Neural Networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, pp. 4580-4584, 2015. [23] K. Choi, et al., “Convolutional Recurrent Neural Networks for Music Classification,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 2392-2396, 2017. [24] “TUT-SED Synthetic,” 2016. [Online]. Available at: http://www.cs.tut.fi/sgn/arg/taslp2017-crnn-sed/tut-sed-synthetic-2016 [25] A. Mesaros, et al., “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, pp. 162-178, 2016. BIOGRAPHIES OF AUTHORS Sukwhan Jung received his B.Sc. and M.Sc. Degree in Electronics Engineering from Keimyung University, Daegu, South Korea in 2016 and 2018, respectively. He has been with Samju Electroincs Co. since March 2018. His main research interests are audio event detection under noisy environments and deep learning for artificial intelligence. Yongjoo Chung received his B.Sc. degree in Electronics Engineering from Seoul National University, Seoul, South Korea in 1988. He earned his M.Sc. and PhD degree in Electrical and Electronics Engineering from Korea Advanced Science and Technology, Daejon, South Korean in 1995. He is currently a Professor with the Department of Electronics Engineering at Keimyung University, Daegu, S. Korea. His research interests are in the areas of speech recognition, audio event detection, machine learning and pattern recognition.