Advances In Speech And Music Technology Computational Aspects And Applications Anupam Biswas

Advances In Speech And Music Technology
Computational Aspects And Applications Anupam
Biswas download
https://ebookbell.com/product/advances-in-speech-and-music-
technology-computational-aspects-and-applications-anupam-
biswas-48813804
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Advances In Speech And Music Technology Proceedings Of Frsm 2020
Advances In Intelligent Systems And Computing 1st Ed 2021 Anupam
Biswas Editor
https://ebookbell.com/product/advances-in-speech-and-music-technology-
proceedings-of-frsm-2020-advances-in-intelligent-systems-and-
computing-1st-ed-2021-anupam-biswas-editor-36874822
Advances In Speech And Language Technologies For Iberian Languages
Iberspeech 2012 Conference Madrid Spain November 2123 2012 Proceedings
1st Edition Jess Villalba
https://ebookbell.com/product/advances-in-speech-and-language-
technologies-for-iberian-languages-iberspeech-2012-conference-madrid-
spain-november-2123-2012-proceedings-1st-edition-jess-villalba-4522804
Second International Conference Iberspeech 2014 Las Palmas De Gran
Canaria Spain November 1921 2014 Proceedings 1st Edition Juan Luis
Navarro Mesa
technologies-for-iberian-languages-second-international-conference-
iberspeech-2014-las-palmas-de-gran-canaria-spain-
november-1921-2014-proceedings-1st-edition-juan-luis-navarro-
mesa-4933216
Third International Conference Iberspeech 2016 Lisbon Portugal
November 2325 2016 Proceedings 1st Edition Alberto Abad
technologies-for-iberian-languages-third-international-conference-
iberspeech-2016-lisbon-portugal-november-2325-2016-proceedings-1st-
edition-alberto-abad-5608192

Advances In Audio And Speech Signal Processing Technologies And
Applications Hector Perez Meana
https://ebookbell.com/product/advances-in-audio-and-speech-signal-
processing-technologies-and-applications-hector-perez-meana-1213550
Advances In Speech Recognition Mobile Environments Call Centers And
Clinics 1st Edition William Meisel Auth
https://ebookbell.com/product/advances-in-speech-recognition-mobile-
environments-call-centers-and-clinics-1st-edition-william-meisel-
auth-1666476
Signals And Images Advances And Results In Speech Estimation
Compression Recognition Filtering And Processing Coelho
https://ebookbell.com/product/signals-and-images-advances-and-results-
in-speech-estimation-compression-recognition-filtering-and-processing-
coelho-5280556
Advances In Speech Recognition Noam Shabtai
https://ebookbell.com/product/advances-in-speech-recognition-noam-
shabtai-2403972
Advances In Audio Watermarking Based On Matrix Decomposition
Springerbriefs In Speech Technology 1st Ed 2019 Dhar
https://ebookbell.com/product/advances-in-audio-watermarking-based-on-
matrix-decomposition-springerbriefs-in-speech-technology-1st-
ed-2019-dhar-10486900

Signals and CommunicationTechnology
Anupam Biswas
EmileWennekes
AlicjaWieczorkowska
Rabul Hussain Laskar Editors
Advances
in Speech
and Music
Technology
Computational Aspects and Applications

Signals and Communication Technology
Series Editors
Emre Celebi, Department of Computer Science, University of Central Arkansas,
Conway, AR, USA
Jingdong Chen, Northwestern Polytechnical University, Xi’an, China
E. S. Gopi, Department of Electronics and Communication Engineering, National
Institute of Technology, Tiruchirappalli, Tamil Nadu, India
Amy Neustein, Linguistic Technology Systems, Fort Lee, NJ, USA
H. Vincent Poor, Department of Electrical Engineering, Princeton University,
Princeton, NJ, USA
Antonio Liotta, University of Bolzano, Bolzano, Italy
Mario Di Mauro, University of Salerno, Salerno, Italy

This series is devoted to fundamentals and applications of modern methods of
signal processing and cutting-edge communication technologies. The main topics
are information and signal theory, acoustical signal processing, image processing
and multimedia systems, mobile and wireless communications, and computer and
communication networks. Volumes in the series address researchers in academia
and industrial R&D departments. The series is application-oriented. The level of
presentation of each individual volume, however, depends on the subject and can
range from practical to scientific.
Indexing: All books in “Signals and Communication Technology” are indexed
by Scopus and zbMATH
For general information about this book series, comments or suggestions, please
contact Mary James at mary.james@springer.com or Ramesh Nath Premnath at
ramesh.premnath@springer.com.

Anupam Biswas • Emile Wennekes •
Alicja Wieczorkowska • Rabul Hussain Laskar
Editors
Advances in Speech and
Music Technology
Computational Aspects and Applications

Editors
Anupam Biswas
Department of Computer Science &
Engineering
National Institute of Technology Silchar
Cachar, Assam, India
Emile Wennekes
Department of Media and Culture Studies
Utrecht University
Utrecht, Utrecht, The Netherlands
Alicja Wieczorkowska
Multimedia Department
Polish-Japanese Academy of Information
Technology
Warsaw, Poland
Rabul Hussain Laskar
Department of Electronics &
Communication Engineering
National Institute of Technology Silchar
Cachar, India
ISSN 1860-4862 ISSN 1860-4870 (electronic)
Signals and Communication Technology
ISBN 978-3-031-18443-7 ISBN 978-3-031-18444-4 (eBook)
https://doi.org/10.1007/978-3-031-18444-4
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface
Speech and music are two prominent research areas in the domain of audio signal
processing. With recent advancements in speech and music technology, the area has
grown tremendously, bringing together the interdisciplinary researchers of computer
science, musicology, and speech analysis. The language we speak propagates
as sound waves through various media and allows communication between, or
entertainment for us, humans. Music we hear or create can be perceived in different
aspects as rhythm, melody, harmony, timbre, or mood. The multifaceted nature
of speech or music information requires algorithms, systems using sophisticated
signal processing, and machine learning techniques to optimally extract useful
information. This book provides both profound technological knowledge and a
comprehensive treatment of essential and innovative topics in speech and music
processing.
Recent computational developments have opened up several avenues to further
explore the domains of speech and music. A profound understanding of both speech
and music in terms of perception, emotion, mood, gesture, and cognition is in
the forefront, and many researchers are working in these domains. In this digital
age, overwhelming data have been generated across the world that require efficient
processing for better maintenance and retrieval. Machine learning and artificial
intelligence are best suited for these computational tasks.
The book comprises four parts. The first part covers state of the art in com-
putational aspects of speech and music. The second part covers machine learning
techniques applied in various music information retrieval tasks. The third part
comprises chapters dealing with perception, health, and emotion involving music.
The last part includes several case studies.
Audio technology, covering speech, music, and other signals, is a very broad
domain. Part I contains five review chapters, presenting state of the art in selected
aspects of speech and music research, namely automatic speaker recognition, music
composition based on artificial intelligence, music recommendation systems, and
investigations on Indian classical music, which is very different from Western music
that most of us are used to.
v

vi Preface
Chapter “A Comprehensive Review on Speaker Recognition”, written by Banala
Saritha, Mohammad Azharuddin Laskar, and Rabul Hussain Laskar, offers a
comprehensive review on speaker recognition techniques, mainly focusing on text-
dependent methods, where predefined text is used in the identification process. The
authors review feature extraction techniques applied often as pre-processing, and
then present various models that can be trained for speaker identification, with a
special section devoted to deep learning. Measures that can be applied to assess the
speaker recognition quality are also briefly discussed.
Chapter “Music Composition with Deep Learning: A Review”, authored by
Carlos Hernandez-Olivan and Jose R. Beltran, presents a review of music compo-
sition techniques, based on deep learning. Artificial intelligence has been applied
to music composition since the previous millennium, as briefly reviewed in this
chapter. Obviously, deep neural networks are also applied for this purpose, and
these techniques are presented in this chapter. Next, the authors delve into the
details of the music composition process, including musical form and style, melody,
harmony, and instrumentation. Evaluation metrics are also provided in this chapter.
Finally, the authors pose and answer interesting questions regarding automatic
music composition: how creative it is, what network architectures perform best,
how much data is needed for training, etc. Possible directions of future works in this
area conclude this chapter.
Chapters “Music Recommendation Systems: Overview and Challenges” and
“Music Recommender Systems: A Review Centered on Biases” describe music
recommendation systems. Chapter “Music Recommendation Systems: Overview
and Challenges”, written by Yesid Ospitia-Medina, Sandra Baldassarri, Cecilia
Sanz, and José Ramón Beltrán, offers a general overview of such systems,
whereas chapter Music Recommender Systems: A Review Centered on Biases,
by Makarand Velankar and Parag Kulkarni, presents a review focusing on biases.
Chapter “Music Recommendation Systems: Overview and Challenges” presents
very broadly content-based approach to music recommendation systems, as well
as collaborative and approach, and the hybrid approach to creating such systems.
Context-aware recommendation systems, which result in better recommendations,
are also briefly presented in this chapter. The authors discuss business aspects
of music recommendation systems as well. A special section is devoted to user
profiling and psychological aspects, as the type of music the users want to listen to
depends on their mood and emotional state. The chapter is concluded with current
challenges and trends in music recommendation.
Chapter “Music Recommender Systems: A Review Centered on Biases” presents
an overview of biases in music recommendation systems. The authors set off with
presenting research questions that are the basis of research in this area, and thus
answering them can introduce biases. These research questions include the main
characteristics of music recommender systems (approaches to the creation of such
systems are presented in the chapter), and how new songs are introduced. The
authors review what are the main biases in such systems, and the relationships
between the biases and both recommendation strategies and music datasets used.
Biases are classified into three categories, namely pre-existing, technical, and

Preface vii
emerging biases, detected in use of the system. Works on biases, as well as general
works on music recommendation systems, are reviewed here. The authors discuss
on how biases impact these systems, and also propose guidelines for handling biases
in such systems.
Chapter “Computational Approaches for Indian Classical Music: A Comprehen-
sive Review” presents a review of research on computational techniques applied in
Indian classical music, by Yeshwant Singh and Anupam Biswas. This traditional
music has roots in singing swaras. Nowadays, it is divided into Hindustani music,
with ragas (raags), mainly practiced in northern India, and Carnatic music in
southern part of the country. Microtones called shruti are specific to Indian classical
music, and make it very different from the Western music. The authors review papers
on tonic identification in classical Indian music, including feature extraction and
distribution, and melody processing, with segmentation, similarity analysis, and
melody representation. The automatic recognition of ragas is also covered in this
chapter. The authors also describe datasets of classical Indian music, and evaluation
metrics for the research on this music. Before concluding the chapter, the authors
present open challenges in this interesting research area.
Machine learning is helpful in understanding and learning from data, identifying
patterns, and making decisions with minimal human interaction. This is why
machine learning for audio signal processing has attracted attention recently for its
applications in both speech and music processing, presented in the five chapters of
Part II. Two chapters are focused on speech and multimodal audio signal processing,
and three on music, including instruments, raags, shruti, and emotion recognition
from music.
Chapter “A Study on Effectiveness of Deep Neural Networks for Speech Signal
Enhancement in Comparison with Wiener Filtering Technique” by Vijaya Kumar
Padarti, Gnana Sai Polavarapu, Madhurima Madiraju, V. V. Naga Sai Nuthalapati,
Vinay Babu Thota, and V.D. Subramanyam Veeravalli explores speech signal
enhancement with deep learning and Wiener filtering techniques. The speech signal
in general is highly susceptible to various noises. Therefore, speech denoising is
essential to produce noise-free speech signals from noisy recordings, thus improving
the perceived speech quality and increasing its intelligibility. Common approach
is to remove high frequency components from the original signal, but it leads to
removal of parts of the original signal, resulting in undesirable quality degradation.
In this chapter, Wiener filtering and neural networks are compared as tools for
speech signal enhancement. The output signal quality is assessed in terms of signal
to noise ratio (SNR) and peak signal to noise ratio (PSNR). Advanced MATLAB
toolboxes such as Deep Learning toolbox, Audio toolbox, and Signal Processing
toolbox are utilized for the analysis.
Chapter “Video Soundtrack Evaluation with Machine Learning: Data Availabil-
ity, Feature Extraction, and Classification” by Georgios Touros and Theodoros
Giannakopoulos evaluates multimodal signals using machine learning techniques,
with a combined analysis of both video and audio data, in order to find satisfactory
accompaniment music for video content. The availability of data, feature extraction,
and classification are discussed in this chapter. Creating or choosing music that

viii Preface
accompanies visual content, i.e. video soundtracks, is an artistic task that is usually
taken up by dedicated professionals, namely a composer and music supervisor,
to have the musical content that best accentuates each scene. In this chapter,
a method is proposed for collecting and combining relevant data from three
modalities: audio, video, and symbolic representations of music, in an end-to-end
classification pipeline. A comprehensive multimodal feature library is described,
together with a database that has been obtained by applying the proposed method
on a small dataset representing movie scenes. Furthermore, a classifier that aims
to discriminate between real and fake examples of video soundtracks from movies
has been implemented. This chapter also presents potential research directions and
possible improvements in the investigated area.
Chapter “Deep Learning Approach to Joint Identification of Instrument Pitch and
Raga for Indian Classical Music” by Ashwini Bhat, Karrthik G. K., Vishal Mahesh,
and Vijaya Krishna A. explores deep learning approaches for joint identification of
instruments, pitch, and ragas in Indian classical music. The concept of raag and
shruti is fundamental in Indian classical music, so their identification, although
difficult, is crucial for the analysis of a very complex Indian classical music. The
chapter offers a comprehensive comparison of Convolution Neural Network (CNN),
Recurrent Neural Network (RNN), and XGboost as tools to achieve the goal. Three
feature sets have been created for each task at hand, three models trained, and next
a combined RNN model created, yielding approximately 97% accuracy.
Chapter “Comparison of Convolutional Neural Networks and K-Nearest Neigh-
bors for Music Instrument Recognition” by Dhivya S and Prabu Mohandas analyses
convolutional neural networks and k-nearest neighbours (k-NN) for identifying
instruments from music. Music instrument recognition is one of the main tasks of
music information retrieval, as it can enhance the performance of other tasks like
automatic music transcription, music genre identification, and source separation.
Identification of instruments from the recording is a challenging task in the case
of polyphonic music, but it is feasible in the monophonic case. Temporal, spectral,
and perceptual features are used for identifying instruments. The chapter compares
a convolutional neural network architecture and k-nearest neighbour classifier
to identify the musical instrument from monophonic music. Mel-spectrogram
representation is used to extract features for the neural network model, and mel-
frequency cepstral coefficients are the basis for the k-NN classification. The models
were trained on the London Philharmonic dataset consisting of six classes of musical
instruments, yielding up to 99% accuracy.
Chapter “Emotion Recognition in Music Using Deep Neural Networks” written
by Angelos Geroulanos and Theodoros Giannakopoulos deals with the emotion
recognition in music, using deep learning techniques. Although accessing music
content online is easy nowadays, and streaming platforms provide automatic
recommendations to the users, the suggested list often does not match the current
emotional state of the listener; even the classification of emotions poses difficulty,
due to the lack of universal definitions. In this chapter, the task of music emotion
recognition is investigated using deep neural networks, and adversarial architectures
are applied for music data augmentation. Traditional classifiers such as support

Preface ix
vector machines, k-NN, random forests, and trees have also been applied, using
hand-crafted features representing the audio signals. Mel scale spectrograms were
used as a basis to create inputs to the deep convolutional networks. Six archi-
tectures (AlexNet, VGG16bn, Inception v3, DenseNet121, SqueezeNet1.0, and
ResNeXt101-32x8d) with an equal number of ImageNet pre-trained models were
applied in transfer learning. The classification was evaluated for the recognition of
valence, energy, tension, and emotions (anger, fear, happy, sad, and tender).
In the era of deep learning, speech and music signal processing offers unprece-
dented opportunities to transform the healthcare industry. In addition, the quality of
the perceived speech and music signals, both for normal and hard of hearing people,
is one of the most important requirements of the end users. Music can help deal with
stress, anxiety, and various emotions, and influence activity-related brain plasticity.
Part III comprises five chapters that explore the potential use of speech and music
technology for our well-being. The first three chapters focus on music processing
for the hearing impaired, as well as on music therapy addressed to relieve anxiety
in diabetic patients and stress in the era of pandemic. The fourth chapter sheds light
on the plasticity of the brain when learning music, and the fifth chapter is focused
on expressing emotions in speech automatically generated from text.
Chapter “Music to Ears in Hearing Impaired: Signal Processing Advancements in
Hearing Amplification Devices” by Kavassery Venkateswaran Nisha, Neelamegara-
jan Devi, and Sampath Sridhar explores music perception in the hearing impaired,
using hearing aids and cochlear implants. The hearing aids improve the auditory
perception of speech sounds, using various signal processing techniques. However,
the music perception is usually not improved, as hearing aids do not compensate
the non-linear response of human cochlea, a pre-requisite for music perception. The
limited input dynamic range and higher crest ratio in analogue-to-digital converters
of hearing aids fall short of processing live music. The cochlear implants were
developed to improve speech perception rather than music perception, and they have
limitations for music perception in terms of encoding fine structure information
in music. The electrode array that is surgically implanted results in difficulty in
perceiving pitch and higher harmonics of musical sounds. This chapter provides
elaborate discussion on the advancements in signal processing techniques in hearing
amplification devices such as hearing aids and cochlear implants that can address
their drawbacks.
Chapter “Music Therapy: A Best Way to Solve Anxiety and Depression in
Diabetes Mellitus Patients” by Anchana P. Belmon and Jeraldin Auxillia evaluates
the potential of music therapy as an alternative solution towards the anxiety and
depression in diabetic patients. There are pharmacological and non-pharmacological
treatments available to deal with anxiety and depression. Music therapy along with
relaxation and patients training is the main non pharmacological method. The effect
of music in human body is unbelievable. There are two types of music therapy,
namely passive and active music therapy. In this chapter, the effectiveness of music
therapy in 50 diabetic patients has been assessed using Beck Anxiety Inventory and
Beck Depression Inventory, reporting 0.67 reliability. The anxiety and depression
measures were assessed in pre-evaluation, post-evaluation, and follow-up stages.

x Preface
The statistical analysis suggests that music is an effective tool to accelerate the
recovery of patients.
Chapter “Music and Stress During Covid-19 Lockdown: Influence of Locus of
Control and Coping Styles on Musical Preferences” by Junmoni Borgohain, Rashmi
Ranjan Behera, Chirashree Srabani Rath, and Priyadarshi Patnaik explores music as
one of the effective strategies to enhance well-being during the lockdown. It tries
to analyse the relation between stress during Covid-19 lockdown and preferences
towards various types of music as a remedial tool. Music helps to reduce stress,
but the ways people deal with stress are influenced by individual traits and people’s
musical tastes. The reported study was conducted on 138 Indian participants, repre-
senting various age, social, and demographic groups. Several quantitative measures
(scaled from 1 to 5) such as BRIEF-COPE inventory, Perceived Stress Scale, the
Cantril scale, and Brief-COPE Inventory are used for parametric representation
of various activities performed by the subjects, and statistical analysis applied for
data analysis. This study has observed several patterns in music preference during
lockdown period. The study shows how music can be used as a tool for socio-
emotional management during stressful times, and it can be helpful for machine
learning experts to develop music-recommendation systems.
Chapter “Biophysics of Brain Plasticity and Its Correlation to Music Learning”
authored by Sandipan Talukdar and Subhendu Ghosh explores the correlation
of brain plasticity with learning music, based on experimental evidence. Brain
plasticity is one of the key mechanisms of learning new things through growth
and reorganization of neural networks in the brain. Human brains can change both
structurally and functionally, which is a basis of the remarkable capacity of the brain
to learn and memorize or unlearn and forget. The plasticity of the brain manifests
itself at the level of synapses, single neurons, and networks. Music learning involves
all of these mechanisms of brain plasticity, which requires intensive brain activities
at different regions, whether it is simply listening to a music pattern, or performing,
or even imaging music. The chapter investigates the possibility of any correlation
between the biological changes induced in the brain and the sound wave during
music perception and learning. Biophysical mechanisms involved in brain plasticity
at the level of synapses and single neurons are considered for experimentation. The
ways in which this plasticity is involved in music learning are discussed in this
chapter, taking into account the experimental evidence.
Chapter “Analyzing Emotional Speech and Text: A Special Focus on Bengali
Language” by Krishanu Majumder and Dipankar Das deals with the development
of a text-to-speech (TTS) system in Bengali language and incorporates as much
naturalistic features as possible using deep learning technique. The existing multi-
lingual state-of-the-art TTS systems that produce speech for given text have several
limitations. Most of them lack naturalness and sound artificial. Also, very limited
work has been carried out on regional languages like Bengali, and no standard
database is available to carry out the research work. This has motivated the authors
to collect a database in Bengali language, with different emotions, for developing
TTS engine. TTS systems are generally trained on a single language, and the
possibility of training a TTS on multiple languages has also been explored. The

Preface xi
chapter explores the possibility of including the contextual emotional aspects in the
synthesized speech to enhance its quality. Another contribution of this chapter is to
develop a bilingual TTS in Bengali and English languages. The objectives of the
chapter have been validated in several experiments.
The concluding part of this volume comprises six chapters addressing an equal
amount of case studies. They span from research addressing the duplication of
audio material to Dutch song structures, and from album covers to measurements
of the tabla’s timbre. Musical influence on visual aesthetics as well as the study on
emotions in audio-visual domain complete the picture of this part’s contents.
Chapter “Duplicate Detection for Digital Audio Archive Management: Two Case
Studies”, by Joren Six, Federica Bressan, and Koen Renders, presents research
aimed at identifying duplicate audio material in large digital music archives. The
recent and rapid developments of Music Information Retrieval (MIR) have yet to be
exploited by digital music archive management, but there is promising potential for
this technology to aid such tasks as duplicate management. This research comprises
two cases studies to explore the effectiveness of MIR for this task. The first case
study is based on a VRT shellac disc archive at the Belgian broadcasting institute.
Based on 15,243 digitized discs (out of about 100,000 total), the study attempts to
determine the amount of unique versus duplicate audio material. The results show
difficulties in discriminating between a near exact noisy duplicate and a translated
version of a song with the same orchestral backing, when based on duplicate
detection only. The second case study uses an archive of tapes from the Institute
for Psychoacoustic and Electronic Music (IPEM). This study had the benefit of the
digitized archive, first in 2001 and then in 2016. The results showed that in this
case, MIR was highly effective at correctly identifying tracks and assigning meta
data. This chapter concludes with a deeper dive into the recent Panako system for
acoustic fingerprinting (i.e. the technology for identifying same or similar audio data
in the database), to show its virtues.
Yke Paul Schotanus shows in chapter “How a Song’s Section Order Affects Both
‘Refrein’ Perception and the Song’s Perceived Meaning” how digital restructuring
of song sections influences the lyrical meaning, as well as our understanding of the
song’s structure. When the section order of a song is manipulated, the listeners’
understanding of a song is primarily based on how and where they perceive the
chorus (refrein in Dutch), and/or the leading refrain line. A listening experiment was
conducted, involving 111 listeners and two songs. Each participant listened to one
of three different versions of the same Dutch cabaret song. The experiment showed
that section order affects refrain perception and (semantic) meaning of Western pop
songs. Manipulating musical properties such as pitch, timing, phrasing, or section
order shows that popular music is more complex than thus far presumed; “the refrain
of a song cannot be detected on the basis of strict formal properties”, he concludes.
The objective of chapter “Musical Influence on Visual Aesthetics: An Explo-
ration on Intermediality from Psychological, Semiotic, and Fractal Approach” (by
Archi Banerjee, Pinaki Gayen, Shankha Sanyal, Sayan Nag, Junmoni Borgohain,
Souparno Roy, Priyadarshi Patnaik, and Dipak Ghosh) is to determine the degree
to which music and visuals interact in terms of human psychological perception.

xii Preface
Auditory and visual senses are not isolated aspects of psychological perception;
rather, they are entwined in a complex process known as intermediality. Further-
more, the senses are not always equal in their impact on each other in a multimodal
experience, as some senses may dominate others. This study attempts to investigate
the relationship between auditory and visual senses to discover which is more
dominant in influencing the total emotional outcome for a certain audience. To
this end, abstract paintings have been used (chosen because of the lack of semantic
dominance and the presence of pure, basic visual elements – lines, colours, shapes,
orientation, etc.) and short piano clips of different tempo and complexity. Forty-
five non-artist participants are then exposed to various combinations of the music
and art – both complimentary and contradictory – and asked to rate their response
using predefined emotion labels. The results are then analysed using a detrended
fluctuation analysis to determine the nature of association between music and
visuals – indifferent, compatible, or incompatible. It is found that music has a
more significant influence on the total emotional outcome. This study reveals that
intermediality is scientifically quantifiable and merits additional research.
Chapter “Influence of Musical Acoustics on Graphic Design: An Exploration
with Indian Classical Music Album Cover Design”, by Pinaki Gayen, Archi
Banerjee, Sankha Sanyal, Priyadarshi Patnaik, and Dipak Ghosh, analyses strategies
for graphic design for Indian classical music album covers and options to determine
new possible design strategies to move beyond status quo conventions. The study
is conducted with 30 design students who are asked to draw their own designs
upon hearing two types of musical examples – Komal Rishav Asavari and Jaunpuri,
which had been rated as “sad music” and “happy music”, respectively, by a previous
70-person experiment. The design students were split into two groups, and each
given 1 hour to complete their own designs while listening to the music. The
resulting designs were then analysed using semiotic analysis and fractal analysis
(detrended fluctuation analysis) to identify patterns of intermediality. This semiotic
analysis entailed analysing iconic or symbolic representation (direct association
to objects) versus indexical representation (cause and effect relationships). The
findings showed that album cover designs fell in three categories: direct mood
or emotional representation using symbolic followed by indexical representation;
visual imageries derived from indexical followed by iconic representational; and
musical feature representation primarily relying on iconic representation. In sum-
mary, the study provides new design possibilities for Indian classical music album
covers and offers a quantitative approach to establishing effective intermediality
towards successful designs.
Shankha Sanyal, Sayan Nag, Archi Banerjee, Souparno Roy, Ranjan Sengupta,
and Dipak Ghosh present in chapter “A Fractal Approach to Characterize Emotions
in Audio and Visual Domain: A Study on Cross-Modal Interaction” their study
about classifying the emotional cues of sound and visual stimuli solely from their
source characteristics. The study uses as a sample data set a collection of six audio
signals of 15 seconds each and six affective pictures, of which three belonged to
positive and negative valence, respectively (“excited”, “happy”, “pleased”, etc.,
versus “sad”, “bored”, “angry”, etc.). Then, using detrended fluctuation analy-

Preface xiii
sis (DFA), the study calculates the long-range temporal correlations (the Hurst
exponent) corresponding to the audio signals. The results of the DFA technique
were then applied on the array of pixels corresponding to affective pictures of
contrast emotions to obtain a single unique scaling exponent corresponding to each
audio signal and three scaling exponents corresponding to red/green/blue (RGB)
component in each of the images. Finally, detrended cross-correlation (DCCA)
technique was used to calculate the degree of nonlinear correlation between the
sample audio and visual clips. The results were next confirmed by a follow-up
human response study based on the emotional Likert scale ratings. The study
presents an original algorithm to automatically classify and compare emotional
appraisal from cross-modal stimuli based on the amount of long-range temporal
correlations between the auditory and visual stimulus.
The closing chapter, chapter “Inharmonic Frequency Analysis of Tabla Strokes
in North Indian Classical Music”, by Shambhavi Shivraj Shete and Saurabh Harish
Deshmukh, features the tabla. It is one of the essential instruments in North Indian
Classical Music (NICM) and is highly unique compared to Western drums for its
timbre. The tabla’s timbre is related to inharmonicity (i.e. its overtones departing
from harmonic series, being multiple of the fundamental) which is due to a complex
art involving the application of ink (Syahi) to the table drum surface. This study aims
to create a set of standard measurements of the tabla’s timbre as this could be useful
for instrument makers and performers. This measurement process is accomplished
in two steps. First, a recording session collects 10 samples of a tabla playing the 9
common strokes within NICM for a total of 90 audio samples. These samples are
then processed by a fast Fourier transform function to extract a frequency spectrum
and determine the fundamental. The results are then compiled and organized by
stroke type with comments about which overtones are most defining and which
aspects of the stroke technique are especially important towards affecting those
overtones responses.
We cordially thank all the authors for their valuable contributions. We also
thank the reviewers for their input and valuable suggestions, and the Utrecht
University intern Ethan Borshansky, as well as Mariusz Kleć from the Polish-
Japanese Academy of Information Technology for their editorial assistance.
Finally, we thank all the stakeholders who have contributed directly or indirectly
to making this book a success.
Cachar, Assam, India Anupam Biswas
Utrecht, The Netherlands Emile Wennekes
Warsaw, Poland Alicja Wieczorkowska
Cachar, India Rabul Hussain Laskar
December 2022

Contents
Part I State-of-the-Art
A Comprehensive Review on Speaker Recognition .......................... 3
Banala Saritha, Mohammad Azharuddin Laskar, and Rabul Hussain Laskar
Music Composition with Deep Learning: A Review ......................... 25
Carlos Hernandez-Olivan and José R. Beltrán
Music Recommendation Systems: Overview and Challenges ............... 51
Makarand Velankar and Parag Kulkarni
Music Recommender Systems: A Review Centered on Biases .............. 71
Yesid Ospitia-Medina, Sandra Baldassarri, Cecilia Sanz,
and José Ramón Beltrán
Computational Approaches for Indian Classical Music: A
Comprehensive Review .......................................................... 91
Yeshwant Singh and Anupam Biswas
Part II Machine Learning
A Study on Effectiveness of Deep Neural Networks for Speech
Signal Enhancement in Comparison with Wiener Filtering Technique.... 121
Vijay Kumar Padarti, Gnana Sai Polavarapu, Madhurima Madiraju,
V. V. Naga Sai Nuthalapati, Vinay Babu Thota,
and V. D. Subramanyam Veeravalli
Video Soundtrack Evaluation with Machine Learning: Data
Availability, Feature Extraction, and Classification .......................... 137
Georgios Touros and Theodoros Giannakopoulos
Deep Learning Approach to Joint Identification of Instrument
Pitch and Raga for Indian Classical Music.................................... 159
Ashwini Bhat, Karrthik Gopi Krishnan, Vishal Mahesh,
and Vijaya Krishna Ananthapadmanabha
xv

xvi Contents
Comparison of Convolutional Neural Networks and K-Nearest
Neighbors for Music Instrument Recognition ................................ 175
S. Dhivya and Prabu Mohandas
Emotion Recognition in Music Using Deep Neural Networks............... 193
Angelos Geroulanos and Theodoros Giannakopoulos
Part III Perception, Health and Emotion
Music to Ears in Hearing Impaired: Signal Processing
Advancements in Hearing Amplification Devices ............................ 217
Kavassery Venkateswaran Nisha, Neelamegarajan Devi,
and Sampath Sridhar
Music Therapy: A Best Way to Solve Anxiety and Depression
in Diabetes Mellitus Patients .................................................... 237
Anchana P. Belmon and Jeraldin Auxillia
Music and Stress During COVID-19 Lockdown: Influence of
Locus of Control and Coping Styles on Musical Preferences ............... 249
Junmoni Borgohain, Rashmi Ranjan Behera, Chirashree Srabani Rath,
and Priyadarshi Patnaik
Biophysics of Brain Plasticity and Its Correlation to Music Learning ..... 269
Sandipan Talukdar and Subhendu Ghosh
Analyzing Emotional Speech and Text: A Special Focus on
Bengali Language ................................................................ 283
Krishanu Majumder and Dipankar Das
Part IV Case Studies
Duplicate Detection for for Digital Audio Archive Management:
Two Case Studies ................................................................. 311
Joren Six, Federica Bressan, and Koen Renders
How a Song’s Section Order Affects Both ‘Refrein’ Perception
and the Song’s Perceived Meaning ............................................. 331
Yke Paul Schotanus
Musical Influence on Visual Aesthetics: An Exploration on
Intermediality from Psychological, Semiotic, and Fractal Approach ...... 353
Archi Banerjee, Pinaki Gayen, Shankha Sanyal, Sayan Nag,
Junmoni Borgohain, Souparno Roy, Priyadarshi Patnaik, and Dipak Ghosh
Influence of Musical Acoustics on Graphic Design: An
Exploration with Indian Classical Music Album Cover Design ............ 379
Pinaki Gayen, Archi Banerjee, Shankha Sanyal, Priyadarshi Patnaik,
and Dipak Ghosh

Contents xvii
A Fractal Approach to Characterize Emotions in Audio
and Visual Domain: A Study on Cross-Modal Interaction .................. 397
Shankha Sanyal, Archi Banerjee, Sayan Nag, Souparno Roy,
Ranjan Sengupta, and Dipak Ghosh
Inharmonic Frequency Analysis of Tabla Strokes in North Indian
Classical Music ................................................................... 415
Shambhavi Shivraj Shete and Saurabh Harish Deshmukh
Index............................................................................... 441

A Comprehensive Review on Speaker
Recognition
Banala Saritha , Mohammad Azharuddin Laskar ,
and Rabul Hussain Laskar
1 Introduction
Speech is the universal mode of human communication. In addition to exchanging
thoughts and ideas, speech is considered to be useful for extracting a lot of other
information like language identity, gender, age, emotion, cognitive behavior, and
speaker identity. One of the goals in speech technology is to make human-machine
interaction as natural as possible, with systems like intelligent assistants, e.g., Apple
Siri, Cortana, and Google Now. Speaker recognition also has a huge scope in this
line of products. Every human has a unique speech production system [1]. The
unique characteristics of the speech production system help to find a speaker’s
identity based on his or her speech signal. The task of recognizing the identity
of a person from speech signal is called speaker recognition. It may be classified
into speaker identification and speaker verification. The process of identifying
an unknown speaker from a set of known speakers is speaker identification,
while authentication of an unknown speaker claiming a person’s identity already
registered with the system is called speaker verification [2]. Speaker recognition
finds numerous applications across different fields like biometrics, forensics, and
access control systems [3]. Further, speaker recognition is classified into text-
dependent and text-independent tasks based on whether the test subject is required
to use a particular fixed utterance or is free to utter any valid text for recognition
purposes [4]. Research on speaker recognition has been carried out since the
1960s [5]. Significant advancements have been made in this field over the recent
decades where various aspects like features, modeling techniques, and scoring have
been explored. The advances in deep learning and machine learning techniques
B. Saritha · M. A. Laskar () · R. H. Laskar ()
Department of Electronics and Communication Engineering, National Institute of Technology
Silchar, Silchar, India
e-mail: rhlaskar@ece.nits.ac.in
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
A. Biswas et al. (eds.), Advances in Speech and Music Technology, Signals and
Communication Technology, https://doi.org/10.1007/978-3-031-18444-4_1
3

4 B. Saritha et al.
have helped to promote speaker recognition and develop renewed interest among
researchers in this field. Owing to its ease of use and higher accuracy, text-dependent
speaker verification has been one of the focus areas. It plays a vital role in fraud
prevention and access control. This chapter presents a comprehensive review of the
techniques and methods employed for speaker recognition, with emphasis on text-
dependent speaker verification.
The chapter’s organization is as follows: Sect. 2 describes the basic structure
of speaker recognition system. Section 3 presents a review of feature extraction
techniques with an emphasis on the Mel-frequency cepstral coefficient (MFCC)
feature extraction method. Speaker modeling involving classical techniques is
discussed in Sect. 4. Advancements in speaker recognition with deep learning
are discussed in Sect. 5. It also describes the performance metric for speaker
recognition. The last section concludes the chapter.
2 Basic Overview of a Speaker Recognition System
Figure 1 represents the basic block diagram of the speaker verification system. The
design of the speaker verification system mainly consists of two modules, namely,
frontend and backend. Frontend takes the speech as an input signal and extracts
the features. Generally, features are the more convenient representation of a given
speech signal. It is represented in the form of a set of vectors and is termed as
acoustic features.
The acoustic features are fed to the backend, which consists of a pre-designed
speaker model along with classification and decision-making modules. The model
based on these features is then compared with that of the registered speakers’ models
to determine the match between speakers. In text-dependent speaker verification
(TDSV), the systems need to model both the text and the speaker characteristics.
Figure 2 presents the block diagram representation of a text-dependent speaker
verification system.
The speech signal input is first pre-processed using pre-emphasis filter followed
by windowing and voice activity detection. Feature extraction is carried out using
the voiced speech frames. These features are further modeled using techniques like
Gaussian mixture model (GMM), identity vector (i-vector), or neural network and
Fig. 1 Basic block diagram of a speaker verification system

A Comprehensive Review on Speaker Recognition 5
Fig. 2 Text-dependent
speaker verification system
block diagram representation
used for enrollment and verification. In the enrollment phase, speech utterance of
adequate duration is taken and subjected to the said feature extraction and modeling
modules to obtain the speaker model. Generally, a background or development
set data is also required in conjunction with the enrolment data for training.
During verification, the test utterance undergoes similar transformations and is
then compared with the model corresponding to the claimed speaker identity. The
comparison results in a score that helps to decide whether to accept or reject the
claim [6].
3 Review on Feature Extraction
Every speaker has a unique speech production system. The process of capturing
vocal characteristics is called feature extraction. Features can be classified into
two types, namely, behavior-based (learned) features and physiological features.
Behavior-based features include prosodic, spectro-temporal, and high-level features.
Rhythm, energy, duration, pitch, and temporal features constitute the prosodic and
spectro-temporal features. Phones, accents, idiolect, semantics, and pronunciation
are the high-level features [7]. Figure 3 shows a classification of feature character-
istics.
The physiological features are representative of the vocal tract length, dimension,
and vocal fold size. Short-term spectral feature representations are commonly
used to characterize these speaker-specific attributes. Some of the commonly used
spectral features include Mel-frequency cepstral coefficients (MFCCs), gammatone
feature, gammatone frequency cepstral coefficients (GFCCs), relative spectral-
perceptual linear prediction (RASTA-PLP), Hilbert envelope of gammatone filter
bank, and mean Hilbert envelope coefficients (MHECs) [8]. Of these features,
MFCCs are the most widely used spectral features in the state-of-the-art speaker
identification and verification systems.

6 B. Saritha et al.
Fig. 3 Classification of
feature characteristics [7]
3.1 MFCC’s Extraction Method
MFCCs are based on human auditory perception. Auditory perception is nonlinear
and can be approximated using linear functions. The Mel scale is roughly linear at
low frequencies and logarithmic at higher frequencies [9]. For a given frequency
f in Hz, corresponding Mel scale frequency can be determined by the following
formula:
mel(f ) = 2595 ∗ log10(1 + f/700) (1)
Figure 4 represents the commonly followed feature extraction process. Speech
signals are non-stationary in nature, i.e., the spectral content of the signals varies
with time. Hence, in order to process a speech signal, it is divided into short
(overlapping) temporal segments called frames, 10–30 ms long, as speech signal
is quasi-stationary over a very short time frame, and short-time Fourier transform
can be applied to analyze this signal. Further, to reduce the artifacts due to sudden
signal truncations at the boundaries of the frames, windowing is done. Generally,
the Hamming windowing (2) is applied to all frames to obtain smooth boundaries
and to minimize the spectral distortions.
w(n) = 0.54 − 0.46 cos(2πn/M − 1);0 ≤ n ≤ M (2)
where M is the number of samples in the frame.
The short segments can be assumed to be stationary and are used for short-
term frequency analysis. To obtain MFCCs, the windowed frames are subjected
to fast Fourier transform (FFT) followed by the Mel filterbank, as shown in
Fig. 5. The Mel spectrum is then represented on log scale before performing
discrete cosine transform (DCT) to obtain the MFCC features. Usually, the first 13
coefficients, C0, C1, . . . , C12, are considered. The coefficients are then normalized

Fig. 4 Process of extracting feature vectors
Frame Signal
Mel Filterbank
Window Frame FFT
Log() DCT()
Speech signal frames
mel-spectrum log mel-spectrum
20ms 10ms For each frame
MFCCs
Fig. 5 MFCC feature extraction process
using techniques like cepstral mean subtraction (CMS), relative spectral filtering
(RASTRA), and feature warping techniques. Once the features are normalized, the
difference between C0 coefficients of a frame and its subsequent frame is calculated
to obtain the delta parameter d0. Similarly, d1, d2, . . . , dn are obtained from
C1, C2, . . . , Cn coefficients, respectively, as shown in Fig. 6. These are known
as delta features. In the same way, acceleration or double-delta features are obtained
by using difference of delta features [10]. The 13 MFCCs, the 13 delta features,
and the 13 double-delta features are concatenated to obtain 39-dimensional feature
vector for every frame.

8 B. Saritha et al.
Fig. 6 Delta and acceleration features
4 Speaker Modeling
Once the acoustic features are extracted, speaker models are trained on it. The
traditional speaker modeling techniques are categorized into two types. They are
template models and stochastic models. Vector quantization (VQ) and dynamic
time warping (DTW) approaches are the most popular template-based modeling
techniques [11]. For text-dependent speaker verification, DTW template matching
is a widely used technique. The acoustic feature sequence is obtained from the
enrollment utterance and stored as a speaker-phrase template model. During the
testing phase, the feature sequence corresponding to the test utterance is compared
with the speaker-phrase template model using the DTW algorithm. DTW helps to
time-align the two sequences and gives a similarity score, which is then used for
the decision-making process. Another popular system for text-dependent speaker
verification is realized using the VQ technique. In this method, the vector space
consists of feature vectors and is mapped to a finite number of regions in the vector
space. These regions are formed as clusters and represented by centroids. The set of
centroids that represents the entire vector space is known as a codebook. Hence, the

speaker-phrase models are prepared in terms of codebooks. This technique allows
more flexibility in modeling the speaker-phrase model.
The stochastic models make use of probability theory. The most popular
stochastic models are the Gaussian mixture model-universal background model
(GMM-UBM) and hidden Markov model (HMM).
GMM-UBM is also a commonly used method for text-dependent speaker
verification [12]. A UBM is built to represent a world model. Using the maximum
a posteriori (MAP) adaptation technique, speaker-phrase-specific GMM is built
from UBM using class-specific data [13]. The log-likelihood ratio is used to make
the decision of whether to accept or reject the speaker-phrase subject. A number
of HMM-based methods have also been proposed for text-dependent speaker
verification. Such models are good at modeling the temporal information in the
utterance and help to provide improved results. An unsupervised HMM-UBM and
temporal GMM-UBM have also been proposed for TDSV [14]. In the case of HMM-
UBM-based method, speaker-specific HMM is built through MAP adaptation from
speaker-independent HMM-UBM trained in an unsupervised manner without using
any transcription. In the temporal GMM-UBM approach, however, the temporal
information is incorporated by computing the transition probability among the
GMM-UBM mixture components using the speaker-specific training data. The
HMM-UBM-based method is found to outperform the other systems by virtue
of more effective modeling of the temporal information. Hierarchical multi-layer
acoustic model (HiLAM) is an example of an HMM-based speaker-phrase model. It
is a hierarchical acoustic model that adapts a text-independent, speaker-dependent
GMM from UBM and then adapts the different HMM states from the mid-level
model.
4.1 Gaussian Mixture Model-Universal Background Model
GMM-UBM model as shown in Fig. 7 is a popularly used method for text-dependent
speaker verification. When all features are populated in large dimensional space,
clusters are formed. Each cluster can be represented by a Gaussian distribution
specified by mean and variance parameters. The overall data may be represented
by a mixture of such Gaussian distributions, known as GMM, which is defined by
a set of mean, variance, and weight parameters. Universal background model is a
general speaker-independent model. It is used to obtain speaker-dependent GMMs
with adaptation of mean, variance, and weight using target-specific data [15].
A training set is required to build up the UBM model, alongside a target set for
which we are designing the system and a test set to evaluate the performance of the
system.
A representation of GMM-based speaker model is given in Fig. 8. For a speaker
model X1, each mixture ranging from 1 to 1024 has a 1×1 weight vector, 39×1
mean matrix, and 39×1 covariance matrix. Similarly, for developing a system for
speaker models 2 to M, we have to store 1024 weight vectors and 39×1024 mean
and covariance matrices.

10 B. Saritha et al.
Feature Extraction
Speech
GMM
UBM
MAP Adaptation Likelihood
GMM
MFCC+
Represents feature space
distribution of multiple
speakers
+ 39xN
feature
Vector
Fig. 7 A basic system of GMM-UBM model
Fig. 8 Gaussian mixture model for speaker
When a speaker claims the identity of a registered speaker, the speaker verifica-
tion system first extracts the features and compares them with the speaker model
(GMM), determines the level of the match based on the log-likelihood ratio with a
predefined threshold, and makes a decision whether to accept or reject the claimed
speaker [16]. This process is shown in Fig. 9. The problem with GMM-UBM is that
loads of vectors and matrices need to be stored. A single vector concept called a
supervector S is introduced to represent a speaker to overcome the abovementioned
difficulty.

Fig. 9 Representation of universal background model
4.2 Supervector
Supervector is formed by concatenating all individual mixture means, scaled by the
corresponding weights and covariances resulting in 39*1024-dimensional feature
vector shown in Fig. 10. Each GMM (speaker model) can be represented by a super-
vector. Speech of different durations can be represented by supervectors of fixed
size. Model normalization is typically carried out on a supervector when the speaker
model is built. Supervectors are further modeled using various techniques available
like nuisance attribute projection (NAP), joint factor analysis (JFA), i-vector, within-
class covariance (WCCN), linear discriminant analysis, and probabilistic linear
discriminant analysis (PLDA) [17].
Joint factor analysis is one of the popular techniques used to model GMM
supervectors. The supervector is assumed to embed speaker information, channel
information, some residual component, and speaker-independent components [18].
Accordingly, the supervector S is decomposed into different components given by
S = m + Vy + Ux + Dz (3)

Fig. 10 The process of supervector formation
where m is the speaker-independent component, Vy represents the speaker infor-
mation, Ux represents the channel information, and Dz represents the speaker-
dependent residual component.
4.3 i-vector
In this technique, factor analysis is performed considering a common subspace
for channel and speaker information as research indicated that relevant speaker
information is also present in the channel feature obtained in JFA [19]. A lower-
dimensional identity vector (w) is used to represent speaker model. The model may
be represented as follows:
S = m + T w (4)
where S is the GMM supervector, m is the UBM supervector mean, T represents the
total variability matrix, and w indicates the standard normal distributed vector.
w may be represented as p(wX) = N(φ, L − 1)
where w represents all the speaker information, and it follows a normal distribu-
tion with mean φ and variance (L -1).
4.4 Trends in Speaker Recognition
Table 1 attempts to report the trends and progress in the field of speaker recognition.
Techniques like NAP and WCCN have been used to normalize channel effects
and session variability. Also, PLDA has been a popular backend model for many
systems.

Table 1 Trends in speaker recognition with adopted techniques
S. no. Year Techniques adopted
1 2005 Supervector + NAP + support vector machine (SVM) scoring
2 2007 Supervector + JFA + y (reduced dimension vector) + WCNN +
SVM scoring
3 2007 i-vector + WCNN + LDA (reduced dimension i-vector) + cosine
distance scoring
4 2009 i-vector + PLDA (divides into channel and speaker space/channel
compensation) + cosine distance scoring
5 2010 i-vector replaced with DNN + cosine distance/PLDA as
decision-making
5 Deep Learning Methods for Speaker Recognition
The significant advancements in deep learning and machine learning techniques
developed renewed interest among researchers in speaker recognition. The different
deep learning architectures have received impetus from the availability of increased
data and high computational power and have resulted in state-of-the-art systems.
The DTW framework has also been implemented using deep neural network (DNN)
posteriors extracted from the DNN-HMM automatic speech recognition (ASR)
model [20]. The system leverages the discriminative power of the DNN-based
model and is able to achieve enhanced performance. The deep neural network
framework has two approaches in speaker recognition. The leading approach is
feature extraction with deep learning methods. Another approach is classification
and decision-making using deep learning methods [21]. In the first approach, Mel-
frequency cepstral coefficients or spectra are taken as inputs and used to train a DNN
with speaker IDs as the target variable.
Speaker feature embeddings are then obtained from the last layers of the trained
DNN. The second approach replaces the cosine distance and probabilistic linear
discriminate analysis; a deep network can be used for classification and decision-
making.
2014 d-vector In d-vector framework shown in Fig. 11, instead of MFCC, stacked
filter bank energies are used as input, to train a DNN in a supervised way
[22]. The averaged activation function outputs from the last hidden layer of the
trained network are used as the d-vector [23]. The 13-dimensional perceptual linear
prediction coefficients (PLP) and delta and double-delta features were used in the
training phase. “OK Google” database was used for experimentation [24].
2015 j-vector The multi-task learning approach shown in Fig. 12 extends the “d-
vector” concept and leads to the j-vector framework [25]. The network is trained to
discriminate both speakers and text at the same time.

Fig. 11 d-vector framework in Variani et al. [23]
Fig. 12 Multi-task DNN in Chen et al. [25]
Like d-vectors, once the supervised training is finished, the output layer is
discarded. Then joint feature vector called a j-vector is obtained from the last layer
[26].
2018–2019 x-vector The authors (D. Snyder, D. Garcia-Romero) have proposed
DNN embeddings in their work, called “x-vectors,” to replace i-vectors for text-
independent speaker verification [27, 28]. The main idea is to take variable length
audio and to get a single vector representation for that entire audio. This single
vector is capable of capturing speaker discriminating features. This architecture
shown in Fig. 13 uses time delay neural network (TDNN) layers with a statistics

Fig. 13 x-vector DNN
embedding architecture [26]
Fig. 14 TDNN framework
[28]
pooling layer. TDNN operates at frame level as shown in Fig. 14 and works better
for a smaller amount of data. Speech input (utterances) is fed to the TDNN layer in
frames (x1, x2. . . , xT), and it generates a sequence of frame-level features.
The statistics pooling layer determines the mean and standard deviation for the
sequence of vectors. Concatenation of these two vectors is passed on as input to the
next layer, which operates at the segment level. In the end, the softmax layer predicts
the probability of speaker for a particular utterance. Additionally, data augmentation
is done where noise and reverberation are added to original data with different SNR
[29]. This makes the system more robust and improves accuracy compared to the i-
and d-vector, particularly for short-duration utterances.
2018–2019 End-to-End System In the paper “End-to-end text-dependent speaker
verification” by Heigold et al. [30], Google introduced DNN embedding-based
speaker verification, which is one of state-of-the-art systems [30]. In this archi-

Fig. 15 End-to-end
architecture used in [29]
Accept / Reject
logistic regression
Cosine similarity
average
enrollment enrollment
utterance1
utterance utterance N
evaluation
Speaker
Speaker Model
Score function
Representation
DNN / LSTM
tecture, long short-term memory (LSTM) is used to process the “OK Google”
kind of utterances. It gives speaker representations which are highly discriminative
feature vectors. There are two inputs, namely, enrollment and evaluation utterances,
applied to the network. As shown in Fig. 15, the network aggregates N vectors
corresponding to N enrollment utterances to obtain a representation of the enrolled
speaker. When the same speaker claims during the verification stage, it compares the
generated vector with the previously stored vector from the enrollment data. Using
cosine similarity, it compares the difference between vectors. If it is greater than the
pre-determined threshold, the claimed speaker is accepted; otherwise, rejected. This
end-to-end architecture performance on the “OK Google” database is similar to that
of the “d-vector” technique.
2019–2020 Advancements in TDNN System A number of variants have been
introduced to improve the performance of TDNN. They are factorized TDNN
(F-TDNN), crossed TDNN (C-TDNN), densely connected TDNN (D-TDNN),
extended TDNN (E-TDNN) [31], and emphasized channel attention and propaga-
tion and aggregation TDNN (ECAPA-TDNN). The current state-of-the-art TDNN-
based speaker recognition is ECAPA-TDNN represented in Fig. 16. This architec-
ture is proposed by Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck
[32]. The complete architecture of squeeze-excitation (SE)-based Res2Net block
(SE-Res2Block) of the ECAPA-TDNN is given in Fig. 17.

Fig. 16 Network topology of
the ECAPA-TDNN [31] Input 80
C
ConvID+ ReLU+BN (k=5, d=1)
ConvID+ReLU+BN (k+5, d-1)
+
+
T
T
C
+
+
+
+
+
T
C
+
+
T
(C
+
T)
T
SE-Res2Block (k=3,d=3)
1536
Attentive State Pooling + BN
3072 1
1
1
FC + BN
192
S
AAM-Softmax
Output
3
Fig. 17 The SE-Res2Block
of the ECAPA-TDNN [31]
architecture
Input
Conv + ReLU + BN
Res2 Dilated Conv ID + ReLU + BN
Conv + ReLU + BN
SE - Block
Output

5.1 Deep Learning for Classification
Classification models with deep learning are presented in Table 2.
6 Performance Measure
To decide whether a speaker is accepted or rejected by the system, a threshold may
require against which the score may generate.
Correct Decision If a system correctly identifies the true speaker, it may be the
“correct decision.”
Miss If a system rejects the true speaker, it may be “miss.”
False Acceptance If the system accepts the imposter, the system makes an error
known as false acceptance.
Detection Error Trade-Off (DET) Curve
To plot the DET curve, first record the number of times the system rejected the true
speaker (miss) and the number of times the imposter is accepted (false acceptance).
Then express these two parameters in terms of percentage. Take false acceptance
on the x-axis and miss rate on the y-axis for a particular threshold value θ. We
will get the point on two-dimensional space. By varying the θ value continuously,
we will obtain a curve known as the detection error trade-off curve. As example,
the dot point on the DET curve in Fig. 18 indicates the miss rate is very high and
false acceptance is very low [48, 49]. This shows the system rejects true speakers
in a good number of times and less number of imposters allowed. It is preferable in
high-security systems like banking.
On the other end of the curve, the miss rate is meager, and false acceptance is
very high. This shows the system is allowing imposters easily and not missing the
true speaker. Such a scenario is useful in low-security applications [50].
The point at which the miss rate is equal to the false acceptance rate is the equal
error rate (EER). Any system can be operated at this point. For example, let a system
having EER of 1% indicates it has 1% of miss rate and 1% of false acceptance rate
[51]. In Fig. 19, if the EER point is decaying toward the origin, it is a better system.
If the EER point is rising toward the y=x line, it is not a better system.

Table 2 Classification models with deep learning
S. no. Techniques adopted Key concept Merits/demerits
1 Variational
autoencoder (VAE)
[33–35]
VAE is used for voice conversion,
speech, and speaker recognition, and
it consists of stochastic neurons along
with deterministic layers. The
log-likelihood ratio scoring is used to
discriminate between the same and
different speakers
The performance of
VAE is not superior to
PLDA scoring
2 Multi-domain
features [36]
Automatic speaker recognition (ASR)
output is given as input to the speaker
recognition evaluation (SRE) system
and vice versa Extracted frame-level
features are the inputs to ASR and
SRE
For WSJ database,
EERs are low compared
to the i- vector
3 DNN in place of
UBM [25]
DNN is used instead of UBMs.
Universal deep belief networks are
developed and are used as a backend.
The target i-vector and imposter i
-vector develop a vector such that it
should have good discriminating
properties. It is stored in a
discriminative target model
This model not
accomplished better
performance compared
to PLDA-based i-vector
4 Unlabeled data
[37–39]
It introduces unlabeled samples
continuously, and by taking a labeled
corpus, it learns the DNN model
Proposed LSTM- and
TDNN-based systems
outperform the
traditional methods
5 Hybrid framework
[40]
Zero-order feature statistics are fed to
a standard i-vector model through
DNN. Also, speech is segmented into
senones by training the DNN using
the HMM-GMM model
A low equal error rate is
achieved with this
method
6 SincNet [41–45] SincNet is a distinctive convolutional
neural network (CNN) architecture
that takes one-dimensional raw audio
as input. A filter at the first layer of
CNN acquires the knowledge of lower
(fl) and higher (fu) cut-off
frequencies. Also, the convolutional
layer can adjust these frequencies
before applying them to further
standard layers
Fast convergence,
improved accuracy, and
computational
efficiency. SincNet
method outperforms the
other DNN solutions in
speaker verification
7 Far-field
ResNet-BAM [46]
It is an easy and effective novel
speaker embedding architecture with
the ResNet-BAM method, in which
the bottleneck attention module
(BAM) with residual neural network
(ResNet) is mixed. It focused on the
smallest speech and domain mismatch
where the frontend includes a data
processing unit. The backend consists
of speaker embedding with a DAT
extractor
With the help of
adversarial domain
training along with
gradient reversal layer
provides domain
mismatch
8 Bidirectional
attention [47]
Proposed to unite CNN-based feature
knowledge with a bidirectional
attention method to attain improved
performance with merely single
enrollment speech
It is outperformed over
the
sequence-to-sequence
and vector models

Fig. 18 Detection error trade-off (DET) curve
Fig. 19 Performance of system with EER on DET curve
7 Conclusion
This chapter attempts to present a comprehensive review of speaker recognition
with more emphasis on text-dependent speaker verification. It discusses the most
commonly used feature extraction and modeling techniques used for this task. It
also surveys all the recent advancements proposed in the field of text-independent
speaker recognition. The deep learning models have been found to outperform many
classical techniques, and there exists a huge scope to further improve the perfor-
mance of the systems with more advanced architectures and data augmentation
techniques.

References
1. Kinnunen, T. and Li, H.Z. (2010). An Overview of Text-Independent Speaker Recognition:
From Features to Super vectors. Speech Communication, 52, 12–40. https://doi.org/10.1016/j.
specom.2009.08.009.
2. Hansen, John and Hasan, Taufiq. (2015). Speaker Recognition by Machines and Humans: A
tutorial review. Signal Processing Magazine, IEEE. 32. 74–99. https://doi.org/10.1109/MSP.
2015.2462851.
3. Todkar, S.P., Babar, S.S., Ambike, R.U., Suryakar, P.B., Prasad, J.R. (2018). Speaker
Recognition Techniques: A Review. 2018 3rd International Conference for Convergence in
Technology (I2CT), 1–5.
4. Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S.,
... and Reynolds, D.A. (2004). A tutorial on text-independent speaker verification. EURASIP
Journal on Advances in Signal Processing, 2004 (4), 1–22.
5. Pruzansky, S., Mathews, M., and Britner P.B. (1963). Talker-Recognition Procedure Based on
Analysis of Variance. Journal of the Acoustical Society of America, 35, 1877–1877.
6. Nguyen, M.S., Vo, T. (2015). Vietnamese Voice Recognition for Home Automation using
MFCC and DTW Techniques. 2015 International Conference on Advanced Computing and
Applications (ACOMP), 150–156.
7. Tirumala, S.S., Shahamiri, S.R., Garhwal, A.S., Wang, R. Speaker identification features
extraction methods: A systematic review, Expert Systems with Applications, Volume 90, 2017,
Pages 250–271, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2017.08.015.
8. Islam, M.A., et al. A Robust Speaker Identification System Using the Responses from a Model
of the Auditory Periphery. PloS one vol.11, 7e0158520.8 Jul.2016, https://doi.org/10.1371/
journal.pone.0158520.
9. Sujiya, S., Chandra, E. (2017). A Review on Speaker Recognition. International journal of
engineering and technology, 9, 1592–1598.
10. Alim,S.A., and Rashid,N.K.A. (December 12th 2018). Some Commonly Used Speech Feature
Extraction Algorithms, From Natural to Artificial Intelligence - Algorithms and Applications,
Ricardo Lopez-Ruiz, IntechOpen, https://doi.org/10.5772/intechopen.80419.
11. Brew, A., and Cunningham, P. (2010). Vector Quantization Mappings for Speaker Verification.
2010 20th International Conference on Pattern Recognition, 560–564.
12. Reynolds, D.A., Quatieri, T.F., and Dunn, R.B., “Speaker verification using adapted Gaussian
mixture models”, Digital Signal Processing, vol.10, no.1–3, pp. 19–41, 2000.
13. Larcher, A., Lee, K.A., Ma, B., Li, H., Text-dependent speaker verification: Classifiers,
databases and RSR2015. Speech Communication, Elsevier: North-Holland, 2014.
14. Sarkar, A.K., and Tan, Z.H. “Text dependent speaker verification using un-supervised HMM-
UBM and temporal GMM-UBM”, in Seventeenth Annual Conference of the International
Speech Communication Association (INTERSPEECH), pp. 425–429, 2016.
15. Zheng, R., Zhang, S., and Xu, B. (2004). Text-independent speaker identification using GMM-
UBM and frame level likelihood normalization. 2004 International Symposium on Chinese
Spoken Language Processing, 289–292.
16. Yin, S., Rose, R., and Kenny, p. “A Joint Factor Analysis Approach to Progressive Model
Adaptation in Text-Independent Speaker Verification”, in IEEE Transactions on Audio,
Speech, and Language Processing, vol. 15, no. 7, pp. 1999–2010, Sept. 2007, https://doi.org/
10.1109/TASL.2007.902410.
17. Campbell, W.M., Sturim, D.E., Reynolds, D.A. and Solomonoff, A. “SVM based speaker
verification using a GMM supervector kernel and NAP variability compensation”, in 2006
IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
(ICASSP), vol. 1, pp. I-I, IEEE, May 2006.
18. Kanagasundaram, A., Vogt, R., and Dean, D., Sridharan, S., Mason, M. (2011). i-vector
Based Speaker Recognition on Short Utterances. Proceedings of the Annual Conference of
the International Speech Communication Association, INTERSPEECH.

19. Li, W., Fu, T., Zhu, J. An improved i-vector extraction algorithm for speaker verification. J
Audio Speech Music Proc. 2015, 18 (2015).https://doi.org/10.1186/s13636-015-0061-x
20. Dey, S., Motlicek, P., Madikeri, S., and Ferras, M., “Template-matching for text-dependent
speaker verification”, Speech Communication, vol.88, pp. 96–105, 2017.
21. Sztahó, D., Szaszák, G., Beke, A. (2019). Deep learning methods in speaker recognition: a
review.
22. Bai, Z., and Zhang, X.-L. (2021). Speaker recognition based on deep learning: An overview.
Neural Networks, 140, 65–99. https://doi.org/10.1016/j.neunet.2021.03.004.
23. Variani, E., Lei, X., McDermott, E., Moreno, I.L., and Gonzalez-Dominguez, J. “Deep neural
networks for small footprint text-dependent speaker verification”, 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052–4056,
https://doi.org/10.1109/ICASSP.2014.6854363.
24. Wan, L., Wang, Q., Papir, A., and Lopez-Moreno, I. (2018). Generalized End-to-End Loss for
Speaker Verification. 2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 4879–4883.
25. Chen, D., Mak, B., Leung, C., and Sivadas, S. “Joint acoustic modeling of triphones and tri-
graphemes by multi-task learning deep neural networks for low-resource speech recognition”,
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2014, pp. 5592–5596, https://doi.org/10.1109/ICASSP.2014.6854673.
26. Bai, Z., Zhang, X. (2021). Speaker recognition based on deep learning:An overview. Neural
networks: the official journal of the International Neural Network Society, 140, 65–99.
27. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S. (2018). X-Vectors:
Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 5329–5333.
28. Fang, F., Wang, X., Yamagishi, J., Echizen, I., Todisco, M., Evans, N.,Bonastre, J. (2019).
Speaker Anonymization Using X-vector and Neural Waveform Models. arXiv preprint arXiv:
1905.13561.
29. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S. (2017). Deep Neural Network
Embeddings for Text-Independent Speaker Verification. INTERSPEECH.
30. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.(2016). End-to-end text-dependent speaker ver-
ification.In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). pp. 5115–5119.
31. Yu, Y., and Li, W. (2020). Densely Connected Time Delay Neural Network for Speaker
Verification. INTERSPEECH.
32. Desplanques, B., Thienpondt, J., and Demuynck, K. (2020). ECAPA-TDNN: Emphasized
Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification.
INTERSPEECH.
33. Kingma, D.P., Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:
1312.6114.
34. Rezende, D.J., Mohamed, S., Wierstra, D. (2014). Stochastic Backpropagation and Approxi-
mate Inference in Deep Generative Models. Proceedings of the 31st International Conference
on Machine Learning,in PMLR 32 (2), pp. 1278–1286.
35. Villalba, J., Brümmer, N., Dehak, N. (2017). Tied Variational Autoencoder Backends for i-
Vector Speaker Recognition. In INTERSPEECH 2017, pp. 1004–1008.
36. Tang, Z., Li, L., Wang, D. (2016). Multi-task recurrent model for speech and speaker
recognition. In 2016 Asia-Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA), pp. 1–4.
37. Marchi, E., Shum, S., Hwang, K., Kajarekar,S., Sigtia, S., Richards, H., Haynes, R.,Kim,
Y.,Bridle J. (2018). Generalised discriminative transform via curriculum learning for speaker
recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Process-
ing, ICASSP 2018, pp. 5324–5328.
38. Ranjan, S., Hansen, J.H., Ranjan, S., Hansen, J.H. (2018). Curriculum learning based
approaches for noise robust speaker recognition. IEEE/ACM Transactions on Audio, Speech
and Language Processing (TASLP), 26 (1), pp. 197–210.

39. Zheng, S., Liu, G., Suo, H., Lei, Y. (2019). Auto encoder-based Semi-Supervised Curriculum
Learning For Out-of-domain Speaker Verification. In: INTERSPEECH. 2019. pp. 4360–4364.
40. Lei, Y., Scheffer, N., Ferrer, L., McLaren, M. (2014). A novel scheme for speaker recognition
using a phonetically-aware deep neural network. In: 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). pp. 1695–1699.
41. Nagrani, A., Chung, J.S., Zisserman, A. (2017). Voxceleb: a large-scale speaker identification
dataset. arXiv preprint arXiv:1706.08612.
42. Ravanelli, M., Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In:
2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028.
43. Hajavi, A., Etemad A. (2019). A Deep Neural Network for Short-Segment Speaker Recogni-
tion.In: Proc.Interspeech 2019, pp. 2878–2882.
44. Ravanelli, M., Bengio,Y. (2019). Learning speaker representations with mutual information.
In: Proc.Interspeech 2019, pp. 1153–1157.
45. Salvati, D., Drioli, C., Foresti, G.L. (2019). End-to-End Speaker Identification in Noisy
and Reverberant Environments Using Raw Waveform Convolutional Neural Networks. Proc.
Interspeech 2019, pp. 4335–4339.
46. Zhang, Li, Wu, Jian, Xie, Lei. (2021). NPU Speaker Verification System for INTERSPEECH
2020 Far-Field Speaker Verification Challenge.
47. Fang, X., Gao, T., Zou, L., Ling, Z.: Bidirectional Attention for Text-Dependent Speaker
Verification. Sensors. 20, 6784 (2020).
48. Doddington, G.R., Przybocki, M.A., Martin, A.F., and Reynolds, “The NIST speaker recog-
nition evaluation–overview, methodology, systems, results, perspective”, Speech Communica-
tion, vol.31, no.2–3, pp. 225–254, 2000.
49. Zeinali, H., Sameti, H., and Burget, L. “Text-dependent speaker verification based on i-vectors,
Neural Networks and Hidden Markov Models”, Computer Speech and Language, vol.46, pp.
53–71, 2017.
50. Bimbot, F., Bonastre, J F. Fredouille, C. et al. A Tutorial on Text-Independent Speaker
Verification. EURASIP J.Adv.Signal Process.2004, 101962 (2004). https://doi.org/10.1155/
S1110865704310024.
51. Cheng, J., and Wang, H. (2004). A method of estimating the equal error rate for automatic
speaker verification. 2004 International Symposium on Chinese Spoken Language Processing,
285–288.

Music Composition with Deep Learning:
A Review
Carlos Hernandez-Olivan and José R. Beltrán
1 Introduction
Music is generally defined as a succession of pitches or rhythms, or both, in some
definite patterns [1]. Music composition (or generation) is the process of creating
or writing a new piece of music. The music composition term can also refer to an
original piece or work of music [1]. Music composition requires creativity. Chomsky
defines creativity as “the unique human capacity to understand and produce an
indefinitely large number of sentences in a language, most of which have never been
encountered or spoken before” [9]. On the other hand, Root-Bernstein defines this
concept as “creativity comes from finding the unexpected connections, from making
use of skills, ideas, insights and analogies from disparate fields” [51]. Regarding
music creativity, Gordon declares that there is not a clear definition of this concept.
He stands that music creativity cannot be taught but the readiness for one to fulfill his
potential for music creativity, that is, the audation vocabulary of tonal patterns or the
varied and large rhythmic patterns [22]. This is a very important aspect that needs to
be taken into account when designing or proposing an AI-based music composition
algorithm. More specifically, music composition is an important topic in the music
information retrieval (MIR) field. It comprises subtasks such as melody generation,
multi-track or multi-instrument generation, style transfer, or harmonization. These
aspects will be covered in this chapter from the point of view of the multitude of
techniques that have flourished in recent years based on AI and DL.
C. Hernandez-Olivan () · J. R. Beltrán ()
Department of Electronic Engineering and Communications, University of Zaragoza, Zaragoza,
Spain
e-mail: carloshero@unizar.es; jrbelbla@unizar.es
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
A. Biswas et al. (eds.), Advances in Speech and Music Technology, Signals and
Communication Technology, https://doi.org/10.1007/978-3-031-18444-4_2
25

26 C. Hernandez-Olivan and J. R. Beltrán
1.1 From Algorithmic Composition to Deep Learning
From the 1980s, the interest in computer-based music composition has never stop
to grow. Some experiments came up in the early 1980s such as the Experiments in
Musical Intelligence (EMI) [12] by David Cope from 1983 to 1989 or Analogiques
A and B by Iannis Xenakis that follow the previous work from the author in 1963
[68]. Later in the 2000s, also David Cope proposed the combination of Markov
chains with grammars for automatic music composition, and other relevant works
such as Project1 (PR1) by Koening [2] were born. These techniques can be grouped
in the field of algorithmic music composition which is a way of composing by means
of formalizable methods [46, 58]. This type of composing consists of a controlled
procedure which is based on mathematical instructions that must be followed in
a fixed order. There are several methods inside the algorithmic composition such
as Markov models, generative grammars, cellular automata, genetic algorithms,
transition networks, or chaos theory [28]. Sometimes, these techniques and other
probabilistic methods are combined with deep neural networks (NNs) in order to
condition them or help them to better model music which is the case of DeepBach
[25]. These models can generate and harmonize melodies in different styles, but
the lack of generalizability capacity of these models and the rule-based definitions
that must be done by hand make these methods less powerful and generalizable in
comparison with DL-based models.
From the 1980s to the early 2000s, the first works which tried to model music
with NNs were born [3, 17, 44]. In recent years, with the growing of deep learning
(DL), lots of studies have tried to model music with deep NNs. DL models for music
generation normally use NN architectures that are proven to perform well in other
fields such as computer vision or natural language processing (NLP). There can also
be used pre-trained models in these fields that can be used for music generation. This
is called transfer learning [74]. Some NN techniques and architectures will be shown
later in this chapter. Music composition today is taking input representations and
NN’s architectures from large-scale NLP applications, such as transformer-based
models which are demonstrating very good performance in this task. This is due to
the fact that music can be understood as a language in which every style or music
genre has its own rules.
1.2 Neural Network Architectures for Music Composition with
Deep Learning
First of all, we will provide an overview of the most widely used NN architectures
that are providing the best results in the task of musical composition so far. The
most used NN architectures in music composition task are generative models such
as variational autoencoders (VAEs) or generative adversarial networks (GANs) and

Music Composition with Deep Learning: A Review 27
NLP-based models such as long short-term memory (LSTM) or transformers. The
following is an overview of these models.
1.2.1 Variational Autoencoders (VAEs)
The original VAE model [37] uses an encoder-decoder architecture to produce a
latent space by reconstructing the input (see Fig. 1a). A latent space is a multidi-
mensional space of compressed data in which the most similar elements are located
closest to each other. In a VAE, the encoder approximates the posterior, and the
decoder parameterizes the likelihood. The posterior and likelihood approximations
are parametrized by a NN with λ and θ parameters for the encoder and decoder,
respectively. The posterior inference is done by minimizing the Kullback-Leibler
(KL) divergence between the encoder and approximate posterior, and the true
posterior by maximizing the evidence lower bound (ELBO). The gradient is
computed with the so-called reparametrization trick. There are variations of the
original VAE model such as the β-VAE [27] which adds a penalty term β to the
reconstruction loss in order to improve the latent space distribution. In Fig. 1a,
we show the general VAE architecture. An example of a DL model for music
composition based on a VAE is MusicVAE [50] which we describe in further
sections in this chapter.
1.2.2 Generative Adversarial Networks (GANs)
GANs [21] are generative models composed by two NNs: the generator G and
the discriminator D. The generator learns a distribution pg over the input data.
The training is done in order to let the discriminator maximize the probability of
assigning the correct label to the training samples and the samples generated by
the generator. This training idea can be understood as if D and G follow the two-
player minimax game that Goodfellow et al. [21] described. In Fig. 1b, we show the
general GAN architecture. The generator and the discriminator can be formed by
different NN layers such as multi-layer perceptrons (MLPs) [52], LSTMs [30], or
convolutional neural networks (CNNs) [19, 40].
1.2.3 Transformers
Transformers [61] are being currently used in NLP applications due to their well
performance not only in NLP but also in computer vision models. Transformers can
be used as auto-regressive models like the LSTMs which allow them to be used in
generative tasks. The basic idea behind transformers is the attention mechanism.
There are several variations of the original attention mechanism proposed by
Vaswani et al. [61] that have been used in music composition [33]. The combination
of the attention layer with feedforward layers leads to the formation of the encoder

and decoder of the transformer which differs from purely autoencoder models that
are also composed by the encoder and decoder. Transformers are trained with tokens
which are structured representations of the inputs. In Fig. 1c, we show the general
transformer architecture.
1.3 Challenges in Music Composition with Deep Learning
There are different points of view about the challenges perspective in music
composition with DL that make ourselves ask questions related to the input
representations and DL models that have been used in this field, the output quality
of the actual state-of-the-art methods, or the way that researchers have measured
the quality of the generated music. In this chapter, we ask ourselves the following
questions that involve the composition process and output: Are the current DL
models capable of generating music with a certain level of creativity? What is the
best NN architecture to perform music composition with DL? Could end-to-end
methods generate entire structured music pieces? Are the composed pieces with
DL just an imitation of the inputs or can NNs generate new music in styles that
are not present in the training data? Should NNs compose music by following the
same logic and process as humans do? How much data do DL models for music
generation need? Are current evaluation methods good enough to compare and
measure the creativity of the composed music?
To answer these questions, we approach music composition or generation from
the point of view of the process followed to obtain the final composition and
the output of DL models, i.e., the comparison between the human composition
process and the music generation process with DL and the artistic and creative
characteristics presented by the generated music. We also analyze recent state-of-
the-art models of music composition with DL to show the result provided by these
models (motifs, complete compositions, etc.). Another important aspect analyzed
is the input representation that these models use to generate music to understand if
these representations are suitable for composing. This gives us some insights on how
these models could be improved, if these NN architectures are powerful enough to
compose new music with a certain level of creativity, and the directions and future
work that should be done in music composition with DL.
1.4 Chapter Structure
In this chapter, we make an analysis of the symbolic music composition task from
the composition process and the type of generated output perspectives. Therefore,
we do not cover the performance or synthesis tasks. This chapter is structured as
follows. Section 2 introduces a general view of the music composition process and
music basic principles. In Sect. 3, we give an overview of state-of-the-art methods

Fig. 1 (a) VAE [37], (b)
GAN [21], and (c)
transformer general
architecture. Reproduced
from [61]

from the melodic composition perspective. We also examine how these models deal
with the harmony and structure. In Sect. 4, we describe DL models that generate
multi-track or multi-instrument music. In Sect. 5, we show different methods and
metrics that are commonly used to evaluate the output of a music generation model.
In Sect. 6, we describe the open questions in music generation field by analyzing the
models described in the previous sections. Finally, in Sect. 7, we expose future work
and challenges that are still being studied in the music generation with DL field.
2 The Music Composition Process
Much like written language, the music composition process is a complex process
that depends on a large number of decisions [41]. In the music field, this process
[11] depends on the music style we are working with. As an example, it is very
common in Western classical music to start with a small unit of one or two bars
called motif and develop it to compose a melody or music phrase, and in styles like
pop or jazz, it is more common to take a harmonic progression and compose or
improvise a melody ahead of it. In spite of the music style we are composing in,
when a composer starts a piece of music, there is some basic melodic or harmonic
idea behind it. From the Western classical music perspective, this idea (or motif)
is developed by the composer to construct the melody or phrase that generates or
follows a certain harmonic progression, and then these phrases are structured in
sections. The melody can be constructed after the harmonic progression is set, or it
can also be generated in the first place and then be harmonized. How the melody
is constructed and the way it is harmonized are decisions made by the composer.
Each section has its own purpose which means that it can be written in different
tonalities and its phrases usually follow different harmonic progressions than the
other sections. Sometimes, music pieces have a melodic part and an accompaniment
part. The melodic part of a music piece can be played by different instruments whose
frequency range may or may not be similar, and the harmonic part gives the piece
a deep and structured feel. The instruments, which are not necessarily in the same
frequency range, are combined with Instrumentation and Orchestration techniques
(see Sect. 3.2). These elements are crucial in musical composition, and they are
also important keys when defining the style or genre of a piece of music. Music
has two dimensions, the time and the harmony dimensions. The time dimension is
represented by the notes duration or rhythm which is the lowest level in this axis.
In this dimension, notes can be grouped or measured in units called bars, which are
ordered groups of notes. The other dimension, harmony, is related to the note values
or pitch. If we think of an image, time dimension would be the horizontal axis and
harmony dimension the vertical axis. Harmony does also have a temporal evolution,
but this is not represented in music scores. There is a very common software-based
music representation called piano-roll that follows this logic.
The music time dimension is structured in low-level units that are notes. Notes
are grouped in bars that form motifs. In the time high-level dimension, we can find

Fig. 2 (a) General music composition scheme and (b) an example of the beginning of Beethoven’s
fifth symphony with music levels or categories
sections which are composed by phrases that last eight or more bars (this depends
on the style and composer). The lowest level in the harmony dimension is the note
level. The superposition of notes played by different instruments creates chords. The
sequence of chords is called harmonic progressions or chord progressions that are
relevant to the composition, and they also have dependencies in the time dimension.
Having said that, we can think about music as a complex language model that
consists of short- and long-term relationships. These relationships extend in two
dimensions, the time dimension which is related to music structure and the harmonic
dimension which is related to the notes or pitches and chords, that is, the harmony.
From the symbolic music generation and analysis points of view, based on the
ideas of Walton [63], some of the basic music principles or elements are (see
Fig. 2):
– Harmony. It is the superposition of notes that form chords that compose a chord
progression. The note level can be considered as the lowest level in harmony, and
the next level that can be considered is the chord level. The highest level is the
progression level which usually belongs to a certain tonality in tonal music.
– Music Form or Structure. It is the highest level that music presents, and it is
related to the time dimension. The smallest part of a music piece is the motif

which is developed in a music phrase, and the combination of music phrases
forms a section. Sections in music are ordered depending on the music style
such as intro-verse-chorus-verse-outro for some pop songs (also represented
as ABCBA) or exposition-development-recapitulation or ABA for sonatas. The
concatenation of sections that can be in different scales and modes gives us the
entire composition.
– Melody and Texture. Texture in music terms refers to the melodic, rhythmic, and
harmonic contents that have to be combined in a composition in order to form
the music piece. Music can be monophonic or polyphonic depending on the notes
that are played at the same time step and homophonic or heterophonic depending
on the melody, if it has or not accompaniment.
– Instrumentation and Orchestration. These are music techniques that take into
account the number of instruments or tracks in a music piece. Whereas instru-
mentation is related to the combination of musical instruments which compose
a music piece, orchestration refers to the assignment of melodies and accompa-
niment to the different instruments that compose a determined music piece. In
recording or software-based music representation, instruments are organized as
tracks. Each track contains the collection of notes played on a single instrument
[18]. Therefore, we can call a piece with more than one instrument as multi-track
which refers to the information that contains two or more tracks where each track
is played by a single instrument. Each track can contain one note or multiple
notes that sounds simultaneously, leading to monophonic tracks and polyphonic
tracks, respectively.
Music categories are related between them. Harmony is related to the structure
because a section is usually played in the same scale and mode. There are cadences
between sections, and there can also be modulations which change the scale of
the piece. Texture and instrumentation are related to timbral features, and their
relationship is based on the fact that not all the instruments can play the same
melodies. An example of that is when we have a melody with lots of ornamentation
elements which cannot be played with determined instrument families (because of
a fact of each instrument technique possibilities or a stylist reason).
Another important music attribute is the dynamics, but they are related to the
performance rather than the composition itself, so we will not cover them in this
chapter. In Fig. 2, we show the aspects of the music composition process that we
cover in this chapter, and the relationships between categories and the sections of
the chapter in which each topic is discussed are depicted.
3 Melody Generation
A melody is a sequence of notes with a certain rhythm ordered in an aesthetic way.
Melodies can be monophonic or polyphonic. Monophonic refers to melodies in
which only one note is played at a time step, whereas in polyphonic melodies, there

Fig. 3 Scheme of an output-like score of melody generation models
is more than one note being played at the same time step. Melody generation is
an important part of music composition, and it has been attempted with algorithmic
composition and with several of the NN architectures that include generative models
such as VAEs or GANs, recurrent neural networks (RNNs) used for auto-regression
tasks such as LSTM, neural autoregressive distribution estimators (NADEs) [38], or
current models used in natural language processing like transformers [61]. In Fig. 3,
we show the scheme with the music basic principles of an output-like score of a
melody generation model.
3.1 Deep Learning Models for Melody Generation: From
Motifs to Melodic Phrases
Depending on the music genre of our domain, the human composition process
usually begins with the creation of a motif or a chord progression that is then
expanded to a phrase or melody. When it comes to DL methods for music
generation, several models can generate short-term notes sequences. In 2016, the
very first DL models attempted to generate short melodies with recurrent neural
networks (RNNs) and semantic models such as unit selection [4]. These models
worked for short sequences, so the interest to create entire melodies grew in parallel
to the birth of new NNs. Derived from these first works and with the aim of creating
longer sequences (or melodies), other models that combined NNs with probabilistic
methods came up. An example of this is Google’s Magenta Melody RNN models
[62] released in 2016 and the Anticipation-RNN [24] and DeepBach [25] both
published in 2017. DeepBach is currently considered one of the current state-of-the-
art models for music generation because of its capacity to generate 4-voice chorales
in the style of Bach.
However, these methods cannot generate new melodies with a high level of
creativity from scratch. In order to improve the generation task, generative models

were chosen by researchers to perform music composition. In fact, nowadays, one
of the best-performing models to generate motifs or short melodies from 2 to 16
bars is MusicVAE1 [50] which was published in 2018. MusicVAE is a model for
music generation based on a VAE [37]. With this model, music can be generated by
interpolating in a latent space. This model is trained with approximately 1.5 million
songs from the Lakh MIDI Dataset (LMD)2 [49], and it can generate polyphonic
melodies for almost 3 instruments: melody, bass, and drums. After the creation of
MusicVAE model along with the birth of new NN architectures in other fields, the
necessity and availability of new DL-based models that can create longer melodies
grew and this led to the birth of new transformer-based models for music generation.
Examples of these models are the Music Transformer [33] in 2018, and models that
use pre-trained transformers such as MuseNet in 2019 proposed by OpenAI [47]
which uses the GPT-2 to generate music. These transformer-based models, such as
Music Transformer, can generate longer melodies and continue a given sequence,
but after a few bars or seconds, the melody ends up being a bit random, that is, there
are notes and harmonies that do not follow the musical sense of the piece.
In order to overcome this problem and develop models that can generate longer
sequences without losing the sense of the music generated in the previous bars
or the main motifs, new models were born in 2020 and 2021 as combinations of
VAEs, transformers, or other NNs or machine learning algorithms. Some examples
of these models are the TransformerVAE [36] and PianoTree [66]. These models
perform well even in polyphonic music, and they can generate music phrases. One
of the latest released models to generate entire phrases is the model proposed in
2021 by Mittal et al. [42] which is based in denoising diffusion probabilistic models
(DDPMs) [29] which are new generative models that generate high-quality samples
by learning to invert a diffusion process from data to Gaussian noise. This model
uses a MusicVAE 2-bar model to then train a diffusion model to capture the temporal
relationships among the VAE latents zk with k = 32 which are the 32 latent variables
that allows to generate 64 bars (2 bars per latent). In spite that there can be generated
longer polyphonic melodies, they do not follow a central motif so they tend to lose
the sense of a certain direction.
3.2 Structure Awareness
As we mentioned in Sect. 1, music is a structured language. Once melodies have
been created, they must be grouped into bigger sections (see Fig. 2) which play a
fundamental role in a composition. These sections have different names that vary
depending on the music style such as introduction, chorus, or verse for pop or trap
genres and exposition, development, or recapitulation for classical sonatas. Sections
1 https://magenta.tensorflow.org/music-vae, accessed August 2021.
2 https://colinraffel.com/projects/lmd/, accessed August 2021.

can also be named with capital letters, and song structures can be expressed as
ABAB, for example. Generating music with structure is one of the most difficult
tasks in music composition with DL because structure means an aesthetical sense
of rhythm, chord progressions, and melodies that are concatenated with bridges and
cadences [39].
In DL, there have been models that have tried to generate structured music by
imposing the high-level structure with the self-similarity constrains. An example of
that is the model proposed by Lattner et al. in 2018 [39] which uses a convolutional
restricted Boltzmann machine (C-RBM) to generate music and self-similarity
constrain with a self-similarity matrix [45] to impose the structure of the piece as
if it was a template. This method which imposes a structure template is similar to
the composition process that a composer follows when composing music, and the
resulting music pieces followed the imposed structure template. Although new DL
models are trending to be end to end, and new studies about modeling music with
structure are being released [7], there have not been DL models that are capable of
generating structured music by themselves, that is, without the help of a template or
high-level structure information that is passed to the NN.
3.3 Harmony and Melody Conditioning
Inside music composition with DL, there is a task that is the harmonization of
a given melody which differs to the task of creating a polyphonic melody from
scratch. On the one hand, if we analyze the harmony of a created melody from
scratch with a DL model, we saw that music generated with DL is not well
structured as it does not compose different sections and write aesthetic cadences
or bridges between the sections in an end-to-end way yet. In spite of that, the
harmony generated by transformer-based models that compose polyphonic melodies
is coherent in the first bars of the generated pieces [33] because it follows a
certain key. We have to emphasize here that these melodies are written for piano,
which differs from multi-instrument music that presents added challenges such as
generating appropriate melodies or accompaniments for each instrument or deciding
which instruments make up the ensemble (see Sect. 4).
On the other hand, the task of melody harmonization consists of generating the
harmony that accompanies a given melody. The accompaniment can be a chord
accompaniment regardless of the instrument or the track where the chords are and
multi-track accompaniment where the notes in each chord belong to a specific
instrument. The first models for harmonization used HMM, but these models were
improved by RNNs. Some models predicted chord functions [70], and other models
match chord accompaniments for a given melody [69]. Regarding the generation
of accompaniment with different tracks, there have been proposed GAN-based
models which implement lead sheet arrangements. In 2018, a Multi-Instrument Co-
Arrangement Model called MICA [72] and its improvement MSMICA in 2020 [73]
were proposed to generate multi-track accompaniment. There is also a model called

Random documents with unrelated
content Scribd suggests to you:

Erano delle morse, chiamate anche trichechi, e dagli esquimesi:
awak. (Cap. XVII).
— Fate fuoco!... Voi chiacchierate troppo.
— Non sono un baleniere io, mastro Dik. I lupi di mare, si sa già da
lungo tempo, sono assai avari di parole.
— E prendono le balene.
— Ed io, pur chiacchierando, vi mostrerò, mastro Dik, come si
ammazzano i trichechi. —
L'allegro campione di Cambridge balzò giù dall'automobile e sparò
sei colpi uno dietro l'altro, facendo stramazzare altrettanti giganti
polari. Sparava con una sicurezza meravigliosa ed anche con una

calma stupefacente, che strappava grida d'ammirazione al canadese.
Perfino il baleniere pareva estremamente stupito.
Sei morse, colpite tutte alla testa, erano cadute l'una vicina all'altra,
senza però far indietreggiare le altre, le quali anzi parevano decise a
scagliarsi contro il treno e tentare di metterlo a pezzi.
— Mastro Dik, — disse lo studente. — Che vogliano, queste otri
d'olio, provare su di noi la robustezza delle loro zanne?
— Continuate, — rispose l'ex-baleniere. — Poi lancerò il treno a tutta
velocità e passeremo su quei mastodonti che creperanno appunto
come otri.
— E manderete l'automobile a sprofondarsi nella baia, — disse il
canadese, il quale era pure balzato a terra, armato di un fucile.
— Non ci pensate, signore: io rispondo di tutto. —
Walter aveva riempito il serbatoio ed imbracciato nuovamente il
mauser.
— Voi a destra, signor Gastone, ed io a sinistra. Per tutti i fulmini di
Giove!... Dovremo preparare agli orsi bianchi un banchetto
colossale?
— Sparate, — disse Dik. — Io mi preparo a caricare a fondo. —
Fu un vero fuoco di fila che rimbombò ai due lati dell'automobile. Lo
studente ed il canadese gareggiavano in abilità e le morse cadevano
sotto i loro colpi, fulminate da quei colpi meravigliosi che le
toccavano al cuore o al cervello.
Le compagne, rese furiose per le perdite subite, non accennavano
affatto a cedere il campo.
In file compatte, s'avanzavano verso l'automobile, trascinandosi
penosamente, con delle contrazioni furiose, tentando di venire a
contatto.
Ruggivano ferocemente, facendo rintronare la galleria tutta,
provocando perfino delle piccole frane nello strato superiore.

— Salite, — disse ad un tratto l'ex-baleniere. — Giacchè si ostinano,
passeremo egualmente.
Tanto peggio per quelli che rimarranno sotto le nostre ruote.
Tenetevi saldi!... —
Walter ed il signor di Montcalm si erano slanciati sui predellini, poi si
erano messi dietro all'ex-baleniere, ricaricando prontamente i fucili.
— Bestie dannate!... — esclamò lo studente. — Non credevo che
fossero così ostinati i barilotti d'olio. —
L'automobile prese lo slancio e si scagliò innanzi colla violenza d'un
ariete che sfonda la porta d'una fortezza, passando quasi di volo
attraverso quell'ammasso mostruoso di corpi.
Fu una serie di salti spaventosi che mise a dura prova i muscoli dei
tre esploratori, poichè l'automobile passava, insieme al pesante
carrozzone, sopra i disgraziati anfibi, lasciandosi dietro un vero fiume
di sangue e d'olio.
Lo slancio era stato così improvviso e così fulmineo, che i trichechi
non avevano avuto il tempo di tentare nessun attacco, nemmeno
contro le pneumatiche, che passavano sui loro corpacci aprendo dei
solchi sanguinosi.
In tre o quattro secondi il treno attraversò il campo e si precipitò
verso l'uscita della galleria, slanciandosi sui banchi di ghiaccio che si
erano formati lungo le spiaggie della baia d'Hudson.
L'ex-baleniere, che conservava il suo meraviglioso sangue freddo,
virò quasi sul posto, a meno di trecento metri dai giganteschi ice-
bergs che si erano già accumulati in gran numero, trasportati dalla
corrente polare e spinti dai venti di levante, poi risalì di volata la
costa riguadagnando la sconfinata pianura.
— Ventre di Giove!... — esclamò lo studente, il quale pareva che in
quel momento si fosse dimenticati i suoi eterni fulmini. — Questo
marinaio è diventato uno chaffeur prodigioso.
Non potevate trovarne uno migliore, signor Gastone.
È

— È vero, — rispose il canadese. — Colpo d'occhio, mano sicura e
un'audacia straordinaria.
Dik, come si presenta la pianura?
— Buona, signore, almeno per ora, — rispose l'ex-baleniere, sempre
appoggiato al volante.
— Potremo giungere, prima che la notte scenda, al lago di Yath-
kyed?
— Lo spero.
— Allora spingete pure, giacchè il ghiaccio è abbastanza liscio. Non
dimentichiamo che Torpon corre pure verso il Polo.
— Lasciate fare a me, signore, — rispose l'ex-baleniere, con un
risolino un po' sardonico. — Andremo più presto dell'americano.
— Fate attenzione ai corsi d'acqua. Un altro salto potrebbe riuscirci
fatale.
— Aprirò gli occhi, signore. —
L'automobile correva velocissima senza scosse, senza soprassalti,
poichè la pianura polare si presentava bellissima come una pista.
Pochi hummok di quando in quando si mostravano, formati all'ultima
tempesta di neve, ostacoli insignificanti che l'ex-baleniere evitava
facilmente.
Bande di uccelli polari si alzavano dinanzi al treno, spaventati dal
fragore del motore. Erano gabbiani venuti dalla vicina baia di
Hudson, borgomastri, piccoli plectrophanes nivales eternamente
pigolanti, e graziosissimi auk, uccelli che vivono in stormi immensi e
che gli esquimesi prendono in gran numero servendosi d'una rete
simile a quella usata dai nostri ragazzi per impadronirsi delle farfalle.
Anche la piccola selvaggina si levava, scappando con rapidità
fulminea e cacciandosi sotto gli hummok.
Ora era una magnifica martora, di quelle chiamate dai cacciatori
della Compagnia di Pelliccie charsa, lunga un mezzo metro, con una
coda di quaranta e più centimetri, col pelame giallo-brillante; talvolta

invece era una coppia di linci polari che balzava fuori dalla neve e
che s'allontanava soffiando rabbiosamente e scuotendo i due bizzarri
pennacchi biancastri che adornano i loro orecchi.
Walter non mancava, di quando in quando, di sparare qualche colpo
di fucile, ma la rapidità dell'automobile non gli permetteva di
mandare le palle a giusta destinazione.
Due ore prima del tramonto, il treno, sempre lanciato a grande
velocità, raggiungeva il North Lined, uno dei più bei laghi dell'alta
terra hudsoniana, popolato sempre da stormi immensi di cigni
trombettanti dalla mattina alla sera, ritrovo preferito dei cacciatori
canadesi durante la stagione estiva, ma in quel momento
assolutamente privo persino d'esquimesi.
Due colpi di fucile sparati dallo studente assicurarono una copiosa
cena di carne eccellente e ben grassa.
Alle sei, nel momento in cui il sole scompariva in mezzo ad un fitto
nebbione, che il gelido vento del nord spingeva furiosamente
attraverso quelle desolate plaghe coperte di neve e di ghiacci,
l'automobile si arrestava all'estremità meridionale dell'Yath-kyed, un
altro lago perduto fra le alte terre hudsoniane.

CAPITOLO XVIII.
Un dramma polare.
Quella notte, passata sulle rive di quel lago gelato, fu tranquillissima,
ed i tre esploratori poterono riposarsi placidamente sui loro piccoli
ma soffici lettucci, russando insieme alla stufa che era stata lasciata
accesa dopo la cena.
L'indomani una nebbia intensa copriva la sterminata pianura polare.
Il canadese, dopo aver rilevato alla meglio, sulla bussola, la
posizione e aver fatti accendere i due potenti fanali, diede il segnale
della partenza.
Era forse una imprudenza avventurarsi attraverso a quel denso
strato di vapori che un vento freddissimo del nord ora lacerava ed
ora addensava. Se Torpon non si fosse già spinto tanto innanzi, il
canadese avrebbe accordato un giorno di riposo, non avendo
stabilito il giorno per raggiungere il Polo; l'idea però che il rivale
potesse giungere prima di lui all'incrocio di tutti i meridiani e
spiegarvi la bandiera stellata dell'Unione Americana, lo spingeva ad
affrettarsi.
— Aprite bene gli occhi, Dik, — disse, prendendo il solito posto
insieme allo studente. — Non spingete troppo la corsa, almeno fino a
che la nebbia non si sia diradata.
Se tutto va bene, faremo quest'oggi un bel tratto di via e ci
lasceremo alle spalle il circolo polare artico. —
E l'automobile si era slanciata in mezzo a quel caos di vapori
turbinanti, procedendo con una velocità di trenta miglia all'ora,

velocità che se fosse durata sole dieci ore, avrebbe potuto condurre
gli esploratori fino sulle coste meridionali del vastissimo golfo di
Boothia.
La pianura si manteneva sempre abbastanza buona, quantunque, di
quando in quando, si presentassero dei crepacci che l'ex-baleniere
evitava con grande fatica e che talvolta invece superava quasi di
volata, facendo subire alle due vetture dei soprassalti spaventosi.
A mezzodì il treno giungeva sulle rive del Chesterfied, una specie di
fiord che staccandosi dalla baia d'Hudson s'inoltra entro terra per
parecchie dozzine di leghe.
Essendo la sua superficie tutta gelata, l'automobile vi si avventurò,
senza perdere tempo a girarlo verso ponente.
Giganteschi ice-bergs, alti due e perfino trecento metri, si erano qua
e là accumulati, formando talvolta delle barriere che apparivano
insuperabili.
Il freddo intenso li aveva però saldamente imprigionati entro il pak,
sicchè non c'era pericolo che da un momento all'altro perdessero
l'equilibrio e schiacciassero il treno.
Descrivendo delle grandi curve e dei vasti angoli, l'automobile
avanzava sempre, perseguitato da vere nubi di volatili, i quali
osavano perfino precipitarsi sugli esploratori colle loro grida rauche e
discordi.
Erano così poco paurosi, che anche presi a fucilate, dopo qualche
minuto tornavano alla carica.
Walter era riuscito a strozzarne perfino alcuni colle mani, e li aveva
messi da parte, contando di prepararseli per la cena, quantunque il
canadese ed anche l'ex-baleniere avessero fatto delle smorfie molto
significanti. Valeva infatti molto meglio la carne dell'orso bianco, più
saporita e meno coriacea.
Un'ora più tardi l'automobile correva nuovamente sulle pianure
settentrionali, muovendo sul Wager River, che non è affatto un

fiume, bensì un altro lunghissimo e largo fiord che sbocca di fronte
all'isola di Southampton.
Il terreno era diventato nuovamente migliore, sicchè Dik, il quale
pareva pel momento che si fosse dimenticato delle promesse fatte a
mister Torpon, spingeva la velocità talvolta perfino a sessanta miglia
all'ora.
Se non vi fosse stato il carrozzone, quel diavolo d'uomo non avrebbe
esitato a lanciarlo anche a cento, non essendovi alcun pericolo di
schiacciare delle persone, ma non doveva dimenticare il peso
considerevole che il motore era costretto a trainare.
Ancora tre ore di corsa velocissima in mezzo ad un piccolo uragano
di neve ed il treno giungeva sulle rive del golfo di Boothia, un grado
e mezzo sopra del circolo polare artico.
— Se continuiamo così e non succedono guasti, fra cinque o sei
giorni al più, noi faremo colazione al Polo, mio caro Walter, — disse il
canadese, nel momento che l'automobile si arrestava.
— Siamo infatti molto innanzi, signor Gastone. Me ne accorgo dal
freddo intenso.
Quanti gradi avremo?
— Trentacinque sotto.
— Brrr!... Eppure si può ancora resistere abbastanza bene.
— Perchè non soffia il vento del nord.
— Ci fermiamo qui?
— Voglio prima assicurarmi dello stato dei ghiacci.
— Correremo sul mare?
— Sarà molto meglio, Walter, così raggiungeremo più presto l'isola di
Devon.
Lasciamo che si occupi Dik della cucina per oggi, e noi andiamo a
fare una piccola esplorazione lungo la costa.

— Avete capito, mastro baleniere? — gridò lo studente. — Vi
raccomando i miei gabbiani.
— Che mangerete voi solo, — rispose lo chaffeur. — Io preferisco un
filetto d'orso bianco.
— Come vi piace: io tengo più ai miei volatili.
— Andiamo, Walter, — disse il canadese.
Presero i loro fucili ed i loro coltellacci, non essendo improbabile
l'incontro di qualche orso bianco, e scesero la spiaggia interrotta da
seni e da minuscoli fiords.
Tutto il golfo, il quale si prolunga fra la terra omonima che si
frastaglia a ponente e quella di Baffin a levante, era gelato.
S'aprivano però qua e là dei larghi canali, in mezzo ai quali sfilavano
gli ice-bergs.
Una luce intensissima, d'una bianchezza diafana, che il cielo, coperto
di nubi gravide di neve, rifrangeva, lo illuminava tutto: era l'ice-blink.
— Che inverno precoce, — disse il canadese. — Guai se i balenieri si
fossero quest'anno indugiati.
— Che ne troviamo qualcuno rinchiuso fra i ghiacci? — chiese lo
studente.
— Può darsi, Walter. Questo freddo intenso è però favorevole a noi
poichè potremo correre, senza alcun pericolo, attraverso i canali del
Reggente e di Lancaster e raggiungere le terre di Lincoln e di
Ellesmere. Oh!... Toh!... Che cos'è quella massa oscura che si scorge
laggiù, rinserrata nel pak?
— Qualche tricheco forse?
— Non mi sembra, Walter. Si direbbe piuttosto un rottame.
— Possibile?
— Andiamo a vedere. —

Una massa piuttosto grigiastra, anzichè nerastra come una morsa od
una foca, si scorgeva in mezzo al ghiaccio, ad un duecento metri da
un piccolo fiord. Un animale non doveva certo essere, poichè
scorgendo i due uomini avanzarsi, non avrebbe tardato a fuggire.
— Sì, deve essere un rottame o per lo meno una scialuppa, — disse
il canadese.
— Una scialuppa, signore, — aggiunse Walter, il quale forse aveva la
vista più acuta.
Affrettarono il passo avanzandosi sul ghiaccio e si avvidero di non
essersi ingannati.
Una piccola scialuppa baleniera si trovava incastrata nel pak, con un
fianco già sfondato dalle prime pressioni, e dentro vi era un uomo
ormai ridotto allo stato di scheletro, semi-avvolto in una vecchia
pelliccia.
Il cranio sporgeva da una parte; le due gambe dall'altra prive dei
piedi, i quali si erano staccati e giacevano in fondo al battello.
Accanto a quel disgraziato vi era un fucile arrugginito, un'ascia ed un
barilotto che doveva aver contenuto dei viveri e che ora invece era
affatto vuoto.
— Chi sarà quest'uomo? — chiese lo studente, con voce commossa.
— Che sia morto di fame e di freddo? —
Invece di rispondere, il canadese, passato il primo istante di doloroso
stupore, era entrato nella scialuppa ed aveva messe le mani su un
pezzo di carta ingiallita, su cui erano state vergate alcune righe con
una materia rossastra, probabilmente del sangue.
Molte parole erano assolutamente indecifrabili, ma due colpirono
subito il canadese:
«Sarya e barone de Tolt».
Un grido di profonda sorpresa gli era sfuggito.
— Una scialuppa della Sarya qui!... Come l'hanno condotta fino a
questi luoghi le correnti polari? Ah!... Mi ricordo benissimo della

disgraziata spedizione del barone de Tolt, che commosse non solo la
Russia intera, ma anche tutti i naviganti polari dell'Europa e
dell'America.
— Che cosa dite voi, signor Gastone? — chiese Walter, il quale
continuava a guardare, con un misto di terrore e di compassione,
quel teschio umano le cui vuote occhiaie pareva che lo guardassero.
— Chi sarà questo naufrago sperduto sui mari del Grande Nord?
— Chi lo sa? Forse il barone de Tolt in persona od uno dei marinai
che lo seguivano.
— Qui vi è una drammatica istoria che mi pare voi conosciate.
— È vero, Walter.
— Chi era dunque quel barone?
— Un ardito esploratore russo che nel 1900, ossia cinque anni or
sono, si era proposto di esplorare le isole della Nuova Siberia, sulle
quali sperava di trovare ancora qualche mammouth vivente o almeno
ancora ben conservato fra i profondi strati sabbiosi.
Già molti denti giganteschi, d'un avorio ben più fino di quello degli
elefanti, erano stati trovati su quelle terre desolate da indigeni
siberiani spinti lassù dalle tempeste.
— Narrate, signor Gastone. I drammi polari mi interessano ed hanno
destato sempre in me una profonda impressione, dopo che ho letto
dei rapporti dell'Ammiragliato sulla terribile fine dell'Erebus e del
Terror, che l'ammiraglio Franklyn conduceva al Polo e che
l'Inghilterra ancora piange.
— Seguiamo la costa, Walter, — rispose il canadese. — La vista di
questo disgraziato mi impressiona.
— E me non meno di voi, — rispose lo studente.
— Vi dicevo dunque, — riprese il canadese, rimettendosi in
cammino, — che quel disgraziato scienziato voleva visitare quelle
isole così prossime al Polo.

Aveva, a tale scopo, armata una nave che si chiamava la Sarya. Ai
primi di Giugno del 1900 la spedizione aveva oltrepassato lo stretto
di Kara filando lungo le coste siberiane.
Il ghiaccio era pessimo ed ostacolava continuamente la nave,
minacciando ad ogni istante di rinchiuderla in qualche wake o nei
grandi paks.
Alla fine di Settembre la Sarya veniva strettamente imprigionata sulle
coste settentrionali dell'isola Taimer, oltre l'imboccatura dell'Jenissik.
Svernò in quel luogo, colla speranza che la prossima estate
sciogliesse i grandi banchi di ghiaccio, ma soltanto verso la fine
dell'Agosto potè muoversi, e dopo una lotta spaventosa cogli ice-
bergs, si diresse verso le isole della Nuova Siberia.
Nel Settembre la Sarya raggiungeva l'isola Bennett, la famosa isola
scoperta dall'equipaggio della Yeannette, perito così tragicamente,
quasi tutto, di fame e di freddo alla foce della Lena, il grande fiume
siberiano.
I ghiacci che circondavano quell'isola inospitale costrinsero il barone
de Tolt a cercarsi un altro rifugio e lo trovò infatti in una baia
dell'isola Kotelinoi.
Era ormai troppo tardi per pensare al ritorno. Il pak si stringeva da
tutte le parti intorno alla poco fortunata nave e fu deciso un secondo
svernamento.
Nella primavera del 1902 il coraggioso barone parte con delle slitte,
una scialuppa, probabilmente quella che abbiamo trovata or ora, in
compagnia d'un astronomo e di un medico, risoluto a raggiungere
l'isola Bennett.
Aveva avvertito il capitano della nave che se dopo tre mesi non
fossero tornati, si mettesse in cerca di loro.
La spedizione pareva però perseguitata da un triste destino.
Un'altra volta la Sarya, che nel frattempo aveva potuto raggiungere
la costa siberiana, approfittando del breve estate, per rifornirsi di

viveri e di carbone, viene presa dai ghiacci.
Raggiungere l'isola Bennett era impossibile, ed erano già trascorsi
cinque mesi.
Un coraggioso, votatosi prima ad una morte certa, il tenente
Kolchak, parte a bordo d'un piccolo canotto, e passando fra i canali
aperti fra i banchi di ghiaccio, dopo sforzi sovrumani riesce a
raggiungere l'isola e la percorre in lungo ed in largo.
Trova finalmente un cairn, ossia una piramide formata di pietre, la
atterra e vi trova dentro una scatola di zinco contenente una lettera
scritta dal barone che datava dall'anno precedente, ossia dal
Novembre 1902, se la memoria non m'inganna.
Il disgraziato esploratore diceva in quelle righe che stava per partire
pel sud, non avendo viveri che per sole tre settimane.
Fu cercato invano e più nulla mai si seppe di lui e dei suoi compagni.
Si suppose che i ghiacci si fossero aperti sotto le slitte e che lo
avessero inghiottito, mentre invece noi abbiamo ora la prova che lui
solo o insieme all'astronomo o al medico si erano imbarcati sulla
scialuppa baleniera forse colla speranza di raggiungere le coste della
Siberia.
— Ma come quella imbarcazione può essere giunta qui? — chiese lo
studente. — Non sono mai stato molto forte in geografia, tuttavia mi
pare che le isole della Nuova Siberia siano lontane assai.
— Delle migliaia di miglia, mio caro Walter. Voi però dovete tener
conto delle correnti le quali girano intorno al Polo da ponente a
levante.
Pensate inoltre che questa scialuppa ha impiegato ben due anni
prima di incanalarsi attraverso il passaggio del nord-ovest scoperto
da Mac-Clure e finire qui.
— E che cosa sarà succeduto degli altri due che mancano?
— Chi potrebbe dirlo? Forse quei disgraziati, rôsi dalla fame, si sono
divorati.

— Ah!...
— Forse che i superstiti della spedizione Franklyn non hanno fatto
altrettanto? Si assassinavano per riempire le caldaie di carne umana.
— È orribile!...
— Il Polo, mio caro, ha avuto centinaia e centinaia, e forse delle
migliaia, di vittime umane.
Orsù, ritorniamo. Ricomincia a nevicare ed il nebbione si avanza
scendendo lungo il golfo.
Questa sera non faremo un passo innanzi. —
Dopo essersi accertati della resistenza del ghiaccio, risalirono la
sponda bruciando qualche cartuccia contro i volatili polari e
raggiunsero il treno proprio nel momento in cui la neve cadeva a
larghe falde attraversando silenziosamente la nebbia che s'avanzava
velocissima, tutto avvolgendo.
Il sole era già scomparso e la notte era scesa, ma una bella luce
usciva attraverso i vetri del carrozzone e dal tubo della stufa
uscivano dei profumi appetitosi.
— I miei gabbiani? — chiese lo studente, entrando.
— Pronti, signore, — rispose l'ex-baleniere, il quale si affaccendava
intorno alla stufa, prestando tutta la sua attenzione ad un enorme
pezzo d'orso bianco già perfettamente arrosolato.
— A tavola!... — aveva concluso il canadese, sbarrando la porta e
sbarazzandosi della grossa pelliccia.
Tutta la notte la neve cadde senza interruzione; il freddo però era
così intenso che la gelava quasi di colpo.
Nondimeno all'indomani, quantunque con molto ritardo, avendo
dovuto i tre esploratori rompere il ghiaccio per un buon tratto,
l'automobile ed il carrozzone ripartivano, scendendo sul golfo, che,
come abbiamo detto, era gelato a perdita d'occhio, per raggiungere
lo stretto del Reggente.

Il tempo era sempre pessimo ed il freddo così intenso da non poter
più appoggiarsi ad un pezzo di metallo od impugnare un oggetto
qualsiasi di ferro senza riportare alle mani delle vere bruciature.
I grossi guanti di pelle di foca erano stati messi in opera con poco
piacere da parte di Walter, il quale si trovava imbarazzato a far uso
dei suoi fucili.
E la selvaggina non scarseggiava sui banchi di ghiaccio, tutt'altro!...
Di quando in quando dai canali salivano drappelli di foche e di
morse, salivano per scomparire subito appena il treno s'avvicinava a
loro.
Abbondavano sopratutto le volpi azzurre dalla pelliccia preziosissima,
animali ormai quasi scomparsi nei dintorni della baia di Hudson, in
seguito alle caccie accanite degli uomini della Compagnia.
— Valgono proprio molto, signor Gastone, quelle pelliccie? — chiese
il campione di Cambridge, il quale le seguiva cogli sguardi ardenti,
senza provarsi nemmeno a far fuoco, poichè le astute bestie si
nascondevano subito in mezzo alla neve.
— Si pagano perfino duemila e cinquecento lire l'una, — rispose il
canadese, — ed il prezzo certamente aumenterà poichè sono
diventate rarissime.
Un tempo si cacciavano non solo su questi territorî, nell'Alaska e
nelle isole dello stretto di Behring, bensì anche in Siberia e perfino
nell'Europa settentrionale, ma ora non se ne trovano più.
Sono diventate rarissime poichè su 25,000 volpi che si uccidono ogni
anno nel distretto di Beresow è molto se se ne trovano cinquanta di
azzurre.
In Siberia su cento volpi non se ne trovano che tre o quattro, mentre
una volta erano più numerose.
Solo in Groenlandia sono ancora abbastanza in buon numero, però
anche là non tarderanno a scomparire.
— E da che cosa proviene quella splendida tinta azzurrognola?

— Alcuni credettero che ciò dipendesse più che altro dal
cambiamento delle stagioni; ora però si suppone che derivi dal sesso
e dall'età.
— E sono anche quelle delle volpi, signor Gastone? — chiese lo
studente, il quale si era precipitosamente alzato.
— Quali?
— Per tutti i fulmini di Giove!... Che l'automobile abbia fatto, a nostra
insaputa, una volata in pieno Far-West?
Si direbbero bisonti quegli animali che accennano a tagliarci la
strada.
— Fermate, Dik!... — gridò il canadese. — Abbiamo dinanzi a noi un
grosso branco di buoi muschiati!... —

.... dovettero rompere il ghiaccio, affinchè l'automobile e il
carrozzone potessero ripartire. (Cap. XVIII).

CAPITOLO XIX.
La carica dei buoi muschiati.
Una lunga fila composta di animali d'aspetto imponente, s'avanzava
attraverso i ghiacci dirigendosi verso la costa.
Quantunque i buoi muschiati non abbiano le dimensioni mostruose
dei bisonti delle praterie americane, sono delle splendide bestie,
grosse e alte quanto i tori comuni, armate di corna più arcuate e
d'aspetto più selvaggio in causa anche del fitto e magnifico vello, di
color bruno, a riflessi giallastri, che scende, come un mantello, fino a
terra, sicchè appena appena si possono scorgere i bianchi e
robustissimi zoccoli.
Questi animali, che un tempo erano numerosi anche nel Canadà,
ormai non si trovano più che sulle grandi isole polari o sui territorî
situati al nord della baia di Hudson.
Sembra che si siano perfettamente abituati ai grandi freddi, poichè si
moltiplicano abbastanza bene e sfidano le terribili bufere di neve
senza risentirne gran danno.
Al pari dei bisonti, sono emigranti e viaggiano continuamente per
cercare i licheni ed i muschi dei quali si nutrono, non essendo riusciti
ad abituarsi, come i cavalli islandesi, a nutrirsi di pesci.
La loro carne non sarebbe meno eccellente di quella dei buoi comuni
se non sapesse di muschio, forse in causa del genere del loro
nutrimento.
Se il canadese aveva dato ordine a Dik di fermare prontamente
l'automobile, aveva avute le sue buone ragioni.

I buoi muschiati, al contrario dei bisonti americani che si lasciano
massacrare senza quasi mai rivoltarsi, sono ombrosi come i bufali
indiani, e se si credono minacciati caricano con furia irresistibile, a
testa bassa, presentando le loro formidabili corna. Guai a chi ha la
sventura di trovarsi sulla loro corsa!... Viene scaraventato in aria e
poi finito a colpi di zoccolo.
La schiera che si avanzava sul golfo e che proveniva probabilmente
dalla Terra di Baffin in cerca d'un rifugio migliore, si componeva di
due dozzine d'animali, tutti adulti, senza piccini fra di loro.
Probabilmente erano tutti maschi, a giudicarne anche dallo sviluppo
delle corna.
— Signor Gastone, — disse lo studente, il quale aveva già preso il
suo mauser. — Lascieremo noi andarsene in pace quella splendida
selvaggina che io non ho mai assaggiata, senza consumare una
mezza dozzina di palle?
— Sono troppi, mio caro Walter, — rispose il canadese. — Voi
ignorate la forza che posseggono quegli animali.
Una volta preso lo slancio non si arrestano più, e sarebbero capaci di
guastare seriamente il nostro treno. È vero, Dik?
— Sono infatti terribili, — rispose l'ex-baleniere, il quale aveva
accesa tranquillamente la sua pipa.
— Se provassimo, signor Gastone? — insistette lo studente. — Non
sono che a quattrocento metri e siamo abili cacciatori.
Il canadese, che sapeva di che cosa fossero capaci quei bestioni, non
meno terribili degli orsi bianchi, esitava, poi finalmente la passione
del cacciatore lo vinse.
— Sì, — disse. — Venire al Polo per non provare le forti emozioni
della caccia sarebbe una sciocchezza.
Proviamo, Walter. Dopo tutto non getteranno in aria il nostro treno a
colpi di corna.
Dik!... Armatevi anche voi di un fucile.

— Pronto, signore, — rispose l'ex-baleniere, il quale amava le forti
emozioni non meno degli altri.
Mentre si dirigeva verso il carrozzone per prendere la grossa
carabina da caccia, il canadese e lo studente erano balzati sul campo
di ghiaccio, mettendosi in ginocchio l'uno accanto all'altro.
I buoi muschiati si erano già accorti della presenza di quel mostro a
loro sconosciuto, ed avevano arrestata la loro marcia verso la
spiaggia, disponendosi su una doppia fila, colle teste basse, come se
si preparassero a caricare.
— Che siano veramente così terribili? — si chiese lo studente. — Ora
lo sapremo. —
Si volse indietro. Dik giungeva in quel momento portando tre grosse
carabine da caccia, già cariche.
— Per Giove!... — esclamò lo studente. — Che tutti i fulmini mi
piombino addosso se questa sera non assaggerò un pezzo di bue
polare.
Se sarà muschiato come un coccodrillo, tanto peggio per la
cucina. —
Mirò con estrema attenzione e fece fuoco. Il canadese lo aveva
subito imitato.
Un bufalo, fulminato da una o dall'altra palla, era caduto subito sulle
ginocchia, mandando un lungo muggito.
Gli altri rimasero un momento come stupiti, forse un po' spaventati
da quei due spari che forse mai avevano uditi, poi con un insieme
fulmineo si slanciarono a corsa sfrenata verso il treno, mentre il loro
compagno si rovesciava pesantemente su un fianco, vomitando un
torrente di sangue dalla bocca.
Dik aveva mandato un grido:
— Nel carrozzone, signori!... —
Il canadese e lo studente, prima di obbedire, vuotarono celeremente
i serbatoi dei mauser colla speranza di arrestare la carica che

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Advances In Speech And Music Technology Computational Aspects And Applications Anupam Biswas

More Related Content

Similar to Advances In Speech And Music Technology Computational Aspects And Applications Anupam Biswas (20)

Recently uploaded (20)

Advances In Speech And Music Technology Computational Aspects And Applications Anupam Biswas