SlideShare a Scribd company logo
International Journal of Trend in Scientific Research and Development (IJTSRD)
Volume 6 Issue 1, November-December 2021 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470
@ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 922
Speech Emotion Recognition Using Neural Networks
Anirban Chakraborty
Research Scholar, Department of Artificial Intelligence, Lovely Professional University, Jalandhar, Punjab, India
ABSTRACT
Speech is the most natural and easy method for people to
communicate, and interpreting speech is one of the most
sophisticated tasks that the human brain conducts. The goal of
Speech Emotion Recognition (SER) is to identify human emotion
from speech. This is due to the fact that tone and pitch of the voice
frequently reflect underlying emotions. Librosa was used to analyse
audio and music, sound file was used to read and write sampled
sound file formats, and sklearn was used to create the model. The
current study looked on the effectiveness of Convolutional Neural
Networks (CNN) in recognising spoken emotions. The networks'
input characteristics are spectrograms of voice samples. Mel-
Frequency Cepstral Coefficients (MFCC) are used to extract
characteristics from audio. Our own voice dataset is utilised to train
and test our algorithms. The emotions of the speech (happy, sad,
angry, neutral, shocked, disgusted) will be determined based on the
evaluation.
KEYWORDS: Speech emotion, Energy, Pitch, Librosa, Sklearn,
Sound file, CNN, Spectrogram, MFCC
How to cite this paper: Anirban
Chakraborty "Speech Emotion
Recognition Using Neural Networks"
Published in
International
Journal of Trend in
Scientific Research
and Development
(ijtsrd), ISSN: 2456-
6470, Volume-6 |
Issue-1, December
2021, pp.922-927, URL:
www.ijtsrd.com/papers/ijtsrd47958.pdf
Copyright © 2021 by author(s) and
International Journal of Trend in
Scientific Research and Development
Journal. This is an
Open Access article
distributed under the
terms of the Creative Commons
Attribution License (CC BY 4.0)
(http://creativecommons.org/licenses/by/4.0)
I. INTRODUCTION
Speech emotion recognition (SER) is a technique that
extracts emotional features from speech by analysing
distinctive characteristics and the acquired emotional
change. At the moment, voice emotion recognition is
a developing artificial intelligence cross-field [1]. A
voice emotion processing and recognition system is
made up of three parts: speech signal acquisition,
feature extraction, and emotion recognition. In this
method, the extraction quality has a direct impact on
the accuracy of speech emotion identification. In
feature extraction, the entire emotion sentence was
frequently used as a unit for feature extraction and
extraction contents. The neural networks of the
human brain are highly capable of learning high-level
abstract notions from low-level information acquired
by the sensory periphery. Humans communicate
through voice, and interpreting speech is one of the
most sophisticated operations that the human brain
conducts. It has been argued that children who are not
able to understand the emotional states of the
speakers developed poor social skills and in some
cases they show psychopathological symptoms [2, 3].
This highlights the importance of recognizing the
emotional states of speech in effective
communication. Detection of emotion from facial
expressions and biological measurements such as
heart beats or skin resistance formed the preliminary
framework of research in emotion recognition[4].
More recently, emotion recognition from speech
signal has received growing attention. The traditional
approach toward this problem was based on the fact
that there are relationships between acoustic features
and emotion. In other words, the emotion is encoded
by acoustic and prosodic correlates of speech signals
such as speaking rate, intonation, energy, formant
frequencies, fundamental frequency (pitch), intensity
(loudness), duration (length), and spectral
characteristic (timbre) [5, 6]. There are a variety of
machine learning algorithms that have been examined
to classify emotions based on their acoustic correlates
in speech utterances. In the current study, we
investigated the capability of convolutional neural
networks in classifying speech emotions using our
own dataset. There are a variety of machine learning
algorithms that have been examined to classify
emotions based on their acoustic correlates in speech
utterances. In the current study, we investigated the
capability of convolutional neural networks in
classifying speech emotions using our own dataset.
The specific contribution of this study is using
IJTSRD47958
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 923
wideband spectrograms instead of narrow-band
spectrograms as well as assessing the effect of data
augmentation on the accuracy of models. Our results
revealed that wide-band spectrograms and data
augmentation equipped CNNs to achieve the state-of-
the art accuracy and surpass human performance.
Fig.1. Speech emotion recognition block diagram
II. RELATED WORK
Most of the papers published in last decade use spectral and prosodic features extracted from raw audio signals.
The process of emotion recognition from speech involves extracting the characteristics from a corpus of
emotional speech selected or implemented, and after that, the classification of emotions is done on the basis of
the extracted characteristics. The performance of the classification of emotions strongly depends on the good
extraction of the characteristics (such as combination of MFCC acoustic feature with the energy prosodic feature
[7]. Yixiong Pan in [8] used SVM for three class emotion classification on Berlin Database of Emotional Speech
[9] and achieved 95.1% accuracy.
Norooziet.al. Proposed a versatile emotion recognition system based on the analysis of visual and auditory
signals. He used 88 features (Mel frequency cepstral coefficients
(MFCC), filter bank energies (FBEs)) using the Principal Component Analysis (PCA) infeature extraction to
reduce the dimension of features previously extracted revealed that wide-band spectrograms and data
augmentation equipped CNNs to achieve the state-of-the art accuracy and surpass human performance.
The performance of the classification of emotions strongly depends on the good extraction of the characteristics
(such as combination of MFCC acoustic feature with the energy prosodic feature [7]. Yixiong Pan in [8] used
SVM for three class emotion classification on Berlin Database of Emotional Speech [9] and achieved 95.1%
accuracy.
Norooziet.al. proposed a versatile emotion recognition system based on the analysis of visual and auditory
signals. He used 88 features (Mel frequency cepstral coefficients (MFCC), filter bank energies (FBEs)) using the
Principal Component Analysis (PCA) in feature extraction to reduce the dimension of features previously
extracted [10]. S. Lalitha in [11] used pitch and prosody features and SVM classifier reporting 81.1% accuracy
on 7 classes of the whole Berlin Database of Emotional Speech. Zamil et al also used the spectral characteristics
which is the 13 MFCC obtained from the audio data in their proposed system to classify the 7 emotions with the
Logistic Model Tree (LMT) algorithm with an accuracy rate 70% [12]. Yu zhou in [13] combined prosodic and
spectral features and used Gaussian mixture model super vector based SVM and reported 88.35% accuracy on 5
classes of Chinese-LDC corpus.
H.M Fayek in [14] explored various DNN architecture and reported accuracy around 60% on two different
database eNTERFACE [15] and SAVEE [16] with 6 and 7 classes respectively. Fei Wang used combination of
Deep Auto Encoder, various features and SVM in [17] and reported 83.5% accuracy on 6 classes of Chinese
emotion corpus CASIA. In contrast to these traditional approaches more novel papers have been published
recently employing Deep Neural Networks into their experiments with the promising results. Many authors
agree that the most important audio characteristics to recognize emotions are spectral energydistribution, Teager
Energy Operator (TEO) [18], MFCC, Zero Crossing Rate (ZCR), and the energy parameters of the filter bank
energies (FBEs) [19].
III. TRADITIONAL SYSTEM
The traditional system was based on the analysis and comparison of all kinds of emotional characteristic
parameters, selecting emotional characteristics with high emotional resolution for feature extraction. In general,
the traditional emotional feature extraction concentrates on the analysis of the emotional features in the speech
from time construction, amplitude construction, and fundamental frequency construction and signal feature [28].
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 924
IV. PROPOSED METHOD
Convolutional Neural Network (CNN) is used to classify the emotions (happy, sad, angry, neutral, surprised,
disgust) and to predict the output by showing its accuracy.
The given speech is plotted as spectrogram by using matplot library and this is used as input for CNN to build
the model.
Fig.2. Flow diagram of proposed system
A. Data Set Collection
The first step is to create an empty dataset that will hold the training data for the model. After creating an empty
dataset, the data’s (audio) have to be recorded and labeled in different classes. Once the labeling is done, the
data’s have to be preprocessed which will produce the clear pitch of the data by removing its unwanted
background noise. After preprocessing the data’s are classified into train dataset and test dataset, where the train
dataset hold 75% of the data and the test dataset holds 25% of the data.
B. Feature Extraction of Speech Emotion
Human speech consists of many parameters which show the emotions compromise in it. As there is change in
emotions these parameters also gets changed. Hence it’s necessary to select proper feature vector to identify the
emotions. Features are categorized as excitation source features, spectral features, and prosodic features.
Excitation source features are achieved by suppressing characteristics of vocal tract (VT). Spectral features used
for emotion recognition are linear prediction coefficients (LPC), Perceptual Linear prediction coefficients
(PLPCs), Mel-frequency cepstral coefficients (MFCC), linear prediction cepstrum coefficients (LPCC), and
perceptual linear prediction (PLP). The accuracy of differentiating different emotions can be achieved by using
MFCC, LFPC
[20, 21].
C. Mel-Frequency Cepstral Coefficients
The Mel-Frequency Cepstral Coefficients (MFCC) feature extraction method is a leading approach for speech
feature extraction. The various steps involved in MFCC feature extraction are:
Fig.3. Flow of MFCC A/D conversion:
This converts the analog signal into discrete space.
Pre-emphasis:
This boosts the amount of energy in the high frequencies.
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 925
Windowing:
Windowing involves the slicing of audio waveform into sliding frames. Discrete Fourier Transform:
DFT is used to extract information in the frequency domain [22, 23].
D. Classifiers
After extracting features of speech, it is essential to select a proper classifier. Classifiers are used to classify
emotions. In the current study, we use Convolutional Neural Network (CNN). The term Convolutional comes
from the fact that Convolution-the mathematical operation is employed in these networks. Convolutional Neural
Networks is one of the most popular Deep Learning Models that have manifested remarkable success in the
research areas. CNN is a deep learning algorithm that takes image as an input, assign importance to various
aspects in the image and will be able to differentiate from other. Generally CNNs have three building blocks: the
convolutional layer, the pooling layer, and the fully connected layer. Following, we describe these building
blocks along with some basic concept such as soft max unit, rectified linear unit, and drop out.
Input layer: This layer holds the raw input image.
Convolution Layer: This layer computes the output volume by computing dot product between all filters
and image patch.
Activation Function Layer: This layer will apply element wise activation function to the output of
convolution layer.
Pool Layer: This layer is periodically inserted in CNN and its main function is to reduce the size of volume
which makes computation fast and reduces memory. The two types are Maxpooling and average pooling.
Fully-Connected Layer: This layer takes input from the previous layer and computes the class scores and
outputs the 1-D array of size equal to the number of classes [24, 25].
Fig.4. CNN Algorithm
V. APPLICATION
The applications of speech emotion recognition
system are, psychiatric diagnosis, conversation with
robots, intelligent toys, mobile based emotion
recognition, emotion recognition in call centre where
emotions of customer can be identified and can help
to get better service quality, intelligent tutoring
system, lie detection, games[26,27]. It is also used in
healthcare, Psychology, cognitive science and
marketing, voice-based virtual assistants.
VI. CONCLUSION
In this research, we suggested a technique for
extracting the emotional characteristic parameter
from an emotional speech signal using the CNN
algorithm, one of the Deep Learning methods.
Previous research relied heavily on narrow-band
spectrograms, which offer better frequency resolution
than wide-band spectrograms and can discern
individual harmonics. Wide-band spectrograms, on
the other hand, offer better temporal resolution than
narrow-band spectrograms and reveal distinct glottal
pulses that are connected with basic frequency and
pitch. On training data, CNNs perform admirably.
The current study's findings demonstrated CNNs'
ability to learn the fundamental emotional properties
of speech signals from their low-level representation
utilising wide-band spectrums.
VII. FUTURE SCOPE
For future work, we suggest to use audio-visual
database or audio-visual-linguistic databases to train
Deep Learning models where facial expressions and
semantic information are taken into account as well as
speech signals, which allows improving the
recognition rate of each emotion. In future, we can
think about using other types of features and apply
our system on other bases that are larger and used
other method for feature extraction.
REFERENCES
[1] Z. Yongzhao and C. Peng, “Research and
implementation of emotional feature extraction
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 926
and recognition in speech signal,” Journal of
Jiangsu University, volume. 26, No. 1, pp.72-
75, 2005.
[2] Monita Chatterjee, Danielle J Zion, Mickael L
Deroche, Brooke A Burianek, Charles J Limb,
Alison P Goren, Aditya M Kulkarni, and Julie
A Christensen. Voice emotion recognition by
Cochlear-implanted children and their
normally-hearing peers. Hearing research,
322:151-162, 2015.
[3] Nancy Eisenberg, Tracy L Spinrad, and Natalie
D Eggum. Emotional-related self-regulation
and its relation to children’s maladjustment.
Annual review of clinical pschycology, 6:495-
525, 2010.
[4] Harold Schlosberg. Three dimensions of
emotion. Psychological review, 61(2):81, 1954.
[5] Louis Ten Bosch. Emotions, speech and the asr
framework. Speech communication, 40(1-
2):213-225, 2003.
[6] Thomas S Polzin and Alex Waibel. Detecting
emotions in speech. In proceedings of the
CMC, volume 16. Citeseer, 1998.
[7] SurajTripathi, Abhay Kumar, Abhiram
Ramesh, Chirag Singh, PromodYenigalla,
“Speech emotion recognition using kernel
sparse representation based classifier,” in 2016
24th European Signal Processing
Conference(EUSIPCO), pp.374-377, 2016.
[8] Pan, Y., Shen, P. and Shen, L., 2012. Speech
emotion recognition using support vector
machine. International Journal of Smart Home,
6(2), pp.101-108.
[9] Burkhardt, F.,Taeschke, A.,Rolfes,
M.,Sendlmeier, W.F.andWeiss,B., 2015,
September. A database of German emotional
speech. In Interspeech (vol.5, pp.1517-1520).
[10] Noroozi, F., Marjavonic, M., Njegus, A.,
Escalera, S., &Anbarjafari, G. Audio-visual
emotion recognition in video clips. IEEE
Transactions on Affective Computing, 2017.
[11] Lalitha, S., Madhavan, A., Bhushan, B, and
Saketh, S., 2014, October. Speech emotion
recognition. In Advances in Electronics,
Computer characteristics. The performance of
the classification of emotions strongly depends
on the good extraction of the characteristics
(such as combination of MFCC acoustic feature
with the energy prosodic feature [7]. Yixiong
Pan in [8] used SVM for three class emotion
classification on Berlin Database of Emotional
Speech [9] and achieved 95.1% accuracy.
Noroozi et al proposed a versatile and
Communications (ICAECC), 2014
International Conference on (pp. 1-4). IEEE.
[12] Zamil, AdibAshfaq A., et al. “Emotion
Detection from Speech Signals using Voting
Machanism on Classified Frames.” 2019
International Conference on Robotics,
Electrical and Signal Processing Techniques
(ICREST). IEEE, 2019.
[13] Zhou, Y., Sun, Y., Zhang, J., and Yan, Y.,
2009, December. Speech emotion recognition
using both spectral and prosodic features. In
2009 International Conference on Information
Engineering and Computer Science (pp. 1-4).
IEEE.
[14] Fayek, H. M., M. Lech, and L. Cavedon.
“Towards real-time speech emotion recognition
using Deep Neural Networks.” Signal
Processing and Communication Systems
(ICSPCS), 2015 9th International Conference
on IEEE, 2015.
[15] Martin, O., Kotsia, L.,Macq, B., and Pitas, I.,
2006, April. The eNtERFACE’05 audio-visual
database. In 22nd International Conference on
Data Engineering Workshops (ICDEW’06) (pp
8-8). IEEE.
[16] Sanjitha. B. R, Nipunika. A, Rohita Desai.
“Speech Emotion Recognition using MLP”,
IJESC.
[17] Ray Kurzweil. The singularity is near. Gerald
Duckworth & Co, 2010.
[18] HadhamiAouani and Yassine Ben Ayed,
“Speech Emotion Recognition with Deep
Learning”, 24th International Conference on
Knowledge-Based and Intelligent Information
& Engineering Systems.
[19] PavolHarar, RadimBurget and Malay Kishore
Dutta “Efficiency of chosen ` speech
descriptors in relation to emotion recognition,”
EURASIP Journal on Audio, Speech, and
Music Processing, 2017.
[20] Idris I., Salam M.S. “Improved Speech
Emotion Classification from Spectral
Coefficient Optimization”. Lecture Notes in
Electrical Engineering, vol 387. Springer, 2016.
[21] Pao TL., Chen YT., Yeh JH., Cheng YM.,
Chien C.S. “Feature Combination for Better
Differentiating Anger from Neutral in
Mandarin Emotional Speech”, LNCS: Vol.
4738 Berlin: Springer 2007.
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 927
[22] J&M: Daniel Jurafsky and James H. Martin
(2008). Speech and Language Processing,
Pearson Education (2nd edition).
[23] HyenkHermansky, “Perceptual linear
Predictive (PLP) analysis of speech,” The
Journal of the Acoustical Society of America,
Vol.87, No.4, pp.1737-1752, 1980.
[24] A. Berg, J. Deng, and L. Fei-Fei. Large scale
visual recognition challenge 2010.
www.upgrad.com/blog/.

More Related Content

PDF
Speech emotion recognition using 2D-convolutional neural network
PDF
IRJET - Audio Emotion Analysis
PDF
Effective modelling of human expressive states from voice by adaptively tunin...
PDF
Improved speech emotion recognition with Mel frequency magnitude coefficient
DOCX
Emotion Recognition based on Speech and EEG Using Machine Learning Techniques...
PDF
SPEECH EMOTION RECOGNITION SYSTEM USING RNN
PDF
Signal & Image Processing : An International Journal
PDF
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Model
Speech emotion recognition using 2D-convolutional neural network
IRJET - Audio Emotion Analysis
Effective modelling of human expressive states from voice by adaptively tunin...
Improved speech emotion recognition with Mel frequency magnitude coefficient
Emotion Recognition based on Speech and EEG Using Machine Learning Techniques...
SPEECH EMOTION RECOGNITION SYSTEM USING RNN
Signal & Image Processing : An International Journal
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Model

Similar to A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledge Regarding Maintaining Airway Patency in Patients with Mechanical Ventilator (20)

PDF
Speech emotion recognition with light gradient boosting decision trees machine
PDF
histogram-based-emotion
PDF
A Review Paper on Speech Based Emotion Detection Using Deep Learning
PDF
A hybrid strategy for emotion classification
PDF
Audio-
PDF
Utterance Based Speaker Identification Using ANN
PDF
Utterance Based Speaker Identification Using ANN
PDF
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
PDF
ASERS-CNN: Arabic Speech Emotion Recognition System based on CNN Model
PDF
Signal & Image Processing : An International Journal
PDF
ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODEL
PDF
A017410108
PDF
A017410108
PDF
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
PPTX
SPEECH EMOTION RECOGNITION SYSTEM (1).pptx
PDF
Recognition of emotional states using EEG signals based on time-frequency ana...
PDF
Sentiment analysis by deep learning approaches
PDF
76201926
PDF
Human Emotion Recognition From Speech
Speech emotion recognition with light gradient boosting decision trees machine
histogram-based-emotion
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A hybrid strategy for emotion classification
Audio-
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
ASERS-CNN: Arabic Speech Emotion Recognition System based on CNN Model
Signal & Image Processing : An International Journal
ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODEL
A017410108
A017410108
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
SPEECH EMOTION RECOGNITION SYSTEM (1).pptx
Recognition of emotional states using EEG signals based on time-frequency ana...
Sentiment analysis by deep learning approaches
76201926
Human Emotion Recognition From Speech
Ad

More from ijtsrd (20)

PDF
A Study of School Dropout in Rural Districts of Darjeeling and Its Causes
PDF
Pre extension Demonstration and Evaluation of Soybean Technologies in Fedis D...
PDF
Pre extension Demonstration and Evaluation of Potato Technologies in Selected...
PDF
Pre extension Demonstration and Evaluation of Animal Drawn Potato Digger in S...
PDF
Pre extension Demonstration and Evaluation of Drought Tolerant and Early Matu...
PDF
Pre extension Demonstration and Evaluation of Double Cropping Practice Legume...
PDF
Pre extension Demonstration and Evaluation of Common Bean Technology in Low L...
PDF
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Ap...
PDF
Manpower Training and Employee Performance in Mellienium Ltdawka, Anambra State
PDF
A Statistical Analysis on the Growth Rate of Selected Sectors of Nigerian Eco...
PDF
Automatic Accident Detection and Emergency Alert System using IoT
PDF
Corporate Social Responsibility Dimensions and Corporate Image of Selected Up...
PDF
The Role of Media in Tribal Health and Educational Progress of Odisha
PDF
Advancements and Future Trends in Advanced Quantum Algorithms A Prompt Scienc...
PDF
A Study on Seismic Analysis of High Rise Building with Mass Irregularities, T...
PDF
Descriptive Study to Assess the Knowledge of B.Sc. Interns Regarding Biomedic...
PDF
Performance of Grid Connected Solar PV Power Plant at Clear Sky Day
PDF
Vitiligo Treated Homoeopathically A Case Report
PDF
Vitiligo Treated Homoeopathically A Case Report
PDF
Uterine Fibroids Homoeopathic Perspectives
A Study of School Dropout in Rural Districts of Darjeeling and Its Causes
Pre extension Demonstration and Evaluation of Soybean Technologies in Fedis D...
Pre extension Demonstration and Evaluation of Potato Technologies in Selected...
Pre extension Demonstration and Evaluation of Animal Drawn Potato Digger in S...
Pre extension Demonstration and Evaluation of Drought Tolerant and Early Matu...
Pre extension Demonstration and Evaluation of Double Cropping Practice Legume...
Pre extension Demonstration and Evaluation of Common Bean Technology in Low L...
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Ap...
Manpower Training and Employee Performance in Mellienium Ltdawka, Anambra State
A Statistical Analysis on the Growth Rate of Selected Sectors of Nigerian Eco...
Automatic Accident Detection and Emergency Alert System using IoT
Corporate Social Responsibility Dimensions and Corporate Image of Selected Up...
The Role of Media in Tribal Health and Educational Progress of Odisha
Advancements and Future Trends in Advanced Quantum Algorithms A Prompt Scienc...
A Study on Seismic Analysis of High Rise Building with Mass Irregularities, T...
Descriptive Study to Assess the Knowledge of B.Sc. Interns Regarding Biomedic...
Performance of Grid Connected Solar PV Power Plant at Clear Sky Day
Vitiligo Treated Homoeopathically A Case Report
Vitiligo Treated Homoeopathically A Case Report
Uterine Fibroids Homoeopathic Perspectives
Ad

Recently uploaded (20)

PDF
RMMM.pdf make it easy to upload and study
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Insiders guide to clinical Medicine.pdf
PDF
Classroom Observation Tools for Teachers
PDF
Business Ethics Teaching Materials for college
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
master seminar digital applications in india
RMMM.pdf make it easy to upload and study
102 student loan defaulters named and shamed – Is someone you know on the list?
O5-L3 Freight Transport Ops (International) V1.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Insiders guide to clinical Medicine.pdf
Classroom Observation Tools for Teachers
Business Ethics Teaching Materials for college
Abdominal Access Techniques with Prof. Dr. R K Mishra
2.FourierTransform-ShortQuestionswithAnswers.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Microbial diseases, their pathogenesis and prophylaxis
VCE English Exam - Section C Student Revision Booklet
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Cell Structure & Organelles in detailed.
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
master seminar digital applications in india

A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledge Regarding Maintaining Airway Patency in Patients with Mechanical Ventilator

  • 1. International Journal of Trend in Scientific Research and Development (IJTSRD) Volume 6 Issue 1, November-December 2021 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470 @ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 922 Speech Emotion Recognition Using Neural Networks Anirban Chakraborty Research Scholar, Department of Artificial Intelligence, Lovely Professional University, Jalandhar, Punjab, India ABSTRACT Speech is the most natural and easy method for people to communicate, and interpreting speech is one of the most sophisticated tasks that the human brain conducts. The goal of Speech Emotion Recognition (SER) is to identify human emotion from speech. This is due to the fact that tone and pitch of the voice frequently reflect underlying emotions. Librosa was used to analyse audio and music, sound file was used to read and write sampled sound file formats, and sklearn was used to create the model. The current study looked on the effectiveness of Convolutional Neural Networks (CNN) in recognising spoken emotions. The networks' input characteristics are spectrograms of voice samples. Mel- Frequency Cepstral Coefficients (MFCC) are used to extract characteristics from audio. Our own voice dataset is utilised to train and test our algorithms. The emotions of the speech (happy, sad, angry, neutral, shocked, disgusted) will be determined based on the evaluation. KEYWORDS: Speech emotion, Energy, Pitch, Librosa, Sklearn, Sound file, CNN, Spectrogram, MFCC How to cite this paper: Anirban Chakraborty "Speech Emotion Recognition Using Neural Networks" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456- 6470, Volume-6 | Issue-1, December 2021, pp.922-927, URL: www.ijtsrd.com/papers/ijtsrd47958.pdf Copyright © 2021 by author(s) and International Journal of Trend in Scientific Research and Development Journal. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0) (http://creativecommons.org/licenses/by/4.0) I. INTRODUCTION Speech emotion recognition (SER) is a technique that extracts emotional features from speech by analysing distinctive characteristics and the acquired emotional change. At the moment, voice emotion recognition is a developing artificial intelligence cross-field [1]. A voice emotion processing and recognition system is made up of three parts: speech signal acquisition, feature extraction, and emotion recognition. In this method, the extraction quality has a direct impact on the accuracy of speech emotion identification. In feature extraction, the entire emotion sentence was frequently used as a unit for feature extraction and extraction contents. The neural networks of the human brain are highly capable of learning high-level abstract notions from low-level information acquired by the sensory periphery. Humans communicate through voice, and interpreting speech is one of the most sophisticated operations that the human brain conducts. It has been argued that children who are not able to understand the emotional states of the speakers developed poor social skills and in some cases they show psychopathological symptoms [2, 3]. This highlights the importance of recognizing the emotional states of speech in effective communication. Detection of emotion from facial expressions and biological measurements such as heart beats or skin resistance formed the preliminary framework of research in emotion recognition[4]. More recently, emotion recognition from speech signal has received growing attention. The traditional approach toward this problem was based on the fact that there are relationships between acoustic features and emotion. In other words, the emotion is encoded by acoustic and prosodic correlates of speech signals such as speaking rate, intonation, energy, formant frequencies, fundamental frequency (pitch), intensity (loudness), duration (length), and spectral characteristic (timbre) [5, 6]. There are a variety of machine learning algorithms that have been examined to classify emotions based on their acoustic correlates in speech utterances. In the current study, we investigated the capability of convolutional neural networks in classifying speech emotions using our own dataset. There are a variety of machine learning algorithms that have been examined to classify emotions based on their acoustic correlates in speech utterances. In the current study, we investigated the capability of convolutional neural networks in classifying speech emotions using our own dataset. The specific contribution of this study is using IJTSRD47958
  • 2. International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 923 wideband spectrograms instead of narrow-band spectrograms as well as assessing the effect of data augmentation on the accuracy of models. Our results revealed that wide-band spectrograms and data augmentation equipped CNNs to achieve the state-of- the art accuracy and surpass human performance. Fig.1. Speech emotion recognition block diagram II. RELATED WORK Most of the papers published in last decade use spectral and prosodic features extracted from raw audio signals. The process of emotion recognition from speech involves extracting the characteristics from a corpus of emotional speech selected or implemented, and after that, the classification of emotions is done on the basis of the extracted characteristics. The performance of the classification of emotions strongly depends on the good extraction of the characteristics (such as combination of MFCC acoustic feature with the energy prosodic feature [7]. Yixiong Pan in [8] used SVM for three class emotion classification on Berlin Database of Emotional Speech [9] and achieved 95.1% accuracy. Norooziet.al. Proposed a versatile emotion recognition system based on the analysis of visual and auditory signals. He used 88 features (Mel frequency cepstral coefficients (MFCC), filter bank energies (FBEs)) using the Principal Component Analysis (PCA) infeature extraction to reduce the dimension of features previously extracted revealed that wide-band spectrograms and data augmentation equipped CNNs to achieve the state-of-the art accuracy and surpass human performance. The performance of the classification of emotions strongly depends on the good extraction of the characteristics (such as combination of MFCC acoustic feature with the energy prosodic feature [7]. Yixiong Pan in [8] used SVM for three class emotion classification on Berlin Database of Emotional Speech [9] and achieved 95.1% accuracy. Norooziet.al. proposed a versatile emotion recognition system based on the analysis of visual and auditory signals. He used 88 features (Mel frequency cepstral coefficients (MFCC), filter bank energies (FBEs)) using the Principal Component Analysis (PCA) in feature extraction to reduce the dimension of features previously extracted [10]. S. Lalitha in [11] used pitch and prosody features and SVM classifier reporting 81.1% accuracy on 7 classes of the whole Berlin Database of Emotional Speech. Zamil et al also used the spectral characteristics which is the 13 MFCC obtained from the audio data in their proposed system to classify the 7 emotions with the Logistic Model Tree (LMT) algorithm with an accuracy rate 70% [12]. Yu zhou in [13] combined prosodic and spectral features and used Gaussian mixture model super vector based SVM and reported 88.35% accuracy on 5 classes of Chinese-LDC corpus. H.M Fayek in [14] explored various DNN architecture and reported accuracy around 60% on two different database eNTERFACE [15] and SAVEE [16] with 6 and 7 classes respectively. Fei Wang used combination of Deep Auto Encoder, various features and SVM in [17] and reported 83.5% accuracy on 6 classes of Chinese emotion corpus CASIA. In contrast to these traditional approaches more novel papers have been published recently employing Deep Neural Networks into their experiments with the promising results. Many authors agree that the most important audio characteristics to recognize emotions are spectral energydistribution, Teager Energy Operator (TEO) [18], MFCC, Zero Crossing Rate (ZCR), and the energy parameters of the filter bank energies (FBEs) [19]. III. TRADITIONAL SYSTEM The traditional system was based on the analysis and comparison of all kinds of emotional characteristic parameters, selecting emotional characteristics with high emotional resolution for feature extraction. In general, the traditional emotional feature extraction concentrates on the analysis of the emotional features in the speech from time construction, amplitude construction, and fundamental frequency construction and signal feature [28].
  • 3. International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 924 IV. PROPOSED METHOD Convolutional Neural Network (CNN) is used to classify the emotions (happy, sad, angry, neutral, surprised, disgust) and to predict the output by showing its accuracy. The given speech is plotted as spectrogram by using matplot library and this is used as input for CNN to build the model. Fig.2. Flow diagram of proposed system A. Data Set Collection The first step is to create an empty dataset that will hold the training data for the model. After creating an empty dataset, the data’s (audio) have to be recorded and labeled in different classes. Once the labeling is done, the data’s have to be preprocessed which will produce the clear pitch of the data by removing its unwanted background noise. After preprocessing the data’s are classified into train dataset and test dataset, where the train dataset hold 75% of the data and the test dataset holds 25% of the data. B. Feature Extraction of Speech Emotion Human speech consists of many parameters which show the emotions compromise in it. As there is change in emotions these parameters also gets changed. Hence it’s necessary to select proper feature vector to identify the emotions. Features are categorized as excitation source features, spectral features, and prosodic features. Excitation source features are achieved by suppressing characteristics of vocal tract (VT). Spectral features used for emotion recognition are linear prediction coefficients (LPC), Perceptual Linear prediction coefficients (PLPCs), Mel-frequency cepstral coefficients (MFCC), linear prediction cepstrum coefficients (LPCC), and perceptual linear prediction (PLP). The accuracy of differentiating different emotions can be achieved by using MFCC, LFPC [20, 21]. C. Mel-Frequency Cepstral Coefficients The Mel-Frequency Cepstral Coefficients (MFCC) feature extraction method is a leading approach for speech feature extraction. The various steps involved in MFCC feature extraction are: Fig.3. Flow of MFCC A/D conversion: This converts the analog signal into discrete space. Pre-emphasis: This boosts the amount of energy in the high frequencies.
  • 4. International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 925 Windowing: Windowing involves the slicing of audio waveform into sliding frames. Discrete Fourier Transform: DFT is used to extract information in the frequency domain [22, 23]. D. Classifiers After extracting features of speech, it is essential to select a proper classifier. Classifiers are used to classify emotions. In the current study, we use Convolutional Neural Network (CNN). The term Convolutional comes from the fact that Convolution-the mathematical operation is employed in these networks. Convolutional Neural Networks is one of the most popular Deep Learning Models that have manifested remarkable success in the research areas. CNN is a deep learning algorithm that takes image as an input, assign importance to various aspects in the image and will be able to differentiate from other. Generally CNNs have three building blocks: the convolutional layer, the pooling layer, and the fully connected layer. Following, we describe these building blocks along with some basic concept such as soft max unit, rectified linear unit, and drop out. Input layer: This layer holds the raw input image. Convolution Layer: This layer computes the output volume by computing dot product between all filters and image patch. Activation Function Layer: This layer will apply element wise activation function to the output of convolution layer. Pool Layer: This layer is periodically inserted in CNN and its main function is to reduce the size of volume which makes computation fast and reduces memory. The two types are Maxpooling and average pooling. Fully-Connected Layer: This layer takes input from the previous layer and computes the class scores and outputs the 1-D array of size equal to the number of classes [24, 25]. Fig.4. CNN Algorithm V. APPLICATION The applications of speech emotion recognition system are, psychiatric diagnosis, conversation with robots, intelligent toys, mobile based emotion recognition, emotion recognition in call centre where emotions of customer can be identified and can help to get better service quality, intelligent tutoring system, lie detection, games[26,27]. It is also used in healthcare, Psychology, cognitive science and marketing, voice-based virtual assistants. VI. CONCLUSION In this research, we suggested a technique for extracting the emotional characteristic parameter from an emotional speech signal using the CNN algorithm, one of the Deep Learning methods. Previous research relied heavily on narrow-band spectrograms, which offer better frequency resolution than wide-band spectrograms and can discern individual harmonics. Wide-band spectrograms, on the other hand, offer better temporal resolution than narrow-band spectrograms and reveal distinct glottal pulses that are connected with basic frequency and pitch. On training data, CNNs perform admirably. The current study's findings demonstrated CNNs' ability to learn the fundamental emotional properties of speech signals from their low-level representation utilising wide-band spectrums. VII. FUTURE SCOPE For future work, we suggest to use audio-visual database or audio-visual-linguistic databases to train Deep Learning models where facial expressions and semantic information are taken into account as well as speech signals, which allows improving the recognition rate of each emotion. In future, we can think about using other types of features and apply our system on other bases that are larger and used other method for feature extraction. REFERENCES [1] Z. Yongzhao and C. Peng, “Research and implementation of emotional feature extraction
  • 5. International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 926 and recognition in speech signal,” Journal of Jiangsu University, volume. 26, No. 1, pp.72- 75, 2005. [2] Monita Chatterjee, Danielle J Zion, Mickael L Deroche, Brooke A Burianek, Charles J Limb, Alison P Goren, Aditya M Kulkarni, and Julie A Christensen. Voice emotion recognition by Cochlear-implanted children and their normally-hearing peers. Hearing research, 322:151-162, 2015. [3] Nancy Eisenberg, Tracy L Spinrad, and Natalie D Eggum. Emotional-related self-regulation and its relation to children’s maladjustment. Annual review of clinical pschycology, 6:495- 525, 2010. [4] Harold Schlosberg. Three dimensions of emotion. Psychological review, 61(2):81, 1954. [5] Louis Ten Bosch. Emotions, speech and the asr framework. Speech communication, 40(1- 2):213-225, 2003. [6] Thomas S Polzin and Alex Waibel. Detecting emotions in speech. In proceedings of the CMC, volume 16. Citeseer, 1998. [7] SurajTripathi, Abhay Kumar, Abhiram Ramesh, Chirag Singh, PromodYenigalla, “Speech emotion recognition using kernel sparse representation based classifier,” in 2016 24th European Signal Processing Conference(EUSIPCO), pp.374-377, 2016. [8] Pan, Y., Shen, P. and Shen, L., 2012. Speech emotion recognition using support vector machine. International Journal of Smart Home, 6(2), pp.101-108. [9] Burkhardt, F.,Taeschke, A.,Rolfes, M.,Sendlmeier, W.F.andWeiss,B., 2015, September. A database of German emotional speech. In Interspeech (vol.5, pp.1517-1520). [10] Noroozi, F., Marjavonic, M., Njegus, A., Escalera, S., &Anbarjafari, G. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017. [11] Lalitha, S., Madhavan, A., Bhushan, B, and Saketh, S., 2014, October. Speech emotion recognition. In Advances in Electronics, Computer characteristics. The performance of the classification of emotions strongly depends on the good extraction of the characteristics (such as combination of MFCC acoustic feature with the energy prosodic feature [7]. Yixiong Pan in [8] used SVM for three class emotion classification on Berlin Database of Emotional Speech [9] and achieved 95.1% accuracy. Noroozi et al proposed a versatile and Communications (ICAECC), 2014 International Conference on (pp. 1-4). IEEE. [12] Zamil, AdibAshfaq A., et al. “Emotion Detection from Speech Signals using Voting Machanism on Classified Frames.” 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST). IEEE, 2019. [13] Zhou, Y., Sun, Y., Zhang, J., and Yan, Y., 2009, December. Speech emotion recognition using both spectral and prosodic features. In 2009 International Conference on Information Engineering and Computer Science (pp. 1-4). IEEE. [14] Fayek, H. M., M. Lech, and L. Cavedon. “Towards real-time speech emotion recognition using Deep Neural Networks.” Signal Processing and Communication Systems (ICSPCS), 2015 9th International Conference on IEEE, 2015. [15] Martin, O., Kotsia, L.,Macq, B., and Pitas, I., 2006, April. The eNtERFACE’05 audio-visual database. In 22nd International Conference on Data Engineering Workshops (ICDEW’06) (pp 8-8). IEEE. [16] Sanjitha. B. R, Nipunika. A, Rohita Desai. “Speech Emotion Recognition using MLP”, IJESC. [17] Ray Kurzweil. The singularity is near. Gerald Duckworth & Co, 2010. [18] HadhamiAouani and Yassine Ben Ayed, “Speech Emotion Recognition with Deep Learning”, 24th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems. [19] PavolHarar, RadimBurget and Malay Kishore Dutta “Efficiency of chosen ` speech descriptors in relation to emotion recognition,” EURASIP Journal on Audio, Speech, and Music Processing, 2017. [20] Idris I., Salam M.S. “Improved Speech Emotion Classification from Spectral Coefficient Optimization”. Lecture Notes in Electrical Engineering, vol 387. Springer, 2016. [21] Pao TL., Chen YT., Yeh JH., Cheng YM., Chien C.S. “Feature Combination for Better Differentiating Anger from Neutral in Mandarin Emotional Speech”, LNCS: Vol. 4738 Berlin: Springer 2007.
  • 6. International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD47958 | Volume – 6 | Issue – 1 | Nov-Dec 2021 Page 927 [22] J&M: Daniel Jurafsky and James H. Martin (2008). Speech and Language Processing, Pearson Education (2nd edition). [23] HyenkHermansky, “Perceptual linear Predictive (PLP) analysis of speech,” The Journal of the Acoustical Society of America, Vol.87, No.4, pp.1737-1752, 1980. [24] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge 2010. www.upgrad.com/blog/.