SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 659
Review On Speech Recognition using Deep Learning
Anushree Raj1, Sahir Abdulla2, Vishwas N3
1 Assistant Professor- IT Department, AIMIT, Mangaluru, anushreeraj@staloysius.ac.in
2 MCA Student, AIMIT, Mangaluru, 2117117vishwas@staloysius.ac.in
3 MCA Student, AIMIT, Mangaluru, 2117044sahir@staloysius.ac.in
---------------------------------------------------------------------***--------------------------------------------------------------------
Abstract: Speech is the most effective means forhumans to
communicate their ideas and emotions across a variety of
languages. Every language has a different set of speech
characteristics. The tempo and dialect vary from person to
person even when speaking the same language. For some
folks, this makes it difficult to understand the messagebeing
delivered. Long speeches can be challenging to follow at
times because of things like inconsistent pronunciation,
tempo, and other factors. The development of technology
that enables the recognition and transcription of voice into
text is aided by speech recognition, an interdisciplinaryarea
of computational linguistics. The most crucial information is
taken from a text source and adequately summarizedbytext
summarization.
Key words: Speech recognition, Deep learning,
computational linguistics,featureextraction,featurevectors.
1. INTRODUCTION
To select the proper output, some Voice is frequently used
and regarded as important information while engagingwith
others. Through comprehension and recognition, voice
recognition technology enables machines to convert human
vocal signals into equivalent commands. Speech is the most
effective form of expression for thoughts and feelings when
learning new languages. In the survey we conducted this is
very useful when we want to communicate with others.This
project will convert speech to text or text to speech using
deep learning technique using CNN (conventional neural
networking), Just like Google’sgoogleAssistant,Apple’sSIRI,
Samsung’s Bixby. A combinationofspeechtotextconversion
and text summarization is used in the suggested work.
Applications that call for concise summaries of lengthy talks
will benefit from this hybrid approach, whichisquitehelpful
for documentation. Deep learning is a sort of AIandmachine
learning that mimics how people learn specific types of
information. Nowadays, numerous applications use human-
machine interaction [1]. Speech is one of the interactional
media. The primary difficulty in human-machineinteraction
is identifying emotions in speech.
2. OBJECTIVES
The objective of voice recognition is to use linguistic and
phonetic data to convert the input speech feature vector
series into a sequence of words. A full voice recognition
system, according to the system's structure, consists of a
feature extraction algorithm, acoustic model, language
model, and search algorithm. A multidimensional pattern
recognition system is essentially what the speech
recognition system does.
Speech recognition provides input forautomatictranslation,
generates print-ready dictation, and allows hands-free
operation of various devices and equipment—all of which
are especially helpful to many disabled people. Medical
dictation software and automated telephone systems were
some of the first speech recognition applications [2].
Speech recognizers are made up of a few components, such
as the speech input, feature extraction, feature vectors, a
decoder, and a word output. The decoder leverages acoustic
models, a pronunciation dictionary, and language models
Benefits:
 It can help to increase productivity in many businesses,
such as in healthcare industries.
 It can capture speech much faster than you can type.
 You can use text-to-speech in real-time.
 The software can spell with the same ability as any other
writing tool.
Helps those who have problems with speech or sight.
3. LITERATURE REVIEW
The most crucial component of human communication is
speech. Although there are many ways to express what we
think and feel, speaking is often regarded as the primary
form of communication. The Google API can be used to
convert recorded speech to text. Because the retrieved text
does not contain a period, it is challenging to split the
content into sentences that were created using the Google
API. In the suggested model, a period is added at the end of
each phrase to distinguish them from one another.
The theoretical algorithms used to construct voice
recognition were explained in this study. The precise steps
involved in voice recognition, suchasbiometricsacquisition,
preprocessing, feature extraction, biometrics pattern
matching, and recognition outcomes, arefirstdescribed.The
detailed introduction of speech recognition in biological
features [3]. The primary procedures,recognitionstrategies,
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 660
and application situations for voice recognition are outlined
in this paper. The secret to ensuring recognition
effectiveness is learning how to extract feature information
sensibly.
The voice-inputvoice-output communicationaid(VIVOCA),a
novel type of augmentative and alternative communication
(AAC) technology forpeople withseverespeechimpairment,
is described. The VIVOCA creates messages from the user's
disordered speech and transforms them into synthetic
speech. The findings demonstrate that the VIVOCA device
performed better when phrase construction was the goal
rather than spelling [4]. This is because there are normally
3–10 competing words in these trials, which makes the
ambiguity relatively low in the phrase building mode.
Computer models can be used to predict voice recognition.
There are numerous files with a variety of audio and audio
files in huge audio or video files with many minutes in
length. This researcher selected the appropriate soundfrom
a sizable file to listen to. Deep learning was employed in this
study to categorize speech. The model was trained using the
Google corpus. We had accuracy of 66.22%. The detailed
introduction of speech recognition in biological features[5].
The primary procedures, recognition strategies, and
application situations for voice recognition are outlined in
this paper. The secret to ensuring recognition effectiveness
is learning how to extract feature information sensibly.
When the audio is distorted bynoise,theaudio-visual speech
recognition (AVSR) system is regarded as one of the most
promising options for accurate speech recognition. To
achieve good recognition performance, however, careful
sensory feature selection is essential. In this study, they
suggested an AVSR system based on MSHMMs for
multimodal feature integration and isolated word
recognition in addition to deep learning architectures for
audio and visual feature extraction [6]. Our test findings
showed that the deep denoising auto encoder can efficiently
remove the effect of noise superimposed on original clean
audio inputs when compared to the original MFCCs.
They discuss our research on the usefulness of using
representation learning onsizableunlabeledspeechcorpora
for speech emotion recognition (SER). The relatively small
emotional speech datasets were the main focus of earlier
work on representation learning for SER, and no further
unlabeled speech data were utilized. They have
demonstrated in this study that adding representations
produced by an auto encoder that was trained on a sizable
dataset consistently increases the provided SER model's
recognition accuracy [7]. Additionally, we provided t-SNE
visualizations that demonstrate the representations' ability
to discriminate between low and high levels of arousal.
4. CHALLENGES AND ISSUES
During the last few years, speech recognition has improved
a lot. This can mainly be attributed to the rise of graphics
processing and cloud computing, as these have made large
data sets widely distributable.
Some Challenges are:
 Audio / Video Conferencing with Background Noise.
 Speech Recognition and Voice Assistant Devices.
 Lack of Trust and Privacy Issues.
 Touch less Screens.
 The Future of Voice Recognition Technology.
 In Summary.
With recent developments, it’s going to be interesting to
see how the momentum of rapid growth can be maintained
and how the current challenges of speech recognition will
be dealt with [8].
The goal of Automatic Speech Recognition (ASR) for the
past five to ten years has been to decode voice inputs as
accurately as possible. Systems like Siri, Alexa, and Google
Assistant, which are well-known, were made feasible by
this. Voice recognition has entered our daily lives thanks to
these well-known voice assistants. In this essay, we'll
examine the speech recognition industry's existing
difficulties and potential future advances. Reach and loud
settings are the two main causes of the current difficulties
in voice detection. This necessitates even more accurate
systems that are capable of handling the most challenging
ASR use-cases. Consider speech recognition during a
boisterous family dinner, live interviews, or group
meetings [9]. These are the upcoming difficulties for next-
generation voice recognition.
Beyond this, voice recognition needs to support additional
languages and a larger range of subjects. A lot of the data
that ASR currently needs to operate well has simply not
been obtained for certain languages and topics. Without
them, ASR systems will remain quite constrained. Voice
assistants and Voice Powered User Interfaces (VUIs) have a
straightforward use-case. They enable spoken commands
from people to be translated into actions by machines. Even
if the use-case seems to be crystal evident, the ideal
approach to human-machine interactions is still being
developed [10]. Naturally, speech recognition will have
difficulties as a result.
5. CONCLUSSION
Here we have some dictating tips to consider for improved
results:
 Speak in an even tone and clarity. If you are
whispering, then words could not be interpreted in
the correct way.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 661
 It is always better to pause before and after a
command while avoid taking a pause in themidstof
command issuance. So that it could not be
interpreted as a dictation.
 Prefer to speak in complete sentences even
including punctuation, to give proper context.
Here are the ways that we can consider bringing
improvement in our text to speech technology in the best
possible way:
1) Understand the type of errors
A text to speech tool more often comes up with the array of
words based on what it has heard. This is what the tools
have been designed to do. However, deciding which words
string it has heard can become a bit tricky for it. Therefore,
there can occur errors that can throw users off.
While guessing the wrong word is one of the classic
problems in speech technology because they present all
kinds of potential mishearing thatsoundsimilar.Buta whole
sentence might make some sense.
Use high-quality headset microphone
Using a high-quality headset microphone is one of the most
important factors to improve voice recognition. Itis because
these are not only capable of catching the right words, but
also have the ability to hold a microphone in front of your
mouth at a consistent position directly. Therefore, these can
offer more desirable. This can help you in getting more
desirablespeechrecognition resultsbyremainingpositioned
consistently.
2) Make corrections
Most commonly speech technology learns from the
corrections that are being made by you. It is because mostof
these tools are based on artificial intelligence and deep
learning technology. Therefore, these will be going to learn
your corrected words and will use those for the next time.
3) Use automatic formatting
There are some tools available in speech recognition
technology that offer automatic formatting solutions. These
can help in formatting various types of text automatically. It
can also help your text to speech solutions to format specific
phrases and words as per your preferences.
There is a continuous improvement in text to speech
technology, but speech recognition systemsarehavinggreat
difficulty in attaining 99% accuracy. However, considering
these some effective tips can help you in getting better
results.
recognition. As computer information technology develops,
speech recognition technology will advance considerably.
Several businesses,includingpublicsecurity,mobileInternet
security, and automotive network security, are expected to
employ this technology. Speech recognition research
consequently has two main objectives: improving the
information society and boosting living standards. Speech
recognition is a key man-machine interface tool in
information technology with strong scientific importance
and vast application usefulness. In this study, biological
features are completely introduced for speech recognition.
This article describes the key steps, recognition tactics, and
application scenarios for voice recognition. Ability to
reasonably extract feature information is necessary for
effective
REFERENCES
[1] Xinman Zhang School of Electronics and Information
Engineering, MOE Key Lab for IntelligentNetworksand
Network Security, Xi’an JiaotongUniversityXi’an,China
e-mail: zhangxinman@xjtu.edu.cn “An Overview of
Speech Recognition Technology” 2019 4th
International Conference on Control, Robotics and
Cybernetics (CRC)
[2] Mark S. Hawley, Stuart P. Cunningham, Phil D. Green,
Pam Enderby, Rebecca Palmer, Siddharth Sehgal, and
Peter O’Neill” A Voice-Input Voice-Output
Communication Aid for People With Severe Speech
Impairment” IEEE TRANSACTIONS ON NEURAL
SYSTEMS AND REHABILITATION ENGINEERING, VOL.
21, NO. 1, JANUARY 2013
[3] Phoemporn Lakkhanawannakun, Chaluemwut
Noyunsan Department of Computer Engineering,
Faculty of Engineering, Rajamangala University of
Technology Isan, Khon Kaen Campus, Khon Kaen,
40000, Thailand” Speech Recognition using Deep
Learning”
[4] Kuniaki Noda · Yuki Yamaguchi · Kazuhiro Nakadai ·
Hiroshi G. Okuno · Tetsuya Ogata” Audio-visual speech
recognition using deep learning” Published online: 20
December 2014 © Springer Science+Business Media
New York 2014
[5] Michael Neumann, Ngoc Thang Vu University of
Stuttgart, Germany
{michael.neumann|thang.vu}@ims.uni-stuttgart.de”
Published online: 20 December 2014 © Springer
Science+Business Media New York 2014”.
[6] mY. H. Ghadage and S. D. Shelke, "Speech to text
conversion for multilingual languages," 2016
International Conferenceon CommunicationandSignal
Processing (ICCSP), Melmaruvathur, pp. 0236-0240,
2016.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 662
[7] ] F. Zheng, L. T. Li, M. Z. Escale, and H. Zhang, “Voice
print recognition technology and its application
status,”Research on Information Security, vol. 2, no. 1,
Jan. 2016, pp. 44–57.
[8] Conference on Acoustics, vol. 24, no. 7, July. 2012, pp.
1315–1329. [9] C. H. Zhou, “Research on Speaker
recognition system based on MFCC feature and GMM
Model,” Ph.D. dissertation, Dept. Electron. Eng.,
Lanzhou University of Technology, Lanzhou, China,
2013
[9] Jose D V, Alfateh Mustafa, Sharan R, "A Novel Model for
Speech to Text Conversion," International Refereed
Journal of Engineering and Science (IRJES), vol 3, no. 1,
2014.
[10] L. Liu, “Research on Fusion and Recognition Methods
on Multimode Biometrics,” Ph.D. dissertation, Dept.
Electron. Eng., University of Electronic Science and
Technology, Chengdu, China, 2010.

More Related Content

PPTX
Voice recognition
PPTX
AI for voice recognition.pptx
PDF
IRJET- Voice to Code Editor using Speech Recognition
PDF
A survey on Enhancements in Speech Recognition
PDF
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
PPTX
Speech to text conversion
PPTX
Speech to text conversion
PDF
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
Voice recognition
AI for voice recognition.pptx
IRJET- Voice to Code Editor using Speech Recognition
A survey on Enhancements in Speech Recognition
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
Speech to text conversion
Speech to text conversion
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION

Similar to Review On Speech Recognition using Deep Learning (20)

PPTX
PDF
Recent advances in LVCSR : A benchmark comparison of performances
PPTX
SPEECH RECOGNIZATION-LOPAMUDRA.pptxFV hsdhfhshsuhishvs;hv;lsd bsdbgvsugvsidvs...
PPTX
SPEECH RECOGNIZATION-LOPAMUDRA.pptx jbjaegjvbleritglerlgeb reterltgfeltglgert...
PPTX
Research Developments and Directions in Speech Recognition and ...
PDF
Voice recognition
PDF
Speech Recognition: Transcription and transformation of human speech
PDF
Deep Learning For Speech Recognition
PPT
Machine Learning_ How to Do Speech Recognition with Deep Learning
PDF
Speech recognition using neural + fuzzy logic
PDF
Artificial Intelligence for Speech Recognition
PPT
Abstract of speech recognition
PDF
IRJET- Voice based Billing System
PDF
Deep Learning for Speech Recognition - Vikrant Singh Tomar
PPTX
sample PPT.pptx
PPTX
Speech recognition with Nlp (1).pptx DK.pptx
PPTX
SPEECH RECOGNITION USING NEURAL NETWORK
PDF
Develop Communication using Virtual Reality and Machine Learning
Recent advances in LVCSR : A benchmark comparison of performances
SPEECH RECOGNIZATION-LOPAMUDRA.pptxFV hsdhfhshsuhishvs;hv;lsd bsdbgvsugvsidvs...
SPEECH RECOGNIZATION-LOPAMUDRA.pptx jbjaegjvbleritglerlgeb reterltgfeltglgert...
Research Developments and Directions in Speech Recognition and ...
Voice recognition
Speech Recognition: Transcription and transformation of human speech
Deep Learning For Speech Recognition
Machine Learning_ How to Do Speech Recognition with Deep Learning
Speech recognition using neural + fuzzy logic
Artificial Intelligence for Speech Recognition
Abstract of speech recognition
IRJET- Voice based Billing System
Deep Learning for Speech Recognition - Vikrant Singh Tomar
sample PPT.pptx
Speech recognition with Nlp (1).pptx DK.pptx
SPEECH RECOGNITION USING NEURAL NETWORK
Develop Communication using Virtual Reality and Machine Learning
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Ad

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
composite construction of structures.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
additive manufacturing of ss316l using mig welding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Model Code of Practice - Construction Work - 21102022 .pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
composite construction of structures.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
additive manufacturing of ss316l using mig welding
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mechanical Engineering MATERIALS Selection
Safety Seminar civil to be ensured for safe working.
bas. eng. economics group 4 presentation 1.pptx
Lecture Notes Electrical Wiring System Components
Current and future trends in Computer Vision.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks

Review On Speech Recognition using Deep Learning

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 659 Review On Speech Recognition using Deep Learning Anushree Raj1, Sahir Abdulla2, Vishwas N3 1 Assistant Professor- IT Department, AIMIT, Mangaluru, anushreeraj@staloysius.ac.in 2 MCA Student, AIMIT, Mangaluru, 2117117vishwas@staloysius.ac.in 3 MCA Student, AIMIT, Mangaluru, 2117044sahir@staloysius.ac.in ---------------------------------------------------------------------***-------------------------------------------------------------------- Abstract: Speech is the most effective means forhumans to communicate their ideas and emotions across a variety of languages. Every language has a different set of speech characteristics. The tempo and dialect vary from person to person even when speaking the same language. For some folks, this makes it difficult to understand the messagebeing delivered. Long speeches can be challenging to follow at times because of things like inconsistent pronunciation, tempo, and other factors. The development of technology that enables the recognition and transcription of voice into text is aided by speech recognition, an interdisciplinaryarea of computational linguistics. The most crucial information is taken from a text source and adequately summarizedbytext summarization. Key words: Speech recognition, Deep learning, computational linguistics,featureextraction,featurevectors. 1. INTRODUCTION To select the proper output, some Voice is frequently used and regarded as important information while engagingwith others. Through comprehension and recognition, voice recognition technology enables machines to convert human vocal signals into equivalent commands. Speech is the most effective form of expression for thoughts and feelings when learning new languages. In the survey we conducted this is very useful when we want to communicate with others.This project will convert speech to text or text to speech using deep learning technique using CNN (conventional neural networking), Just like Google’sgoogleAssistant,Apple’sSIRI, Samsung’s Bixby. A combinationofspeechtotextconversion and text summarization is used in the suggested work. Applications that call for concise summaries of lengthy talks will benefit from this hybrid approach, whichisquitehelpful for documentation. Deep learning is a sort of AIandmachine learning that mimics how people learn specific types of information. Nowadays, numerous applications use human- machine interaction [1]. Speech is one of the interactional media. The primary difficulty in human-machineinteraction is identifying emotions in speech. 2. OBJECTIVES The objective of voice recognition is to use linguistic and phonetic data to convert the input speech feature vector series into a sequence of words. A full voice recognition system, according to the system's structure, consists of a feature extraction algorithm, acoustic model, language model, and search algorithm. A multidimensional pattern recognition system is essentially what the speech recognition system does. Speech recognition provides input forautomatictranslation, generates print-ready dictation, and allows hands-free operation of various devices and equipment—all of which are especially helpful to many disabled people. Medical dictation software and automated telephone systems were some of the first speech recognition applications [2]. Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models Benefits:  It can help to increase productivity in many businesses, such as in healthcare industries.  It can capture speech much faster than you can type.  You can use text-to-speech in real-time.  The software can spell with the same ability as any other writing tool. Helps those who have problems with speech or sight. 3. LITERATURE REVIEW The most crucial component of human communication is speech. Although there are many ways to express what we think and feel, speaking is often regarded as the primary form of communication. The Google API can be used to convert recorded speech to text. Because the retrieved text does not contain a period, it is challenging to split the content into sentences that were created using the Google API. In the suggested model, a period is added at the end of each phrase to distinguish them from one another. The theoretical algorithms used to construct voice recognition were explained in this study. The precise steps involved in voice recognition, suchasbiometricsacquisition, preprocessing, feature extraction, biometrics pattern matching, and recognition outcomes, arefirstdescribed.The detailed introduction of speech recognition in biological features [3]. The primary procedures,recognitionstrategies,
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 660 and application situations for voice recognition are outlined in this paper. The secret to ensuring recognition effectiveness is learning how to extract feature information sensibly. The voice-inputvoice-output communicationaid(VIVOCA),a novel type of augmentative and alternative communication (AAC) technology forpeople withseverespeechimpairment, is described. The VIVOCA creates messages from the user's disordered speech and transforms them into synthetic speech. The findings demonstrate that the VIVOCA device performed better when phrase construction was the goal rather than spelling [4]. This is because there are normally 3–10 competing words in these trials, which makes the ambiguity relatively low in the phrase building mode. Computer models can be used to predict voice recognition. There are numerous files with a variety of audio and audio files in huge audio or video files with many minutes in length. This researcher selected the appropriate soundfrom a sizable file to listen to. Deep learning was employed in this study to categorize speech. The model was trained using the Google corpus. We had accuracy of 66.22%. The detailed introduction of speech recognition in biological features[5]. The primary procedures, recognition strategies, and application situations for voice recognition are outlined in this paper. The secret to ensuring recognition effectiveness is learning how to extract feature information sensibly. When the audio is distorted bynoise,theaudio-visual speech recognition (AVSR) system is regarded as one of the most promising options for accurate speech recognition. To achieve good recognition performance, however, careful sensory feature selection is essential. In this study, they suggested an AVSR system based on MSHMMs for multimodal feature integration and isolated word recognition in addition to deep learning architectures for audio and visual feature extraction [6]. Our test findings showed that the deep denoising auto encoder can efficiently remove the effect of noise superimposed on original clean audio inputs when compared to the original MFCCs. They discuss our research on the usefulness of using representation learning onsizableunlabeledspeechcorpora for speech emotion recognition (SER). The relatively small emotional speech datasets were the main focus of earlier work on representation learning for SER, and no further unlabeled speech data were utilized. They have demonstrated in this study that adding representations produced by an auto encoder that was trained on a sizable dataset consistently increases the provided SER model's recognition accuracy [7]. Additionally, we provided t-SNE visualizations that demonstrate the representations' ability to discriminate between low and high levels of arousal. 4. CHALLENGES AND ISSUES During the last few years, speech recognition has improved a lot. This can mainly be attributed to the rise of graphics processing and cloud computing, as these have made large data sets widely distributable. Some Challenges are:  Audio / Video Conferencing with Background Noise.  Speech Recognition and Voice Assistant Devices.  Lack of Trust and Privacy Issues.  Touch less Screens.  The Future of Voice Recognition Technology.  In Summary. With recent developments, it’s going to be interesting to see how the momentum of rapid growth can be maintained and how the current challenges of speech recognition will be dealt with [8]. The goal of Automatic Speech Recognition (ASR) for the past five to ten years has been to decode voice inputs as accurately as possible. Systems like Siri, Alexa, and Google Assistant, which are well-known, were made feasible by this. Voice recognition has entered our daily lives thanks to these well-known voice assistants. In this essay, we'll examine the speech recognition industry's existing difficulties and potential future advances. Reach and loud settings are the two main causes of the current difficulties in voice detection. This necessitates even more accurate systems that are capable of handling the most challenging ASR use-cases. Consider speech recognition during a boisterous family dinner, live interviews, or group meetings [9]. These are the upcoming difficulties for next- generation voice recognition. Beyond this, voice recognition needs to support additional languages and a larger range of subjects. A lot of the data that ASR currently needs to operate well has simply not been obtained for certain languages and topics. Without them, ASR systems will remain quite constrained. Voice assistants and Voice Powered User Interfaces (VUIs) have a straightforward use-case. They enable spoken commands from people to be translated into actions by machines. Even if the use-case seems to be crystal evident, the ideal approach to human-machine interactions is still being developed [10]. Naturally, speech recognition will have difficulties as a result. 5. CONCLUSSION Here we have some dictating tips to consider for improved results:  Speak in an even tone and clarity. If you are whispering, then words could not be interpreted in the correct way.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 661  It is always better to pause before and after a command while avoid taking a pause in themidstof command issuance. So that it could not be interpreted as a dictation.  Prefer to speak in complete sentences even including punctuation, to give proper context. Here are the ways that we can consider bringing improvement in our text to speech technology in the best possible way: 1) Understand the type of errors A text to speech tool more often comes up with the array of words based on what it has heard. This is what the tools have been designed to do. However, deciding which words string it has heard can become a bit tricky for it. Therefore, there can occur errors that can throw users off. While guessing the wrong word is one of the classic problems in speech technology because they present all kinds of potential mishearing thatsoundsimilar.Buta whole sentence might make some sense. Use high-quality headset microphone Using a high-quality headset microphone is one of the most important factors to improve voice recognition. Itis because these are not only capable of catching the right words, but also have the ability to hold a microphone in front of your mouth at a consistent position directly. Therefore, these can offer more desirable. This can help you in getting more desirablespeechrecognition resultsbyremainingpositioned consistently. 2) Make corrections Most commonly speech technology learns from the corrections that are being made by you. It is because mostof these tools are based on artificial intelligence and deep learning technology. Therefore, these will be going to learn your corrected words and will use those for the next time. 3) Use automatic formatting There are some tools available in speech recognition technology that offer automatic formatting solutions. These can help in formatting various types of text automatically. It can also help your text to speech solutions to format specific phrases and words as per your preferences. There is a continuous improvement in text to speech technology, but speech recognition systemsarehavinggreat difficulty in attaining 99% accuracy. However, considering these some effective tips can help you in getting better results. recognition. As computer information technology develops, speech recognition technology will advance considerably. Several businesses,includingpublicsecurity,mobileInternet security, and automotive network security, are expected to employ this technology. Speech recognition research consequently has two main objectives: improving the information society and boosting living standards. Speech recognition is a key man-machine interface tool in information technology with strong scientific importance and vast application usefulness. In this study, biological features are completely introduced for speech recognition. This article describes the key steps, recognition tactics, and application scenarios for voice recognition. Ability to reasonably extract feature information is necessary for effective REFERENCES [1] Xinman Zhang School of Electronics and Information Engineering, MOE Key Lab for IntelligentNetworksand Network Security, Xi’an JiaotongUniversityXi’an,China e-mail: zhangxinman@xjtu.edu.cn “An Overview of Speech Recognition Technology” 2019 4th International Conference on Control, Robotics and Cybernetics (CRC) [2] Mark S. Hawley, Stuart P. Cunningham, Phil D. Green, Pam Enderby, Rebecca Palmer, Siddharth Sehgal, and Peter O’Neill” A Voice-Input Voice-Output Communication Aid for People With Severe Speech Impairment” IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 21, NO. 1, JANUARY 2013 [3] Phoemporn Lakkhanawannakun, Chaluemwut Noyunsan Department of Computer Engineering, Faculty of Engineering, Rajamangala University of Technology Isan, Khon Kaen Campus, Khon Kaen, 40000, Thailand” Speech Recognition using Deep Learning” [4] Kuniaki Noda · Yuki Yamaguchi · Kazuhiro Nakadai · Hiroshi G. Okuno · Tetsuya Ogata” Audio-visual speech recognition using deep learning” Published online: 20 December 2014 © Springer Science+Business Media New York 2014 [5] Michael Neumann, Ngoc Thang Vu University of Stuttgart, Germany {michael.neumann|thang.vu}@ims.uni-stuttgart.de” Published online: 20 December 2014 © Springer Science+Business Media New York 2014”. [6] mY. H. Ghadage and S. D. Shelke, "Speech to text conversion for multilingual languages," 2016 International Conferenceon CommunicationandSignal Processing (ICCSP), Melmaruvathur, pp. 0236-0240, 2016.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 662 [7] ] F. Zheng, L. T. Li, M. Z. Escale, and H. Zhang, “Voice print recognition technology and its application status,”Research on Information Security, vol. 2, no. 1, Jan. 2016, pp. 44–57. [8] Conference on Acoustics, vol. 24, no. 7, July. 2012, pp. 1315–1329. [9] C. H. Zhou, “Research on Speaker recognition system based on MFCC feature and GMM Model,” Ph.D. dissertation, Dept. Electron. Eng., Lanzhou University of Technology, Lanzhou, China, 2013 [9] Jose D V, Alfateh Mustafa, Sharan R, "A Novel Model for Speech to Text Conversion," International Refereed Journal of Engineering and Science (IRJES), vol 3, no. 1, 2014. [10] L. Liu, “Research on Fusion and Recognition Methods on Multimode Biometrics,” Ph.D. dissertation, Dept. Electron. Eng., University of Electronic Science and Technology, Chengdu, China, 2010.