2. Table Of Content
• Introduction
• Problem Identification
• Literature Review
• Research Gap
• Research Objective
• Proposed Methodology
• Conclusion
• References
3. INTRODUCTION
• Automatic Speech Recognition (ASR) is a transformative technology
that converts spoken language into text, enabling applications in areas
like virtual assistants, real-time transcription, and accessibility tools.
• Traditional ASR systems rely heavily on separate components,
including acoustic modeling, language modeling, and decoding.
4. LLM Model
Large Language Models (LLMs), such as OpenAI’s GPT, Google’s T5,
and Meta’s LLaMA, are deep learning models trained on massive
textual datasets. These models excel in understanding, generating, and
contextualizing text. Their ability to comprehend nuances,
disambiguate meaning, and perform reasoning tasks makes them
valuable assets in ASR.
5. Literature Review
Ref. Focus Area Techniques/Methods Limitations Research Gaps
[1]
Overview of LLMs
and applications
Discusses LLM
downstream tasks like
text generation, and
applications in
healthcare, education,
etc.
Secondary Data
Challenges in domain-
specific adaptation,
hallucinations, lack of
interpretability, ethical
concerns, and computational
cost.
Need for domain-specific LLMs,
strategies to reduce biases and
hallucinations, and improvement in
interpretability of predictions.
[2] Evaluation of LLMs
Explores evaluation
frameworks, tasks, and
benchmarks for LLMs
across domains like NLP,
ethics, and reasoning.
Secondary Data
Limited focus on diverse
tasks and underrepresented
languages
Development of standardized
evaluation protocols addressing
safety, reliability, and robustness
[3]
Deep Learning in
Audio-Visual
Speech
Recognition
(AVSR)
Focus on multimodal
fusion strategies, pre-
processing techniques,
and end-to-end AVSR
architectures using deep
Real-world noise, lack of
large-scale datasets in
diverse languages
Need for large-scale multilingual
AVSR datasets, robust methods for
handling noise and variability in real-
world scenarios
6. Literature Review
Ref. Focus Area Techniques/Methods Limitations Research Gaps
[4]
Deep Learning
Techniques for
Speech Emotion
Recognition (SER)
Deep learning (LSTM,
CNNs, GANs,
Autoencoders), use of
emotional speech
datasets, feature
extraction methods.
Lack of real-world SER datasets,
limitations in speaker-independent
settings; challenges in noisy
environments.
Need for robust SER systems in
noisy and natural settings;
exploration of multimodal data
integration to enhance emotion
recognition accuracy.
[5] Audio-Visual Speech
Recognition (AVSR)
Deep learning for
modality fusion, pre-
processing,
augmentation, and
end-to-end AVSR
systems.
Limitations in datasets for diverse
languages and real-world noise;
difficulties in managing variability
in speaker characteristics.
Development of large-scale
multilingual datasets; more
robust fusion strategies to
handle noise, accents, and
diverse conditions in real-world
AVSR systems.