SlideShare a Scribd company logo
R.M.K. COLLEGE OF
ENGINEERING AND TECHNOLOGY
22AI903 TEXT AND SPEECH ANALYTICS
(Professional Elective IV)
Dr. V. VIJAYARAJA
Professor
Artificial Intelligence and Data Science
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
OBJECTIVES
To introduce the tools and techniques for performing text and
speech analytics in diverse contexts.
To understand the tools and technologies involved in
developing text and speech applications.
To demonstrate the use of computing for building
applications in text and speech processing.
To use Information Retrieval Techniques to build and evaluate
text processing systems.
To apply advanced speech recognition methodologies in
practical applications.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
OUTCOMES
CO1: Apply the fundamental techniques in text processing for various NLP
tasks.
CO2: Implement advanced language models and improve text
classification accuracy.
CO3: Designing text processing systems using state-of-the-art techniques.
CO4: Design, implement, and evaluate ASR and TTS systems.
CO5: Apply advanced speech recognition methodologies in practical
applications.
CO6: Use Information Retrieval Techniques to build and evaluate text
processing systems
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
UNIT I TEXT PROCESSING
Speech and Language Processing
Regular Expression
Text normalization
Edit Distance
Lemmatization
Stemming
N-gram Language Models
Vector Semantics and Embeddings.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
UNIT II TEXT CLASSIFICATION
Text Classification Tasks
Language Model
Neural Language Models
RNNs as Language Models
Transformers and Large Language Models.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
UNIT III QUESTION ANSWERING AND DIALOGUE SYSTEMS
Information Retrieval
Dense Vectors
Neural IR for Question Answering
Evaluating Retrieval based Question Answering
Frame-based Dialogue Systems
Dialogue Acts and Dialogue State
Chatbots – Dialogue System Design.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
UNIT IV TEXT TO SPEECH SYNTHESIS
Automatic Speech Recognition Task
Feature Extraction for ASR: Log Mel Spectrum
Speech Recognition Architecture
CTC
ASR Evaluation: Word Error Rate
TTS
Speech Tasks.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
UNIT V SPEECH RECOGNITION
LPC for speech recognition
Hidden Markov Model (HMM)
Training procedure for HMM
subword unit model based on HMM
Language models for large vocabulary speech recognition
Overall recognition system based on subword units
Context dependent subword units
Semantic post processor for speech recognition.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
TEXT BOOKS
Jurafsky, D. and J. H. Martin, Speech and language
processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition Pearson Publication, Third Edition, 2022.
Lawrence Rabiner, Biing-Hwang Juang and
B.Yegnanarayana, “Fundamentals of Speech Recognition”,
Pearson Education, 2009.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
REFERENCES
John Atkinson-Abutridy, Text Analytics: An Introduction to the
Science and Applications of Unstructured Information Analysis, CRC
Press, 2022.
Jim Schwoebel, NeuroLex, Introduction to Voice Computing in
Python, 2018
Lawrence R. Rabiner, Ronald W. Schafe, Theory and Applications of
Digital Speech Processing, First Edition, Pearson, 2010..
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
ARTIFICIAL INTELLIGENCE
Artificial intelligence is a specific branch of computer
science concerned with replicating the thought process
and decision-making ability of humans through computer
algorithms
Artificial intelligence makes it possible for machines to
learn from experience, adjust to new inputs and perform
human-like tasks
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
ARTIFICIAL INTELLIGENCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
NATURAL LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
NLP stands for Natural Language Processing, which deals
with the interaction between computers and humans in
natural language
SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Involves the development of techniques that allow computers to
• understand,
• interpret, and
• generate human languages (both spoken and written)
It encompasses multiple domains of research and applications such as
• speech recognition,
• natural language processing (NLP) and
• text-to-speech synthesis
COMPONENTS OF SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech Recognition (Automatic Speech Recognition - ASR):converting spoken words
into text. Used in voice assistants like Siri, Google Assistant, and Alexa
Natural Language Processing (NLP):interaction between computers and human
language. Used in chatbots
Text-to-Speech (TTS):Converts written text into spoken words. Used in Google
Translate
Speech Synthesis:human-like speech from text. Used inGoogle’s Text-to-Speech
service available in smartphone
COMPONENTS OF SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Regular Expressions: searching and manipulating text data. Used in the search for
specific phrases or patterns in voice transcripts.
Text Normalization: converting raw text into a standard format
Edit Distance : measures the number of operations required to convert one string
into another
Stemming : reduce words to their root form (e.g., “running” → “run”).
COMPONENTS OF SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Lemmatization: involves reducing words to their base form, considering its meaning
(e.g., “better” → “good”).
N-gram Language Models: used to predict the next word or sequence in a sentence
Vector Semantics and Embeddings: involves representing words or phrases as
vectors in a multi-dimensional space
REGULAR EXPRESSIONS (Regex)
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Sequence of characters that forms a search pattern.
Used for pattern matching with strings or for searching and manipulating text
Essential in tasks such as text searching, text extraction, and data cleaning
Particularly in speech and language processing, where preprocessing text or
speech transcriptions is often required
CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Literal Characters:
Basic characters that match themselves in a string.
Example: The regex apple matches the string "apple".
Meta-characters:
These are special characters that have specific meanings. Commonly used meta-characters include:
. (dot): Matches any single character (except newline).
^: Anchors the match at the beginning of the string.
$: Anchors the match at the end of the string.
*: Matches zero or more of the preceding character.
+: Matches one or more of the preceding character.
?: Matches zero or one of the preceding character.
CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Character Classes:
A character class defines a set of characters that can match a position in the string.
Example:
[a-z]: Matches any lowercase letter.
[A-Z]: Matches any uppercase letter.
[0-9]: Matches any digit.
[aeiou]: Matches any vowel.
Predefined Character Classes:
d: Matches any digit (equivalent to [0-9]).
w: Matches any word character (alphanumeric + underscore).
s: Matches any whitespace character (spaces, tabs, line breaks).
b: Matches a word boundary.
CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Quantifiers:
Quantifiers specify the number of occurrences of a character or group to match.
Example:
a{3}: Matches exactly three 'a's in a row (e.g., "aaa").
a{2,4}: Matches between 2 and 4 'a's in a row (e.g., "aa", "aaa", "aaaa").
Grouping and Capturing:
Parentheses () are used to group parts of the regular expression, allowing you to apply operators to
entire sections of the pattern.
Example:
(abc)+: Matches one or more occurrences of "abc".
Capturing groups store the matched text, which can be referenced later.
CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Alternation:
The pipe symbol | represents an "OR" operation in regular expressions.
Example:
apple|banana: Matches either "apple" or "banana".
Escape Sequences:
Some characters are reserved in regex (e.g., . or *). To use these characters as literals, they must be
escaped with a backslash .
Example:
.: Matches the literal dot character, not any character
EXCERCISES https://regex101.com/
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
• ^A – It’s means starts with ‘A’ in paragraph
• .A –Any character attached with ‘A’ in paragraph
• done$ - End with ‘done’ in paragraph
• hallo* - hall followed by zero or more ‘o’
• hallo+ - hall followed by one or more ‘o’
• hallo? – hall followed by zero or one ‘o’
• hallo{2} -hall followed by 2 ‘o’
• hallo{2,} - hall followed by 2 or more ‘o’
• hal(lo)* - hal followed by zero or more ‘lo’
• hal(lo){2,5} - hal followed by zero or more ‘lo’
• a(b|c) or a[bc] - a followed by b or c
EXCERCISES https://regex101.com/
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
• d - matches a single character that is a digit
• w - matches a word character
• s - matches a whitespace character (includes tabs and line breaks)
• [abc] - matches a string that has either an a or a b or a c -> is the same as a|b|c
• [a-c] - same as previous
• [0-9]% - a string that has a character from 0 to 9 before a % sign
• babcb – search whole value
• d(?=r) - matches a d only if is followed by r
• (?<=r)d - matches a d only if is before by r
• d(?!r) - matches a d only if is not followed by r
APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Text Normalization
Clean and preprocess raw text before feeding it into text processing models
Removing unwanted punctuation or special characters from text
Transforming all characters to lowercase for uniformity
Example Regex: To remove punctuation from text:
regex Copy code [^ws]
This regex matches any character that is not a word character or whitespace
APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Tokenization
Breaking down sentences into tokens (words, punctuation, etc.) is a fundamental
step in text processing.
Regular expressions help segment the text into individual words and phrases
Example:
Splitting text into words based on spaces and punctuation.
Regex for splitting sentences into words: w+
APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Pattern Matching and Extraction
Regex is often used to search for specific patterns in text, such as email
addresses, dates, phone numbers, or specific keywords
Example:
Extracting phone numbers from a document:
regex Copy code d{3}-d{3}-d{4}
This regex matches a phone number in the format xxx-xxx-xxxx
APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Named Entity Recognition (NER)
 In text processing, regex is used to identify entities such as names, dates, and
places by matching predefined patterns
Example:
Matching dates: d{2}/d{2}/d{4} (matches "12/05/2022").
APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech-to-Text Transcription Cleanup
 After speech recognition transcribes audio into text, regular expressions can be
used to remove errors like extra spaces, incomplete words, or unwanted symbols
Example:
Removing extra spaces after transcription:
regex Copy code s{2,}
.
KEY STEPS IN TEXT NORMALIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Lowercasing
Input: “I love the new Apple products.”
Output: “i love the new apple products.”
Removing Punctuation
Input: “Hello, world!”
Output: “Hello world”
Removing Special Characters
Input: “#DataScience is awesome!”
Output: “DataScience is awesome”.
Removing Stop Words (e.g., “the”, “a”, “and”, “in”)
Input: “The quick brown fox jumps over the lazy dog.”
Output: “quick brown fox jumps over lazy dog”
KEY STEPS IN TEXT NORMALIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Expanding Contractions
Input: “I can’t believe it!”
Output: “I cannot believe it!”
Stemming and Lemmatization
"running" becomes "run"
“better” becomes “good”
Removing Special Characters
Input: “#DataScience is awesome!”
Output: “DataScience is awesome”.
Removing Stop Words (e.g., “the”, “a”, “and”, “in”)
Input: “The quick brown fox jumps over the lazy dog.”
Output: “quick brown fox jumps over lazy dog”
KEY STEPS IN TEXT NORMALIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Stemming and Lemmatization
Stemming involves reducing words to their root form by chopping off suffixes (e.g., "running"
becomes "run")
Lemmatization considers the meaning of the word and reduces it to its base form (e.g., “better”
becomes “good”)
Spelling Correction
Input: “I love progamming.”
Output: “I love programming”
Handling Numerals
Input: “I have 3 apples.”
Output: “I have three apples” (if converting numbers to words) or
“I have apples” (if removing numbers).
EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Edit Distance is a measure of the difference between two strings (e.g., words or
sequences of text).
It quantifies how many basic operations (insertions, deletions, substitutions) are
needed to transform one string into another.
Edit distance is a fundamental concept in text processing,
Especially in tasks like spell checking, text correction, machine translation, and
speech recognition
TYPES OF EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
1.Levenshtein Distance:
It computes the minimum number of single-character edits required to convert one string into
another, where each edit can be one of the following:
 Insertion: Adding a character to a string.
 Deletion: Removing a character from a string.
 Substitution: Replacing one character with another
Example: String 1: “kitten” String 2: “sitting”
The operations required are:
1. Substitute 'k' with 's': "kitten" → "sitten"
2. Substitute ‘e' with ‘i': "sitten" → "sittin"
3. Insert 'g' at the end: "sittin" → "sitting"
Total distance = 3 (3 operations)
TYPES OF EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
2. Damerau-Levenshtein Distance:
The Damerau-Levenshtein Distance is an extension of the Levenshtein distance that also considers
transpositions (swapping two adjacent characters) as a valid operation.
Example:
▪ String 1: “ab” ▪ String 2: “ba”
The Damerau-Levenshtein distance is 1, as only a transposition is required
3. Hamming Distance:
Hamming Distance is a special case of edit distance that only works on strings of the same length
and counts the number of positions at which the corresponding characters are different.
Example:
▪ String 1: “karolin” String 2: “kathrin”
The Hamming distance is 3 because the characters at positions 3, 4 and 5 differ.
COMPUTING EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
The distance between “kitten” and “sitting” is 3, as it requires 3 operations (Replace
‘s‘ by ‘k', Replace ‘E‘ by ‘I', and Remove 'g' at the end
WORD ERROR RATE (WER)
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
WER is a metric used to evaluate the performance of speech-to-text systems.
It is calculated as the edit distance between the reference (correct transcription)
and the hypothesis (ASR output), divided by the total number of words in the
reference
WORD ERROR RATE (WER)
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Example
I am now going to bed
The total number of words = 6
STT Model 1: I am now going to bed.
WER = 0% (Sum of Errors: 0)
STT Model 2: I am now to bed.
WER = 16.7% (Sum of Errors: 1, Deletion = 1: going)
STT Model 2: I am now to the bed.
WER = 33.3% (Sum of Errors: 2, Deletion = 1: going, Insertion = 1: the)
LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Process of reducing a word to its base or root form, known as the lemma, while
considering the context and meaning of the word
Lemmatization uses a vocabulary and morphological analysis of words to return
their base form
• The lemma of "running" is "run".
• The lemma of "better" is "good" (based on context and meaning)
KEY FEATURES OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Context Awareness:
Lemmatization considers the meaning and part of speech (POS) of the word.
For example, "flies" as a noun is reduced to "fly," while as a verb, it is also
reduced to "fly."
Dependency on POS Tagging:
The lemmatizer requires POS tags to determine the correct lemma.
For example, "saw" can be a noun (the tool) or a verb (past tense of "see"). The
lemma is determined based on context.
Dictionary-Based Approach:
Lemmatization relies on dictionaries or lexicons to determine the base form of a
word.
POS TAGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
PROCESS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
POS Tagging:
The word's part of speech is identified (e.g., noun, verb, adjective).
Example: Input: "The boys are playing in the park."
POS Tags: [The (DT), boys (NNS), are (VB), playing (VBG), in (IN), the (DT), park (NN)]
Morphological Analysis:
The morphological structure of the word is analyzed to determine its lemma.
Example: "Playing" → root: "play" (verb)
Lookup in Lemmatization Dictionary:
The lemma is looked up in the lexicon or dictionary based on the POS tag and
root form.
EXAMPLES OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Basic Examples:
Words like "running," "runs," and "ran" → Lemma: "run".
Words like "better" → Lemma: "good" (based on context)
Sentence-Level Example:
Input Sentence: "The children were playing in the gardens."
Lemmatized Output: "The child be play in the garden."
Ambiguity Example:
Word: "barked"
As a verb (past tense): Lemma → "bark."
As a noun (the sound of a dog): Lemma → "bark."
APPLICATIONS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Search Engines:
Lemmatization helps improve search results by matching queries to documents,
regardless of word variations.
Example: A user searches for "running," and the engine retrieves documents
containing "run," "runs," or "ran.“
Text Classification:
Reducing words to their lemma helps create consistent input for machine learning
models.
Example: In sentiment analysis, words like "happiest" and "happier" are reduced to
"happy," ensuring consistent feature extraction
APPLICATIONS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech-to-Text Systems:
Lemmatization ensures that speech transcriptions are converted into meaningful,
standardized text for further processing.
Example: Converting "talking" in a transcript to "talk" for language modeling
Machine Translation:
Lemmatization ensures consistency when translating between languages by
standardizing word forms.
Example: Translating "jumping" and "jumps" into a consistent word in the target
language
APPLICATIONS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Question Answering Systems:
Lemmatization enables systems to understand user queries better by reducing
variations in word forms.
Example: A question about "children playing" can match documents containing
"child play."
STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Stemming is the process of reducing words to their base or root form by
removing affixes (prefixes or suffixes).
Stemming does not consider the context or meaning of the word
It applies a set of heuristic rules to trim words down to their "stem.“
Stemming is widely used in text preprocessing tasks for natural language
processing (NLP) applications, such as
search engines,
text classification, and
information retrieval
KEY FEATURES OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Rule-Based Approach:
Stemming uses rules to remove common prefixes and suffixes.
Example: Words ending in "ing," "ed," or "ly" are reduced by stripping these
endings
Not Context-Aware:
Stemming does not consider the word’s meaning or part of speech (POS).
Example: The word "better" is stemmed to "bet," even though "good" is the actual
lemma
Produces Non-Words:
Stems are often not valid words in the language.
Example: "Studies" is stemmed to "studi."
EXAMPLES OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Basic Examples:
"Running" → "run" "Studies" → "studi" "Caring" → "car"
Sentence-Level Example:
Input: "The boys are running quickly."
Output: "The boy are run quick.“
Different Word Forms:
Connection," "connections," "connected," and "connecting" are all reduced to
"connect."
COMMON STEMMING ALGORITHMS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Porter Stemmer:
One of the most widely used stemming algorithms. Applies a series of rules to
remove common suffixes. Example:
Input: "caresses," "flies," "dies"
Output: "caress," "fli," "die“
Lancaster Stemmer:
A more aggressive stemming algorithm that produces shorter stems.
Example:
Input: "running" Output: "run"
COMMON STEMMING ALGORITHMS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Snowball Stemmer:
An improved version of the Porter stemmer, also known as the Porter stemmer.
Supports multiple languages and is less aggressive than the Lancaster stemmer.
Regex-Based Stemmer:
Uses regular expressions to define simple rules for stemming.
Example: Removing "-ing," "-ed," or "-ly" endings
APPLICATIONS OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Search Engines:
Stemming helps search engines retrieve relevant documents by matching different
word forms.
Example: A search for "running" retrieves results containing "run," "runs," or "ran."
Text Classification:
Reducing words to their stems improves the efficiency of text classification models
by reducing dimensionality.
Example: In sentiment analysis, "happy" and "happiness" are treated as the
same feature
APPLICATIONS OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Information Retrieval:
Stemming enhances the matching of user queries to relevant documents by
normalizing word forms.
Example: Searching for "connections" in a database also retrieves documents
containing "connected."
Spam Detection:
Stemming reduces variations in word forms, making it easier to detect patterns in
spam messages.
Example: "offer," "offered," and "offering" are normalized to "offer."
COMPARISON: LEMMATIZATION VS. STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Aspect Lemmatization Stemming
Output
Produces meaningful words
(e.g., "better" → "good").
May produce non-words
(e.g., "better" → "bet").
Context Awareness Considers context and POS. Ignores context and POS.
Accuracy
High accuracy in identifying root
words.
Lower accuracy as it uses
simple rules.
Speed
Slower (requires dictionary
lookup).
Faster (rule-based).
EXAMPLE: LEMMATIZATION VS. STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Input Word: "caring"
Lemmatization: "care"
Stemming: "car“
Input Word: "flying"
Lemmatization: "fly"
Stemming: "fli“
Input Word: "better"
Stemming: "bet"
Lemmatization: "good"
N-GRAM LANGUAGE MODEL
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Statistical language model used to predict the likelihood of a sequence of words
or tokens.
It divides text into chunks of n words or tokens (N-grams) and estimates the
probability of a word based on its preceding n-1 words
Key Concepts of N-grams
An N-gram is a contiguous sequence of n items (words, characters, or phonemes)
from a given text or speech input.
Examples:
Unigram (n=1): ["I", "love", "NLP"]
Bigram (n=2): ["I love", "love NLP"]
Trigram (n=3): ["I love NLP"]
N-GRAM LANGUAGE MODEL
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
STEPS TO BUILD AN N-GRAM MODEL
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
1. Tokenization:
Split the text into words or tokens.
Example: "I love NLP" → ["I", "love", "NLP"]
2. Generate N-grams:
Extract sequences of n contiguous tokens.
Example for bigrams: ["I love", "love NLP"]
3. Calculate Frequencies:
Count occurrences of each N-gram
STEPS TO BUILD AN N-GRAM MODEL
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
APPLICATIONS OF N-GRAM MODELS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech Recognition:
Predicts the most likely next word to improve transcription accuracy.
Example: In "I want to", the trigram model predicts "go" or "eat" based on training
data
Autocomplete and Text Prediction:
Suggests the next word based on previous inputs.
Example: Typing "How are" suggests "you" in predictive text
Spelling Correction:
Identifies the most likely word in the context of surrounding words.
Example: "Ths is a tst" → "This is a test" using bigram probabilities
APPLICATIONS OF N-GRAM MODELS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Machine Translation:
Helps in aligning and translating phrases by considering word sequences.
Example: "Je t’aime" → "I love you," considering bigrams like "I love.“
Sentiment Analysis:
Considers word combinations to determine sentiment.
Example: "Very happy" is more positive than "Very sad."
Language Modeling:
Predicts the next word in a sequence, commonly used in NLP tasks.
Example: In "The cat sat on the," the model predicts "mat."
VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Vector Semantics is a method of representing the meaning of words as
mathematical vectors in a continuous, high-dimensional space.
These vectors capture semantic relationships between words, enabling machines
to understand and analyze language more effectively.
Embeddings are the actual vector representations of words, phrases, or
sentences.
They map discrete linguistic units into a continuous vector space, where similar
words are closer to each other.
KEY CONCEPTS IN VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Word Vectors:
Words are represented as points in a multi-dimensional space.
The closer two words are in this space, the more similar their meanings.
Context-Based Representations:
Word embeddings are generated based on the contexts in which words appear,
capturing semantic and syntactic relationships.
Dimensionality Reduction:
Instead of representing words as high-dimensional sparse vectors (e.g., one-hot
encoding), embeddings represent them as dense vectors in a smaller dimensional
space
WORD EMBEDDING MODELS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Count-Based Models:
Use co-occurrence matrices to represent word relationships.
Example: Latent Semantic Analysis (LSA).
Predictive Models:
Predict word embeddings directly by training neural networks.
Examples: Word2Vec, GloVe.
Contextual Models:
Capture word meaning based on surrounding context.
Examples: BERT, ELMo.
POPULAR WORD EMBEDDING TECHNIQUES
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Word2Vec:
Developed by Google, Word2Vec creates word embeddings using two methods:
Skip-Gram: Predicts the context words from a given word.
CBOW (Continuous Bag of Words): Predicts a target word from its context words.
Example:
Input: "The cat sat on the mat."
Output: Vectors for words like "cat," "sat," and "mat," where "cat" and "mat" are
closer in the vector space
POPULAR WORD EMBEDDING TECHNIQUES
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
GloVe (Global Vectors for Word Representation):
Combines the benefits of count-based and predictive models by factoring in co-
occurrence statistics.
Example:
Words like "king" and "queen" are similar but differ along the gender dimension
Fast Text:
Represents words as a combination of character n-grams, enabling the model to
understand rare or out-of-vocabulary words.
Example:
Words like "walking" and "walked" are represented similarly due to shared
subword components.
POPULAR WORD EMBEDDING TECHNIQUES
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
BERT (Bidirectional Encoder Representations from Transformers):
Generates contextual embeddings by understanding the meaning of a word in its
sentence.
Example:
The word "bank" in "river bank" and "financial bank" has different embeddings
based on context
EXAMPLES OF VECTOR SEMANTICS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Word Similarity:
Words with similar meanings have closer embeddings.
Example: "Happy" and "Joyful" will have high cosine similarity.
Synonyms and Analogies:
Word embeddings can identify synonyms and solve analogies.
Example: Analogy: "Man is to King as Woman is to ?" → Answer: "Queen"
Document Similarity:
Entire documents can be represented as vectors (e.g., sentence or paragraph
embeddings).
Example: Comparing the similarity of two documents for plagiarism detection
APPLICATIONS OF VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Search Engines:
Embeddings help search engines understand synonyms and improve query results.
Example: earching for "laptop" retrieves results for "notebook."
Sentiment Analysis:
Embeddings capture the sentiment of words and sentences.
Example: Positive words ("great," "excellent") cluster together, distinct from
negative words.
Machine Translation:
Models like Word2Vec map words from different languages into a shared
embedding space for translation.
Example: "Bonjour" (French) and "Hello" (English) are close in vector space
APPLICATIONS OF VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech Recognition:
Embeddings improve recognition systems by linking phonemes to meaningful
words.
Example: The phrase "recognize speech" vs. "wreck a nice beach.“
Chatbots and Virtual Assistants:
Use embeddings to understand and respond to user queries.
Example: Recognizing "What's up?" as a casual greeting
ADVANTAGES OF VECTOR SEMANTICS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Efficient Representation:
Reduces dimensionality compared to sparse one-hot encodings.
Captures Semantic Relationships:
Words with similar meanings are close in the vector space.
Adaptable to Various Tasks:
Supports a wide range of NLP and speech tasks.

More Related Content

PPTX
Machine translation from English to Hindi
PDF
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
PPT
Arabic MT Project
PPTX
BWU_BTA_22_508_RANJAN DAS(NLP)cuytttt.pptx
PDF
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
PDF
concepts-in-programming-languages-2kuots4121.pdf
PDF
Integration of speech recognition with computer assisted translation
PDF
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Machine translation from English to Hindi
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Arabic MT Project
BWU_BTA_22_508_RANJAN DAS(NLP)cuytttt.pptx
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
concepts-in-programming-languages-2kuots4121.pdf
Integration of speech recognition with computer assisted translation
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...

Similar to Speech and Language Processing - Regular Expression (20)

PDF
Generative programming (mostly parser generation)
PPT
Computer Systems Lab Overview
PDF
Programming for Problem Solving
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PPT
GATE, HLT and Machine Learning, Sheffield, July 2003
PPTX
PDF
EasyChair-Preprint-7375.pdf
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
PDF
Challenges in transfer learning in nlp
PDF
Themes for graduation projects 2010
PPTX
Natural Language Processing (NLP).pptx
PDF
Generative Artificial Intelligence and Large Language Model
PDF
Submission_36
PDF
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
PPT
Compiler_Project_Srikanth_Vanama
PPT
Literature Based Framework for Semantic Descriptions of e-Science resources
PPTX
Text Mining Infrastructure in R
PPT
Compiler Construction Chapter number 1 slide
PDF
Dalia Ali CV
Generative programming (mostly parser generation)
Computer Systems Lab Overview
Programming for Problem Solving
MACHINE-DRIVEN TEXT ANALYSIS
GATE, HLT and Machine Learning, Sheffield, July 2003
EasyChair-Preprint-7375.pdf
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
Challenges in transfer learning in nlp
Themes for graduation projects 2010
Natural Language Processing (NLP).pptx
Generative Artificial Intelligence and Large Language Model
Submission_36
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
Compiler_Project_Srikanth_Vanama
Literature Based Framework for Semantic Descriptions of e-Science resources
Text Mining Infrastructure in R
Compiler Construction Chapter number 1 slide
Dalia Ali CV
Ad

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mushroom cultivation and it's methods.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Hybrid model detection and classification of lung cancer
PDF
Web App vs Mobile App What Should You Build First.pdf
Encapsulation theory and applications.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Building Integrated photovoltaic BIPV_UPV.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative analysis of optical character recognition models for extracting...
Mushroom cultivation and it's methods.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Heart disease approach using modified random forest and particle swarm optimi...
cloud_computing_Infrastucture_as_cloud_p
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Zenith AI: Advanced Artificial Intelligence
DP Operators-handbook-extract for the Mautical Institute
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Tartificialntelligence_presentation.pptx
Getting Started with Data Integration: FME Form 101
Hybrid model detection and classification of lung cancer
Web App vs Mobile App What Should You Build First.pdf
Ad

Speech and Language Processing - Regular Expression

  • 1. R.M.K. COLLEGE OF ENGINEERING AND TECHNOLOGY 22AI903 TEXT AND SPEECH ANALYTICS (Professional Elective IV) Dr. V. VIJAYARAJA Professor Artificial Intelligence and Data Science Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 2. OBJECTIVES To introduce the tools and techniques for performing text and speech analytics in diverse contexts. To understand the tools and technologies involved in developing text and speech applications. To demonstrate the use of computing for building applications in text and speech processing. To use Information Retrieval Techniques to build and evaluate text processing systems. To apply advanced speech recognition methodologies in practical applications. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 3. OUTCOMES CO1: Apply the fundamental techniques in text processing for various NLP tasks. CO2: Implement advanced language models and improve text classification accuracy. CO3: Designing text processing systems using state-of-the-art techniques. CO4: Design, implement, and evaluate ASR and TTS systems. CO5: Apply advanced speech recognition methodologies in practical applications. CO6: Use Information Retrieval Techniques to build and evaluate text processing systems Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 4. UNIT I TEXT PROCESSING Speech and Language Processing Regular Expression Text normalization Edit Distance Lemmatization Stemming N-gram Language Models Vector Semantics and Embeddings. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 5. UNIT II TEXT CLASSIFICATION Text Classification Tasks Language Model Neural Language Models RNNs as Language Models Transformers and Large Language Models. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 6. UNIT III QUESTION ANSWERING AND DIALOGUE SYSTEMS Information Retrieval Dense Vectors Neural IR for Question Answering Evaluating Retrieval based Question Answering Frame-based Dialogue Systems Dialogue Acts and Dialogue State Chatbots – Dialogue System Design. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 7. UNIT IV TEXT TO SPEECH SYNTHESIS Automatic Speech Recognition Task Feature Extraction for ASR: Log Mel Spectrum Speech Recognition Architecture CTC ASR Evaluation: Word Error Rate TTS Speech Tasks. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 8. UNIT V SPEECH RECOGNITION LPC for speech recognition Hidden Markov Model (HMM) Training procedure for HMM subword unit model based on HMM Language models for large vocabulary speech recognition Overall recognition system based on subword units Context dependent subword units Semantic post processor for speech recognition. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 9. TEXT BOOKS Jurafsky, D. and J. H. Martin, Speech and language processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Pearson Publication, Third Edition, 2022. Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana, “Fundamentals of Speech Recognition”, Pearson Education, 2009. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 10. REFERENCES John Atkinson-Abutridy, Text Analytics: An Introduction to the Science and Applications of Unstructured Information Analysis, CRC Press, 2022. Jim Schwoebel, NeuroLex, Introduction to Voice Computing in Python, 2018 Lawrence R. Rabiner, Ronald W. Schafe, Theory and Applications of Digital Speech Processing, First Edition, Pearson, 2010.. Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 11. ARTIFICIAL INTELLIGENCE Artificial intelligence is a specific branch of computer science concerned with replicating the thought process and decision-making ability of humans through computer algorithms Artificial intelligence makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 12. ARTIFICIAL INTELLIGENCE Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 13. NATURAL LANGUAGE PROCESSING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology NLP stands for Natural Language Processing, which deals with the interaction between computers and humans in natural language
  • 14. SPEECH AND LANGUAGE PROCESSING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Involves the development of techniques that allow computers to • understand, • interpret, and • generate human languages (both spoken and written) It encompasses multiple domains of research and applications such as • speech recognition, • natural language processing (NLP) and • text-to-speech synthesis
  • 15. COMPONENTS OF SPEECH AND LANGUAGE PROCESSING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Speech Recognition (Automatic Speech Recognition - ASR):converting spoken words into text. Used in voice assistants like Siri, Google Assistant, and Alexa Natural Language Processing (NLP):interaction between computers and human language. Used in chatbots Text-to-Speech (TTS):Converts written text into spoken words. Used in Google Translate Speech Synthesis:human-like speech from text. Used inGoogle’s Text-to-Speech service available in smartphone
  • 16. COMPONENTS OF SPEECH AND LANGUAGE PROCESSING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Regular Expressions: searching and manipulating text data. Used in the search for specific phrases or patterns in voice transcripts. Text Normalization: converting raw text into a standard format Edit Distance : measures the number of operations required to convert one string into another Stemming : reduce words to their root form (e.g., “running” → “run”).
  • 17. COMPONENTS OF SPEECH AND LANGUAGE PROCESSING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Lemmatization: involves reducing words to their base form, considering its meaning (e.g., “better” → “good”). N-gram Language Models: used to predict the next word or sequence in a sentence Vector Semantics and Embeddings: involves representing words or phrases as vectors in a multi-dimensional space
  • 18. REGULAR EXPRESSIONS (Regex) Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Sequence of characters that forms a search pattern. Used for pattern matching with strings or for searching and manipulating text Essential in tasks such as text searching, text extraction, and data cleaning Particularly in speech and language processing, where preprocessing text or speech transcriptions is often required
  • 19. CONCEPTS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Literal Characters: Basic characters that match themselves in a string. Example: The regex apple matches the string "apple". Meta-characters: These are special characters that have specific meanings. Commonly used meta-characters include: . (dot): Matches any single character (except newline). ^: Anchors the match at the beginning of the string. $: Anchors the match at the end of the string. *: Matches zero or more of the preceding character. +: Matches one or more of the preceding character. ?: Matches zero or one of the preceding character.
  • 20. CONCEPTS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Character Classes: A character class defines a set of characters that can match a position in the string. Example: [a-z]: Matches any lowercase letter. [A-Z]: Matches any uppercase letter. [0-9]: Matches any digit. [aeiou]: Matches any vowel. Predefined Character Classes: d: Matches any digit (equivalent to [0-9]). w: Matches any word character (alphanumeric + underscore). s: Matches any whitespace character (spaces, tabs, line breaks). b: Matches a word boundary.
  • 21. CONCEPTS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Quantifiers: Quantifiers specify the number of occurrences of a character or group to match. Example: a{3}: Matches exactly three 'a's in a row (e.g., "aaa"). a{2,4}: Matches between 2 and 4 'a's in a row (e.g., "aa", "aaa", "aaaa"). Grouping and Capturing: Parentheses () are used to group parts of the regular expression, allowing you to apply operators to entire sections of the pattern. Example: (abc)+: Matches one or more occurrences of "abc". Capturing groups store the matched text, which can be referenced later.
  • 22. CONCEPTS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Alternation: The pipe symbol | represents an "OR" operation in regular expressions. Example: apple|banana: Matches either "apple" or "banana". Escape Sequences: Some characters are reserved in regex (e.g., . or *). To use these characters as literals, they must be escaped with a backslash . Example: .: Matches the literal dot character, not any character
  • 23. EXCERCISES https://regex101.com/ Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology • ^A – It’s means starts with ‘A’ in paragraph • .A –Any character attached with ‘A’ in paragraph • done$ - End with ‘done’ in paragraph • hallo* - hall followed by zero or more ‘o’ • hallo+ - hall followed by one or more ‘o’ • hallo? – hall followed by zero or one ‘o’ • hallo{2} -hall followed by 2 ‘o’ • hallo{2,} - hall followed by 2 or more ‘o’ • hal(lo)* - hal followed by zero or more ‘lo’ • hal(lo){2,5} - hal followed by zero or more ‘lo’ • a(b|c) or a[bc] - a followed by b or c
  • 24. EXCERCISES https://regex101.com/ Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology • d - matches a single character that is a digit • w - matches a word character • s - matches a whitespace character (includes tabs and line breaks) • [abc] - matches a string that has either an a or a b or a c -> is the same as a|b|c • [a-c] - same as previous • [0-9]% - a string that has a character from 0 to 9 before a % sign • babcb – search whole value • d(?=r) - matches a d only if is followed by r • (?<=r)d - matches a d only if is before by r • d(?!r) - matches a d only if is not followed by r
  • 25. APPLICATIONS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Text Normalization Clean and preprocess raw text before feeding it into text processing models Removing unwanted punctuation or special characters from text Transforming all characters to lowercase for uniformity Example Regex: To remove punctuation from text: regex Copy code [^ws] This regex matches any character that is not a word character or whitespace
  • 26. APPLICATIONS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Tokenization Breaking down sentences into tokens (words, punctuation, etc.) is a fundamental step in text processing. Regular expressions help segment the text into individual words and phrases Example: Splitting text into words based on spaces and punctuation. Regex for splitting sentences into words: w+
  • 27. APPLICATIONS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Pattern Matching and Extraction Regex is often used to search for specific patterns in text, such as email addresses, dates, phone numbers, or specific keywords Example: Extracting phone numbers from a document: regex Copy code d{3}-d{3}-d{4} This regex matches a phone number in the format xxx-xxx-xxxx
  • 28. APPLICATIONS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Named Entity Recognition (NER)  In text processing, regex is used to identify entities such as names, dates, and places by matching predefined patterns Example: Matching dates: d{2}/d{2}/d{4} (matches "12/05/2022").
  • 29. APPLICATIONS OF REGULAR EXPRESSIONS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Speech-to-Text Transcription Cleanup  After speech recognition transcribes audio into text, regular expressions can be used to remove errors like extra spaces, incomplete words, or unwanted symbols Example: Removing extra spaces after transcription: regex Copy code s{2,} .
  • 30. KEY STEPS IN TEXT NORMALIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Lowercasing Input: “I love the new Apple products.” Output: “i love the new apple products.” Removing Punctuation Input: “Hello, world!” Output: “Hello world” Removing Special Characters Input: “#DataScience is awesome!” Output: “DataScience is awesome”. Removing Stop Words (e.g., “the”, “a”, “and”, “in”) Input: “The quick brown fox jumps over the lazy dog.” Output: “quick brown fox jumps over lazy dog”
  • 31. KEY STEPS IN TEXT NORMALIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Expanding Contractions Input: “I can’t believe it!” Output: “I cannot believe it!” Stemming and Lemmatization "running" becomes "run" “better” becomes “good” Removing Special Characters Input: “#DataScience is awesome!” Output: “DataScience is awesome”. Removing Stop Words (e.g., “the”, “a”, “and”, “in”) Input: “The quick brown fox jumps over the lazy dog.” Output: “quick brown fox jumps over lazy dog”
  • 32. KEY STEPS IN TEXT NORMALIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Stemming and Lemmatization Stemming involves reducing words to their root form by chopping off suffixes (e.g., "running" becomes "run") Lemmatization considers the meaning of the word and reduces it to its base form (e.g., “better” becomes “good”) Spelling Correction Input: “I love progamming.” Output: “I love programming” Handling Numerals Input: “I have 3 apples.” Output: “I have three apples” (if converting numbers to words) or “I have apples” (if removing numbers).
  • 33. EDIT DISTANCE Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Edit Distance is a measure of the difference between two strings (e.g., words or sequences of text). It quantifies how many basic operations (insertions, deletions, substitutions) are needed to transform one string into another. Edit distance is a fundamental concept in text processing, Especially in tasks like spell checking, text correction, machine translation, and speech recognition
  • 34. TYPES OF EDIT DISTANCE Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology 1.Levenshtein Distance: It computes the minimum number of single-character edits required to convert one string into another, where each edit can be one of the following:  Insertion: Adding a character to a string.  Deletion: Removing a character from a string.  Substitution: Replacing one character with another Example: String 1: “kitten” String 2: “sitting” The operations required are: 1. Substitute 'k' with 's': "kitten" → "sitten" 2. Substitute ‘e' with ‘i': "sitten" → "sittin" 3. Insert 'g' at the end: "sittin" → "sitting" Total distance = 3 (3 operations)
  • 35. TYPES OF EDIT DISTANCE Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology 2. Damerau-Levenshtein Distance: The Damerau-Levenshtein Distance is an extension of the Levenshtein distance that also considers transpositions (swapping two adjacent characters) as a valid operation. Example: ▪ String 1: “ab” ▪ String 2: “ba” The Damerau-Levenshtein distance is 1, as only a transposition is required 3. Hamming Distance: Hamming Distance is a special case of edit distance that only works on strings of the same length and counts the number of positions at which the corresponding characters are different. Example: ▪ String 1: “karolin” String 2: “kathrin” The Hamming distance is 3 because the characters at positions 3, 4 and 5 differ.
  • 36. COMPUTING EDIT DISTANCE Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The distance between “kitten” and “sitting” is 3, as it requires 3 operations (Replace ‘s‘ by ‘k', Replace ‘E‘ by ‘I', and Remove 'g' at the end
  • 37. WORD ERROR RATE (WER) Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology WER is a metric used to evaluate the performance of speech-to-text systems. It is calculated as the edit distance between the reference (correct transcription) and the hypothesis (ASR output), divided by the total number of words in the reference
  • 38. WORD ERROR RATE (WER) Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Example I am now going to bed The total number of words = 6 STT Model 1: I am now going to bed. WER = 0% (Sum of Errors: 0) STT Model 2: I am now to bed. WER = 16.7% (Sum of Errors: 1, Deletion = 1: going) STT Model 2: I am now to the bed. WER = 33.3% (Sum of Errors: 2, Deletion = 1: going, Insertion = 1: the)
  • 39. LEMMATIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Process of reducing a word to its base or root form, known as the lemma, while considering the context and meaning of the word Lemmatization uses a vocabulary and morphological analysis of words to return their base form • The lemma of "running" is "run". • The lemma of "better" is "good" (based on context and meaning)
  • 40. KEY FEATURES OF LEMMATIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Context Awareness: Lemmatization considers the meaning and part of speech (POS) of the word. For example, "flies" as a noun is reduced to "fly," while as a verb, it is also reduced to "fly." Dependency on POS Tagging: The lemmatizer requires POS tags to determine the correct lemma. For example, "saw" can be a noun (the tool) or a verb (past tense of "see"). The lemma is determined based on context. Dictionary-Based Approach: Lemmatization relies on dictionaries or lexicons to determine the base form of a word.
  • 41. POS TAGS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 42. PROCESS OF LEMMATIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology POS Tagging: The word's part of speech is identified (e.g., noun, verb, adjective). Example: Input: "The boys are playing in the park." POS Tags: [The (DT), boys (NNS), are (VB), playing (VBG), in (IN), the (DT), park (NN)] Morphological Analysis: The morphological structure of the word is analyzed to determine its lemma. Example: "Playing" → root: "play" (verb) Lookup in Lemmatization Dictionary: The lemma is looked up in the lexicon or dictionary based on the POS tag and root form.
  • 43. EXAMPLES OF LEMMATIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Basic Examples: Words like "running," "runs," and "ran" → Lemma: "run". Words like "better" → Lemma: "good" (based on context) Sentence-Level Example: Input Sentence: "The children were playing in the gardens." Lemmatized Output: "The child be play in the garden." Ambiguity Example: Word: "barked" As a verb (past tense): Lemma → "bark." As a noun (the sound of a dog): Lemma → "bark."
  • 44. APPLICATIONS OF LEMMATIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Search Engines: Lemmatization helps improve search results by matching queries to documents, regardless of word variations. Example: A user searches for "running," and the engine retrieves documents containing "run," "runs," or "ran.“ Text Classification: Reducing words to their lemma helps create consistent input for machine learning models. Example: In sentiment analysis, words like "happiest" and "happier" are reduced to "happy," ensuring consistent feature extraction
  • 45. APPLICATIONS OF LEMMATIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Speech-to-Text Systems: Lemmatization ensures that speech transcriptions are converted into meaningful, standardized text for further processing. Example: Converting "talking" in a transcript to "talk" for language modeling Machine Translation: Lemmatization ensures consistency when translating between languages by standardizing word forms. Example: Translating "jumping" and "jumps" into a consistent word in the target language
  • 46. APPLICATIONS OF LEMMATIZATION Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Question Answering Systems: Lemmatization enables systems to understand user queries better by reducing variations in word forms. Example: A question about "children playing" can match documents containing "child play."
  • 47. STEMMING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Stemming is the process of reducing words to their base or root form by removing affixes (prefixes or suffixes). Stemming does not consider the context or meaning of the word It applies a set of heuristic rules to trim words down to their "stem.“ Stemming is widely used in text preprocessing tasks for natural language processing (NLP) applications, such as search engines, text classification, and information retrieval
  • 48. KEY FEATURES OF STEMMING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Rule-Based Approach: Stemming uses rules to remove common prefixes and suffixes. Example: Words ending in "ing," "ed," or "ly" are reduced by stripping these endings Not Context-Aware: Stemming does not consider the word’s meaning or part of speech (POS). Example: The word "better" is stemmed to "bet," even though "good" is the actual lemma Produces Non-Words: Stems are often not valid words in the language. Example: "Studies" is stemmed to "studi."
  • 49. EXAMPLES OF STEMMING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Basic Examples: "Running" → "run" "Studies" → "studi" "Caring" → "car" Sentence-Level Example: Input: "The boys are running quickly." Output: "The boy are run quick.“ Different Word Forms: Connection," "connections," "connected," and "connecting" are all reduced to "connect."
  • 50. COMMON STEMMING ALGORITHMS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Porter Stemmer: One of the most widely used stemming algorithms. Applies a series of rules to remove common suffixes. Example: Input: "caresses," "flies," "dies" Output: "caress," "fli," "die“ Lancaster Stemmer: A more aggressive stemming algorithm that produces shorter stems. Example: Input: "running" Output: "run"
  • 51. COMMON STEMMING ALGORITHMS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Snowball Stemmer: An improved version of the Porter stemmer, also known as the Porter stemmer. Supports multiple languages and is less aggressive than the Lancaster stemmer. Regex-Based Stemmer: Uses regular expressions to define simple rules for stemming. Example: Removing "-ing," "-ed," or "-ly" endings
  • 52. APPLICATIONS OF STEMMING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Search Engines: Stemming helps search engines retrieve relevant documents by matching different word forms. Example: A search for "running" retrieves results containing "run," "runs," or "ran." Text Classification: Reducing words to their stems improves the efficiency of text classification models by reducing dimensionality. Example: In sentiment analysis, "happy" and "happiness" are treated as the same feature
  • 53. APPLICATIONS OF STEMMING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Information Retrieval: Stemming enhances the matching of user queries to relevant documents by normalizing word forms. Example: Searching for "connections" in a database also retrieves documents containing "connected." Spam Detection: Stemming reduces variations in word forms, making it easier to detect patterns in spam messages. Example: "offer," "offered," and "offering" are normalized to "offer."
  • 54. COMPARISON: LEMMATIZATION VS. STEMMING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Aspect Lemmatization Stemming Output Produces meaningful words (e.g., "better" → "good"). May produce non-words (e.g., "better" → "bet"). Context Awareness Considers context and POS. Ignores context and POS. Accuracy High accuracy in identifying root words. Lower accuracy as it uses simple rules. Speed Slower (requires dictionary lookup). Faster (rule-based).
  • 55. EXAMPLE: LEMMATIZATION VS. STEMMING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Input Word: "caring" Lemmatization: "care" Stemming: "car“ Input Word: "flying" Lemmatization: "fly" Stemming: "fli“ Input Word: "better" Stemming: "bet" Lemmatization: "good"
  • 56. N-GRAM LANGUAGE MODEL Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Statistical language model used to predict the likelihood of a sequence of words or tokens. It divides text into chunks of n words or tokens (N-grams) and estimates the probability of a word based on its preceding n-1 words Key Concepts of N-grams An N-gram is a contiguous sequence of n items (words, characters, or phonemes) from a given text or speech input. Examples: Unigram (n=1): ["I", "love", "NLP"] Bigram (n=2): ["I love", "love NLP"] Trigram (n=3): ["I love NLP"]
  • 57. N-GRAM LANGUAGE MODEL Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 58. STEPS TO BUILD AN N-GRAM MODEL Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology 1. Tokenization: Split the text into words or tokens. Example: "I love NLP" → ["I", "love", "NLP"] 2. Generate N-grams: Extract sequences of n contiguous tokens. Example for bigrams: ["I love", "love NLP"] 3. Calculate Frequencies: Count occurrences of each N-gram
  • 59. STEPS TO BUILD AN N-GRAM MODEL Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
  • 60. APPLICATIONS OF N-GRAM MODELS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Speech Recognition: Predicts the most likely next word to improve transcription accuracy. Example: In "I want to", the trigram model predicts "go" or "eat" based on training data Autocomplete and Text Prediction: Suggests the next word based on previous inputs. Example: Typing "How are" suggests "you" in predictive text Spelling Correction: Identifies the most likely word in the context of surrounding words. Example: "Ths is a tst" → "This is a test" using bigram probabilities
  • 61. APPLICATIONS OF N-GRAM MODELS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Machine Translation: Helps in aligning and translating phrases by considering word sequences. Example: "Je t’aime" → "I love you," considering bigrams like "I love.“ Sentiment Analysis: Considers word combinations to determine sentiment. Example: "Very happy" is more positive than "Very sad." Language Modeling: Predicts the next word in a sequence, commonly used in NLP tasks. Example: In "The cat sat on the," the model predicts "mat."
  • 62. VECTOR SEMANTICS AND EMBEDDINGS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Vector Semantics is a method of representing the meaning of words as mathematical vectors in a continuous, high-dimensional space. These vectors capture semantic relationships between words, enabling machines to understand and analyze language more effectively. Embeddings are the actual vector representations of words, phrases, or sentences. They map discrete linguistic units into a continuous vector space, where similar words are closer to each other.
  • 63. KEY CONCEPTS IN VECTOR SEMANTICS AND EMBEDDINGS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Word Vectors: Words are represented as points in a multi-dimensional space. The closer two words are in this space, the more similar their meanings. Context-Based Representations: Word embeddings are generated based on the contexts in which words appear, capturing semantic and syntactic relationships. Dimensionality Reduction: Instead of representing words as high-dimensional sparse vectors (e.g., one-hot encoding), embeddings represent them as dense vectors in a smaller dimensional space
  • 64. WORD EMBEDDING MODELS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Count-Based Models: Use co-occurrence matrices to represent word relationships. Example: Latent Semantic Analysis (LSA). Predictive Models: Predict word embeddings directly by training neural networks. Examples: Word2Vec, GloVe. Contextual Models: Capture word meaning based on surrounding context. Examples: BERT, ELMo.
  • 65. POPULAR WORD EMBEDDING TECHNIQUES Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Word2Vec: Developed by Google, Word2Vec creates word embeddings using two methods: Skip-Gram: Predicts the context words from a given word. CBOW (Continuous Bag of Words): Predicts a target word from its context words. Example: Input: "The cat sat on the mat." Output: Vectors for words like "cat," "sat," and "mat," where "cat" and "mat" are closer in the vector space
  • 66. POPULAR WORD EMBEDDING TECHNIQUES Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology GloVe (Global Vectors for Word Representation): Combines the benefits of count-based and predictive models by factoring in co- occurrence statistics. Example: Words like "king" and "queen" are similar but differ along the gender dimension Fast Text: Represents words as a combination of character n-grams, enabling the model to understand rare or out-of-vocabulary words. Example: Words like "walking" and "walked" are represented similarly due to shared subword components.
  • 67. POPULAR WORD EMBEDDING TECHNIQUES Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology BERT (Bidirectional Encoder Representations from Transformers): Generates contextual embeddings by understanding the meaning of a word in its sentence. Example: The word "bank" in "river bank" and "financial bank" has different embeddings based on context
  • 68. EXAMPLES OF VECTOR SEMANTICS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Word Similarity: Words with similar meanings have closer embeddings. Example: "Happy" and "Joyful" will have high cosine similarity. Synonyms and Analogies: Word embeddings can identify synonyms and solve analogies. Example: Analogy: "Man is to King as Woman is to ?" → Answer: "Queen" Document Similarity: Entire documents can be represented as vectors (e.g., sentence or paragraph embeddings). Example: Comparing the similarity of two documents for plagiarism detection
  • 69. APPLICATIONS OF VECTOR SEMANTICS AND EMBEDDINGS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Search Engines: Embeddings help search engines understand synonyms and improve query results. Example: earching for "laptop" retrieves results for "notebook." Sentiment Analysis: Embeddings capture the sentiment of words and sentences. Example: Positive words ("great," "excellent") cluster together, distinct from negative words. Machine Translation: Models like Word2Vec map words from different languages into a shared embedding space for translation. Example: "Bonjour" (French) and "Hello" (English) are close in vector space
  • 70. APPLICATIONS OF VECTOR SEMANTICS AND EMBEDDINGS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Speech Recognition: Embeddings improve recognition systems by linking phonemes to meaningful words. Example: The phrase "recognize speech" vs. "wreck a nice beach.“ Chatbots and Virtual Assistants: Use embeddings to understand and respond to user queries. Example: Recognizing "What's up?" as a casual greeting
  • 71. ADVANTAGES OF VECTOR SEMANTICS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Efficient Representation: Reduces dimensionality compared to sparse one-hot encodings. Captures Semantic Relationships: Words with similar meanings are close in the vector space. Adaptable to Various Tasks: Supports a wide range of NLP and speech tasks.