Speech and Language Processing - Regular Expression
1. R.M.K. COLLEGE OF
ENGINEERING AND TECHNOLOGY
22AI903 TEXT AND SPEECH ANALYTICS
(Professional Elective IV)
Dr. V. VIJAYARAJA
Professor
Artificial Intelligence and Data Science
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
2. OBJECTIVES
To introduce the tools and techniques for performing text and
speech analytics in diverse contexts.
To understand the tools and technologies involved in
developing text and speech applications.
To demonstrate the use of computing for building
applications in text and speech processing.
To use Information Retrieval Techniques to build and evaluate
text processing systems.
To apply advanced speech recognition methodologies in
practical applications.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
3. OUTCOMES
CO1: Apply the fundamental techniques in text processing for various NLP
tasks.
CO2: Implement advanced language models and improve text
classification accuracy.
CO3: Designing text processing systems using state-of-the-art techniques.
CO4: Design, implement, and evaluate ASR and TTS systems.
CO5: Apply advanced speech recognition methodologies in practical
applications.
CO6: Use Information Retrieval Techniques to build and evaluate text
processing systems
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
4. UNIT I TEXT PROCESSING
Speech and Language Processing
Regular Expression
Text normalization
Edit Distance
Lemmatization
Stemming
N-gram Language Models
Vector Semantics and Embeddings.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
5. UNIT II TEXT CLASSIFICATION
Text Classification Tasks
Language Model
Neural Language Models
RNNs as Language Models
Transformers and Large Language Models.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
6. UNIT III QUESTION ANSWERING AND DIALOGUE SYSTEMS
Information Retrieval
Dense Vectors
Neural IR for Question Answering
Evaluating Retrieval based Question Answering
Frame-based Dialogue Systems
Dialogue Acts and Dialogue State
Chatbots – Dialogue System Design.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
7. UNIT IV TEXT TO SPEECH SYNTHESIS
Automatic Speech Recognition Task
Feature Extraction for ASR: Log Mel Spectrum
Speech Recognition Architecture
CTC
ASR Evaluation: Word Error Rate
TTS
Speech Tasks.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
8. UNIT V SPEECH RECOGNITION
LPC for speech recognition
Hidden Markov Model (HMM)
Training procedure for HMM
subword unit model based on HMM
Language models for large vocabulary speech recognition
Overall recognition system based on subword units
Context dependent subword units
Semantic post processor for speech recognition.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
9. TEXT BOOKS
Jurafsky, D. and J. H. Martin, Speech and language
processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition Pearson Publication, Third Edition, 2022.
Lawrence Rabiner, Biing-Hwang Juang and
B.Yegnanarayana, “Fundamentals of Speech Recognition”,
Pearson Education, 2009.
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
10. REFERENCES
John Atkinson-Abutridy, Text Analytics: An Introduction to the
Science and Applications of Unstructured Information Analysis, CRC
Press, 2022.
Jim Schwoebel, NeuroLex, Introduction to Voice Computing in
Python, 2018
Lawrence R. Rabiner, Ronald W. Schafe, Theory and Applications of
Digital Speech Processing, First Edition, Pearson, 2010..
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
11. ARTIFICIAL INTELLIGENCE
Artificial intelligence is a specific branch of computer
science concerned with replicating the thought process
and decision-making ability of humans through computer
algorithms
Artificial intelligence makes it possible for machines to
learn from experience, adjust to new inputs and perform
human-like tasks
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
13. NATURAL LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
NLP stands for Natural Language Processing, which deals
with the interaction between computers and humans in
natural language
14. SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Involves the development of techniques that allow computers to
• understand,
• interpret, and
• generate human languages (both spoken and written)
It encompasses multiple domains of research and applications such as
• speech recognition,
• natural language processing (NLP) and
• text-to-speech synthesis
15. COMPONENTS OF SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech Recognition (Automatic Speech Recognition - ASR):converting spoken words
into text. Used in voice assistants like Siri, Google Assistant, and Alexa
Natural Language Processing (NLP):interaction between computers and human
language. Used in chatbots
Text-to-Speech (TTS):Converts written text into spoken words. Used in Google
Translate
Speech Synthesis:human-like speech from text. Used inGoogle’s Text-to-Speech
service available in smartphone
16. COMPONENTS OF SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Regular Expressions: searching and manipulating text data. Used in the search for
specific phrases or patterns in voice transcripts.
Text Normalization: converting raw text into a standard format
Edit Distance : measures the number of operations required to convert one string
into another
Stemming : reduce words to their root form (e.g., “running” → “run”).
17. COMPONENTS OF SPEECH AND LANGUAGE PROCESSING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Lemmatization: involves reducing words to their base form, considering its meaning
(e.g., “better” → “good”).
N-gram Language Models: used to predict the next word or sequence in a sentence
Vector Semantics and Embeddings: involves representing words or phrases as
vectors in a multi-dimensional space
18. REGULAR EXPRESSIONS (Regex)
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Sequence of characters that forms a search pattern.
Used for pattern matching with strings or for searching and manipulating text
Essential in tasks such as text searching, text extraction, and data cleaning
Particularly in speech and language processing, where preprocessing text or
speech transcriptions is often required
19. CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Literal Characters:
Basic characters that match themselves in a string.
Example: The regex apple matches the string "apple".
Meta-characters:
These are special characters that have specific meanings. Commonly used meta-characters include:
. (dot): Matches any single character (except newline).
^: Anchors the match at the beginning of the string.
$: Anchors the match at the end of the string.
*: Matches zero or more of the preceding character.
+: Matches one or more of the preceding character.
?: Matches zero or one of the preceding character.
20. CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Character Classes:
A character class defines a set of characters that can match a position in the string.
Example:
[a-z]: Matches any lowercase letter.
[A-Z]: Matches any uppercase letter.
[0-9]: Matches any digit.
[aeiou]: Matches any vowel.
Predefined Character Classes:
d: Matches any digit (equivalent to [0-9]).
w: Matches any word character (alphanumeric + underscore).
s: Matches any whitespace character (spaces, tabs, line breaks).
b: Matches a word boundary.
21. CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Quantifiers:
Quantifiers specify the number of occurrences of a character or group to match.
Example:
a{3}: Matches exactly three 'a's in a row (e.g., "aaa").
a{2,4}: Matches between 2 and 4 'a's in a row (e.g., "aa", "aaa", "aaaa").
Grouping and Capturing:
Parentheses () are used to group parts of the regular expression, allowing you to apply operators to
entire sections of the pattern.
Example:
(abc)+: Matches one or more occurrences of "abc".
Capturing groups store the matched text, which can be referenced later.
22. CONCEPTS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Alternation:
The pipe symbol | represents an "OR" operation in regular expressions.
Example:
apple|banana: Matches either "apple" or "banana".
Escape Sequences:
Some characters are reserved in regex (e.g., . or *). To use these characters as literals, they must be
escaped with a backslash .
Example:
.: Matches the literal dot character, not any character
23. EXCERCISES https://regex101.com/
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
• ^A – It’s means starts with ‘A’ in paragraph
• .A –Any character attached with ‘A’ in paragraph
• done$ - End with ‘done’ in paragraph
• hallo* - hall followed by zero or more ‘o’
• hallo+ - hall followed by one or more ‘o’
• hallo? – hall followed by zero or one ‘o’
• hallo{2} -hall followed by 2 ‘o’
• hallo{2,} - hall followed by 2 or more ‘o’
• hal(lo)* - hal followed by zero or more ‘lo’
• hal(lo){2,5} - hal followed by zero or more ‘lo’
• a(b|c) or a[bc] - a followed by b or c
24. EXCERCISES https://regex101.com/
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
• d - matches a single character that is a digit
• w - matches a word character
• s - matches a whitespace character (includes tabs and line breaks)
• [abc] - matches a string that has either an a or a b or a c -> is the same as a|b|c
• [a-c] - same as previous
• [0-9]% - a string that has a character from 0 to 9 before a % sign
• babcb – search whole value
• d(?=r) - matches a d only if is followed by r
• (?<=r)d - matches a d only if is before by r
• d(?!r) - matches a d only if is not followed by r
25. APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Text Normalization
Clean and preprocess raw text before feeding it into text processing models
Removing unwanted punctuation or special characters from text
Transforming all characters to lowercase for uniformity
Example Regex: To remove punctuation from text:
regex Copy code [^ws]
This regex matches any character that is not a word character or whitespace
26. APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Tokenization
Breaking down sentences into tokens (words, punctuation, etc.) is a fundamental
step in text processing.
Regular expressions help segment the text into individual words and phrases
Example:
Splitting text into words based on spaces and punctuation.
Regex for splitting sentences into words: w+
27. APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Pattern Matching and Extraction
Regex is often used to search for specific patterns in text, such as email
addresses, dates, phone numbers, or specific keywords
Example:
Extracting phone numbers from a document:
regex Copy code d{3}-d{3}-d{4}
This regex matches a phone number in the format xxx-xxx-xxxx
28. APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Named Entity Recognition (NER)
In text processing, regex is used to identify entities such as names, dates, and
places by matching predefined patterns
Example:
Matching dates: d{2}/d{2}/d{4} (matches "12/05/2022").
29. APPLICATIONS OF REGULAR EXPRESSIONS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech-to-Text Transcription Cleanup
After speech recognition transcribes audio into text, regular expressions can be
used to remove errors like extra spaces, incomplete words, or unwanted symbols
Example:
Removing extra spaces after transcription:
regex Copy code s{2,}
.
30. KEY STEPS IN TEXT NORMALIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Lowercasing
Input: “I love the new Apple products.”
Output: “i love the new apple products.”
Removing Punctuation
Input: “Hello, world!”
Output: “Hello world”
Removing Special Characters
Input: “#DataScience is awesome!”
Output: “DataScience is awesome”.
Removing Stop Words (e.g., “the”, “a”, “and”, “in”)
Input: “The quick brown fox jumps over the lazy dog.”
Output: “quick brown fox jumps over lazy dog”
31. KEY STEPS IN TEXT NORMALIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Expanding Contractions
Input: “I can’t believe it!”
Output: “I cannot believe it!”
Stemming and Lemmatization
"running" becomes "run"
“better” becomes “good”
Removing Special Characters
Input: “#DataScience is awesome!”
Output: “DataScience is awesome”.
Removing Stop Words (e.g., “the”, “a”, “and”, “in”)
Input: “The quick brown fox jumps over the lazy dog.”
Output: “quick brown fox jumps over lazy dog”
32. KEY STEPS IN TEXT NORMALIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Stemming and Lemmatization
Stemming involves reducing words to their root form by chopping off suffixes (e.g., "running"
becomes "run")
Lemmatization considers the meaning of the word and reduces it to its base form (e.g., “better”
becomes “good”)
Spelling Correction
Input: “I love progamming.”
Output: “I love programming”
Handling Numerals
Input: “I have 3 apples.”
Output: “I have three apples” (if converting numbers to words) or
“I have apples” (if removing numbers).
33. EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Edit Distance is a measure of the difference between two strings (e.g., words or
sequences of text).
It quantifies how many basic operations (insertions, deletions, substitutions) are
needed to transform one string into another.
Edit distance is a fundamental concept in text processing,
Especially in tasks like spell checking, text correction, machine translation, and
speech recognition
34. TYPES OF EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
1.Levenshtein Distance:
It computes the minimum number of single-character edits required to convert one string into
another, where each edit can be one of the following:
Insertion: Adding a character to a string.
Deletion: Removing a character from a string.
Substitution: Replacing one character with another
Example: String 1: “kitten” String 2: “sitting”
The operations required are:
1. Substitute 'k' with 's': "kitten" → "sitten"
2. Substitute ‘e' with ‘i': "sitten" → "sittin"
3. Insert 'g' at the end: "sittin" → "sitting"
Total distance = 3 (3 operations)
35. TYPES OF EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
2. Damerau-Levenshtein Distance:
The Damerau-Levenshtein Distance is an extension of the Levenshtein distance that also considers
transpositions (swapping two adjacent characters) as a valid operation.
Example:
▪ String 1: “ab” ▪ String 2: “ba”
The Damerau-Levenshtein distance is 1, as only a transposition is required
3. Hamming Distance:
Hamming Distance is a special case of edit distance that only works on strings of the same length
and counts the number of positions at which the corresponding characters are different.
Example:
▪ String 1: “karolin” String 2: “kathrin”
The Hamming distance is 3 because the characters at positions 3, 4 and 5 differ.
36. COMPUTING EDIT DISTANCE
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
The distance between “kitten” and “sitting” is 3, as it requires 3 operations (Replace
‘s‘ by ‘k', Replace ‘E‘ by ‘I', and Remove 'g' at the end
37. WORD ERROR RATE (WER)
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
WER is a metric used to evaluate the performance of speech-to-text systems.
It is calculated as the edit distance between the reference (correct transcription)
and the hypothesis (ASR output), divided by the total number of words in the
reference
38. WORD ERROR RATE (WER)
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Example
I am now going to bed
The total number of words = 6
STT Model 1: I am now going to bed.
WER = 0% (Sum of Errors: 0)
STT Model 2: I am now to bed.
WER = 16.7% (Sum of Errors: 1, Deletion = 1: going)
STT Model 2: I am now to the bed.
WER = 33.3% (Sum of Errors: 2, Deletion = 1: going, Insertion = 1: the)
39. LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Process of reducing a word to its base or root form, known as the lemma, while
considering the context and meaning of the word
Lemmatization uses a vocabulary and morphological analysis of words to return
their base form
• The lemma of "running" is "run".
• The lemma of "better" is "good" (based on context and meaning)
40. KEY FEATURES OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Context Awareness:
Lemmatization considers the meaning and part of speech (POS) of the word.
For example, "flies" as a noun is reduced to "fly," while as a verb, it is also
reduced to "fly."
Dependency on POS Tagging:
The lemmatizer requires POS tags to determine the correct lemma.
For example, "saw" can be a noun (the tool) or a verb (past tense of "see"). The
lemma is determined based on context.
Dictionary-Based Approach:
Lemmatization relies on dictionaries or lexicons to determine the base form of a
word.
41. POS TAGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
42. PROCESS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
POS Tagging:
The word's part of speech is identified (e.g., noun, verb, adjective).
Example: Input: "The boys are playing in the park."
POS Tags: [The (DT), boys (NNS), are (VB), playing (VBG), in (IN), the (DT), park (NN)]
Morphological Analysis:
The morphological structure of the word is analyzed to determine its lemma.
Example: "Playing" → root: "play" (verb)
Lookup in Lemmatization Dictionary:
The lemma is looked up in the lexicon or dictionary based on the POS tag and
root form.
43. EXAMPLES OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Basic Examples:
Words like "running," "runs," and "ran" → Lemma: "run".
Words like "better" → Lemma: "good" (based on context)
Sentence-Level Example:
Input Sentence: "The children were playing in the gardens."
Lemmatized Output: "The child be play in the garden."
Ambiguity Example:
Word: "barked"
As a verb (past tense): Lemma → "bark."
As a noun (the sound of a dog): Lemma → "bark."
44. APPLICATIONS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Search Engines:
Lemmatization helps improve search results by matching queries to documents,
regardless of word variations.
Example: A user searches for "running," and the engine retrieves documents
containing "run," "runs," or "ran.“
Text Classification:
Reducing words to their lemma helps create consistent input for machine learning
models.
Example: In sentiment analysis, words like "happiest" and "happier" are reduced to
"happy," ensuring consistent feature extraction
45. APPLICATIONS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech-to-Text Systems:
Lemmatization ensures that speech transcriptions are converted into meaningful,
standardized text for further processing.
Example: Converting "talking" in a transcript to "talk" for language modeling
Machine Translation:
Lemmatization ensures consistency when translating between languages by
standardizing word forms.
Example: Translating "jumping" and "jumps" into a consistent word in the target
language
46. APPLICATIONS OF LEMMATIZATION
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Question Answering Systems:
Lemmatization enables systems to understand user queries better by reducing
variations in word forms.
Example: A question about "children playing" can match documents containing
"child play."
47. STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Stemming is the process of reducing words to their base or root form by
removing affixes (prefixes or suffixes).
Stemming does not consider the context or meaning of the word
It applies a set of heuristic rules to trim words down to their "stem.“
Stemming is widely used in text preprocessing tasks for natural language
processing (NLP) applications, such as
search engines,
text classification, and
information retrieval
48. KEY FEATURES OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Rule-Based Approach:
Stemming uses rules to remove common prefixes and suffixes.
Example: Words ending in "ing," "ed," or "ly" are reduced by stripping these
endings
Not Context-Aware:
Stemming does not consider the word’s meaning or part of speech (POS).
Example: The word "better" is stemmed to "bet," even though "good" is the actual
lemma
Produces Non-Words:
Stems are often not valid words in the language.
Example: "Studies" is stemmed to "studi."
49. EXAMPLES OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Basic Examples:
"Running" → "run" "Studies" → "studi" "Caring" → "car"
Sentence-Level Example:
Input: "The boys are running quickly."
Output: "The boy are run quick.“
Different Word Forms:
Connection," "connections," "connected," and "connecting" are all reduced to
"connect."
50. COMMON STEMMING ALGORITHMS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Porter Stemmer:
One of the most widely used stemming algorithms. Applies a series of rules to
remove common suffixes. Example:
Input: "caresses," "flies," "dies"
Output: "caress," "fli," "die“
Lancaster Stemmer:
A more aggressive stemming algorithm that produces shorter stems.
Example:
Input: "running" Output: "run"
51. COMMON STEMMING ALGORITHMS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Snowball Stemmer:
An improved version of the Porter stemmer, also known as the Porter stemmer.
Supports multiple languages and is less aggressive than the Lancaster stemmer.
Regex-Based Stemmer:
Uses regular expressions to define simple rules for stemming.
Example: Removing "-ing," "-ed," or "-ly" endings
52. APPLICATIONS OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Search Engines:
Stemming helps search engines retrieve relevant documents by matching different
word forms.
Example: A search for "running" retrieves results containing "run," "runs," or "ran."
Text Classification:
Reducing words to their stems improves the efficiency of text classification models
by reducing dimensionality.
Example: In sentiment analysis, "happy" and "happiness" are treated as the
same feature
53. APPLICATIONS OF STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Information Retrieval:
Stemming enhances the matching of user queries to relevant documents by
normalizing word forms.
Example: Searching for "connections" in a database also retrieves documents
containing "connected."
Spam Detection:
Stemming reduces variations in word forms, making it easier to detect patterns in
spam messages.
Example: "offer," "offered," and "offering" are normalized to "offer."
54. COMPARISON: LEMMATIZATION VS. STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Aspect Lemmatization Stemming
Output
Produces meaningful words
(e.g., "better" → "good").
May produce non-words
(e.g., "better" → "bet").
Context Awareness Considers context and POS. Ignores context and POS.
Accuracy
High accuracy in identifying root
words.
Lower accuracy as it uses
simple rules.
Speed
Slower (requires dictionary
lookup).
Faster (rule-based).
55. EXAMPLE: LEMMATIZATION VS. STEMMING
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Input Word: "caring"
Lemmatization: "care"
Stemming: "car“
Input Word: "flying"
Lemmatization: "fly"
Stemming: "fli“
Input Word: "better"
Stemming: "bet"
Lemmatization: "good"
56. N-GRAM LANGUAGE MODEL
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Statistical language model used to predict the likelihood of a sequence of words
or tokens.
It divides text into chunks of n words or tokens (N-grams) and estimates the
probability of a word based on its preceding n-1 words
Key Concepts of N-grams
An N-gram is a contiguous sequence of n items (words, characters, or phonemes)
from a given text or speech input.
Examples:
Unigram (n=1): ["I", "love", "NLP"]
Bigram (n=2): ["I love", "love NLP"]
Trigram (n=3): ["I love NLP"]
58. STEPS TO BUILD AN N-GRAM MODEL
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
1. Tokenization:
Split the text into words or tokens.
Example: "I love NLP" → ["I", "love", "NLP"]
2. Generate N-grams:
Extract sequences of n contiguous tokens.
Example for bigrams: ["I love", "love NLP"]
3. Calculate Frequencies:
Count occurrences of each N-gram
59. STEPS TO BUILD AN N-GRAM MODEL
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
60. APPLICATIONS OF N-GRAM MODELS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech Recognition:
Predicts the most likely next word to improve transcription accuracy.
Example: In "I want to", the trigram model predicts "go" or "eat" based on training
data
Autocomplete and Text Prediction:
Suggests the next word based on previous inputs.
Example: Typing "How are" suggests "you" in predictive text
Spelling Correction:
Identifies the most likely word in the context of surrounding words.
Example: "Ths is a tst" → "This is a test" using bigram probabilities
61. APPLICATIONS OF N-GRAM MODELS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Machine Translation:
Helps in aligning and translating phrases by considering word sequences.
Example: "Je t’aime" → "I love you," considering bigrams like "I love.“
Sentiment Analysis:
Considers word combinations to determine sentiment.
Example: "Very happy" is more positive than "Very sad."
Language Modeling:
Predicts the next word in a sequence, commonly used in NLP tasks.
Example: In "The cat sat on the," the model predicts "mat."
62. VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Vector Semantics is a method of representing the meaning of words as
mathematical vectors in a continuous, high-dimensional space.
These vectors capture semantic relationships between words, enabling machines
to understand and analyze language more effectively.
Embeddings are the actual vector representations of words, phrases, or
sentences.
They map discrete linguistic units into a continuous vector space, where similar
words are closer to each other.
63. KEY CONCEPTS IN VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Word Vectors:
Words are represented as points in a multi-dimensional space.
The closer two words are in this space, the more similar their meanings.
Context-Based Representations:
Word embeddings are generated based on the contexts in which words appear,
capturing semantic and syntactic relationships.
Dimensionality Reduction:
Instead of representing words as high-dimensional sparse vectors (e.g., one-hot
encoding), embeddings represent them as dense vectors in a smaller dimensional
space
64. WORD EMBEDDING MODELS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Count-Based Models:
Use co-occurrence matrices to represent word relationships.
Example: Latent Semantic Analysis (LSA).
Predictive Models:
Predict word embeddings directly by training neural networks.
Examples: Word2Vec, GloVe.
Contextual Models:
Capture word meaning based on surrounding context.
Examples: BERT, ELMo.
65. POPULAR WORD EMBEDDING TECHNIQUES
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Word2Vec:
Developed by Google, Word2Vec creates word embeddings using two methods:
Skip-Gram: Predicts the context words from a given word.
CBOW (Continuous Bag of Words): Predicts a target word from its context words.
Example:
Input: "The cat sat on the mat."
Output: Vectors for words like "cat," "sat," and "mat," where "cat" and "mat" are
closer in the vector space
66. POPULAR WORD EMBEDDING TECHNIQUES
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
GloVe (Global Vectors for Word Representation):
Combines the benefits of count-based and predictive models by factoring in co-
occurrence statistics.
Example:
Words like "king" and "queen" are similar but differ along the gender dimension
Fast Text:
Represents words as a combination of character n-grams, enabling the model to
understand rare or out-of-vocabulary words.
Example:
Words like "walking" and "walked" are represented similarly due to shared
subword components.
67. POPULAR WORD EMBEDDING TECHNIQUES
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
BERT (Bidirectional Encoder Representations from Transformers):
Generates contextual embeddings by understanding the meaning of a word in its
sentence.
Example:
The word "bank" in "river bank" and "financial bank" has different embeddings
based on context
68. EXAMPLES OF VECTOR SEMANTICS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Word Similarity:
Words with similar meanings have closer embeddings.
Example: "Happy" and "Joyful" will have high cosine similarity.
Synonyms and Analogies:
Word embeddings can identify synonyms and solve analogies.
Example: Analogy: "Man is to King as Woman is to ?" → Answer: "Queen"
Document Similarity:
Entire documents can be represented as vectors (e.g., sentence or paragraph
embeddings).
Example: Comparing the similarity of two documents for plagiarism detection
69. APPLICATIONS OF VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Search Engines:
Embeddings help search engines understand synonyms and improve query results.
Example: earching for "laptop" retrieves results for "notebook."
Sentiment Analysis:
Embeddings capture the sentiment of words and sentences.
Example: Positive words ("great," "excellent") cluster together, distinct from
negative words.
Machine Translation:
Models like Word2Vec map words from different languages into a shared
embedding space for translation.
Example: "Bonjour" (French) and "Hello" (English) are close in vector space
70. APPLICATIONS OF VECTOR SEMANTICS AND EMBEDDINGS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Speech Recognition:
Embeddings improve recognition systems by linking phonemes to meaningful
words.
Example: The phrase "recognize speech" vs. "wreck a nice beach.“
Chatbots and Virtual Assistants:
Use embeddings to understand and respond to user queries.
Example: Recognizing "What's up?" as a casual greeting
71. ADVANTAGES OF VECTOR SEMANTICS
Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Efficient Representation:
Reduces dimensionality compared to sparse one-hot encodings.
Captures Semantic Relationships:
Words with similar meanings are close in the vector space.
Adaptable to Various Tasks:
Supports a wide range of NLP and speech tasks.