Natural language processing unit - 2 ppt

syllabus
• Introduction to word types: word2Vec, Word
Embedding, POS Tagging, Count Vectorizer,
Multiword Expressions the role of language
models. Simple N-gram models. Bag of words,
estimating parameters and smoothing.
Evaluating language models.

Word Embedding
• Word embeddings are a way of representing words in a vector space
(a list of numbers).
• They are numerical representations of words that capture their
meanings, relationships, and contexts.
• Word embeddings allow words with similar meanings to have similar
representations in the vector space.
• In simpler terms, instead of representing words as simple strings like
"apple" or "dog," we represent them as vectors of numbers, such
that similar words (like "cat" and "dog") will have similar vectors.

Why Do We Need Word Embeddings?
• Computers cannot understand human language directly, so we
need to convert words into a form they can process.
• The traditional approach, called one-hot encoding, represents
words as sparse vectors (like a long list of zeros and a single 1 at
the index representing the word).
• However, this doesn't capture any relationships between words.
• Word embeddings solve this problem by placing semantically
similar words closer together in the vector space.

Explanation of Word Embeddings:
Step 1: Traditional Representation vs. Word Embeddings
• Traditional Representation (One-Hot Encoding):
– Example: Let's take three words: "cat", "dog", and "fish".
– In one-hot encoding, each word would be represented as a vector with as many
dimensions as the size of the vocabulary, with a 1 at the index of the word and 0s
elsewhere.
• For a vocabulary of size 3, it might look like this:
• Word One-Hot Encoding cat[1, 0, 0]dog[0, 1, 0]fish[0, 0, 1]
• This representation has a couple of drawbacks:
• High dimensionality: For large vocabularies, these vectors become very
long.
• No relationship captured: The vector [1, 0, 0] for "cat" is no more related to
[0, 1, 0] for "dog" than it is to [0, 0, 1] for "fish". There's no concept of
meaning or similarity between words.

Word Embedding Representation:
• Word embeddings solve this by mapping each word into a
continuous vector space, where the distance between vectors
represents semantic similarity.
• For example, the vector for "cat" would be close to the vector for
"dog", but far away from "fish", because "cat" and "dog" are more
semantically related.
• WordWord Embedding (Example)cat[0.2, 0.4, 0.6]dog[0.1, 0.5,
0.7]fish[0.9, 0.3, 0.2].
• In this example, "cat" and "dog" have similar vectors, while "fish"
has a different vector because it's less related to "cat" and "dog".

Cont..
• Step 2: How Are Word Embeddings Learned?
• Word embeddings are typically learned using machine learning models
trained on large text datasets. The goal is to find a representation of each
word in such a way that similar words are close together in the vector space.
• Popular Algorithms for Learning Word Embeddings:
• Word2Vec (Continuous Bag of Words - CBOW, and Skip-Gram):
– CBOW predicts a target word based on context words around it.
– Skip-Gram does the reverse: it predicts the context words based on a target word.
– Both models use a neural network and learn to adjust the word vectors
(embeddings) to make accurate predictions.

Word2Vec
• Word2Vec is a popular technique in Natural Language
Processing (NLP) used to represent words as vectors
(numbers).
• These vectors are created in such a way that words with
similar meanings are close together in the vector space,
making it easier for computers to understand
relationships between words.
• It was developed by Google in 2013 and is widely used to
understand the relationships between words in a text.

Real-World Example: A Restaurant Menu
• Imagine you're reading a menu at a restaurant. Here's a simplified list of items:
• Burger, Fries, Coke
• Pizza, Garlic Bread, Pepsi
• Sushi, Miso Soup, Green Tea
• From these combinations, Word2Vec would learn that:
• "Burger" is similar to "Pizza" because both are paired with other fast-food
items.
• "Coke" is similar to "Pepsi" because they are both drinks that appear with
similar foods.
• "Sushi" is different from "Burger" because it appears in a different context,
with Japanese food items like "Miso Soup."

Scenario: Cricket Player and Match Recommendation System
• Imagine you're using a cricket app that helps you
discover cricket players, matches, or even teams
you might like based on your preferences and past
activity. How does the app know which players or
matches to recommend next? The answer is
word2vec, which helps the app understand
relationships between players, teams, and match
types based on their context.

Understanding word2vec in Cricket
• Let’s say the app treats each cricket player, match, and team as a
"word" in a massive "sentence" representing all the cricket data.
Just like how words in a sentence have relationships based on their
meanings, players and matches also have relationships based on
their context. For example:
• Player 1: Virat Kohli (Batsman, India)
• Player 2: Rohit Sharma (Batsman, India)
• Player 3: Joe Root (Batsman, England)
• Player 4: Ben Stokes (All-rounder, England)
• Player 5: Jofra Archer (Bowler, England)
• Player 6: Steve Smith (Batsman, Australia)

How word2vec Works in This Case
• Now, the app uses a word2vec-like model to
convert these players into vectors (numbers). These
vectors represent how similar players are to each
other based on things like:
• Role (batsman, bowler, all-rounder)
• Country (India, England, Australia)
• Playing style (aggressive, defensive, all-rounder)

Cont..
• The app analyzes these contexts and places similar players closer together in the
vector space. Here's how it might look:
• Virat Kohli and Rohit Sharma will be represented by vectors that are close
together because both are Indian batsmen who play in a similar aggressive style.
• Joe Root and Ben Stokes will have vectors that are relatively close because they
both play for England, though Root is more of a traditional batsman, and Stokes
is an all-rounder.
• Jofra Archer will be further from players like Kohli and Root because he is a
bowler with a completely different role.
• Steve Smith will be close to Kohli and Sharma in the vector space since he is
also a batsman, but his unique style may position him slightly apart.

Why Do We Need Word2Vec?
• Words, by themselves, are just text. Computers can't
directly understand text, so we need to convert words
into numbers. But a simple number (like assigning
"apple" = 1 and "orange" = 2) doesn't capture the
meaning of the words or how they're related.
• Word2Vec solves this problem by creating vectors (lists of
numbers) that represent words in a way that captures
their meanings and relationships.

How Does Word2Vec Work?
• Contextual Representation:
• Word2Vec is based on the idea that "words
that appear in similar contexts have similar
meanings."
• For example, in the sentences "I love apples"
and "I love oranges," the words "apples" and
"oranges" appear in similar contexts ("I love").

Cont..
• Training with Context:
• Word2Vec uses a neural network to learn word relationships from large
text data.
• It trains in two main ways:
– CBOW (Continuous Bag of Words): Predicts a word based on its surrounding
words.
– Skip-Gram: Predicts surrounding words based on a single word.
• Word Vectors:
• Each word is represented as a vector of numbers (e.g., [1.2, -0.8,
0.5, ...]).
• Words with similar meanings or used in similar ways have similar
vectors.

Natural language processing unit - 2 ppt

CBOW (Continuous Bag of Words):
• Continuous Bag of Words (CBOW) model is one
of the two main architectures used in Word2Vec
(the other being Skip-Gram). It’s a simple but
effective method for learning word embeddings.
• CBOW predicts a word based on its context (the
surrounding words).

Example
1. Start with a Simple Sentence
• Write a sentence on the board for the students to
understand: Sentence: "The cat sat on the mat.“
2. Explain the Goal of CBOW
• What is CBOW? CBOW predicts a target word (like "sat")
using its context words (like "The", "cat", "on", "the",
"mat").
• Write on the board:
Context words → Predict Target Word

Cont..
3. Vocabulary Setup
• Explain that CBOW uses a vocabulary, which is a list of all unique words
in the text.
• Example Vocabulary:
– "The", "cat", "sat", "on", "mat"
• Assign an index to each word:
• The: 0, cat: 1, sat: 2, on: 3, mat: 4
4. Represent Words as Vectors (One-Hot Encoding)
• Explain that each word is represented as a one-hot vector:
• A one-hot vector is a list of 0s with a single 1 at the index of the word.
• The → [1, 0, 0, 0, 0] Cat → [0, 1, 0, 0, 0] Sat → [0, 0, 1, 0, 0]

Cont..
5. Training Example
• Write an example of how CBOW uses context
words to predict a target word:
• Context Words: ["The", "cat", "on", "the",
"mat"]
• Target Word: "sat"

6. Step-by-Step CBOW Process
• Step 1: Initialize word vectors Each word is
represented by a random vector at the
beginning of the training. Suppose we have
the following random vector for each word
(simplified for clarity):

Step 2: Calculate the context vector Now, we average the vectors of the context
words ("The", "cat", "sits", "on"):

Cont..
• Step 3: Calculate the prediction Now, we use the context
vector to predict the target word.
• This is done by multiplying the context vector with a
weight matrix and passing it through a softmax function
to calculate the probability of each word being the target.
• The model will adjust its weights during training to make
the predicted target word more likely.

Cont..
• 3. Compute the Prediction Scores (Dot Product)
• Now, we take the context vector and multiply it by the weight
matrix to get prediction scores. This is done by taking the dot
product of the context vector with each word's vector in the
vocabulary.
• For example, let’s compute the dot product between the context
vector and the vector of each word in the vocabulary:
• Context vector:
• [0.25,0.275,0.3][0.25, 0.275, 0.3][0.25,0.275,0.3]

• "The": 0.17
• "cat": 0.1975
• "sits": 0.3275
• "on": 0.2175
• "mat": 0.2125
• These values are called scores, and they
indicate how well each word in the vocabulary
fits the context.

Step 2: Compute the Sum of Exponentials
Add up all the exponential values:
Sum=1.1852+1.2184+1.3875+1.2430+1.2370=6.2711

Cont..
• Step 4: Verify the Results
• The softmax values are:
• [0.1890,0.1943,0.2213,0.1982,0.1972]
• Finally, check that they sum to 1.0:
• 0.1890+0.1943+0.2213+0.1982+0.1972=1.0

Skip-gram is a model
• Skip-gram is a model used to predict context
words based on a target word. Instead of
predicting the target word from its surrounding
context, the Skip-gram model works the other
way around: it tries to predict the surrounding
context words from a given target word.

Applications
• Search Engines: Understanding synonyms and
related words.
• Recommendation Systems: Finding similar
items or concepts.
• Text Classification: Helping classify text into
categories.

POS (Part of Speech)
• POS (Part of Speech) tagging is a process in natural
language processing where words in a sentence are
labeled with their corresponding parts of speech,
such as nouns, verbs, adjectives, adverbs, etc.
• The goal is to understand the role of each word in a
sentence to help computers analyze the meaning.

• Here’s an easy explanation:
• Nouns (N): Words that name people, places, things, or ideas.
Example: dog, city, happiness
• Verbs (V): Words that show actions, occurrences, or states of being.
Example: run, is, seem
• Adjectives (ADJ): Words that describe or modify nouns.
Example: happy, blue, tall
• Adverbs (ADV): Words that describe or modify verbs, adjectives, or other adverbs.
Example: quickly, very, gently
• Pronouns (PRON): Words that take the place of nouns.
Example: he, she, they
• Prepositions (PREP): Words that show the relationship between a noun (or pronoun) and
other parts of the sentence.
Example: in, on, under
• Conjunctions (CONJ): Words that connect words, phrases, or clauses.
Example: and, but, or
• Interjections (INTJ): Words that show strong emotions or feelings.
Example: wow, ouch, hey

• For example, in the sentence "The quick brown fox jumps over the lazy
dog," the POS tags would be:
• "The" – Determiner (often classified as a type of noun modifier)
• "quick" – Adjective
• "brown" – Adjective
• "fox" – Noun
• "jumps" – Verb
• "over" – Preposition
• "the" – Determiner
• "lazy" – Adjective
• "dog" – Noun
• By tagging these parts of speech, computers can better understand the
structure of a sentence and its meaning.

Methods of POS Tagging
1. Rule-Based: Used for specific, structured text where rules are
predefined (e.g., legal or scientific documents).
2. Statistical: Works well for large datasets and can predict POS
tags based on observed word patterns (e.g., news articles,
blogs).
3. Machine Learning: Suitable for dynamic contexts where
patterns emerge from examples, used in translation and
grammar-checking tools.
4. Deep Learning: Best for handling complex, ambiguous
sentences and long-term context, used in voice assistants,
search engines, and advanced chatbots.

Count Vectorizer
• A Count Vectorizer is a tool used in natural language processing (NLP) to
transform a collection of text into a matrix of token counts. Here's a
simple breakdown of how it works:
1.Text Input: Imagine you have a few documents (or sentences). For example:
– "I love programming."
– "Programming is fun."
– "I love learning new things."
2.Tokenization: The Count Vectorizer splits each sentence into individual
words (tokens). For the above sentences, the tokens would be:
– "I", "love", "programming"
– "Programming", "is", "fun"
– "I", "love", "learning", "new", "things"

Cont..
3.Vocabulary Creation: It then creates a list of all unique words across the
documents (vocabulary):
• Now, CountVectorizer creates a list of unique words in alphabetical order
and assigns an index:
• "I", "love", "programming", "is", "fun", "learning", "new", "things"
4.Count Matrix: For each sentence, it counts how many times each word (from
the vocabulary) appears. This is represented as a matrix (or table) where:
• Each row represents a sentence.
• Each column represents a word from the vocabulary.
• The cell values show the count of each word in that sentence.

Simple N-gram models
• N-gram models are a basic type of statistical
language model used in natural language
processing (NLP) to predict the likelihood of a
sequence of words.

What is an N-gram?
• An N-gram is a contiguous sequence of N items
(typically words or characters) from a given sample
of text or speech.
– Unigram: Single word (e.g., "I", "like", "cats").
– Bigram: Two consecutive words (e.g., "I like", "like
cats").
– Trigram: Three consecutive words (e.g., "I like cats").
– N-gram: A general term for NNN consecutive words
(e.g., "I really like cats" is a 4-gram or "quadgram").

How do N-gram models work?
• Count Frequencies: An N-gram model starts by counting how
often specific N-grams appear in a large corpus of text.
• Example: For bigrams, calculate how often "I like", "like cats",
etc., occur.
• Probability Estimation: The model estimates the probability
of the next word based on the previous N−1N-1N−1 words.
– For a bigram model: P(cats like)=Count(like cats)/Count(like)
∣
– For a trigram model: P(cats I like)=Count(I like cats)/Count(I like)
∣
• Predict Next Word: Using these probabilities, the model
predicts the most likely next word in a sequence.

• Example 1: Bigram Model
• Let's say we have the following text:
• "I like cats. I like dogs. I like ice cream."
• We count the bigrams (two-word sequences):
Bigram Count
"I like" 3
"like cats" 1
"like dogs" 1
"like ice" 1

Cont..
• Now, we calculate the probability of "cats" given "like":
• P("cats" "like")=Count("like cats")/Count("like")
∣
• P("cats" "like")=1/3=0.33
∣
• Similarly:
• P("dogs" "like")=1/3=0.33
∣
• P("ice" "like")=1/3=0.33
∣
• So, if we see "I like", the model predicts "cats", "dogs",
or "ice" with equal probability.

Predicting Next Word
• Input: "I like"
• Bigram Probabilities:
– "cats" → 33%
– "dogs" → 33%
– "ice" → 33%
• Predicted Next Word: The model randomly
selects from cats, dogs, or ice.

Bag of words
• What is Bag of Words (BoW)?
• Bag of Words is a simple way to convert text
into numbers (vectorization) so that a
computer can understand and process it. It
does this by counting how often each word
appears in a document.

Example to Understand BoW Easily
Step 1: Sample Sentences
• Imagine we have two simple sentences:
• 1.”I love apples and oranges."
2”Apples and bananas are tasty.“
Step 2: Create a Vocabulary
• We list all unique words from both sentences:
• ["I", "love", "apples", "and", "oranges", "bananas", "are",
"tasty"]

Step 3: Count Word Occurrences
• Now, we create a table to count how many
times each word appears in each sentence.

Cont..
• Each row represents a word, and each column shows how
many times that word appears in a sentence.
• Step 4: Represent as a Vector
• Each sentence is now converted into a numerical vector:
• Sentence 1 → [1, 1, 1, 1, 1, 0, 0, 0]
Sentence 2 → [0, 0, 1, 1, 0, 1, 1, 1]
• Now, the computer can process these numbers instead of
raw text!

cont,..
3. Advantages & Disadvantages
• ✔ Advantages:
• Works well for simple text classification tasks.
• Easy to implement and interpret.
• ❌ Disadvantages:
• Doesn’t consider the order or meaning of words (e.g., "not good" and "good"
are treated similarly).
• Large vocabulary can make computation expensive.
• Can lead to sparse matrices, meaning lots of zeros in the vector
representation.

Cont..
• 4. How to Improve BoW?
• To make BoW more effective, we can use:
– TF-IDF (Term Frequency-Inverse Document Frequency) –
Gives importance to rare words instead of common ones.
– N-grams – Considers sequences of words instead of single
words (e.g., "Machine Learning" as one unit).
– Word Embeddings (like Word2Vec, GloVe) – Captures the
meaning and context of words.

TF-IDF (Term Frequency-Inverse Document Frequency)
• TF-IDF (Term Frequency-Inverse Document
Frequency) is a statistical measure used in
Natural Language Processing (NLP) and
information retrieval to evaluate the importance
of a word within a document relative to a
corpus of documents.

Components of TF-IDF
1. Term Frequency (TF)

2. Inverse Document Frequency (IDF)

TF-IDF Score
• What it is: It combines TF and IDF to give a score to each word that
reflects its importance in a document relative to the entire corpus.
• Formula: TF-IDF(w)=TF(w)×IDF(w)
• Why it matters: Words that are frequent in a document but rare
across all documents will have a high TF-IDF score, indicating they
are important.
• Example: Continuing the previous example:
• TF("apple") = 0.05
• IDF("apple") = 2
• TF-IDF(apple)=0.05×2=0.1

Why is TF-IDF useful?
• Words that are very common across many
documents (like "the," "is," "and") will have a low
IDF value and will not contribute much to the TF-IDF
score.
• Words that are frequent in a specific document but
rare in others will have a high TF-IDF score, meaning
they are important keywords for that document.

New Documents:
• Document 1: "The sun is bright."
• Document 2: "The moon is bright at night."
• Document 3: "The stars are bright in the sky."
Step 1: Calculate Term Frequency (TF)
• We'll calculate the TF for the words "sun," "moon," "bright," "night,"
"stars," "sky," and "the" in each document.
• Total words = 4 ("The", "sun", "is", "bright")
• The appears 1 time, Sun appears 1 time, Bright appears 1 time.
– TF(the)=1/4=
=0.25
– TF(sun)=1/4=0.25
– TF(bright)=1/4=0.25

• Total words = 6 ("The", "moon", "is", "bright", "at", "night")
• The appears 1 time, Moon appears 1 time, Bright appears 1 time, Night appears 1
time.
– TF(the)=1/6≈0.1667
– TF(moon)=1/6≈0.1667
– TF(bright)=1/6≈0.1667
– TF(night)=1/6≈0.1667
• Total words = 7 ("The", "stars", "are", "bright", "in", "the", "sky")
• The appears 2 times, Stars appears 1 time, Bright appears 1 time, Sky appears 1
time.
– TF(the)=2/7≈0.2857
– TF(stars)=1/7≈0.1429
– TF(bright)=1/7≈0.1429
– TF(sky)=1/7≈0.1429

• Step 2: Calculate Inverse Document Frequency (IDF):Now, we calculate
the IDF for the words "sun," "moon," "bright," "night," "stars," and "sky."
• Total number of documents = 3.
• Now, count how many documents contain each word:
• Sun appears in Document 1 (1 document).
• Moon appears in Document 2 (1 document).
• Bright appears in all 3 documents (3 documents).
• Night appears in Document 2 (1 document).
• Stars appears in Document 3 (1 document).
• Sky appears in Document 3 (1 document).
• The appears in all 3 documents (3 documents).

Cont..
• We can now calculate the IDF for each word:
• IDF(sun) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(moon) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(bright) = log⁡
(3/3)=log⁡
(1)=0
• IDF(night) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(stars) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(sky) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(the) = log⁡
(3/3)=log⁡
(1)=0

• Step 3: Calculate TF-IDF:Now, let's calculate the TF-IDF for each word in each document.
• TF-IDF(the, Doc 1) = 0.25×0=0
• TF-IDF(sun, Doc 1) = 0.25×0.4771=0.1193
• TF-IDF(bright, Doc 1) = 0.25×0=0
• TF-IDF(the, Doc 2) = 0.1667×0=0
• TF-IDF(moon, Doc 2) = 0.1667×0.4771=0.0795
• TF-IDF(bright, Doc 2) = 0.1667×0=0
• TF-IDF(night, Doc 2) = 0.1667×0.4771=0.0795
• TF-IDF(the, Doc 3) = 0.2857×0=0
• TF-IDF(stars, Doc 3) = 0.1429×0.4771=0.0681
• TF-IDF(bright, Doc 3) = 0.1429×0=0
• TF-IDF(sky, Doc 3) = 0.1429×0.4771=0.0681

Applications
• Ranking documents in search engines
• Classifying documents into categories
• Identifying keywords
• Detecting plagiarism

Multiword Expressions (MWEs)
• Multiword Expressions (MWEs) refer to groups of words that together
form a single unit of meaning, but their meaning is different from the
individual words in the expression. These expressions often cannot be
understood by simply looking at the meaning of each word separately.
• Examples:
• "kick the bucket" – This means "to die," but if you take the words
literally, they have no connection to death.
• "break the ice" – This means to start a conversation or make people feel
more comfortable, not actually breaking ice.
• "by and large" – This means "generally speaking," not referring to size or
scale.

What Are MWEs?
• Fixed phrases: These are expressions that have a fixed structure and
meaning. They often don’t follow regular grammar rules. Example: "kick
the bucket" (which means "to die," not literally kicking a bucket).
• Collocations: These are word pairs or groups of words that often appear
together, and their meaning is understood by frequent use in the
language. Example: "fast food" (fast and food are commonly used
together, but "fast" doesn’t describe food directly).
• Idiomatic expressions: These are phrases whose meanings are not
derived from the meanings of the individual words. Example: "break a
leg" (meaning "good luck").
• Phrasal verbs: These are verbs combined with prepositions or adverbs
to create a new meaning. Example: "give up" (meaning "quit").

Why Are MWEs Important in NLP?
• Ambiguity resolution: In regular language
processing, breaking a sentence into individual
words can lead to confusion, but MWEs help
reduce this ambiguity.
• Improving understanding: MWEs help machines
better understand human language, as they
reflect the actual usage of words in context.

Examples:
• "New York": It’s not just the words “new” and “york,” but a location (a
proper noun).
• "Take a break": The meaning is not about physically taking a break, but
about resting.
Types of MWEs:
• Fixed Expressions: "once in a blue moon"
• Phrasal Verbs: "look up"
• Collocations: "strong tea"
• By recognizing MWEs, NLP systems can better understand language and
handle tasks like translation, sentiment analysis, and question answering
more effectively!

Applications
• Machine Translation: Correct translation of phrases like “kick the bucket” (to die) by
treating MWEs as a whole unit.
• Speech Recognition: Recognize phrases like “good morning” without misinterpreting
individual words.
• Sentiment Analysis: Detect true sentiment in phrases like “break a leg” (good luck), not
just the words “break” and “leg.”
• Named Entity Recognition (NER): Identify multiword entities like “United Nations” as a
single entity.
• Question Answering (QA): Interpret phrases like “capital of France” correctly in a
question.
• Social Media Analysis: Understand hashtags or informal phrases (#BlackFriday) in trend
detection.

Role of Language Models
• What is a Language Model?
• A language model is a system designed to predict
and understand the structure and meaning of
human language.
• It uses patterns from large amounts of text to
guess what words or sentences are most likely to
come next in a given context.

How Does It Work?
• Language models are trained using tons of text, such as books,
articles, and websites. They learn patterns like:
• Word prediction: For example, if the sentence is "The cat is on
the ___," the model can predict the word "mat."
• Context understanding: It can also understand the meaning of
a word based on the words around it. For example, the model
knows that "bank" could mean a place to store money or the
side of a river, depending on the surrounding words.

Why Are Language Models Important in NLP?
• Language models are the backbone of many NLP tasks, such as:
• Text Generation: They help generate new text that sounds natural
(e.g., writing essays, creating chatbot responses).
• Translation: Translating text from one language to another (e.g.,
Google Translate).
• Sentiment Analysis: Determining if a piece of text is positive,
negative, or neutral (e.g., customer reviews).
• Speech Recognition: Converting spoken language into written text
(e.g., Siri, Google Assistant).

4.Types of Language Models:
• Statistical Models: Earlier models relied on counting word frequencies
and probabilities.
• Neural Networks: Modern models, like GPT (Generative Pre-trained
Transformer), use deep learning to understand and generate language
more effectively.
5. Real-Life Examples:
• Chatbots: Like the one you're interacting with, where the model
predicts the best response based on your input.
• Voice Assistants: Alexa, Siri, and Google Assistant use LMs to
understand commands and answer questions.
• Text Autocompletion: When typing a message, predictive text suggests
the next word or phrase based on your previous words.

Estimating Parameters and Smoothing
• Estimating Parameters (Finding Probabilities)
• When we try to predict the next word in a
sentence, we need to know how often words
appear together. This is called estimating
probabilities.
• For example, in a bigram model (where we predict
one word based on the previous one), we calculate:

What is Smoothing?
• Smoothing is a trick to handle words or phrases that we
haven't seen before in our data.
• Imagine you have a predictive text keyboard.
• You've seen the phrase "I love chocolate cake" many times.
• But you've never seen "I love chocolate pizza" in your data.
• Without smoothing, the keyboard will think "chocolate
pizza" is impossible and never suggest it! .
• Smoothing fixes this! It makes sure everything gets at least
a small probability, even if we haven't seen it before.

How Does Smoothing Work?
• Smoothing adds a small value to every count, so nothing is completely zero.
• Example: Add-One (Laplace) Smoothing
• We just add 1 to every word count! 🎉
• Without Smoothing:
• If your data has:
• "chocolate cake" appeared 5 times
• "chocolate pizza" appeared 0 times
• We calculate probability:
• P("pizza" "chocolate")=0/Total times "chocolate" appears
∣
• chocolate pizza" is impossible (which is not true).

Why Does Smoothing Matter?
• Helps chat bots, voice assistants, and search engines handle
new words.
• Improves spell check, predictive text, and auto-correct.
• Makes translations and speech recognition more accurate.
• Ensures NLP models don’t break when they see something
new.
• Without smoothing, models would fail at handling rare or
unseen words.
• With smoothing, they stay smart and flexible! .

Natural language processing unit - 2 ppt

More Related Content

Similar to Natural language processing unit - 2 ppt (20)

Recently uploaded (20)

Natural language processing unit - 2 ppt