SlideShare a Scribd company logo
UNIT-II -NLP
syllabus
• Introduction to word types: word2Vec, Word
Embedding, POS Tagging, Count Vectorizer,
Multiword Expressions the role of language
models. Simple N-gram models. Bag of words,
estimating parameters and smoothing.
Evaluating language models.
Word Embedding
• Word embeddings are a way of representing words in a vector space
(a list of numbers).
• They are numerical representations of words that capture their
meanings, relationships, and contexts.
• Word embeddings allow words with similar meanings to have similar
representations in the vector space.
• In simpler terms, instead of representing words as simple strings like
"apple" or "dog," we represent them as vectors of numbers, such
that similar words (like "cat" and "dog") will have similar vectors.
Why Do We Need Word Embeddings?
• Computers cannot understand human language directly, so we
need to convert words into a form they can process.
• The traditional approach, called one-hot encoding, represents
words as sparse vectors (like a long list of zeros and a single 1 at
the index representing the word).
• However, this doesn't capture any relationships between words.
• Word embeddings solve this problem by placing semantically
similar words closer together in the vector space.
Explanation of Word Embeddings:
Step 1: Traditional Representation vs. Word Embeddings
• Traditional Representation (One-Hot Encoding):
– Example: Let's take three words: "cat", "dog", and "fish".
– In one-hot encoding, each word would be represented as a vector with as many
dimensions as the size of the vocabulary, with a 1 at the index of the word and 0s
elsewhere.
• For a vocabulary of size 3, it might look like this:
• Word One-Hot Encoding cat[1, 0, 0]dog[0, 1, 0]fish[0, 0, 1]
• This representation has a couple of drawbacks:
• High dimensionality: For large vocabularies, these vectors become very
long.
• No relationship captured: The vector [1, 0, 0] for "cat" is no more related to
[0, 1, 0] for "dog" than it is to [0, 0, 1] for "fish". There's no concept of
meaning or similarity between words.
Word Embedding Representation:
• Word embeddings solve this by mapping each word into a
continuous vector space, where the distance between vectors
represents semantic similarity.
• For example, the vector for "cat" would be close to the vector for
"dog", but far away from "fish", because "cat" and "dog" are more
semantically related.
• WordWord Embedding (Example)cat[0.2, 0.4, 0.6]dog[0.1, 0.5,
0.7]fish[0.9, 0.3, 0.2].
• In this example, "cat" and "dog" have similar vectors, while "fish"
has a different vector because it's less related to "cat" and "dog".
Cont..
• Step 2: How Are Word Embeddings Learned?
• Word embeddings are typically learned using machine learning models
trained on large text datasets. The goal is to find a representation of each
word in such a way that similar words are close together in the vector space.
• Popular Algorithms for Learning Word Embeddings:
• Word2Vec (Continuous Bag of Words - CBOW, and Skip-Gram):
– CBOW predicts a target word based on context words around it.
– Skip-Gram does the reverse: it predicts the context words based on a target word.
– Both models use a neural network and learn to adjust the word vectors
(embeddings) to make accurate predictions.
Word2Vec
• Word2Vec is a popular technique in Natural Language
Processing (NLP) used to represent words as vectors
(numbers).
• These vectors are created in such a way that words with
similar meanings are close together in the vector space,
making it easier for computers to understand
relationships between words.
• It was developed by Google in 2013 and is widely used to
understand the relationships between words in a text.
Real-World Example: A Restaurant Menu
• Imagine you're reading a menu at a restaurant. Here's a simplified list of items:
• Burger, Fries, Coke
• Pizza, Garlic Bread, Pepsi
• Sushi, Miso Soup, Green Tea
• From these combinations, Word2Vec would learn that:
• "Burger" is similar to "Pizza" because both are paired with other fast-food
items.
• "Coke" is similar to "Pepsi" because they are both drinks that appear with
similar foods.
• "Sushi" is different from "Burger" because it appears in a different context,
with Japanese food items like "Miso Soup."
Scenario: Cricket Player and Match Recommendation System
• Imagine you're using a cricket app that helps you
discover cricket players, matches, or even teams
you might like based on your preferences and past
activity. How does the app know which players or
matches to recommend next? The answer is
word2vec, which helps the app understand
relationships between players, teams, and match
types based on their context.
Understanding word2vec in Cricket
• Let’s say the app treats each cricket player, match, and team as a
"word" in a massive "sentence" representing all the cricket data.
Just like how words in a sentence have relationships based on their
meanings, players and matches also have relationships based on
their context. For example:
• Player 1: Virat Kohli (Batsman, India)
• Player 2: Rohit Sharma (Batsman, India)
• Player 3: Joe Root (Batsman, England)
• Player 4: Ben Stokes (All-rounder, England)
• Player 5: Jofra Archer (Bowler, England)
• Player 6: Steve Smith (Batsman, Australia)
How word2vec Works in This Case
• Now, the app uses a word2vec-like model to
convert these players into vectors (numbers). These
vectors represent how similar players are to each
other based on things like:
• Role (batsman, bowler, all-rounder)
• Country (India, England, Australia)
• Playing style (aggressive, defensive, all-rounder)
Cont..
• The app analyzes these contexts and places similar players closer together in the
vector space. Here's how it might look:
• Virat Kohli and Rohit Sharma will be represented by vectors that are close
together because both are Indian batsmen who play in a similar aggressive style.
• Joe Root and Ben Stokes will have vectors that are relatively close because they
both play for England, though Root is more of a traditional batsman, and Stokes
is an all-rounder.
• Jofra Archer will be further from players like Kohli and Root because he is a
bowler with a completely different role.
• Steve Smith will be close to Kohli and Sharma in the vector space since he is
also a batsman, but his unique style may position him slightly apart.
Why Do We Need Word2Vec?
• Words, by themselves, are just text. Computers can't
directly understand text, so we need to convert words
into numbers. But a simple number (like assigning
"apple" = 1 and "orange" = 2) doesn't capture the
meaning of the words or how they're related.
• Word2Vec solves this problem by creating vectors (lists of
numbers) that represent words in a way that captures
their meanings and relationships.
How Does Word2Vec Work?
• Contextual Representation:
• Word2Vec is based on the idea that "words
that appear in similar contexts have similar
meanings."
• For example, in the sentences "I love apples"
and "I love oranges," the words "apples" and
"oranges" appear in similar contexts ("I love").
Cont..
• Training with Context:
• Word2Vec uses a neural network to learn word relationships from large
text data.
• It trains in two main ways:
– CBOW (Continuous Bag of Words): Predicts a word based on its surrounding
words.
– Skip-Gram: Predicts surrounding words based on a single word.
• Word Vectors:
• Each word is represented as a vector of numbers (e.g., [1.2, -0.8,
0.5, ...]).
• Words with similar meanings or used in similar ways have similar
vectors.
Natural language processing unit - 2 ppt
Natural language processing unit - 2 ppt
CBOW (Continuous Bag of Words):
• Continuous Bag of Words (CBOW) model is one
of the two main architectures used in Word2Vec
(the other being Skip-Gram). It’s a simple but
effective method for learning word embeddings.
• CBOW predicts a word based on its context (the
surrounding words).
Example
1. Start with a Simple Sentence
• Write a sentence on the board for the students to
understand: Sentence: "The cat sat on the mat.“
2. Explain the Goal of CBOW
• What is CBOW? CBOW predicts a target word (like "sat")
using its context words (like "The", "cat", "on", "the",
"mat").
• Write on the board:
Context words → Predict Target Word
Cont..
3. Vocabulary Setup
• Explain that CBOW uses a vocabulary, which is a list of all unique words
in the text.
• Example Vocabulary:
– "The", "cat", "sat", "on", "mat"
• Assign an index to each word:
• The: 0, cat: 1, sat: 2, on: 3, mat: 4
4. Represent Words as Vectors (One-Hot Encoding)
• Explain that each word is represented as a one-hot vector:
• A one-hot vector is a list of 0s with a single 1 at the index of the word.
• The → [1, 0, 0, 0, 0] Cat → [0, 1, 0, 0, 0] Sat → [0, 0, 1, 0, 0]
Cont..
5. Training Example
• Write an example of how CBOW uses context
words to predict a target word:
• Context Words: ["The", "cat", "on", "the",
"mat"]
• Target Word: "sat"
6. Step-by-Step CBOW Process
• Step 1: Initialize word vectors Each word is
represented by a random vector at the
beginning of the training. Suppose we have
the following random vector for each word
(simplified for clarity):
Natural language processing unit - 2 ppt
Step 2: Calculate the context vector Now, we average the vectors of the context
words ("The", "cat", "sits", "on"):
Cont..
• Step 3: Calculate the prediction Now, we use the context
vector to predict the target word.
• This is done by multiplying the context vector with a
weight matrix and passing it through a softmax function
to calculate the probability of each word being the target.
• The model will adjust its weights during training to make
the predicted target word more likely.
Cont..
• 3. Compute the Prediction Scores (Dot Product)
• Now, we take the context vector and multiply it by the weight
matrix to get prediction scores. This is done by taking the dot
product of the context vector with each word's vector in the
vocabulary.
• For example, let’s compute the dot product between the context
vector and the vector of each word in the vocabulary:
• Context vector:
• [0.25,0.275,0.3][0.25, 0.275, 0.3][0.25,0.275,0.3]
Natural language processing unit - 2 ppt
• "The": 0.17
• "cat": 0.1975
• "sits": 0.3275
• "on": 0.2175
• "mat": 0.2125
• These values are called scores, and they
indicate how well each word in the vocabulary
fits the context.
Natural language processing unit - 2 ppt
Step 2: Compute the Sum of Exponentials
Add up all the exponential values:
Sum=1.1852+1.2184+1.3875+1.2430+1.2370=6.2711
Natural language processing unit - 2 ppt
Cont..
• Step 4: Verify the Results
• The softmax values are:
• [0.1890,0.1943,0.2213,0.1982,0.1972]
• Finally, check that they sum to 1.0:
• 0.1890+0.1943+0.2213+0.1982+0.1972=1.0
Skip-gram is a model
• Skip-gram is a model used to predict context
words based on a target word. Instead of
predicting the target word from its surrounding
context, the Skip-gram model works the other
way around: it tries to predict the surrounding
context words from a given target word.
Applications
• Search Engines: Understanding synonyms and
related words.
• Recommendation Systems: Finding similar
items or concepts.
• Text Classification: Helping classify text into
categories.
POS (Part of Speech)
• POS (Part of Speech) tagging is a process in natural
language processing where words in a sentence are
labeled with their corresponding parts of speech,
such as nouns, verbs, adjectives, adverbs, etc.
• The goal is to understand the role of each word in a
sentence to help computers analyze the meaning.
• Here’s an easy explanation:
• Nouns (N): Words that name people, places, things, or ideas.
Example: dog, city, happiness
• Verbs (V): Words that show actions, occurrences, or states of being.
Example: run, is, seem
• Adjectives (ADJ): Words that describe or modify nouns.
Example: happy, blue, tall
• Adverbs (ADV): Words that describe or modify verbs, adjectives, or other adverbs.
Example: quickly, very, gently
• Pronouns (PRON): Words that take the place of nouns.
Example: he, she, they
• Prepositions (PREP): Words that show the relationship between a noun (or pronoun) and
other parts of the sentence.
Example: in, on, under
• Conjunctions (CONJ): Words that connect words, phrases, or clauses.
Example: and, but, or
• Interjections (INTJ): Words that show strong emotions or feelings.
Example: wow, ouch, hey
• For example, in the sentence "The quick brown fox jumps over the lazy
dog," the POS tags would be:
• "The" – Determiner (often classified as a type of noun modifier)
• "quick" – Adjective
• "brown" – Adjective
• "fox" – Noun
• "jumps" – Verb
• "over" – Preposition
• "the" – Determiner
• "lazy" – Adjective
• "dog" – Noun
• By tagging these parts of speech, computers can better understand the
structure of a sentence and its meaning.
Methods of POS Tagging
1. Rule-Based: Used for specific, structured text where rules are
predefined (e.g., legal or scientific documents).
2. Statistical: Works well for large datasets and can predict POS
tags based on observed word patterns (e.g., news articles,
blogs).
3. Machine Learning: Suitable for dynamic contexts where
patterns emerge from examples, used in translation and
grammar-checking tools.
4. Deep Learning: Best for handling complex, ambiguous
sentences and long-term context, used in voice assistants,
search engines, and advanced chatbots.
Count Vectorizer
• A Count Vectorizer is a tool used in natural language processing (NLP) to
transform a collection of text into a matrix of token counts. Here's a
simple breakdown of how it works:
1.Text Input: Imagine you have a few documents (or sentences). For example:
– "I love programming."
– "Programming is fun."
– "I love learning new things."
2.Tokenization: The Count Vectorizer splits each sentence into individual
words (tokens). For the above sentences, the tokens would be:
– "I", "love", "programming"
– "Programming", "is", "fun"
– "I", "love", "learning", "new", "things"
Cont..
3.Vocabulary Creation: It then creates a list of all unique words across the
documents (vocabulary):
• Now, CountVectorizer creates a list of unique words in alphabetical order
and assigns an index:
• "I", "love", "programming", "is", "fun", "learning", "new", "things"
4.Count Matrix: For each sentence, it counts how many times each word (from
the vocabulary) appears. This is represented as a matrix (or table) where:
• Each row represents a sentence.
• Each column represents a word from the vocabulary.
• The cell values show the count of each word in that sentence.
Natural language processing unit - 2 ppt
Simple N-gram models
• N-gram models are a basic type of statistical
language model used in natural language
processing (NLP) to predict the likelihood of a
sequence of words.
What is an N-gram?
• An N-gram is a contiguous sequence of N items
(typically words or characters) from a given sample
of text or speech.
– Unigram: Single word (e.g., "I", "like", "cats").
– Bigram: Two consecutive words (e.g., "I like", "like
cats").
– Trigram: Three consecutive words (e.g., "I like cats").
– N-gram: A general term for NNN consecutive words
(e.g., "I really like cats" is a 4-gram or "quadgram").
Natural language processing unit - 2 ppt
How do N-gram models work?
• Count Frequencies: An N-gram model starts by counting how
often specific N-grams appear in a large corpus of text.
• Example: For bigrams, calculate how often "I like", "like cats",
etc., occur.
• Probability Estimation: The model estimates the probability
of the next word based on the previous N−1N-1N−1 words.
– For a bigram model: P(cats like)=Count(like cats)/Count(like)
∣
– For a trigram model: P(cats I like)=Count(I like cats)/Count(I like)
∣
• Predict Next Word: Using these probabilities, the model
predicts the most likely next word in a sequence.
• Example 1: Bigram Model
• Let's say we have the following text:
• "I like cats. I like dogs. I like ice cream."
• We count the bigrams (two-word sequences):
Bigram Count
"I like" 3
"like cats" 1
"like dogs" 1
"like ice" 1
Cont..
• Now, we calculate the probability of "cats" given "like":
• P("cats" "like")=Count("like cats")/Count("like")
∣
• P("cats" "like")=1/3=0.33
∣
• Similarly:
• P("dogs" "like")=1/3=0.33
∣
• P("ice" "like")=1/3=0.33
∣
• So, if we see "I like", the model predicts "cats", "dogs",
or "ice" with equal probability.
Predicting Next Word
• Input: "I like"
• Bigram Probabilities:
– "cats" → 33%
– "dogs" → 33%
– "ice" → 33%
• Predicted Next Word: The model randomly
selects from cats, dogs, or ice.
Bag of words
• What is Bag of Words (BoW)?
• Bag of Words is a simple way to convert text
into numbers (vectorization) so that a
computer can understand and process it. It
does this by counting how often each word
appears in a document.
Example to Understand BoW Easily
Step 1: Sample Sentences
• Imagine we have two simple sentences:
• 1.”I love apples and oranges."
2”Apples and bananas are tasty.“
Step 2: Create a Vocabulary
• We list all unique words from both sentences:
• ["I", "love", "apples", "and", "oranges", "bananas", "are",
"tasty"]
Step 3: Count Word Occurrences
• Now, we create a table to count how many
times each word appears in each sentence.
Cont..
• Each row represents a word, and each column shows how
many times that word appears in a sentence.
• Step 4: Represent as a Vector
• Each sentence is now converted into a numerical vector:
• Sentence 1 → [1, 1, 1, 1, 1, 0, 0, 0]
Sentence 2 → [0, 0, 1, 1, 0, 1, 1, 1]
• Now, the computer can process these numbers instead of
raw text!
cont,..
3. Advantages & Disadvantages
• ✔ Advantages:
• Works well for simple text classification tasks.
• Easy to implement and interpret.
• ❌ Disadvantages:
• Doesn’t consider the order or meaning of words (e.g., "not good" and "good"
are treated similarly).
• Large vocabulary can make computation expensive.
• Can lead to sparse matrices, meaning lots of zeros in the vector
representation.
Cont..
• 4. How to Improve BoW?
• To make BoW more effective, we can use:
– TF-IDF (Term Frequency-Inverse Document Frequency) –
Gives importance to rare words instead of common ones.
– N-grams – Considers sequences of words instead of single
words (e.g., "Machine Learning" as one unit).
– Word Embeddings (like Word2Vec, GloVe) – Captures the
meaning and context of words.
TF-IDF (Term Frequency-Inverse Document Frequency)
• TF-IDF (Term Frequency-Inverse Document
Frequency) is a statistical measure used in
Natural Language Processing (NLP) and
information retrieval to evaluate the importance
of a word within a document relative to a
corpus of documents.
Components of TF-IDF
1. Term Frequency (TF)
2. Inverse Document Frequency (IDF)
TF-IDF Score
• What it is: It combines TF and IDF to give a score to each word that
reflects its importance in a document relative to the entire corpus.
• Formula: TF-IDF(w)=TF(w)×IDF(w)
• Why it matters: Words that are frequent in a document but rare
across all documents will have a high TF-IDF score, indicating they
are important.
• Example: Continuing the previous example:
• TF("apple") = 0.05
• IDF("apple") = 2
• TF-IDF(apple)=0.05×2=0.1
Why is TF-IDF useful?
• Words that are very common across many
documents (like "the," "is," "and") will have a low
IDF value and will not contribute much to the TF-IDF
score.
• Words that are frequent in a specific document but
rare in others will have a high TF-IDF score, meaning
they are important keywords for that document.
New Documents:
• Document 1: "The sun is bright."
• Document 2: "The moon is bright at night."
• Document 3: "The stars are bright in the sky."
Step 1: Calculate Term Frequency (TF)
• We'll calculate the TF for the words "sun," "moon," "bright," "night,"
"stars," "sky," and "the" in each document.
• Document 1: "The sun is bright."
• Total words = 4 ("The", "sun", "is", "bright")
• The appears 1 time, Sun appears 1 time, Bright appears 1 time.
– TF(the)=1/4=​
=0.25
– TF(sun)=1/4=0.25
– TF(bright)=1/4=0.25
• Document 2: "The moon is bright at night."
• Total words = 6 ("The", "moon", "is", "bright", "at", "night")
• The appears 1 time, Moon appears 1 time, Bright appears 1 time, Night appears 1
time.
– TF(the)=1/6≈0.1667
– TF(moon)=1/6≈0.1667
– TF(bright)=1/6≈0.1667
– TF(night)=1/6≈0.1667
• Document 3: "The stars are bright in the sky."
• Total words = 7 ("The", "stars", "are", "bright", "in", "the", "sky")
• The appears 2 times, Stars appears 1 time, Bright appears 1 time, Sky appears 1
time.
– TF(the)=2/7≈0.2857
– TF(stars)=1/7≈0.1429
– TF(bright)=1/7≈0.1429
– TF(sky)=1/7≈0.1429
• Step 2: Calculate Inverse Document Frequency (IDF):Now, we calculate
the IDF for the words "sun," "moon," "bright," "night," "stars," and "sky."
• Total number of documents = 3.
• Now, count how many documents contain each word:
• Sun appears in Document 1 (1 document).
• Moon appears in Document 2 (1 document).
• Bright appears in all 3 documents (3 documents).
• Night appears in Document 2 (1 document).
• Stars appears in Document 3 (1 document).
• Sky appears in Document 3 (1 document).
• The appears in all 3 documents (3 documents).
Cont..
• We can now calculate the IDF for each word:
• IDF(sun) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(moon) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(bright) = log⁡
(3/3)=log⁡
(1)=0
• IDF(night) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(stars) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(sky) = log⁡
(3/1)=log⁡
(3)≈0.4771
• IDF(the) = log⁡
(3/3)=log⁡
(1)=0
• Step 3: Calculate TF-IDF:Now, let's calculate the TF-IDF for each word in each document.
• Document 1: "The sun is bright."
• TF-IDF(the, Doc 1) = 0.25×0=0
• TF-IDF(sun, Doc 1) = 0.25×0.4771=0.1193
• TF-IDF(bright, Doc 1) = 0.25×0=0
• Document 2: "The moon is bright at night."
• TF-IDF(the, Doc 2) = 0.1667×0=0
• TF-IDF(moon, Doc 2) = 0.1667×0.4771=0.0795
• TF-IDF(bright, Doc 2) = 0.1667×0=0
• TF-IDF(night, Doc 2) = 0.1667×0.4771=0.0795
• Document 3: "The stars are bright in the sky."
• TF-IDF(the, Doc 3) = 0.2857×0=0
• TF-IDF(stars, Doc 3) = 0.1429×0.4771=0.0681
• TF-IDF(bright, Doc 3) = 0.1429×0=0
• TF-IDF(sky, Doc 3) = 0.1429×0.4771=0.0681
Natural language processing unit - 2 ppt
Applications
• Ranking documents in search engines
• Classifying documents into categories
• Identifying keywords
• Detecting plagiarism
Multiword Expressions (MWEs)
• Multiword Expressions (MWEs) refer to groups of words that together
form a single unit of meaning, but their meaning is different from the
individual words in the expression. These expressions often cannot be
understood by simply looking at the meaning of each word separately.
• Examples:
• "kick the bucket" – This means "to die," but if you take the words
literally, they have no connection to death.
• "break the ice" – This means to start a conversation or make people feel
more comfortable, not actually breaking ice.
• "by and large" – This means "generally speaking," not referring to size or
scale.
What Are MWEs?
• Fixed phrases: These are expressions that have a fixed structure and
meaning. They often don’t follow regular grammar rules. Example: "kick
the bucket" (which means "to die," not literally kicking a bucket).
• Collocations: These are word pairs or groups of words that often appear
together, and their meaning is understood by frequent use in the
language. Example: "fast food" (fast and food are commonly used
together, but "fast" doesn’t describe food directly).
• Idiomatic expressions: These are phrases whose meanings are not
derived from the meanings of the individual words. Example: "break a
leg" (meaning "good luck").
• Phrasal verbs: These are verbs combined with prepositions or adverbs
to create a new meaning. Example: "give up" (meaning "quit").
Why Are MWEs Important in NLP?
• Ambiguity resolution: In regular language
processing, breaking a sentence into individual
words can lead to confusion, but MWEs help
reduce this ambiguity.
• Improving understanding: MWEs help machines
better understand human language, as they
reflect the actual usage of words in context.
Examples:
• "New York": It’s not just the words “new” and “york,” but a location (a
proper noun).
• "Take a break": The meaning is not about physically taking a break, but
about resting.
Types of MWEs:
• Fixed Expressions: "once in a blue moon"
• Phrasal Verbs: "look up"
• Collocations: "strong tea"
• By recognizing MWEs, NLP systems can better understand language and
handle tasks like translation, sentiment analysis, and question answering
more effectively!
Applications
• Machine Translation: Correct translation of phrases like “kick the bucket” (to die) by
treating MWEs as a whole unit.
• Speech Recognition: Recognize phrases like “good morning” without misinterpreting
individual words.
• Sentiment Analysis: Detect true sentiment in phrases like “break a leg” (good luck), not
just the words “break” and “leg.”
• Named Entity Recognition (NER): Identify multiword entities like “United Nations” as a
single entity.
• Question Answering (QA): Interpret phrases like “capital of France” correctly in a
question.
• Social Media Analysis: Understand hashtags or informal phrases (#BlackFriday) in trend
detection.
Role of Language Models
• What is a Language Model?
• A language model is a system designed to predict
and understand the structure and meaning of
human language.
• It uses patterns from large amounts of text to
guess what words or sentences are most likely to
come next in a given context.
How Does It Work?
• Language models are trained using tons of text, such as books,
articles, and websites. They learn patterns like:
• Word prediction: For example, if the sentence is "The cat is on
the ___," the model can predict the word "mat."
• Context understanding: It can also understand the meaning of
a word based on the words around it. For example, the model
knows that "bank" could mean a place to store money or the
side of a river, depending on the surrounding words.
Why Are Language Models Important in NLP?
• Language models are the backbone of many NLP tasks, such as:
• Text Generation: They help generate new text that sounds natural
(e.g., writing essays, creating chatbot responses).
• Translation: Translating text from one language to another (e.g.,
Google Translate).
• Sentiment Analysis: Determining if a piece of text is positive,
negative, or neutral (e.g., customer reviews).
• Speech Recognition: Converting spoken language into written text
(e.g., Siri, Google Assistant).
4.Types of Language Models:
• Statistical Models: Earlier models relied on counting word frequencies
and probabilities.
• Neural Networks: Modern models, like GPT (Generative Pre-trained
Transformer), use deep learning to understand and generate language
more effectively.
5. Real-Life Examples:
• Chatbots: Like the one you're interacting with, where the model
predicts the best response based on your input.
• Voice Assistants: Alexa, Siri, and Google Assistant use LMs to
understand commands and answer questions.
• Text Autocompletion: When typing a message, predictive text suggests
the next word or phrase based on your previous words.
Estimating Parameters and Smoothing
• Estimating Parameters (Finding Probabilities)
• When we try to predict the next word in a
sentence, we need to know how often words
appear together. This is called estimating
probabilities.
• For example, in a bigram model (where we predict
one word based on the previous one), we calculate:
Example
What is Smoothing?
• Smoothing is a trick to handle words or phrases that we
haven't seen before in our data.
• Imagine you have a predictive text keyboard.
• You've seen the phrase "I love chocolate cake" many times.
• But you've never seen "I love chocolate pizza" in your data.
• Without smoothing, the keyboard will think "chocolate
pizza" is impossible and never suggest it! .
• Smoothing fixes this! It makes sure everything gets at least
a small probability, even if we haven't seen it before.
How Does Smoothing Work?
• Smoothing adds a small value to every count, so nothing is completely zero.
• Example: Add-One (Laplace) Smoothing
• We just add 1 to every word count! 🎉
• Without Smoothing:
• If your data has:
• "chocolate cake" appeared 5 times
• "chocolate pizza" appeared 0 times
• We calculate probability:
• P("pizza" "chocolate")=0/Total times "chocolate" appears
∣
• chocolate pizza" is impossible (which is not true).
Natural language processing unit - 2 ppt
Why Does Smoothing Matter?
• Helps chat bots, voice assistants, and search engines handle
new words.
• Improves spell check, predictive text, and auto-correct.
• Makes translations and speech recognition more accurate.
• Ensures NLP models don’t break when they see something
new.
• Without smoothing, models would fail at handling rare or
unseen words.
• With smoothing, they stay smart and flexible! .

More Related Content

PPTX
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
PPTX
Unit - III Vector Space Model in Natura Languge Processing .pptx
PPTX
Pycon ke word vectors
PPTX
A Panorama of Natural Language Processing
PPTX
NS-CUK Seminar: J.H.Lee, Review on "Abstract Meaning Representation for Semb...
PDF
Yoav Goldberg: Word Embeddings What, How and Whither
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PPTX
Word vectors
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
Unit - III Vector Space Model in Natura Languge Processing .pptx
Pycon ke word vectors
A Panorama of Natural Language Processing
NS-CUK Seminar: J.H.Lee, Review on "Abstract Meaning Representation for Semb...
Yoav Goldberg: Word Embeddings What, How and Whither
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Word vectors

Similar to Natural language processing unit - 2 ppt (20)

PPTX
Word Sense Disambiguation - Algorithms for WSD.pptx
PPTX
Text similarity measures
PPTX
wordembedding.pptx
PPTX
Module 4.1 of chennai's slides wo hanve dot do thhopps otps
PPTX
IR.pptx
PPTX
Word embedding
PPTX
Text features
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PPT
Online dictionaries
PDF
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
PPTX
Subword tokenizers
PDF
Paper dissected glove_ global vectors for word representation_ explained _ ...
PDF
Thai Word Embedding with Tensorflow
PPTX
NLP Bootcamp
PPTX
NLP Introduction and basics of natural language processing
PPTX
Information Retrieval and Extraction - Module 7
PPTX
Data Science-2.pptx for engineering students
PDF
Word2vec and Friends
PPTX
Word2vec slide(lab seminar)
Word Sense Disambiguation - Algorithms for WSD.pptx
Text similarity measures
wordembedding.pptx
Module 4.1 of chennai's slides wo hanve dot do thhopps otps
IR.pptx
Word embedding
Text features
NLP Bootcamp 2018 : Representation Learning of text for NLP
Introduction to Neural Information Retrieval and Large Language Models
Online dictionaries
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
Subword tokenizers
Paper dissected glove_ global vectors for word representation_ explained _ ...
Thai Word Embedding with Tensorflow
NLP Bootcamp
NLP Introduction and basics of natural language processing
Information Retrieval and Extraction - Module 7
Data Science-2.pptx for engineering students
Word2vec and Friends
Word2vec slide(lab seminar)
Ad

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
master seminar digital applications in india
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Basic Mud Logging Guide for educational purpose
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Pharma ospi slides which help in ospi learning
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Complications of Minimal Access Surgery at WLH
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Supply Chain Operations Speaking Notes -ICLT Program
master seminar digital applications in india
TR - Agricultural Crops Production NC III.pdf
Institutional Correction lecture only . . .
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Renaissance Architecture: A Journey from Faith to Humanism
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
STATICS OF THE RIGID BODIES Hibbelers.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Basic Mud Logging Guide for educational purpose
Insiders guide to clinical Medicine.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pharma ospi slides which help in ospi learning
Abdominal Access Techniques with Prof. Dr. R K Mishra
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Complications of Minimal Access Surgery at WLH
Sports Quiz easy sports quiz sports quiz
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Ad

Natural language processing unit - 2 ppt

  • 2. syllabus • Introduction to word types: word2Vec, Word Embedding, POS Tagging, Count Vectorizer, Multiword Expressions the role of language models. Simple N-gram models. Bag of words, estimating parameters and smoothing. Evaluating language models.
  • 3. Word Embedding • Word embeddings are a way of representing words in a vector space (a list of numbers). • They are numerical representations of words that capture their meanings, relationships, and contexts. • Word embeddings allow words with similar meanings to have similar representations in the vector space. • In simpler terms, instead of representing words as simple strings like "apple" or "dog," we represent them as vectors of numbers, such that similar words (like "cat" and "dog") will have similar vectors.
  • 4. Why Do We Need Word Embeddings? • Computers cannot understand human language directly, so we need to convert words into a form they can process. • The traditional approach, called one-hot encoding, represents words as sparse vectors (like a long list of zeros and a single 1 at the index representing the word). • However, this doesn't capture any relationships between words. • Word embeddings solve this problem by placing semantically similar words closer together in the vector space.
  • 5. Explanation of Word Embeddings: Step 1: Traditional Representation vs. Word Embeddings • Traditional Representation (One-Hot Encoding): – Example: Let's take three words: "cat", "dog", and "fish". – In one-hot encoding, each word would be represented as a vector with as many dimensions as the size of the vocabulary, with a 1 at the index of the word and 0s elsewhere. • For a vocabulary of size 3, it might look like this: • Word One-Hot Encoding cat[1, 0, 0]dog[0, 1, 0]fish[0, 0, 1] • This representation has a couple of drawbacks: • High dimensionality: For large vocabularies, these vectors become very long. • No relationship captured: The vector [1, 0, 0] for "cat" is no more related to [0, 1, 0] for "dog" than it is to [0, 0, 1] for "fish". There's no concept of meaning or similarity between words.
  • 6. Word Embedding Representation: • Word embeddings solve this by mapping each word into a continuous vector space, where the distance between vectors represents semantic similarity. • For example, the vector for "cat" would be close to the vector for "dog", but far away from "fish", because "cat" and "dog" are more semantically related. • WordWord Embedding (Example)cat[0.2, 0.4, 0.6]dog[0.1, 0.5, 0.7]fish[0.9, 0.3, 0.2]. • In this example, "cat" and "dog" have similar vectors, while "fish" has a different vector because it's less related to "cat" and "dog".
  • 7. Cont.. • Step 2: How Are Word Embeddings Learned? • Word embeddings are typically learned using machine learning models trained on large text datasets. The goal is to find a representation of each word in such a way that similar words are close together in the vector space. • Popular Algorithms for Learning Word Embeddings: • Word2Vec (Continuous Bag of Words - CBOW, and Skip-Gram): – CBOW predicts a target word based on context words around it. – Skip-Gram does the reverse: it predicts the context words based on a target word. – Both models use a neural network and learn to adjust the word vectors (embeddings) to make accurate predictions.
  • 8. Word2Vec • Word2Vec is a popular technique in Natural Language Processing (NLP) used to represent words as vectors (numbers). • These vectors are created in such a way that words with similar meanings are close together in the vector space, making it easier for computers to understand relationships between words. • It was developed by Google in 2013 and is widely used to understand the relationships between words in a text.
  • 9. Real-World Example: A Restaurant Menu • Imagine you're reading a menu at a restaurant. Here's a simplified list of items: • Burger, Fries, Coke • Pizza, Garlic Bread, Pepsi • Sushi, Miso Soup, Green Tea • From these combinations, Word2Vec would learn that: • "Burger" is similar to "Pizza" because both are paired with other fast-food items. • "Coke" is similar to "Pepsi" because they are both drinks that appear with similar foods. • "Sushi" is different from "Burger" because it appears in a different context, with Japanese food items like "Miso Soup."
  • 10. Scenario: Cricket Player and Match Recommendation System • Imagine you're using a cricket app that helps you discover cricket players, matches, or even teams you might like based on your preferences and past activity. How does the app know which players or matches to recommend next? The answer is word2vec, which helps the app understand relationships between players, teams, and match types based on their context.
  • 11. Understanding word2vec in Cricket • Let’s say the app treats each cricket player, match, and team as a "word" in a massive "sentence" representing all the cricket data. Just like how words in a sentence have relationships based on their meanings, players and matches also have relationships based on their context. For example: • Player 1: Virat Kohli (Batsman, India) • Player 2: Rohit Sharma (Batsman, India) • Player 3: Joe Root (Batsman, England) • Player 4: Ben Stokes (All-rounder, England) • Player 5: Jofra Archer (Bowler, England) • Player 6: Steve Smith (Batsman, Australia)
  • 12. How word2vec Works in This Case • Now, the app uses a word2vec-like model to convert these players into vectors (numbers). These vectors represent how similar players are to each other based on things like: • Role (batsman, bowler, all-rounder) • Country (India, England, Australia) • Playing style (aggressive, defensive, all-rounder)
  • 13. Cont.. • The app analyzes these contexts and places similar players closer together in the vector space. Here's how it might look: • Virat Kohli and Rohit Sharma will be represented by vectors that are close together because both are Indian batsmen who play in a similar aggressive style. • Joe Root and Ben Stokes will have vectors that are relatively close because they both play for England, though Root is more of a traditional batsman, and Stokes is an all-rounder. • Jofra Archer will be further from players like Kohli and Root because he is a bowler with a completely different role. • Steve Smith will be close to Kohli and Sharma in the vector space since he is also a batsman, but his unique style may position him slightly apart.
  • 14. Why Do We Need Word2Vec? • Words, by themselves, are just text. Computers can't directly understand text, so we need to convert words into numbers. But a simple number (like assigning "apple" = 1 and "orange" = 2) doesn't capture the meaning of the words or how they're related. • Word2Vec solves this problem by creating vectors (lists of numbers) that represent words in a way that captures their meanings and relationships.
  • 15. How Does Word2Vec Work? • Contextual Representation: • Word2Vec is based on the idea that "words that appear in similar contexts have similar meanings." • For example, in the sentences "I love apples" and "I love oranges," the words "apples" and "oranges" appear in similar contexts ("I love").
  • 16. Cont.. • Training with Context: • Word2Vec uses a neural network to learn word relationships from large text data. • It trains in two main ways: – CBOW (Continuous Bag of Words): Predicts a word based on its surrounding words. – Skip-Gram: Predicts surrounding words based on a single word. • Word Vectors: • Each word is represented as a vector of numbers (e.g., [1.2, -0.8, 0.5, ...]). • Words with similar meanings or used in similar ways have similar vectors.
  • 19. CBOW (Continuous Bag of Words): • Continuous Bag of Words (CBOW) model is one of the two main architectures used in Word2Vec (the other being Skip-Gram). It’s a simple but effective method for learning word embeddings. • CBOW predicts a word based on its context (the surrounding words).
  • 20. Example 1. Start with a Simple Sentence • Write a sentence on the board for the students to understand: Sentence: "The cat sat on the mat.“ 2. Explain the Goal of CBOW • What is CBOW? CBOW predicts a target word (like "sat") using its context words (like "The", "cat", "on", "the", "mat"). • Write on the board: Context words → Predict Target Word
  • 21. Cont.. 3. Vocabulary Setup • Explain that CBOW uses a vocabulary, which is a list of all unique words in the text. • Example Vocabulary: – "The", "cat", "sat", "on", "mat" • Assign an index to each word: • The: 0, cat: 1, sat: 2, on: 3, mat: 4 4. Represent Words as Vectors (One-Hot Encoding) • Explain that each word is represented as a one-hot vector: • A one-hot vector is a list of 0s with a single 1 at the index of the word. • The → [1, 0, 0, 0, 0] Cat → [0, 1, 0, 0, 0] Sat → [0, 0, 1, 0, 0]
  • 22. Cont.. 5. Training Example • Write an example of how CBOW uses context words to predict a target word: • Context Words: ["The", "cat", "on", "the", "mat"] • Target Word: "sat"
  • 23. 6. Step-by-Step CBOW Process • Step 1: Initialize word vectors Each word is represented by a random vector at the beginning of the training. Suppose we have the following random vector for each word (simplified for clarity):
  • 25. Step 2: Calculate the context vector Now, we average the vectors of the context words ("The", "cat", "sits", "on"):
  • 26. Cont.. • Step 3: Calculate the prediction Now, we use the context vector to predict the target word. • This is done by multiplying the context vector with a weight matrix and passing it through a softmax function to calculate the probability of each word being the target. • The model will adjust its weights during training to make the predicted target word more likely.
  • 27. Cont.. • 3. Compute the Prediction Scores (Dot Product) • Now, we take the context vector and multiply it by the weight matrix to get prediction scores. This is done by taking the dot product of the context vector with each word's vector in the vocabulary. • For example, let’s compute the dot product between the context vector and the vector of each word in the vocabulary: • Context vector: • [0.25,0.275,0.3][0.25, 0.275, 0.3][0.25,0.275,0.3]
  • 29. • "The": 0.17 • "cat": 0.1975 • "sits": 0.3275 • "on": 0.2175 • "mat": 0.2125 • These values are called scores, and they indicate how well each word in the vocabulary fits the context.
  • 31. Step 2: Compute the Sum of Exponentials Add up all the exponential values: Sum=1.1852+1.2184+1.3875+1.2430+1.2370=6.2711
  • 33. Cont.. • Step 4: Verify the Results • The softmax values are: • [0.1890,0.1943,0.2213,0.1982,0.1972] • Finally, check that they sum to 1.0: • 0.1890+0.1943+0.2213+0.1982+0.1972=1.0
  • 34. Skip-gram is a model • Skip-gram is a model used to predict context words based on a target word. Instead of predicting the target word from its surrounding context, the Skip-gram model works the other way around: it tries to predict the surrounding context words from a given target word.
  • 35. Applications • Search Engines: Understanding synonyms and related words. • Recommendation Systems: Finding similar items or concepts. • Text Classification: Helping classify text into categories.
  • 36. POS (Part of Speech) • POS (Part of Speech) tagging is a process in natural language processing where words in a sentence are labeled with their corresponding parts of speech, such as nouns, verbs, adjectives, adverbs, etc. • The goal is to understand the role of each word in a sentence to help computers analyze the meaning.
  • 37. • Here’s an easy explanation: • Nouns (N): Words that name people, places, things, or ideas. Example: dog, city, happiness • Verbs (V): Words that show actions, occurrences, or states of being. Example: run, is, seem • Adjectives (ADJ): Words that describe or modify nouns. Example: happy, blue, tall • Adverbs (ADV): Words that describe or modify verbs, adjectives, or other adverbs. Example: quickly, very, gently • Pronouns (PRON): Words that take the place of nouns. Example: he, she, they • Prepositions (PREP): Words that show the relationship between a noun (or pronoun) and other parts of the sentence. Example: in, on, under • Conjunctions (CONJ): Words that connect words, phrases, or clauses. Example: and, but, or • Interjections (INTJ): Words that show strong emotions or feelings. Example: wow, ouch, hey
  • 38. • For example, in the sentence "The quick brown fox jumps over the lazy dog," the POS tags would be: • "The" – Determiner (often classified as a type of noun modifier) • "quick" – Adjective • "brown" – Adjective • "fox" – Noun • "jumps" – Verb • "over" – Preposition • "the" – Determiner • "lazy" – Adjective • "dog" – Noun • By tagging these parts of speech, computers can better understand the structure of a sentence and its meaning.
  • 39. Methods of POS Tagging 1. Rule-Based: Used for specific, structured text where rules are predefined (e.g., legal or scientific documents). 2. Statistical: Works well for large datasets and can predict POS tags based on observed word patterns (e.g., news articles, blogs). 3. Machine Learning: Suitable for dynamic contexts where patterns emerge from examples, used in translation and grammar-checking tools. 4. Deep Learning: Best for handling complex, ambiguous sentences and long-term context, used in voice assistants, search engines, and advanced chatbots.
  • 40. Count Vectorizer • A Count Vectorizer is a tool used in natural language processing (NLP) to transform a collection of text into a matrix of token counts. Here's a simple breakdown of how it works: 1.Text Input: Imagine you have a few documents (or sentences). For example: – "I love programming." – "Programming is fun." – "I love learning new things." 2.Tokenization: The Count Vectorizer splits each sentence into individual words (tokens). For the above sentences, the tokens would be: – "I", "love", "programming" – "Programming", "is", "fun" – "I", "love", "learning", "new", "things"
  • 41. Cont.. 3.Vocabulary Creation: It then creates a list of all unique words across the documents (vocabulary): • Now, CountVectorizer creates a list of unique words in alphabetical order and assigns an index: • "I", "love", "programming", "is", "fun", "learning", "new", "things" 4.Count Matrix: For each sentence, it counts how many times each word (from the vocabulary) appears. This is represented as a matrix (or table) where: • Each row represents a sentence. • Each column represents a word from the vocabulary. • The cell values show the count of each word in that sentence.
  • 43. Simple N-gram models • N-gram models are a basic type of statistical language model used in natural language processing (NLP) to predict the likelihood of a sequence of words.
  • 44. What is an N-gram? • An N-gram is a contiguous sequence of N items (typically words or characters) from a given sample of text or speech. – Unigram: Single word (e.g., "I", "like", "cats"). – Bigram: Two consecutive words (e.g., "I like", "like cats"). – Trigram: Three consecutive words (e.g., "I like cats"). – N-gram: A general term for NNN consecutive words (e.g., "I really like cats" is a 4-gram or "quadgram").
  • 46. How do N-gram models work? • Count Frequencies: An N-gram model starts by counting how often specific N-grams appear in a large corpus of text. • Example: For bigrams, calculate how often "I like", "like cats", etc., occur. • Probability Estimation: The model estimates the probability of the next word based on the previous N−1N-1N−1 words. – For a bigram model: P(cats like)=Count(like cats)/Count(like) ∣ – For a trigram model: P(cats I like)=Count(I like cats)/Count(I like) ∣ • Predict Next Word: Using these probabilities, the model predicts the most likely next word in a sequence.
  • 47. • Example 1: Bigram Model • Let's say we have the following text: • "I like cats. I like dogs. I like ice cream." • We count the bigrams (two-word sequences): Bigram Count "I like" 3 "like cats" 1 "like dogs" 1 "like ice" 1
  • 48. Cont.. • Now, we calculate the probability of "cats" given "like": • P("cats" "like")=Count("like cats")/Count("like") ∣ • P("cats" "like")=1/3=0.33 ∣ • Similarly: • P("dogs" "like")=1/3=0.33 ∣ • P("ice" "like")=1/3=0.33 ∣ • So, if we see "I like", the model predicts "cats", "dogs", or "ice" with equal probability.
  • 49. Predicting Next Word • Input: "I like" • Bigram Probabilities: – "cats" → 33% – "dogs" → 33% – "ice" → 33% • Predicted Next Word: The model randomly selects from cats, dogs, or ice.
  • 50. Bag of words • What is Bag of Words (BoW)? • Bag of Words is a simple way to convert text into numbers (vectorization) so that a computer can understand and process it. It does this by counting how often each word appears in a document.
  • 51. Example to Understand BoW Easily Step 1: Sample Sentences • Imagine we have two simple sentences: • 1.”I love apples and oranges." 2”Apples and bananas are tasty.“ Step 2: Create a Vocabulary • We list all unique words from both sentences: • ["I", "love", "apples", "and", "oranges", "bananas", "are", "tasty"]
  • 52. Step 3: Count Word Occurrences • Now, we create a table to count how many times each word appears in each sentence.
  • 53. Cont.. • Each row represents a word, and each column shows how many times that word appears in a sentence. • Step 4: Represent as a Vector • Each sentence is now converted into a numerical vector: • Sentence 1 → [1, 1, 1, 1, 1, 0, 0, 0] Sentence 2 → [0, 0, 1, 1, 0, 1, 1, 1] • Now, the computer can process these numbers instead of raw text!
  • 54. cont,.. 3. Advantages & Disadvantages • ✔ Advantages: • Works well for simple text classification tasks. • Easy to implement and interpret. • ❌ Disadvantages: • Doesn’t consider the order or meaning of words (e.g., "not good" and "good" are treated similarly). • Large vocabulary can make computation expensive. • Can lead to sparse matrices, meaning lots of zeros in the vector representation.
  • 55. Cont.. • 4. How to Improve BoW? • To make BoW more effective, we can use: – TF-IDF (Term Frequency-Inverse Document Frequency) – Gives importance to rare words instead of common ones. – N-grams – Considers sequences of words instead of single words (e.g., "Machine Learning" as one unit). – Word Embeddings (like Word2Vec, GloVe) – Captures the meaning and context of words.
  • 56. TF-IDF (Term Frequency-Inverse Document Frequency) • TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in Natural Language Processing (NLP) and information retrieval to evaluate the importance of a word within a document relative to a corpus of documents.
  • 57. Components of TF-IDF 1. Term Frequency (TF)
  • 58. 2. Inverse Document Frequency (IDF)
  • 59. TF-IDF Score • What it is: It combines TF and IDF to give a score to each word that reflects its importance in a document relative to the entire corpus. • Formula: TF-IDF(w)=TF(w)×IDF(w) • Why it matters: Words that are frequent in a document but rare across all documents will have a high TF-IDF score, indicating they are important. • Example: Continuing the previous example: • TF("apple") = 0.05 • IDF("apple") = 2 • TF-IDF(apple)=0.05×2=0.1
  • 60. Why is TF-IDF useful? • Words that are very common across many documents (like "the," "is," "and") will have a low IDF value and will not contribute much to the TF-IDF score. • Words that are frequent in a specific document but rare in others will have a high TF-IDF score, meaning they are important keywords for that document.
  • 61. New Documents: • Document 1: "The sun is bright." • Document 2: "The moon is bright at night." • Document 3: "The stars are bright in the sky." Step 1: Calculate Term Frequency (TF) • We'll calculate the TF for the words "sun," "moon," "bright," "night," "stars," "sky," and "the" in each document. • Document 1: "The sun is bright." • Total words = 4 ("The", "sun", "is", "bright") • The appears 1 time, Sun appears 1 time, Bright appears 1 time. – TF(the)=1/4=​ =0.25 – TF(sun)=1/4=0.25 – TF(bright)=1/4=0.25
  • 62. • Document 2: "The moon is bright at night." • Total words = 6 ("The", "moon", "is", "bright", "at", "night") • The appears 1 time, Moon appears 1 time, Bright appears 1 time, Night appears 1 time. – TF(the)=1/6≈0.1667 – TF(moon)=1/6≈0.1667 – TF(bright)=1/6≈0.1667 – TF(night)=1/6≈0.1667 • Document 3: "The stars are bright in the sky." • Total words = 7 ("The", "stars", "are", "bright", "in", "the", "sky") • The appears 2 times, Stars appears 1 time, Bright appears 1 time, Sky appears 1 time. – TF(the)=2/7≈0.2857 – TF(stars)=1/7≈0.1429 – TF(bright)=1/7≈0.1429 – TF(sky)=1/7≈0.1429
  • 63. • Step 2: Calculate Inverse Document Frequency (IDF):Now, we calculate the IDF for the words "sun," "moon," "bright," "night," "stars," and "sky." • Total number of documents = 3. • Now, count how many documents contain each word: • Sun appears in Document 1 (1 document). • Moon appears in Document 2 (1 document). • Bright appears in all 3 documents (3 documents). • Night appears in Document 2 (1 document). • Stars appears in Document 3 (1 document). • Sky appears in Document 3 (1 document). • The appears in all 3 documents (3 documents).
  • 64. Cont.. • We can now calculate the IDF for each word: • IDF(sun) = log⁡ (3/1)=log⁡ (3)≈0.4771 • IDF(moon) = log⁡ (3/1)=log⁡ (3)≈0.4771 • IDF(bright) = log⁡ (3/3)=log⁡ (1)=0 • IDF(night) = log⁡ (3/1)=log⁡ (3)≈0.4771 • IDF(stars) = log⁡ (3/1)=log⁡ (3)≈0.4771 • IDF(sky) = log⁡ (3/1)=log⁡ (3)≈0.4771 • IDF(the) = log⁡ (3/3)=log⁡ (1)=0
  • 65. • Step 3: Calculate TF-IDF:Now, let's calculate the TF-IDF for each word in each document. • Document 1: "The sun is bright." • TF-IDF(the, Doc 1) = 0.25×0=0 • TF-IDF(sun, Doc 1) = 0.25×0.4771=0.1193 • TF-IDF(bright, Doc 1) = 0.25×0=0 • Document 2: "The moon is bright at night." • TF-IDF(the, Doc 2) = 0.1667×0=0 • TF-IDF(moon, Doc 2) = 0.1667×0.4771=0.0795 • TF-IDF(bright, Doc 2) = 0.1667×0=0 • TF-IDF(night, Doc 2) = 0.1667×0.4771=0.0795 • Document 3: "The stars are bright in the sky." • TF-IDF(the, Doc 3) = 0.2857×0=0 • TF-IDF(stars, Doc 3) = 0.1429×0.4771=0.0681 • TF-IDF(bright, Doc 3) = 0.1429×0=0 • TF-IDF(sky, Doc 3) = 0.1429×0.4771=0.0681
  • 67. Applications • Ranking documents in search engines • Classifying documents into categories • Identifying keywords • Detecting plagiarism
  • 68. Multiword Expressions (MWEs) • Multiword Expressions (MWEs) refer to groups of words that together form a single unit of meaning, but their meaning is different from the individual words in the expression. These expressions often cannot be understood by simply looking at the meaning of each word separately. • Examples: • "kick the bucket" – This means "to die," but if you take the words literally, they have no connection to death. • "break the ice" – This means to start a conversation or make people feel more comfortable, not actually breaking ice. • "by and large" – This means "generally speaking," not referring to size or scale.
  • 69. What Are MWEs? • Fixed phrases: These are expressions that have a fixed structure and meaning. They often don’t follow regular grammar rules. Example: "kick the bucket" (which means "to die," not literally kicking a bucket). • Collocations: These are word pairs or groups of words that often appear together, and their meaning is understood by frequent use in the language. Example: "fast food" (fast and food are commonly used together, but "fast" doesn’t describe food directly). • Idiomatic expressions: These are phrases whose meanings are not derived from the meanings of the individual words. Example: "break a leg" (meaning "good luck"). • Phrasal verbs: These are verbs combined with prepositions or adverbs to create a new meaning. Example: "give up" (meaning "quit").
  • 70. Why Are MWEs Important in NLP? • Ambiguity resolution: In regular language processing, breaking a sentence into individual words can lead to confusion, but MWEs help reduce this ambiguity. • Improving understanding: MWEs help machines better understand human language, as they reflect the actual usage of words in context.
  • 71. Examples: • "New York": It’s not just the words “new” and “york,” but a location (a proper noun). • "Take a break": The meaning is not about physically taking a break, but about resting. Types of MWEs: • Fixed Expressions: "once in a blue moon" • Phrasal Verbs: "look up" • Collocations: "strong tea" • By recognizing MWEs, NLP systems can better understand language and handle tasks like translation, sentiment analysis, and question answering more effectively!
  • 72. Applications • Machine Translation: Correct translation of phrases like “kick the bucket” (to die) by treating MWEs as a whole unit. • Speech Recognition: Recognize phrases like “good morning” without misinterpreting individual words. • Sentiment Analysis: Detect true sentiment in phrases like “break a leg” (good luck), not just the words “break” and “leg.” • Named Entity Recognition (NER): Identify multiword entities like “United Nations” as a single entity. • Question Answering (QA): Interpret phrases like “capital of France” correctly in a question. • Social Media Analysis: Understand hashtags or informal phrases (#BlackFriday) in trend detection.
  • 73. Role of Language Models • What is a Language Model? • A language model is a system designed to predict and understand the structure and meaning of human language. • It uses patterns from large amounts of text to guess what words or sentences are most likely to come next in a given context.
  • 74. How Does It Work? • Language models are trained using tons of text, such as books, articles, and websites. They learn patterns like: • Word prediction: For example, if the sentence is "The cat is on the ___," the model can predict the word "mat." • Context understanding: It can also understand the meaning of a word based on the words around it. For example, the model knows that "bank" could mean a place to store money or the side of a river, depending on the surrounding words.
  • 75. Why Are Language Models Important in NLP? • Language models are the backbone of many NLP tasks, such as: • Text Generation: They help generate new text that sounds natural (e.g., writing essays, creating chatbot responses). • Translation: Translating text from one language to another (e.g., Google Translate). • Sentiment Analysis: Determining if a piece of text is positive, negative, or neutral (e.g., customer reviews). • Speech Recognition: Converting spoken language into written text (e.g., Siri, Google Assistant).
  • 76. 4.Types of Language Models: • Statistical Models: Earlier models relied on counting word frequencies and probabilities. • Neural Networks: Modern models, like GPT (Generative Pre-trained Transformer), use deep learning to understand and generate language more effectively. 5. Real-Life Examples: • Chatbots: Like the one you're interacting with, where the model predicts the best response based on your input. • Voice Assistants: Alexa, Siri, and Google Assistant use LMs to understand commands and answer questions. • Text Autocompletion: When typing a message, predictive text suggests the next word or phrase based on your previous words.
  • 77. Estimating Parameters and Smoothing • Estimating Parameters (Finding Probabilities) • When we try to predict the next word in a sentence, we need to know how often words appear together. This is called estimating probabilities. • For example, in a bigram model (where we predict one word based on the previous one), we calculate:
  • 79. What is Smoothing? • Smoothing is a trick to handle words or phrases that we haven't seen before in our data. • Imagine you have a predictive text keyboard. • You've seen the phrase "I love chocolate cake" many times. • But you've never seen "I love chocolate pizza" in your data. • Without smoothing, the keyboard will think "chocolate pizza" is impossible and never suggest it! . • Smoothing fixes this! It makes sure everything gets at least a small probability, even if we haven't seen it before.
  • 80. How Does Smoothing Work? • Smoothing adds a small value to every count, so nothing is completely zero. • Example: Add-One (Laplace) Smoothing • We just add 1 to every word count! 🎉 • Without Smoothing: • If your data has: • "chocolate cake" appeared 5 times • "chocolate pizza" appeared 0 times • We calculate probability: • P("pizza" "chocolate")=0/Total times "chocolate" appears ∣ • chocolate pizza" is impossible (which is not true).
  • 82. Why Does Smoothing Matter? • Helps chat bots, voice assistants, and search engines handle new words. • Improves spell check, predictive text, and auto-correct. • Makes translations and speech recognition more accurate. • Ensures NLP models don’t break when they see something new. • Without smoothing, models would fail at handling rare or unseen words. • With smoothing, they stay smart and flexible! .