SlideShare a Scribd company logo
2
Most read
4
Most read
17
Most read
Language Modeling with
N-grams
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Introduction
• Models that assign probabilities to sequences of words are called
language models or LMs.
• In this lesson we introduce the simplest model that assigns
probabilities to sentences and sequences of words, the N-gram.
• An N-gram is a sequence of words: a 2-gram (or bigram) is a two-
word sequence of words like “please turn”, “turn your”, or ”your
homework”, and a 3-gram (or trigram) is a three-word sequence of
words like “please turn your”, or “turn your homework”.
• We’ll see how to use N-gram models to estimate the probability of
the last word of an N-gram given the previous words, and also to
assign probabilities to entire sequences.
• Whether estimating probabilities of next words or of whole
sequences, the N-gram model is one of the most important tools in
speech and language processing.
N-grams
• Let’s begin with the task of computing P(w|h),
the probability of a word w given some history
h.
• Suppose the history h is “its water is so
transparent that” and we want to know the
probability that the next word is the:
– P(the|its water is so transparent that):
N-grams (Cont…)
• One way to estimate this probability is from relative frequency counts:
take a very large corpus, count the number of times we see its water is so
transparent that, and count the number of times this is followed by the.
• This would be answering the question “Out of the times we saw the
history h, how many times was it followed by the word w”, as follows:
• With a large enough corpus, such as the web, we can compute these
counts and estimate the probability using above equation.
N-grams (Cont…)
• If we wanted to know the joint probability of
an entire sequence of words like its water is so
transparent, we could do it by asking “out of
all possible sequences of five words, how
many of them are its water is so transparent?”
• We would have to get the count of its water is
so transparent and divide by the sum of the
counts of all possible five word sequences.
• That seems rather a lot to estimate!
N-grams (Cont…)
• For this reason, we’ll need to introduce cleverer ways of
estimating the probability of a word w given a history h, or
the probability of an entire word sequence W.
• Let’s start with a little formalizing of notation.
• To represent the probability of a particular random variable
Xi taking on the value “the”, or P(Xi = “the”), we will use the
simplification P(the).
• We’ll represent a sequence of N words either as w1, …….,.
Wn or 𝑤1
𝑛
• For the joint probability of each word in a sequence having
a particular value P(X = w1, Y = w2, Z = w3, …..,W = wn)
we’ll use P(w1, w2, ……., wn).
N-grams (Cont…)
• Now how can we compute probabilities of entire
sequences like P(w1, w2, ……., wn)?
• One thing we can do is decompose this
probability using the chain rule of probability:
N-grams (Cont…)
• The chain rule shows the link between computing the joint probability of a
sequence and computing the conditional probability of a word given
previous words.
• Previous equation suggests that we could estimate the joint probability of
an entire sequence of words by multiplying together a number of
conditional probabilities.
• But using the chain rule doesn’t really seem to help us!
• We don’t know any way to compute the exact probability of a word given
a long sequence of preceding words, 𝑃 𝑊𝑛 𝑊1
𝑛−1
)
• As we said above, we can’t just estimate by counting the number of times
every word occurs following every long string, because language is
creative and any particular context might have never occurred before!
• The intuition of the N-gram model is that instead of computing the
probability of a word given its entire history, we can approximate the
history by just the last few words.
Bi-Gram
• The bigram model, for example, approximates the
probability of a word given all the previous words
𝑃 𝑊𝑛 𝑊1
𝑛−1
) by using only the conditional probability
of the preceding word 𝑃(𝑊𝑛|𝑊𝑛−1).
• In other words, instead of computing the probability
P(the|Walden Pond’s water is so transparent that) we
approximate it with the probability P(the|that)
• When we use a bigram model to predict the
conditional probability of the next word, we are thus
making the following approximation:
𝑊𝑛 𝑊1
𝑛−1
) ≈ 𝑃(𝑊𝑛|𝑊𝑛−1)
Markov Assumption
• The assumption that the probability of a
word depends only on the previous word is
called a Markov assumption.
• Markov models are the class of probabilistic
models that assume we can predict the
probability of some future unit without
looking too far into the past.
Generalize Bi-grams to N-grams
• We can generalize the bigram (which looks
one word into the past) to the trigram (which
looks two words into the past) and thus to the
N-gram (which looks N -1 words into the past).
• Thus, the general equation for this N-gram
approximation to the conditional probability
of the next word in a sequence is
𝑊𝑛 𝑊1
𝑛−1
) ≈ 𝑃(𝑊𝑛|𝑊𝑛−𝑁+1
𝑛−1
)
Generalize Bi-grams to N-grams
(Cont…)
• Given the bigram assumption for the
probability of an individual word, we can
• compute the probability of a complete word
sequence by substituting
𝑊𝑛 𝑊1
𝑛−1
) ≈ 𝑃(𝑊𝑛|𝑊𝑛−1)
• in to following equation:
𝑃 𝑊1
𝑛
=
𝑘=1
𝑛
𝑃(𝑊𝑘|𝑊𝑘−1)
Maximum Likelihood Estimation (MLE)
• How do we estimate these bigram or N-gram
probabilities?
• An intuitive way to estimate probabilities is
called maximum likelihood estimation or MLE.
• We get the MLE estimate for the parameters
of an N-gram model by getting counts from a
corpus, and normalizing the counts so that
they lie between 0 and 1.
Maximum Likelihood Estimation (MLE)
(Cont…)
• For example, to compute a particular bigram probability of
a word y given a previous word x, we’ll compute the count
of the bigram C(x,y) and normalize by the sum of all the
bigrams that share the same first word x:
𝑃 𝑊𝑛 𝑊𝑛−1 =
𝐶(𝑊𝑛−1 𝑊𝑛)
𝑊 𝐶(𝑊𝑛−1−𝑊𝑛)
• We can simplify this equation, since the sum of all bigram
counts that start with a given word (Wn-1) must be equal to
the unigram count for that word wn-1
𝑃 𝑊𝑛 𝑊𝑛−1 =
𝐶(𝑊𝑛−1 𝑊𝑛)
𝐶(𝑊𝑛−1)
Maximum Likelihood Estimation (MLE)
(Cont…)
• Let’s work through an example using a mini-corpus of
three sentences.
• We’ll first need to augment each sentence with a
special symbol <s> at the beginning of the sentence, to
give us the bigram context of the first word.
• We’ll also need a special end-symbol </s>.
Example 2
• Here are some text normalized sample user
queries extracted from Berkeley Restaurant
Project.
– can you tell me about any good cantonese restaurants
close by
– mid priced thai food is what i’m looking for tell me
about chez panisse
– can you give me a listing of the kinds of food that are
available
– i’m looking for a good place to eat breakfast
– when is caffe venezia open during the day
Example 2 (Cont…)
• Table shows the bigram counts from a piece of a bigram grammar
from the Berkeley Restaurant Project.
• Note that the majority of the values are zero.
• In fact, we have chosen the sample words to cohere with each
other; a matrix selected from a random set of seven words would
be even more sparse.
Example 2 (Cont…)
• Following table shows the bigram probabilities after normalization
(dividing each cell in previous table by the appropriate unigram for
its row, taken from the following set of unigram probabilities):
• Here are a few other useful probabilities:
Example 2 (Cont…)
• Now we can compute the probability of
sentences like I want English food by simply
multiplying the appropriate bigram
probabilities together, as follows:

More Related Content

PPTX
Word embedding
PPTX
NLP_KASHK:Finite-State Automata
PPTX
NLP_KASHK:Morphology
PPTX
NLP_KASHK:Regular Expressions
PPTX
NLP_KASHK:Smoothing N-gram Models
PPTX
NLP_KASHK:Finite-State Morphological Parsing
PDF
Syntactic analysis in NLP
PPTX
Language Model (N-Gram).pptx
Word embedding
NLP_KASHK:Finite-State Automata
NLP_KASHK:Morphology
NLP_KASHK:Regular Expressions
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Finite-State Morphological Parsing
Syntactic analysis in NLP
Language Model (N-Gram).pptx

What's hot (20)

PPTX
NLP_KASHK: Introduction
PPTX
Natural Language Processing: Parsing
PPTX
Morphological Analysis
PDF
Lecture: Word Sense Disambiguation
PPT
Natural language processing
PDF
Word Embeddings - Introduction
PPTX
Language models
PDF
Formal Languages and Automata Theory unit 2
PPTX
A Simple Introduction to Word Embeddings
PDF
word level analysis
PPTX
Regular expressions
PPTX
Spell checker using Natural language processing
PDF
Natural Language Processing (NLP)
PDF
Autoencoders
PDF
Usage of regular expressions in nlp
PDF
Lecture Notes-Finite State Automata for NLP.pdf
PPTX
Regular Expression to Finite Automata
PDF
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
PPTX
Turing machine
NLP_KASHK: Introduction
Natural Language Processing: Parsing
Morphological Analysis
Lecture: Word Sense Disambiguation
Natural language processing
Word Embeddings - Introduction
Language models
Formal Languages and Automata Theory unit 2
A Simple Introduction to Word Embeddings
word level analysis
Regular expressions
Spell checker using Natural language processing
Natural Language Processing (NLP)
Autoencoders
Usage of regular expressions in nlp
Lecture Notes-Finite State Automata for NLP.pdf
Regular Expression to Finite Automata
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Turing machine
Ad

Similar to NLP_KASHK:N-Grams (20)

PDF
An N-Gram Language Model predicts the next word in a sequence
PDF
Probability Theory Application and statitics
PPTX
Ngrams smoothing
PPTX
Language model in nature language processing
PDF
lec03-LanguageModels_230214_161016.pdf
PPTX
Artificial Intelligence
PPTX
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
PDF
2_Corpora_and_Smoothing_2024.pdf
PPT
Natural Language Processing: N-Gram Language Models
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
PPT
Natural Language Processing: N-Gram Language Models
PPTX
Next word Prediction
PDF
[Book Reading] 機械翻訳 - Section 3 No.1
PPTX
Natural Language Processing
PDF
Presentation
PDF
Lecture 6
PPT
2-Chapter Two-N-gram Language Models.ppt
PPT
natural language processing by Christopher
PDF
Cl.week5-6
An N-Gram Language Model predicts the next word in a sequence
Probability Theory Application and statitics
Ngrams smoothing
Language model in nature language processing
lec03-LanguageModels_230214_161016.pdf
Artificial Intelligence
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
2_Corpora_and_Smoothing_2024.pdf
Natural Language Processing: N-Gram Language Models
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
Natural Language Processing: N-Gram Language Models
Next word Prediction
[Book Reading] 機械翻訳 - Section 3 No.1
Natural Language Processing
Presentation
Lecture 6
2-Chapter Two-N-gram Language Models.ppt
natural language processing by Christopher
Cl.week5-6
Ad

More from Hemantha Kulathilake (20)

PPTX
NLP_KASHK:Parsing with Context-Free Grammar
PPTX
NLP_KASHK:Context-Free Grammar for English
PPTX
NLP_KASHK:POS Tagging
PPTX
NLP_KASHK:Markov Models
PPTX
NLP_KASHK:Evaluating Language Model
PPTX
NLP_KASHK:Minimum Edit Distance
PPTX
NLP_KASHK:Text Normalization
PPTX
COM1407: File Processing
PPTX
COm1407: Character & Strings
PPTX
COM1407: Structures, Unions & Dynamic Memory Allocation
PPTX
COM1407: Input/ Output Functions
PPTX
COM1407: Working with Pointers
PPTX
COM1407: Arrays
PPTX
COM1407: Program Control Structures – Repetition and Loops
PPTX
COM1407: Program Control Structures – Decision Making & Branching
PPTX
COM1407: C Operators
PPTX
COM1407: Type Casting, Command Line Arguments and Defining Constants
PPTX
COM1407: Variables and Data Types
PPTX
COM1407: Introduction to C Programming
PPTX
COM1407: Structured Program Development
NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:POS Tagging
NLP_KASHK:Markov Models
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Text Normalization
COM1407: File Processing
COm1407: Character & Strings
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Input/ Output Functions
COM1407: Working with Pointers
COM1407: Arrays
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Decision Making & Branching
COM1407: C Operators
COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Variables and Data Types
COM1407: Introduction to C Programming
COM1407: Structured Program Development

Recently uploaded (20)

PPTX
web development for engineering and engineering
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Construction Project Organization Group 2.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PPT on Performance Review to get promotions
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
web development for engineering and engineering
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Construction Project Organization Group 2.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Strings in CPP - Strings in C++ are sequences of characters used to store and...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Lecture Notes Electrical Wiring System Components
PPT on Performance Review to get promotions
OOP with Java - Java Introduction (Basics)
Arduino robotics embedded978-1-4302-3184-4.pdf
Geodesy 1.pptx...............................................
CYBER-CRIMES AND SECURITY A guide to understanding
additive manufacturing of ss316l using mig welding
CH1 Production IntroductoryConcepts.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

NLP_KASHK:N-Grams

  • 1. Language Modeling with N-grams K.A.S.H. Kulathilake B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
  • 2. Introduction • Models that assign probabilities to sequences of words are called language models or LMs. • In this lesson we introduce the simplest model that assigns probabilities to sentences and sequences of words, the N-gram. • An N-gram is a sequence of words: a 2-gram (or bigram) is a two- word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”. • We’ll see how to use N-gram models to estimate the probability of the last word of an N-gram given the previous words, and also to assign probabilities to entire sequences. • Whether estimating probabilities of next words or of whole sequences, the N-gram model is one of the most important tools in speech and language processing.
  • 3. N-grams • Let’s begin with the task of computing P(w|h), the probability of a word w given some history h. • Suppose the history h is “its water is so transparent that” and we want to know the probability that the next word is the: – P(the|its water is so transparent that):
  • 4. N-grams (Cont…) • One way to estimate this probability is from relative frequency counts: take a very large corpus, count the number of times we see its water is so transparent that, and count the number of times this is followed by the. • This would be answering the question “Out of the times we saw the history h, how many times was it followed by the word w”, as follows: • With a large enough corpus, such as the web, we can compute these counts and estimate the probability using above equation.
  • 5. N-grams (Cont…) • If we wanted to know the joint probability of an entire sequence of words like its water is so transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its water is so transparent?” • We would have to get the count of its water is so transparent and divide by the sum of the counts of all possible five word sequences. • That seems rather a lot to estimate!
  • 6. N-grams (Cont…) • For this reason, we’ll need to introduce cleverer ways of estimating the probability of a word w given a history h, or the probability of an entire word sequence W. • Let’s start with a little formalizing of notation. • To represent the probability of a particular random variable Xi taking on the value “the”, or P(Xi = “the”), we will use the simplification P(the). • We’ll represent a sequence of N words either as w1, …….,. Wn or 𝑤1 𝑛 • For the joint probability of each word in a sequence having a particular value P(X = w1, Y = w2, Z = w3, …..,W = wn) we’ll use P(w1, w2, ……., wn).
  • 7. N-grams (Cont…) • Now how can we compute probabilities of entire sequences like P(w1, w2, ……., wn)? • One thing we can do is decompose this probability using the chain rule of probability:
  • 8. N-grams (Cont…) • The chain rule shows the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. • Previous equation suggests that we could estimate the joint probability of an entire sequence of words by multiplying together a number of conditional probabilities. • But using the chain rule doesn’t really seem to help us! • We don’t know any way to compute the exact probability of a word given a long sequence of preceding words, 𝑃 𝑊𝑛 𝑊1 𝑛−1 ) • As we said above, we can’t just estimate by counting the number of times every word occurs following every long string, because language is creative and any particular context might have never occurred before! • The intuition of the N-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.
  • 9. Bi-Gram • The bigram model, for example, approximates the probability of a word given all the previous words 𝑃 𝑊𝑛 𝑊1 𝑛−1 ) by using only the conditional probability of the preceding word 𝑃(𝑊𝑛|𝑊𝑛−1). • In other words, instead of computing the probability P(the|Walden Pond’s water is so transparent that) we approximate it with the probability P(the|that) • When we use a bigram model to predict the conditional probability of the next word, we are thus making the following approximation: 𝑊𝑛 𝑊1 𝑛−1 ) ≈ 𝑃(𝑊𝑛|𝑊𝑛−1)
  • 10. Markov Assumption • The assumption that the probability of a word depends only on the previous word is called a Markov assumption. • Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.
  • 11. Generalize Bi-grams to N-grams • We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the N-gram (which looks N -1 words into the past). • Thus, the general equation for this N-gram approximation to the conditional probability of the next word in a sequence is 𝑊𝑛 𝑊1 𝑛−1 ) ≈ 𝑃(𝑊𝑛|𝑊𝑛−𝑁+1 𝑛−1 )
  • 12. Generalize Bi-grams to N-grams (Cont…) • Given the bigram assumption for the probability of an individual word, we can • compute the probability of a complete word sequence by substituting 𝑊𝑛 𝑊1 𝑛−1 ) ≈ 𝑃(𝑊𝑛|𝑊𝑛−1) • in to following equation: 𝑃 𝑊1 𝑛 = 𝑘=1 𝑛 𝑃(𝑊𝑘|𝑊𝑘−1)
  • 13. Maximum Likelihood Estimation (MLE) • How do we estimate these bigram or N-gram probabilities? • An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. • We get the MLE estimate for the parameters of an N-gram model by getting counts from a corpus, and normalizing the counts so that they lie between 0 and 1.
  • 14. Maximum Likelihood Estimation (MLE) (Cont…) • For example, to compute a particular bigram probability of a word y given a previous word x, we’ll compute the count of the bigram C(x,y) and normalize by the sum of all the bigrams that share the same first word x: 𝑃 𝑊𝑛 𝑊𝑛−1 = 𝐶(𝑊𝑛−1 𝑊𝑛) 𝑊 𝐶(𝑊𝑛−1−𝑊𝑛) • We can simplify this equation, since the sum of all bigram counts that start with a given word (Wn-1) must be equal to the unigram count for that word wn-1 𝑃 𝑊𝑛 𝑊𝑛−1 = 𝐶(𝑊𝑛−1 𝑊𝑛) 𝐶(𝑊𝑛−1)
  • 15. Maximum Likelihood Estimation (MLE) (Cont…) • Let’s work through an example using a mini-corpus of three sentences. • We’ll first need to augment each sentence with a special symbol <s> at the beginning of the sentence, to give us the bigram context of the first word. • We’ll also need a special end-symbol </s>.
  • 16. Example 2 • Here are some text normalized sample user queries extracted from Berkeley Restaurant Project. – can you tell me about any good cantonese restaurants close by – mid priced thai food is what i’m looking for tell me about chez panisse – can you give me a listing of the kinds of food that are available – i’m looking for a good place to eat breakfast – when is caffe venezia open during the day
  • 17. Example 2 (Cont…) • Table shows the bigram counts from a piece of a bigram grammar from the Berkeley Restaurant Project. • Note that the majority of the values are zero. • In fact, we have chosen the sample words to cohere with each other; a matrix selected from a random set of seven words would be even more sparse.
  • 18. Example 2 (Cont…) • Following table shows the bigram probabilities after normalization (dividing each cell in previous table by the appropriate unigram for its row, taken from the following set of unigram probabilities): • Here are a few other useful probabilities:
  • 19. Example 2 (Cont…) • Now we can compute the probability of sentences like I want English food by simply multiplying the appropriate bigram probabilities together, as follows: