SlideShare a Scribd company logo
2
Most read
15
Most read
16
Most read
Smoothing N-gram Models
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Smoothing
• What do we do with words that are in our vocabulary (they are not
unknown words) but appear in a test set in an unseen context (for
example they appear after a word they never appeared after in
training)?
• To keep a language model from assigning zero probability to these
unseen events, we’ll have to shave off a bit of probability mass from
some more frequent events and give it to the events we’ve never
seen.
• This modification is called smoothing or discounting.
• There are variety of ways to do smoothing:
– Add-1 smoothing
– Add-k smoothing
– Good-Turing Discounting
– Stupid backoff
– Kneser-Ney smoothing and many more
Laplace Smoothing / Add 1 Smoothing
• The simplest way to do smoothing is to add one to all
the bigram counts, before we normalize them into
probabilities.
• All the counts that used to be zero will now have a
count of 1, the counts of 1 will be 2, and so on.
• This algorithm is called Laplace smoothing.
• Laplace smoothing does not perform well enough to be
used in modern N-gram models, but it usefully
introduces many of the concepts that we see in other
smoothing algorithms, gives a useful baseline, and is
also a practical smoothing algorithm for other tasks like
text classification.
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Let’s start with the application of Laplace smoothing to
unigram probabilities.
• Recall that the unsmoothed maximum likelihood estimate
of the unigram probability of the word wi is its count ci
normalized by the total number of word tokens N:
𝑃 𝑊𝑖 =
𝐶𝑖
𝑁
• Laplace smoothing merely adds one to each count.
• Since there are V words in the vocabulary and each one
was incremented, we also need to adjust the denominator
to take into account the extra V observations.
𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑊𝑖 =
𝐶𝑖 + 1
𝑁 + 𝑉
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Instead of changing both the numerator and
denominator, it is convenient to describe how a
smoothing algorithm affects the numerator, by defining
an adjusted count C*.
• This adjusted count is easier to compare directly with
the MLE counts and can be turned into a probability
like an MLE count by normalizing by N.
• To define this count, since we are only changing the
numerator in addition to adding 1 we’ll also need to
multiply by a normalization factor
𝑁
𝑁+𝑉
:
𝐶𝑖
∗
= (𝐶𝑖 + 1)
𝑁
𝑁 + 𝑉
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• A related way to view smoothing is as discounting
(lowering) some non-zero counts in order to get
the probability mass that will be assigned to the
zero counts.
• Thus, instead of referring to the discounted
counts c, we might describe a smoothing
algorithm in terms of a relative discount dc, the
ratio of the discounted counts to the original
counts:
𝑑 𝑐 =
𝐶∗
𝐶
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• let’s smooth our Berkeley Restaurant Project
bigrams.
• Let’s take the Berkeley Restaurant project
bigrams;
Laplace Smoothing / Add 1 Smoothing
(Cont…)
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Recall that normal bigram probabilities are computed by
normalizing each row of counts by the unigram count:
𝑃 𝑊𝑛 𝑊𝑛−1 =
𝐶(𝑊𝑛−1 𝑊𝑛)
𝐶(𝑊𝑛−1)
• For add-one smoothed bigram counts, we need to augment the
unigram count by the number of total word types in the vocabulary
V:
𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒
∗
𝑊𝑛 𝑊𝑛−1 =
𝐶 𝑊𝑛−1 𝑊𝑛 + 1
𝐶 𝑊𝑛−1 + 𝑉
• Thus, each of the unigram counts given in the previous table will
need to be augmented by V =1446.
• Ex:𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒
∗
𝑡𝑜 𝐼 =
𝐶 𝐼,𝑡𝑜 +1
𝐶 𝐼 +𝑉
=
0+1
2500+1446
=
1
3946
= 0.000253
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Following table shows the add-one smoothed
probabilities for the bigrams:
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• It is often convenient to reconstruct the count
matrix so we can see how much a smoothing
algorithm has changed the original counts.
• These adjusted counts can be computed by
using following equation:
𝐶∗
(𝑊𝑛−1 𝑊𝑛 ) =
[𝐶 𝑊𝑛−1 𝑊𝑛 ] × 𝐶(𝑊𝑛−1)
𝐶 𝑊𝑛−1 + 𝑉
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Following table shows the reconstructed
counts:
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Note that add-one smoothing has made a very big change to the
counts.
• C(want to) changed from 608 to 238!
• We can see this in probability space as well: P(to|want) decreases
from .66 in the unsmoothed case to .26 in the smoothed case.
• Looking at the discount d (the ratio between new and old counts)
shows us how strikingly the counts for each prefix word have been
reduced; the discount for the bigram want to is .39, while the
discount for Chinese food is .10, a factor of 10!
• The sharp change in counts and probabilities occurs because too
much probability mass is moved to all the zeros.
Add-K Smoothing
• One alternative to add-one smoothing is to move a bit
less of the probability mass from the seen to the
unseen events.
• Instead of adding 1 to each count, we add a fractional
count k (.5? .05? .01?).
• This algorithm is therefore called add-k smoothing.
𝑃𝐴𝑑𝑑−𝑘
∗
𝑊𝑛 𝑊𝑛−1 =
𝐶 𝑊𝑛−1 𝑊𝑛 + 𝑘
𝐶 𝑊𝑛−1 + 𝑘𝑉
• Add-k smoothing requires that we have a method for
choosing k; this can be done, for example, by
optimizing on a development set (devset).
Good-Turing Discounting
• The basic insight of Good-Turing smoothing is to re-estimate the amount of
probability mass to assign to N-grams with zero or low counts by looking at the
number of N-grams with higher counts.
• In other words, we examine Nc, the number of N-grams that occur c times.
• We refer to the number of N-grams that occur c times as the frequency of
frequency c.
• So applying the idea to smoothing the joint probability of bigrams, N0 is the
number of bigrams b of count 0, N1 the number of bigrams with count 1, and so
on:
𝑁𝐶 =
𝑏:𝑐 𝑏 =𝑐
1
• The Good-Turing estimate gives a smoothed count c* based on the set of Nc for all
c, as follows:
𝐶∗ = (𝐶 + 1)
𝑁𝐶 + 1
𝑁𝐶
https://www.youtube.com/watch?v=z1bq4C8hFEk
Backoff and Interpolation
• In the backoff model, like the deleted interpolation model,
we build an N-gram model based on an (N-1)-gram model.
• The difference is that in backoff, if we have non-zero
trigram counts, we rely solely on the trigram counts and
don’t interpolate the bigram and unigram counts at all.
• We only ‘back off’ to a lower-order N-gram if we have zero
evidence for a higher-order N-gram.
Reference
• https://www.youtube.com/watch?v=z1bq4C8
hFEk

More Related Content

PPTX
NLP_KASHK:N-Grams
PPTX
Language models
PPTX
NLP_KASHK: Introduction
PDF
P, NP, NP-Complete, and NP-Hard
PPTX
Natural Language Processing: Parsing
PDF
Lecture Notes-Finite State Automata for NLP.pdf
PDF
Natural language processing
PPTX
Evaluating hypothesis
NLP_KASHK:N-Grams
Language models
NLP_KASHK: Introduction
P, NP, NP-Complete, and NP-Hard
Natural Language Processing: Parsing
Lecture Notes-Finite State Automata for NLP.pdf
Natural language processing
Evaluating hypothesis

What's hot (20)

PDF
Lecture: Word Sense Disambiguation
PPTX
NLP_KASHK:Text Normalization
PPTX
NLP_KASHK:Finite-State Automata
PDF
Syntactic analysis in NLP
PPT
5 csp
PDF
Adnan: Introduction to Natural Language Processing
PDF
6 shallow parsing introduction
PDF
Open mp directives
PDF
UNIT-V.pdf daa unit material 5 th unit ppt
PPTX
NLP_KASHK:Morphology
PPT
Knowledge Representation & Reasoning
PPTX
Resolution method in AI.pptx
PPTX
Artificial Intelligence Notes Unit 3
PPTX
Uncertain Knowledge and Reasoning in Artificial Intelligence
PPTX
5. phases of nlp
PPTX
Types of Parser
PDF
Daa notes 1
PPTX
NLP_KASHK:Evaluating Language Model
PPT
Context free grammars
Lecture: Word Sense Disambiguation
NLP_KASHK:Text Normalization
NLP_KASHK:Finite-State Automata
Syntactic analysis in NLP
5 csp
Adnan: Introduction to Natural Language Processing
6 shallow parsing introduction
Open mp directives
UNIT-V.pdf daa unit material 5 th unit ppt
NLP_KASHK:Morphology
Knowledge Representation & Reasoning
Resolution method in AI.pptx
Artificial Intelligence Notes Unit 3
Uncertain Knowledge and Reasoning in Artificial Intelligence
5. phases of nlp
Types of Parser
Daa notes 1
NLP_KASHK:Evaluating Language Model
Context free grammars
Ad

Similar to NLP_KASHK:Smoothing N-gram Models (20)

PPTX
Numerical Analysis and Its application to Boundary Value Problems
PPTX
Mechanics physical quantities, si units and vectors
PPT
Chapter 2-3.ppt Electrons and standard units in a configuration
PPT
Chapter 2-2.ppt utilizing the significant figures and overall units of measur...
PDF
Linear regression
PPTX
Mathematics Basic operations, fractions decimals and percentage.pptx
PPTX
linear regression1.pptx machine learning
PPT
Chapter 2-1 significant figures and units
PPTX
Applied Algorithms and Structures week999
PDF
AS physics - Physical quantities & units
PPT
CS8451 - Design and Analysis of Algorithms
PPTX
Calculator-Techniques for engineering.pptx
PDF
Optimum Engineering Design - Day 4 - Clasical methods of optimization
PPTX
Simple Linear Regression: Step-By-Step
PDF
Measurement Techniques,Uncertainty Analysispdf
PDF
Digital Electronics NPTEL Slides (WEEK 2)
PDF
Machine learning mathematicals.pdf
PPTX
Optimization techniq
PDF
11_Học máy cơ bản_Hồi quy tuyến tính.pdf
Numerical Analysis and Its application to Boundary Value Problems
Mechanics physical quantities, si units and vectors
Chapter 2-3.ppt Electrons and standard units in a configuration
Chapter 2-2.ppt utilizing the significant figures and overall units of measur...
Linear regression
Mathematics Basic operations, fractions decimals and percentage.pptx
linear regression1.pptx machine learning
Chapter 2-1 significant figures and units
Applied Algorithms and Structures week999
AS physics - Physical quantities & units
CS8451 - Design and Analysis of Algorithms
Calculator-Techniques for engineering.pptx
Optimum Engineering Design - Day 4 - Clasical methods of optimization
Simple Linear Regression: Step-By-Step
Measurement Techniques,Uncertainty Analysispdf
Digital Electronics NPTEL Slides (WEEK 2)
Machine learning mathematicals.pdf
Optimization techniq
11_Học máy cơ bản_Hồi quy tuyến tính.pdf
Ad

More from Hemantha Kulathilake (20)

PPTX
NLP_KASHK:Parsing with Context-Free Grammar
PPTX
NLP_KASHK:Context-Free Grammar for English
PPTX
NLP_KASHK:POS Tagging
PPTX
NLP_KASHK:Markov Models
PPTX
NLP_KASHK:Minimum Edit Distance
PPTX
NLP_KASHK:Finite-State Morphological Parsing
PPTX
NLP_KASHK:Regular Expressions
PPTX
COM1407: File Processing
PPTX
COm1407: Character & Strings
PPTX
COM1407: Structures, Unions & Dynamic Memory Allocation
PPTX
COM1407: Input/ Output Functions
PPTX
COM1407: Working with Pointers
PPTX
COM1407: Arrays
PPTX
COM1407: Program Control Structures – Repetition and Loops
PPTX
COM1407: Program Control Structures – Decision Making & Branching
PPTX
COM1407: C Operators
PPTX
COM1407: Type Casting, Command Line Arguments and Defining Constants
PPTX
COM1407: Variables and Data Types
PPTX
COM1407: Introduction to C Programming
PPTX
COM1407: Structured Program Development
NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:POS Tagging
NLP_KASHK:Markov Models
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Regular Expressions
COM1407: File Processing
COm1407: Character & Strings
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Input/ Output Functions
COM1407: Working with Pointers
COM1407: Arrays
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Decision Making & Branching
COM1407: C Operators
COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Variables and Data Types
COM1407: Introduction to C Programming
COM1407: Structured Program Development

Recently uploaded (20)

PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
web development for engineering and engineering
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
OOP with Java - Java Introduction (Basics)
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Digital Logic Computer Design lecture notes
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Construction Project Organization Group 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPT
Mechanical Engineering MATERIALS Selection
Foundation to blockchain - A guide to Blockchain Tech
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
web development for engineering and engineering
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Sustainable Sites - Green Building Construction
Lecture Notes Electrical Wiring System Components
OOP with Java - Java Introduction (Basics)
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Digital Logic Computer Design lecture notes
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT 4 Total Quality Management .pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Construction Project Organization Group 2.pptx
additive manufacturing of ss316l using mig welding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Structs to JSON How Go Powers REST APIs.pdf
Mechanical Engineering MATERIALS Selection

NLP_KASHK:Smoothing N-gram Models

  • 1. Smoothing N-gram Models K.A.S.H. Kulathilake B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
  • 2. Smoothing • What do we do with words that are in our vocabulary (they are not unknown words) but appear in a test set in an unseen context (for example they appear after a word they never appeared after in training)? • To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. • This modification is called smoothing or discounting. • There are variety of ways to do smoothing: – Add-1 smoothing – Add-k smoothing – Good-Turing Discounting – Stupid backoff – Kneser-Ney smoothing and many more
  • 3. Laplace Smoothing / Add 1 Smoothing • The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into probabilities. • All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. • This algorithm is called Laplace smoothing. • Laplace smoothing does not perform well enough to be used in modern N-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other tasks like text classification.
  • 4. Laplace Smoothing / Add 1 Smoothing (Cont…) • Let’s start with the application of Laplace smoothing to unigram probabilities. • Recall that the unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N: 𝑃 𝑊𝑖 = 𝐶𝑖 𝑁 • Laplace smoothing merely adds one to each count. • Since there are V words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra V observations. 𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑊𝑖 = 𝐶𝑖 + 1 𝑁 + 𝑉
  • 5. Laplace Smoothing / Add 1 Smoothing (Cont…) • Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count C*. • This adjusted count is easier to compare directly with the MLE counts and can be turned into a probability like an MLE count by normalizing by N. • To define this count, since we are only changing the numerator in addition to adding 1 we’ll also need to multiply by a normalization factor 𝑁 𝑁+𝑉 : 𝐶𝑖 ∗ = (𝐶𝑖 + 1) 𝑁 𝑁 + 𝑉
  • 6. Laplace Smoothing / Add 1 Smoothing (Cont…) • A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts. • Thus, instead of referring to the discounted counts c, we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts: 𝑑 𝑐 = 𝐶∗ 𝐶
  • 7. Laplace Smoothing / Add 1 Smoothing (Cont…) • let’s smooth our Berkeley Restaurant Project bigrams. • Let’s take the Berkeley Restaurant project bigrams;
  • 8. Laplace Smoothing / Add 1 Smoothing (Cont…)
  • 9. Laplace Smoothing / Add 1 Smoothing (Cont…) • Recall that normal bigram probabilities are computed by normalizing each row of counts by the unigram count: 𝑃 𝑊𝑛 𝑊𝑛−1 = 𝐶(𝑊𝑛−1 𝑊𝑛) 𝐶(𝑊𝑛−1) • For add-one smoothed bigram counts, we need to augment the unigram count by the number of total word types in the vocabulary V: 𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 ∗ 𝑊𝑛 𝑊𝑛−1 = 𝐶 𝑊𝑛−1 𝑊𝑛 + 1 𝐶 𝑊𝑛−1 + 𝑉 • Thus, each of the unigram counts given in the previous table will need to be augmented by V =1446. • Ex:𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 ∗ 𝑡𝑜 𝐼 = 𝐶 𝐼,𝑡𝑜 +1 𝐶 𝐼 +𝑉 = 0+1 2500+1446 = 1 3946 = 0.000253
  • 10. Laplace Smoothing / Add 1 Smoothing (Cont…) • Following table shows the add-one smoothed probabilities for the bigrams:
  • 11. Laplace Smoothing / Add 1 Smoothing (Cont…) • It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. • These adjusted counts can be computed by using following equation: 𝐶∗ (𝑊𝑛−1 𝑊𝑛 ) = [𝐶 𝑊𝑛−1 𝑊𝑛 ] × 𝐶(𝑊𝑛−1) 𝐶 𝑊𝑛−1 + 𝑉
  • 12. Laplace Smoothing / Add 1 Smoothing (Cont…) • Following table shows the reconstructed counts:
  • 13. Laplace Smoothing / Add 1 Smoothing (Cont…) • Note that add-one smoothing has made a very big change to the counts. • C(want to) changed from 608 to 238! • We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case. • Looking at the discount d (the ratio between new and old counts) shows us how strikingly the counts for each prefix word have been reduced; the discount for the bigram want to is .39, while the discount for Chinese food is .10, a factor of 10! • The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros.
  • 14. Add-K Smoothing • One alternative to add-one smoothing is to move a bit less of the probability mass from the seen to the unseen events. • Instead of adding 1 to each count, we add a fractional count k (.5? .05? .01?). • This algorithm is therefore called add-k smoothing. 𝑃𝐴𝑑𝑑−𝑘 ∗ 𝑊𝑛 𝑊𝑛−1 = 𝐶 𝑊𝑛−1 𝑊𝑛 + 𝑘 𝐶 𝑊𝑛−1 + 𝑘𝑉 • Add-k smoothing requires that we have a method for choosing k; this can be done, for example, by optimizing on a development set (devset).
  • 15. Good-Turing Discounting • The basic insight of Good-Turing smoothing is to re-estimate the amount of probability mass to assign to N-grams with zero or low counts by looking at the number of N-grams with higher counts. • In other words, we examine Nc, the number of N-grams that occur c times. • We refer to the number of N-grams that occur c times as the frequency of frequency c. • So applying the idea to smoothing the joint probability of bigrams, N0 is the number of bigrams b of count 0, N1 the number of bigrams with count 1, and so on: 𝑁𝐶 = 𝑏:𝑐 𝑏 =𝑐 1 • The Good-Turing estimate gives a smoothed count c* based on the set of Nc for all c, as follows: 𝐶∗ = (𝐶 + 1) 𝑁𝐶 + 1 𝑁𝐶 https://www.youtube.com/watch?v=z1bq4C8hFEk
  • 16. Backoff and Interpolation • In the backoff model, like the deleted interpolation model, we build an N-gram model based on an (N-1)-gram model. • The difference is that in backoff, if we have non-zero trigram counts, we rely solely on the trigram counts and don’t interpolate the bigram and unigram counts at all. • We only ‘back off’ to a lower-order N-gram if we have zero evidence for a higher-order N-gram.