SlideShare a Scribd company logo
2
Most read
9
Most read
19
Most read
Evaluating Language Models
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Extrinsic Evaluation
• The best way to evaluate the performance of a language
model is to embed it in an application and measure how
much the application improves.
• Such end-to-end evaluation is called extrinsic evaluation.
• Extrinsic evaluation is the only way to evaluation know if a
particular improvement in a component is really going to
help the task at hand.
• Thus, for speech recognition, we can compare the
performance of two language models by running the
speech recognizer twice, once with each language model,
and seeing which gives the more accurate transcription.
Intrinsic Evaluation
• Unfortunately, running big NLP systems end-
to-end is often very expensive.
• Instead, it would be nice to have a metric that
can be used to quickly evaluate potential
improvements in a language model.
• An intrinsic evaluation metric is one that
measures the quality of a model independent
of any application.
Intrinsic Evaluation (Cont…)
• For an intrinsic evaluation of a language model we need a test set.
• The probabilities of an N-gram model training set come from the
corpus it is trained on, the training set or training corpus.
• We can then measure the quality of an N-gram model by its
performance on some unseen test set data called the test set or
test corpus.
• We will also sometimes call test sets and other datasets that are
not in our training sets held out corpora because we hold them out
from the training data.
• So if we are given a corpus of text and want to compare two
different N-gram models, we divide the data into training and test
sets, train the parameters of both models on the training set, and
then compare how well the two trained models fit the test set.
Intrinsic Evaluation (Cont…)
• But what does it mean to “fit the test set”?
– Whichever model assigns a higher probability to
the test set—meaning it more accurately predicts
the test set—is a better model.
• Given two probabilistic models, the better
model is the one that has a tighter fit to the
test data or that better predicts the details of
the test data, and hence will assign a higher
probability to the test data.
Intrinsic Evaluation (Cont…)
• Since our evaluation metric is based on test set probability,
it’s important not to let the test sentences into the training
set.
• Suppose we are trying to compute the probability of a
particular “test” sentence.
• If our test sentence is part of the training corpus, we will
mistakenly assign it an artificially high probability when it
occurs in the test set.
• We call this situation training on the test set.
• Training on the test set introduces a bias that makes the
probabilities all look too high, and causes huge inaccuracies
in perplexity ( perplexity means the probability-based
metric).
Development Test
• Sometimes we use a particular test set so often that we implicitly tune to its
characteristics.
• We then need a fresh test set that is truly unseen.
• In such cases, we call the initial test set the development test set or, devset.
• How do we divide our data into training, development, and test sets?
• We want our test set to be as large as possible, since a small test set may be
accidentally unrepresentative, but we also want as much training data as possible.
• At the minimum, we would want to pick the smallest test set that gives us enough
statistical power to measure a statistically significant difference between two
potential models.
• In practice, we often just divide our data into 80% training, 10% development, and
10% test.
• Given a large corpus that we want to divide into training and test, test data can
either be taken from some continuous sequence of text inside the corpus, or we
can remove smaller “stripes” of text from randomly selected parts of our corpus
and combine them into a test set.
Perplexity
• In practice we don’t use raw probability as our
metric for evaluating language models, but a
variant called perplexity.
• The perplexity (sometimes called PP for short) of
a language model on a test set is the inverse
probability of the test set, normalized by the
number of words. For a test setW = w1, w2, ……,
wN:
Perplexity (Cont…)
• We can use the chain rule to expand the
probability of W:
• Thus, if we are computing the perplexity of W
with a bigram language model, we get:
Perplexity (Cont…)
• Note that because of the inverse in previous equations, the
higher the conditional probability of the word sequence,
the lower the perplexity.
• Thus, minimizing perplexity is equivalent to maximizing the
test set probability according to the language model.
• What we generally use for word sequence in those
equations is the entire sequence of words in some test set.
• Since this sequence will cross many sentence boundaries,
we need to include the begin- and end-sentence markers
<s> and </s> in the probability computation.
• We also need to include the end-of-sentence marker </s>
(but not the beginning-of-sentence marker <s>) in the total
count of word tokens N.
Perplexity (Cont…)
• There is another way to think about perplexity: as the weighted
average branching factor of a language.
• The branching factor of a language is the number of possible next
words that can follow any word.
• Consider the task of recognizing the digits in English (zero, one,
two,..., nine), given that each of the 10 digits occurs with equal
probability P = 1/10.
• The perplexity of this mini-language is in fact 10.
• To see that, imagine a string of digits of length N.
Perplexity for Comparing Different N-
gram Models
• We trained unigram, bigram, and trigram grammars on 38
million words (including start-of-sentence tokens) from the
Wall Street Journal, using a 19,979 word vocabulary.
• We then computed the perplexity of each of these models
on a test set of 1.5 million words with following equation.
• The table below shows the perplexity of a 1.5 million word
WSJ test set according to each of these grammars.
Perplexity for Comparing Different N-
gram Models (Cont…)
• As we see above, the more information the N-
gram gives us about the word sequence, the
lower the perplexity.
• Note that in computing perplexities, the N-gram
model P must be constructed without any
knowledge of the test set or any prior knowledge
of the vocabulary of the test set.
• Any kind of knowledge of the test set can cause
the perplexity to be artificially low.
• The perplexity of two language models is only
comparable if they use identical vocabularies.
Generalization and Zeros
• The statistical models are likely to be pretty
useless as predictors if the training sets and the
test sets are as different.
• How should we deal with this problem when we
build N-gram models?
• One way is to be sure to use a training corpus
that has a similar genre to whatever task we are
trying to accomplish.
• To build a language model for translating legal
documents, we need a training corpus of legal
documents.
Generalization and Zeros (Cont…)
• Matching genres is still not sufficient.
• Our models may still be subject to the problem of sparsity.
• For any N-gram that occurred a sufficient number of times,
we might have a good estimate of its probability.
• But because any corpus is limited, some perfectly
acceptable English word sequences are bound to be
missing from it.
• That is, we’ll have many cases of putative “zero probability
N-grams” that should really have some non-zero
probability.
• Consider the words that follow the bigram denied the in
the WSJ Treebank3 corpus, together with their counts:
Generalization and Zeros (Cont…)
• To build a language model for a question-
answering system, we need a training corpus
of questions.
The Zeros
• Zeros— things that don’t ever occur in the training set
but do occur in the test set—are a problem for two
reasons.
– First, their presence means we are underestimating the
probability of all sorts of words that might occur, which will
hurt the performance of any application we want to run on
this data.
– Second, if the probability of any word in the test set is 0,
the entire probability of the test set is 0.
• By definition, perplexity is based on the inverse probability of
the test set.
• Thus if some words have zero probability, we can’t compute
perplexity at all, since we can’t divide by 0!
Unknown Words
• The previous section discussed the problem of words
whose bigram probability is zero.
• But what about words we simply have never seen before?
• Closed Vocabulary
– Sometimes we have a language task in which this can’t happen
because we know all the words that can occur.
– In such a closed vocabulary system the test set can only contain
words from this lexicon, and there will be no unknown words.
– This is a reasonable assumption in some domains, such as
speech recognition or machine translation, where we have a
pronunciation dictionary or a phrase table that are fixed in
advance, and so the language model can only use the words in
that dictionary or phrase table.
Unknown Words (Cont…)
• Open Vocabulary
– In other cases we have to deal with words we
haven’t seen before, which we’ll call unknown
words, or out of vocabulary (OOV) words.
– The percentage of OOV words that appear in the
test set is called the OOV rate.
– An open vocabulary system is one in which we
model these potential unknown words in the test
set by adding a pseudo-word called <UNK>.
Train the Probabilities of Unknown
Words
• There are two common ways to train the
probabilities of the unknown word model <UNK>.
• 1st Method:
– Turn the problem back into a closed vocabulary one
by choosing a fixed vocabulary in advance:
– Convert in the training set any word that is not in this
set (any OOV word) to the unknown word token
<UNK> in a text normalization step.
– Estimate the probabilities for <UNK> from its counts
just like any other regular word in the training set.
Train the Probabilities of Unknown
Words (Cont…)
• 2nd Method
– The second alternative, in situations where we don’t
have a prior vocabulary in advance, is to create such a
vocabulary implicitly, replacing words in the training
data by <UNK> based on their frequency.
– For example we can replace by <UNK> all words that
occur fewer than n times in the training set, where n is
some small number, or equivalently select a
vocabulary size V in advance (say 50,000) and choose
the top V words by frequency and replace the rest by
UNK.
– In either case we then proceed to train the language
model as before, treating <UNK> like a regular word.
Train the Probabilities of Unknown
Words (Cont…)
• The exact choice of <UNK> model does
have an effect on metrics like perplexity.
• A language model can achieve low perplexity
by choosing a small vocabulary and assigning
the unknown word a high probability.
• For this reason, perplexities should only be
compared across language models with the
same vocabularies.

More Related Content

PPTX
Text similarity measures
PPTX
NLP_KASHK:Minimum Edit Distance
PPTX
Natural Language Processing in AI
PPTX
NLP_KASHK:Finite-State Automata
PPTX
5. phases of nlp
PPTX
Text Classification
PPTX
Natural language processing: feature extraction
PPTX
IRS-Cataloging and Indexing-2.1.pptx
Text similarity measures
NLP_KASHK:Minimum Edit Distance
Natural Language Processing in AI
NLP_KASHK:Finite-State Automata
5. phases of nlp
Text Classification
Natural language processing: feature extraction
IRS-Cataloging and Indexing-2.1.pptx

What's hot (20)

PPTX
Knowledge representation in AI
PPTX
Language models
PDF
Cross validation
PPTX
ProLog (Artificial Intelligence) Introduction
PPTX
PPTX
Register allocation and assignment
PPTX
Analytical learning
PDF
Linear regression
PPT
Natural Language Processing
PPTX
Clustering for Stream and Parallelism (DATA ANALYTICS)
PDF
Information retrieval-systems notes
PPT
Introduction to Compiler design
PPTX
Attribute grammer
PPTX
NLP_KASHK:N-Grams
PPT
Hill climbing
ODP
Topic Modeling
PPTX
NLP_KASHK:Context-Free Grammar for English
PDF
Natural language processing (nlp)
PPTX
Text summarization
PPTX
Artificial Intelligence Searching Techniques
Knowledge representation in AI
Language models
Cross validation
ProLog (Artificial Intelligence) Introduction
Register allocation and assignment
Analytical learning
Linear regression
Natural Language Processing
Clustering for Stream and Parallelism (DATA ANALYTICS)
Information retrieval-systems notes
Introduction to Compiler design
Attribute grammer
NLP_KASHK:N-Grams
Hill climbing
Topic Modeling
NLP_KASHK:Context-Free Grammar for English
Natural language processing (nlp)
Text summarization
Artificial Intelligence Searching Techniques
Ad

Similar to NLP_KASHK:Evaluating Language Model (20)

PPTX
Language model in nature language processing
PPT
Natural Language Processing: N-Gram Language Models
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
PPT
Natural Language Processing: N-Gram Language Models
PPTX
LLM24aug.pptxxz khi ong mat troi thuc dat me
PPT
2-Chapter Two-N-gram Language Models.ppt
PDF
Probability Theory Application and statitics
PPTX
Unit - I Sentiment anlysis with logistic regression.pptx
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PPTX
NLP Bootcamp
PPTX
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
PPTX
Pptphrase tagset mapping for french and english treebanks and its application...
PPTX
Word_Embedding.pptx
PPT
Lecture 1
PPT
lec1.ppt
PPTX
sentiment analysis
PDF
Turkish language modeling using BERT
PPTX
Word embedding
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Language model in nature language processing
Natural Language Processing: N-Gram Language Models
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
Natural Language Processing: N-Gram Language Models
LLM24aug.pptxxz khi ong mat troi thuc dat me
2-Chapter Two-N-gram Language Models.ppt
Probability Theory Application and statitics
Unit - I Sentiment anlysis with logistic regression.pptx
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Pptphrase tagset mapping for french and english treebanks and its application...
Word_Embedding.pptx
Lecture 1
lec1.ppt
sentiment analysis
Turkish language modeling using BERT
Word embedding
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Ad

More from Hemantha Kulathilake (20)

PPTX
NLP_KASHK:Parsing with Context-Free Grammar
PPTX
NLP_KASHK:POS Tagging
PPTX
NLP_KASHK:Markov Models
PPTX
NLP_KASHK:Smoothing N-gram Models
PPTX
NLP_KASHK:Finite-State Morphological Parsing
PPTX
NLP_KASHK:Morphology
PPTX
NLP_KASHK:Text Normalization
PPTX
NLP_KASHK:Regular Expressions
PPTX
NLP_KASHK: Introduction
PPTX
COM1407: File Processing
PPTX
COm1407: Character & Strings
PPTX
COM1407: Structures, Unions & Dynamic Memory Allocation
PPTX
COM1407: Input/ Output Functions
PPTX
COM1407: Working with Pointers
PPTX
COM1407: Arrays
PPTX
COM1407: Program Control Structures – Repetition and Loops
PPTX
COM1407: Program Control Structures – Decision Making & Branching
PPTX
COM1407: C Operators
PPTX
COM1407: Type Casting, Command Line Arguments and Defining Constants
PPTX
COM1407: Variables and Data Types
NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:POS Tagging
NLP_KASHK:Markov Models
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Morphology
NLP_KASHK:Text Normalization
NLP_KASHK:Regular Expressions
NLP_KASHK: Introduction
COM1407: File Processing
COm1407: Character & Strings
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Input/ Output Functions
COM1407: Working with Pointers
COM1407: Arrays
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Decision Making & Branching
COM1407: C Operators
COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Variables and Data Types

Recently uploaded (20)

PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Geodesy 1.pptx...............................................
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Sustainable Sites - Green Building Construction
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT
Project quality management in manufacturing
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
Well-logging-methods_new................
PPT
Drone Technology Electronics components_1
PDF
ETO & MEO Certificate of Competency Questions and Answers
Strings in CPP - Strings in C++ are sequences of characters used to store and...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Geodesy 1.pptx...............................................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Digital Logic Computer Design lecture notes
Structs to JSON How Go Powers REST APIs.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Sustainable Sites - Green Building Construction
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Project quality management in manufacturing
Mechanical Engineering MATERIALS Selection
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Construction Project Organization Group 2.pptx
Well-logging-methods_new................
Drone Technology Electronics components_1
ETO & MEO Certificate of Competency Questions and Answers

NLP_KASHK:Evaluating Language Model

  • 1. Evaluating Language Models K.A.S.H. Kulathilake B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
  • 2. Extrinsic Evaluation • The best way to evaluate the performance of a language model is to embed it in an application and measure how much the application improves. • Such end-to-end evaluation is called extrinsic evaluation. • Extrinsic evaluation is the only way to evaluation know if a particular improvement in a component is really going to help the task at hand. • Thus, for speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription.
  • 3. Intrinsic Evaluation • Unfortunately, running big NLP systems end- to-end is often very expensive. • Instead, it would be nice to have a metric that can be used to quickly evaluate potential improvements in a language model. • An intrinsic evaluation metric is one that measures the quality of a model independent of any application.
  • 4. Intrinsic Evaluation (Cont…) • For an intrinsic evaluation of a language model we need a test set. • The probabilities of an N-gram model training set come from the corpus it is trained on, the training set or training corpus. • We can then measure the quality of an N-gram model by its performance on some unseen test set data called the test set or test corpus. • We will also sometimes call test sets and other datasets that are not in our training sets held out corpora because we hold them out from the training data. • So if we are given a corpus of text and want to compare two different N-gram models, we divide the data into training and test sets, train the parameters of both models on the training set, and then compare how well the two trained models fit the test set.
  • 5. Intrinsic Evaluation (Cont…) • But what does it mean to “fit the test set”? – Whichever model assigns a higher probability to the test set—meaning it more accurately predicts the test set—is a better model. • Given two probabilistic models, the better model is the one that has a tighter fit to the test data or that better predicts the details of the test data, and hence will assign a higher probability to the test data.
  • 6. Intrinsic Evaluation (Cont…) • Since our evaluation metric is based on test set probability, it’s important not to let the test sentences into the training set. • Suppose we are trying to compute the probability of a particular “test” sentence. • If our test sentence is part of the training corpus, we will mistakenly assign it an artificially high probability when it occurs in the test set. • We call this situation training on the test set. • Training on the test set introduces a bias that makes the probabilities all look too high, and causes huge inaccuracies in perplexity ( perplexity means the probability-based metric).
  • 7. Development Test • Sometimes we use a particular test set so often that we implicitly tune to its characteristics. • We then need a fresh test set that is truly unseen. • In such cases, we call the initial test set the development test set or, devset. • How do we divide our data into training, development, and test sets? • We want our test set to be as large as possible, since a small test set may be accidentally unrepresentative, but we also want as much training data as possible. • At the minimum, we would want to pick the smallest test set that gives us enough statistical power to measure a statistically significant difference between two potential models. • In practice, we often just divide our data into 80% training, 10% development, and 10% test. • Given a large corpus that we want to divide into training and test, test data can either be taken from some continuous sequence of text inside the corpus, or we can remove smaller “stripes” of text from randomly selected parts of our corpus and combine them into a test set.
  • 8. Perplexity • In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity. • The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words. For a test setW = w1, w2, ……, wN:
  • 9. Perplexity (Cont…) • We can use the chain rule to expand the probability of W: • Thus, if we are computing the perplexity of W with a bigram language model, we get:
  • 10. Perplexity (Cont…) • Note that because of the inverse in previous equations, the higher the conditional probability of the word sequence, the lower the perplexity. • Thus, minimizing perplexity is equivalent to maximizing the test set probability according to the language model. • What we generally use for word sequence in those equations is the entire sequence of words in some test set. • Since this sequence will cross many sentence boundaries, we need to include the begin- and end-sentence markers <s> and </s> in the probability computation. • We also need to include the end-of-sentence marker </s> (but not the beginning-of-sentence marker <s>) in the total count of word tokens N.
  • 11. Perplexity (Cont…) • There is another way to think about perplexity: as the weighted average branching factor of a language. • The branching factor of a language is the number of possible next words that can follow any word. • Consider the task of recognizing the digits in English (zero, one, two,..., nine), given that each of the 10 digits occurs with equal probability P = 1/10. • The perplexity of this mini-language is in fact 10. • To see that, imagine a string of digits of length N.
  • 12. Perplexity for Comparing Different N- gram Models • We trained unigram, bigram, and trigram grammars on 38 million words (including start-of-sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary. • We then computed the perplexity of each of these models on a test set of 1.5 million words with following equation. • The table below shows the perplexity of a 1.5 million word WSJ test set according to each of these grammars.
  • 13. Perplexity for Comparing Different N- gram Models (Cont…) • As we see above, the more information the N- gram gives us about the word sequence, the lower the perplexity. • Note that in computing perplexities, the N-gram model P must be constructed without any knowledge of the test set or any prior knowledge of the vocabulary of the test set. • Any kind of knowledge of the test set can cause the perplexity to be artificially low. • The perplexity of two language models is only comparable if they use identical vocabularies.
  • 14. Generalization and Zeros • The statistical models are likely to be pretty useless as predictors if the training sets and the test sets are as different. • How should we deal with this problem when we build N-gram models? • One way is to be sure to use a training corpus that has a similar genre to whatever task we are trying to accomplish. • To build a language model for translating legal documents, we need a training corpus of legal documents.
  • 15. Generalization and Zeros (Cont…) • Matching genres is still not sufficient. • Our models may still be subject to the problem of sparsity. • For any N-gram that occurred a sufficient number of times, we might have a good estimate of its probability. • But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. • That is, we’ll have many cases of putative “zero probability N-grams” that should really have some non-zero probability. • Consider the words that follow the bigram denied the in the WSJ Treebank3 corpus, together with their counts:
  • 16. Generalization and Zeros (Cont…) • To build a language model for a question- answering system, we need a training corpus of questions.
  • 17. The Zeros • Zeros— things that don’t ever occur in the training set but do occur in the test set—are a problem for two reasons. – First, their presence means we are underestimating the probability of all sorts of words that might occur, which will hurt the performance of any application we want to run on this data. – Second, if the probability of any word in the test set is 0, the entire probability of the test set is 0. • By definition, perplexity is based on the inverse probability of the test set. • Thus if some words have zero probability, we can’t compute perplexity at all, since we can’t divide by 0!
  • 18. Unknown Words • The previous section discussed the problem of words whose bigram probability is zero. • But what about words we simply have never seen before? • Closed Vocabulary – Sometimes we have a language task in which this can’t happen because we know all the words that can occur. – In such a closed vocabulary system the test set can only contain words from this lexicon, and there will be no unknown words. – This is a reasonable assumption in some domains, such as speech recognition or machine translation, where we have a pronunciation dictionary or a phrase table that are fixed in advance, and so the language model can only use the words in that dictionary or phrase table.
  • 19. Unknown Words (Cont…) • Open Vocabulary – In other cases we have to deal with words we haven’t seen before, which we’ll call unknown words, or out of vocabulary (OOV) words. – The percentage of OOV words that appear in the test set is called the OOV rate. – An open vocabulary system is one in which we model these potential unknown words in the test set by adding a pseudo-word called <UNK>.
  • 20. Train the Probabilities of Unknown Words • There are two common ways to train the probabilities of the unknown word model <UNK>. • 1st Method: – Turn the problem back into a closed vocabulary one by choosing a fixed vocabulary in advance: – Convert in the training set any word that is not in this set (any OOV word) to the unknown word token <UNK> in a text normalization step. – Estimate the probabilities for <UNK> from its counts just like any other regular word in the training set.
  • 21. Train the Probabilities of Unknown Words (Cont…) • 2nd Method – The second alternative, in situations where we don’t have a prior vocabulary in advance, is to create such a vocabulary implicitly, replacing words in the training data by <UNK> based on their frequency. – For example we can replace by <UNK> all words that occur fewer than n times in the training set, where n is some small number, or equivalently select a vocabulary size V in advance (say 50,000) and choose the top V words by frequency and replace the rest by UNK. – In either case we then proceed to train the language model as before, treating <UNK> like a regular word.
  • 22. Train the Probabilities of Unknown Words (Cont…) • The exact choice of <UNK> model does have an effect on metrics like perplexity. • A language model can achieve low perplexity by choosing a small vocabulary and assigning the unknown word a high probability. • For this reason, perplexities should only be compared across language models with the same vocabularies.