NLP_KASHK:Evaluating Language Model

Evaluating Language Models
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)

Extrinsic Evaluation
• The best way to evaluate the performance of a language
model is to embed it in an application and measure how
much the application improves.
• Such end-to-end evaluation is called extrinsic evaluation.
• Extrinsic evaluation is the only way to evaluation know if a
particular improvement in a component is really going to
help the task at hand.
• Thus, for speech recognition, we can compare the
performance of two language models by running the
speech recognizer twice, once with each language model,
and seeing which gives the more accurate transcription.

Intrinsic Evaluation
• Unfortunately, running big NLP systems end-
to-end is often very expensive.
• Instead, it would be nice to have a metric that
can be used to quickly evaluate potential
improvements in a language model.
• An intrinsic evaluation metric is one that
measures the quality of a model independent
of any application.

Intrinsic Evaluation (Cont…)
• For an intrinsic evaluation of a language model we need a test set.
• The probabilities of an N-gram model training set come from the
corpus it is trained on, the training set or training corpus.
• We can then measure the quality of an N-gram model by its
performance on some unseen test set data called the test set or
test corpus.
• We will also sometimes call test sets and other datasets that are
not in our training sets held out corpora because we hold them out
from the training data.
• So if we are given a corpus of text and want to compare two
different N-gram models, we divide the data into training and test
sets, train the parameters of both models on the training set, and
then compare how well the two trained models fit the test set.

• But what does it mean to “fit the test set”?
– Whichever model assigns a higher probability to
the test set—meaning it more accurately predicts
the test set—is a better model.
• Given two probabilistic models, the better
model is the one that has a tighter fit to the
test data or that better predicts the details of
the test data, and hence will assign a higher
probability to the test data.

• Since our evaluation metric is based on test set probability,
it’s important not to let the test sentences into the training
set.
• Suppose we are trying to compute the probability of a
particular “test” sentence.
• If our test sentence is part of the training corpus, we will
mistakenly assign it an artificially high probability when it
occurs in the test set.
• We call this situation training on the test set.
• Training on the test set introduces a bias that makes the
probabilities all look too high, and causes huge inaccuracies
in perplexity ( perplexity means the probability-based
metric).

Development Test
• Sometimes we use a particular test set so often that we implicitly tune to its
characteristics.
• We then need a fresh test set that is truly unseen.
• In such cases, we call the initial test set the development test set or, devset.
• How do we divide our data into training, development, and test sets?
• We want our test set to be as large as possible, since a small test set may be
accidentally unrepresentative, but we also want as much training data as possible.
• At the minimum, we would want to pick the smallest test set that gives us enough
statistical power to measure a statistically significant difference between two
potential models.
• In practice, we often just divide our data into 80% training, 10% development, and
10% test.
• Given a large corpus that we want to divide into training and test, test data can
either be taken from some continuous sequence of text inside the corpus, or we
can remove smaller “stripes” of text from randomly selected parts of our corpus
and combine them into a test set.

Perplexity
• In practice we don’t use raw probability as our
metric for evaluating language models, but a
variant called perplexity.
• The perplexity (sometimes called PP for short) of
a language model on a test set is the inverse
probability of the test set, normalized by the
number of words. For a test setW = w1, w2, ……,
wN:

Perplexity (Cont…)
• We can use the chain rule to expand the
probability of W:
• Thus, if we are computing the perplexity of W
with a bigram language model, we get:

• Note that because of the inverse in previous equations, the
higher the conditional probability of the word sequence,
the lower the perplexity.
• Thus, minimizing perplexity is equivalent to maximizing the
test set probability according to the language model.
• What we generally use for word sequence in those
equations is the entire sequence of words in some test set.
• Since this sequence will cross many sentence boundaries,
we need to include the begin- and end-sentence markers
<s> and </s> in the probability computation.
• We also need to include the end-of-sentence marker </s>
(but not the beginning-of-sentence marker <s>) in the total
count of word tokens N.

• There is another way to think about perplexity: as the weighted
average branching factor of a language.
• The branching factor of a language is the number of possible next
words that can follow any word.
• Consider the task of recognizing the digits in English (zero, one,
two,..., nine), given that each of the 10 digits occurs with equal
probability P = 1/10.
• The perplexity of this mini-language is in fact 10.
• To see that, imagine a string of digits of length N.

Perplexity for Comparing Different N-
gram Models
• We trained unigram, bigram, and trigram grammars on 38
million words (including start-of-sentence tokens) from the
Wall Street Journal, using a 19,979 word vocabulary.
• We then computed the perplexity of each of these models
on a test set of 1.5 million words with following equation.
• The table below shows the perplexity of a 1.5 million word
WSJ test set according to each of these grammars.

Perplexity for Comparing Different N-
gram Models (Cont…)
• As we see above, the more information the N-
gram gives us about the word sequence, the
lower the perplexity.
• Note that in computing perplexities, the N-gram
model P must be constructed without any
knowledge of the test set or any prior knowledge
of the vocabulary of the test set.
• Any kind of knowledge of the test set can cause
the perplexity to be artificially low.
• The perplexity of two language models is only
comparable if they use identical vocabularies.

Generalization and Zeros
• The statistical models are likely to be pretty
useless as predictors if the training sets and the
test sets are as different.
• How should we deal with this problem when we
build N-gram models?
• One way is to be sure to use a training corpus
that has a similar genre to whatever task we are
trying to accomplish.
• To build a language model for translating legal
documents, we need a training corpus of legal
documents.

Generalization and Zeros (Cont…)
• Matching genres is still not sufficient.
• Our models may still be subject to the problem of sparsity.
• For any N-gram that occurred a sufficient number of times,
we might have a good estimate of its probability.
• But because any corpus is limited, some perfectly
acceptable English word sequences are bound to be
missing from it.
• That is, we’ll have many cases of putative “zero probability
N-grams” that should really have some non-zero
probability.
• Consider the words that follow the bigram denied the in
the WSJ Treebank3 corpus, together with their counts:

Generalization and Zeros (Cont…)
• To build a language model for a question-
answering system, we need a training corpus
of questions.

The Zeros
• Zeros— things that don’t ever occur in the training set
but do occur in the test set—are a problem for two
reasons.
– First, their presence means we are underestimating the
probability of all sorts of words that might occur, which will
hurt the performance of any application we want to run on
this data.
– Second, if the probability of any word in the test set is 0,
the entire probability of the test set is 0.
• By definition, perplexity is based on the inverse probability of
the test set.
• Thus if some words have zero probability, we can’t compute
perplexity at all, since we can’t divide by 0!

Unknown Words
• The previous section discussed the problem of words
whose bigram probability is zero.
• But what about words we simply have never seen before?
• Closed Vocabulary
– Sometimes we have a language task in which this can’t happen
because we know all the words that can occur.
– In such a closed vocabulary system the test set can only contain
words from this lexicon, and there will be no unknown words.
– This is a reasonable assumption in some domains, such as
speech recognition or machine translation, where we have a
pronunciation dictionary or a phrase table that are fixed in
advance, and so the language model can only use the words in
that dictionary or phrase table.

Unknown Words (Cont…)
• Open Vocabulary
– In other cases we have to deal with words we
haven’t seen before, which we’ll call unknown
words, or out of vocabulary (OOV) words.
– The percentage of OOV words that appear in the
test set is called the OOV rate.
– An open vocabulary system is one in which we
model these potential unknown words in the test
set by adding a pseudo-word called <UNK>.

Train the Probabilities of Unknown
Words
• There are two common ways to train the
probabilities of the unknown word model <UNK>.
• 1st Method:
– Turn the problem back into a closed vocabulary one
by choosing a fixed vocabulary in advance:
– Convert in the training set any word that is not in this
set (any OOV word) to the unknown word token
<UNK> in a text normalization step.
– Estimate the probabilities for <UNK> from its counts
just like any other regular word in the training set.

Words (Cont…)
• 2nd Method
– The second alternative, in situations where we don’t
have a prior vocabulary in advance, is to create such a
vocabulary implicitly, replacing words in the training
data by <UNK> based on their frequency.
– For example we can replace by <UNK> all words that
occur fewer than n times in the training set, where n is
some small number, or equivalently select a
vocabulary size V in advance (say 50,000) and choose
the top V words by frequency and replace the rest by
UNK.
– In either case we then proceed to train the language
model as before, treating <UNK> like a regular word.

Words (Cont…)
• The exact choice of <UNK> model does
have an effect on metrics like perplexity.
• A language model can achieve low perplexity
by choosing a small vocabulary and assigning
the unknown word a high probability.
• For this reason, perplexities should only be
compared across language models with the
same vocabularies.

NLP_KASHK:Evaluating Language Model

More Related Content

What's hot (20)

Similar to NLP_KASHK:Evaluating Language Model (20)

More from Hemantha Kulathilake (20)

Recently uploaded (20)

NLP_KASHK:Evaluating Language Model