Language models

Artificial Intelligence
Language Models

Maryam Khordad
The University of Western Ontario
1

Natural Language Processing
• The idea behind NLP: To give computers
the ability to process human language.
• To get computers to perform useful tasks
involving human language:
– Dialogue systems
• Cleverbot

– Machine translation
– Question answering (Search engines)

2

Natural Language Processing
• Knowledge of language.
• One of the important information about a
language: Is a string a valid member of a
language or not?
–
–
–
–

Word
Sentence
Phrase
…

3

Language Models
• Formal grammars (e.g. regular, context free)
give a hard “binary” model of the legal
sentences in a language.
• For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
• To specify a correct probability distribution,
the probability of all sentences in a
language must sum to 1.
4

Uses of Language Models
• Speech recognition
– “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”

• OCR & Handwriting recognition
– More probable sentences are more likely correct readings.

• Machine translation
– More likely sentences are probably better translations.

• Generation
– More likely sentences are probably better NL generations.

• Context sensitive spelling correction
– “Their are problems wit this sentence.”
5

Completion Prediction
• A language model also supports predicting
the completion of a sentence.
– Please turn off your cell _____
– Your program does not ______

• Predictive text input systems can guess what
you are typing and give choices on how to
complete it.

6

N-Gram Word Models
• We can have n-gram models over sequences of words,
characters, syllables or other units.
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)

• An N-gram model uses only N 1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)

• The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history. In particular, in a kth-order Markov model, the
next state only depends on the k most recent states,
therefore an N-gram model is a (N 1)-order Markov model.
7

N-Gram Model Formulas
• Word sequences
w1n

w1... wn

• Chain rule of probability
n
n
1

P( w )

2
1

n 1
1

P( w1 ) P( w2 | w1 ) P( w3 | w )... P( wn | w

• Bigram approximation

P( wk | w1k 1 )

)
k 1

n
n
1

P( w )

P ( wk | wk 1 )
k 1

• N-gram approximation
n
n
1

k
P( wk | wk

P( w )

1
N 1

)

k 1
8

Estimating Probabilities
• N-gram conditional probabilities can be estimated
from raw text based on the relative frequency of
word sequences.
Bigram:
N-gram:

C ( wn 1wn )
C ( wn 1 )

P ( wn | wn 1 )
n
P(wn | wn

1
N 1

)

n
C (wn 1 1wn )
N
n
C (wn 1 1 )
N

• To have a consistent probabilistic model, append a
unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.
9

N-gram character models
• One of the simplest language models: P(c1N )
• Language identification: given the text determine
which language it is written in.
• Build a trigram character model of each candidate
language: P(ci | ci 2:i 1 , l )
• We want to find the most probable language given
the Text:
l * argmax P(l | c1N )
l

argmax P(l ) P(c1N | l )
l

N

argmax P (l )
l

P(ci | ci
i 1

2:i 1

, l)
10

Train and Test Corpora
• We call a body of text a corpus (plural corpora).
• A language model must be trained on a large
corpus of text to estimate good parameter values.
• Model can be evaluated based on its ability to
predict a high probability for a disjoint (held-out)
test corpus (testing on the training corpus would
give an optimistically biased estimate).
• Ideally, the training (and test) corpus should be
representative of the actual application data.

11

Unknown Words
• How to handle words in the test corpus that
did not occur in the training data, i.e. out of
vocabulary (OOV) words?
• Train a model that includes an explicit
symbol for an unknown word (<UNK>).
– Choose a vocabulary in advance and replace
other words in the training corpus with
<UNK>.
– Replace the first occurrence of each word in the
training data with <UNK>.
12

Perplexity
• Measure of how well a model “fits” the test data.
• Uses the probability that the model assigns to the
test corpus.
• Normalizes for the number of words in the test
corpus and takes the inverse.
N

Perplexity(W1 )

N

1
P( w1w2 ...wN )

• Measures the weighted average branching factor
in predicting the next word (lower is better).

13

Sample Perplexity Evaluation
• Models trained on 38 million words from
the Wall Street Journal (WSJ) using a
19,979 word vocabulary.
• Evaluate on a disjoint set of 1.5 million
WSJ words.
Unigram
Perplexity
962

Bigram
170

Trigram
109

14

Smoothing
• Since there are a combinatorial number of possible
strings, many rare (but not impossible)
combinations never occur in training, so the
system incorrectly assigns zero to many
parameters.
• If a new combination occurs during testing, it is
given a probability of zero and the entire sequence
gets a probability of zero (i.e. infinite perplexity).
• Example:
– “--th” Very common
– “--ht” Uncommon but what if we see this sentence:
“This program issues an http request.”
15

Smoothing
• In practice, parameters are smoothed to reassign
some probability mass to unseen events.
– Adding probability mass to unseen events requires
removing it from seen ones (discounting) in order to
maintain a joint distribution that sums to 1.

16

Laplace (Add-One) Smoothing
• The simplest type of smoothing.
• In the lack of further information, if a random
Boolean variable X has been false in all n
observations so far then the estimate for P(X=true)
should be 1/(n+2).
• He assumes that with two more trials, one might be
false and one false.
• Performs relatively poorly.

17

Advanced Smoothing
• Many advanced techniques have been
developed to improve smoothing for
language models.
– Interpolation
– Backoff

18

Model Combination
• As N increases, the power (expressiveness)
of an N-gram model increases, but the
ability to estimate accurate parameters from
sparse data decreases (i.e. the smoothing
problem gets worse).
• A general approach is to combine the results
of multiple N-gram models of increasing
complexity (i.e. increasing N).

19

Interpolation
• Linearly combine estimates of N-gram models of
increasing order.
Interpolated Trigram Model:

ˆ
P(wn | wn 2, wn 1 )

3

Where:

(wn | wn 2, wn 1 )
i

2

P(wn | wn 1 )

1

P(wn )

1

i

•
•

i Can

be fixed or can be trained.
i Can depend on the counts: if we have a high
count of trigrams then we weigh them relatively
more otherwise put more weight on the bigram
and unigrams.
20

Backoff
• Only use lower-order model when data for higherorder model is unavailable (i.e. count is zero).
• Recursively back-off to weaker models until data
is available.

21

Summary
• Language models assign a probability that a
sentence is a legal string in a language.
• They are useful as a component of many NLP
systems, such as ASR, OCR, and MT.
• Simple N-gram models are easy to train on
corpora and can provide useful estimates of
sentence likelihood.
• N-gram gives inaccurate parameters for models
trained on sparse data.
• Smoothing techniques adjust parameter estimates
to account for unseen (but not impossible) events.
22

About this slide presentation
• These slides are developed using Raymond
J.Mooney’s slides from University of Texas at
Austin
(http://www.cs.utexas.edu/~mooney/cs388/) .

24

Language models

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Language models (20)

Recently uploaded (20)

Language models

Editor's Notes