SlideShare a Scribd company logo
DT2118
Speech and Speaker Recognition
Language Modelling
Giampiero Salvi
KTH/CSC/TMH giampi@kth.se
VT 2015
1 / 56
Outline
Introduction
Formal Language Theory
Stochastic Language Models (SLM)
N-gram Language Models
N-gram Smoothing
Class N-grams
Adaptive Language Models
Language Model Evaluation
2 / 56
Outline
Introduction
Formal Language Theory
Stochastic Language Models (SLM)
N-gram Language Models
N-gram Smoothing
Class N-grams
Adaptive Language Models
Language Model Evaluation
3 / 56
Components of ASR System
Speech Signal
Spectral
Analysis
Feature
Extraction
Search
and Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - Knowledge
Decoder
Language Models
4 / 56
Why do we need language models?
Bayes’ rule:
P(words|sounds) =
P(sounds|words)P(words)
P(sounds)
where
P(words): a priori probability of the words
(Language Model)
We could use non informative priors
(P(words) = 1/N), but. . .
5 / 56
Branching Factor
I if we have N words in the dictionary
I at every word boundary we have to consider N
equally likely alternatives
I N can be in the order of millions
word
word1
word2
. . .
wordN
6 / 56
Ambiguity
“ice cream” vs “I scream”
/aI s k ô i: m/
7 / 56
Language Models in ASR
We want to:
1. limit the branching factor in the recognition
network
2. augment and complete the acoustic
probabilities
I we are only interested to know if the sequence
of words is plausible grammatically or not
I this kind of grammar is integrated in the
recognition network prior to decoding
8 / 56
Language Models in Dialogue Systems
I we want to assign a class to each word (noun,
verb, attribute. . . parts of speech)
I parsing is usually performed on the output of a
speech recogniser
The grammar is used twice in a Dialogue System!!
9 / 56
Language Models in ASR
I small vocabulary: often formal grammar
specified by hand
I example: loop of digits as in the HTK exercise
I large vocabulary: often stochastic grammar
estimated from data
10 / 56
Outline
Introduction
Formal Language Theory
Stochastic Language Models (SLM)
N-gram Language Models
N-gram Smoothing
Class N-grams
Adaptive Language Models
Language Model Evaluation
11 / 56
Formal Language Theory
grammar: formal specification of permissible
structures for the language
parser: algorithm that can analyse a sentence
and determine if its structure is
compliant with the grammar
12 / 56
Chomsky’s formal grammar
Noam Chomsky: linguist, philosopher, . . .
13 / 56
Chomsky’s formal grammar
Noam Chomsky: linguist, philosopher, . . .
G = (V , T, P, S)
where
V : set of non-terminal constituents
T: set of terminals (lexical items)
P: set of production rules
S: start symbol
13 / 56
Example
S = sentence
V = {NP (noun phrase),
NP1, VP (verb
phrase), NAME, ADJ,
V (verb), N (noun)}
T = {Mary , person , loves
, that , . . . }
P = {S → NP VP
NP → NAME
NP → ADJ NP1
NP1 → N
VP → VERB NP
NAME → Mary
V → loves
N → person
ADJ → that }
14 / 56
Example
S = sentence
V = {NP (noun phrase),
NP1, VP (verb
phrase), NAME, ADJ,
V (verb), N (noun)}
T = {Mary , person , loves
, that , . . . }
P = {S → NP VP
NP → NAME
NP → ADJ NP1
NP1 → N
VP → VERB NP
NAME → Mary
V → loves
N → person
ADJ → that }
S
NP
NAME
Mary
VP
V
loves
NP
ADJ
that
NP1
N
person
14 / 56
Chomsky’s hierarchy
Greek letters: sequence of terminals or
non-terminals
Upper-case Latin letters: single non-terminal
Lower-case Latin letters: single terminal
Types Constraints Automata
Phrase structure
grammar
α → β. This is the most general
grammar
Turing ma-
chine
Context-sensitive
grammar
length of α ≤ length of β Linear
bounded
Context-free
grammar
A → β. Equivalent to A → w, A →
BC
Push down
Regular grammar A → w, A → wB Finite-state
15 / 56
Chomsky’s hierarchy
Greek letters: sequence of terminals or
non-terminals
Upper-case Latin letters: single non-terminal
Lower-case Latin letters: single terminal
Types Constraints Automata
Phrase structure
grammar
α → β. This is the most general
grammar
Turing ma-
chine
Context-sensitive
grammar
length of α ≤ length of β Linear
bounded
Context-free
grammar
A → β. Equivalent to A → w, A →
BC
Push down
Regular grammar A → w, A → wB Finite-state
Context-free and regular grammars are used in
practice
15 / 56
Are languages context-free?
Mostly true, with exceptions
Swiss German:
“. . . das mer d’chind em Hans es huus lönd häfte
aastriiche”
Word-by-word:
“. . . that we the children Hans the house let help
paint”
Translation:
“. . . that we let the children help Hans paint the
house”
16 / 56
Parsers
I assign each word in a sentence to a part of
speech
I originally developed for programming languages
(no ambiguities)
I only available for context-free and regular
grammars
I top-down: start with S and generate rules until
you reach the words (terminal symbols)
I bottom-up: start with the words and work your
way up until you reach S
17 / 56
Example: Top-down parser
Parts of speech Rules
S
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
Mary VP NAME → Mary
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
Mary VP NAME → Mary
Mary V NP VP → V NP
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
Mary VP NAME → Mary
Mary V NP VP → V NP
Mary loves NP V → loves
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
Mary VP NAME → Mary
Mary V NP VP → V NP
Mary loves NP V → loves
Mary loves ADJ NP1 NP → ADJ NP1
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
Mary VP NAME → Mary
Mary V NP VP → V NP
Mary loves NP V → loves
Mary loves ADJ NP1 NP → ADJ NP1
Mary loves that NP1 ADJ → that
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
Mary VP NAME → Mary
Mary V NP VP → V NP
Mary loves NP V → loves
Mary loves ADJ NP1 NP → ADJ NP1
Mary loves that NP1 ADJ → that
Mary loves that N NP1 → N
18 / 56
Example: Top-down parser
Parts of speech Rules
S
NP VP S → NP VP
NAME VP NP → NAME
Mary VP NAME → Mary
Mary V NP VP → V NP
Mary loves NP V → loves
Mary loves ADJ NP1 NP → ADJ NP1
Mary loves that NP1 ADJ → that
Mary loves that N NP1 → N
Mary loves that person N → person
18 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
NAME V ADJ person ADJ → that
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
NAME V ADJ person ADJ → that
NAME V ADJ N N → person
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
NAME V ADJ person ADJ → that
NAME V ADJ N N → person
NP V ADJ N NP → NAME
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
NAME V ADJ person ADJ → that
NAME V ADJ N N → person
NP V ADJ N NP → NAME
NP V ADJ NP1 NP1 → N
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
NAME V ADJ person ADJ → that
NAME V ADJ N N → person
NP V ADJ N NP → NAME
NP V ADJ NP1 NP1 → N
NP V NP NP → ADJ NP1
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
NAME V ADJ person ADJ → that
NAME V ADJ N N → person
NP V ADJ N NP → NAME
NP V ADJ NP1 NP1 → N
NP V NP NP → ADJ NP1
NP VP VP → V NP
19 / 56
Example: Bottom-up parser
Parts of speech Rules
Mary loves that person
NAME loves that person NAME → Mary
NAME V that person V → loves
NAME V ADJ person ADJ → that
NAME V ADJ N N → person
NP V ADJ N NP → NAME
NP V ADJ NP1 NP1 → N
NP V NP NP → ADJ NP1
NP VP VP → V NP
S S → NP VP
19 / 56
Top-down vs bottom-up parsers
I Top-down characteristics:
+ very predictive
+ only consider grammatical combinations
– predict constituents that do not have a match in
the text
I Bottom-up characteristics:
+ check input text only once
+ suitable for robust language processing
– may build trees that do not lead to full parse
I All in all, similar performance
20 / 56
Chart parsing (dynamic programming)
Name[1,1] Mary
Mary loves that person
21 / 56
Chart parsing (dynamic programming)
S NP ° VP
V[2,2] loves
Name Mary
NP Name
Mary loves that person
21 / 56
Chart parsing (dynamic programming)
V loves
VP V °NP
Name Mary
NP Name
S NP °VP
Mary loves that person
ADJ that
21 / 56
Chart parsing (dynamic programming)
ADJ that
NP ADJ ° NP1
S NP °VP
V loves
VP V °NP
Name Mary
NP Name
S NP °VP
Mary loves that person
N person
21 / 56
Chart parsing (dynamic programming)
ADJ that
NP ADJ ° NP1
S NP °VP
V loves
VP V °NP
Name Mary
NP Name
S NP °VP
Mary loves that person
N person
NP1 N
NP ADJ NP1
VP V NP
S NP VP
21 / 56
Outline
Introduction
Formal Language Theory
Stochastic Language Models (SLM)
N-gram Language Models
N-gram Smoothing
Class N-grams
Adaptive Language Models
Language Model Evaluation
22 / 56
Stochastic Language Models (SLM)
1. formal grammars lack coverage (for general
domains)
2. spoken language does not follow strictly the
grammar
Model sequences of words statistically:
P(W ) = P(w1w2 . . . wn)
23 / 56
Probabilistic Context-free grammars
(PCFGs)
Assign probabilities to generative rules:
P(A → α|G)
Then calculate probability of generating a word
sequence w1w2 . . . wn as probability of the rules
necessary to go from S to w1w2 . . . wn:
P(S ⇒ w1w2 . . . wn|G)
24 / 56
Training PCFGs
If annotated corpus, Maximum Likelihood estimate:
P(A → αj) =
C(A → αj)
Pm
i=1 C(A → αi)
If non-annotated corpus: inside-outside algorithm
(similar to HMM training, forward-backward)
25 / 56
Independence assumption
S
NP
NAME
Mary
VP
V
loves
NP
ADJ
that
NP1
N
person 26 / 56
Inside-outside probabilities
Chomsky’s normal forms: Ai → AmAn or Ai → wl
inside(s, Ai, t) = P(Ai ⇒ wsws+1 . . . wt)
outside(s, Ai, t) = P(S ⇒ w1 . . . ws−1 Ai wt+1 . . . wT )
Ai
w w w w w w
s s t t T
1 1 1
... ... ...
- +
S
27 / 56
Probabilistic Context-free
grammars:limitations
I probabilities help sorting alternative
explanations, but
I still problem with coverage: the production
rules are hand made
P(A → α|G)
28 / 56
N-gram Language Models
Flat model: no hierarchical structure
P(W) = P(w1, w2, . . . , wn)
= P(w1)P(w2|w1)P(w3|w1, w2) · · · P(wn|w1, w2 . . . , wn−1)
=
n
Y
i=1
P(wi |w1, w2, . . . , wi−1)
29 / 56
N-gram Language Models
Flat model: no hierarchical structure
P(W) = P(w1, w2, . . . , wn)
= P(w1)P(w2|w1)P(w3|w1, w2) · · · P(wn|w1, w2 . . . , wn−1)
=
n
Y
i=1
P(wi |w1, w2, . . . , wi−1)
Approximations:
P(wi |w1, w2, . . . , wi−1) = P(wi ) (Unigram)
P(wi |w1, w2, . . . , wi−1) = P(wi |wi−1) (Bigram)
P(wi |w1, w2, . . . , wi−1) = P(wi |wi−2, wi−1) (Trigram)
P(wi |w1, w2, . . . , wi−1) = P(wi |wi−N+1, . . . , wi−1) (N-gram)
29 / 56
Example (Bigram)
P(Mary, loves, that, person) =
P(Mary|<s>)P(loves|Mary)P(that|loves)
P(person|that)P(</s>|person)
30 / 56
N-gram estimation (Maximum Likelihood)
P(wi|wi−N+1, . . . , wi−1) =
C(
N
z }| {
wi−N+1, . . . , wi−1, wi)
C(wi−N+1, . . . , wi−1
| {z }
N−1
)
=
C(wi−N+1, . . . , wi−1, wi)
P
wi
C(wi−N+1, . . . , wi−1, wi)
31 / 56
N-gram estimation (Maximum Likelihood)
P(wi|wi−N+1, . . . , wi−1) =
C(
N
z }| {
wi−N+1, . . . , wi−1, wi)
C(wi−N+1, . . . , wi−1
| {z }
N−1
)
=
C(wi−N+1, . . . , wi−1, wi)
P
wi
C(wi−N+1, . . . , wi−1, wi)
Problem: data sparseness
31 / 56
N-gram estimation example
Corpus:
1: John read her book
2: I read a different book
3: John read a book by Mulan
P(John| < s >) = C(<s>,John)
C(<s>) = 2
3
P(read|John) = C(John,read)
C(John)
= 2
2
P(a|read) = C(read,a)
C(read)
= 2
3
P(book|a) = C(a,book)
C(a) = 1
2
P(< /s > |book) =
C(book,</s>)
C(book)
= 2
3
32 / 56
N-gram estimation example
Corpus:
1: John read her book
2: I read a different book
3: John read a book by Mulan
P(John| < s >) = C(<s>,John)
C(<s>) = 2
3
P(read|John) = C(John,read)
C(John)
= 2
2
P(a|read) = C(read,a)
C(read)
= 2
3
P(book|a) = C(a,book)
C(a) = 1
2
P(< /s > |book) =
C(book,</s>)
C(book)
= 2
3
P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · ·
P(book|a)P(< /s > |book) = 0.148
32 / 56
N-gram estimation example
Corpus:
1: John read her book
2: I read a different book
3: John read a book by Mulan
P(John| < s >) = C(<s>,John)
C(<s>) = 2
3
P(read|John) = C(John,read)
C(John)
= 2
2
P(a|read) = C(read,a)
C(read)
= 2
3
P(book|a) = C(a,book)
C(a) = 1
2
P(< /s > |book) =
C(book,</s>)
C(book)
= 2
3
P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · ·
P(book|a)P(< /s > |book) = 0.148
P(Mulan, read, a, book) = P(Mulan| < s >) · · · = 0
32 / 56
N-gram Smoothing
Problem:
I Many very possible word sequences may have
been observed in zero or very low numbers in
the training data
I Leads to extremely low probabilities, effectively
disabling this word sequence, no matter how
strong the acoustic evidence is
Solution: smoothing
I produce more robust probabilities for unseen
data at the cost of modelling the training data
slightly worse
33 / 56
Simplest Smoothing technique
Instead of ML estimate
P(wi |wi−N+1, . . . , wi−1) =
C(wi−N+1, . . . , wi−1, wi )
P
wi
C(wi−N+1, . . . , wi−1, wi )
Use
P(wi |wi−N+1, . . . , wi−1) =
1 + C(wi−N+1, . . . , wi−1, wi )
P
wi
(1 + C(wi−N+1, . . . , wi−1, wi ))
I prevents zero probabilities
I but still very low probabilities
34 / 56
N-gram simple smoothing example
Corpus:
1: John read her book
2: I read a different book
3: John read a book by Mulan
P(John| < s >) = 1+C(<s>,John)
11+C(<s>) = 3
14
P(read|John) = 1+C(John,read)
11+C(John)
= 3
13
. . .
P(Mulan| < s >) = 1+C(<s>,Mulan)
11+C(<s>) = 1
14
35 / 56
N-gram simple smoothing example
Corpus:
1: John read her book
2: I read a different book
3: John read a book by Mulan
P(John| < s >) = 1+C(<s>,John)
11+C(<s>) = 3
14
P(read|John) = 1+C(John,read)
11+C(John)
= 3
13
. . .
P(Mulan| < s >) = 1+C(<s>,Mulan)
11+C(<s>) = 1
14
P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · ·
P(book|a)P(< /s > |book) = 0.00035(0.148)
35 / 56
N-gram simple smoothing example
Corpus:
1: John read her book
2: I read a different book
3: John read a book by Mulan
P(John| < s >) = 1+C(<s>,John)
11+C(<s>) = 3
14
P(read|John) = 1+C(John,read)
11+C(John)
= 3
13
. . .
P(Mulan| < s >) = 1+C(<s>,Mulan)
11+C(<s>) = 1
14
P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · ·
P(book|a)P(< /s > |book) = 0.00035(0.148)
P(Mulan, read, a, book) = P(Mulan| < s >)P(read|Mulan)P(a|read) · · ·
P(book|a)P(< /s > |book) = 0.000084(0)
35 / 56
Interpolation vs Backoff smoothing
Interpolation models:
I Linear combination with lower order n-grams
I Modifies the probabilities of both nonzero and
zero count n-grams
Backoff models:
I Use lower order n-grams when the requested
n-gram has zero or very low count in the
training data
I Nonzero count n-grams are unchanged
I Discounting: Reduce the probability of seen
n-grams and distribute among unseen ones
36 / 56
Interpolation vs Backoff smoothing
Interpolation models:
Psmooth(wi |wi−N+1, . . . , wi−1) = λ
N
z }| {
PML(wi |wi−N+1, . . . , wi−1) +
(1 − λ)
N−1
z }| {
Psmooth(wi |wi−N+2, . . . , wi−1)
Backoff models:
Psmooth(wi |wi−N+1, . . . , wi−1) =







α
N
z }| {
P(wi |wi−N+1, . . . , wi−1) if C(wi |wi−N+1, . . . , wi−1) > 0
γ
N−1
z }| {
Psmooth(wi |wi−N+2, . . . , wi−1) if C(wi |wi−N+1, . . . , wi−1) = 0
37 / 56
Deleted interpolation smoothing
Recursively interpolate with n-grams of lower order:
if historyn = wi−n+1, . . . , wi−1
PI (wi|historyn) = λhistoryn
P(wi|historyn) +
(1 − λhistoryn
)PI (wi|historyn−1)
I hard to estimate λhistoryn
for every history
I cluster into moderate number of weights
38 / 56
Backoff smoothing
Use P(wi|historyn−1) only if you lack data for
P(wi|historyn)
39 / 56
Good-Turing estimate
I Partition n-grams into groups depending on
their frequency in the training data
I Change the number of occurrences of an
n-gram according to
r∗
= (r + 1)
nr+1
nr
where r is the occurrence number, nr is the
number of n-grams that occur r times
40 / 56
Katz smoothing
based on Good-Turing: combine higher and lower
order n-grams
For every N-gram:
1. if count r is large (> 5 or 8), do not change it
2. if count r is small but non-zero, discount with
≈ r∗
3. if count r = 0, reassign discounted counts with
lower order N-gram
C∗
(wi−1, wi) = α(wi−1)P(wi)
41 / 56
Kneser-Ney smoothing: motivation
Background
I Lower order n-grams are often used as backoff model if the count
of a higher-order n-gram is too low (e.g. unigram instead of
bigram)
Problem
I Some words with relatively high unigram probability only occur in
a few bigrams. E.g. Francisco, which is mainly found in San
Francisco. However, infrequent word pairs, such as New Francisco,
will be given too high probability if the unigram probabilities of
New and Francisco are used. Maybe instead, the Francisco
unigram should have a lower value to prevent it from occurring in
other contexts.
I can’t see without my reading. . .
42 / 56
Kneser-Ney intuition
If a word has been seen in many contexts it is more
likely to be seen in new contexts as well.
I instead of backing off to lower order n-gram,
use continuation probability
Example: instead of unigram P(wi), use
PCONTINUATION(wi) =
|{wi−1 : C(wi−1wi) > 0}|
P
wi
|{wi−1 : C(wi−1wi) > 0}|
43 / 56
Kneser-Ney intuition
If a word has been seen in many contexts it is more
likely to be seen in new contexts as well.
I instead of backing off to lower order n-gram,
use continuation probability
Example: instead of unigram P(wi), use
PCONTINUATION(wi) =
|{wi−1 : C(wi−1wi) > 0}|
P
wi
|{wi−1 : C(wi−1wi) > 0}|
I can’t see without my reading. . . glasses
43 / 56
Class N-grams
1. Group words into semantic or grammatical
classes
2. build n-grams for class sequences:
P(wi|ci−N+1 . . . ci−1) = P(wi|ci)P(ci|ci−N+1 . . . ci−1)
I rapid adaptation, small training sets, small
models
I works on limited domains
I classes can be rule-based or data-driven
44 / 56
Combining PCFGs and N-grams
Only N-grams:
Meeting at three with Zhou Li
Meeting at four PM with Derek
P(Zhou|three, with) and P(Derek|PM, with))
N-grams + CFGs:
Meeting {at three: TIME} with {Zhou Li: NAME}
Meeting {at four PM: TIME} with {Derek: NAME}
P(NAME|TIME, with)
45 / 56
Adaptive Language Models
I conversational topic is not stationary
I topic stationary over some period of time
I build more specialised models that can adapt in
time
Techniques
I Cache Language Models
I Topic-Adaptive Models
I Maximum Entropy Models
46 / 56
Cache Language Models
1. build a full static n-gram model
2. during conversation accumulate low order
n-grams
3. interpolate between 1 and 2
47 / 56
Topic-Adaptive Models
1. cluster documents into topics (manually or
data-driven)
2. use information retrieval techniques with
current recognition output to select the right
cluster
3. if off-line run recognition in several passes
48 / 56
Maximum Entropy Models
Instead of linear combination:
1. reformulate information sources into constraints
2. choose maximum entropy distribution that
satisfies the constraints
49 / 56
Maximum Entropy Models
Instead of linear combination:
1. reformulate information sources into constraints
2. choose maximum entropy distribution that
satisfies the constraints
Constraints general form:
X
X
P(X)fi(X) = Ei
Example: unigram
fwi
=

1 if w = wi
0 otherwise
49 / 56
Outline
Introduction
Formal Language Theory
Stochastic Language Models (SLM)
N-gram Language Models
N-gram Smoothing
Class N-grams
Adaptive Language Models
Language Model Evaluation
50 / 56
Language Model Evaluation
I Evaluation in combination with Speech
Recogniser
I hard to separate contribution of the two
I Evaluation based on probabilities assigned to
text in the training and test set
51 / 56
Information, Entropy, Perplexity
Information:
I(xi) = log
1
P(xi)
Entropy:
H(X) = E[I(X)] = −
X
i
P(xi) log P(xi)
Perplexity:
PP(X) = 2H(X)
52 / 56
Perplexity of a model
We do not know the “true” distribution
p(w1, . . . , wn). But we have a model
m(w1, . . . , wn). The cross-entropy is:
H(p, m) = −
X
w1,...,wn
p(w1, . . . , wn) log m(w1, . . . , wn)
Cross-entropy is upper bound to entropy:
H ≤ H(p, m)
The better the model, the lower the cross-entropy
and the lower the perplexity (on the same data)
53 / 56
Test-set Perplexity
Estimate the distribution p(w1, . . . , wn) on the
training data
Evaluate it on the test data
H = −
X
w1,...,wn∈test set
p(w1, . . . , wn) log p(w1, . . . , wn)
PP = 2H
54 / 56
Perplexity and branching factor
Perplexity is roughly the geometric mean of the
branching factor
word
word1
word2
. . .
wordN
Shannon: 2.39 for English letters and 130 for
English words
Digit strings: 10
n-gram English: 50–1000
Wall Street Journal test set: 180 (bigram) 91
(trigram)
55 / 56
Performance of N-grams
Models Perplexity Word Error Rate
Unigram Katz 1196.45 14.85%
Unigram Kneser-Ney 1199.59 14.86%
Bigram Katz 176.31 11.38%
Bigram Kneser-Ney 176.11 11.34%
Trigram Katz 95.19 9.69%
Trigram Kneser-Ney 91.47 9.60%
Wall Street Journal database Dictionary: 60 000
words
Training set: 260 000 000 words
56 / 56

More Related Content

PDF
Adnan: Introduction to Natural Language Processing
PPT
lecture_15.pptffffffffffffffffffffffffff
PPTX
Natural Language Processing
PPT
Moore_slides.ppt
PPT
computer Languages and Grammars.pptx and
PDF
Computational linguistics
PPT
INFO-2950-Languages-and-Grammars.ppt
PPT
SLoSP-2007-1statisticalstatisticalstatistical.ppt
Adnan: Introduction to Natural Language Processing
lecture_15.pptffffffffffffffffffffffffff
Natural Language Processing
Moore_slides.ppt
computer Languages and Grammars.pptx and
Computational linguistics
INFO-2950-Languages-and-Grammars.ppt
SLoSP-2007-1statisticalstatisticalstatistical.ppt

Similar to Speech and Speaker Recognition - Language Modelling (20)

PPT
SLoSP-2007-1 natural language processing.ppt
PDF
Lecture: Context-Free Grammars
PDF
CS571: Phrase Structure Grammar
PPTX
gdhfjdhjcbdjhvjhdshbajhbvdjbklcbdsjhbvjhsdbvjjv
PPT
Integrated Fundamental and Technical Analysis of Select Public Sector Oil Com...
PPTX
Artificial Intelligence Notes Unit 4
PPT
haenelt.ppt
PDF
Natural Language Processing Course in AI
PDF
Control structure
PDF
Introduction to Natural language Processing
PPT
NLP-my-lecture (3).ppt
PDF
Context Free Grammar
PPTX
Debugging Chomsky's Hierarchy
PPTX
TOPDOWN-PREDICTIVE.pptx TOP-DOWN PARSING & PREDICTIVE PARSING
PPTX
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
PPTX
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
PPTX
Types of parsers
PPTX
natural language processing of artificial
PPTX
computerdictionariesandparsingppt-201216152127.pptx
PPTX
natural language processing
SLoSP-2007-1 natural language processing.ppt
Lecture: Context-Free Grammars
CS571: Phrase Structure Grammar
gdhfjdhjcbdjhvjhdshbajhbvdjbklcbdsjhbvjhsdbvjjv
Integrated Fundamental and Technical Analysis of Select Public Sector Oil Com...
Artificial Intelligence Notes Unit 4
haenelt.ppt
Natural Language Processing Course in AI
Control structure
Introduction to Natural language Processing
NLP-my-lecture (3).ppt
Context Free Grammar
Debugging Chomsky's Hierarchy
TOPDOWN-PREDICTIVE.pptx TOP-DOWN PARSING & PREDICTIVE PARSING
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Types of parsers
natural language processing of artificial
computerdictionariesandparsingppt-201216152127.pptx
natural language processing
Ad

More from cniclsh (8)

PDF
Lecture 2 Solving Problems by Searching by Marco Chiarandini
PDF
Course Introduction Artificial Intelligence by Marco Chiarandini
PDF
Lecture 5 Baysian Networks by Marco Chiarandini
PDF
Project Topic Presentation Data and Web Science Group IE686 Large Language Mo...
PDF
Instruction Tuning and Reinforcement Learning from Human Feedback Data and We...
PDF
LLM Agents and Tool Use Data and Web Science Group IE686 Large Language Model...
PDF
Introduction and Organization Data and Web Science Group IE686 Large Language...
PDF
DM2556 Intercultural communication Lecture 3
Lecture 2 Solving Problems by Searching by Marco Chiarandini
Course Introduction Artificial Intelligence by Marco Chiarandini
Lecture 5 Baysian Networks by Marco Chiarandini
Project Topic Presentation Data and Web Science Group IE686 Large Language Mo...
Instruction Tuning and Reinforcement Learning from Human Feedback Data and We...
LLM Agents and Tool Use Data and Web Science Group IE686 Large Language Model...
Introduction and Organization Data and Web Science Group IE686 Large Language...
DM2556 Intercultural communication Lecture 3
Ad

Recently uploaded (20)

PPTX
Essential Infomation Tech presentation.pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Introduction to Artificial Intelligence
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Nekopoi APK 2025 free lastest update
PDF
AI in Product Development-omnex systems
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
System and Network Administration Chapter 2
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
Essential Infomation Tech presentation.pptx
top salesforce developer skills in 2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Introduction to Artificial Intelligence
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Nekopoi APK 2025 free lastest update
AI in Product Development-omnex systems
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Upgrade and Innovation Strategies for SAP ERP Customers
System and Network Administration Chapter 2
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Operating system designcfffgfgggggggvggggggggg
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Softaken Excel to vCard Converter Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Odoo Companies in India – Driving Business Transformation.pdf
How Creative Agencies Leverage Project Management Software.pdf

Speech and Speaker Recognition - Language Modelling

  • 1. DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56
  • 2. Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 2 / 56
  • 3. Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 3 / 56
  • 4. Components of ASR System Speech Signal Spectral Analysis Feature Extraction Search and Match Recognised Words Acoustic Models Lexical Models Language Models Representation Constraints - Knowledge Decoder Language Models 4 / 56
  • 5. Why do we need language models? Bayes’ rule: P(words|sounds) = P(sounds|words)P(words) P(sounds) where P(words): a priori probability of the words (Language Model) We could use non informative priors (P(words) = 1/N), but. . . 5 / 56
  • 6. Branching Factor I if we have N words in the dictionary I at every word boundary we have to consider N equally likely alternatives I N can be in the order of millions word word1 word2 . . . wordN 6 / 56
  • 7. Ambiguity “ice cream” vs “I scream” /aI s k ô i: m/ 7 / 56
  • 8. Language Models in ASR We want to: 1. limit the branching factor in the recognition network 2. augment and complete the acoustic probabilities I we are only interested to know if the sequence of words is plausible grammatically or not I this kind of grammar is integrated in the recognition network prior to decoding 8 / 56
  • 9. Language Models in Dialogue Systems I we want to assign a class to each word (noun, verb, attribute. . . parts of speech) I parsing is usually performed on the output of a speech recogniser The grammar is used twice in a Dialogue System!! 9 / 56
  • 10. Language Models in ASR I small vocabulary: often formal grammar specified by hand I example: loop of digits as in the HTK exercise I large vocabulary: often stochastic grammar estimated from data 10 / 56
  • 11. Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 11 / 56
  • 12. Formal Language Theory grammar: formal specification of permissible structures for the language parser: algorithm that can analyse a sentence and determine if its structure is compliant with the grammar 12 / 56
  • 13. Chomsky’s formal grammar Noam Chomsky: linguist, philosopher, . . . 13 / 56
  • 14. Chomsky’s formal grammar Noam Chomsky: linguist, philosopher, . . . G = (V , T, P, S) where V : set of non-terminal constituents T: set of terminals (lexical items) P: set of production rules S: start symbol 13 / 56
  • 15. Example S = sentence V = {NP (noun phrase), NP1, VP (verb phrase), NAME, ADJ, V (verb), N (noun)} T = {Mary , person , loves , that , . . . } P = {S → NP VP NP → NAME NP → ADJ NP1 NP1 → N VP → VERB NP NAME → Mary V → loves N → person ADJ → that } 14 / 56
  • 16. Example S = sentence V = {NP (noun phrase), NP1, VP (verb phrase), NAME, ADJ, V (verb), N (noun)} T = {Mary , person , loves , that , . . . } P = {S → NP VP NP → NAME NP → ADJ NP1 NP1 → N VP → VERB NP NAME → Mary V → loves N → person ADJ → that } S NP NAME Mary VP V loves NP ADJ that NP1 N person 14 / 56
  • 17. Chomsky’s hierarchy Greek letters: sequence of terminals or non-terminals Upper-case Latin letters: single non-terminal Lower-case Latin letters: single terminal Types Constraints Automata Phrase structure grammar α → β. This is the most general grammar Turing ma- chine Context-sensitive grammar length of α ≤ length of β Linear bounded Context-free grammar A → β. Equivalent to A → w, A → BC Push down Regular grammar A → w, A → wB Finite-state 15 / 56
  • 18. Chomsky’s hierarchy Greek letters: sequence of terminals or non-terminals Upper-case Latin letters: single non-terminal Lower-case Latin letters: single terminal Types Constraints Automata Phrase structure grammar α → β. This is the most general grammar Turing ma- chine Context-sensitive grammar length of α ≤ length of β Linear bounded Context-free grammar A → β. Equivalent to A → w, A → BC Push down Regular grammar A → w, A → wB Finite-state Context-free and regular grammars are used in practice 15 / 56
  • 19. Are languages context-free? Mostly true, with exceptions Swiss German: “. . . das mer d’chind em Hans es huus lönd häfte aastriiche” Word-by-word: “. . . that we the children Hans the house let help paint” Translation: “. . . that we let the children help Hans paint the house” 16 / 56
  • 20. Parsers I assign each word in a sentence to a part of speech I originally developed for programming languages (no ambiguities) I only available for context-free and regular grammars I top-down: start with S and generate rules until you reach the words (terminal symbols) I bottom-up: start with the words and work your way up until you reach S 17 / 56
  • 21. Example: Top-down parser Parts of speech Rules S 18 / 56
  • 22. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP 18 / 56
  • 23. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME 18 / 56
  • 24. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME Mary VP NAME → Mary 18 / 56
  • 25. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME Mary VP NAME → Mary Mary V NP VP → V NP 18 / 56
  • 26. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME Mary VP NAME → Mary Mary V NP VP → V NP Mary loves NP V → loves 18 / 56
  • 27. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME Mary VP NAME → Mary Mary V NP VP → V NP Mary loves NP V → loves Mary loves ADJ NP1 NP → ADJ NP1 18 / 56
  • 28. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME Mary VP NAME → Mary Mary V NP VP → V NP Mary loves NP V → loves Mary loves ADJ NP1 NP → ADJ NP1 Mary loves that NP1 ADJ → that 18 / 56
  • 29. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME Mary VP NAME → Mary Mary V NP VP → V NP Mary loves NP V → loves Mary loves ADJ NP1 NP → ADJ NP1 Mary loves that NP1 ADJ → that Mary loves that N NP1 → N 18 / 56
  • 30. Example: Top-down parser Parts of speech Rules S NP VP S → NP VP NAME VP NP → NAME Mary VP NAME → Mary Mary V NP VP → V NP Mary loves NP V → loves Mary loves ADJ NP1 NP → ADJ NP1 Mary loves that NP1 ADJ → that Mary loves that N NP1 → N Mary loves that person N → person 18 / 56
  • 31. Example: Bottom-up parser Parts of speech Rules Mary loves that person 19 / 56
  • 32. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary 19 / 56
  • 33. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves 19 / 56
  • 34. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves NAME V ADJ person ADJ → that 19 / 56
  • 35. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves NAME V ADJ person ADJ → that NAME V ADJ N N → person 19 / 56
  • 36. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves NAME V ADJ person ADJ → that NAME V ADJ N N → person NP V ADJ N NP → NAME 19 / 56
  • 37. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves NAME V ADJ person ADJ → that NAME V ADJ N N → person NP V ADJ N NP → NAME NP V ADJ NP1 NP1 → N 19 / 56
  • 38. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves NAME V ADJ person ADJ → that NAME V ADJ N N → person NP V ADJ N NP → NAME NP V ADJ NP1 NP1 → N NP V NP NP → ADJ NP1 19 / 56
  • 39. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves NAME V ADJ person ADJ → that NAME V ADJ N N → person NP V ADJ N NP → NAME NP V ADJ NP1 NP1 → N NP V NP NP → ADJ NP1 NP VP VP → V NP 19 / 56
  • 40. Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME → Mary NAME V that person V → loves NAME V ADJ person ADJ → that NAME V ADJ N N → person NP V ADJ N NP → NAME NP V ADJ NP1 NP1 → N NP V NP NP → ADJ NP1 NP VP VP → V NP S S → NP VP 19 / 56
  • 41. Top-down vs bottom-up parsers I Top-down characteristics: + very predictive + only consider grammatical combinations – predict constituents that do not have a match in the text I Bottom-up characteristics: + check input text only once + suitable for robust language processing – may build trees that do not lead to full parse I All in all, similar performance 20 / 56
  • 42. Chart parsing (dynamic programming) Name[1,1] Mary Mary loves that person 21 / 56
  • 43. Chart parsing (dynamic programming) S NP ° VP V[2,2] loves Name Mary NP Name Mary loves that person 21 / 56
  • 44. Chart parsing (dynamic programming) V loves VP V °NP Name Mary NP Name S NP °VP Mary loves that person ADJ that 21 / 56
  • 45. Chart parsing (dynamic programming) ADJ that NP ADJ ° NP1 S NP °VP V loves VP V °NP Name Mary NP Name S NP °VP Mary loves that person N person 21 / 56
  • 46. Chart parsing (dynamic programming) ADJ that NP ADJ ° NP1 S NP °VP V loves VP V °NP Name Mary NP Name S NP °VP Mary loves that person N person NP1 N NP ADJ NP1 VP V NP S NP VP 21 / 56
  • 47. Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 22 / 56
  • 48. Stochastic Language Models (SLM) 1. formal grammars lack coverage (for general domains) 2. spoken language does not follow strictly the grammar Model sequences of words statistically: P(W ) = P(w1w2 . . . wn) 23 / 56
  • 49. Probabilistic Context-free grammars (PCFGs) Assign probabilities to generative rules: P(A → α|G) Then calculate probability of generating a word sequence w1w2 . . . wn as probability of the rules necessary to go from S to w1w2 . . . wn: P(S ⇒ w1w2 . . . wn|G) 24 / 56
  • 50. Training PCFGs If annotated corpus, Maximum Likelihood estimate: P(A → αj) = C(A → αj) Pm i=1 C(A → αi) If non-annotated corpus: inside-outside algorithm (similar to HMM training, forward-backward) 25 / 56
  • 52. Inside-outside probabilities Chomsky’s normal forms: Ai → AmAn or Ai → wl inside(s, Ai, t) = P(Ai ⇒ wsws+1 . . . wt) outside(s, Ai, t) = P(S ⇒ w1 . . . ws−1 Ai wt+1 . . . wT ) Ai w w w w w w s s t t T 1 1 1 ... ... ... - + S 27 / 56
  • 53. Probabilistic Context-free grammars:limitations I probabilities help sorting alternative explanations, but I still problem with coverage: the production rules are hand made P(A → α|G) 28 / 56
  • 54. N-gram Language Models Flat model: no hierarchical structure P(W) = P(w1, w2, . . . , wn) = P(w1)P(w2|w1)P(w3|w1, w2) · · · P(wn|w1, w2 . . . , wn−1) = n Y i=1 P(wi |w1, w2, . . . , wi−1) 29 / 56
  • 55. N-gram Language Models Flat model: no hierarchical structure P(W) = P(w1, w2, . . . , wn) = P(w1)P(w2|w1)P(w3|w1, w2) · · · P(wn|w1, w2 . . . , wn−1) = n Y i=1 P(wi |w1, w2, . . . , wi−1) Approximations: P(wi |w1, w2, . . . , wi−1) = P(wi ) (Unigram) P(wi |w1, w2, . . . , wi−1) = P(wi |wi−1) (Bigram) P(wi |w1, w2, . . . , wi−1) = P(wi |wi−2, wi−1) (Trigram) P(wi |w1, w2, . . . , wi−1) = P(wi |wi−N+1, . . . , wi−1) (N-gram) 29 / 56
  • 56. Example (Bigram) P(Mary, loves, that, person) = P(Mary|<s>)P(loves|Mary)P(that|loves) P(person|that)P(</s>|person) 30 / 56
  • 57. N-gram estimation (Maximum Likelihood) P(wi|wi−N+1, . . . , wi−1) = C( N z }| { wi−N+1, . . . , wi−1, wi) C(wi−N+1, . . . , wi−1 | {z } N−1 ) = C(wi−N+1, . . . , wi−1, wi) P wi C(wi−N+1, . . . , wi−1, wi) 31 / 56
  • 58. N-gram estimation (Maximum Likelihood) P(wi|wi−N+1, . . . , wi−1) = C( N z }| { wi−N+1, . . . , wi−1, wi) C(wi−N+1, . . . , wi−1 | {z } N−1 ) = C(wi−N+1, . . . , wi−1, wi) P wi C(wi−N+1, . . . , wi−1, wi) Problem: data sparseness 31 / 56
  • 59. N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John| < s >) = C(<s>,John) C(<s>) = 2 3 P(read|John) = C(John,read) C(John) = 2 2 P(a|read) = C(read,a) C(read) = 2 3 P(book|a) = C(a,book) C(a) = 1 2 P(< /s > |book) = C(book,</s>) C(book) = 2 3 32 / 56
  • 60. N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John| < s >) = C(<s>,John) C(<s>) = 2 3 P(read|John) = C(John,read) C(John) = 2 2 P(a|read) = C(read,a) C(read) = 2 3 P(book|a) = C(a,book) C(a) = 1 2 P(< /s > |book) = C(book,</s>) C(book) = 2 3 P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · · P(book|a)P(< /s > |book) = 0.148 32 / 56
  • 61. N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John| < s >) = C(<s>,John) C(<s>) = 2 3 P(read|John) = C(John,read) C(John) = 2 2 P(a|read) = C(read,a) C(read) = 2 3 P(book|a) = C(a,book) C(a) = 1 2 P(< /s > |book) = C(book,</s>) C(book) = 2 3 P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · · P(book|a)P(< /s > |book) = 0.148 P(Mulan, read, a, book) = P(Mulan| < s >) · · · = 0 32 / 56
  • 62. N-gram Smoothing Problem: I Many very possible word sequences may have been observed in zero or very low numbers in the training data I Leads to extremely low probabilities, effectively disabling this word sequence, no matter how strong the acoustic evidence is Solution: smoothing I produce more robust probabilities for unseen data at the cost of modelling the training data slightly worse 33 / 56
  • 63. Simplest Smoothing technique Instead of ML estimate P(wi |wi−N+1, . . . , wi−1) = C(wi−N+1, . . . , wi−1, wi ) P wi C(wi−N+1, . . . , wi−1, wi ) Use P(wi |wi−N+1, . . . , wi−1) = 1 + C(wi−N+1, . . . , wi−1, wi ) P wi (1 + C(wi−N+1, . . . , wi−1, wi )) I prevents zero probabilities I but still very low probabilities 34 / 56
  • 64. N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John| < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 P(read|John) = 1+C(John,read) 11+C(John) = 3 13 . . . P(Mulan| < s >) = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 35 / 56
  • 65. N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John| < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 P(read|John) = 1+C(John,read) 11+C(John) = 3 13 . . . P(Mulan| < s >) = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · · P(book|a)P(< /s > |book) = 0.00035(0.148) 35 / 56
  • 66. N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John| < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 P(read|John) = 1+C(John,read) 11+C(John) = 3 13 . . . P(Mulan| < s >) = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 P(John, read, a, book) = P(John| < s >)P(read|John)P(a|read) · · · P(book|a)P(< /s > |book) = 0.00035(0.148) P(Mulan, read, a, book) = P(Mulan| < s >)P(read|Mulan)P(a|read) · · · P(book|a)P(< /s > |book) = 0.000084(0) 35 / 56
  • 67. Interpolation vs Backoff smoothing Interpolation models: I Linear combination with lower order n-grams I Modifies the probabilities of both nonzero and zero count n-grams Backoff models: I Use lower order n-grams when the requested n-gram has zero or very low count in the training data I Nonzero count n-grams are unchanged I Discounting: Reduce the probability of seen n-grams and distribute among unseen ones 36 / 56
  • 68. Interpolation vs Backoff smoothing Interpolation models: Psmooth(wi |wi−N+1, . . . , wi−1) = λ N z }| { PML(wi |wi−N+1, . . . , wi−1) + (1 − λ) N−1 z }| { Psmooth(wi |wi−N+2, . . . , wi−1) Backoff models: Psmooth(wi |wi−N+1, . . . , wi−1) =        α N z }| { P(wi |wi−N+1, . . . , wi−1) if C(wi |wi−N+1, . . . , wi−1) > 0 γ N−1 z }| { Psmooth(wi |wi−N+2, . . . , wi−1) if C(wi |wi−N+1, . . . , wi−1) = 0 37 / 56
  • 69. Deleted interpolation smoothing Recursively interpolate with n-grams of lower order: if historyn = wi−n+1, . . . , wi−1 PI (wi|historyn) = λhistoryn P(wi|historyn) + (1 − λhistoryn )PI (wi|historyn−1) I hard to estimate λhistoryn for every history I cluster into moderate number of weights 38 / 56
  • 70. Backoff smoothing Use P(wi|historyn−1) only if you lack data for P(wi|historyn) 39 / 56
  • 71. Good-Turing estimate I Partition n-grams into groups depending on their frequency in the training data I Change the number of occurrences of an n-gram according to r∗ = (r + 1) nr+1 nr where r is the occurrence number, nr is the number of n-grams that occur r times 40 / 56
  • 72. Katz smoothing based on Good-Turing: combine higher and lower order n-grams For every N-gram: 1. if count r is large (> 5 or 8), do not change it 2. if count r is small but non-zero, discount with ≈ r∗ 3. if count r = 0, reassign discounted counts with lower order N-gram C∗ (wi−1, wi) = α(wi−1)P(wi) 41 / 56
  • 73. Kneser-Ney smoothing: motivation Background I Lower order n-grams are often used as backoff model if the count of a higher-order n-gram is too low (e.g. unigram instead of bigram) Problem I Some words with relatively high unigram probability only occur in a few bigrams. E.g. Francisco, which is mainly found in San Francisco. However, infrequent word pairs, such as New Francisco, will be given too high probability if the unigram probabilities of New and Francisco are used. Maybe instead, the Francisco unigram should have a lower value to prevent it from occurring in other contexts. I can’t see without my reading. . . 42 / 56
  • 74. Kneser-Ney intuition If a word has been seen in many contexts it is more likely to be seen in new contexts as well. I instead of backing off to lower order n-gram, use continuation probability Example: instead of unigram P(wi), use PCONTINUATION(wi) = |{wi−1 : C(wi−1wi) > 0}| P wi |{wi−1 : C(wi−1wi) > 0}| 43 / 56
  • 75. Kneser-Ney intuition If a word has been seen in many contexts it is more likely to be seen in new contexts as well. I instead of backing off to lower order n-gram, use continuation probability Example: instead of unigram P(wi), use PCONTINUATION(wi) = |{wi−1 : C(wi−1wi) > 0}| P wi |{wi−1 : C(wi−1wi) > 0}| I can’t see without my reading. . . glasses 43 / 56
  • 76. Class N-grams 1. Group words into semantic or grammatical classes 2. build n-grams for class sequences: P(wi|ci−N+1 . . . ci−1) = P(wi|ci)P(ci|ci−N+1 . . . ci−1) I rapid adaptation, small training sets, small models I works on limited domains I classes can be rule-based or data-driven 44 / 56
  • 77. Combining PCFGs and N-grams Only N-grams: Meeting at three with Zhou Li Meeting at four PM with Derek P(Zhou|three, with) and P(Derek|PM, with)) N-grams + CFGs: Meeting {at three: TIME} with {Zhou Li: NAME} Meeting {at four PM: TIME} with {Derek: NAME} P(NAME|TIME, with) 45 / 56
  • 78. Adaptive Language Models I conversational topic is not stationary I topic stationary over some period of time I build more specialised models that can adapt in time Techniques I Cache Language Models I Topic-Adaptive Models I Maximum Entropy Models 46 / 56
  • 79. Cache Language Models 1. build a full static n-gram model 2. during conversation accumulate low order n-grams 3. interpolate between 1 and 2 47 / 56
  • 80. Topic-Adaptive Models 1. cluster documents into topics (manually or data-driven) 2. use information retrieval techniques with current recognition output to select the right cluster 3. if off-line run recognition in several passes 48 / 56
  • 81. Maximum Entropy Models Instead of linear combination: 1. reformulate information sources into constraints 2. choose maximum entropy distribution that satisfies the constraints 49 / 56
  • 82. Maximum Entropy Models Instead of linear combination: 1. reformulate information sources into constraints 2. choose maximum entropy distribution that satisfies the constraints Constraints general form: X X P(X)fi(X) = Ei Example: unigram fwi = 1 if w = wi 0 otherwise 49 / 56
  • 83. Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 50 / 56
  • 84. Language Model Evaluation I Evaluation in combination with Speech Recogniser I hard to separate contribution of the two I Evaluation based on probabilities assigned to text in the training and test set 51 / 56
  • 85. Information, Entropy, Perplexity Information: I(xi) = log 1 P(xi) Entropy: H(X) = E[I(X)] = − X i P(xi) log P(xi) Perplexity: PP(X) = 2H(X) 52 / 56
  • 86. Perplexity of a model We do not know the “true” distribution p(w1, . . . , wn). But we have a model m(w1, . . . , wn). The cross-entropy is: H(p, m) = − X w1,...,wn p(w1, . . . , wn) log m(w1, . . . , wn) Cross-entropy is upper bound to entropy: H ≤ H(p, m) The better the model, the lower the cross-entropy and the lower the perplexity (on the same data) 53 / 56
  • 87. Test-set Perplexity Estimate the distribution p(w1, . . . , wn) on the training data Evaluate it on the test data H = − X w1,...,wn∈test set p(w1, . . . , wn) log p(w1, . . . , wn) PP = 2H 54 / 56
  • 88. Perplexity and branching factor Perplexity is roughly the geometric mean of the branching factor word word1 word2 . . . wordN Shannon: 2.39 for English letters and 130 for English words Digit strings: 10 n-gram English: 50–1000 Wall Street Journal test set: 180 (bigram) 91 (trigram) 55 / 56
  • 89. Performance of N-grams Models Perplexity Word Error Rate Unigram Katz 1196.45 14.85% Unigram Kneser-Ney 1199.59 14.86% Bigram Katz 176.31 11.38% Bigram Kneser-Ney 176.11 11.34% Trigram Katz 95.19 9.69% Trigram Kneser-Ney 91.47 9.60% Wall Street Journal database Dictionary: 60 000 words Training set: 260 000 000 words 56 / 56