lect26-em.ppt

600.465 - Intro to NLP - J. Eisner 1
The Expectation
Maximization (EM) Algorithm
… continued!

Repeat until convergence!
General Idea
 Start by devising a noisy channel
 Any model that predicts the corpus observations via
some hidden structure (tags, parses, …)
 Initially guess the parameters of the model!
 Educated guess is best, but random can work
 Expectation step: Use current parameters (and
observations) to reconstruct hidden structure
 Maximization step: Use that hidden structure
(and observations) to reestimate parameters

Guess of
unknown
parameters
(probabilities)
initial
guess
M step
Observed structure
(words, ice cream)
General Idea
Guess of unknown
hidden structure
(tags, parses, weather)
E step

Guess of
unknown
parameters
(probabilities)
M step
Observed structure
(words, ice cream)
For Hidden Markov Models
Guess of unknown
hidden structure
E step
initial
guess

Guess of
unknown
parameters
(probabilities)
M step
Observed structure
(words, ice cream)
Guess of unknown
hidden structure
E step
initial
guess

Grammar Reestimation
P
A
R
S
E
R
Grammar
s
c
o
r
e
r
correct test trees
accuracy
LEARNER
training
trees
test
sentences
cheap, plentiful
and appropriate
expensive and/or
wrong sublanguage
E step
M step

EM by Dynamic Programming:
Two Versions
 The Viterbi approximation
 Expectation: pick the best parse of each sentence
 Maximization: retrain on this best-parsed corpus
 Advantage: Speed!
 Real EM
 Expectation: find all parses of each sentence
 Maximization: retrain on all parses in proportion to
their probability (as if we observed fractional count)
 Advantage: p(training corpus) guaranteed to increase
 Exponentially many parses, so don’t extract them
from chart – need some kind of clever counting

Examples of EM
 Finite-State case: Hidden Markov Models
 “forward-backward” or “Baum-Welch” algorithm
 Applications:
 explain ice cream in terms of underlying weather sequence
 explain words in terms of underlying tag sequence
 explain phoneme sequence in terms of underlying word
 explain sound sequence in terms of underlying phoneme
 Context-Free case: Probabilistic CFGs
 “inside-outside” algorithm: unsupervised grammar learning!
 Explain raw text in terms of underlying cx-free parse
 In practice, local maximum problem gets in the way
 But can improve a good starting grammar via raw text
 (pretraining!)
 Clustering case: explain points via clusters
compose
these?

 Start with a “pretty good” grammar
 E.g., it was trained on supervised data (a treebank) that is
small, imperfectly annotated, or has sentences in a different
style from what you want to parse.
 Parse a corpus of unparsed sentences:
 Reestimate:
 Collect counts: …; c(S  NP VP) += 12; c(S) += 2*12; …
 Divide: p(S  NP VP | S) = c(S  NP VP) / c(S)
 May be wise to smooth
Viterbi reestimation for parsing
…
12 Today stocks were up
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
12
# copies of
this sentence
in the corpus

 Similar, but now we consider all parses of each sentence
 Parse our corpus of unparsed sentences:
 Collect counts fractionally:
 …; c(S  NP VP) += 10.8; c(S) += 2*10.8; …
 …; c(S  NP VP) += 1.2; c(S) += 1*1.2; …
600.465 - Intro to NLP - J. Eisner
True EM for parsing
…
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
10.8
# copies of
this sentence
in the corpus
Today were up
stocks
NP
VP
V PRT
S
NP
NP
1.2

Where are the constituents?
13
coal energy expert witness
p=0.5

14
p=0.1

15
p=0.1

16
p=0.1

17
p=0.2

18
0.5+0.1+0.1+0.1+0.2 = 1

flies like an
Time arrow
Where are NPs, VPs, … ?
19
flies like an
Time arrow
NP locations VP locations
VP
NP
NP
PP
S
V P Det N

20
flies like an
Time arrow flies like an
Time arrow
p=0.5
(S (NP Time) (VP flies (PP like (NP an arrow))))

21
flies like an
Time arrow
p=0.3
(S (NP Time flies) (VP like (NP an arrow)))

22
flies like an
Time arrow
p=0.1
(S (VP Time (NP (NP flies) (PP like (NP an arrow)))))

23
flies like an
Time arrow
p=0.1
(S (VP (VP Time (NP flies)) (PP like (NP an arrow))))

24
flies like an
Time arrow
NP locations
flies like an
Time arrow
VP locations
0.5+0.3+0.1+0.1 = 1

How many NPs, VPs, … in all?
25
flies like an
Time arrow
NP locations
flies like an
Time arrow
VP locations
0.5+0.3+0.1+0.1 = 1

How many NPs, VPs, … in all?
26
flies like an
Time arrow
NP locations
flies like an
Time arrow
VP locations
2.1 NPs
(expected)
1.1 VPs
(expected)

Where did the rules apply?
27
flies like an
Time arrow
S  NP VP locations NP  Det N locations

Where did the rules apply?
28
flies like an
Time arrow
p=0.5
(S (NP Time) (VP flies (PP like (NP an arrow))))

Where is S  NP VP substructure?
29
flies like an
Time arrow
p=0.3
(S (NP Time flies) (VP like (NP an arrow)))

30
flies like an
Time arrow
p=0.1
(S (VP Time (NP (NP flies) (PP like (NP an arrow)))))

31
flies like an
Time arrow
p=0.1
(S (VP (VP Time (NP flies)) (PP like (NP an arrow))))

Why do we want this info?
 Grammar reestimation by EM method
 E step collects those expected counts
 M step sets
 Or M step fits a log-linear model to the counts
 Minimum Bayes Risk decoding
 Find a tree that maximizes expected reward,
e.g., expected total # of correct constituents
 Run CKY again, in different semiring (see later slide)
 The input specifies the probability of correctness
for each possible constituent (e.g., VP from 1 to 5)
32

Why do we want this info?
 Soft features of a sentence for other tasks
 NER system asks: “Is there an NP from 0 to 2?”
 True answer is 1 (true) or 0 (false)
 But we return 0.3, averaging over all parses
 That’s a perfectly good feature value – can be fed as
to a CRF or a neural network as an input feature
 Writing tutor system asks: “How many times did
the student use S  NP[singular] VP[plural]?”
 True answer is in {0, 1, 2, …}
 But we return 1.8, averaging over all parses
33

 Similar, but now we consider all parses of each sentence
 Parse our corpus of unparsed sentences:
 Collect counts fractionally:
 …; c(S  NP VP) += 10.8; c(S) += 2*10.8; …
 …; c(S  NP VP) += 1.2; c(S) += 1*1.2; …
 But there may be exponentially
many parses of a length-n sentence!
 How can we stay fast? Similar to taggings…
600.465 - Intro to NLP - J. Eisner
True EM for parsing
…
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
10.8
# copies of
this sentence
in the corpus
Today were up
stocks
NP
VP
V PRT
S
NP
NP
1.2

Analogies to a, b in PCFG?
Day 1: 2 cones
Start
C
H
C
H
p(C|C)*p(3|C)
0.8*0.1=0.08
p(H|H)*p(3|H)
0.8*0.7=0.56
a=0.1*0.07+0.1*0.56
=0.063
a=0.1*0.08+0.1*0.01
=0.009
Day 2: 3 cones
a=0.1
a=0.1
C
H
p(C|C)*p(3|C)
0.8*0.1=0.08
p(H|H)*p(3|H)
0.8*0.7=0.56
a=0.009*0.07+0.063*0.56
=0.03591
a=0.009*0.08+0.063*0.01
=0.00135
Day 3: 3 cones
The dynamic programming computation of a. (b is similar but works back from Stop.)
Call these aH(2) and bH(2) aH(3) and bH(3)

“Inside Probabilities”
 Sum over all VP parses of “flies like an arrow”:
bVP(1,5) = p(flies like an arrow | VP)
 Sum over all S parses of “time flies like an arrow”:
bS(0,5) = p(time flies like an arrow | S)
S
NP
time
VP
V
flies
PP
P
like
NP
Det
an
N
arrow
* p(V  flies | V) * …

Compute b Bottom-Up by CKY
bVP(1,5) = p(flies like an arrow | VP) = …
bS(0,5) = p(time flies like an arrow | S)
= bNP(0,1) * bVP(1,5) * p(S  NP VP|S) + …
S
NP
time
VP
V
flies
PP
P
like
NP
Det
an
N
arrow
* p(V  flies | V) * …

time 1 flies 2 like 3 an 4 arrow 5
0
NP 3
Vst 3
NP 10
S 8
S 13
NP 24
NP 24
S 22
S 27
S 27
1
NP 4
VP 4
NP 18
S 21
VP 18
2
P 2
V 5
PP 12
VP 16
3 Det 1 NP 10
4 N 8
1 S  NP VP
6 S  Vst NP
2 S  S PP
1 VP  V NP
2 VP  VP PP
1 NP  Det N
2 NP  NP PP
3 NP  NP NP
0 PP  P NP

0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
S 2-13
NP 2-24
NP 2-24
S 2-22
S 2-27
S 2-27
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
S 2-22

0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
S 2-13
NP 2-24
NP 2-24
S 2-22
S 2-27
S 2-27
S 2-22
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
S 2-27

0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
S 2-13
NP 2-24
NP 2-24
S 2-22
S 2-27
S 2-27
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
The Efficient Version: Add as we go

0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
+2-13
NP 2-24
+2-24
S 2-22
+2-27
+2-27
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
The Efficient Version: Add as we go
+2-22
+2-27
bPP(2,5)
bS(0,2) bs(0,5)
bs(0,2)* bPP(2,5)*p(S  S PP | S)

Compute b probs bottom-up (CKY)
for width := 2 to n (* build smallest first *)
for i := 0 to n-width (* start *)
let k := i + width (* end *)
for j := i+1 to k-1 (* middle *)
for all grammar rules A  B C
bA(i,k) += p(A  B C | A) * bB(i,j) * bC(j,k)
X
Y Z
i j k
what if you
changed + to
max?
need some initialization up here for the width-1 case
what if you
replaced all rule
probabilities by 1?

aVP(1,5) = p(time VP today | S)
Inside & Outside Probabilities
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
aVP(1,5) * bVP(1,5)
= p(time [VP flies like an arrow] today | S)

S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
aVP(1,5) * bVP(1,5)
= p(time flies like an arrow today & VP(1,5) | S)
/ bS(0,6)
p(time flies like an arrow today | S)
= p(VP(1,5) | time flies like an arrow today, S)

S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
So aVP(1,5) * bVP(1,5) / bs(0,6)
is probability that there is a VP here,
given all of the observed data (words)
strictly analogous
to forward-backward
in the finite-state case!
Start
C
H
C
H
C
H

bV(1,2) = p(flies | V)
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
So aVP(1,5) * bV(1,2) * bPP(2,5) / bs(0,6)
is probability that there is VP  V PP here,
bPP(2,5) = p(like an arrow | PP)
… or is it?

sum prob over all position triples like (1,2,5) to
get expected c(VP  V PP); reestimate PCFG!
bV(1,2) = p(flies | V)
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
So aVP(1,5) * p(VP  V PP) * bV(1,2) * bPP(2,5) / bs(0,6)
is probability that there is VP  V PP here (at 1-2-5),
bPP(2,5) = p(like an arrow | PP)
strictly analogous
to forward-backward
in the finite-state case!
Start
C
H
C
H

Compute b probs bottom-up
(gradually build up larger blue “inside” regions)
V
flies
PP
P
like
NP
Det
an
N
arrow
bV(1,2) bPP(2,5)
VP
bVP(1,5)
p(flies | V)
p(like an arrow | PP)
p(V PP | VP)
+=
*
*
p(flies like an arrow | VP)
Summary: bVP(1,5) += p(V PP | VP) * bV(1,2) * bPP(2,5)
inside 1,5 inside 1,2 inside 2,5

Compute a probs top-down
(uses b probs as well)
V
flies
PP
P
like
NP
Det
an
N
arrow
bPP(2,5)
aV(1,2)
VP
aVP(1,5)
VP
NP
today
NP
time
+= p(time VP today | S)
* p(V PP | VP)
* p(like an arrow | PP)
p(time V like an arrow today | S)
S
Summary: aV(1,2) += p(V PP | VP) * aVP(1,5) * bPP(2,5)
outside 1,2 outside 1,5 inside 2,5
(gradually build up larger
pink “outside” regions)

Compute a probs top-down
(uses b probs as well)
V
flies
PP
P
like
NP
Det
an
N
arrow
bV(1,2)
aPP(2,5)
VP
aVP(1,5)
VP
NP
today
NP
time
+= p(time VP today | S)
* p(V PP | VP)
* p(flies| V)
p(time flies PP today | S)
S
Summary: aPP(2,5) += p(V PP | VP) * aVP(1,5) * bV(1,2)
outside 2,5 outside 1,5 inside 1,2

Details:
Compute b probs bottom-up
When you build VP(1,5),
from VP(1,2) and VP(2,5)
during CKY,
increment bVP(1,5) by
p(VP  VP PP) *
bVP(1,2) * bPP(2,5)
Why? bVP(1,5) is total
probability of all derivations
p(flies like an arrow | VP)
and we just found another.
(See earlier slide of CKY chart.)
VP
flies
PP
P
like
NP
Det
an
N
arrow
bVP(1,2) bPP(2,5)
VP
bVP(1,5)

Details:
Compute b probs bottom-up (CKY)
bA(i,k) += p(A  B C) * bB(i,j) * bC(j,k)
A
B C
i j k

Details:
Compute a probs top-down (reverse CKY)
aB(i,j) += ???
aC(j,k) += ???
n downto 2 “unbuild” biggest first
A
B C
i j k

Details:
After computing b during CKY,
revisit constits in reverse order (i.e.,
bigger constituents first).
When you “unbuild” VP(1,5)
from VP(1,2) and VP(2,5),
increment aVP(1,2) by
aVP(1,5) * p(VP  VP PP) *
bPP(2,5)
and increment aPP(2,5) by
aVP(1,5) * p(VP  VP PP) *
bVP(1,2)
S
NP
time
VP
VP NP
today
aVP(1,5)
bVP(1,2) bPP(2,5)
already computed on
bottom-up pass
aVP(1,2) is total prob of all ways to gen VP(1,2) and all outside words.
VP PP aPP(2,5)
aVP(1,2)

Details:
aB(i,j) += aA(i,k) * p(X  Y Z) * bC(j,k)
aC(j,k) += aA(i,k) * p(X  Y Z) * bB(i,j)
n downto 2 “unbuild” biggest first
A
B C
i j k

Inside computation: (a product)
bA(i,k) += p(A  B C) * bB(i,j) * bC(j,k)
Outside computation: (looks like backprop on a product?!)
aB(i,j) += aA(i,k) * p(A  B C) * bC(j,k)
aC(j,k) += aA(i,k) * p(A  B C) * bB(i,j)
Deep reason: inside algorithm computes denominator Z,
and log Z is vector of expected counts
Related to backprop?

What Inside-Outside is Good For
1. As the E step in the EM training algorithm
2. Predicting which nonterminals are probably where
3. Viterbi version as an A* or pruning heuristic
4. As a subroutine within non-context-free models

 That’s why we just did it
…
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
10.8
Today were up
stocks
NP
VP
V PRT
S
NP
NP
1.2
c(S) = i,j aS(i,j)bS(i,j)/Z
c(S  NP VP) = i,j,k aS(i,k)p(S  NP VP)
bNP(i,j) bVP(j,k)/Z
where
Z = total prob of all parses = bS(0,n)

Does Unsupervised
Learning Work?
 Merialdo (1994)
 “The paper that freaked me out …”
- Kevin Knight
 Catastrophic forgetting after smart initialization
 EM always improves likelihood
 But it sometimes hurts accuracy
 Why?#@!?

Does Unsupervised Learning Work?

Does Unsupervised Learning Work?
You mean a unique and
precious context-
dependent high-
dimensional vector?
… a vector that’s
trained to help with
some supervised or self-
supervised task?

 Posterior decoding of a single sentence
 Like using ab/Z to pick the most probable tag for each word
 But can’t just pick most probable nonterminal for each span …
 Wouldn’t get a tree! (Not all spans are constituents.)
 So, find the tree that maximizes expected # correct constits.
 Or you could decide you want to max expected # correct rules.
 For each nonterminal (or rule), at each position:
 ab/Z tells you the probability that it’s correct.
 For a given tree, sum these probabilities over all positions to get
that tree’s expected # of correct nonterminals (or rules).
 How can we find the tree that maximizes this sum?
 Dynamic programming – just weighted CKY all over again.
 But now the weights come from ab (run inside-outside first).

 Posterior decoding of a single sentence
 As soft features in a predictive classifier
 You want to predict whether the substring from i to j is a name
 Feature 17 asks whether your parser thinks it’s an NP
 If you’re sure it’s an NP, the feature fires
 add 1 17 to the log-probability
 If you’re sure it’s not an NP, the feature doesn’t fire
 add 0  17 to the log-probability
 But you’re not sure!
 The chance there’s an NP there is p = aNP(i,j)bNP(i,j)/Z
 So add p  17 to the log-probability

Outside Estimates
for better Pruning and Prioritization
 Iterative deepening: Throw x away if p(x)*q(x) < 10-200
(lower this threshold if we don’t get a parse)
 Heuristic pruning: Throw x away if p(x)*q(x) < 0.01*p(y)*q(y)
for some y that spans the same set of words
 Prioritized agenda: Priority of x on agenda is p(x)*q(x); stop at first parse
 In general, the “inside prob” p(x) will be higher for smaller constituents
 Not many rule probabilities inside them
 The “outside prob” q(x) is intended to correct for this
 Estimates the prob of all the rest of the rules needed to build x into full parse
 So p(x)*q(x) estimates prob of the best parse that contains x
 If we take q(x) to be the best estimate we can get
 Methods may no longer be safe (but may be fast!)
 Prioritized agenda is then called a “best-first algorithm”
 But if we take q(x)=1, that’s just the methods from previous slides
 And iterative deepening and prioritization were safe there
 If we take q(x) to be an “optimistic estimate” (always ≥ true prob)
 Still safe! Prioritized agenda is then an example of an “A* algorithm”
slide from earlier lecture

Outside Estimates
for better Pruning and Prioritization
 Iterative deepening: Throw x away if p(x)*q(x) < 10-200
(lower this threshold if we don’t get a parse)
 Heuristic pruning: Throw x away if p(x)*q(x) < 0.01*p(y)*q(y)
for some y that spans the same set of words
 Prioritized agenda: Priority of x on agenda is p(x)*q(x); stop at first parse
 In general, the “inside prob” p(x) will be higher for smaller constituents
 Not many rule probabilities inside them
 The “outside prob” q(x) is intended to correct for this
 Estimates the prob of all the rest of the rules needed to build x into full parse
 So p(x)*q(x) estimates prob of the best parse that contains x
slide from earlier lecture
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
Here we want “Viterbi inside” and “Viterbi outside” probabilities, which
max over possible partial parses (blue and pink) instead of summing.
But the prob of a partial parse is still the product of its rule probs.




0.6 S  NP[sing] VP[sing]
0.3 S  NP[plur] VP[plur]
0 S  NP[sing] VP[plur]
0 S  NP[plur] VP[sing]
0.1 S  VP[stem]
…
0.6 S  NP[?] VP[?]
This “coarse” grammar ignores features and
makes optimistic assumptions about how they
will turn out. Few nonterminals, so fast.
max

4. As a subroutine within non-context-free models
 We’ve always defined the weight of a parse tree as the sum of its
rules’ weights.
 Advanced topic: Can do better by considering additional “non-
local” features of the tree, e.g., in a log-linear or neural model.
 CKY no longer works for finding the best parse. 
 Approximate “reranking” algorithm: Using a simplified model that uses
only local features, use CKY to find a parse forest. Extract the best
1000 parses. Then re-score these 1000 parses using the full model.
 Better approximate and exact algorithms: Beyond scope of this
course. But they usually call inside-outside or Viterbi inside-outside as
a subroutine, often several times (on multiple variants of the
grammar, where again each variant can only use local features).

lect26-em.ppt

More Related Content

Similar to lect26-em.ppt (20)

More from SYAMDAVULURI (7)

Recently uploaded (20)

lect26-em.ppt