SlideShare a Scribd company logo
600.465 - Intro to NLP - J. Eisner 1
The Expectation
Maximization (EM) Algorithm
… continued!
600.465 - Intro to NLP - J. Eisner 2
Repeat until convergence!
General Idea
 Start by devising a noisy channel
 Any model that predicts the corpus observations via
some hidden structure (tags, parses, …)
 Initially guess the parameters of the model!
 Educated guess is best, but random can work
 Expectation step: Use current parameters (and
observations) to reconstruct hidden structure
 Maximization step: Use that hidden structure
(and observations) to reestimate parameters
600.465 - Intro to NLP - J. Eisner 3
Guess of
unknown
parameters
(probabilities)
initial
guess
M step
Observed structure
(words, ice cream)
General Idea
Guess of unknown
hidden structure
(tags, parses, weather)
E step
600.465 - Intro to NLP - J. Eisner 4
Guess of
unknown
parameters
(probabilities)
M step
Observed structure
(words, ice cream)
For Hidden Markov Models
Guess of unknown
hidden structure
(tags, parses, weather)
E step
initial
guess
600.465 - Intro to NLP - J. Eisner 5
Guess of
unknown
parameters
(probabilities)
M step
Observed structure
(words, ice cream)
For Hidden Markov Models
Guess of unknown
hidden structure
(tags, parses, weather)
E step
initial
guess
600.465 - Intro to NLP - J. Eisner 6
Guess of
unknown
parameters
(probabilities)
M step
Observed structure
(words, ice cream)
For Hidden Markov Models
Guess of unknown
hidden structure
(tags, parses, weather)
E step
initial
guess
600.465 - Intro to NLP - J. Eisner 7
Grammar Reestimation
P
A
R
S
E
R
Grammar
s
c
o
r
e
r
correct test trees
accuracy
LEARNER
training
trees
test
sentences
cheap, plentiful
and appropriate
expensive and/or
wrong sublanguage
E step
M step
600.465 - Intro to NLP - J. Eisner 8
EM by Dynamic Programming:
Two Versions
 The Viterbi approximation
 Expectation: pick the best parse of each sentence
 Maximization: retrain on this best-parsed corpus
 Advantage: Speed!
 Real EM
 Expectation: find all parses of each sentence
 Maximization: retrain on all parses in proportion to
their probability (as if we observed fractional count)
 Advantage: p(training corpus) guaranteed to increase
 Exponentially many parses, so don’t extract them
from chart – need some kind of clever counting
600.465 - Intro to NLP - J. Eisner 9
Examples of EM
 Finite-State case: Hidden Markov Models
 “forward-backward” or “Baum-Welch” algorithm
 Applications:
 explain ice cream in terms of underlying weather sequence
 explain words in terms of underlying tag sequence
 explain phoneme sequence in terms of underlying word
 explain sound sequence in terms of underlying phoneme
 Context-Free case: Probabilistic CFGs
 “inside-outside” algorithm: unsupervised grammar learning!
 Explain raw text in terms of underlying cx-free parse
 In practice, local maximum problem gets in the way
 But can improve a good starting grammar via raw text
 (pretraining!)
 Clustering case: explain points via clusters
compose
these?
600.465 - Intro to NLP - J. Eisner 10
Our old friend PCFG
S
NP
time
VP
V
flies
PP
P
like
NP
Det
an
N
arrow
p( | S) = p(S  NP VP | S) * p(NP  time | NP)
* p(VP  V PP | VP)
* p(V  flies | V) * …
 Start with a “pretty good” grammar
 E.g., it was trained on supervised data (a treebank) that is
small, imperfectly annotated, or has sentences in a different
style from what you want to parse.
 Parse a corpus of unparsed sentences:
 Reestimate:
 Collect counts: …; c(S  NP VP) += 12; c(S) += 2*12; …
 Divide: p(S  NP VP | S) = c(S  NP VP) / c(S)
 May be wise to smooth
600.465 - Intro to NLP - J. Eisner 11
Viterbi reestimation for parsing
…
12 Today stocks were up
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
12
# copies of
this sentence
in the corpus
 Similar, but now we consider all parses of each sentence
 Parse our corpus of unparsed sentences:
 Collect counts fractionally:
 …; c(S  NP VP) += 10.8; c(S) += 2*10.8; …
 …; c(S  NP VP) += 1.2; c(S) += 1*1.2; …
600.465 - Intro to NLP - J. Eisner
True EM for parsing
…
12 Today stocks were up
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
10.8
# copies of
this sentence
in the corpus
Today were up
stocks
NP
VP
V PRT
S
NP
NP
1.2
Where are the constituents?
13
coal energy expert witness
p=0.5
Where are the constituents?
14
coal energy expert witness
p=0.1
Where are the constituents?
15
coal energy expert witness
p=0.1
Where are the constituents?
16
coal energy expert witness
p=0.1
Where are the constituents?
17
coal energy expert witness
p=0.2
Where are the constituents?
18
coal energy expert witness
0.5+0.1+0.1+0.1+0.2 = 1
flies like an
Time arrow
Where are NPs, VPs, … ?
19
flies like an
Time arrow
NP locations VP locations
VP
NP
NP
PP
S
V P Det N
Where are NPs, VPs, … ?
20
flies like an
Time arrow flies like an
Time arrow
NP locations VP locations
p=0.5
(S (NP Time) (VP flies (PP like (NP an arrow))))
Where are NPs, VPs, … ?
21
flies like an
Time arrow flies like an
Time arrow
NP locations VP locations
p=0.3
(S (NP Time flies) (VP like (NP an arrow)))
Where are NPs, VPs, … ?
22
flies like an
Time arrow flies like an
Time arrow
NP locations VP locations
p=0.1
(S (VP Time (NP (NP flies) (PP like (NP an arrow)))))
Where are NPs, VPs, … ?
23
flies like an
Time arrow flies like an
Time arrow
NP locations VP locations
p=0.1
(S (VP (VP Time (NP flies)) (PP like (NP an arrow))))
Where are NPs, VPs, … ?
24
flies like an
Time arrow
NP locations
flies like an
Time arrow
VP locations
0.5+0.3+0.1+0.1 = 1
How many NPs, VPs, … in all?
25
flies like an
Time arrow
NP locations
flies like an
Time arrow
VP locations
0.5+0.3+0.1+0.1 = 1
How many NPs, VPs, … in all?
26
flies like an
Time arrow
NP locations
flies like an
Time arrow
VP locations
2.1 NPs
(expected)
1.1 VPs
(expected)
Where did the rules apply?
27
flies like an
Time arrow flies like an
Time arrow
S  NP VP locations NP  Det N locations
Where did the rules apply?
28
flies like an
Time arrow flies like an
Time arrow
S  NP VP locations NP  Det N locations
p=0.5
(S (NP Time) (VP flies (PP like (NP an arrow))))
Where is S  NP VP substructure?
29
flies like an
Time arrow flies like an
Time arrow
S  NP VP locations NP  Det N locations
p=0.3
(S (NP Time flies) (VP like (NP an arrow)))
Where is S  NP VP substructure?
30
flies like an
Time arrow flies like an
Time arrow
S  NP VP locations NP  Det N locations
p=0.1
(S (VP Time (NP (NP flies) (PP like (NP an arrow)))))
Where is S  NP VP substructure?
31
flies like an
Time arrow flies like an
Time arrow
S  NP VP locations NP  Det N locations
p=0.1
(S (VP (VP Time (NP flies)) (PP like (NP an arrow))))
Why do we want this info?
 Grammar reestimation by EM method
 E step collects those expected counts
 M step sets
 Or M step fits a log-linear model to the counts
 Minimum Bayes Risk decoding
 Find a tree that maximizes expected reward,
e.g., expected total # of correct constituents
 Run CKY again, in different semiring (see later slide)
 The input specifies the probability of correctness
for each possible constituent (e.g., VP from 1 to 5)
32
Why do we want this info?
 Soft features of a sentence for other tasks
 NER system asks: “Is there an NP from 0 to 2?”
 True answer is 1 (true) or 0 (false)
 But we return 0.3, averaging over all parses
 That’s a perfectly good feature value – can be fed as
to a CRF or a neural network as an input feature
 Writing tutor system asks: “How many times did
the student use S  NP[singular] VP[plural]?”
 True answer is in {0, 1, 2, …}
 But we return 1.8, averaging over all parses
33
 Similar, but now we consider all parses of each sentence
 Parse our corpus of unparsed sentences:
 Collect counts fractionally:
 …; c(S  NP VP) += 10.8; c(S) += 2*10.8; …
 …; c(S  NP VP) += 1.2; c(S) += 1*1.2; …
 But there may be exponentially
many parses of a length-n sentence!
 How can we stay fast? Similar to taggings…
600.465 - Intro to NLP - J. Eisner
True EM for parsing
…
12 Today stocks were up
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
10.8
# copies of
this sentence
in the corpus
Today were up
stocks
NP
VP
V PRT
S
NP
NP
1.2
600.465 - Intro to NLP - J. Eisner 35
Analogies to a, b in PCFG?
Day 1: 2 cones
Start
C
H
C
H
p(C|C)*p(3|C)
0.8*0.1=0.08
p(H|H)*p(3|H)
0.8*0.7=0.56
a=0.1*0.07+0.1*0.56
=0.063
a=0.1*0.08+0.1*0.01
=0.009
Day 2: 3 cones
a=0.1
a=0.1
C
H
p(C|C)*p(3|C)
0.8*0.1=0.08
p(H|H)*p(3|H)
0.8*0.7=0.56
a=0.009*0.07+0.063*0.56
=0.03591
a=0.009*0.08+0.063*0.01
=0.00135
Day 3: 3 cones
The dynamic programming computation of a. (b is similar but works back from Stop.)
Call these aH(2) and bH(2) aH(3) and bH(3)
600.465 - Intro to NLP - J. Eisner 36
“Inside Probabilities”
 Sum over all VP parses of “flies like an arrow”:
bVP(1,5) = p(flies like an arrow | VP)
 Sum over all S parses of “time flies like an arrow”:
bS(0,5) = p(time flies like an arrow | S)
S
NP
time
VP
V
flies
PP
P
like
NP
Det
an
N
arrow
p( | S) = p(S  NP VP | S) * p(NP  time | NP)
* p(VP  V PP | VP)
* p(V  flies | V) * …
600.465 - Intro to NLP - J. Eisner 37
Compute b Bottom-Up by CKY
bVP(1,5) = p(flies like an arrow | VP) = …
bS(0,5) = p(time flies like an arrow | S)
= bNP(0,1) * bVP(1,5) * p(S  NP VP|S) + …
S
NP
time
VP
V
flies
PP
P
like
NP
Det
an
N
arrow
p( | S) = p(S  NP VP | S) * p(NP  time | NP)
* p(VP  V PP | VP)
* p(V  flies | V) * …
time 1 flies 2 like 3 an 4 arrow 5
0
NP 3
Vst 3
NP 10
S 8
S 13
NP 24
NP 24
S 22
S 27
S 27
1
NP 4
VP 4
NP 18
S 21
VP 18
2
P 2
V 5
PP 12
VP 16
3 Det 1 NP 10
4 N 8
1 S  NP VP
6 S  Vst NP
2 S  S PP
1 VP  V NP
2 VP  VP PP
1 NP  Det N
2 NP  NP PP
3 NP  NP NP
0 PP  P NP
Compute b Bottom-Up by CKY
time 1 flies 2 like 3 an 4 arrow 5
0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
S 2-13
NP 2-24
NP 2-24
S 2-22
S 2-27
S 2-27
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
Compute b Bottom-Up by CKY
S 2-22
time 1 flies 2 like 3 an 4 arrow 5
0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
S 2-13
NP 2-24
NP 2-24
S 2-22
S 2-27
S 2-27
S 2-22
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
Compute b Bottom-Up by CKY
S 2-27
time 1 flies 2 like 3 an 4 arrow 5
0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
S 2-13
NP 2-24
NP 2-24
S 2-22
S 2-27
S 2-27
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
The Efficient Version: Add as we go
time 1 flies 2 like 3 an 4 arrow 5
0
NP 2-3
Vst 2-3
NP 2-10
S 2-8
+2-13
NP 2-24
+2-24
S 2-22
+2-27
+2-27
NP 2-4
VP 2-4
NP 2-18
S 2-21
VP 2-18
2
P 2-2
V 2-5
PP 2-12
VP 2-16
3 Det 2-1 NP 2-10
4 N 2-8
2-1 S  NP VP
2-6 S  Vst NP
2-2 S  S PP
2-1 VP  V NP
2-2 VP  VP PP
2-1 NP  Det N
2-2 NP  NP PP
2-3 NP  NP NP
2-0 PP  P NP
The Efficient Version: Add as we go
+2-22
+2-27
bPP(2,5)
bS(0,2) bs(0,5)
bs(0,2)* bPP(2,5)*p(S  S PP | S)
600.465 - Intro to NLP - J. Eisner 43
Compute b probs bottom-up (CKY)
for width := 2 to n (* build smallest first *)
for i := 0 to n-width (* start *)
let k := i + width (* end *)
for j := i+1 to k-1 (* middle *)
for all grammar rules A  B C
bA(i,k) += p(A  B C | A) * bB(i,j) * bC(j,k)
X
Y Z
i j k
what if you
changed + to
max?
need some initialization up here for the width-1 case
what if you
replaced all rule
probabilities by 1?
600.465 - Intro to NLP - J. Eisner 44
bVP(1,5) = p(flies like an arrow | VP)
aVP(1,5) = p(time VP today | S)
Inside & Outside Probabilities
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
aVP(1,5) * bVP(1,5)
= p(time [VP flies like an arrow] today | S)
600.465 - Intro to NLP - J. Eisner 45
bVP(1,5) = p(flies like an arrow | VP)
aVP(1,5) = p(time VP today | S)
Inside & Outside Probabilities
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
aVP(1,5) * bVP(1,5)
= p(time flies like an arrow today & VP(1,5) | S)
/ bS(0,6)
p(time flies like an arrow today | S)
= p(VP(1,5) | time flies like an arrow today, S)
600.465 - Intro to NLP - J. Eisner 46
bVP(1,5) = p(flies like an arrow | VP)
aVP(1,5) = p(time VP today | S)
Inside & Outside Probabilities
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
So aVP(1,5) * bVP(1,5) / bs(0,6)
is probability that there is a VP here,
given all of the observed data (words)
strictly analogous
to forward-backward
in the finite-state case!
Start
C
H
C
H
C
H
600.465 - Intro to NLP - J. Eisner 47
bV(1,2) = p(flies | V)
aVP(1,5) = p(time VP today | S)
Inside & Outside Probabilities
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
So aVP(1,5) * bV(1,2) * bPP(2,5) / bs(0,6)
is probability that there is VP  V PP here,
given all of the observed data (words)
bPP(2,5) = p(like an arrow | PP)
… or is it?
600.465 - Intro to NLP - J. Eisner 48
sum prob over all position triples like (1,2,5) to
get expected c(VP  V PP); reestimate PCFG!
bV(1,2) = p(flies | V)
aVP(1,5) = p(time VP today | S)
Inside & Outside Probabilities
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
So aVP(1,5) * p(VP  V PP) * bV(1,2) * bPP(2,5) / bs(0,6)
is probability that there is VP  V PP here (at 1-2-5),
given all of the observed data (words)
bPP(2,5) = p(like an arrow | PP)
strictly analogous
to forward-backward
in the finite-state case!
Start
C
H
C
H
600.465 - Intro to NLP - J. Eisner 49
Compute b probs bottom-up
(gradually build up larger blue “inside” regions)
V
flies
PP
P
like
NP
Det
an
N
arrow
bV(1,2) bPP(2,5)
VP
bVP(1,5)
p(flies | V)
p(like an arrow | PP)
p(V PP | VP)
+=
*
*
p(flies like an arrow | VP)
Summary: bVP(1,5) += p(V PP | VP) * bV(1,2) * bPP(2,5)
inside 1,5 inside 1,2 inside 2,5
600.465 - Intro to NLP - J. Eisner 50
Compute a probs top-down
(uses b probs as well)
V
flies
PP
P
like
NP
Det
an
N
arrow
bPP(2,5)
aV(1,2)
VP
aVP(1,5)
VP
NP
today
NP
time
+= p(time VP today | S)
* p(V PP | VP)
* p(like an arrow | PP)
p(time V like an arrow today | S)
S
Summary: aV(1,2) += p(V PP | VP) * aVP(1,5) * bPP(2,5)
outside 1,2 outside 1,5 inside 2,5
(gradually build up larger
pink “outside” regions)
600.465 - Intro to NLP - J. Eisner 51
Compute a probs top-down
(uses b probs as well)
V
flies
PP
P
like
NP
Det
an
N
arrow
bV(1,2)
aPP(2,5)
VP
aVP(1,5)
VP
NP
today
NP
time
+= p(time VP today | S)
* p(V PP | VP)
* p(flies| V)
p(time flies PP today | S)
S
Summary: aPP(2,5) += p(V PP | VP) * aVP(1,5) * bV(1,2)
outside 2,5 outside 1,5 inside 1,2
600.465 - Intro to NLP - J. Eisner 52
Details:
Compute b probs bottom-up
When you build VP(1,5),
from VP(1,2) and VP(2,5)
during CKY,
increment bVP(1,5) by
p(VP  VP PP) *
bVP(1,2) * bPP(2,5)
Why? bVP(1,5) is total
probability of all derivations
p(flies like an arrow | VP)
and we just found another.
(See earlier slide of CKY chart.)
VP
flies
PP
P
like
NP
Det
an
N
arrow
bVP(1,2) bPP(2,5)
VP
bVP(1,5)
600.465 - Intro to NLP - J. Eisner 53
Details:
Compute b probs bottom-up (CKY)
for width := 2 to n (* build smallest first *)
for i := 0 to n-width (* start *)
let k := i + width (* end *)
for j := i+1 to k-1 (* middle *)
for all grammar rules A  B C
bA(i,k) += p(A  B C) * bB(i,j) * bC(j,k)
A
B C
i j k
600.465 - Intro to NLP - J. Eisner 54
Details:
Compute a probs top-down (reverse CKY)
for width := 2 to n (* build smallest first *)
for i := 0 to n-width (* start *)
let k := i + width (* end *)
for j := i+1 to k-1 (* middle *)
for all grammar rules A  B C
aB(i,j) += ???
aC(j,k) += ???
n downto 2 “unbuild” biggest first
A
B C
i j k
600.465 - Intro to NLP - J. Eisner 55
Details:
Compute a probs top-down (reverse CKY)
After computing b during CKY,
revisit constits in reverse order (i.e.,
bigger constituents first).
When you “unbuild” VP(1,5)
from VP(1,2) and VP(2,5),
increment aVP(1,2) by
aVP(1,5) * p(VP  VP PP) *
bPP(2,5)
and increment aPP(2,5) by
aVP(1,5) * p(VP  VP PP) *
bVP(1,2)
S
NP
time
VP
VP NP
today
aVP(1,5)
bVP(1,2) bPP(2,5)
already computed on
bottom-up pass
aVP(1,2) is total prob of all ways to gen VP(1,2) and all outside words.
VP PP aPP(2,5)
aVP(1,2)
600.465 - Intro to NLP - J. Eisner 56
Details:
Compute a probs top-down (reverse CKY)
for width := 2 to n (* build smallest first *)
for i := 0 to n-width (* start *)
let k := i + width (* end *)
for j := i+1 to k-1 (* middle *)
for all grammar rules A  B C
aB(i,j) += aA(i,k) * p(X  Y Z) * bC(j,k)
aC(j,k) += aA(i,k) * p(X  Y Z) * bB(i,j)
n downto 2 “unbuild” biggest first
A
B C
i j k
600.465 - Intro to NLP - J. Eisner 57
Inside computation: (a product)
bA(i,k) += p(A  B C) * bB(i,j) * bC(j,k)
Outside computation: (looks like backprop on a product?!)
aB(i,j) += aA(i,k) * p(A  B C) * bC(j,k)
aC(j,k) += aA(i,k) * p(A  B C) * bB(i,j)
Deep reason: inside algorithm computes denominator Z,
and log Z is vector of expected counts
Related to backprop?
What Inside-Outside is Good For
1. As the E step in the EM training algorithm
2. Predicting which nonterminals are probably where
3. Viterbi version as an A* or pruning heuristic
4. As a subroutine within non-context-free models
What Inside-Outside is Good For
1. As the E step in the EM training algorithm
 That’s why we just did it
…
12 Today stocks were up
…
Today
were up
stocks
S
NP VP
V PRT
S
AdvP
10.8
Today were up
stocks
NP
VP
V PRT
S
NP
NP
1.2
c(S) = i,j aS(i,j)bS(i,j)/Z
c(S  NP VP) = i,j,k aS(i,k)p(S  NP VP)
bNP(i,j) bVP(j,k)/Z
where
Z = total prob of all parses = bS(0,n)
Does Unsupervised
Learning Work?
 Merialdo (1994)
 “The paper that freaked me out …”
- Kevin Knight
 Catastrophic forgetting after smart initialization
 EM always improves likelihood
 But it sometimes hurts accuracy
 Why?#@!?
600.465 - Intro to NLP - J. Eisner 60
Does Unsupervised Learning Work?
600.465 - Intro to NLP - J. Eisner 61
Does Unsupervised Learning Work?
600.465 - Intro to NLP - J. Eisner 62
You mean a unique and
precious context-
dependent high-
dimensional vector?
… a vector that’s
trained to help with
some supervised or self-
supervised task?
What Inside-Outside is Good For
1. As the E step in the EM training algorithm
2. Predicting which nonterminals are probably where
 Posterior decoding of a single sentence
 Like using ab/Z to pick the most probable tag for each word
 But can’t just pick most probable nonterminal for each span …
 Wouldn’t get a tree! (Not all spans are constituents.)
 So, find the tree that maximizes expected # correct constits.
 Or you could decide you want to max expected # correct rules.
 For each nonterminal (or rule), at each position:
 ab/Z tells you the probability that it’s correct.
 For a given tree, sum these probabilities over all positions to get
that tree’s expected # of correct nonterminals (or rules).
 How can we find the tree that maximizes this sum?
 Dynamic programming – just weighted CKY all over again.
 But now the weights come from ab (run inside-outside first).
What Inside-Outside is Good For
1. As the E step in the EM training algorithm
2. Predicting which nonterminals are probably where
 Posterior decoding of a single sentence
 As soft features in a predictive classifier
 You want to predict whether the substring from i to j is a name
 Feature 17 asks whether your parser thinks it’s an NP
 If you’re sure it’s an NP, the feature fires
 add 1 17 to the log-probability
 If you’re sure it’s not an NP, the feature doesn’t fire
 add 0  17 to the log-probability
 But you’re not sure!
 The chance there’s an NP there is p = aNP(i,j)bNP(i,j)/Z
 So add p  17 to the log-probability
What Inside-Outside is Good For
1. As the E step in the EM training algorithm
2. Predicting which nonterminals are probably where
3. Viterbi version as an A* or pruning heuristic
600.465 - Intro to NLP - J. Eisner 66
Outside Estimates
for better Pruning and Prioritization
 Iterative deepening: Throw x away if p(x)*q(x) < 10-200
(lower this threshold if we don’t get a parse)
 Heuristic pruning: Throw x away if p(x)*q(x) < 0.01*p(y)*q(y)
for some y that spans the same set of words
 Prioritized agenda: Priority of x on agenda is p(x)*q(x); stop at first parse
 In general, the “inside prob” p(x) will be higher for smaller constituents
 Not many rule probabilities inside them
 The “outside prob” q(x) is intended to correct for this
 Estimates the prob of all the rest of the rules needed to build x into full parse
 So p(x)*q(x) estimates prob of the best parse that contains x
 If we take q(x) to be the best estimate we can get
 Methods may no longer be safe (but may be fast!)
 Prioritized agenda is then called a “best-first algorithm”
 But if we take q(x)=1, that’s just the methods from previous slides
 And iterative deepening and prioritization were safe there
 If we take q(x) to be an “optimistic estimate” (always ≥ true prob)
 Still safe! Prioritized agenda is then an example of an “A* algorithm”
slide from earlier lecture
600.465 - Intro to NLP - J. Eisner 67
Outside Estimates
for better Pruning and Prioritization
 Iterative deepening: Throw x away if p(x)*q(x) < 10-200
(lower this threshold if we don’t get a parse)
 Heuristic pruning: Throw x away if p(x)*q(x) < 0.01*p(y)*q(y)
for some y that spans the same set of words
 Prioritized agenda: Priority of x on agenda is p(x)*q(x); stop at first parse
 In general, the “inside prob” p(x) will be higher for smaller constituents
 Not many rule probabilities inside them
 The “outside prob” q(x) is intended to correct for this
 Estimates the prob of all the rest of the rules needed to build x into full parse
 So p(x)*q(x) estimates prob of the best parse that contains x
slide from earlier lecture
S
NP
time
VP
VP NP
today
V
flies
PP
P
like
NP
Det
an
N
arrow
Here we want “Viterbi inside” and “Viterbi outside” probabilities, which
max over possible partial parses (blue and pink) instead of summing.
But the prob of a partial parse is still the product of its rule probs.
What Inside-Outside is Good For

What Inside-Outside is Good For

What Inside-Outside is Good For

0.6 S  NP[sing] VP[sing]
0.3 S  NP[plur] VP[plur]
0 S  NP[sing] VP[plur]
0 S  NP[plur] VP[sing]
0.1 S  VP[stem]
…
0.6 S  NP[?] VP[?]
This “coarse” grammar ignores features and
makes optimistic assumptions about how they
will turn out. Few nonterminals, so fast.
max
What Inside-Outside is Good For

What Inside-Outside is Good For
1. As the E step in the EM training algorithm
2. Predicting which nonterminals are probably where
3. Viterbi version as an A* or pruning heuristic
4. As a subroutine within non-context-free models
 We’ve always defined the weight of a parse tree as the sum of its
rules’ weights.
 Advanced topic: Can do better by considering additional “non-
local” features of the tree, e.g., in a log-linear or neural model.
 CKY no longer works for finding the best parse. 
 Approximate “reranking” algorithm: Using a simplified model that uses
only local features, use CKY to find a parse forest. Extract the best
1000 parses. Then re-score these 1000 parses using the full model.
 Better approximate and exact algorithms: Beyond scope of this
course. But they usually call inside-outside or Viterbi inside-outside as
a subroutine, often several times (on multiple variants of the
grammar, where again each variant can only use local features).

More Related Content

PPT
Hidden Markov Model in Natural Language Processing
PPT
Moore_slides.ppt
PPT
Lect24 hmm
PPT
Natural Language Processing: N-Gram Language Models
PPT
Natural Language Processing: N-Gram Language Models
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
PDF
Visual-Semantic Embeddings: some thoughts on Language
Hidden Markov Model in Natural Language Processing
Moore_slides.ppt
Lect24 hmm
Natural Language Processing: N-Gram Language Models
Natural Language Processing: N-Gram Language Models
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
Visual-Semantic Embeddings: some thoughts on Language

Similar to lect26-em.ppt (20)

PDF
Crash Course in Natural Language Processing (2016)
PPT
Chapter14part2
PDF
Crash-course in Natural Language Processing
PDF
Lecture 6
PPT
lect14-semantics.ppt
PDF
CFG Parsing
PDF
Adnan: Introduction to Natural Language Processing
PDF
Unsupervised program synthesis
PPTX
https://www.slideshare.net/amaresimachew/hot-topics-132093738
PDF
Unsupervised analysis for decipherment problems
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PDF
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
PDF
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
PDF
Et25897899
PPT
Lecture 7: Definite Clause Grammars
PDF
Sequence Learning with CTC technique
PDF
Theory of Mind and Language Processing, Fast and Slow
PPTX
Natural langaugea processing n gram models
PDF
Basics of Dynamic programming
PDF
lec03-LanguageModels_230214_161016.pdf
Crash Course in Natural Language Processing (2016)
Chapter14part2
Crash-course in Natural Language Processing
Lecture 6
lect14-semantics.ppt
CFG Parsing
Adnan: Introduction to Natural Language Processing
Unsupervised program synthesis
https://www.slideshare.net/amaresimachew/hot-topics-132093738
Unsupervised analysis for decipherment problems
Deep Learning for Natural Language Processing: Word Embeddings
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
Et25897899
Lecture 7: Definite Clause Grammars
Sequence Learning with CTC technique
Theory of Mind and Language Processing, Fast and Slow
Natural langaugea processing n gram models
Basics of Dynamic programming
lec03-LanguageModels_230214_161016.pdf
Ad

More from SYAMDAVULURI (7)

PPTX
Bioplastic and Biocomposit.pptx
PPTX
Functional Elements of SWM.pptx
PPT
Biodegradable Materials.ppt
PPTX
Types of Biodegradable Polymers.pptx
PPTX
ch03-2018.02.02.pptx
PPTX
lecture_11.pptx
PPTX
Cellulose Based -Biodegradable Polymers.pptx
Bioplastic and Biocomposit.pptx
Functional Elements of SWM.pptx
Biodegradable Materials.ppt
Types of Biodegradable Polymers.pptx
ch03-2018.02.02.pptx
lecture_11.pptx
Cellulose Based -Biodegradable Polymers.pptx
Ad

Recently uploaded (20)

PPTX
Paediatric History & Clinical Examination.pptx
PPT
Kaizen for Beginners and how to implement Kaizen
PPTX
building_blocks.pptxdcsDVabdbzfbtydtyyjtj67
PDF
Volvo EC290C NL EC290CNL Excavator Service Repair Manual Instant Download.pdf
PDF
computer system to create, modify, analyse or optimize an engineering design.
PDF
EC300D LR EC300DLR - Volvo Service Repair Manual.pdf
PDF
Volvo EC290C NL EC290CNL Hydraulic Excavator Specs Manual.pdf
PDF
Honda Dealership SNS Evaluation pdf/ppts
PPTX
Lecture 3b C Library xnxjxjxjxkx_ ESP32.pptx
PDF
Marketing project 2024 for marketing students
PPTX
laws of thermodynamics with diagrams details
PDF
Volvo EC300D L EC300DL excavator weight Manuals.pdf
PPTX
laws of thermodynamics with complete explanation
PPTX
Materi Kuliah Umum Prof. Hsien Tsai Wu.pptx
PPTX
IMMUNITY TYPES PPT.pptx very good , sufficient
DOCX
lp of food hygiene.docxvvvvvvvvvvvvvvvvvvvvvvv
PPTX
capstoneoooooooooooooooooooooooooooooooooo
PDF
Todays Technician Automotive Heating & Air Conditioning Classroom Manual and ...
PDF
Volvo EC20C Excavator Step-by-step Maintenance Instructions pdf
PDF
Journal Meraj.pdfuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
Paediatric History & Clinical Examination.pptx
Kaizen for Beginners and how to implement Kaizen
building_blocks.pptxdcsDVabdbzfbtydtyyjtj67
Volvo EC290C NL EC290CNL Excavator Service Repair Manual Instant Download.pdf
computer system to create, modify, analyse or optimize an engineering design.
EC300D LR EC300DLR - Volvo Service Repair Manual.pdf
Volvo EC290C NL EC290CNL Hydraulic Excavator Specs Manual.pdf
Honda Dealership SNS Evaluation pdf/ppts
Lecture 3b C Library xnxjxjxjxkx_ ESP32.pptx
Marketing project 2024 for marketing students
laws of thermodynamics with diagrams details
Volvo EC300D L EC300DL excavator weight Manuals.pdf
laws of thermodynamics with complete explanation
Materi Kuliah Umum Prof. Hsien Tsai Wu.pptx
IMMUNITY TYPES PPT.pptx very good , sufficient
lp of food hygiene.docxvvvvvvvvvvvvvvvvvvvvvvv
capstoneoooooooooooooooooooooooooooooooooo
Todays Technician Automotive Heating & Air Conditioning Classroom Manual and ...
Volvo EC20C Excavator Step-by-step Maintenance Instructions pdf
Journal Meraj.pdfuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

lect26-em.ppt

  • 1. 600.465 - Intro to NLP - J. Eisner 1 The Expectation Maximization (EM) Algorithm … continued!
  • 2. 600.465 - Intro to NLP - J. Eisner 2 Repeat until convergence! General Idea  Start by devising a noisy channel  Any model that predicts the corpus observations via some hidden structure (tags, parses, …)  Initially guess the parameters of the model!  Educated guess is best, but random can work  Expectation step: Use current parameters (and observations) to reconstruct hidden structure  Maximization step: Use that hidden structure (and observations) to reestimate parameters
  • 3. 600.465 - Intro to NLP - J. Eisner 3 Guess of unknown parameters (probabilities) initial guess M step Observed structure (words, ice cream) General Idea Guess of unknown hidden structure (tags, parses, weather) E step
  • 4. 600.465 - Intro to NLP - J. Eisner 4 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) For Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess
  • 5. 600.465 - Intro to NLP - J. Eisner 5 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) For Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess
  • 6. 600.465 - Intro to NLP - J. Eisner 6 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) For Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess
  • 7. 600.465 - Intro to NLP - J. Eisner 7 Grammar Reestimation P A R S E R Grammar s c o r e r correct test trees accuracy LEARNER training trees test sentences cheap, plentiful and appropriate expensive and/or wrong sublanguage E step M step
  • 8. 600.465 - Intro to NLP - J. Eisner 8 EM by Dynamic Programming: Two Versions  The Viterbi approximation  Expectation: pick the best parse of each sentence  Maximization: retrain on this best-parsed corpus  Advantage: Speed!  Real EM  Expectation: find all parses of each sentence  Maximization: retrain on all parses in proportion to their probability (as if we observed fractional count)  Advantage: p(training corpus) guaranteed to increase  Exponentially many parses, so don’t extract them from chart – need some kind of clever counting
  • 9. 600.465 - Intro to NLP - J. Eisner 9 Examples of EM  Finite-State case: Hidden Markov Models  “forward-backward” or “Baum-Welch” algorithm  Applications:  explain ice cream in terms of underlying weather sequence  explain words in terms of underlying tag sequence  explain phoneme sequence in terms of underlying word  explain sound sequence in terms of underlying phoneme  Context-Free case: Probabilistic CFGs  “inside-outside” algorithm: unsupervised grammar learning!  Explain raw text in terms of underlying cx-free parse  In practice, local maximum problem gets in the way  But can improve a good starting grammar via raw text  (pretraining!)  Clustering case: explain points via clusters compose these?
  • 10. 600.465 - Intro to NLP - J. Eisner 10 Our old friend PCFG S NP time VP V flies PP P like NP Det an N arrow p( | S) = p(S  NP VP | S) * p(NP  time | NP) * p(VP  V PP | VP) * p(V  flies | V) * …
  • 11.  Start with a “pretty good” grammar  E.g., it was trained on supervised data (a treebank) that is small, imperfectly annotated, or has sentences in a different style from what you want to parse.  Parse a corpus of unparsed sentences:  Reestimate:  Collect counts: …; c(S  NP VP) += 12; c(S) += 2*12; …  Divide: p(S  NP VP | S) = c(S  NP VP) / c(S)  May be wise to smooth 600.465 - Intro to NLP - J. Eisner 11 Viterbi reestimation for parsing … 12 Today stocks were up … Today were up stocks S NP VP V PRT S AdvP 12 # copies of this sentence in the corpus
  • 12.  Similar, but now we consider all parses of each sentence  Parse our corpus of unparsed sentences:  Collect counts fractionally:  …; c(S  NP VP) += 10.8; c(S) += 2*10.8; …  …; c(S  NP VP) += 1.2; c(S) += 1*1.2; … 600.465 - Intro to NLP - J. Eisner True EM for parsing … 12 Today stocks were up … Today were up stocks S NP VP V PRT S AdvP 10.8 # copies of this sentence in the corpus Today were up stocks NP VP V PRT S NP NP 1.2
  • 13. Where are the constituents? 13 coal energy expert witness p=0.5
  • 14. Where are the constituents? 14 coal energy expert witness p=0.1
  • 15. Where are the constituents? 15 coal energy expert witness p=0.1
  • 16. Where are the constituents? 16 coal energy expert witness p=0.1
  • 17. Where are the constituents? 17 coal energy expert witness p=0.2
  • 18. Where are the constituents? 18 coal energy expert witness 0.5+0.1+0.1+0.1+0.2 = 1
  • 19. flies like an Time arrow Where are NPs, VPs, … ? 19 flies like an Time arrow NP locations VP locations VP NP NP PP S V P Det N
  • 20. Where are NPs, VPs, … ? 20 flies like an Time arrow flies like an Time arrow NP locations VP locations p=0.5 (S (NP Time) (VP flies (PP like (NP an arrow))))
  • 21. Where are NPs, VPs, … ? 21 flies like an Time arrow flies like an Time arrow NP locations VP locations p=0.3 (S (NP Time flies) (VP like (NP an arrow)))
  • 22. Where are NPs, VPs, … ? 22 flies like an Time arrow flies like an Time arrow NP locations VP locations p=0.1 (S (VP Time (NP (NP flies) (PP like (NP an arrow)))))
  • 23. Where are NPs, VPs, … ? 23 flies like an Time arrow flies like an Time arrow NP locations VP locations p=0.1 (S (VP (VP Time (NP flies)) (PP like (NP an arrow))))
  • 24. Where are NPs, VPs, … ? 24 flies like an Time arrow NP locations flies like an Time arrow VP locations 0.5+0.3+0.1+0.1 = 1
  • 25. How many NPs, VPs, … in all? 25 flies like an Time arrow NP locations flies like an Time arrow VP locations 0.5+0.3+0.1+0.1 = 1
  • 26. How many NPs, VPs, … in all? 26 flies like an Time arrow NP locations flies like an Time arrow VP locations 2.1 NPs (expected) 1.1 VPs (expected)
  • 27. Where did the rules apply? 27 flies like an Time arrow flies like an Time arrow S  NP VP locations NP  Det N locations
  • 28. Where did the rules apply? 28 flies like an Time arrow flies like an Time arrow S  NP VP locations NP  Det N locations p=0.5 (S (NP Time) (VP flies (PP like (NP an arrow))))
  • 29. Where is S  NP VP substructure? 29 flies like an Time arrow flies like an Time arrow S  NP VP locations NP  Det N locations p=0.3 (S (NP Time flies) (VP like (NP an arrow)))
  • 30. Where is S  NP VP substructure? 30 flies like an Time arrow flies like an Time arrow S  NP VP locations NP  Det N locations p=0.1 (S (VP Time (NP (NP flies) (PP like (NP an arrow)))))
  • 31. Where is S  NP VP substructure? 31 flies like an Time arrow flies like an Time arrow S  NP VP locations NP  Det N locations p=0.1 (S (VP (VP Time (NP flies)) (PP like (NP an arrow))))
  • 32. Why do we want this info?  Grammar reestimation by EM method  E step collects those expected counts  M step sets  Or M step fits a log-linear model to the counts  Minimum Bayes Risk decoding  Find a tree that maximizes expected reward, e.g., expected total # of correct constituents  Run CKY again, in different semiring (see later slide)  The input specifies the probability of correctness for each possible constituent (e.g., VP from 1 to 5) 32
  • 33. Why do we want this info?  Soft features of a sentence for other tasks  NER system asks: “Is there an NP from 0 to 2?”  True answer is 1 (true) or 0 (false)  But we return 0.3, averaging over all parses  That’s a perfectly good feature value – can be fed as to a CRF or a neural network as an input feature  Writing tutor system asks: “How many times did the student use S  NP[singular] VP[plural]?”  True answer is in {0, 1, 2, …}  But we return 1.8, averaging over all parses 33
  • 34.  Similar, but now we consider all parses of each sentence  Parse our corpus of unparsed sentences:  Collect counts fractionally:  …; c(S  NP VP) += 10.8; c(S) += 2*10.8; …  …; c(S  NP VP) += 1.2; c(S) += 1*1.2; …  But there may be exponentially many parses of a length-n sentence!  How can we stay fast? Similar to taggings… 600.465 - Intro to NLP - J. Eisner True EM for parsing … 12 Today stocks were up … Today were up stocks S NP VP V PRT S AdvP 10.8 # copies of this sentence in the corpus Today were up stocks NP VP V PRT S NP NP 1.2
  • 35. 600.465 - Intro to NLP - J. Eisner 35 Analogies to a, b in PCFG? Day 1: 2 cones Start C H C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7=0.56 a=0.1*0.07+0.1*0.56 =0.063 a=0.1*0.08+0.1*0.01 =0.009 Day 2: 3 cones a=0.1 a=0.1 C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7=0.56 a=0.009*0.07+0.063*0.56 =0.03591 a=0.009*0.08+0.063*0.01 =0.00135 Day 3: 3 cones The dynamic programming computation of a. (b is similar but works back from Stop.) Call these aH(2) and bH(2) aH(3) and bH(3)
  • 36. 600.465 - Intro to NLP - J. Eisner 36 “Inside Probabilities”  Sum over all VP parses of “flies like an arrow”: bVP(1,5) = p(flies like an arrow | VP)  Sum over all S parses of “time flies like an arrow”: bS(0,5) = p(time flies like an arrow | S) S NP time VP V flies PP P like NP Det an N arrow p( | S) = p(S  NP VP | S) * p(NP  time | NP) * p(VP  V PP | VP) * p(V  flies | V) * …
  • 37. 600.465 - Intro to NLP - J. Eisner 37 Compute b Bottom-Up by CKY bVP(1,5) = p(flies like an arrow | VP) = … bS(0,5) = p(time flies like an arrow | S) = bNP(0,1) * bVP(1,5) * p(S  NP VP|S) + … S NP time VP V flies PP P like NP Det an N arrow p( | S) = p(S  NP VP | S) * p(NP  time | NP) * p(VP  V PP | VP) * p(V  flies | V) * …
  • 38. time 1 flies 2 like 3 an 4 arrow 5 0 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 NP 24 S 22 S 27 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 NP 10 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP Compute b Bottom-Up by CKY
  • 39. time 1 flies 2 like 3 an 4 arrow 5 0 NP 2-3 Vst 2-3 NP 2-10 S 2-8 S 2-13 NP 2-24 NP 2-24 S 2-22 S 2-27 S 2-27 NP 2-4 VP 2-4 NP 2-18 S 2-21 VP 2-18 2 P 2-2 V 2-5 PP 2-12 VP 2-16 3 Det 2-1 NP 2-10 4 N 2-8 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP Compute b Bottom-Up by CKY S 2-22
  • 40. time 1 flies 2 like 3 an 4 arrow 5 0 NP 2-3 Vst 2-3 NP 2-10 S 2-8 S 2-13 NP 2-24 NP 2-24 S 2-22 S 2-27 S 2-27 S 2-22 NP 2-4 VP 2-4 NP 2-18 S 2-21 VP 2-18 2 P 2-2 V 2-5 PP 2-12 VP 2-16 3 Det 2-1 NP 2-10 4 N 2-8 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP Compute b Bottom-Up by CKY S 2-27
  • 41. time 1 flies 2 like 3 an 4 arrow 5 0 NP 2-3 Vst 2-3 NP 2-10 S 2-8 S 2-13 NP 2-24 NP 2-24 S 2-22 S 2-27 S 2-27 NP 2-4 VP 2-4 NP 2-18 S 2-21 VP 2-18 2 P 2-2 V 2-5 PP 2-12 VP 2-16 3 Det 2-1 NP 2-10 4 N 2-8 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP The Efficient Version: Add as we go
  • 42. time 1 flies 2 like 3 an 4 arrow 5 0 NP 2-3 Vst 2-3 NP 2-10 S 2-8 +2-13 NP 2-24 +2-24 S 2-22 +2-27 +2-27 NP 2-4 VP 2-4 NP 2-18 S 2-21 VP 2-18 2 P 2-2 V 2-5 PP 2-12 VP 2-16 3 Det 2-1 NP 2-10 4 N 2-8 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP The Efficient Version: Add as we go +2-22 +2-27 bPP(2,5) bS(0,2) bs(0,5) bs(0,2)* bPP(2,5)*p(S  S PP | S)
  • 43. 600.465 - Intro to NLP - J. Eisner 43 Compute b probs bottom-up (CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules A  B C bA(i,k) += p(A  B C | A) * bB(i,j) * bC(j,k) X Y Z i j k what if you changed + to max? need some initialization up here for the width-1 case what if you replaced all rule probabilities by 1?
  • 44. 600.465 - Intro to NLP - J. Eisner 44 bVP(1,5) = p(flies like an arrow | VP) aVP(1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP VP NP today V flies PP P like NP Det an N arrow aVP(1,5) * bVP(1,5) = p(time [VP flies like an arrow] today | S)
  • 45. 600.465 - Intro to NLP - J. Eisner 45 bVP(1,5) = p(flies like an arrow | VP) aVP(1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP VP NP today V flies PP P like NP Det an N arrow aVP(1,5) * bVP(1,5) = p(time flies like an arrow today & VP(1,5) | S) / bS(0,6) p(time flies like an arrow today | S) = p(VP(1,5) | time flies like an arrow today, S)
  • 46. 600.465 - Intro to NLP - J. Eisner 46 bVP(1,5) = p(flies like an arrow | VP) aVP(1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP VP NP today V flies PP P like NP Det an N arrow So aVP(1,5) * bVP(1,5) / bs(0,6) is probability that there is a VP here, given all of the observed data (words) strictly analogous to forward-backward in the finite-state case! Start C H C H C H
  • 47. 600.465 - Intro to NLP - J. Eisner 47 bV(1,2) = p(flies | V) aVP(1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP VP NP today V flies PP P like NP Det an N arrow So aVP(1,5) * bV(1,2) * bPP(2,5) / bs(0,6) is probability that there is VP  V PP here, given all of the observed data (words) bPP(2,5) = p(like an arrow | PP) … or is it?
  • 48. 600.465 - Intro to NLP - J. Eisner 48 sum prob over all position triples like (1,2,5) to get expected c(VP  V PP); reestimate PCFG! bV(1,2) = p(flies | V) aVP(1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP VP NP today V flies PP P like NP Det an N arrow So aVP(1,5) * p(VP  V PP) * bV(1,2) * bPP(2,5) / bs(0,6) is probability that there is VP  V PP here (at 1-2-5), given all of the observed data (words) bPP(2,5) = p(like an arrow | PP) strictly analogous to forward-backward in the finite-state case! Start C H C H
  • 49. 600.465 - Intro to NLP - J. Eisner 49 Compute b probs bottom-up (gradually build up larger blue “inside” regions) V flies PP P like NP Det an N arrow bV(1,2) bPP(2,5) VP bVP(1,5) p(flies | V) p(like an arrow | PP) p(V PP | VP) += * * p(flies like an arrow | VP) Summary: bVP(1,5) += p(V PP | VP) * bV(1,2) * bPP(2,5) inside 1,5 inside 1,2 inside 2,5
  • 50. 600.465 - Intro to NLP - J. Eisner 50 Compute a probs top-down (uses b probs as well) V flies PP P like NP Det an N arrow bPP(2,5) aV(1,2) VP aVP(1,5) VP NP today NP time += p(time VP today | S) * p(V PP | VP) * p(like an arrow | PP) p(time V like an arrow today | S) S Summary: aV(1,2) += p(V PP | VP) * aVP(1,5) * bPP(2,5) outside 1,2 outside 1,5 inside 2,5 (gradually build up larger pink “outside” regions)
  • 51. 600.465 - Intro to NLP - J. Eisner 51 Compute a probs top-down (uses b probs as well) V flies PP P like NP Det an N arrow bV(1,2) aPP(2,5) VP aVP(1,5) VP NP today NP time += p(time VP today | S) * p(V PP | VP) * p(flies| V) p(time flies PP today | S) S Summary: aPP(2,5) += p(V PP | VP) * aVP(1,5) * bV(1,2) outside 2,5 outside 1,5 inside 1,2
  • 52. 600.465 - Intro to NLP - J. Eisner 52 Details: Compute b probs bottom-up When you build VP(1,5), from VP(1,2) and VP(2,5) during CKY, increment bVP(1,5) by p(VP  VP PP) * bVP(1,2) * bPP(2,5) Why? bVP(1,5) is total probability of all derivations p(flies like an arrow | VP) and we just found another. (See earlier slide of CKY chart.) VP flies PP P like NP Det an N arrow bVP(1,2) bPP(2,5) VP bVP(1,5)
  • 53. 600.465 - Intro to NLP - J. Eisner 53 Details: Compute b probs bottom-up (CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules A  B C bA(i,k) += p(A  B C) * bB(i,j) * bC(j,k) A B C i j k
  • 54. 600.465 - Intro to NLP - J. Eisner 54 Details: Compute a probs top-down (reverse CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules A  B C aB(i,j) += ??? aC(j,k) += ??? n downto 2 “unbuild” biggest first A B C i j k
  • 55. 600.465 - Intro to NLP - J. Eisner 55 Details: Compute a probs top-down (reverse CKY) After computing b during CKY, revisit constits in reverse order (i.e., bigger constituents first). When you “unbuild” VP(1,5) from VP(1,2) and VP(2,5), increment aVP(1,2) by aVP(1,5) * p(VP  VP PP) * bPP(2,5) and increment aPP(2,5) by aVP(1,5) * p(VP  VP PP) * bVP(1,2) S NP time VP VP NP today aVP(1,5) bVP(1,2) bPP(2,5) already computed on bottom-up pass aVP(1,2) is total prob of all ways to gen VP(1,2) and all outside words. VP PP aPP(2,5) aVP(1,2)
  • 56. 600.465 - Intro to NLP - J. Eisner 56 Details: Compute a probs top-down (reverse CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules A  B C aB(i,j) += aA(i,k) * p(X  Y Z) * bC(j,k) aC(j,k) += aA(i,k) * p(X  Y Z) * bB(i,j) n downto 2 “unbuild” biggest first A B C i j k
  • 57. 600.465 - Intro to NLP - J. Eisner 57 Inside computation: (a product) bA(i,k) += p(A  B C) * bB(i,j) * bC(j,k) Outside computation: (looks like backprop on a product?!) aB(i,j) += aA(i,k) * p(A  B C) * bC(j,k) aC(j,k) += aA(i,k) * p(A  B C) * bB(i,j) Deep reason: inside algorithm computes denominator Z, and log Z is vector of expected counts Related to backprop?
  • 58. What Inside-Outside is Good For 1. As the E step in the EM training algorithm 2. Predicting which nonterminals are probably where 3. Viterbi version as an A* or pruning heuristic 4. As a subroutine within non-context-free models
  • 59. What Inside-Outside is Good For 1. As the E step in the EM training algorithm  That’s why we just did it … 12 Today stocks were up … Today were up stocks S NP VP V PRT S AdvP 10.8 Today were up stocks NP VP V PRT S NP NP 1.2 c(S) = i,j aS(i,j)bS(i,j)/Z c(S  NP VP) = i,j,k aS(i,k)p(S  NP VP) bNP(i,j) bVP(j,k)/Z where Z = total prob of all parses = bS(0,n)
  • 60. Does Unsupervised Learning Work?  Merialdo (1994)  “The paper that freaked me out …” - Kevin Knight  Catastrophic forgetting after smart initialization  EM always improves likelihood  But it sometimes hurts accuracy  Why?#@!? 600.465 - Intro to NLP - J. Eisner 60
  • 61. Does Unsupervised Learning Work? 600.465 - Intro to NLP - J. Eisner 61
  • 62. Does Unsupervised Learning Work? 600.465 - Intro to NLP - J. Eisner 62 You mean a unique and precious context- dependent high- dimensional vector? … a vector that’s trained to help with some supervised or self- supervised task?
  • 63. What Inside-Outside is Good For 1. As the E step in the EM training algorithm 2. Predicting which nonterminals are probably where  Posterior decoding of a single sentence  Like using ab/Z to pick the most probable tag for each word  But can’t just pick most probable nonterminal for each span …  Wouldn’t get a tree! (Not all spans are constituents.)  So, find the tree that maximizes expected # correct constits.  Or you could decide you want to max expected # correct rules.  For each nonterminal (or rule), at each position:  ab/Z tells you the probability that it’s correct.  For a given tree, sum these probabilities over all positions to get that tree’s expected # of correct nonterminals (or rules).  How can we find the tree that maximizes this sum?  Dynamic programming – just weighted CKY all over again.  But now the weights come from ab (run inside-outside first).
  • 64. What Inside-Outside is Good For 1. As the E step in the EM training algorithm 2. Predicting which nonterminals are probably where  Posterior decoding of a single sentence  As soft features in a predictive classifier  You want to predict whether the substring from i to j is a name  Feature 17 asks whether your parser thinks it’s an NP  If you’re sure it’s an NP, the feature fires  add 1 17 to the log-probability  If you’re sure it’s not an NP, the feature doesn’t fire  add 0  17 to the log-probability  But you’re not sure!  The chance there’s an NP there is p = aNP(i,j)bNP(i,j)/Z  So add p  17 to the log-probability
  • 65. What Inside-Outside is Good For 1. As the E step in the EM training algorithm 2. Predicting which nonterminals are probably where 3. Viterbi version as an A* or pruning heuristic
  • 66. 600.465 - Intro to NLP - J. Eisner 66 Outside Estimates for better Pruning and Prioritization  Iterative deepening: Throw x away if p(x)*q(x) < 10-200 (lower this threshold if we don’t get a parse)  Heuristic pruning: Throw x away if p(x)*q(x) < 0.01*p(y)*q(y) for some y that spans the same set of words  Prioritized agenda: Priority of x on agenda is p(x)*q(x); stop at first parse  In general, the “inside prob” p(x) will be higher for smaller constituents  Not many rule probabilities inside them  The “outside prob” q(x) is intended to correct for this  Estimates the prob of all the rest of the rules needed to build x into full parse  So p(x)*q(x) estimates prob of the best parse that contains x  If we take q(x) to be the best estimate we can get  Methods may no longer be safe (but may be fast!)  Prioritized agenda is then called a “best-first algorithm”  But if we take q(x)=1, that’s just the methods from previous slides  And iterative deepening and prioritization were safe there  If we take q(x) to be an “optimistic estimate” (always ≥ true prob)  Still safe! Prioritized agenda is then an example of an “A* algorithm” slide from earlier lecture
  • 67. 600.465 - Intro to NLP - J. Eisner 67 Outside Estimates for better Pruning and Prioritization  Iterative deepening: Throw x away if p(x)*q(x) < 10-200 (lower this threshold if we don’t get a parse)  Heuristic pruning: Throw x away if p(x)*q(x) < 0.01*p(y)*q(y) for some y that spans the same set of words  Prioritized agenda: Priority of x on agenda is p(x)*q(x); stop at first parse  In general, the “inside prob” p(x) will be higher for smaller constituents  Not many rule probabilities inside them  The “outside prob” q(x) is intended to correct for this  Estimates the prob of all the rest of the rules needed to build x into full parse  So p(x)*q(x) estimates prob of the best parse that contains x slide from earlier lecture S NP time VP VP NP today V flies PP P like NP Det an N arrow Here we want “Viterbi inside” and “Viterbi outside” probabilities, which max over possible partial parses (blue and pink) instead of summing. But the prob of a partial parse is still the product of its rule probs.
  • 68. What Inside-Outside is Good For 
  • 69. What Inside-Outside is Good For 
  • 70. What Inside-Outside is Good For  0.6 S  NP[sing] VP[sing] 0.3 S  NP[plur] VP[plur] 0 S  NP[sing] VP[plur] 0 S  NP[plur] VP[sing] 0.1 S  VP[stem] … 0.6 S  NP[?] VP[?] This “coarse” grammar ignores features and makes optimistic assumptions about how they will turn out. Few nonterminals, so fast. max
  • 71. What Inside-Outside is Good For 
  • 72. What Inside-Outside is Good For 1. As the E step in the EM training algorithm 2. Predicting which nonterminals are probably where 3. Viterbi version as an A* or pruning heuristic 4. As a subroutine within non-context-free models  We’ve always defined the weight of a parse tree as the sum of its rules’ weights.  Advanced topic: Can do better by considering additional “non- local” features of the tree, e.g., in a log-linear or neural model.  CKY no longer works for finding the best parse.   Approximate “reranking” algorithm: Using a simplified model that uses only local features, use CKY to find a parse forest. Extract the best 1000 parses. Then re-score these 1000 parses using the full model.  Better approximate and exact algorithms: Beyond scope of this course. But they usually call inside-outside or Viterbi inside-outside as a subroutine, often several times (on multiple variants of the grammar, where again each variant can only use local features).