SlideShare a Scribd company logo
Sequential supervised
learning– An Overview
Classical Supervised learning
• Given a set of N training samples {xi,yi},
where xi is a vector and yi is an integer (for
classification) or a real number (reg.)
– Xi could be anything in the real world, and yi is
the number you’re interested in about xi.
– E.g., xi is an image, yi is a label indicating what
object this image contained.
• The goal is to learn a function h from an
assumed functional space H, such that
y=h(x) could predict new samples well.
Classical Supervised learning
• What does the h(x) look like?
– Most of the time, it is a linear function. E.g. y=
sum_c{w_c f(x_c)}, where
– c – index of a subset of variables for x.
– x_c – a subset of variables indexed by c
– f(x_c) – a “feature” function, indicating how good of
this set of x_c at predicting y.
– w_c – indicating the strength of feature c
– In this way, we exploit the structure of input space
X, to predict a single number y for each samples in
that space.
Classical Supervised learning
• You, as the engineer, has only two key tasks for
a specific problem
• 1. design good feature functions f
– This should encode your reasoning mechanism as a
human, i.e., give a score to your evidence.
– f(x)= K(xi,x) ;
– f(xc)=1 if xc=“ing” 0 otherwise;
– f(xc)=“local descriptor for image patch c”
– f(xc)=“image patch 1 AND image patch 2 ” c={1,2}
– In principle, one can design unlimited number of
features based on a subset of variable xc.
Classical Supervised learning
• 2, find a way to estimate your best w_c
(could be a vector) :
– Maximum likelihood, Maximum a posterior
• P(y|x_i, w)= exp(H(x_i))/z
• = exp(sum_c{w_c f(x_c)})/z
– Maximum entropy
– Maximum margin – SVM
Classical Supervised learning
– But the above criterions encodes the same
thing – minimizing (training) errors
– Different way to calculate/penalize errors (the
thing to do this is called loss function) gives
different algorithm.
• Hinge loss – SVM has taken it as his private
property.
• Log loss – maximum entropy/ logistic regression
• Exponential loss – Adaboost
• squared loss – Least square regression
• Misclassification loss – perceptron
A Sequence of Label?
• In many case, there also exist correlation
between outputs
Sequential supervised learning
• Given N training samples {xi,yi},each xi and
yi are both sequence, the goal is to learn a
classifier y=h(x) with sequence input and
sequence output.
Sequential supervised learning
• Related to but different from time-series predicting
– we a given the whole sequence
– We need to learn all the labels instead of only one at yt+1
• Related to but different from Multi-task learning
– In multi-task learning, the input space of each task may be totally
different
– More general than sequential learning
• Related to but different from Multi-label learning
– In Multi-label learning, the task is to assign multiple labels to one
single instance (or multiple instances but taken as a whole)
Major issues
1. loss function for SP
• That is, given a feature sequence, you have to compute a
number to indicate whether your predicted label sequence
is good or not, in other words, you need to measure how
good your model fit the data!
– On one hand, count the number of incorrect labels, this is equal to
maximize p(yi|x), i.e., your model should correctly label as many
individual label as possible.
– On the other hand, if any individual label is incorrect, the whole
sequence is treated as incorrect, i.e., 0-1 loss in seq. case.
– Between these extremes, make a global analysis of yi, but allow
some yi be incorrect.
– The above options depend on your application!
• After define your loss, what you do is to tune your model
parameter such that the average loss over the training
samples is minimized, this procedure called parameter
estimation or learning.
Major issues
2. features and long distance interaction
• Central problem: identify the relevant information subset (xi) for
accurate predictions (y).
• How to do this?
– 1) Use traditional feature selection method, search the best set
according to your loss: forward, backward,.. Not
good for SP! Too many options.
– 2) Attach a weight to each feature, and sparse these weights, e.g., L1-
SVM, ARD in neural network. Not good for SP for the same reason.
– 3) Feature scoring: e.g., mutual information as a similarity measure
between a feature and a class label.
– 4) Fit a simple model first, use it as a feature selector, then remove
unrelevant features.
• Usual way in SP: define a window around xi to predict yi, but this
may not be safe, long sequence is always a problem.
Major issues
3.representation: P(x,y) or P(y|x)
• Model joint distribution - P(X,Y)
• Hidden Markov Model
• 1) write down the joint distribution according to
graphical model
• P(x,y)= p(y1)p(x1|y1)p(y2|y1)p(x2|y2)….
• 2) learning is easy, to get P(x,y), just need
– P(yi+1|yi) – looking at all pairs of adjacent ys
– P(xi|yi) – looking at all pairs of (xi|yi)
– Done!
• 3) But how to do prediction?
Major issues
3.representation: P(x,y) or P(y|x)
• 3) But how to do prediction?
• Given an observation sequence x, need to find
out an output sequence y , such that the
average loss L(z,y) is minimized, that is
•
• Suppose |X|=L, |Y|=K, need
• 1) if the loss L(z,y) depends on the entire seq.
(0-1 loss)
( | )
arg min ( ( , )) arg min ( , ) ( | )
z p y x z
y
y E L z y L z y p y x
  

( )
L
O K
2
( )
O K L
arg max ( | )
y
y p y x


dynamic programming – for each label u and each yi, find
the best path for [0, yi(=u)], repeat.
Major issues
3.representation: P(x,y) or P(y|x)
• 2) if the loss can be decomposed into separate
decisions for each yi, we just need to compute
p(yi=u|x1,x2…) – called pseudo-likelihood
1 1
1 1
( | ) ( ) ( ,... | ) ( ,... | ) ( )
( | )
( ) ( )
( ,... , ) ( ,... | )
( )
i i i i i L i i
i
i i i L i
p x y p y p x x y p x x y p y
p y u x
p x p x
p x x y p x x y
p x


  

Can be compute by forward procedure and a backward procedure.
1 1
,... ,... |
i i L i
x x x x y


Major issues
3.representation: P(x,y) or P(y|x)
• Another idea is to model conditional distribution
P(y|x) directly. But how?
• Graphically, this is easy, just reverse the direction
on the edge of HMM and remove the direction on
the edge between two label variable, then write
down the conditional distribution according to the
Hammersley-Clifford theorem. This leads to the
Conditional Random Field
Major issues
3.representation: P(x,y) or P(y|x)
• Another way is to directly pursue a
conditional distribution among a functional
space under some criterion – e.g., the
maxent criterion. This leads to
• Maximumn Entropy Markov Model
(MEMM)
But this is nothing if no constraints are imposed. So
what are they? – consistent with the training data!
Major issues
3.representation: P(x,y) or P(y|x)
• the constraints are somewhat like the usual loss
function, which say that what you estimate
should be consistent with what the model says.
• As before, first encode your evidence as
features,
• then your estimated model tells you that the
sufficient statistics from the data is
• And the true model tells you that it should be
Those two numbers should be consistent
( the same!
Major issues
3.representation: P(x,y) or P(y|x)
• After getting the constraints, we get the
object function:
• After some long (but simple) math, we get
This looks exactly the same as that in the logistic regression situation!
Major issues
3.representation: P(x,y) or P(y|x)
• Now put it in the context of a chain, note
that one label variable only depends on
the previous variable and its the whole
sequence x ,we have:
1... 1
1
1
( | ) ( | , )
1
exp[ ( , , , )]
( , , , )
L i i i
i
T
i i
i i
p y x p y y x
w f x i y y
Z x i y w







What’s the problem of this per-state normalization scheme?
Label-Bias Problem of MEMMs
• Transitions from a state compete only with each other
– Transition scores are conditional probabilities of next states given current
state and observation.
– The total mass of P(yt+1|yt,x) is sum_yt+1{p(yt+1,yt,x1,…t+1)}
– While p(yt+1,yt,x1,…t+1 )= p(yt,x1,…xt) p(yt+1|yt,xt+1) p(xt+1 )
– But p(xt+1 ) will be cancelled out by the same term in the numerator, so the
total mass is actually p(yt,x1,…xt)
– This means:
– 1)Observations do not affect the total mass of next states!
– 2) but this total mass of a state will be distributed among next states.
– 3) so, the probability to choose some next state k will bias towards
those path with fewer outgoing transitions, even when xt+1 is
completely incompatible with this state k!
• States with a single outgoing transition ignore observations
Label-Bias Problem: Example
• A model for distinguishing ‘rob’ from ‘rib’
• Suppose we get an input sequence ‘rib’
– First step, ‘r’ matches both possible states so equally likely
– Next, ‘i’ is observed, but since both y1
and y4
have have one outgoing
state, they both give probability 1 to the next state
– Does not happen in HMMs – why? p(yt+1|yt) p(xt+1|yt+1)
– Note: if one word is more likely in train it will win
y0
y1
y2
y3
y4
y5
y0
0 0.5 0.5 0 0 0
y1
0 0 1 0 0 0
y2
0 0 0 1 0 0
y3
0 0 0 1 0 0
y4
0 0 0 0 0 1
5
P(Y’|Y)
y0
y1
y2
y3
State transition
y4 y5
r
r
o
i
b
b
Major issues
3.representation: P(x,y) or P(y|x)
• This problem is corrected by the CRF:
1... 1
1
( | ) exp[ ( , , , )]
( , )
T
L i i
i
p y x w f x i y y
Z x w

 
How to compute W?
Maximum likelihood – but hard to compute the gradient of log of partition function!
can be solved similar to the forward and backward
Procedure adopted in HMM.
Note that this is possible due to the Local Markov Property of CRF graph.
But the cost we must pay is to summarize over the whole
Lattice!
More about the conditional
likelihood function
• Let the true model is p0(x) and the estimated is
p(x), then the negative log likelihood L(p,x
x)=-log
p(x), x is the samples. We want to minimize this,
why?
– What we really want to minimize is the KL between the
true, unknown model p0(x) and the one being fit, p(x), it
easy to see that minKLp(p0||p)=minp-sum_x p0(x)log
p(x)=minsum_x p0(x)L(p, x) , so minimize likelihood
loss make sense.
– if we think samples are from p0(x), then by MC
approximation, minpsum_x p0(x)L(p,y,x)=minp average
of L(p, x ) on the training set. But no guarantee for a
consistent estimator – p space is just too large!
More about the conditional
likelihood loss function
• In the case of conditional likelihood, L(p,y,x)=-log
p(y|x) and minimize this is equivalent to
minimizing the expected KL, that is
0 0 0 0
ˆ ˆ
0 0
ˆ ˆ
ˆ ˆ
arg min ( ) ( ( | ) || ( | )) arg min ( ) ( | )log ( | )
ˆ ˆ
arg min ( , )log ( | ) arg min ( , ) ( , , )
p p
x x y
p p
x y x y
p x KL p y x p y x p x p y x p y x
p y x p y x p y x L p y x

  
  
 
This says that minimizing conditional loss makes sense!
Major issues
3.representation: P(x,y) or P(y|x)
• But in classification situation, estimating
accuracy condition distribution is not
necessarily.
• One can instead directly maximize likelihood
ratios between a sample and the one whose
conditional likelihood is most competing.
• This idea leads to the family of so called
“discriminatively trained Markov network”,
including Hidden Markov support vector
machine, max margin markov network, or
structured support vector machine.
Yi is a sequence of labels by sample I; and Yi’ is another label sequence
Some examples
summary
• Two key problem in structured prediction
• 1. representation – features
• 2. weights learning method
• There are many variations of CRF
• E.g., Latent CRF, Discriminative CRF, semi-
supervised CRF, Sparse CRF , kernel CRF, and
so on….
• It seems that online CRF has not be done yet

More Related Content

PPT
Alpaydin - Chapter 2
PDF
lec6_annotated.pdf ml csci 567 vatsal sharan
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PDF
Cheatsheet supervised-learning
PPT
Alpaydin - Chapter 2
PPT
learning.ppt
PDF
MS CS - Selecting Machine Learning Algorithm
PPT
Machine learning
Alpaydin - Chapter 2
lec6_annotated.pdf ml csci 567 vatsal sharan
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Cheatsheet supervised-learning
Alpaydin - Chapter 2
learning.ppt
MS CS - Selecting Machine Learning Algorithm
Machine learning

Similar to The Structured Prediction – An Overview.ppt (20)

PDF
Review : Perceptron Artificial Intelligence.pdf
PDF
MLHEP Lectures - day 2, basic track
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Supervised Prediction of Graph Summaries
PPTX
Lec2-review-III-svm-logreg_for the beginner.pptx
PPTX
Lec2-review-III-svm-logregressionmodel.pptx
PPT
lec1.ppt
PPT
Lecture 1
PPT
AML_030607.ppt
PDF
Lecture13 xing fei-fei
PDF
Linear models for classification
PPTX
Machine learning introduction lecture notes
PDF
Learning Deep Learning
PDF
L1 intro2 supervised_learning
PPT
Statistical Machine________ Learning.ppt
PPTX
Statistical Learning and Model Selection module 2.pptx
PPT
notes as .ppt
PPTX
Supervised learning for IOT IN Vellore Institute of Technology
PPTX
Learn from Example and Learn Probabilistic Model
PDF
Machine Learning Lecture 2 Basics
Review : Perceptron Artificial Intelligence.pdf
MLHEP Lectures - day 2, basic track
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Supervised Prediction of Graph Summaries
Lec2-review-III-svm-logreg_for the beginner.pptx
Lec2-review-III-svm-logregressionmodel.pptx
lec1.ppt
Lecture 1
AML_030607.ppt
Lecture13 xing fei-fei
Linear models for classification
Machine learning introduction lecture notes
Learning Deep Learning
L1 intro2 supervised_learning
Statistical Machine________ Learning.ppt
Statistical Learning and Model Selection module 2.pptx
notes as .ppt
Supervised learning for IOT IN Vellore Institute of Technology
Learn from Example and Learn Probabilistic Model
Machine Learning Lecture 2 Basics
Ad

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PPTX
1_Introduction to advance data techniques.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Quality review (1)_presentation of this 21
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Computer network topology notes for revision
1_Introduction to advance data techniques.pptx
IB Computer Science - Internal Assessment.pptx
Foundation of Data Science unit number two notes
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Quality review (1)_presentation of this 21
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Reliability_Chapter_ presentation 1221.5784
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Analytics and business intelligence.pdf
Mega Projects Data Mega Projects Data
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Data_Analytics_and_PowerBI_Presentation.pptx
Ad

The Structured Prediction – An Overview.ppt

  • 2. Classical Supervised learning • Given a set of N training samples {xi,yi}, where xi is a vector and yi is an integer (for classification) or a real number (reg.) – Xi could be anything in the real world, and yi is the number you’re interested in about xi. – E.g., xi is an image, yi is a label indicating what object this image contained. • The goal is to learn a function h from an assumed functional space H, such that y=h(x) could predict new samples well.
  • 3. Classical Supervised learning • What does the h(x) look like? – Most of the time, it is a linear function. E.g. y= sum_c{w_c f(x_c)}, where – c – index of a subset of variables for x. – x_c – a subset of variables indexed by c – f(x_c) – a “feature” function, indicating how good of this set of x_c at predicting y. – w_c – indicating the strength of feature c – In this way, we exploit the structure of input space X, to predict a single number y for each samples in that space.
  • 4. Classical Supervised learning • You, as the engineer, has only two key tasks for a specific problem • 1. design good feature functions f – This should encode your reasoning mechanism as a human, i.e., give a score to your evidence. – f(x)= K(xi,x) ; – f(xc)=1 if xc=“ing” 0 otherwise; – f(xc)=“local descriptor for image patch c” – f(xc)=“image patch 1 AND image patch 2 ” c={1,2} – In principle, one can design unlimited number of features based on a subset of variable xc.
  • 5. Classical Supervised learning • 2, find a way to estimate your best w_c (could be a vector) : – Maximum likelihood, Maximum a posterior • P(y|x_i, w)= exp(H(x_i))/z • = exp(sum_c{w_c f(x_c)})/z – Maximum entropy – Maximum margin – SVM
  • 6. Classical Supervised learning – But the above criterions encodes the same thing – minimizing (training) errors – Different way to calculate/penalize errors (the thing to do this is called loss function) gives different algorithm. • Hinge loss – SVM has taken it as his private property. • Log loss – maximum entropy/ logistic regression • Exponential loss – Adaboost • squared loss – Least square regression • Misclassification loss – perceptron
  • 7. A Sequence of Label? • In many case, there also exist correlation between outputs
  • 8. Sequential supervised learning • Given N training samples {xi,yi},each xi and yi are both sequence, the goal is to learn a classifier y=h(x) with sequence input and sequence output.
  • 9. Sequential supervised learning • Related to but different from time-series predicting – we a given the whole sequence – We need to learn all the labels instead of only one at yt+1 • Related to but different from Multi-task learning – In multi-task learning, the input space of each task may be totally different – More general than sequential learning • Related to but different from Multi-label learning – In Multi-label learning, the task is to assign multiple labels to one single instance (or multiple instances but taken as a whole)
  • 10. Major issues 1. loss function for SP • That is, given a feature sequence, you have to compute a number to indicate whether your predicted label sequence is good or not, in other words, you need to measure how good your model fit the data! – On one hand, count the number of incorrect labels, this is equal to maximize p(yi|x), i.e., your model should correctly label as many individual label as possible. – On the other hand, if any individual label is incorrect, the whole sequence is treated as incorrect, i.e., 0-1 loss in seq. case. – Between these extremes, make a global analysis of yi, but allow some yi be incorrect. – The above options depend on your application! • After define your loss, what you do is to tune your model parameter such that the average loss over the training samples is minimized, this procedure called parameter estimation or learning.
  • 11. Major issues 2. features and long distance interaction • Central problem: identify the relevant information subset (xi) for accurate predictions (y). • How to do this? – 1) Use traditional feature selection method, search the best set according to your loss: forward, backward,.. Not good for SP! Too many options. – 2) Attach a weight to each feature, and sparse these weights, e.g., L1- SVM, ARD in neural network. Not good for SP for the same reason. – 3) Feature scoring: e.g., mutual information as a similarity measure between a feature and a class label. – 4) Fit a simple model first, use it as a feature selector, then remove unrelevant features. • Usual way in SP: define a window around xi to predict yi, but this may not be safe, long sequence is always a problem.
  • 12. Major issues 3.representation: P(x,y) or P(y|x) • Model joint distribution - P(X,Y) • Hidden Markov Model • 1) write down the joint distribution according to graphical model • P(x,y)= p(y1)p(x1|y1)p(y2|y1)p(x2|y2)…. • 2) learning is easy, to get P(x,y), just need – P(yi+1|yi) – looking at all pairs of adjacent ys – P(xi|yi) – looking at all pairs of (xi|yi) – Done! • 3) But how to do prediction?
  • 13. Major issues 3.representation: P(x,y) or P(y|x) • 3) But how to do prediction? • Given an observation sequence x, need to find out an output sequence y , such that the average loss L(z,y) is minimized, that is • • Suppose |X|=L, |Y|=K, need • 1) if the loss L(z,y) depends on the entire seq. (0-1 loss) ( | ) arg min ( ( , )) arg min ( , ) ( | ) z p y x z y y E L z y L z y p y x     ( ) L O K 2 ( ) O K L arg max ( | ) y y p y x   dynamic programming – for each label u and each yi, find the best path for [0, yi(=u)], repeat.
  • 14. Major issues 3.representation: P(x,y) or P(y|x) • 2) if the loss can be decomposed into separate decisions for each yi, we just need to compute p(yi=u|x1,x2…) – called pseudo-likelihood 1 1 1 1 ( | ) ( ) ( ,... | ) ( ,... | ) ( ) ( | ) ( ) ( ) ( ,... , ) ( ,... | ) ( ) i i i i i L i i i i i i L i p x y p y p x x y p x x y p y p y u x p x p x p x x y p x x y p x       Can be compute by forward procedure and a backward procedure. 1 1 ,... ,... | i i L i x x x x y  
  • 15. Major issues 3.representation: P(x,y) or P(y|x) • Another idea is to model conditional distribution P(y|x) directly. But how? • Graphically, this is easy, just reverse the direction on the edge of HMM and remove the direction on the edge between two label variable, then write down the conditional distribution according to the Hammersley-Clifford theorem. This leads to the Conditional Random Field
  • 16. Major issues 3.representation: P(x,y) or P(y|x) • Another way is to directly pursue a conditional distribution among a functional space under some criterion – e.g., the maxent criterion. This leads to • Maximumn Entropy Markov Model (MEMM) But this is nothing if no constraints are imposed. So what are they? – consistent with the training data!
  • 17. Major issues 3.representation: P(x,y) or P(y|x) • the constraints are somewhat like the usual loss function, which say that what you estimate should be consistent with what the model says. • As before, first encode your evidence as features, • then your estimated model tells you that the sufficient statistics from the data is • And the true model tells you that it should be Those two numbers should be consistent ( the same!
  • 18. Major issues 3.representation: P(x,y) or P(y|x) • After getting the constraints, we get the object function: • After some long (but simple) math, we get This looks exactly the same as that in the logistic regression situation!
  • 19. Major issues 3.representation: P(x,y) or P(y|x) • Now put it in the context of a chain, note that one label variable only depends on the previous variable and its the whole sequence x ,we have: 1... 1 1 1 ( | ) ( | , ) 1 exp[ ( , , , )] ( , , , ) L i i i i T i i i i p y x p y y x w f x i y y Z x i y w        What’s the problem of this per-state normalization scheme?
  • 20. Label-Bias Problem of MEMMs • Transitions from a state compete only with each other – Transition scores are conditional probabilities of next states given current state and observation. – The total mass of P(yt+1|yt,x) is sum_yt+1{p(yt+1,yt,x1,…t+1)} – While p(yt+1,yt,x1,…t+1 )= p(yt,x1,…xt) p(yt+1|yt,xt+1) p(xt+1 ) – But p(xt+1 ) will be cancelled out by the same term in the numerator, so the total mass is actually p(yt,x1,…xt) – This means: – 1)Observations do not affect the total mass of next states! – 2) but this total mass of a state will be distributed among next states. – 3) so, the probability to choose some next state k will bias towards those path with fewer outgoing transitions, even when xt+1 is completely incompatible with this state k! • States with a single outgoing transition ignore observations
  • 21. Label-Bias Problem: Example • A model for distinguishing ‘rob’ from ‘rib’ • Suppose we get an input sequence ‘rib’ – First step, ‘r’ matches both possible states so equally likely – Next, ‘i’ is observed, but since both y1 and y4 have have one outgoing state, they both give probability 1 to the next state – Does not happen in HMMs – why? p(yt+1|yt) p(xt+1|yt+1) – Note: if one word is more likely in train it will win y0 y1 y2 y3 y4 y5 y0 0 0.5 0.5 0 0 0 y1 0 0 1 0 0 0 y2 0 0 0 1 0 0 y3 0 0 0 1 0 0 y4 0 0 0 0 0 1 5 P(Y’|Y) y0 y1 y2 y3 State transition y4 y5 r r o i b b
  • 22. Major issues 3.representation: P(x,y) or P(y|x) • This problem is corrected by the CRF: 1... 1 1 ( | ) exp[ ( , , , )] ( , ) T L i i i p y x w f x i y y Z x w    How to compute W? Maximum likelihood – but hard to compute the gradient of log of partition function! can be solved similar to the forward and backward Procedure adopted in HMM. Note that this is possible due to the Local Markov Property of CRF graph. But the cost we must pay is to summarize over the whole Lattice!
  • 23. More about the conditional likelihood function • Let the true model is p0(x) and the estimated is p(x), then the negative log likelihood L(p,x x)=-log p(x), x is the samples. We want to minimize this, why? – What we really want to minimize is the KL between the true, unknown model p0(x) and the one being fit, p(x), it easy to see that minKLp(p0||p)=minp-sum_x p0(x)log p(x)=minsum_x p0(x)L(p, x) , so minimize likelihood loss make sense. – if we think samples are from p0(x), then by MC approximation, minpsum_x p0(x)L(p,y,x)=minp average of L(p, x ) on the training set. But no guarantee for a consistent estimator – p space is just too large!
  • 24. More about the conditional likelihood loss function • In the case of conditional likelihood, L(p,y,x)=-log p(y|x) and minimize this is equivalent to minimizing the expected KL, that is 0 0 0 0 ˆ ˆ 0 0 ˆ ˆ ˆ ˆ arg min ( ) ( ( | ) || ( | )) arg min ( ) ( | )log ( | ) ˆ ˆ arg min ( , )log ( | ) arg min ( , ) ( , , ) p p x x y p p x y x y p x KL p y x p y x p x p y x p y x p y x p y x p y x L p y x          This says that minimizing conditional loss makes sense!
  • 25. Major issues 3.representation: P(x,y) or P(y|x) • But in classification situation, estimating accuracy condition distribution is not necessarily. • One can instead directly maximize likelihood ratios between a sample and the one whose conditional likelihood is most competing. • This idea leads to the family of so called “discriminatively trained Markov network”, including Hidden Markov support vector machine, max margin markov network, or structured support vector machine. Yi is a sequence of labels by sample I; and Yi’ is another label sequence
  • 27. summary • Two key problem in structured prediction • 1. representation – features • 2. weights learning method • There are many variations of CRF • E.g., Latent CRF, Discriminative CRF, semi- supervised CRF, Sparse CRF , kernel CRF, and so on…. • It seems that online CRF has not be done yet