2. Classical Supervised learning
• Given a set of N training samples {xi,yi},
where xi is a vector and yi is an integer (for
classification) or a real number (reg.)
– Xi could be anything in the real world, and yi is
the number you’re interested in about xi.
– E.g., xi is an image, yi is a label indicating what
object this image contained.
• The goal is to learn a function h from an
assumed functional space H, such that
y=h(x) could predict new samples well.
3. Classical Supervised learning
• What does the h(x) look like?
– Most of the time, it is a linear function. E.g. y=
sum_c{w_c f(x_c)}, where
– c – index of a subset of variables for x.
– x_c – a subset of variables indexed by c
– f(x_c) – a “feature” function, indicating how good of
this set of x_c at predicting y.
– w_c – indicating the strength of feature c
– In this way, we exploit the structure of input space
X, to predict a single number y for each samples in
that space.
4. Classical Supervised learning
• You, as the engineer, has only two key tasks for
a specific problem
• 1. design good feature functions f
– This should encode your reasoning mechanism as a
human, i.e., give a score to your evidence.
– f(x)= K(xi,x) ;
– f(xc)=1 if xc=“ing” 0 otherwise;
– f(xc)=“local descriptor for image patch c”
– f(xc)=“image patch 1 AND image patch 2 ” c={1,2}
– In principle, one can design unlimited number of
features based on a subset of variable xc.
5. Classical Supervised learning
• 2, find a way to estimate your best w_c
(could be a vector) :
– Maximum likelihood, Maximum a posterior
• P(y|x_i, w)= exp(H(x_i))/z
• = exp(sum_c{w_c f(x_c)})/z
– Maximum entropy
– Maximum margin – SVM
6. Classical Supervised learning
– But the above criterions encodes the same
thing – minimizing (training) errors
– Different way to calculate/penalize errors (the
thing to do this is called loss function) gives
different algorithm.
• Hinge loss – SVM has taken it as his private
property.
• Log loss – maximum entropy/ logistic regression
• Exponential loss – Adaboost
• squared loss – Least square regression
• Misclassification loss – perceptron
7. A Sequence of Label?
• In many case, there also exist correlation
between outputs
8. Sequential supervised learning
• Given N training samples {xi,yi},each xi and
yi are both sequence, the goal is to learn a
classifier y=h(x) with sequence input and
sequence output.
9. Sequential supervised learning
• Related to but different from time-series predicting
– we a given the whole sequence
– We need to learn all the labels instead of only one at yt+1
• Related to but different from Multi-task learning
– In multi-task learning, the input space of each task may be totally
different
– More general than sequential learning
• Related to but different from Multi-label learning
– In Multi-label learning, the task is to assign multiple labels to one
single instance (or multiple instances but taken as a whole)
10. Major issues
1. loss function for SP
• That is, given a feature sequence, you have to compute a
number to indicate whether your predicted label sequence
is good or not, in other words, you need to measure how
good your model fit the data!
– On one hand, count the number of incorrect labels, this is equal to
maximize p(yi|x), i.e., your model should correctly label as many
individual label as possible.
– On the other hand, if any individual label is incorrect, the whole
sequence is treated as incorrect, i.e., 0-1 loss in seq. case.
– Between these extremes, make a global analysis of yi, but allow
some yi be incorrect.
– The above options depend on your application!
• After define your loss, what you do is to tune your model
parameter such that the average loss over the training
samples is minimized, this procedure called parameter
estimation or learning.
11. Major issues
2. features and long distance interaction
• Central problem: identify the relevant information subset (xi) for
accurate predictions (y).
• How to do this?
– 1) Use traditional feature selection method, search the best set
according to your loss: forward, backward,.. Not
good for SP! Too many options.
– 2) Attach a weight to each feature, and sparse these weights, e.g., L1-
SVM, ARD in neural network. Not good for SP for the same reason.
– 3) Feature scoring: e.g., mutual information as a similarity measure
between a feature and a class label.
– 4) Fit a simple model first, use it as a feature selector, then remove
unrelevant features.
• Usual way in SP: define a window around xi to predict yi, but this
may not be safe, long sequence is always a problem.
12. Major issues
3.representation: P(x,y) or P(y|x)
• Model joint distribution - P(X,Y)
• Hidden Markov Model
• 1) write down the joint distribution according to
graphical model
• P(x,y)= p(y1)p(x1|y1)p(y2|y1)p(x2|y2)….
• 2) learning is easy, to get P(x,y), just need
– P(yi+1|yi) – looking at all pairs of adjacent ys
– P(xi|yi) – looking at all pairs of (xi|yi)
– Done!
• 3) But how to do prediction?
13. Major issues
3.representation: P(x,y) or P(y|x)
• 3) But how to do prediction?
• Given an observation sequence x, need to find
out an output sequence y , such that the
average loss L(z,y) is minimized, that is
•
• Suppose |X|=L, |Y|=K, need
• 1) if the loss L(z,y) depends on the entire seq.
(0-1 loss)
( | )
arg min ( ( , )) arg min ( , ) ( | )
z p y x z
y
y E L z y L z y p y x
( )
L
O K
2
( )
O K L
arg max ( | )
y
y p y x
dynamic programming – for each label u and each yi, find
the best path for [0, yi(=u)], repeat.
14. Major issues
3.representation: P(x,y) or P(y|x)
• 2) if the loss can be decomposed into separate
decisions for each yi, we just need to compute
p(yi=u|x1,x2…) – called pseudo-likelihood
1 1
1 1
( | ) ( ) ( ,... | ) ( ,... | ) ( )
( | )
( ) ( )
( ,... , ) ( ,... | )
( )
i i i i i L i i
i
i i i L i
p x y p y p x x y p x x y p y
p y u x
p x p x
p x x y p x x y
p x
Can be compute by forward procedure and a backward procedure.
1 1
,... ,... |
i i L i
x x x x y
15. Major issues
3.representation: P(x,y) or P(y|x)
• Another idea is to model conditional distribution
P(y|x) directly. But how?
• Graphically, this is easy, just reverse the direction
on the edge of HMM and remove the direction on
the edge between two label variable, then write
down the conditional distribution according to the
Hammersley-Clifford theorem. This leads to the
Conditional Random Field
16. Major issues
3.representation: P(x,y) or P(y|x)
• Another way is to directly pursue a
conditional distribution among a functional
space under some criterion – e.g., the
maxent criterion. This leads to
• Maximumn Entropy Markov Model
(MEMM)
But this is nothing if no constraints are imposed. So
what are they? – consistent with the training data!
17. Major issues
3.representation: P(x,y) or P(y|x)
• the constraints are somewhat like the usual loss
function, which say that what you estimate
should be consistent with what the model says.
• As before, first encode your evidence as
features,
• then your estimated model tells you that the
sufficient statistics from the data is
• And the true model tells you that it should be
Those two numbers should be consistent
( the same!
18. Major issues
3.representation: P(x,y) or P(y|x)
• After getting the constraints, we get the
object function:
• After some long (but simple) math, we get
This looks exactly the same as that in the logistic regression situation!
19. Major issues
3.representation: P(x,y) or P(y|x)
• Now put it in the context of a chain, note
that one label variable only depends on
the previous variable and its the whole
sequence x ,we have:
1... 1
1
1
( | ) ( | , )
1
exp[ ( , , , )]
( , , , )
L i i i
i
T
i i
i i
p y x p y y x
w f x i y y
Z x i y w
What’s the problem of this per-state normalization scheme?
20. Label-Bias Problem of MEMMs
• Transitions from a state compete only with each other
– Transition scores are conditional probabilities of next states given current
state and observation.
– The total mass of P(yt+1|yt,x) is sum_yt+1{p(yt+1,yt,x1,…t+1)}
– While p(yt+1,yt,x1,…t+1 )= p(yt,x1,…xt) p(yt+1|yt,xt+1) p(xt+1 )
– But p(xt+1 ) will be cancelled out by the same term in the numerator, so the
total mass is actually p(yt,x1,…xt)
– This means:
– 1)Observations do not affect the total mass of next states!
– 2) but this total mass of a state will be distributed among next states.
– 3) so, the probability to choose some next state k will bias towards
those path with fewer outgoing transitions, even when xt+1 is
completely incompatible with this state k!
• States with a single outgoing transition ignore observations
21. Label-Bias Problem: Example
• A model for distinguishing ‘rob’ from ‘rib’
• Suppose we get an input sequence ‘rib’
– First step, ‘r’ matches both possible states so equally likely
– Next, ‘i’ is observed, but since both y1
and y4
have have one outgoing
state, they both give probability 1 to the next state
– Does not happen in HMMs – why? p(yt+1|yt) p(xt+1|yt+1)
– Note: if one word is more likely in train it will win
y0
y1
y2
y3
y4
y5
y0
0 0.5 0.5 0 0 0
y1
0 0 1 0 0 0
y2
0 0 0 1 0 0
y3
0 0 0 1 0 0
y4
0 0 0 0 0 1
5
P(Y’|Y)
y0
y1
y2
y3
State transition
y4 y5
r
r
o
i
b
b
22. Major issues
3.representation: P(x,y) or P(y|x)
• This problem is corrected by the CRF:
1... 1
1
( | ) exp[ ( , , , )]
( , )
T
L i i
i
p y x w f x i y y
Z x w
How to compute W?
Maximum likelihood – but hard to compute the gradient of log of partition function!
can be solved similar to the forward and backward
Procedure adopted in HMM.
Note that this is possible due to the Local Markov Property of CRF graph.
But the cost we must pay is to summarize over the whole
Lattice!
23. More about the conditional
likelihood function
• Let the true model is p0(x) and the estimated is
p(x), then the negative log likelihood L(p,x
x)=-log
p(x), x is the samples. We want to minimize this,
why?
– What we really want to minimize is the KL between the
true, unknown model p0(x) and the one being fit, p(x), it
easy to see that minKLp(p0||p)=minp-sum_x p0(x)log
p(x)=minsum_x p0(x)L(p, x) , so minimize likelihood
loss make sense.
– if we think samples are from p0(x), then by MC
approximation, minpsum_x p0(x)L(p,y,x)=minp average
of L(p, x ) on the training set. But no guarantee for a
consistent estimator – p space is just too large!
24. More about the conditional
likelihood loss function
• In the case of conditional likelihood, L(p,y,x)=-log
p(y|x) and minimize this is equivalent to
minimizing the expected KL, that is
0 0 0 0
ˆ ˆ
0 0
ˆ ˆ
ˆ ˆ
arg min ( ) ( ( | ) || ( | )) arg min ( ) ( | )log ( | )
ˆ ˆ
arg min ( , )log ( | ) arg min ( , ) ( , , )
p p
x x y
p p
x y x y
p x KL p y x p y x p x p y x p y x
p y x p y x p y x L p y x
This says that minimizing conditional loss makes sense!
25. Major issues
3.representation: P(x,y) or P(y|x)
• But in classification situation, estimating
accuracy condition distribution is not
necessarily.
• One can instead directly maximize likelihood
ratios between a sample and the one whose
conditional likelihood is most competing.
• This idea leads to the family of so called
“discriminatively trained Markov network”,
including Hidden Markov support vector
machine, max margin markov network, or
structured support vector machine.
Yi is a sequence of labels by sample I; and Yi’ is another label sequence
27. summary
• Two key problem in structured prediction
• 1. representation – features
• 2. weights learning method
• There are many variations of CRF
• E.g., Latent CRF, Discriminative CRF, semi-
supervised CRF, Sparse CRF , kernel CRF, and
so on….
• It seems that online CRF has not be done yet