Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Machine Learning for Language Technology
Uppsala University
Department of Linguistics and Philology
Slides borrowed from previous courses.
Thanks to Ryan McDonald (Google Research)
and Prof. Joakim Nivre
Machine Learning for Language Technology 1(55)

Introduction
Linear Classifiers
Classifiers covered so far:
Decision trees
Nearest neighbors
Next two lectures: Linear classifiers
Statistics from Google Scholar (Sept 2013):
“Maximum Entropy”&“NLP”; 11,400 hits (2013), 2,660
(2009), 141 before 2000
“SVM”&“NLP”: 10,900 hits (2013), 2,210 (2009), 16 before
2000
“Perceptron”&“NLP”: 3,160 hits (2013), 947 (2009), 118
before 2000
All are linear classifiers that have become important tools in
Language Technology in the past 10 years or so.

Introduction
Outline
Today:
Preliminaries: input/output, features, etc.
Linear classifiers
Perceptron
Large-margin classifiers (SVMs, MIRA)
Logistic regression
Next time:
Structured prediction with linear classifiers
Structured perceptron
Structured large-margin classifiers (SVMs, MIRA)
Conditional random fields
Case study: Dependency parsing

Preliminaries
Inputs and Outputs
Input: x ∈ X
e.g., document or sentence with some words x = w1 . . . wn, or
a series of previous actions
Output: y ∈ Y
e.g., parse tree, document class, part-of-speech tags,
word-sense
Input/output pair: (x, y) ∈ X × Y
e.g., a document x and its label y
Sometimes x is explicit in y, e.g., a parse tree y will contain
the sentence x

Preliminaries
Feature Representations
We assume a mapping from input-output pairs (x, y) to a
high dimensional feature vector
f(x, y) : X × Y → Rm
For some cases, i.e., binary classiﬁcation Y = {−1, +1}, we
can map only from the input to the feature space
f(x) : X → Rm
However, most problems in NLP require more than two
classes, so we focus on the multi-class case
For any vector v ∈ Rm, let vj be the jth value

Preliminaries
Features and Classes
All features must be numerical
Numerical features are represented directly as fi (x, y) ∈ R
Binary (boolean) features are represented as fi (x, y) ∈ {0, 1}
Multinomial (categorical) features must be binarized
Instead of: fi (x, y) ∈ {v0, . . . , vp}
We have: fi+0(x, y) ∈ {0, 1}, . . . , fi+p(x, y) ∈ {0, 1}
Such that: fi+j (x, y) = 1 iﬀ fi (x, y) = vj
We need distinct features for distinct output classes
Instead of: fi (x) (1 ≤ i ≤ m)
We have: fi+0m(x, y), . . . , fi+Nm(x, y) for Y = {0, . . . , N}
Such that: fi+jm(x, y) = fi (x) iﬀ y = yj

Preliminaries
Examples
x is a document and y is a label
fj (x, y) =



1 if x contains the word“interest”
and y =“ﬁnancial”
0 otherwise
fj (x, y) = % of words in x with punctuation and y =“scientiﬁc”
x is a word and y is a part-of-speech tag
fj (x, y) =
1 if x = “bank”and y = Verb
0 otherwise

Preliminaries
Examples
x is a name, y is a label classifying the name
f0(x, y) =
8
<
:
1 if x contains “George”
and y = “Person”
0 otherwise
f1(x, y) =
8
<
:
1 if x contains “Washington”
0 otherwise
f2(x, y) =
8
<
:
1 if x contains “Bridge”
0 otherwise
f3(x, y) =
8
<
:
1 if x contains “General”
0 otherwise
f4(x, y) =
8
<
:
1 if x contains “George”
and y = “Object”
0 otherwise
f5(x, y) =
8
<
:
1 if x contains “Washington”
0 otherwise
f6(x, y) =
8
<
:
1 if x contains “Bridge”
0 otherwise
f7(x, y) =
8
<
:
1 if x contains “General”
0 otherwise
x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]
x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]

Preliminaries
Block Feature Vectors
x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]
x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]
One equal-size block of the feature vector for each label
Input features duplicated in each block
Non-zero values allowed only in one block

Linear Classifiers
Linear Classifiers
Linear classifier: score (or probability) of a particular
classification is based on a linear combination of features and
their weights
Let w ∈ Rm be a high dimensional weight vector
If we assume that w is known, then we define our classifier as
Multiclass Classification: Y = {0, 1, . . . , N}
y = arg max
y
w · f(x, y)
= arg max
y
m
j=0
wj × fj (x, y)
Binary Classification just a special case of multiclass

Linear Classifiers
Linear Classifiers - Bias Terms
Often linear classifiers presented as
y = arg max
y
m
j=0
wj × fj (x, y) + by
Where b is a bias or offset term
But this can be folded into f
x=General George Washington, y=Person → f(x, y) = [1 1 0 1 1 0 0 0 0 0]
x=General George Washington, y=Object → f(x, y) = [0 0 0 0 0 1 1 0 1 1]
f4(x, y) =

1 y =“Person”
0 otherwise
f9(x, y) =

1 y =“Object”
0 otherwise
w4 and w9 are now the bias terms for the labels

Linear Classiﬁers
Binary Linear Classiﬁer
Divides all points:

Linear Classifiers
Multiclass Linear Classifier
Defines regions of space:
i.e., + are all points (x, y) where + = arg maxy w · f(x, y)

Linear Classifiers
Separability
A set of points is separable, if there exists a w such that
classification is perfect
Separable Not Separable
This can also be defined mathematically (and we will shortly)

Linear Classiﬁers
Supervised Learning – how to ﬁnd w
Input: training examples T = {(xt, yt)}
|T |
t=1
Input: feature representation f
Output: w that maximizes/minimizes some important
function on the training set
minimize error (Perceptron, SVMs, Boosting)
maximize likelihood of data (Logistic Regression)
Assumption: The training data is separable
Not necessary, just makes life easier
There is a lot of good work in machine learning to tackle the
non-separable case

Linear Classiﬁers
Perceptron
... The resulting hyperplane is called -perceptron- ...
(Witten and Frank, 2005:126)

Linear Classiﬁers
Perceptron
Choose a w that minimizes error
w = arg min
w t
1 − 1[yt = arg max
y
w · f(xt, y)]
1[p] =
1 p is true
0 otherwise
This is a 0-1 loss function
Aside: when minimizing error people tend to use hinge-loss or
other smoother loss functions

Linear Classiﬁers
Perceptron Learning Algorithm
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T (random order: shuﬄe the training set beforehand!)
4. Let y = arg maxy w(i) · f(xt, y)
5. if y = yt
6. w(i+1) = w(i) + f(xt, yt) − f(xt, y )
7. i = i + 1
8. return wi
See also Daume’ (2012:38-41).

Linear Classiﬁers
Perceptron: Separability and Margin
Given an training instance (xt, yt), deﬁne:
¯Yt = Y − {yt}
i.e., ¯Yt is the set of incorrect labels for xt
A training set T is separable with margin γ > 0 if there exists
a vector w with w = 1 such that:
w · f(xt, yt) − w · f(xt, y ) ≥ γ
for all y ∈ ¯Yt and ||w|| = j w2
j (Euclidean or L2 norm)
Assumption: the training set is separable with margin γ

Linear Classiﬁers
Perceptron: Main Theorem
Theorem: For any training set separable with a margin of γ,
the following holds for the perceptron algorithm:
mistakes made during training ≤
R2
γ2
where R ≥ ||f(xt, yt) − f(xt, y )|| for all (xt, yt) ∈ T and
y ∈ ¯Yt
Thus, after a ﬁnite number of training iterations, the error on
the training set will converge to zero
For proof, see the Appendix to these slides

Linear Classifiers
Perceptron Summary
Learns a linear classifier that minimizes error
Guaranteed to find a w in a finite amount of time
Perceptron is an example of an online learning algorithm
w is updated based on a single training instance in isolation
w(i+1)
= w(i)
+ f(xt, yt) − f(xt, y )
Compare decision trees that perform batch learning
All training instances are used to find best split

Linear Classiﬁers
Margin

Linear Classiﬁers
Margin
Training Testing
Denote the
value of the
margin by γ

Linear Classiﬁers
Maximizing Margin (i)
For a training set T , the margin of a weight vector w is the
smallest γ such that
w · f(xt, yt) − w · f(xt, y ) ≥ γ
for every training instance (xt, yt) ∈ T , y ∈ ¯Yt

Linear Classiﬁers
Maximizing Margin (ii)
Intuitively maximizing margin makes sense
More importantly, generalization error to unseen test data is
proportional to the inverse of the margin (for the proof, see
Daume’, 2012: 45-46)
∝
R2
γ2 × |T |
Perceptron: we have shown that:
If a training set is separable by some margin, the perceptron
will ﬁnd a w that separates the data
However, the perceptron does not pick w to maximize the
margin!

Linear Classiﬁers
Maximizing Margin (iii)
Let γ > 0
max
||w||≤1
γ
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ γ
∀(xt, yt) ∈ T
and y ∈ ¯Yt
Note: algorithm still minimizes error
||w|| is bound since scaling trivially produces larger margin
β(w · f(xt, yt) − w · f(xt, y )) ≥ βγ, for some β ≥ 1

Linear Classifiers
Max Margin = Min Norm
Let γ > 0
Max Margin:
max
||w||≤1
γ
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ γ
∀(xt, yt) ∈ T
and y ∈ ¯Yt
=
Min Norm:
min
w
1
2
||w||2
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ 1
∀(xt, yt) ∈ T
and y ∈ ¯Yt
Instead of fixing ||w|| we fix the margin γ = 1
Technically γ ∝ 1/||w||

Linear Classiﬁers
Support Vector Machines
a.k.a. SVM(s)

Linear Classiﬁers
Support Vector Machines (i)
min
1
2
||w||2
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ 1
∀(xt, yt) ∈ T
and y ∈ ¯Yt
Quadratic programming problem – a well known convex
optimization problem
Can be solved with out-of-the-box algorithms
Batch learning algorithm – w set w.r.t. all training points

Linear Classiﬁers
Support Vector Machines (ii)
Problem: Sometimes |T | is far too large
Thus the number of constraints might make solving the
quadratic programming problem very diﬃcult
Common technique: Sequential Minimal Optimization (SMO)
Sparse: solution depends only on features in support vectors

Linear Classiﬁers
MIRA
Margin Infused Relaxed Algorithm

Linear Classiﬁers
Margin Infused Relaxed Algorithm (MIRA)
Another option – maximize margin using an online algorithm
Batch vs. Online
Batch – update parameters based on entire training set (SVM)
Online – update parameters based on a single training instance
at a time (Perceptron)
MIRA can be thought of as a max-margin perceptron or an
online SVM

Linear Classiﬁers
MIRA
Batch (SVMs):
min
1
2
||w||2
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ 1
∀(xt, yt) ∈ T and y ∈ ¯Yt
Online (MIRA):
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0)
= 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. w(i+1)
= arg minw* w* − w(i)
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ 1
∀y ∈ ¯Yt
5. i = i + 1
6. return wi
MIRA has much smaller optimizations with only | ¯Yt|
constraints
Cost: sub-optimal optimization

Linear Classifiers
Interim Summary
What we have covered
Linear classifiers:
Perceptron
SVMs
MIRA
All are trained to minimize error
With or without maximizing margin
Online or batch
What is next
Logistic Regression
Train linear classifiers to maximize likelihood

Linear Classiﬁers
Logistic Regression

Linear Classifiers
Logistic Regression (i)
Define a conditional probability:
P(y|x) =
ew·f(x,y)
Zx
, where Zx =
y ∈Y
ew·f(x,y )
Note: still a linear classifier
arg max
y
P(y|x) = arg max
y
ew·f(x,y)
Zx
= arg max
y
ew·f(x,y)
= arg max
y
w · f(x, y)

Linear Classiﬁers
Logistic Regression (ii)
P(y|x) =
ew·f(x,y)
Zx
Q: How do we learn weights w
A: Set weights to maximize log-likelihood of training data:
w = arg max
w t
P(yt|xt) = arg max
w t
log P(yt|xt)
In a nut shell we set the weights w so that we assign as much
probability to the correct label y for each x in the training set

Linear Classiﬁers
Logistic Regression
P(y|x) =
ew·f(x,y)
Zx
, where Zx =
y ∈Y
ew·f(x,y )
w = arg max
w t
log P(yt|xt) (*)
The objective function (*) is concave
Therefore there is a global maximum
No closed form solution, but lots of numerical techniques
Gradient methods (gradient ascent, iterative scaling)
Newton methods (limited-memory quasi-newton)

Linear Classifiers
Logistic Regression Summary
Define conditional probability
P(y|x) =
ew·f(x,y)
Zx
Set weights to maximize log-likelihood of training data:
w = arg max
w t
log P(yt|xt)
Can find the gradient and run gradient ascent (or any
gradient-based optimization algorithm)
F(w) = (
∂
∂w0
F(w),
∂
∂w1
F(w), . . . ,
∂
∂wm
F(w))
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )

Linear Classifiers
Linear Classification: Summary
Basic form of (multiclass) classifier:
y = arg max
y
w · f(x, y)
Different learning methods:
Perceptron – separate data (0-1 loss, online)
Support vector machine – maximize margin (hinge loss, batch)
Logistic regression – maximize likelihood (log loss, batch)
All three methods are widely used in NLP

Linear Classiﬁers
Aside: Min error versus max log-likelihood (i)
Highly related but not identical
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000
1 × (x1001, y = 1) = [0, 0, 3, 1]
Now consider w = [−1, 0, 1, 0]
Error in this case is 0 – so w minimizes error
[−1, 0, 1, 0] · [−1, 1, 0, 0] = 1 > [−1, 0, 1, 0] · [0, 0, −1, 1] = −1
[−1, 0, 1, 0] · [0, 0, 3, 1] = 3 > [−1, 0, 1, 0] · [3, 1, 0, 0] = −3
However, log-likelihood = −126.9 (omit calculation)

Linear Classiﬁers
Aside: Min error versus max log-likelihood (ii)
Highly related but not identical
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000
1 × (x1001, y = 1) = [0, 0, 3, 1]
Now consider w = [−1, 7, 1, 0]
Error in this case is 1 – so w does not minimizes error
[−1, 7, 1, 0] · [−1, 1, 0, 0] = 8 > [−1, 7, 1, 0] · [0, 0, −1, 1] = −1
[−1, 7, 1, 0] · [0, 0, 3, 1] = 3 < [−1, 7, 1, 0] · [3, 1, 0, 0] = 4
However, log-likelihood = -1.4
Better log-likelihood and worse error

Linear Classifiers
Aside: Min error versus max log-likelihood (iii)
Max likelihood = min error
Max likelihood pushes as much probability on correct labeling
of training instance
Even at the cost of mislabeling a few examples
Min error forces all training instances to be correctly classified
SVMs with slack variables – allows some examples to be
classified wrong if resulting margin is improved on other
examples

Linear Classifiers
Aside: Max margin versus max log-likelihood
Let’s re-write the max likelihood objective function
w = arg max
w t
log P(yt|xt)
= arg max
w t
log
ew·f(xt ,yt )
y ∈Y ew·f(x,y )
= arg max
w t
w · f(xt, yt) − log
y ∈Y
ew·f(x,y )
Pick w to maximize score difference between correct labeling
and every possible labeling
Margin: maximize difference between correct and all incorrect
The above formulation is often referred to as the soft-margin

Linear Classiﬁers
Aside: Logistic Regression = Maximum
Entropy
Well known equivalence
Max Ent: maximize entropy subject to constraints on features
Empirical feature counts must equal expected counts
Quick intuition
Partial derivative in logistic regression
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )
First term is empirical feature counts and second term is
expected counts
Derivative set to zero maximizes function
Therefore when both counts are equivalent, we optimize the
logistic regression objective!

Appendix
Proofs and Derivations

Convergence Proof for Perceptron
Training data: T = {(xt , yt )}
|T |
t=1
1. w(0)
= 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. Let y = arg maxy w(i)
· f(xt , y)
5. if y = yt
6. w(i+1)
= w(i)
+ f(xt , yt ) − f(xt , y )
7. i = i + 1
8. return wi
w(k−1) are the weights before kth
mistake
Suppose kth mistake made at the
tth example, (xt , yt )
y = arg maxy w(k−1) · f(xt , y)
y = yt
w(k) = w(k−1) + f(xt , yt ) − f(xt , y )
Now: u · w(k) = u · w(k−1) + u · (f(xt , yt ) − f(xt , y )) ≥ u · w(k−1) + γ
Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγ
Now: since u · w(k) ≤ ||u|| × ||w(k)|| and ||u|| = 1 then ||w(k)|| ≥ kγ
Now:
||w(k)
||2
= ||w(k−1)
||2
+ ||f(xt , yt ) − f(xt , y )||2
+ 2w(k−1)
· (f(xt , yt ) − f(xt , y ))
||w(k)
||2
≤ ||w(k−1)
||2
+ R2
(since R ≥ ||f(xt , yt ) − f(xt , y )||
and w(k−1)
· f(xt , yt ) − w(k−1)
· f(xt , y ) ≤ 0)

Convergence Proof for Perceptron
We have just shown that ||w(k)|| ≥ kγ and
||w(k)||2 ≤ ||w(k−1)||2 + R2
By induction on k and since w(0) = 0 and ||w(0)||2 = 0
||w(k)
||2
≤ kR2
Therefore,
k2
γ2
≤ ||w(k)
||2
≤ kR2
and solving for k
k ≤
R2
γ2
Therefore the number of errors is bounded!

Gradient Ascent for Logistic Regression
Gradient Ascent
Let F(w) = t log ew·f(xt ,yt )
Zx
Want to ﬁnd arg maxw F(w)
Set w0
= Om
Iterate until convergence
wi
= wi−1
+ α F(wi−1
)
α > 0 and set so that F(wi ) > F(wi−1)
F(w) is gradient of F w.r.t. w
A gradient is all partial derivatives over variables wi
i.e., F(w) = ( ∂
∂w0
F(w), ∂
∂w1
F(w), . . . , ∂
∂wm
F(w))
Gradient ascent will always ﬁnd w to maximize F

The partial derivatives
Need to ﬁnd all partial derivatives ∂
∂wi
F(w)
F(w) =
t
log P(yt|xt)
=
t
log
ew·f(xt ,yt )
y ∈Y ew·f(xt ,y )
=
t
log
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
j wj ×fj (xt ,y )

Partial derivatives - some reminders
1. ∂
∂x log F = 1
F
∂
∂x F
We always assume log is the natural logarithm loge
2. ∂
∂x eF = eF ∂
∂x F
3. ∂
∂x t Ft = t
∂
∂x Ft
4. ∂
∂x
F
G =
G ∂
∂x
F−F ∂
∂x
G
G2

∂
∂wi
F(w) =
∂
∂wi t
log
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
j wj ×fj (xt ,y )
=
t
∂
∂wi
log
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
j wj ×fj (xt ,y )
=
t
(
y ∈Y e
P
j wj ×fj (xt ,y )
e
P
j wj ×fj (xt ,yt )
)(
∂
∂wi
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
wj
wj ×fj (xt ,y )
)
=
t
(
Zxt
e
P
j wj ×fj (xt ,yt )
)(
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
)

Now,
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
=
Zxt
∂
∂wi
e
P
j wj ×fj (xt ,yt )
− e
P
j wj ×fj (xt ,yt ) ∂
∂wi
Zxt
Z2
xt
=
Zxt e
P
j wj ×fj (xt ,yt )
fi (xt , yt ) − e
P
j wj ×fj (xt ,yt ) ∂
∂wi
Zxt
Z2
xt
=
e
P
j wj ×fj (xt ,yt )
Z2
xt
(Zxt fi (xt , yt ) −
∂
∂wi
Zxt )
=
e
P
j wj ×fj (xt ,yt )
Z2
xt
(Zxt fi (xt , yt )
−
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y ))
because
∂
∂wi
Zxt =
∂
∂wi
X
y ∈Y
e
P
j wj ×fj (xt ,y )
=
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y )

From before,
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
=
e
P
j wj ×fj (xt ,yt )
Z2
xt
(Zxt fi (xt , yt )
−
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y ))
Sub this in,
∂
∂wi
F(w) =
X
t
(
Zxt
e
P
j wj ×fj (xt ,yt )
)(
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
)
=
X
t
1
Zxt
(Zxt fi (xt , yt ) −
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y )))
=
X
t
fi (xt , yt ) −
X
t
X
y ∈Y
e
P
j wj ×fj (xt ,y )
Zxt
fi (xt , y )
=
X
t
fi (xt , yt ) −
X
t
X
y ∈Y
P(y |xt )fi (xt , y )

FINALLY!!!
After all that,
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )
And the gradient is:
F(w) = (
∂
∂w0
F(w),
∂
∂w1
F(w), . . . ,
∂
∂wm
F(w))
So we can now use gradient assent to ﬁnd w!!

Lecture 03: Machine Learning for Language Technology - Linear Classifiers

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Lecture 03: Machine Learning for Language Technology - Linear Classifiers (20)

More from Marina Santini (20)

Recently uploaded (20)

Lecture 03: Machine Learning for Language Technology - Linear Classifiers