CS571: Gradient Descent

Gradient Descent
Natural Language Processing
Emory University
Jinho D. Choi

ˆE(f) =
1
n
nX
i=1
`(ˆyi; yi)
E(f) =
Z
`(ˆy; y) · P(x, y)
Supervised Learning
2
(X, Y ) = {(x1, y1), . . . , (xn, yn)}
ˆy = f(x) predicts the output of x
input
prediction
loss function joint distribution
Expected risk
unknown!
Empirical risk minimize!
output y = ±1 binomial distribution

`(w, x; y) =
1
2
(wT
x y)2
ˆE(f) =
1
n
nX
i=1
`(ˆyi; yi)
Linear Prediction
3
least squares
linear function
Find a weight vector that minimizes the loss.
`(ˆy; y) =
1
2
(ˆy y)2
ˆy = f(x) = wT
(x) = wT
x
feature vector

wt+1 wt ⌘t
1
n
nX
i=1
@
@w
`(wt, xi; yi)
Gradient Descent
4
learning rate derivative of the loss
Minimize loss
Derivative → 0
Global optimum?
Convex optimization

Gradient Descent
5
How often is the weight vector updated?
wt+1 wt ⌘t
1
n
nX
i=1
@
@w
`(wt, xi; yi)
`(w, x; y) =
1
2
(wT
x y)2
@
@w
`(w, x; y) =
@
@w
1
2
(wT
x y)2
= (wT
x y)x
wt+1 wt ⌘t
1
n
nX
i=1
(wT
xi yi)xi

Stochastic Gradient Descent
6
wt+1 wt ⌘t
1
n
nX
i=1
(wT
xi yi)xi
wt+1 wt ⌘t(wT
t xi yi)xi
0
+
-
w0 0
wT
0 x1 > 0
wT
1 x2 < 0
wT
2 x3 < 0 w3 w2 ⌘( 1)x3
w2 w1 ⌘( + 1)x2
w1 w0 ⌘( + 1)x1
wT
3 x4 > 0 w4 w3 ⌘( 1)x4
updated for every instance

Perceptron
7
wt+1 wt ⌘t `
Stochastic gradient descent
wt+1 wt + ⌘t
⇢
x · y wT
t x · y < 0
0 otherwise
`(w, x; y) =
1
2
(wT
x y)2
` = (wT
x y)x
Least squares
` =
⇢
x · y wT
x · y < 0
0 otherwise
`(w, x; y) = max{0, wT
x · y}
Perceptron

Averaged Perceptron
8
The ﬁnal hyperplane may be 
overﬁtted to later instances.
Take the average of all hyperplanes
including ones that are not updated.

Averaged Perceptron
9
c c + 1
Initialization:
Update rule: for every instance
c 1
sparse vector?
wt+1 wt + ⌘t(x · y)
vt+1 vt + ⌘t · c(x · y)
w w
1
c
· v
wt+1 wt + ⌘t(x · y) if wT
t x · y < 0
w
1
c
c 1X
t=0
wt

Emory University Logo Guidelines
-
Multinomial Perceptron
10
Binomial distribution requires
1 hyperplane to separate 2 classes.
Multinomial distribution requires
m hyperplanes to separate m classes.
How many for 
m classes?

Multinomial Perceptron
11
a b c d ew =
1 0 0 1 0x =
wT
x = a + d ˆy =
⇢
1 wT
x 0
1 otherwise
a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3 e0 e1 e2 e3w =
5 features (including bias)
Binomial
Multinomial y = {0, 1, 2, 3}
ˆy = arg max
y
wT
y xwT
y x = ay + dy
y = { 1, 1}

Binomial vs. Multinomial Perceptron
12
wt+1 wt + ⌘t(x · y)
Binomial
wy,t+1 wy,t + ⌘t · x
Multinomial
wˆy,t+1 wˆy,t ⌘t · x
if wT
t x · y < 0 , y 6= ˆy

Hinge Loss
13
` =
⇢
x · y wT
x · y < 0
0 otherwise
`(w, x; y) = max{0, wT
x · y}
Perceptron
Hinge loss
`(w, x; y) = max{0, 1 wT
x · y}
` =
⇢
x · y wT
x · y < 1
0 otherwise

Adaptive Gradient Descent
14
if wT
t x · y < 0
Perceptron
if wT
t · y < 1
Hinge loss
wt+1 wt + ⌘t(x · y)
gt+1 gt + x x
wt+1 wt +
⌘
⇢ +
p
gt+1
· (x · y)

CS571: Gradient Descent

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to CS571: Gradient Descent (20)

More from Jinho Choi (20)

Recently uploaded (20)

CS571: Gradient Descent