SlideShare a Scribd company logo
Jinho D. Choi
jinho.choi@emory.edu
Dynamic Feature Induction
The Last Gist to the SOTA
North American Chapter of the

Association for Computational Linguistics
June 13th, 2016
Feature Engineering
2
Discovering a good set of features
Finding a good combinations of features
wiwi-1wi-2 wi+1 wi+2
ti-1ti-2
wi-1 + wiwi-1 + wi + wi+1 ti-1 + wi
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
low
dimensional
features
high
dimensional
features
Nonlinearity in NLP
3
Linear classifiers generally work well for NLP
Are NLP feature spaces linearly separable?
low
dimensional
features
high
dimensional
features
Deep
Learning?
510 3 7x2
10 2 3 4x1
Dynamic Feature Induction
4
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier1_3
Feature Expansion
Feature Induction
5
training. It takes the set of training instances D and
the learning rate ⌘, and returns the weight vector w
and the set of induced features F.
Algorithm 1 Feature Induction
Input: D: training set, ⌘: learning rate.
Output: w: weight vector, F: induced feature set.
1: w g 0
2: F ?
3: until max epoch is reached do
4: foreach (x, y) 2 D do
5: ˆy arg maxy02Y (w · (x, y0
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
11: L arg k max8i vi
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
The algorithm begins by initializing the weight vector
w, the diagonal vector g, and the induced feature set
by subtracting [w (x, ˆy, ?)]ˆy f
(line 10), where [. . .]y returns o
values relevant to y (Figure 2).
The i’th element in v repres
the i’th feature for y against ˆy;
stronger the i’th feature is. Next
entries in v are collected in the o
representing the strongest featu
Finally, the pairs of the first ind
the strongest feature, and the o
added to the induced feature set
example, if L = [i, j, k] such th
two pairs, (i, j) and (i, k), are a
For all our experiments, k =
k beyond this cutoff did not show
Notice that all induced features
joining only low dimensional fe
algorithm does not join a high
with either a low dimensional fe
dimensional feature. This was
prevent from the feature space b
features can be induced by repla
line 10 as follows:
4. The extended feature set x2 is fed into the classi-
fier. If ˆy2 is equal to y2, no feature combination
is induced from x2.
Thus, high dimensional features in F are incremen-
tally induced and learned along with low dimensional
features during training. During decoding, each fea-
ture set is extended by the induced features in F, and
the prediction is made using the extended feature set.
The size of F can grow up to |X|2, where |X| is the
size of low dimensional features. However, we found
that |F| is more like 1/4 · |X| in practice.
The following sections explain our approach in de-
tails. Sections 3.1, 3.2, and 3.3 describe how features
are induced and learned during training. Sections 3.4
and 3.5 describe how the induced features are stored
Iy(y0
)
(
1, if y = y0.
0, otherwise.
The feature map takes (x, y, F), and returns
dimensional vector, where d and l are the s
features and labels, respectively; each dimens
tains the value for a particular feature and a
If certain combinations between features in
in F, they are appended to the feature vecto
with the low dimensional features (see Sect
for more details). The indicator function I allo
algorithm to be optimized for the hinge loss f
ticlass classification (Crammer and Singer, 2
`h = max[0, 1 + w · ( (x, ˆy, F) (x, y,
Hinge loss
Crammer

and Singer, 2002
Feature Induction
6
training. It takes the set of training instances D and
the learning rate ⌘, and returns the weight vector w
and the set of induced features F.
Algorithm 1 Feature Induction
Input: D: training set, ⌘: learning rate.
Output: w: weight vector, F: induced feature set.
1: w g 0
2: F ?
3: until max epoch is reached do
4: foreach (x, y) 2 D do
5: ˆy arg maxy02Y (w · (x, y0
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
11: L arg k max8i vi
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
The algorithm begins by initializing the weight vector
w, the diagonal vector g, and the induced feature set
by subtracting [w (x, ˆy, ?)]ˆy f
(line 10), where [. . .]y returns o
values relevant to y (Figure 2).
The i’th element in v repres
the i’th feature for y against ˆy;
stronger the i’th feature is. Next
entries in v are collected in the o
representing the strongest featu
Finally, the pairs of the first ind
the strongest feature, and the o
added to the induced feature set
example, if L = [i, j, k] such th
two pairs, (i, j) and (i, k), are a
For all our experiments, k =
k beyond this cutoff did not show
Notice that all induced features
joining only low dimensional fe
algorithm does not join a high
with either a low dimensional fe
dimensional feature. This was
prevent from the feature space b
features can be induced by repla
line 10 as follows:
AdaGrad
Optimization
Duchi et al.,

2011
Feature Induction
7
training. It takes the set of training instances D and
the learning rate ⌘, and returns the weight vector w
and the set of induced features F.
Algorithm 1 Feature Induction
Input: D: training set, ⌘: learning rate.
Output: w: weight vector, F: induced feature set.
1: w g 0
2: F ?
3: until max epoch is reached do
4: foreach (x, y) 2 D do
5: ˆy arg maxy02Y (w · (x, y0
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
11: L arg k max8i vi
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
The algorithm begins by initializing the weight vector
w, the diagonal vector g, and the induced feature set
by subtracting [w (x, ˆy, ?)]ˆy f
(line 10), where [. . .]y returns o
values relevant to y (Figure 2).
The i’th element in v repres
the i’th feature for y against ˆy;
stronger the i’th feature is. Next
entries in v are collected in the o
representing the strongest featu
Finally, the pairs of the first ind
the strongest feature, and the o
added to the induced feature set
example, if L = [i, j, k] such th
two pairs, (i, j) and (i, k), are a
For all our experiments, k =
k beyond this cutoff did not show
Notice that all induced features
joining only low dimensional fe
algorithm does not join a high
with either a low dimensional fe
dimensional feature. This was
prevent from the feature space b
features can be induced by repla
line 10 as follows:
Strength
Vector
Strength Vector
8
w
20 3x
4
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
Figure 2: Overview of dynamic feature induction during train-
ing.
(x, y2, ?) (x, y0, ?)
Once w¨y and wˆy are updated, we take the entry-
wise product between (w¨y wˆy) and (x) (line 9),
and select top-k entries in v (line 10), representing
3
In most cases, these values are either 0 or 1.
This can introduce arbitr
L1 regularization for fea
4.3 Feature hashing
ˆy arg maxy2Y(w ·
⌘( (x, yg) (x, ˆy)) C
tion is too high, randoml
hash. xxHash5
4.4 Randomized featu
ˆy arg maxy2Y(w ·
⌘( (x, yg) (x, ˆy)) C
tion is too high, randoml
4
‘K arg max’ returns the o
in v are top-k and greater than
5
https://github.co
4
Figure 2: Overview of dynamic feature induction during train-
ing.
(x, y2, ?) (x, y0, ?)
Once w¨y and wˆy are updated, we take the entry-
wise product between (w¨y wˆy) and (x) (line 9),
and select top-k entries in v (line 10), representing
3
In most cases, these values are either 0 or 1.
L1 regularization for feature reduct
4.3 Feature hashing
ˆy arg maxy2Y(w · (x, y)
⌘( (x, yg) (x, ˆy)) Combination
tion is too high, randomly drop with
hash. xxHash5
4.4 Randomized feature selectio
ˆy arg maxy2Y(w · (x, y)
⌘( (x, yg) (x, ˆy)) Combination
tion is too high, randomly drop with
4
‘K arg max’ returns the ordered list of
in v are top-k and greater than 0.
5
https://github.com/Cyan497
4
Figure 2: Overview of dynamic feature induction during train-
ing.
[w (x, y0, ?)]y0 [w (x, y1, ?)]y1
Once w¨y and wˆy are updated, we take the entry-
wise product between (w¨y wˆy) and (x) (line 9),
and select top-k entries in v (line 10), representing
3
In most cases, these values are either 0 or 1.
4.2 Regularized dual averag
This can introduce arbitrarily m
L1 regularization for feature red
4.3 Feature hashing
ˆy arg maxy2Y(w · (x, y)
⌘( (x, yg) (x, ˆy)) Combina
tion is too high, randomly drop
hash. xxHash5
4.4 Randomized feature sele
ˆy arg maxy2Y(w · (x, y)
⌘( (x, yg) (x, ˆy)) Combina
tion is too high, randomly drop
4
‘K arg max’ returns the ordered li
in v are top-k and greater than 0.
5
https://github.com/Cyan
y0 y1 y2
v
=
384
385
386
387
388
389
390
391
392
393
394
395
396
397
NAACL 2
for y, it is necessary to intro
nations can be thought of:
weak), and (weak, weak). I
the first combination, (stron
high with high/low.
[w (x, y2, ?)]y2
4.2 Regularized dual ave
This can introduce arbitraril
L1 regularization for featur
4.3 Feature hashing
ˆy arg maxy2Y(w · (
⌘( (x, yg) (x, ˆy)) Com
gold predicted
The feature strength for y0 against y2.
Feature Induction
9
training. It takes the set of training instances D and
the learning rate ⌘, and returns the weight vector w
and the set of induced features F.
Algorithm 1 Feature Induction
Input: D: training set, ⌘: learning rate.
Output: w: weight vector, F: induced feature set.
1: w g 0
2: F ?
3: until max epoch is reached do
4: foreach (x, y) 2 D do
5: ˆy arg maxy02Y (w · (x, y0
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
11: L arg k max8i vi
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
The algorithm begins by initializing the weight vector
w, the diagonal vector g, and the induced feature set
by subtracting [w (x, ˆy, ?)]ˆy f
(line 10), where [. . .]y returns o
values relevant to y (Figure 2).
The i’th element in v repres
the i’th feature for y against ˆy;
stronger the i’th feature is. Next
entries in v are collected in the o
representing the strongest featu
Finally, the pairs of the first ind
the strongest feature, and the o
added to the induced feature set
example, if L = [i, j, k] such th
two pairs, (i, j) and (i, k), are a
For all our experiments, k =
k beyond this cutoff did not show
Notice that all induced features
joining only low dimensional fe
algorithm does not join a high
with either a low dimensional fe
dimensional feature. This was
prevent from the feature space b
features can be induced by repla
line 10 as follows:
Feature
Induction
k = 3
Other combinations?
510 3 7x2
10 2 3 4x1
Dynamic Feature Induction
10
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier1_3
Feature Expansion
Feature Induction Revisited
11
training. It takes the set of training instances D and
the learning rate ⌘, and returns the weight vector w
and the set of induced features F.
Algorithm 1 Feature Induction
Input: D: training set, ⌘: learning rate.
Output: w: weight vector, F: induced feature set.
1: w g 0
2: F ?
3: until max epoch is reached do
4: foreach (x, y) 2 D do
5: ˆy arg maxy02Y (w · (x, y0
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
11: L arg k max8i vi
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
The algorithm begins by initializing the weight vector
w, the diagonal vector g, and the induced feature set
by subtracting [w (x, ˆy, ?)]ˆy f
(line 10), where [. . .]y returns o
values relevant to y (Figure 2).
The i’th element in v repres
the i’th feature for y against ˆy;
stronger the i’th feature is. Next
entries in v are collected in the o
representing the strongest featu
Finally, the pairs of the first ind
the strongest feature, and the o
added to the induced feature set
example, if L = [i, j, k] such th
two pairs, (i, j) and (i, k), are a
For all our experiments, k =
k beyond this cutoff did not show
Notice that all induced features
joining only low dimensional fe
algorithm does not join a high
with either a low dimensional fe
dimensional feature. This was
prevent from the feature space b
features can be induced by repla
line 10 as follows:
Locally Optimal Learning to Search
tures in most NLP tasks are extracted from struc-
s (e.g., sequence, tree). For structured learning,
adapt “locally optimal learning to search” (Chang
l., 2015b), that is a member of imitation learning
ilar to DAGGER (Ross et al., 2011). LOLS not
y performs well relative to the reference policy,
also can improve upon the reference policy, show-
very good results for tasks such as part-of-speech
ging and dependency parsing. We adapt LOLS by
ing the reference policy as follows:
The reference policy ⇡ determines how often the
gold label y is picked over the predicted label ˆy
to build a structure. For all our experiments, ⇡
is initialized to 0.95.
For the first epoch, since ⇡ is 0.95, y is randomly
indices. Given a feature index pair (i, j) representing
strong features for y against ˆy (Section 3.1), the index
of the induced feature can be measured as follows:
k hint!int(i · |X| + j) mod
For efficiency, feature hashing is adapted to our sys-
tem such that the induced feature set F is actually not
a set but a -dimensional boolean array, where each
dimension represents the validity of the correspond-
ing induced feature. Thus, the line 13 in Algorithm 1
is changed to:
k hint!int(L1 · |X| + Li) mod
Fk True
For the choice of h, xxHash is used, that is a fast
non-cryptographic hash algorithm showing the per-
fect score on the Q.Score.5
Feature Hashing
Feature Expansion
12
becomes 0.952 =
y is picked about
5%).
only marginal im-
asks we evaluated,
entity recognition,
However, we still
ause we wanted to
sks such as depen-
search algorithms
g and Nivre, 2012;
g et al., 2015a).
converting string
redze, 2008; Wein-
ng feature f and a
the vector space is
r of the hash code:
mod
training and decoding. It takes the sparse vector x
containing only low dimensional features and returns
a new sparse vector xl+h containing both low and
high dimensional features.
Algorithm 2 Feature Expansion
Input: xl
: sparse feature vector containing only
low dimensional features.
Output: xl+h
: sparse feature vector containing both
low and high dimensional features.
1: xl+h
copy(xl
)
2: for i 1 to |xl
| do
3: for j i + 1 to |xl
| do
4: k hint!int(i · |X| + j) mod
5: if Fk then xl+h
.append(k)
6: return xl+h
The algorithm begins by copying xl to xl+h (line 1).
For every combination (i, j) 2 xl ⇥ xl, where i and j
represent the corresponding feature indices (lines 2-
3), it first measures the index k of the feature com-
bination (line 4), then checks if this combination is
valid (Section 3.4). If the combination is valid, mean-
ing that (F = True), k is added to xl+h (line 5).
Check if any combination exists in 𝑭.
510 3 7x2
10 2 3 4x1
Dynamic Feature Induction
13
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier
1_3
Feature Expansion
Regularized

Dual

Decomposition

(Xiao, 2010)
Regularized Dual Averaging
14
training. It takes the set of training instances D and
the learning rate ⌘, and returns the weight vector w
and the set of induced features F.
Algorithm 1 Feature Induction
Input: D: training set, ⌘: learning rate.
Output: w: weight vector, F: induced feature set.
1: w g 0
2: F ?
3: until max epoch is reached do
4: foreach (x, y) 2 D do
5: ˆy arg maxy02Y (w · (x, y0
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
11: L arg k max8i vi
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
The algorithm begins by initializing the weight vector
w, the diagonal vector g, and the induced feature set
by subtracting [w (x, ˆy, ?)]ˆy f
(line 10), where [. . .]y returns o
values relevant to y (Figure 2).
The i’th element in v repres
the i’th feature for y against ˆy;
stronger the i’th feature is. Next
entries in v are collected in the o
representing the strongest featu
Finally, the pairs of the first ind
the strongest feature, and the o
added to the induced feature set
example, if L = [i, j, k] such th
two pairs, (i, j) and (i, k), are a
For all our experiments, k =
k beyond this cutoff did not show
Notice that all induced features
joining only low dimensional fe
algorithm does not join a high
with either a low dimensional fe
dimensional feature. This was
prevent from the feature space b
features can be induced by repla
line 10 as follows:
y
hen truncates the resulting vector with respect to the label y.
t is worth mentioning that we did not find it useful
or joining intermediate features together (e.g., (j, k)
n the above example). It is possible to utilize these
ombinations by weighting them differently, which
we will explore in the future. Additionally, we exper-
mented with the combinations between strong and
weak features (joining i’th and j’th features, where
i > 0 and vj < 0), which again was not so useful.
We are planning to evaluate our approach on more
asks and data, which will give us better understand-
ng of what combinations are the most effective.
3.2 Regularized Dual Averaging
Each high dimensional feature in F is induced for
making classification between two labels, y and ˆy,
but it may or may not be helpful for distinguishing
abels other than those two. Our algorithm can be
modified to learn the weights of the induced features
only for their relevant labels by adding the label in-
optimization, and works most effectiv
feature vectors. To apply regularized
the line 1 in Algorithm 1 is changed t
w g c 0; t
c is a d ⇥ l-dimensional vector cons
mulative penalties. t is the number of
generated during training. Although w
not updated when y = ˆy, it is still co
vector. Thus, t is incremented for ev
stance, so t t + 1 is inserted after
updated by adding the partial vector @
be inserted after the line 7):
c c + @
Thus, each dimension in c represents
tive penalty (or reward) for a particula
label. At last, the line 9 is changed to
the weight vector w and the feature map , [w (x, y, ?)]y takes the Hadamard product between w and (x, y, ?
e resulting vector with respect to the label y.
ntioning that we did not find it useful
ermediate features together (e.g., (j, k)
xample). It is possible to utilize these
by weighting them differently, which
e in the future. Additionally, we exper-
the combinations between strong and
(joining i’th and j’th features, where
< 0), which again was not so useful.
ng to evaluate our approach on more
, which will give us better understand-
ombinations are the most effective.
ized Dual Averaging
mensional feature in F is induced for
fication between two labels, y and ˆy,
may not be helpful for distinguishing
han those two. Our algorithm can be
arn the weights of the induced features
optimization, and works most effectively with spars
feature vectors. To apply regularized dual averagin
the line 1 in Algorithm 1 is changed to:
w g c 0; t 1
c is a d ⇥ l-dimensional vector consisting of acc
mulative penalties. t is the number of weight vector
generated during training. Although w is technicall
not updated when y = ˆy, it is still considered a ne
vector. Thus, t is incremented for every training i
stance, so t t + 1 is inserted after the line 5. c
updated by adding the partial vector @ as follows (t
be inserted after the line 7):
c c + @
Thus, each dimension in c represents the accumul
tive penalty (or reward) for a particular feature and
label. At last, the line 9 is changed to:
we will explore in the future. Additionally, we exper-
imented with the combinations between strong and
weak features (joining i’th and j’th features, where
vi > 0 and vj < 0), which again was not so useful.
We are planning to evaluate our approach on more
tasks and data, which will give us better understand-
ing of what combinations are the most effective.
3.2 Regularized Dual Averaging
Each high dimensional feature in F is induced for
making classification between two labels, y and ˆy,
but it may or may not be helpful for distinguishing
labels other than those two. Our algorithm can be
modified to learn the weights of the induced features
only for their relevant labels by adding the label in-
formation to F, which would change the line 13 in
Algorithm 1 as follows:
F F [ {(L1, Li, y, ˆy)}
However, introducing features targeting specific la-
bel pairs potentially confuses the classifier, especially
when they are trained with the low dimensional fea-
tures targeting all labels. Instead, it is better to apply
a feature selection technique such as `1 regulariza-
w g c 0; t 1
c is a d ⇥ l-dimensional vector consisting
mulative penalties. t is the number of weigh
generated during training. Although w is te
not updated when y = ˆy, it is still consider
vector. Thus, t is incremented for every tra
stance, so t t + 1 is inserted after the li
updated by adding the partial vector @ as fo
be inserted after the line 7):
c c + @
Thus, each dimension in c represents the a
tive penalty (or reward) for a particular feat
label. At last, the line 9 is changed to:
w (⌘/(⇢+
p
g)) · `1(c, t, )
`1(c, t, )
(
ci sgn(ci) · · t, |c8i|
0, other
The function `1 takes c, t, and the regula
rameter tuned during development. If the
value of the accumulative penalty ci is gre
· t, the weight wi is updated by and t; o
it is assigned to 0. For our experiments, R
for joining intermediate features together (e.g., (j, k)
in the above example). It is possible to utilize these
combinations by weighting them differently, which
we will explore in the future. Additionally, we exper-
imented with the combinations between strong and
weak features (joining i’th and j’th features, where
vi > 0 and vj < 0), which again was not so useful.
We are planning to evaluate our approach on more
tasks and data, which will give us better understand-
ing of what combinations are the most effective.
3.2 Regularized Dual Averaging
Each high dimensional feature in F is induced for
making classification between two labels, y and ˆy,
but it may or may not be helpful for distinguishing
labels other than those two. Our algorithm can be
modified to learn the weights of the induced features
only for their relevant labels by adding the label in-
formation to F, which would change the line 13 in
Algorithm 1 as follows:
F F [ {(L1, Li, y, ˆy)}
However, introducing features targeting specific la-
bel pairs potentially confuses the classifier, especially
when they are trained with the low dimensional fea-
tures targeting all labels. Instead, it is better to apply
feature vectors. To apply regularized dual averaging,
the line 1 in Algorithm 1 is changed to:
w g c 0; t 1
c is a d ⇥ l-dimensional vector consisting of accu-
mulative penalties. t is the number of weight vectors
generated during training. Although w is technically
not updated when y = ˆy, it is still considered a new
vector. Thus, t is incremented for every training in-
stance, so t t + 1 is inserted after the line 5. c is
updated by adding the partial vector @ as follows (to
be inserted after the line 7):
c c + @
Thus, each dimension in c represents the accumula-
tive penalty (or reward) for a particular feature and a
label. At last, the line 9 is changed to:
w (⌘/(⇢+
p
g)) · `1(c, t, )
`1(c, t, )
(
ci sgn(ci) · · t, |c8i| > · t.
0, otherwise.
The function `1 takes c, t, and the regularizer pa-
rameter tuned during development. If the absolute
value of the accumulative penalty ci is greater than
· t, the weight wi is updated by and t; otherwise,
in the above example). It is possible to utilize these
combinations by weighting them differently, which
we will explore in the future. Additionally, we exper-
imented with the combinations between strong and
weak features (joining i’th and j’th features, where
vi > 0 and vj < 0), which again was not so useful.
We are planning to evaluate our approach on more
tasks and data, which will give us better understand-
ing of what combinations are the most effective.
3.2 Regularized Dual Averaging
Each high dimensional feature in F is induced for
making classification between two labels, y and ˆy,
but it may or may not be helpful for distinguishing
labels other than those two. Our algorithm can be
modified to learn the weights of the induced features
only for their relevant labels by adding the label in-
formation to F, which would change the line 13 in
Algorithm 1 as follows:
F F [ {(L1, Li, y, ˆy)}
However, introducing features targeting specific la-
bel pairs potentially confuses the classifier, especially
the line 1 in Algorithm 1 is changed to:
w g c 0; t 1
c is a d ⇥ l-dimensional vector consistin
mulative penalties. t is the number of weig
generated during training. Although w is t
not updated when y = ˆy, it is still conside
vector. Thus, t is incremented for every tr
stance, so t t + 1 is inserted after the l
updated by adding the partial vector @ as f
be inserted after the line 7):
c c + @
Thus, each dimension in c represents the a
tive penalty (or reward) for a particular fea
label. At last, the line 9 is changed to:
w (⌘/(⇢+
p
g)) · `1(c, t, )
`1(c, t, )
(
ci sgn(ci) · · t, |c8i|
0, othe
The function `1 takes c, t, and the regul
rameter tuned during development. If th
for joining intermediate features together (e.g., (j, k)
in the above example). It is possible to utilize these
combinations by weighting them differently, which
we will explore in the future. Additionally, we exper-
imented with the combinations between strong and
weak features (joining i’th and j’th features, where
vi > 0 and vj < 0), which again was not so useful.
We are planning to evaluate our approach on more
tasks and data, which will give us better understand-
ing of what combinations are the most effective.
3.2 Regularized Dual Averaging
Each high dimensional feature in F is induced for
making classification between two labels, y and ˆy,
but it may or may not be helpful for distinguishing
labels other than those two. Our algorithm can be
modified to learn the weights of the induced features
only for their relevant labels by adding the label in-
formation to F, which would change the line 13 in
Algorithm 1 as follows:
F F [ {(L1, Li, y, ˆy)}
feature vectors. To apply regula
the line 1 in Algorithm 1 is cha
w g c 0;
c is a d ⇥ l-dimensional vecto
mulative penalties. t is the num
generated during training. Altho
not updated when y = ˆy, it is s
vector. Thus, t is incremented
stance, so t t + 1 is inserted
updated by adding the partial ve
be inserted after the line 7):
c c + @
Thus, each dimension in c repr
tive penalty (or reward) for a pa
label. At last, the line 9 is chan
w (⌘/(⇢+
p
g)) · `
`1(c, t, )
(
ci sgn(ci) ·
0,
510 3 7x2
10 2 3 4x1
Dynamic Feature Induction
15
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier
1_3
Feature Expansion
Regularized

Dual

Decomposition

(Xiao, 2010)
Locally

Optimal

Learning to Search

(Chang et al., 2015)
Part-of-Speech Tagging
16
Approach ALL OOV EXT Features
M0: baseline 97.18 86.35 365,400
M1: M0 + ext. ambiguity classes 97.37 91.34 ✔ 365,409
M2: M1 + brown clusters 97.46 91.23 ✔ 372,181
M3: M1 + dynamic feature induction 97.52 91.53 ✔ 468,378
M4: M2 + dynamic feature induction 97.64 92.03 ✔ 473,134
Manning (2011) 97.32 90.79 ✔
Shen et al. (2007) 97.33 89.61
Sun (2014) 97.36 -
Moore (2015) 97.36 91.09 ✔
Spoustová et al. (2009) 97.44 ✔
Søgaard (2011) 97.50 ✔
Tsuboi (2014) 97.51 91.64 ✔
Wall Street Journal
From

wikipedia

+ NYTimes
Approach TST Features
M0: baseline 84.44 164,440
M1: M0 + gazetteers 86.85 164,720
M2: M1 + brown clusters 89.64 169,232
M3: M2 + word embeddings 90.57 169,682
M4: M3 + dynamic feature induction 91.00 208,860
Named Entity Recognition
17
CoNLL’03 Shared Task
Turian et al. (2010) 89.41
Suzuki and Isozaki (2008) 89.92
Ratinov and Roth (2009) 90.57
Lin and Wu (2009) 90.90
Passo et al. (2014) 90.90
Not used
for DFI
Conclusion
• Dynamic feature induction finds high dimensional features
by combining low dimensional features during training.
• During decoding, it searches for useful feature combinations
and augment high dimensional features.
• It gives the last gist to the state-of-the-art for part-of-
speech tagging and named entity recognition.
• Dynamic feature induction is rather empirically inspired;
theoretical justification of this approach would be intriguing.
• All our approaches are implemented in NLP4J, previously
known as ClearNLP: http://github.com/emorynlp/nlp4j/.
18

More Related Content

PDF
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
PDF
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
PDF
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
PPT
Submodularity slides
PDF
Low-rank tensor methods for stochastic forward and inverse problems
PDF
26 Machine Learning Unsupervised Fuzzy C-Means
PDF
Kernelization algorithms for graph and other structure modification problems
PDF
Lecture 5: Structured Prediction
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Submodularity slides
Low-rank tensor methods for stochastic forward and inverse problems
26 Machine Learning Unsupervised Fuzzy C-Means
Kernelization algorithms for graph and other structure modification problems
Lecture 5: Structured Prediction

What's hot (20)

PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
PPTX
Aaex7 group2(中英夾雜)
PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
PDF
The Kernel Trick
PDF
22 01 2014_03_23_31_eee_formula_sheet_final
PDF
NIPS2017 Few-shot Learning and Graph Convolution
PDF
sublabel accurate convex relaxation of vectorial multilabel energies
PDF
Kernels and Support Vector Machines
PPT
Admmission in India
PDF
11 Machine Learning Important Issues in Machine Learning
PDF
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
PPT
Admission in india
PDF
Levitan Centenary Conference Talk, June 27 2014
PDF
Markov chain monte_carlo_methods_for_machine_learning
PPT
5.3 dynamic programming
PDF
Cheatsheet deep-learning
PDF
WMC2016-230pm-presented
PDF
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
PDF
Refresher algebra-calculus
PDF
Gaussian Processes: Applications in Machine Learning
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Aaex7 group2(中英夾雜)
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
The Kernel Trick
22 01 2014_03_23_31_eee_formula_sheet_final
NIPS2017 Few-shot Learning and Graph Convolution
sublabel accurate convex relaxation of vectorial multilabel energies
Kernels and Support Vector Machines
Admmission in India
11 Machine Learning Important Issues in Machine Learning
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Admission in india
Levitan Centenary Conference Talk, June 27 2014
Markov chain monte_carlo_methods_for_machine_learning
5.3 dynamic programming
Cheatsheet deep-learning
WMC2016-230pm-presented
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
Refresher algebra-calculus
Gaussian Processes: Applications in Machine Learning
Ad

Similar to Dynamic Feature Induction: The Last Gist to the State-of-the-Art (20)

PPTX
general mathematics discussion for week 2.pptx
PPTX
Functions Defined on General Sets w Arrow Diagrams.pptx
PPTX
power point presentation on genmath_lesson1_2_.pptx
PDF
Calculus 1 Lecture Notes (Functions and Their Graphs)
PDF
E10
PDF
lec4_annotated.pdf ml csci 567 vatsal sharan
PDF
01. Functions-Theory & Solved Examples Module-4.pdf
PPT
L1 functions, domain &amp; range
PPTX
1_Representation_of_Functions.pptxshjsjsj
DOCX
MAT-121 COLLEGE ALGEBRAWritten Assignment 32 points eac.docx
PDF
lemh201 (1).pdfvjsbdkkdjfkfjfkffkrnfkfvfkrjof
PPTX
Vector space, subspace, linear span .pptx
PDF
A Simple Review on SVM
PPTX
R lecture co4_math 21-1
PPT
PDF
CostFunctions.pdf
PDF
2018-G12-Math-E.pdf
PDF
2nd-year-Math-full-Book-PB.pdf
DOCX
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
general mathematics discussion for week 2.pptx
Functions Defined on General Sets w Arrow Diagrams.pptx
power point presentation on genmath_lesson1_2_.pptx
Calculus 1 Lecture Notes (Functions and Their Graphs)
E10
lec4_annotated.pdf ml csci 567 vatsal sharan
01. Functions-Theory & Solved Examples Module-4.pdf
L1 functions, domain &amp; range
1_Representation_of_Functions.pptxshjsjsj
MAT-121 COLLEGE ALGEBRAWritten Assignment 32 points eac.docx
lemh201 (1).pdfvjsbdkkdjfkfjfkffkrnfkfvfkrjof
Vector space, subspace, linear span .pptx
A Simple Review on SVM
R lecture co4_math 21-1
CostFunctions.pdf
2018-G12-Math-E.pdf
2nd-year-Math-full-Book-PB.pdf
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
Ad

More from Jinho Choi (20)

PDF
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
PDF
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
PDF
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
PDF
The Myth of Higher-Order Inference in Coreference Resolution
PDF
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
PDF
Abstract Meaning Representation
PDF
Semantic Role Labeling
PDF
CKY Parsing
PDF
CS329 - WordNet Similarities
PDF
CS329 - Lexical Relations
PDF
Automatic Knowledge Base Expansion for Dialogue Management
PDF
Attention is All You Need for AMR Parsing
PDF
Graph-to-Text Generation and its Applications to Dialogue
PDF
Real-time Coreference Resolution for Dialogue Understanding
PDF
Topological Sort
PDF
Tries - Put
PDF
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
PDF
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
PDF
How to make Emora talk about Sports Intelligently
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
The Myth of Higher-Order Inference in Coreference Resolution
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Abstract Meaning Representation
Semantic Role Labeling
CKY Parsing
CS329 - WordNet Similarities
CS329 - Lexical Relations
Automatic Knowledge Base Expansion for Dialogue Management
Attention is All You Need for AMR Parsing
Graph-to-Text Generation and its Applications to Dialogue
Real-time Coreference Resolution for Dialogue Understanding
Topological Sort
Tries - Put
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
How to make Emora talk about Sports Intelligently

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MIND Revenue Release Quarter 2 2025 Press Release
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Empathic Computing: Creating Shared Understanding
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.

Dynamic Feature Induction: The Last Gist to the State-of-the-Art

  • 1. Jinho D. Choi jinho.choi@emory.edu Dynamic Feature Induction The Last Gist to the SOTA North American Chapter of the
 Association for Computational Linguistics June 13th, 2016
  • 2. Feature Engineering 2 Discovering a good set of features Finding a good combinations of features wiwi-1wi-2 wi+1 wi+2 ti-1ti-2 wi-1 + wiwi-1 + wi + wi+1 ti-1 + wi 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 low dimensional features high dimensional features
  • 3. Nonlinearity in NLP 3 Linear classifiers generally work well for NLP Are NLP feature spaces linearly separable? low dimensional features high dimensional features Deep Learning?
  • 4. 510 3 7x2 10 2 3 4x1 Dynamic Feature Induction 4 Classifier 1_3 ˆy1 ˆy2 y1≠ 𝑭 Feature Induction y2=Classifier1_3 Feature Expansion
  • 5. Feature Induction 5 training. It takes the set of training instances D and the learning rate ⌘, and returns the weight vector w and the set of induced features F. Algorithm 1 Feature Induction Input: D: training set, ⌘: learning rate. Output: w: weight vector, F: induced feature set. 1: w g 0 2: F ? 3: until max epoch is reached do 4: foreach (x, y) 2 D do 5: ˆy arg maxy02Y (w · (x, y0 , F) Iy(y0 )) 6: if y 6= ˆy then 7: @ (x, y, F) (x, ˆy, F) 8: g g + @ @ 9: w w + (⌘/(⇢+ p g)) · @ 10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy 11: L arg k max8i vi 12: for i = 2 to |L| do 13: F F [ {(L1, Li)} 14: return w, F The algorithm begins by initializing the weight vector w, the diagonal vector g, and the induced feature set by subtracting [w (x, ˆy, ?)]ˆy f (line 10), where [. . .]y returns o values relevant to y (Figure 2). The i’th element in v repres the i’th feature for y against ˆy; stronger the i’th feature is. Next entries in v are collected in the o representing the strongest featu Finally, the pairs of the first ind the strongest feature, and the o added to the induced feature set example, if L = [i, j, k] such th two pairs, (i, j) and (i, k), are a For all our experiments, k = k beyond this cutoff did not show Notice that all induced features joining only low dimensional fe algorithm does not join a high with either a low dimensional fe dimensional feature. This was prevent from the feature space b features can be induced by repla line 10 as follows: 4. The extended feature set x2 is fed into the classi- fier. If ˆy2 is equal to y2, no feature combination is induced from x2. Thus, high dimensional features in F are incremen- tally induced and learned along with low dimensional features during training. During decoding, each fea- ture set is extended by the induced features in F, and the prediction is made using the extended feature set. The size of F can grow up to |X|2, where |X| is the size of low dimensional features. However, we found that |F| is more like 1/4 · |X| in practice. The following sections explain our approach in de- tails. Sections 3.1, 3.2, and 3.3 describe how features are induced and learned during training. Sections 3.4 and 3.5 describe how the induced features are stored Iy(y0 ) ( 1, if y = y0. 0, otherwise. The feature map takes (x, y, F), and returns dimensional vector, where d and l are the s features and labels, respectively; each dimens tains the value for a particular feature and a If certain combinations between features in in F, they are appended to the feature vecto with the low dimensional features (see Sect for more details). The indicator function I allo algorithm to be optimized for the hinge loss f ticlass classification (Crammer and Singer, 2 `h = max[0, 1 + w · ( (x, ˆy, F) (x, y, Hinge loss Crammer
 and Singer, 2002
  • 6. Feature Induction 6 training. It takes the set of training instances D and the learning rate ⌘, and returns the weight vector w and the set of induced features F. Algorithm 1 Feature Induction Input: D: training set, ⌘: learning rate. Output: w: weight vector, F: induced feature set. 1: w g 0 2: F ? 3: until max epoch is reached do 4: foreach (x, y) 2 D do 5: ˆy arg maxy02Y (w · (x, y0 , F) Iy(y0 )) 6: if y 6= ˆy then 7: @ (x, y, F) (x, ˆy, F) 8: g g + @ @ 9: w w + (⌘/(⇢+ p g)) · @ 10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy 11: L arg k max8i vi 12: for i = 2 to |L| do 13: F F [ {(L1, Li)} 14: return w, F The algorithm begins by initializing the weight vector w, the diagonal vector g, and the induced feature set by subtracting [w (x, ˆy, ?)]ˆy f (line 10), where [. . .]y returns o values relevant to y (Figure 2). The i’th element in v repres the i’th feature for y against ˆy; stronger the i’th feature is. Next entries in v are collected in the o representing the strongest featu Finally, the pairs of the first ind the strongest feature, and the o added to the induced feature set example, if L = [i, j, k] such th two pairs, (i, j) and (i, k), are a For all our experiments, k = k beyond this cutoff did not show Notice that all induced features joining only low dimensional fe algorithm does not join a high with either a low dimensional fe dimensional feature. This was prevent from the feature space b features can be induced by repla line 10 as follows: AdaGrad Optimization Duchi et al.,
 2011
  • 7. Feature Induction 7 training. It takes the set of training instances D and the learning rate ⌘, and returns the weight vector w and the set of induced features F. Algorithm 1 Feature Induction Input: D: training set, ⌘: learning rate. Output: w: weight vector, F: induced feature set. 1: w g 0 2: F ? 3: until max epoch is reached do 4: foreach (x, y) 2 D do 5: ˆy arg maxy02Y (w · (x, y0 , F) Iy(y0 )) 6: if y 6= ˆy then 7: @ (x, y, F) (x, ˆy, F) 8: g g + @ @ 9: w w + (⌘/(⇢+ p g)) · @ 10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy 11: L arg k max8i vi 12: for i = 2 to |L| do 13: F F [ {(L1, Li)} 14: return w, F The algorithm begins by initializing the weight vector w, the diagonal vector g, and the induced feature set by subtracting [w (x, ˆy, ?)]ˆy f (line 10), where [. . .]y returns o values relevant to y (Figure 2). The i’th element in v repres the i’th feature for y against ˆy; stronger the i’th feature is. Next entries in v are collected in the o representing the strongest featu Finally, the pairs of the first ind the strongest feature, and the o added to the induced feature set example, if L = [i, j, k] such th two pairs, (i, j) and (i, k), are a For all our experiments, k = k beyond this cutoff did not show Notice that all induced features joining only low dimensional fe algorithm does not join a high with either a low dimensional fe dimensional feature. This was prevent from the feature space b features can be induced by repla line 10 as follows: Strength Vector
  • 8. Strength Vector 8 w 20 3x 4 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 Figure 2: Overview of dynamic feature induction during train- ing. (x, y2, ?) (x, y0, ?) Once w¨y and wˆy are updated, we take the entry- wise product between (w¨y wˆy) and (x) (line 9), and select top-k entries in v (line 10), representing 3 In most cases, these values are either 0 or 1. This can introduce arbitr L1 regularization for fea 4.3 Feature hashing ˆy arg maxy2Y(w · ⌘( (x, yg) (x, ˆy)) C tion is too high, randoml hash. xxHash5 4.4 Randomized featu ˆy arg maxy2Y(w · ⌘( (x, yg) (x, ˆy)) C tion is too high, randoml 4 ‘K arg max’ returns the o in v are top-k and greater than 5 https://github.co 4 Figure 2: Overview of dynamic feature induction during train- ing. (x, y2, ?) (x, y0, ?) Once w¨y and wˆy are updated, we take the entry- wise product between (w¨y wˆy) and (x) (line 9), and select top-k entries in v (line 10), representing 3 In most cases, these values are either 0 or 1. L1 regularization for feature reduct 4.3 Feature hashing ˆy arg maxy2Y(w · (x, y) ⌘( (x, yg) (x, ˆy)) Combination tion is too high, randomly drop with hash. xxHash5 4.4 Randomized feature selectio ˆy arg maxy2Y(w · (x, y) ⌘( (x, yg) (x, ˆy)) Combination tion is too high, randomly drop with 4 ‘K arg max’ returns the ordered list of in v are top-k and greater than 0. 5 https://github.com/Cyan497 4 Figure 2: Overview of dynamic feature induction during train- ing. [w (x, y0, ?)]y0 [w (x, y1, ?)]y1 Once w¨y and wˆy are updated, we take the entry- wise product between (w¨y wˆy) and (x) (line 9), and select top-k entries in v (line 10), representing 3 In most cases, these values are either 0 or 1. 4.2 Regularized dual averag This can introduce arbitrarily m L1 regularization for feature red 4.3 Feature hashing ˆy arg maxy2Y(w · (x, y) ⌘( (x, yg) (x, ˆy)) Combina tion is too high, randomly drop hash. xxHash5 4.4 Randomized feature sele ˆy arg maxy2Y(w · (x, y) ⌘( (x, yg) (x, ˆy)) Combina tion is too high, randomly drop 4 ‘K arg max’ returns the ordered li in v are top-k and greater than 0. 5 https://github.com/Cyan y0 y1 y2 v = 384 385 386 387 388 389 390 391 392 393 394 395 396 397 NAACL 2 for y, it is necessary to intro nations can be thought of: weak), and (weak, weak). I the first combination, (stron high with high/low. [w (x, y2, ?)]y2 4.2 Regularized dual ave This can introduce arbitraril L1 regularization for featur 4.3 Feature hashing ˆy arg maxy2Y(w · ( ⌘( (x, yg) (x, ˆy)) Com gold predicted The feature strength for y0 against y2.
  • 9. Feature Induction 9 training. It takes the set of training instances D and the learning rate ⌘, and returns the weight vector w and the set of induced features F. Algorithm 1 Feature Induction Input: D: training set, ⌘: learning rate. Output: w: weight vector, F: induced feature set. 1: w g 0 2: F ? 3: until max epoch is reached do 4: foreach (x, y) 2 D do 5: ˆy arg maxy02Y (w · (x, y0 , F) Iy(y0 )) 6: if y 6= ˆy then 7: @ (x, y, F) (x, ˆy, F) 8: g g + @ @ 9: w w + (⌘/(⇢+ p g)) · @ 10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy 11: L arg k max8i vi 12: for i = 2 to |L| do 13: F F [ {(L1, Li)} 14: return w, F The algorithm begins by initializing the weight vector w, the diagonal vector g, and the induced feature set by subtracting [w (x, ˆy, ?)]ˆy f (line 10), where [. . .]y returns o values relevant to y (Figure 2). The i’th element in v repres the i’th feature for y against ˆy; stronger the i’th feature is. Next entries in v are collected in the o representing the strongest featu Finally, the pairs of the first ind the strongest feature, and the o added to the induced feature set example, if L = [i, j, k] such th two pairs, (i, j) and (i, k), are a For all our experiments, k = k beyond this cutoff did not show Notice that all induced features joining only low dimensional fe algorithm does not join a high with either a low dimensional fe dimensional feature. This was prevent from the feature space b features can be induced by repla line 10 as follows: Feature Induction k = 3 Other combinations?
  • 10. 510 3 7x2 10 2 3 4x1 Dynamic Feature Induction 10 Classifier 1_3 ˆy1 ˆy2 y1≠ 𝑭 Feature Induction y2=Classifier1_3 Feature Expansion
  • 11. Feature Induction Revisited 11 training. It takes the set of training instances D and the learning rate ⌘, and returns the weight vector w and the set of induced features F. Algorithm 1 Feature Induction Input: D: training set, ⌘: learning rate. Output: w: weight vector, F: induced feature set. 1: w g 0 2: F ? 3: until max epoch is reached do 4: foreach (x, y) 2 D do 5: ˆy arg maxy02Y (w · (x, y0 , F) Iy(y0 )) 6: if y 6= ˆy then 7: @ (x, y, F) (x, ˆy, F) 8: g g + @ @ 9: w w + (⌘/(⇢+ p g)) · @ 10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy 11: L arg k max8i vi 12: for i = 2 to |L| do 13: F F [ {(L1, Li)} 14: return w, F The algorithm begins by initializing the weight vector w, the diagonal vector g, and the induced feature set by subtracting [w (x, ˆy, ?)]ˆy f (line 10), where [. . .]y returns o values relevant to y (Figure 2). The i’th element in v repres the i’th feature for y against ˆy; stronger the i’th feature is. Next entries in v are collected in the o representing the strongest featu Finally, the pairs of the first ind the strongest feature, and the o added to the induced feature set example, if L = [i, j, k] such th two pairs, (i, j) and (i, k), are a For all our experiments, k = k beyond this cutoff did not show Notice that all induced features joining only low dimensional fe algorithm does not join a high with either a low dimensional fe dimensional feature. This was prevent from the feature space b features can be induced by repla line 10 as follows: Locally Optimal Learning to Search tures in most NLP tasks are extracted from struc- s (e.g., sequence, tree). For structured learning, adapt “locally optimal learning to search” (Chang l., 2015b), that is a member of imitation learning ilar to DAGGER (Ross et al., 2011). LOLS not y performs well relative to the reference policy, also can improve upon the reference policy, show- very good results for tasks such as part-of-speech ging and dependency parsing. We adapt LOLS by ing the reference policy as follows: The reference policy ⇡ determines how often the gold label y is picked over the predicted label ˆy to build a structure. For all our experiments, ⇡ is initialized to 0.95. For the first epoch, since ⇡ is 0.95, y is randomly indices. Given a feature index pair (i, j) representing strong features for y against ˆy (Section 3.1), the index of the induced feature can be measured as follows: k hint!int(i · |X| + j) mod For efficiency, feature hashing is adapted to our sys- tem such that the induced feature set F is actually not a set but a -dimensional boolean array, where each dimension represents the validity of the correspond- ing induced feature. Thus, the line 13 in Algorithm 1 is changed to: k hint!int(L1 · |X| + Li) mod Fk True For the choice of h, xxHash is used, that is a fast non-cryptographic hash algorithm showing the per- fect score on the Q.Score.5 Feature Hashing
  • 12. Feature Expansion 12 becomes 0.952 = y is picked about 5%). only marginal im- asks we evaluated, entity recognition, However, we still ause we wanted to sks such as depen- search algorithms g and Nivre, 2012; g et al., 2015a). converting string redze, 2008; Wein- ng feature f and a the vector space is r of the hash code: mod training and decoding. It takes the sparse vector x containing only low dimensional features and returns a new sparse vector xl+h containing both low and high dimensional features. Algorithm 2 Feature Expansion Input: xl : sparse feature vector containing only low dimensional features. Output: xl+h : sparse feature vector containing both low and high dimensional features. 1: xl+h copy(xl ) 2: for i 1 to |xl | do 3: for j i + 1 to |xl | do 4: k hint!int(i · |X| + j) mod 5: if Fk then xl+h .append(k) 6: return xl+h The algorithm begins by copying xl to xl+h (line 1). For every combination (i, j) 2 xl ⇥ xl, where i and j represent the corresponding feature indices (lines 2- 3), it first measures the index k of the feature com- bination (line 4), then checks if this combination is valid (Section 3.4). If the combination is valid, mean- ing that (F = True), k is added to xl+h (line 5). Check if any combination exists in 𝑭.
  • 13. 510 3 7x2 10 2 3 4x1 Dynamic Feature Induction 13 Classifier 1_3 ˆy1 ˆy2 y1≠ 𝑭 Feature Induction y2=Classifier 1_3 Feature Expansion Regularized
 Dual
 Decomposition
 (Xiao, 2010)
  • 14. Regularized Dual Averaging 14 training. It takes the set of training instances D and the learning rate ⌘, and returns the weight vector w and the set of induced features F. Algorithm 1 Feature Induction Input: D: training set, ⌘: learning rate. Output: w: weight vector, F: induced feature set. 1: w g 0 2: F ? 3: until max epoch is reached do 4: foreach (x, y) 2 D do 5: ˆy arg maxy02Y (w · (x, y0 , F) Iy(y0 )) 6: if y 6= ˆy then 7: @ (x, y, F) (x, ˆy, F) 8: g g + @ @ 9: w w + (⌘/(⇢+ p g)) · @ 10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy 11: L arg k max8i vi 12: for i = 2 to |L| do 13: F F [ {(L1, Li)} 14: return w, F The algorithm begins by initializing the weight vector w, the diagonal vector g, and the induced feature set by subtracting [w (x, ˆy, ?)]ˆy f (line 10), where [. . .]y returns o values relevant to y (Figure 2). The i’th element in v repres the i’th feature for y against ˆy; stronger the i’th feature is. Next entries in v are collected in the o representing the strongest featu Finally, the pairs of the first ind the strongest feature, and the o added to the induced feature set example, if L = [i, j, k] such th two pairs, (i, j) and (i, k), are a For all our experiments, k = k beyond this cutoff did not show Notice that all induced features joining only low dimensional fe algorithm does not join a high with either a low dimensional fe dimensional feature. This was prevent from the feature space b features can be induced by repla line 10 as follows: y hen truncates the resulting vector with respect to the label y. t is worth mentioning that we did not find it useful or joining intermediate features together (e.g., (j, k) n the above example). It is possible to utilize these ombinations by weighting them differently, which we will explore in the future. Additionally, we exper- mented with the combinations between strong and weak features (joining i’th and j’th features, where i > 0 and vj < 0), which again was not so useful. We are planning to evaluate our approach on more asks and data, which will give us better understand- ng of what combinations are the most effective. 3.2 Regularized Dual Averaging Each high dimensional feature in F is induced for making classification between two labels, y and ˆy, but it may or may not be helpful for distinguishing abels other than those two. Our algorithm can be modified to learn the weights of the induced features only for their relevant labels by adding the label in- optimization, and works most effectiv feature vectors. To apply regularized the line 1 in Algorithm 1 is changed t w g c 0; t c is a d ⇥ l-dimensional vector cons mulative penalties. t is the number of generated during training. Although w not updated when y = ˆy, it is still co vector. Thus, t is incremented for ev stance, so t t + 1 is inserted after updated by adding the partial vector @ be inserted after the line 7): c c + @ Thus, each dimension in c represents tive penalty (or reward) for a particula label. At last, the line 9 is changed to the weight vector w and the feature map , [w (x, y, ?)]y takes the Hadamard product between w and (x, y, ? e resulting vector with respect to the label y. ntioning that we did not find it useful ermediate features together (e.g., (j, k) xample). It is possible to utilize these by weighting them differently, which e in the future. Additionally, we exper- the combinations between strong and (joining i’th and j’th features, where < 0), which again was not so useful. ng to evaluate our approach on more , which will give us better understand- ombinations are the most effective. ized Dual Averaging mensional feature in F is induced for fication between two labels, y and ˆy, may not be helpful for distinguishing han those two. Our algorithm can be arn the weights of the induced features optimization, and works most effectively with spars feature vectors. To apply regularized dual averagin the line 1 in Algorithm 1 is changed to: w g c 0; t 1 c is a d ⇥ l-dimensional vector consisting of acc mulative penalties. t is the number of weight vector generated during training. Although w is technicall not updated when y = ˆy, it is still considered a ne vector. Thus, t is incremented for every training i stance, so t t + 1 is inserted after the line 5. c updated by adding the partial vector @ as follows (t be inserted after the line 7): c c + @ Thus, each dimension in c represents the accumul tive penalty (or reward) for a particular feature and label. At last, the line 9 is changed to: we will explore in the future. Additionally, we exper- imented with the combinations between strong and weak features (joining i’th and j’th features, where vi > 0 and vj < 0), which again was not so useful. We are planning to evaluate our approach on more tasks and data, which will give us better understand- ing of what combinations are the most effective. 3.2 Regularized Dual Averaging Each high dimensional feature in F is induced for making classification between two labels, y and ˆy, but it may or may not be helpful for distinguishing labels other than those two. Our algorithm can be modified to learn the weights of the induced features only for their relevant labels by adding the label in- formation to F, which would change the line 13 in Algorithm 1 as follows: F F [ {(L1, Li, y, ˆy)} However, introducing features targeting specific la- bel pairs potentially confuses the classifier, especially when they are trained with the low dimensional fea- tures targeting all labels. Instead, it is better to apply a feature selection technique such as `1 regulariza- w g c 0; t 1 c is a d ⇥ l-dimensional vector consisting mulative penalties. t is the number of weigh generated during training. Although w is te not updated when y = ˆy, it is still consider vector. Thus, t is incremented for every tra stance, so t t + 1 is inserted after the li updated by adding the partial vector @ as fo be inserted after the line 7): c c + @ Thus, each dimension in c represents the a tive penalty (or reward) for a particular feat label. At last, the line 9 is changed to: w (⌘/(⇢+ p g)) · `1(c, t, ) `1(c, t, ) ( ci sgn(ci) · · t, |c8i| 0, other The function `1 takes c, t, and the regula rameter tuned during development. If the value of the accumulative penalty ci is gre · t, the weight wi is updated by and t; o it is assigned to 0. For our experiments, R for joining intermediate features together (e.g., (j, k) in the above example). It is possible to utilize these combinations by weighting them differently, which we will explore in the future. Additionally, we exper- imented with the combinations between strong and weak features (joining i’th and j’th features, where vi > 0 and vj < 0), which again was not so useful. We are planning to evaluate our approach on more tasks and data, which will give us better understand- ing of what combinations are the most effective. 3.2 Regularized Dual Averaging Each high dimensional feature in F is induced for making classification between two labels, y and ˆy, but it may or may not be helpful for distinguishing labels other than those two. Our algorithm can be modified to learn the weights of the induced features only for their relevant labels by adding the label in- formation to F, which would change the line 13 in Algorithm 1 as follows: F F [ {(L1, Li, y, ˆy)} However, introducing features targeting specific la- bel pairs potentially confuses the classifier, especially when they are trained with the low dimensional fea- tures targeting all labels. Instead, it is better to apply feature vectors. To apply regularized dual averaging, the line 1 in Algorithm 1 is changed to: w g c 0; t 1 c is a d ⇥ l-dimensional vector consisting of accu- mulative penalties. t is the number of weight vectors generated during training. Although w is technically not updated when y = ˆy, it is still considered a new vector. Thus, t is incremented for every training in- stance, so t t + 1 is inserted after the line 5. c is updated by adding the partial vector @ as follows (to be inserted after the line 7): c c + @ Thus, each dimension in c represents the accumula- tive penalty (or reward) for a particular feature and a label. At last, the line 9 is changed to: w (⌘/(⇢+ p g)) · `1(c, t, ) `1(c, t, ) ( ci sgn(ci) · · t, |c8i| > · t. 0, otherwise. The function `1 takes c, t, and the regularizer pa- rameter tuned during development. If the absolute value of the accumulative penalty ci is greater than · t, the weight wi is updated by and t; otherwise, in the above example). It is possible to utilize these combinations by weighting them differently, which we will explore in the future. Additionally, we exper- imented with the combinations between strong and weak features (joining i’th and j’th features, where vi > 0 and vj < 0), which again was not so useful. We are planning to evaluate our approach on more tasks and data, which will give us better understand- ing of what combinations are the most effective. 3.2 Regularized Dual Averaging Each high dimensional feature in F is induced for making classification between two labels, y and ˆy, but it may or may not be helpful for distinguishing labels other than those two. Our algorithm can be modified to learn the weights of the induced features only for their relevant labels by adding the label in- formation to F, which would change the line 13 in Algorithm 1 as follows: F F [ {(L1, Li, y, ˆy)} However, introducing features targeting specific la- bel pairs potentially confuses the classifier, especially the line 1 in Algorithm 1 is changed to: w g c 0; t 1 c is a d ⇥ l-dimensional vector consistin mulative penalties. t is the number of weig generated during training. Although w is t not updated when y = ˆy, it is still conside vector. Thus, t is incremented for every tr stance, so t t + 1 is inserted after the l updated by adding the partial vector @ as f be inserted after the line 7): c c + @ Thus, each dimension in c represents the a tive penalty (or reward) for a particular fea label. At last, the line 9 is changed to: w (⌘/(⇢+ p g)) · `1(c, t, ) `1(c, t, ) ( ci sgn(ci) · · t, |c8i| 0, othe The function `1 takes c, t, and the regul rameter tuned during development. If th for joining intermediate features together (e.g., (j, k) in the above example). It is possible to utilize these combinations by weighting them differently, which we will explore in the future. Additionally, we exper- imented with the combinations between strong and weak features (joining i’th and j’th features, where vi > 0 and vj < 0), which again was not so useful. We are planning to evaluate our approach on more tasks and data, which will give us better understand- ing of what combinations are the most effective. 3.2 Regularized Dual Averaging Each high dimensional feature in F is induced for making classification between two labels, y and ˆy, but it may or may not be helpful for distinguishing labels other than those two. Our algorithm can be modified to learn the weights of the induced features only for their relevant labels by adding the label in- formation to F, which would change the line 13 in Algorithm 1 as follows: F F [ {(L1, Li, y, ˆy)} feature vectors. To apply regula the line 1 in Algorithm 1 is cha w g c 0; c is a d ⇥ l-dimensional vecto mulative penalties. t is the num generated during training. Altho not updated when y = ˆy, it is s vector. Thus, t is incremented stance, so t t + 1 is inserted updated by adding the partial ve be inserted after the line 7): c c + @ Thus, each dimension in c repr tive penalty (or reward) for a pa label. At last, the line 9 is chan w (⌘/(⇢+ p g)) · ` `1(c, t, ) ( ci sgn(ci) · 0,
  • 15. 510 3 7x2 10 2 3 4x1 Dynamic Feature Induction 15 Classifier 1_3 ˆy1 ˆy2 y1≠ 𝑭 Feature Induction y2=Classifier 1_3 Feature Expansion Regularized
 Dual
 Decomposition
 (Xiao, 2010) Locally
 Optimal
 Learning to Search
 (Chang et al., 2015)
  • 16. Part-of-Speech Tagging 16 Approach ALL OOV EXT Features M0: baseline 97.18 86.35 365,400 M1: M0 + ext. ambiguity classes 97.37 91.34 ✔ 365,409 M2: M1 + brown clusters 97.46 91.23 ✔ 372,181 M3: M1 + dynamic feature induction 97.52 91.53 ✔ 468,378 M4: M2 + dynamic feature induction 97.64 92.03 ✔ 473,134 Manning (2011) 97.32 90.79 ✔ Shen et al. (2007) 97.33 89.61 Sun (2014) 97.36 - Moore (2015) 97.36 91.09 ✔ Spoustová et al. (2009) 97.44 ✔ Søgaard (2011) 97.50 ✔ Tsuboi (2014) 97.51 91.64 ✔ Wall Street Journal From
 wikipedia
 + NYTimes
  • 17. Approach TST Features M0: baseline 84.44 164,440 M1: M0 + gazetteers 86.85 164,720 M2: M1 + brown clusters 89.64 169,232 M3: M2 + word embeddings 90.57 169,682 M4: M3 + dynamic feature induction 91.00 208,860 Named Entity Recognition 17 CoNLL’03 Shared Task Turian et al. (2010) 89.41 Suzuki and Isozaki (2008) 89.92 Ratinov and Roth (2009) 90.57 Lin and Wu (2009) 90.90 Passo et al. (2014) 90.90 Not used for DFI
  • 18. Conclusion • Dynamic feature induction finds high dimensional features by combining low dimensional features during training. • During decoding, it searches for useful feature combinations and augment high dimensional features. • It gives the last gist to the state-of-the-art for part-of- speech tagging and named entity recognition. • Dynamic feature induction is rather empirically inspired; theoretical justification of this approach would be intriguing. • All our approaches are implemented in NLP4J, previously known as ClearNLP: http://github.com/emorynlp/nlp4j/. 18