Dynamic Feature Induction: The Last Gist to the State-of-the-Art

Jinho D. Choi
jinho.choi@emory.edu
Dynamic Feature Induction
The Last Gist to the SOTA
North American Chapter of the 
Association for Computational Linguistics
June 13th, 2016

Feature Engineering
2
Discovering a good set of features
Finding a good combinations of features
wiwi-1wi-2 wi+1 wi+2
ti-1ti-2
wi-1 + wiwi-1 + wi + wi+1 ti-1 + wi
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
low
dimensional
features
high
dimensional
features

Nonlinearity in NLP
3
Linear classiﬁers generally work well for NLP
Are NLP feature spaces linearly separable?
low
dimensional
features
high
dimensional
features
Deep
Learning?

510 3 7x2
10 2 3 4x1
4
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier1_3
Feature Expansion

Feature Induction
5
training. It takes the set of training instances D and
the learning rate ⌘, and returns the weight vector w
and the set of induced features F.
Algorithm 1 Feature Induction
Input: D: training set, ⌘: learning rate.
Output: w: weight vector, F: induced feature set.
1: w g 0
2: F ?
3: until max epoch is reached do
4: foreach (x, y) 2 D do
5: ˆy arg maxy02Y (w · (x, y0
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
11: L arg k max8i vi
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
The algorithm begins by initializing the weight vector
w, the diagonal vector g, and the induced feature set
by subtracting [w (x, ˆy, ?)]ˆy f
(line 10), where [. . .]y returns o
values relevant to y (Figure 2).
The i’th element in v repres
the i’th feature for y against ˆy;
stronger the i’th feature is. Next
entries in v are collected in the o
representing the strongest featu
Finally, the pairs of the first ind
the strongest feature, and the o
added to the induced feature set
example, if L = [i, j, k] such th
two pairs, (i, j) and (i, k), are a
For all our experiments, k =
k beyond this cutoff did not show
Notice that all induced features
joining only low dimensional fe
algorithm does not join a high
with either a low dimensional fe
dimensional feature. This was
prevent from the feature space b
features can be induced by repla
line 10 as follows:
4. The extended feature set x2 is fed into the classi-
fier. If ˆy2 is equal to y2, no feature combination
is induced from x2.
Thus, high dimensional features in F are incremen-
tally induced and learned along with low dimensional
features during training. During decoding, each fea-
ture set is extended by the induced features in F, and
the prediction is made using the extended feature set.
The size of F can grow up to |X|2, where |X| is the
size of low dimensional features. However, we found
that |F| is more like 1/4 · |X| in practice.
The following sections explain our approach in de-
tails. Sections 3.1, 3.2, and 3.3 describe how features
are induced and learned during training. Sections 3.4
and 3.5 describe how the induced features are stored
Iy(y0
)
(
1, if y = y0.
0, otherwise.
The feature map takes (x, y, F), and returns
dimensional vector, where d and l are the s
features and labels, respectively; each dimens
tains the value for a particular feature and a
If certain combinations between features in
in F, they are appended to the feature vecto
with the low dimensional features (see Sect
for more details). The indicator function I allo
algorithm to be optimized for the hinge loss f
ticlass classification (Crammer and Singer, 2
`h = max[0, 1 + w · ( (x, ˆy, F) (x, y,
Hinge loss
Crammer 
and Singer, 2002

Feature Induction
6
1: w g 0
2: F ?
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
line 10 as follows:
AdaGrad
Optimization
Duchi et al., 
2011

Feature Induction
7
1: w g 0
2: F ?
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
line 10 as follows:
Strength
Vector

Strength Vector
8
w
20 3x
4
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
Figure 2: Overview of dynamic feature induction during train-
ing.
(x, y2, ?) (x, y0, ?)
Once wÿ and wˆy are updated, we take the entry-
wise product between (wÿ wˆy) and (x) (line 9),
and select top-k entries in v (line 10), representing
3
In most cases, these values are either 0 or 1.
This can introduce arbitr
L1 regularization for fea
4.3 Feature hashing
ˆy arg maxy2Y(w ·
⌘( (x, yg) (x, ˆy)) C
tion is too high, randoml
hash. xxHash5
4.4 Randomized featu
ˆy arg maxy2Y(w ·
⌘( (x, yg) (x, ˆy)) C
tion is too high, randoml
4
‘K arg max’ returns the o
in v are top-k and greater than
5
https://github.co
4
ing.
(x, y2, ?) (x, y0, ?)
3
L1 regularization for feature reduct
4.3 Feature hashing
ˆy arg maxy2Y(w · (x, y)
⌘( (x, yg) (x, ˆy)) Combination
tion is too high, randomly drop with
hash. xxHash5
4.4 Randomized feature selectio
⌘( (x, yg) (x, ˆy)) Combination
tion is too high, randomly drop with
4
‘K arg max’ returns the ordered list of
in v are top-k and greater than 0.
5
https://github.com/Cyan497
4
ing.
[w (x, y0, ?)]y0 [w (x, y1, ?)]y1
3
4.2 Regularized dual averag
This can introduce arbitrarily m
L1 regularization for feature red
4.3 Feature hashing
⌘( (x, yg) (x, ˆy)) Combina
tion is too high, randomly drop
hash. xxHash5
4.4 Randomized feature sele
⌘( (x, yg) (x, ˆy)) Combina
tion is too high, randomly drop
4
‘K arg max’ returns the ordered li
in v are top-k and greater than 0.
5
https://github.com/Cyan
y0 y1 y2
v
=
384
385
386
387
388
389
390
391
392
393
394
395
396
397
NAACL 2
for y, it is necessary to intro
nations can be thought of:
weak), and (weak, weak). I
the first combination, (stron
high with high/low.
[w (x, y2, ?)]y2
4.2 Regularized dual ave
This can introduce arbitraril
L1 regularization for featur
4.3 Feature hashing
ˆy arg maxy2Y(w · (
⌘( (x, yg) (x, ˆy)) Com
gold predicted
The feature strength for y0 against y2.

Feature Induction
9
1: w g 0
2: F ?
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
line 10 as follows:
Feature
Induction
k = 3
Other combinations?

510 3 7x2
10 2 3 4x1
10
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier1_3
Feature Expansion

Feature Induction Revisited
11
1: w g 0
2: F ?
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
line 10 as follows:
Locally Optimal Learning to Search
tures in most NLP tasks are extracted from struc-
s (e.g., sequence, tree). For structured learning,
adapt “locally optimal learning to search” (Chang
l., 2015b), that is a member of imitation learning
ilar to DAGGER (Ross et al., 2011). LOLS not
y performs well relative to the reference policy,
also can improve upon the reference policy, show-
very good results for tasks such as part-of-speech
ging and dependency parsing. We adapt LOLS by
ing the reference policy as follows:
The reference policy ⇡ determines how often the
gold label y is picked over the predicted label ˆy
to build a structure. For all our experiments, ⇡
is initialized to 0.95.
For the ﬁrst epoch, since ⇡ is 0.95, y is randomly
indices. Given a feature index pair (i, j) representing
strong features for y against ˆy (Section 3.1), the index
of the induced feature can be measured as follows:
k hint!int(i · |X| + j) mod
For efﬁciency, feature hashing is adapted to our sys-
tem such that the induced feature set F is actually not
a set but a -dimensional boolean array, where each
dimension represents the validity of the correspond-
ing induced feature. Thus, the line 13 in Algorithm 1
is changed to:
k hint!int(L1 · |X| + Li) mod
Fk True
For the choice of h, xxHash is used, that is a fast
non-cryptographic hash algorithm showing the per-
fect score on the Q.Score.5
Feature Hashing

Feature Expansion
12
becomes 0.952 =
y is picked about
5%).
only marginal im-
asks we evaluated,
entity recognition,
However, we still
ause we wanted to
sks such as depen-
search algorithms
g and Nivre, 2012;
g et al., 2015a).
converting string
redze, 2008; Wein-
ng feature f and a
the vector space is
r of the hash code:
mod
training and decoding. It takes the sparse vector x
containing only low dimensional features and returns
a new sparse vector xl+h containing both low and
high dimensional features.
Algorithm 2 Feature Expansion
Input: xl
: sparse feature vector containing only
low dimensional features.
Output: xl+h
: sparse feature vector containing both
low and high dimensional features.
1: xl+h
copy(xl
)
2: for i 1 to |xl
| do
3: for j i + 1 to |xl
| do
4: k hint!int(i · |X| + j) mod
5: if Fk then xl+h
.append(k)
6: return xl+h
The algorithm begins by copying xl to xl+h (line 1).
For every combination (i, j) 2 xl ⇥ xl, where i and j
represent the corresponding feature indices (lines 2-
3), it ﬁrst measures the index k of the feature com-
bination (line 4), then checks if this combination is
valid (Section 3.4). If the combination is valid, mean-
ing that (F = True), k is added to xl+h (line 5).
Check if any combination exists in 𝑭.

510 3 7x2
10 2 3 4x1
13
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier
1_3
Feature Expansion
Regularized 
Dual 
Decomposition 
(Xiao, 2010)

Regularized Dual Averaging
14
1: w g 0
2: F ?
, F) Iy(y0
))
6: if y 6= ˆy then
7: @ (x, y, F) (x, ˆy, F)
8: g g + @ @
9: w w + (⌘/(⇢+
p
g)) · @
10: v [w (x, y, ?)]y [w (x, ˆy, ?)]ˆy
12: for i = 2 to |L| do
13: F F [ {(L1, Li)}
14: return w, F
line 10 as follows:
y
hen truncates the resulting vector with respect to the label y.
t is worth mentioning that we did not find it useful
or joining intermediate features together (e.g., (j, k)
n the above example). It is possible to utilize these
ombinations by weighting them differently, which
we will explore in the future. Additionally, we exper-
mented with the combinations between strong and
weak features (joining i’th and j’th features, where
i > 0 and vj < 0), which again was not so useful.
We are planning to evaluate our approach on more
asks and data, which will give us better understand-
ng of what combinations are the most effective.
3.2 Regularized Dual Averaging
Each high dimensional feature in F is induced for
making classification between two labels, y and ˆy,
but it may or may not be helpful for distinguishing
abels other than those two. Our algorithm can be
modified to learn the weights of the induced features
only for their relevant labels by adding the label in-
optimization, and works most effectiv
feature vectors. To apply regularized
the line 1 in Algorithm 1 is changed t
w g c 0; t
c is a d ⇥ l-dimensional vector cons
mulative penalties. t is the number of
generated during training. Although w
not updated when y = ˆy, it is still co
vector. Thus, t is incremented for ev
stance, so t t + 1 is inserted after
updated by adding the partial vector @
be inserted after the line 7):
c c + @
Thus, each dimension in c represents
tive penalty (or reward) for a particula
label. At last, the line 9 is changed to
the weight vector w and the feature map , [w (x, y, ?)]y takes the Hadamard product between w and (x, y, ?
e resulting vector with respect to the label y.
ntioning that we did not find it useful
ermediate features together (e.g., (j, k)
xample). It is possible to utilize these
by weighting them differently, which
e in the future. Additionally, we exper-
the combinations between strong and
(joining i’th and j’th features, where
< 0), which again was not so useful.
ng to evaluate our approach on more
, which will give us better understand-
ombinations are the most effective.
ized Dual Averaging
mensional feature in F is induced for
fication between two labels, y and ˆy,
may not be helpful for distinguishing
han those two. Our algorithm can be
arn the weights of the induced features
optimization, and works most effectively with spars
feature vectors. To apply regularized dual averagin
the line 1 in Algorithm 1 is changed to:
w g c 0; t 1
c is a d ⇥ l-dimensional vector consisting of acc
mulative penalties. t is the number of weight vector
generated during training. Although w is technicall
not updated when y = ˆy, it is still considered a ne
vector. Thus, t is incremented for every training i
stance, so t t + 1 is inserted after the line 5. c
updated by adding the partial vector @ as follows (t
c c + @
Thus, each dimension in c represents the accumul
tive penalty (or reward) for a particular feature and
label. At last, the line 9 is changed to:
imented with the combinations between strong and
vi > 0 and vj < 0), which again was not so useful.
tasks and data, which will give us better understand-
ing of what combinations are the most effective.
labels other than those two. Our algorithm can be
formation to F, which would change the line 13 in
Algorithm 1 as follows:
F F [ {(L1, Li, y, ˆy)}
However, introducing features targeting specific la-
bel pairs potentially confuses the classifier, especially
when they are trained with the low dimensional fea-
tures targeting all labels. Instead, it is better to apply
a feature selection technique such as `1 regulariza-
w g c 0; t 1
c is a d ⇥ l-dimensional vector consisting
mulative penalties. t is the number of weigh
generated during training. Although w is te
not updated when y = ˆy, it is still consider
vector. Thus, t is incremented for every tra
stance, so t t + 1 is inserted after the li
updated by adding the partial vector @ as fo
c c + @
Thus, each dimension in c represents the a
tive penalty (or reward) for a particular feat
w (⌘/(⇢+
p
g)) · `1(c, t, )
`1(c, t, )
(
ci sgn(ci) · · t, |c8i|
0, other
The function `1 takes c, t, and the regula
rameter tuned during development. If the
value of the accumulative penalty ci is gre
· t, the weight wi is updated by and t; o
it is assigned to 0. For our experiments, R
for joining intermediate features together (e.g., (j, k)
in the above example). It is possible to utilize these
combinations by weighting them differently, which
F F [ {(L1, Li, y, ˆy)}
when they are trained with the low dimensional fea-
tures targeting all labels. Instead, it is better to apply
feature vectors. To apply regularized dual averaging,
w g c 0; t 1
c is a d ⇥ l-dimensional vector consisting of accu-
mulative penalties. t is the number of weight vectors
generated during training. Although w is technically
not updated when y = ˆy, it is still considered a new
vector. Thus, t is incremented for every training in-
stance, so t t + 1 is inserted after the line 5. c is
updated by adding the partial vector @ as follows (to
c c + @
Thus, each dimension in c represents the accumula-
tive penalty (or reward) for a particular feature and a
w (⌘/(⇢+
p
g)) · `1(c, t, )
`1(c, t, )
(
ci sgn(ci) · · t, |c8i| > · t.
0, otherwise.
The function `1 takes c, t, and the regularizer pa-
rameter tuned during development. If the absolute
value of the accumulative penalty ci is greater than
· t, the weight wi is updated by and t; otherwise,
F F [ {(L1, Li, y, ˆy)}
w g c 0; t 1
c is a d ⇥ l-dimensional vector consistin
mulative penalties. t is the number of weig
generated during training. Although w is t
not updated when y = ˆy, it is still conside
vector. Thus, t is incremented for every tr
stance, so t t + 1 is inserted after the l
updated by adding the partial vector @ as f
c c + @
Thus, each dimension in c represents the a
tive penalty (or reward) for a particular fea
w (⌘/(⇢+
p
g)) · `1(c, t, )
`1(c, t, )
(
ci sgn(ci) · · t, |c8i|
0, othe
The function `1 takes c, t, and the regul
rameter tuned during development. If th
for joining intermediate features together (e.g., (j, k)
F F [ {(L1, Li, y, ˆy)}
feature vectors. To apply regula
the line 1 in Algorithm 1 is cha
w g c 0;
c is a d ⇥ l-dimensional vecto
mulative penalties. t is the num
generated during training. Altho
not updated when y = ˆy, it is s
vector. Thus, t is incremented
stance, so t t + 1 is inserted
updated by adding the partial ve
c c + @
Thus, each dimension in c repr
tive penalty (or reward) for a pa
label. At last, the line 9 is chan
w (⌘/(⇢+
p
g)) · `
`1(c, t, )
(
ci sgn(ci) ·
0,

510 3 7x2
10 2 3 4x1
15
Classifier
1_3
ˆy1
ˆy2
y1≠
𝑭
Feature Induction
y2=Classifier
1_3
Feature Expansion
Regularized 
Dual 
Decomposition 
(Xiao, 2010)
Locally 
Optimal 
Learning to Search 
(Chang et al., 2015)

Part-of-Speech Tagging
16
Approach ALL OOV EXT Features
M0: baseline 97.18 86.35 365,400
M1: M0 + ext. ambiguity classes 97.37 91.34 ✔ 365,409
M2: M1 + brown clusters 97.46 91.23 ✔ 372,181
M3: M1 + dynamic feature induction 97.52 91.53 ✔ 468,378
M4: M2 + dynamic feature induction 97.64 92.03 ✔ 473,134
Manning (2011) 97.32 90.79 ✔
Shen et al. (2007) 97.33 89.61
Sun (2014) 97.36 -
Moore (2015) 97.36 91.09 ✔
Spoustová et al. (2009) 97.44 ✔
Søgaard (2011) 97.50 ✔
Tsuboi (2014) 97.51 91.64 ✔
Wall Street Journal
From 
wikipedia 
+ NYTimes

Approach TST Features
M0: baseline 84.44 164,440
M1: M0 + gazetteers 86.85 164,720
M2: M1 + brown clusters 89.64 169,232
M3: M2 + word embeddings 90.57 169,682
M4: M3 + dynamic feature induction 91.00 208,860
Named Entity Recognition
17
CoNLL’03 Shared Task
Turian et al. (2010) 89.41
Suzuki and Isozaki (2008) 89.92
Ratinov and Roth (2009) 90.57
Lin and Wu (2009) 90.90
Passo et al. (2014) 90.90
Not used
for DFI

Conclusion
• Dynamic feature induction ﬁnds high dimensional features
by combining low dimensional features during training.
• During decoding, it searches for useful feature combinations
and augment high dimensional features.
• It gives the last gist to the state-of-the-art for part-of-
speech tagging and named entity recognition.
• Dynamic feature induction is rather empirically inspired;
theoretical justiﬁcation of this approach would be intriguing.
• All our approaches are implemented in NLP4J, previously
known as ClearNLP: http://github.com/emorynlp/nlp4j/.
18

Dynamic Feature Induction: The Last Gist to the State-of-the-Art

More Related Content

What's hot (20)

Similar to Dynamic Feature Induction: The Last Gist to the State-of-the-Art (20)

More from Jinho Choi (20)

Recently uploaded (20)

Dynamic Feature Induction: The Last Gist to the State-of-the-Art