Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification

Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification
Bikash Joshi†
,Massih R. Amini†
,Ioannis Partalas ,Franck Iutzeler†
,Yury Maximov‡
| †
University Grenoble Alps, Expedia, ‡
Los Alamos NL&Skolkovo IST
Large-scale Multi-class Classification
• Classification problems with extremely large number of classes as it appears in text repositories such
as Wikipedia, Yahoo! Directory (www.dir.yahoo.com), or Directory Mozilla DMOZ (www.dmoz.
org) are more and more common.
• In this case, classical approaches suffer from class imbalance and computational burden.
• New studies try to cope with these limitations
– Tree-based methods rely on binary tree structures and have logarithmic time complexity with
the drawback that it is a challenging task to find a balanced tree structure which can partition
the class labels. These methods suffer from error propagation phenomenon causing a decrease
in accuracy.
– Label embedding approaches, first project the label-matrix into a low-dimensional linear sub-
space and then use an OVA classifier. However, the low-rank assumption of the label-matrix is
generally transgressed in the extreme multi-class classification setting.
• In this work, we propose a scalable multi-class classification method based on an aggressive double
sampling of the dyadic output prediction problem.
Formal Setup and Notations
• Let xy
= (x, y) be an observation in X × Y ⊆
Rd
× Y = {1, . . . , K}; K >>1 generated i.i.d.
with respect to D. We assume that the train-
ing set S = (xyi
i )m
i=1
i.i.d
∼ Dm
, and we consider
a class of functions G = {g : X × Y → R}, in
the form g = f ◦ φ, where φ : X × Y → Rp
is
and application-dependent projection (which
can be learned, or defined using some heuris-
tics); and f ∈ F = {f : Rp
→ R} is a func-
tion that measures the adequacy between an
observation x and a class y using their corre-
sponding representation φ(xy
).
The objective is to find a function g ∈ G with
a small expected risk :
R(g) = Exy∼D [e(g, xy
)] , where (1)
e(g, xy
) =
1
K − 1
y ∈Y{y}
1g(xy)≤g(xy ), (2)
is the instantaneous loss for predictor g on
example xy
that estimates the average num-
ber of classes, given any input data, that get
a greater scoring by g than the correct class.
For the dyadic transformation
T (S)=





zj = φ(xk
i ), φ(x
yi
i ) , ˜yj = −1 if k < yi
zj = φ(x
yi
i ), φ(xk
i ) , ˜yj = +1 elsewhere


j
where j = (i − 1)(K − 1) + k, ∀i ∈ [m], ∀k ∈
[K −1]; that expands a K-class set S of size m
into a binary set T(S) of size N = m(K − 1).
Classification with interdependent data
With the class of functions
H = {h : (φ(xy
), φ(xy
)) → f(φ(xy
))−f(φ(xy
)), f ∈ F},
the empirical loss associated to (Eq. (1)) becomes :
˜RT (S)(h) =
1
N
N
j=1
1˜yj h(zj )≤0. (3)
Definition 1 Let fractional cover of G, if: i) it is
proper: ∀k, Ck is an independent set, i.e., there is
no connections between vertices in Ck; ii) it is an exact
fractional cover of G: ∀v ∈ V, k:v∈Ck
ωk = 1.
From this statement, the classes of functions G and
H introduced previously, consider the parameter-
ized family Hr which, for r > 0, is defined as:
Hr = {h : h ∈ H, V[h]
.
= Vz,˜y[1˜yh(z)] ≤ r},
where V denotes the variance. The fractional
Rademacher complexity that entails our analysis :
RT (S)(H)
.
=
2
N
Eξ
k∈[K−1]
ωkECk
sup
h∈H α∈Ck
zα∈T (S)
ξαh(zα),
Rademacher complexity bounds for interdependent data
Theorem 1 Let S = (xyi
i )m
i=1 ∈ (X × Y)m
be a dataset of m examples drawn i.i.d. according to a probability
distribution D over X × Y and T(S) = ((zi, ˜yi))N
i=1 the transformed set obtained as in Eq. (). Then for any
1 > δ > 0 and 0/1 loss : {−1, +1} × R → [0, 1], with probability at least (1 − δ) the following generalization
bound holds for all h ∈ Hr :
R(h) ≤ ˜RT (S)(h) + RT (S)( ◦ Hr) +
5
2
RT (S)( ◦ Hr) +
r
2
log 1
δ
m
+
25
48
log 1
δ
m
.
Give insights on the consistency of the ERM principle when learning with interdependent data, however
for K >>1 and m>>1 the constitution of T(S) may be intractable.
The (π, κ)-DS algorithm and a new generalization bound
The proposed aggressive double sampling procedurea
, referred to as (π, κ)-DS is composed of two main steps.
1. For each class k ∈ {1, . . . , K}, draw randomly a set Sπk
of examples from S of that class with proba-
bility πk, and let Sπ =
K
k=1
Sπk
;
2. For each example xy
in Sπ, draw uniformly κ adversarial classes in Y{y}.
Theorem 2 Let S = (xyi
i )m
i=1 ∈ (X × Y)m
be a training set of size m i.i.d. according to a probability distribution
D over X × Y, and T(S) = ((zi, ˜yi))N
i=1the transformed set obtained with the transformation function T. Let
Sπ ⊆ S, |Sπ| = M, be a training set outputted by the algorithm (π, κ)-DS and T(Sπ) ⊆ T(S) its corresponding
transformation. Then for any 1 > δ > 0 with probability at least (1 − δ) the following risk bound holds :
∀h ∈ H, R(h) ≤ α ˜RTκ(Sπ)(h) + αRTκ(Sπ)( ◦ H) + α
(K − 1) log 2
δ
2Mκ
+
2α log 4K
δ
β(m − 1)
+
7β log 4K
δ
3(m − 1)
,
where ˜RTκ(Sπ)(h) = 1
κM xy∈Sπ y ∈Yxy
1g(xy)−g(xy )≤0, α = maxy: 1≤y≤K ηy/πy, β = maxy: 1≤y≤K 1/πy
and ηy > 0 is the proportion of class y in S.
ahttps://github.com/bikash617/Aggressive-Sampling-for-Multi-class-to-BinaryReduction
Dataset Properties and φ(.)
Datasets K #Train #Test d
LSHTC1 12294 126871 31718 409774
DMOZ 27875 381149 95288 594158
WIKI-Small 36504 796617 199155 380078
WIKI-50K 50000 1102754 276939 951558
WIKI-100K 100000 2195530 550133 1271710
Features in the joint example/class representation representation φ(xy).
1.
t∈y∩x
log(1 + yt), 2.
t∈y∩x
log 1 +
lS
Ft
, 3.
t∈y∩x
It
4.
t∈y∩x
yt
|y|
.It, 5.
t∈y∩x
log 1 +
yt
|y|
, 6.
t∈y∩x
log 1 +
yt
|y|
.It
7.
t∈y∩x
log 1 +
yt
|y|
.
lS
Ft
, 8.
t∈y∩x
1, 9. d(xy, centroid(y))
10. BM25 =
t∈y∩x It. 2×yt
yt+(0.25+0.75·len(y)/avg(len(y))
Experimental Results
0
45
90
135
180
Time(min.)
LSHTC1
0
300
600
900
1200
DMOZ
0
150
300
450
WIKI-Small
0
300
600
900
1200
WIKI-50K
0
1000
2000
3000
WIKI-100K
0
4
8
12
TotalMemory(GB)
0
10
20
30
0.0
2.5
5.0
7.5
10.0
0
12
24
36
0
14
28
42
0
10
20
30
MaF(%)
0
10
20
30
0
10
20
30
0
10
20
30
0
10
20
30
RecallTree FastXML PfastReXML PD-Sparse Proposed DS
• RecallTree: Tree based multi-class classifier
implemented in Vowpal Wabbit.
• FastXML: Partitioning in the feature space for
faster prediction.
• PfastReXML Tree ensemble based extreme
classifier.
• PD-Sparse: 1-regularized multi-class loss.

Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification

More Related Content

What's hot (19)

Similar to Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification (20)

Recently uploaded (20)

Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification