SlideShare a Scribd company logo
Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification
Bikash Joshi†
,Massih R. Amini†
,Ioannis Partalas ,Franck Iutzeler†
,Yury Maximov‡
| †
University Grenoble Alps, Expedia, ‡
Los Alamos NL&Skolkovo IST
Large-scale Multi-class Classification
• Classification problems with extremely large number of classes as it appears in text repositories such
as Wikipedia, Yahoo! Directory (www.dir.yahoo.com), or Directory Mozilla DMOZ (www.dmoz.
org) are more and more common.
• In this case, classical approaches suffer from class imbalance and computational burden.
• New studies try to cope with these limitations
– Tree-based methods rely on binary tree structures and have logarithmic time complexity with
the drawback that it is a challenging task to find a balanced tree structure which can partition
the class labels. These methods suffer from error propagation phenomenon causing a decrease
in accuracy.
– Label embedding approaches, first project the label-matrix into a low-dimensional linear sub-
space and then use an OVA classifier. However, the low-rank assumption of the label-matrix is
generally transgressed in the extreme multi-class classification setting.
• In this work, we propose a scalable multi-class classification method based on an aggressive double
sampling of the dyadic output prediction problem.
Formal Setup and Notations
• Let xy
= (x, y) be an observation in X × Y ⊆
Rd
× Y = {1, . . . , K}; K >>1 generated i.i.d.
with respect to D. We assume that the train-
ing set S = (xyi
i )m
i=1
i.i.d
∼ Dm
, and we consider
a class of functions G = {g : X × Y → R}, in
the form g = f ◦ φ, where φ : X × Y → Rp
is
and application-dependent projection (which
can be learned, or defined using some heuris-
tics); and f ∈ F = {f : Rp
→ R} is a func-
tion that measures the adequacy between an
observation x and a class y using their corre-
sponding representation φ(xy
).
The objective is to find a function g ∈ G with
a small expected risk :
R(g) = Exy∼D [e(g, xy
)] , where (1)
e(g, xy
) =
1
K − 1
y ∈Y{y}
1g(xy)≤g(xy ), (2)
is the instantaneous loss for predictor g on
example xy
that estimates the average num-
ber of classes, given any input data, that get
a greater scoring by g than the correct class.
For the dyadic transformation
T (S)=





zj = φ(xk
i ), φ(x
yi
i ) , ˜yj = −1 if k < yi
zj = φ(x
yi
i ), φ(xk
i ) , ˜yj = +1 elsewhere


j
where j = (i − 1)(K − 1) + k, ∀i ∈ [m], ∀k ∈
[K −1]; that expands a K-class set S of size m
into a binary set T(S) of size N = m(K − 1).
Classification with interdependent data
With the class of functions
H = {h : (φ(xy
), φ(xy
)) → f(φ(xy
))−f(φ(xy
)), f ∈ F},
the empirical loss associated to (Eq. (1)) becomes :
˜RT (S)(h) =
1
N
N
j=1
1˜yj h(zj )≤0. (3)
Definition 1 Let fractional cover of G, if: i) it is
proper: ∀k, Ck is an independent set, i.e., there is
no connections between vertices in Ck; ii) it is an exact
fractional cover of G: ∀v ∈ V, k:v∈Ck
ωk = 1.
From this statement, the classes of functions G and
H introduced previously, consider the parameter-
ized family Hr which, for r > 0, is defined as:
Hr = {h : h ∈ H, V[h]
.
= Vz,˜y[1˜yh(z)] ≤ r},
where V denotes the variance. The fractional
Rademacher complexity that entails our analysis :
RT (S)(H)
.
=
2
N
Eξ
k∈[K−1]
ωkECk
sup
h∈H α∈Ck
zα∈T (S)
ξαh(zα),
Rademacher complexity bounds for interdependent data
Theorem 1 Let S = (xyi
i )m
i=1 ∈ (X × Y)m
be a dataset of m examples drawn i.i.d. according to a probability
distribution D over X × Y and T(S) = ((zi, ˜yi))N
i=1 the transformed set obtained as in Eq. (). Then for any
1 > δ > 0 and 0/1 loss : {−1, +1} × R → [0, 1], with probability at least (1 − δ) the following generalization
bound holds for all h ∈ Hr :
R(h) ≤ ˜RT (S)(h) + RT (S)( ◦ Hr) +
5
2
RT (S)( ◦ Hr) +
r
2
log 1
δ
m
+
25
48
log 1
δ
m
.
Give insights on the consistency of the ERM principle when learning with interdependent data, however
for K >>1 and m>>1 the constitution of T(S) may be intractable.
The (π, κ)-DS algorithm and a new generalization bound
The proposed aggressive double sampling procedurea
, referred to as (π, κ)-DS is composed of two main steps.
1. For each class k ∈ {1, . . . , K}, draw randomly a set Sπk
of examples from S of that class with proba-
bility πk, and let Sπ =
K
k=1
Sπk
;
2. For each example xy
in Sπ, draw uniformly κ adversarial classes in Y{y}.
Theorem 2 Let S = (xyi
i )m
i=1 ∈ (X × Y)m
be a training set of size m i.i.d. according to a probability distribution
D over X × Y, and T(S) = ((zi, ˜yi))N
i=1the transformed set obtained with the transformation function T. Let
Sπ ⊆ S, |Sπ| = M, be a training set outputted by the algorithm (π, κ)-DS and T(Sπ) ⊆ T(S) its corresponding
transformation. Then for any 1 > δ > 0 with probability at least (1 − δ) the following risk bound holds :
∀h ∈ H, R(h) ≤ α ˜RTκ(Sπ)(h) + αRTκ(Sπ)( ◦ H) + α
(K − 1) log 2
δ
2Mκ
+
2α log 4K
δ
β(m − 1)
+
7β log 4K
δ
3(m − 1)
,
where ˜RTκ(Sπ)(h) = 1
κM xy∈Sπ y ∈Yxy
1g(xy)−g(xy )≤0, α = maxy: 1≤y≤K ηy/πy, β = maxy: 1≤y≤K 1/πy
and ηy > 0 is the proportion of class y in S.
ahttps://github.com/bikash617/Aggressive-Sampling-for-Multi-class-to-BinaryReduction
Dataset Properties and φ(.)
Datasets K #Train #Test d
LSHTC1 12294 126871 31718 409774
DMOZ 27875 381149 95288 594158
WIKI-Small 36504 796617 199155 380078
WIKI-50K 50000 1102754 276939 951558
WIKI-100K 100000 2195530 550133 1271710
Features in the joint example/class representation representation φ(xy).
1.
t∈y∩x
log(1 + yt), 2.
t∈y∩x
log 1 +
lS
Ft
, 3.
t∈y∩x
It
4.
t∈y∩x
yt
|y|
.It, 5.
t∈y∩x
log 1 +
yt
|y|
, 6.
t∈y∩x
log 1 +
yt
|y|
.It
7.
t∈y∩x
log 1 +
yt
|y|
.
lS
Ft
, 8.
t∈y∩x
1, 9. d(xy, centroid(y))
10. BM25 =
t∈y∩x It. 2×yt
yt+(0.25+0.75·len(y)/avg(len(y))
Experimental Results
0
45
90
135
180
Time(min.)
LSHTC1
0
300
600
900
1200
DMOZ
0
150
300
450
WIKI-Small
0
300
600
900
1200
WIKI-50K
0
1000
2000
3000
WIKI-100K
0
4
8
12
TotalMemory(GB)
0
10
20
30
0.0
2.5
5.0
7.5
10.0
0
12
24
36
0
14
28
42
0
10
20
30
MaF(%)
0
10
20
30
0
10
20
30
0
10
20
30
0
10
20
30
RecallTree FastXML PfastReXML PD-Sparse Proposed DS
• RecallTree: Tree based multi-class classifier
implemented in Vowpal Wabbit.
• FastXML: Partitioning in the feature space for
faster prediction.
• PfastReXML Tree ensemble based extreme
classifier.
• PD-Sparse: 1-regularized multi-class loss.

More Related Content

PDF
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
PDF
A Szemerédi-type theorem for subsets of the unit cube
PDF
metric spaces
PDF
machinelearning project
PDF
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
PDF
A Generalized Metric Space and Related Fixed Point Theorems
PDF
Multilinear singular integrals with entangled structure
PDF
A common fixed point of integral type contraction in generalized metric spacess
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
A Szemerédi-type theorem for subsets of the unit cube
metric spaces
machinelearning project
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
A Generalized Metric Space and Related Fixed Point Theorems
Multilinear singular integrals with entangled structure
A common fixed point of integral type contraction in generalized metric spacess

What's hot (19)

PDF
Tales on two commuting transformations or flows
PDF
Ba32759764
PPT
26 triple integrals
PDF
Dynamic Programming Over Graphs of Bounded Treewidth
PDF
Limits and Continuity - Intuitive Approach part 3
PDF
Paraproducts with general dilations
PDF
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
PDF
Scattering theory analogues of several classical estimates in Fourier analysis
PDF
Ji2416271633
PDF
2.9 Cartesian products
PDF
2.7 Ordered pairs
PDF
Calculus Final Exam
PDF
Boundedness of the Twisted Paraproduct
PDF
2.6 Properties of inclusion
PDF
Pc12 sol c04_4-2
PDF
Gamma sag semi ti spaces in topological spaces
PDF
11. gamma sag semi ti spaces in topological spaces
PDF
Rainone - Groups St. Andrew 2013
PDF
Change of variables in double integrals
Tales on two commuting transformations or flows
Ba32759764
26 triple integrals
Dynamic Programming Over Graphs of Bounded Treewidth
Limits and Continuity - Intuitive Approach part 3
Paraproducts with general dilations
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
Scattering theory analogues of several classical estimates in Fourier analysis
Ji2416271633
2.9 Cartesian products
2.7 Ordered pairs
Calculus Final Exam
Boundedness of the Twisted Paraproduct
2.6 Properties of inclusion
Pc12 sol c04_4-2
Gamma sag semi ti spaces in topological spaces
11. gamma sag semi ti spaces in topological spaces
Rainone - Groups St. Andrew 2013
Change of variables in double integrals
Ad

Similar to Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification (20)

PDF
1 hofstad
PDF
Higher-Order (F, α, β, ρ, d) –Convexity for Multiobjective Programming Problem
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
PDF
Machine learning (2)
PDF
Cs229 notes9
PDF
On the Family of Concept Forming Operators in Polyadic FCA
PDF
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...
PDF
Hyperfunction method for numerical integration and Fredholm integral equation...
PDF
Density theorems for Euclidean point configurations
PDF
lec-ugc-sse.pdf
PDF
Hierarchical matrices for approximating large covariance matries and computin...
PDF
IVR - Chapter 1 - Introduction
PDF
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
PDF
(α ψ)- Construction with q- function for coupled fixed point
PDF
QMC: Operator Splitting Workshop, Structured Decomposition of Multi-view Data...
PPT
optimal graph realization
PDF
block-mdp-masters-defense.pdf
1 hofstad
Higher-Order (F, α, β, ρ, d) –Convexity for Multiobjective Programming Problem
Maximum likelihood estimation of regularisation parameters in inverse problem...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
Machine learning (2)
Cs229 notes9
On the Family of Concept Forming Operators in Polyadic FCA
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...
Hyperfunction method for numerical integration and Fredholm integral equation...
Density theorems for Euclidean point configurations
lec-ugc-sse.pdf
Hierarchical matrices for approximating large covariance matries and computin...
IVR - Chapter 1 - Introduction
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
(α ψ)- Construction with q- function for coupled fixed point
QMC: Operator Splitting Workshop, Structured Decomposition of Multi-view Data...
optimal graph realization
block-mdp-masters-defense.pdf
Ad

Recently uploaded (20)

PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
famous lake in india and its disturibution and importance
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
Microbiology with diagram medical studies .pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
An interstellar mission to test astrophysical black holes
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Introduction to Cardiovascular system_structure and functions-1
lecture 2026 of Sjogren's syndrome l .pdf
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
. Radiology Case Scenariosssssssssssssss
famous lake in india and its disturibution and importance
The KM-GBF monitoring framework – status & key messages.pptx
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Classification Systems_TAXONOMY_SCIENCE8.pptx
INTRODUCTION TO EVS | Concept of sustainability
6.1 High Risk New Born. Padetric health ppt
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Microbiology with diagram medical studies .pptx
Phytochemical Investigation of Miliusa longipes.pdf
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
An interstellar mission to test astrophysical black holes

Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification

  • 1. Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification Bikash Joshi† ,Massih R. Amini† ,Ioannis Partalas ,Franck Iutzeler† ,Yury Maximov‡ | † University Grenoble Alps, Expedia, ‡ Los Alamos NL&Skolkovo IST Large-scale Multi-class Classification • Classification problems with extremely large number of classes as it appears in text repositories such as Wikipedia, Yahoo! Directory (www.dir.yahoo.com), or Directory Mozilla DMOZ (www.dmoz. org) are more and more common. • In this case, classical approaches suffer from class imbalance and computational burden. • New studies try to cope with these limitations – Tree-based methods rely on binary tree structures and have logarithmic time complexity with the drawback that it is a challenging task to find a balanced tree structure which can partition the class labels. These methods suffer from error propagation phenomenon causing a decrease in accuracy. – Label embedding approaches, first project the label-matrix into a low-dimensional linear sub- space and then use an OVA classifier. However, the low-rank assumption of the label-matrix is generally transgressed in the extreme multi-class classification setting. • In this work, we propose a scalable multi-class classification method based on an aggressive double sampling of the dyadic output prediction problem. Formal Setup and Notations • Let xy = (x, y) be an observation in X × Y ⊆ Rd × Y = {1, . . . , K}; K >>1 generated i.i.d. with respect to D. We assume that the train- ing set S = (xyi i )m i=1 i.i.d ∼ Dm , and we consider a class of functions G = {g : X × Y → R}, in the form g = f ◦ φ, where φ : X × Y → Rp is and application-dependent projection (which can be learned, or defined using some heuris- tics); and f ∈ F = {f : Rp → R} is a func- tion that measures the adequacy between an observation x and a class y using their corre- sponding representation φ(xy ). The objective is to find a function g ∈ G with a small expected risk : R(g) = Exy∼D [e(g, xy )] , where (1) e(g, xy ) = 1 K − 1 y ∈Y{y} 1g(xy)≤g(xy ), (2) is the instantaneous loss for predictor g on example xy that estimates the average num- ber of classes, given any input data, that get a greater scoring by g than the correct class. For the dyadic transformation T (S)=      zj = φ(xk i ), φ(x yi i ) , ˜yj = −1 if k < yi zj = φ(x yi i ), φ(xk i ) , ˜yj = +1 elsewhere   j where j = (i − 1)(K − 1) + k, ∀i ∈ [m], ∀k ∈ [K −1]; that expands a K-class set S of size m into a binary set T(S) of size N = m(K − 1). Classification with interdependent data With the class of functions H = {h : (φ(xy ), φ(xy )) → f(φ(xy ))−f(φ(xy )), f ∈ F}, the empirical loss associated to (Eq. (1)) becomes : ˜RT (S)(h) = 1 N N j=1 1˜yj h(zj )≤0. (3) Definition 1 Let fractional cover of G, if: i) it is proper: ∀k, Ck is an independent set, i.e., there is no connections between vertices in Ck; ii) it is an exact fractional cover of G: ∀v ∈ V, k:v∈Ck ωk = 1. From this statement, the classes of functions G and H introduced previously, consider the parameter- ized family Hr which, for r > 0, is defined as: Hr = {h : h ∈ H, V[h] . = Vz,˜y[1˜yh(z)] ≤ r}, where V denotes the variance. The fractional Rademacher complexity that entails our analysis : RT (S)(H) . = 2 N Eξ k∈[K−1] ωkECk sup h∈H α∈Ck zα∈T (S) ξαh(zα), Rademacher complexity bounds for interdependent data Theorem 1 Let S = (xyi i )m i=1 ∈ (X × Y)m be a dataset of m examples drawn i.i.d. according to a probability distribution D over X × Y and T(S) = ((zi, ˜yi))N i=1 the transformed set obtained as in Eq. (). Then for any 1 > δ > 0 and 0/1 loss : {−1, +1} × R → [0, 1], with probability at least (1 − δ) the following generalization bound holds for all h ∈ Hr : R(h) ≤ ˜RT (S)(h) + RT (S)( ◦ Hr) + 5 2 RT (S)( ◦ Hr) + r 2 log 1 δ m + 25 48 log 1 δ m . Give insights on the consistency of the ERM principle when learning with interdependent data, however for K >>1 and m>>1 the constitution of T(S) may be intractable. The (π, κ)-DS algorithm and a new generalization bound The proposed aggressive double sampling procedurea , referred to as (π, κ)-DS is composed of two main steps. 1. For each class k ∈ {1, . . . , K}, draw randomly a set Sπk of examples from S of that class with proba- bility πk, and let Sπ = K k=1 Sπk ; 2. For each example xy in Sπ, draw uniformly κ adversarial classes in Y{y}. Theorem 2 Let S = (xyi i )m i=1 ∈ (X × Y)m be a training set of size m i.i.d. according to a probability distribution D over X × Y, and T(S) = ((zi, ˜yi))N i=1the transformed set obtained with the transformation function T. Let Sπ ⊆ S, |Sπ| = M, be a training set outputted by the algorithm (π, κ)-DS and T(Sπ) ⊆ T(S) its corresponding transformation. Then for any 1 > δ > 0 with probability at least (1 − δ) the following risk bound holds : ∀h ∈ H, R(h) ≤ α ˜RTκ(Sπ)(h) + αRTκ(Sπ)( ◦ H) + α (K − 1) log 2 δ 2Mκ + 2α log 4K δ β(m − 1) + 7β log 4K δ 3(m − 1) , where ˜RTκ(Sπ)(h) = 1 κM xy∈Sπ y ∈Yxy 1g(xy)−g(xy )≤0, α = maxy: 1≤y≤K ηy/πy, β = maxy: 1≤y≤K 1/πy and ηy > 0 is the proportion of class y in S. ahttps://github.com/bikash617/Aggressive-Sampling-for-Multi-class-to-BinaryReduction Dataset Properties and φ(.) Datasets K #Train #Test d LSHTC1 12294 126871 31718 409774 DMOZ 27875 381149 95288 594158 WIKI-Small 36504 796617 199155 380078 WIKI-50K 50000 1102754 276939 951558 WIKI-100K 100000 2195530 550133 1271710 Features in the joint example/class representation representation φ(xy). 1. t∈y∩x log(1 + yt), 2. t∈y∩x log 1 + lS Ft , 3. t∈y∩x It 4. t∈y∩x yt |y| .It, 5. t∈y∩x log 1 + yt |y| , 6. t∈y∩x log 1 + yt |y| .It 7. t∈y∩x log 1 + yt |y| . lS Ft , 8. t∈y∩x 1, 9. d(xy, centroid(y)) 10. BM25 = t∈y∩x It. 2×yt yt+(0.25+0.75·len(y)/avg(len(y)) Experimental Results 0 45 90 135 180 Time(min.) LSHTC1 0 300 600 900 1200 DMOZ 0 150 300 450 WIKI-Small 0 300 600 900 1200 WIKI-50K 0 1000 2000 3000 WIKI-100K 0 4 8 12 TotalMemory(GB) 0 10 20 30 0.0 2.5 5.0 7.5 10.0 0 12 24 36 0 14 28 42 0 10 20 30 MaF(%) 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 RecallTree FastXML PfastReXML PD-Sparse Proposed DS • RecallTree: Tree based multi-class classifier implemented in Vowpal Wabbit. • FastXML: Partitioning in the feature space for faster prediction. • PfastReXML Tree ensemble based extreme classifier. • PD-Sparse: 1-regularized multi-class loss.