IDA 2015: Efficient model selection for regularized classification by exploiting unlabeled data

Introduction Quantification The proposed approach Experiment Framework Conclusion
Efficient Model Selection for Regularized
Classification by Exploiting Unlabeled Data
Georgios Balikas1 Ioannis Partalas2 Eric Gaussier1 Rohit
Babbar3 Massih-Reza Amini1
1University Grenoble, Alpes
2Viseo R&D
3Max-Plank Institute for Intelligent Systems
Intelligent Data Analysis 2015, Saint-Étienne
1/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Outline
1 Introduction
2 Quantiﬁcation
3 The proposed approach
4 Experiment Framework
5 Conclusion

Model selection for text classiﬁcation
Doc1
DocN
d1 ∈ Rd
dN ∈ Rd
Feature
Extraction
Select hθ ∈ H.
θ: hyper-parameters
ˆR(θ) ∈ R
Learning
θ ?
The task
Eﬃciently select the hyper-parameter value which minimizes the
generalization error (using the empirical error as a proxy).

Traditional Model Selection Methods
Valid. Train Train Train Train
Train Valid. Train Train Train
Train Train Train Train Valid.
Figure : 5-fold Cross Validation
Train Valid.
Figure : Hold-out
Extensions of the above such as Leave-one-out, etc.
M. Mohri et al.
Foundations of Machine Learning, MIT press 2012

The issues
In large scale problems:
Resource intensive: ∼ 106 − 108 free parameters. Optimized
k-CV can take up to several days.
Power law distribution of
examples. Only a few
instances for small
classes, splitting them
results in loss of
information.
Labeled Documents/class
R. Babbar, I. Partalas, E. Gaussier, M-R. Amini
Re-ranking approach to classiﬁcation in large-scale power-law distributed
category systems, SIGIR 2014

Our contribution
We propose a bound that motivates eﬃcient model selection.
Leverages unlabeled data for model selection
Performs on par (if not better) with traditional methods
Is k times faster than k-cross validation.

Quantification
Definition
In many classification scenarios, the real goal is determining the
prevalence of each class in the test, a task called quantification.
Given a dataset:
How many people liked the new iPhone?
How many instances belong to yi class?
A. Esuli and F. Sebastiani
Optimizing text quantifiers for multivariate loss functions, arXiv preprint
arXiv:1502.05491

Quantification using general purpose learners
Classify and Count
Aggregative method
Classify each instance
first
Count instances/class
Probabilistic Classify and Count
Non-aggregative method
Get scores/probabilities for each
instance
Sum over probabilities/class
G. Forman
Counting positives accurately despite inaccurate classification, ECML 2005

Our setting
Mono-label, multi-class classification
Observations x ∈ X ⊆ Rd , labels y ∈ Y, |Y | > 2
(x, y) i.i.d. according to a fixed, unknown D over X × Y
Strain = {(x(i), y(i))}N
i=1, S = {(x(i))}M
i=N+1
Regularized classification: ˆw = arg min Remp(w) + λReg(w)
hθ ∈ H, e.g., for SVMs the θ = λ from a set λvalues
ˆpy , p
C(S)
y : prior on Strain, estimated using quantification on S

Accuracy bound
Theorem
Let S = {(x(j))}M
j=1 be a set generated i.i.d. with respect to DX , py the true prior
probability for category y ∈ Y and
Ny
N
ˆpy its empirical estimate obtained on Strain.
We consider here a classifier C trained on Strain and we assume that the quantification
method used is accurate in the sense that:
∃ , min{py , ˆpy , p
C(S)
y }, ∀y ∈ Y : |p
C(S)
y −
M
C(S)
y
|S|
| ≤
Let B
C(S)
A , be defined as:
y∈Y
min{ˆpy × |S|, p
C(S)
y × |S|}
|S|
B
C(S)
A
Then for any δ ∈]0, 1], with probability at least (1 − δ):
AC(S)
≤ B
C(S)
A + |Y|(
log |Y| + log 1
δ
2N
+ )

Intuition
Estimated prob. of y on |S|
prior prob. of y
B
C(S)
A
y∈Y
min{ ˆpy × |S|, p
C(S)
y × |S|}
|S|
In a power-law distributed category systems this is an upper
bound:
– ˆpy will be used for large classes due to false positives, and
– p
C(S)
y will be used for small classes due to false negatives.

Model selection using the bound
Training Data
Estimate class priors
Quantiﬁcation on unseen data

Training Data
for λ in λvalues do
Train on Strain
Estimate p
C(S)
y on S
end for

Training Data
Calculate the Bound
Select hyper-parameter value

Datesets
Dataset #Training #Quantiﬁcation #Test #Features # Parameters
dmoz250 1,542 2,401 1,023 55,610 13,9M
dmoz500 2,137 3,042 1,356 77,274 38,6M
dmoz1000 6,806 10,785 4,510 138,879 138,8M
dmoz1500 9,039 14,002 5,958 170,828 256,2M
dmoz2500 12,832 19,188 8,342 212,073 530,1M
– Similar experimental settings on wikipedia data
– SVMs and Log. Regression, λ ∈ {10−4, . . . , 104}
– 5-CV, Held out (70%-30%), BoundUN, BoundTest

Results (1/2)
10−4 10−3 10−2 10−1 1 10 102 103
λ values
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
5-CV
H out
MaF
CC
PCC
Figure : MaF measure optimization for wiki1500 for SVM.

Results (2/2)
BoundUn BoundTest Hold-out 5-CV
Dataset Acc MaF Acc MaF Acc MaF Acc MaF
dmoz250 .8260 .6242 .8270 .6243 .8260 (±.0000) .6242 (±.0000) .8260 .6242
dmoz500 .7227 .5584 .7227 .5584 .7221 (±.0005) .5558 (±.0022) .7220 .5562
dmoz1000 .7302 .4883 .7302 .4892 .7301 (±.0001) .4835 (±.0155) .7299 .4883
dmoz1500 .7132 .4715 .7132 .4715 .6958 (±.0457) .4065 (±.0998) .7132 .4715
dmoz2500 .6352 .4301 .6350 .4306 .6350 (±.0001) .3949 (±.0686) .6352 .4301
wiki1500 for SVM on 4 cores: BoundUn (302 sec), 5-CV (1310 sec).

Conclusions
Performs equally well or better than traditional model
selection methods for model selection.
Is k times faster than k-CV.
It requires unlabeled data from the same distribution as the
training data.

Thank you
Georgios Balikas
georgios.balikas@imag.fr
Ioannis Partalas
ioannis.partalas@viseo.com
Eric Gaussier
eric.gaussier@imag.fr
Rohit Babbar
rohit.babbar@gmail.com
Massih-Reza Amini
massih-reza.amini@imag.fr
This work is partially supported by the CIFRE N 28/2015 and by
the LabEx PERSYVAL Lab ANR-11-LABX-0025.

IDA 2015: Efficient model selection for regularized classification by exploiting unlabeled data

More Related Content

Similar to IDA 2015: Efficient model selection for regularized classification by exploiting unlabeled data (20)

Recently uploaded (20)

IDA 2015: Efficient model selection for regularized classification by exploiting unlabeled data