When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
When Classiﬁer Selection meets
Information Theory: A Unifying View
Mohamed Abdel Hady, Friedhelm Schwenker,
Günther Palm
Institute of Neural Information Processing
University of Ulm, Germany
December 8, 2010
1 / 20

Outline
1 Ensemble Learning
2 Ensemble Pruning
3 Information Theory
4 Ensemble Pruning meets Information Theory
5 Experimental Results
6 Conclusion and Future Work
2 / 20

Ensemble Learning
An ensemble is a set of accurate and diverse classifiers. The objective is that the
ensemble outperforms its member classifiers.
h1
ghi
hN
Classifier Layer Combination Layer
x
h1(x)
hi(x)
hN(x)
g(x)
Ensemble Learning becomes a hot topic during the last years.
Ensemble methods consist of two phase: the construction of multiple individual
classifiers and their combination.
3 / 20

Ensemble Learning
How to construct individual classiﬁers?
4 / 20

Ensemble Pruning
Recent work has considered an additional intermediate phase that deals with the
reduction of the ensemble size before combination.
This phase has several names in the literature such as ensemble pruning,
selective ensemble, ensemble thinning and classifier selection.
Classifier selection is important for two reasons: classification accuracy and
efficiency.
An ensemble may consist not only of accurate classifiers, but also of classifiers
with lower accuracy. The main factor for an effective ensemble is to remove the
poor-performing classifiers while maintaining a good diversity among the
ensemble members.
The second reason is equally important, efficiency. Having a very large number
of classifiers in an ensemble adds a lot of computational overhead. For instance,
decision trees may have large memory requirements and lazy learning methods
have a considerable computational cost during classification phase.
5 / 20

Information Theory
Entropy
H(X) = −
xj ∈X
p(X = xj ) log2 p(X = xj ) (1)
Conditional Entropy
H(X|Y) = −
y∈Y
p(Y = y)
x∈X
p(X = x|Y = y) log p(X = x|Y = y) (2)
6 / 20

Information Theory
Shannon Mutual Information
I(X; Y) = H(X) − H(X|Y) =
x∈X y∈Y
p(x, y) log2
p(x, y)
p(x)p(y)
(3)
Shannon Conditional Mutual Information
I(X1; X2|Y) = H(X1|Y) − H(X1|X2Y)
=
y∈Y
p(y)
x1∈X1 x2∈X2
p(x1, x2|y) log2
p(x1, x2|y)
p(x1|y)p(x2|y)
(4)
7 / 20

Ensemble Pruning meets Information Theory
Information theory can provide a bound on error probability, p(X1:N = Y), for any
combiner g. The error of predicting target variable Y from input X1:N is bounded
by two inequalities as follows,
H(Y) − I(X1:N ; Y) − 1
log(|Y|)
≤ p(X1:N = Y) ≤
1
2
H(Y|X1:N ). (5)
I(X1:N ; Y) involves high dimensional probability distributions p(x1, x2, . . . , xN , y)
that are hard to be implemented. However, it can be decomposed into simpler
terms.
8 / 20

Interaction Information
Shannon’s Mutual Information I(X1; X2) is a function of two variables. It is not
able to measure properties of multiple (N) variables.
McGill presented what is called Interaction Information as a multi-variate
generalization for Shannon’s Mutual Information.
For instance, the Interaction Information between three random variables is
I({X1, X2, X3}) = I(X1; X2|X3) − I(X1; X2) (6)
The general form for arbitrary size S is deﬁned recursively.
I(S ∪ {X}) = I(S|X) − I(S) (7)
W. McGill, Multivariate information transmission, IEEE Trans. on Information Theory,
vol. 4, no. 4, pp. 93111, 1954.
9 / 20

Mutual Information Decomposition
Theorem
Given a set of classiﬁers S = {X1, . . . , XN } and a target class label Y, the Shannon
mutual information between X1:N and Y can be decomposed into a sum of Interaction
Information terms,
I(X1:N ; Y) =
T⊆S
I(T ∪ {Y}), |T| ≥ 1. (8)
For a set of classiﬁers S = {X1, X2, X3}, the mutual information between the
joint variable X1:3 and a target Y can be decomposed as
I(X1:3; Y) = I(X1; Y) + I(X2; Y) + I(X3; Y)
+ I(X1, X2, Y) + I(X1, X3, Y) + I(X1, X3, Y)
+ I(X1, X2, X3, Y)
Each term can then be decomposed into class unconditional I(X) and
conditional I(X|Y) according to Eq. (6).
I(X1:3; Y) =
3
i=1
I(Xi ; Y) −
X⊆S
|X|=2,3
I(X) +
X⊆S
|X|=2,3
I(X|Y)
10 / 20

Mutual Information Decomposition (cont’d)
For an ensemble S of size N and according to Eq. (7),
I(X1:N ; Y) =
N
i=1
I(Xi ; Y) −
X⊆S
|X|=2..N
I(X) +
X⊆S
|X|=2..N
I(X|Y) (9)
We assume that there exist only pairwise unconditional and conditional
interactions and omit higher order terms.
I(X1:N ; Y)
N
i=1
I(Xi ; Y) −
N−1
i=1
N
j=i+1
I(Xi ; Xj ) +
N−1
i=1
N
j=i+1
I(Xi ; Xj |Y) (10)
G. Brown, A new perspective for information theoretic feature selection, in Proc. of the
12th Int. Conf. on Artiﬁcial Intelligence and Statistics (AI-STATS 2009), 2009.
11 / 20

Classifier Selection Criterion
The objective of an information-theoretic classifier selection method, is to select
a subset of K classifiers (S) from a pool of N classifiers (Ω), constructed by any
ensemble learning algorithm, that carries as much information as possible about
the target class using a predefined selection criterion,
J(Xu(j)) = I(X1:k+1; Y) − I(X1:k ; Y)
= I(Xu(j); Y) −
k
i=1
I(Xu(j); Xv(i)) +
k
i=1
I(Xu(j); Xv(i)|Y)
(11)
That is the difference in information, after and before the addition of Xu(j) into S.
This tells us that the best classifier is a trade-off between these components: the
relevance of the classifier, the unconditional correlations, and the
class-conditional correlations. In order to trade-off between these components,
Eq. (11) [Brown, AI-STATS 2009] can be parameterized to define the root
criterion,
J(Xu(j)) = I(Xu(j); Y) − β
k
i=1
I(Xu(j); Xv(i)) + γ
k
i=1
I(Xu(j); Xv(i)|Y). (12)
12 / 20

Classifier Selection Algorithm
1: Select the most relevant classifier, v(1) = arg max1≤j≤N I(Xj ; Y)
2: S = {Xv(1)}
3: for k = 1 : K − 1 do
4: for j = 1 : |Ω S| do
5: Calculate J(Xu(j)) as defined in Eq. (12)
6: end for
7: v(k + 1) = arg max1≤j≤|ΩS| J(Xu(j))
8: S = S ∪ {Xv(k+1)}
9: end for
13 / 20

Classiﬁer Selection Heuristics
Maximal relevance (MR)
J(Xu(j)) = I(Xu(j); Y) (13)
Mutual Information Feature Selection (MIFS) [Battiti, 1994]
J(Xu(j)) = I(Xu(j); Y) −
k
i=1
I(Xu(j); Xv(i)) (14)
Minimal Redundancy Maximal Relevance (mRMR) [Peng et al., 2005]
J(Xu(j)) = I(Xu(j); Y) −
1
|S|
k
i=1
I(Xu(j); Xv(i)) (15)
14 / 20

Classiﬁer Selection Heuristics (cont’d)
Joint Mutual Information (JMI) [Yang and Moody, 1999]
J(Xu(j)) =
k
i=1
I(Xu(j)Xv(i); Y) (16)
Conditional Infomax Feature Extraction (CIFE) [Lin and Tang, 2006]
J(Xu(j)) = I(Xu(j); Y) −
k
i=1
I(Xu(j); Xv(i)) − I(Xu(j); Xv(i)|Y) (17)
Conditional Mutual Information Maximization (CMIM) [Fleuret, 2004]
J(Xu(j)) = min
1≤i≤k
I(Xu(j); Y|Xv(i))
= I(Xu(j); Y) − max
1≤i≤k
[I(Xu(j); Xv(i)) − I(Xu(j); Xv(i)|Y)]
(18)
15 / 20

Experimental Results
Bagging and Random Forest to construct pool of 50 decision trees (N =50)
Each selection criterion is evaluated with K=40 (20%), 30 (40%), 20 (60%) and
10 (80%).
11 data sets from the UCI machine learning repository
average of performing 5 runs of 10-fold cross-validation
normalized_test_acc = pruned_ens_acc−single_tree_acc
unpruned_ens_acc−single_tree_acc
id name Classes Examples
Features
Discrete Continuous
d1 anneal 6 898 32 6
d2 autos 7 205 10 16
d3 wisconsin-breast 2 699 0 9
d4 bupa liver disorders 2 345 0 6
d5 german-credit 2 1000 13 7
d6 pima-diabetes 2 768 0 8
d7 glass 7 214 0 9
d8 cleveland-heart 2 303 7 6
d9 hepatitis 2 155 13 6
d10 ionosphere 2 351 0 34
d11 vehicle 4 846 0 18
16 / 20

Results
Figure: Comparison of the normalized test accuracy of the ensemble
of C4.5 decision trees constructed by Bagging
17 / 20

Results (cont’d)
Figure: Comparison of the normalized test accuracy of the ensemble
of random trees constructed by Random Forest 18 / 20

Conclusion
This paper examined the issue of classifier selection from an information
theoretic viewpoint. The main advantage of information theoretic criteria is that
they capture higher order statistics of the data.
The ensemble mutual information is decomposed into accuracy and diversity
components.
Although diversity was represented by low and high order terms, we keep only
the first-order terms in this paper. In further study, we will study the influence of
including the higher-order terms on pruning performance.
We selected in the paper some points within the continuous space of possible
selection criteria, that represent well-known feature selection criteria, such as
mRMR, CIFE, JMI and CMIM, and use them for classifier selection. In a future
work, we will explore other points in this space that may lead to more effective
pruning.
We plan to extend the algorithm for pruning ensembles of regression estimator.
19 / 20

Thanks for your attention
Questions ??
20 / 20

When Classifier Selection meets Information Theory: A Unifying View

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to When Classifier Selection meets Information Theory: A Unifying View (20)

Recently uploaded (20)

When Classifier Selection meets Information Theory: A Unifying View