SlideShare a Scribd company logo
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
When Classifier Selection meets
Information Theory: A Unifying View
Mohamed Abdel Hady, Friedhelm Schwenker,
Günther Palm
Institute of Neural Information Processing
University of Ulm, Germany
December 8, 2010
1 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Outline
1 Ensemble Learning
2 Ensemble Pruning
3 Information Theory
4 Ensemble Pruning meets Information Theory
5 Experimental Results
6 Conclusion and Future Work
2 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Ensemble Learning
An ensemble is a set of accurate and diverse classifiers. The objective is that the
ensemble outperforms its member classifiers.
h1
ghi
hN
Classifier Layer Combination Layer
x
h1(x)
hi(x)
hN(x)
g(x)
Ensemble Learning becomes a hot topic during the last years.
Ensemble methods consist of two phase: the construction of multiple individual
classifiers and their combination.
3 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Ensemble Learning
How to construct individual classifiers?
4 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Ensemble Pruning
Recent work has considered an additional intermediate phase that deals with the
reduction of the ensemble size before combination.
This phase has several names in the literature such as ensemble pruning,
selective ensemble, ensemble thinning and classifier selection.
Classifier selection is important for two reasons: classification accuracy and
efficiency.
An ensemble may consist not only of accurate classifiers, but also of classifiers
with lower accuracy. The main factor for an effective ensemble is to remove the
poor-performing classifiers while maintaining a good diversity among the
ensemble members.
The second reason is equally important, efficiency. Having a very large number
of classifiers in an ensemble adds a lot of computational overhead. For instance,
decision trees may have large memory requirements and lazy learning methods
have a considerable computational cost during classification phase.
5 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Information Theory
Entropy
H(X) = −
xj ∈X
p(X = xj ) log2 p(X = xj ) (1)
Conditional Entropy
H(X|Y) = −
y∈Y
p(Y = y)
x∈X
p(X = x|Y = y) log p(X = x|Y = y) (2)
6 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Information Theory
Shannon Mutual Information
I(X; Y) = H(X) − H(X|Y) =
x∈X y∈Y
p(x, y) log2
p(x, y)
p(x)p(y)
(3)
Shannon Conditional Mutual Information
I(X1; X2|Y) = H(X1|Y) − H(X1|X2Y)
=
y∈Y
p(y)
x1∈X1 x2∈X2
p(x1, x2|y) log2
p(x1, x2|y)
p(x1|y)p(x2|y)
(4)
7 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Ensemble Pruning meets Information Theory
Information theory can provide a bound on error probability, p(X1:N = Y), for any
combiner g. The error of predicting target variable Y from input X1:N is bounded
by two inequalities as follows,
H(Y) − I(X1:N ; Y) − 1
log(|Y|)
≤ p(X1:N = Y) ≤
1
2
H(Y|X1:N ). (5)
I(X1:N ; Y) involves high dimensional probability distributions p(x1, x2, . . . , xN , y)
that are hard to be implemented. However, it can be decomposed into simpler
terms.
8 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Interaction Information
Shannon’s Mutual Information I(X1; X2) is a function of two variables. It is not
able to measure properties of multiple (N) variables.
McGill presented what is called Interaction Information as a multi-variate
generalization for Shannon’s Mutual Information.
For instance, the Interaction Information between three random variables is
I({X1, X2, X3}) = I(X1; X2|X3) − I(X1; X2) (6)
The general form for arbitrary size S is defined recursively.
I(S ∪ {X}) = I(S|X) − I(S) (7)
W. McGill, Multivariate information transmission, IEEE Trans. on Information Theory,
vol. 4, no. 4, pp. 93111, 1954.
9 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Mutual Information Decomposition
Theorem
Given a set of classifiers S = {X1, . . . , XN } and a target class label Y, the Shannon
mutual information between X1:N and Y can be decomposed into a sum of Interaction
Information terms,
I(X1:N ; Y) =
T⊆S
I(T ∪ {Y}), |T| ≥ 1. (8)
For a set of classifiers S = {X1, X2, X3}, the mutual information between the
joint variable X1:3 and a target Y can be decomposed as
I(X1:3; Y) = I(X1; Y) + I(X2; Y) + I(X3; Y)
+ I(X1, X2, Y) + I(X1, X3, Y) + I(X1, X3, Y)
+ I(X1, X2, X3, Y)
Each term can then be decomposed into class unconditional I(X) and
conditional I(X|Y) according to Eq. (6).
I(X1:3; Y) =
3
i=1
I(Xi ; Y) −
X⊆S
|X|=2,3
I(X) +
X⊆S
|X|=2,3
I(X|Y)
10 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Mutual Information Decomposition (cont’d)
For an ensemble S of size N and according to Eq. (7),
I(X1:N ; Y) =
N
i=1
I(Xi ; Y) −
X⊆S
|X|=2..N
I(X) +
X⊆S
|X|=2..N
I(X|Y) (9)
We assume that there exist only pairwise unconditional and conditional
interactions and omit higher order terms.
I(X1:N ; Y)
N
i=1
I(Xi ; Y) −
N−1
i=1
N
j=i+1
I(Xi ; Xj ) +
N−1
i=1
N
j=i+1
I(Xi ; Xj |Y) (10)
G. Brown, A new perspective for information theoretic feature selection, in Proc. of the
12th Int. Conf. on Artificial Intelligence and Statistics (AI-STATS 2009), 2009.
11 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Classifier Selection Criterion
The objective of an information-theoretic classifier selection method, is to select
a subset of K classifiers (S) from a pool of N classifiers (Ω), constructed by any
ensemble learning algorithm, that carries as much information as possible about
the target class using a predefined selection criterion,
J(Xu(j)) = I(X1:k+1; Y) − I(X1:k ; Y)
= I(Xu(j); Y) −
k
i=1
I(Xu(j); Xv(i)) +
k
i=1
I(Xu(j); Xv(i)|Y)
(11)
That is the difference in information, after and before the addition of Xu(j) into S.
This tells us that the best classifier is a trade-off between these components: the
relevance of the classifier, the unconditional correlations, and the
class-conditional correlations. In order to trade-off between these components,
Eq. (11) [Brown, AI-STATS 2009] can be parameterized to define the root
criterion,
J(Xu(j)) = I(Xu(j); Y) − β
k
i=1
I(Xu(j); Xv(i)) + γ
k
i=1
I(Xu(j); Xv(i)|Y). (12)
12 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Classifier Selection Algorithm
1: Select the most relevant classifier, v(1) = arg max1≤j≤N I(Xj ; Y)
2: S = {Xv(1)}
3: for k = 1 : K − 1 do
4: for j = 1 : |Ω  S| do
5: Calculate J(Xu(j)) as defined in Eq. (12)
6: end for
7: v(k + 1) = arg max1≤j≤|ΩS| J(Xu(j))
8: S = S ∪ {Xv(k+1)}
9: end for
13 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Classifier Selection Heuristics
Maximal relevance (MR)
J(Xu(j)) = I(Xu(j); Y) (13)
Mutual Information Feature Selection (MIFS) [Battiti, 1994]
J(Xu(j)) = I(Xu(j); Y) −
k
i=1
I(Xu(j); Xv(i)) (14)
Minimal Redundancy Maximal Relevance (mRMR) [Peng et al., 2005]
J(Xu(j)) = I(Xu(j); Y) −
1
|S|
k
i=1
I(Xu(j); Xv(i)) (15)
14 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Classifier Selection Heuristics (cont’d)
Joint Mutual Information (JMI) [Yang and Moody, 1999]
J(Xu(j)) =
k
i=1
I(Xu(j)Xv(i); Y) (16)
Conditional Infomax Feature Extraction (CIFE) [Lin and Tang, 2006]
J(Xu(j)) = I(Xu(j); Y) −
k
i=1
I(Xu(j); Xv(i)) − I(Xu(j); Xv(i)|Y) (17)
Conditional Mutual Information Maximization (CMIM) [Fleuret, 2004]
J(Xu(j)) = min
1≤i≤k
I(Xu(j); Y|Xv(i))
= I(Xu(j); Y) − max
1≤i≤k
[I(Xu(j); Xv(i)) − I(Xu(j); Xv(i)|Y)]
(18)
15 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Experimental Results
Bagging and Random Forest to construct pool of 50 decision trees (N =50)
Each selection criterion is evaluated with K=40 (20%), 30 (40%), 20 (60%) and
10 (80%).
11 data sets from the UCI machine learning repository
average of performing 5 runs of 10-fold cross-validation
normalized_test_acc = pruned_ens_acc−single_tree_acc
unpruned_ens_acc−single_tree_acc
id name Classes Examples
Features
Discrete Continuous
d1 anneal 6 898 32 6
d2 autos 7 205 10 16
d3 wisconsin-breast 2 699 0 9
d4 bupa liver disorders 2 345 0 6
d5 german-credit 2 1000 13 7
d6 pima-diabetes 2 768 0 8
d7 glass 7 214 0 9
d8 cleveland-heart 2 303 7 6
d9 hepatitis 2 155 13 6
d10 ionosphere 2 351 0 34
d11 vehicle 4 846 0 18
16 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Results
Figure: Comparison of the normalized test accuracy of the ensemble
of C4.5 decision trees constructed by Bagging
17 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Results (cont’d)
Figure: Comparison of the normalized test accuracy of the ensemble
of random trees constructed by Random Forest 18 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Conclusion
This paper examined the issue of classifier selection from an information
theoretic viewpoint. The main advantage of information theoretic criteria is that
they capture higher order statistics of the data.
The ensemble mutual information is decomposed into accuracy and diversity
components.
Although diversity was represented by low and high order terms, we keep only
the first-order terms in this paper. In further study, we will study the influence of
including the higher-order terms on pruning performance.
We selected in the paper some points within the continuous space of possible
selection criteria, that represent well-known feature selection criteria, such as
mRMR, CIFE, JMI and CMIM, and use them for classifier selection. In a future
work, we will explore other points in this space that may lead to more effective
pruning.
We plan to extend the algorithm for pruning ensembles of regression estimator.
19 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F
Thanks for your attention
Questions ??
20 / 20

More Related Content

PPTX
Handling missing data with expectation maximization algorithm
PDF
Fi review5
PDF
Langrange Interpolation Polynomials
PDF
Some sampling techniques for big data analysis
PDF
PDF
Generative models : VAE and GAN
PDF
Uncoupled Regression from Pairwise Comparison Data
PDF
Predictive mean-matching2
Handling missing data with expectation maximization algorithm
Fi review5
Langrange Interpolation Polynomials
Some sampling techniques for big data analysis
Generative models : VAE and GAN
Uncoupled Regression from Pairwise Comparison Data
Predictive mean-matching2

What's hot (20)

PDF
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
PDF
Analysis of data
PDF
A Novel Bayes Factor for Inverse Model Selection Problem based on Inverse Ref...
PDF
Comparison of the optimal design
PDF
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
PDF
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PDF
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
PDF
Convolutional networks and graph networks through kernels
PDF
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
PDF
Lecture11 xing
PDF
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PDF
A short and naive introduction to epistasis in association studies
PDF
Interpolation of Cubic Splines
PDF
Learning from (dis)similarity data
PDF
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
PPT
Introductory maths analysis chapter 00 official
PDF
better together? statistical learning in models made of modules
PDF
Intuitionistic Fuzzy W- Closed Sets and Intuitionistic Fuzzy W -Continuity
PPT
Introductory maths analysis chapter 12 official
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Analysis of data
A Novel Bayes Factor for Inverse Model Selection Problem based on Inverse Ref...
Comparison of the optimal design
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
Convolutional networks and graph networks through kernels
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
Lecture11 xing
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
A short and naive introduction to epistasis in association studies
Interpolation of Cubic Splines
Learning from (dis)similarity data
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
Introductory maths analysis chapter 00 official
better together? statistical learning in models made of modules
Intuitionistic Fuzzy W- Closed Sets and Intuitionistic Fuzzy W -Continuity
Introductory maths analysis chapter 12 official
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
Ad

Viewers also liked (8)

PDF
"Kate, a Platform for Machine Intelligence" by Wayne Imaino, IBM Research
PPTX
Final thesis presentation on bci
PDF
Hs2014 bci mi
PPTX
What the Brain says about Machine Intelligence
PPTX
Real-Time Streaming Data Analysis with HTM
PDF
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
PDF
EEG based Motor Imagery Classification using SVM and MLP
PDF
Support Vector Machines
"Kate, a Platform for Machine Intelligence" by Wayne Imaino, IBM Research
Final thesis presentation on bci
Hs2014 bci mi
What the Brain says about Machine Intelligence
Real-Time Streaming Data Analysis with HTM
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
EEG based Motor Imagery Classification using SVM and MLP
Support Vector Machines
Ad

Similar to When Classifier Selection meets Information Theory: A Unifying View (20)

PDF
Transfer Learning for the Detection and Classification of traditional pneumon...
PDF
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
PDF
Litv_Denmark_Weak_Supervised_Learning.pdf
PDF
Central tendency
PPT
20070702 Text Categorization
PDF
Machine learning
PDF
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
PDF
Bayesian_Decision_Theory-3.pdf
PDF
MEAN ABSOLUTE DEVIATION FOR HYPEREXPONENTIAL AND HYPOEXPONENTIAL DISTRIBUTION
PDF
Mean Absolute Deviation for Hyperexponential and Hypoexponential Distributions
PDF
Pattern learning and recognition on statistical manifolds: An information-geo...
PDF
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
PPTX
Measures of Central Tendency.pptx
PDF
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
PDF
Lausanne 2019 #1
PDF
Accelerating Metropolis Hastings with Lightweight Inference Compilation
PDF
Ijarcet vol-2-issue-4-1579-1582
PDF
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
PDF
A Statistical Perspective on Retrieval-Based Models.pdf
PDF
02_AJMS_382_22.pdf
Transfer Learning for the Detection and Classification of traditional pneumon...
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
Litv_Denmark_Weak_Supervised_Learning.pdf
Central tendency
20070702 Text Categorization
Machine learning
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
Bayesian_Decision_Theory-3.pdf
MEAN ABSOLUTE DEVIATION FOR HYPEREXPONENTIAL AND HYPOEXPONENTIAL DISTRIBUTION
Mean Absolute Deviation for Hyperexponential and Hypoexponential Distributions
Pattern learning and recognition on statistical manifolds: An information-geo...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Measures of Central Tendency.pptx
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
Lausanne 2019 #1
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Ijarcet vol-2-issue-4-1579-1582
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
A Statistical Perspective on Retrieval-Based Models.pdf
02_AJMS_382_22.pdf

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf

When Classifier Selection meets Information Theory: A Unifying View

  • 1. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F When Classifier Selection meets Information Theory: A Unifying View Mohamed Abdel Hady, Friedhelm Schwenker, Günther Palm Institute of Neural Information Processing University of Ulm, Germany December 8, 2010 1 / 20
  • 2. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Outline 1 Ensemble Learning 2 Ensemble Pruning 3 Information Theory 4 Ensemble Pruning meets Information Theory 5 Experimental Results 6 Conclusion and Future Work 2 / 20
  • 3. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Ensemble Learning An ensemble is a set of accurate and diverse classifiers. The objective is that the ensemble outperforms its member classifiers. h1 ghi hN Classifier Layer Combination Layer x h1(x) hi(x) hN(x) g(x) Ensemble Learning becomes a hot topic during the last years. Ensemble methods consist of two phase: the construction of multiple individual classifiers and their combination. 3 / 20
  • 4. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Ensemble Learning How to construct individual classifiers? 4 / 20
  • 5. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Ensemble Pruning Recent work has considered an additional intermediate phase that deals with the reduction of the ensemble size before combination. This phase has several names in the literature such as ensemble pruning, selective ensemble, ensemble thinning and classifier selection. Classifier selection is important for two reasons: classification accuracy and efficiency. An ensemble may consist not only of accurate classifiers, but also of classifiers with lower accuracy. The main factor for an effective ensemble is to remove the poor-performing classifiers while maintaining a good diversity among the ensemble members. The second reason is equally important, efficiency. Having a very large number of classifiers in an ensemble adds a lot of computational overhead. For instance, decision trees may have large memory requirements and lazy learning methods have a considerable computational cost during classification phase. 5 / 20
  • 6. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Information Theory Entropy H(X) = − xj ∈X p(X = xj ) log2 p(X = xj ) (1) Conditional Entropy H(X|Y) = − y∈Y p(Y = y) x∈X p(X = x|Y = y) log p(X = x|Y = y) (2) 6 / 20
  • 7. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Information Theory Shannon Mutual Information I(X; Y) = H(X) − H(X|Y) = x∈X y∈Y p(x, y) log2 p(x, y) p(x)p(y) (3) Shannon Conditional Mutual Information I(X1; X2|Y) = H(X1|Y) − H(X1|X2Y) = y∈Y p(y) x1∈X1 x2∈X2 p(x1, x2|y) log2 p(x1, x2|y) p(x1|y)p(x2|y) (4) 7 / 20
  • 8. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Ensemble Pruning meets Information Theory Information theory can provide a bound on error probability, p(X1:N = Y), for any combiner g. The error of predicting target variable Y from input X1:N is bounded by two inequalities as follows, H(Y) − I(X1:N ; Y) − 1 log(|Y|) ≤ p(X1:N = Y) ≤ 1 2 H(Y|X1:N ). (5) I(X1:N ; Y) involves high dimensional probability distributions p(x1, x2, . . . , xN , y) that are hard to be implemented. However, it can be decomposed into simpler terms. 8 / 20
  • 9. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Interaction Information Shannon’s Mutual Information I(X1; X2) is a function of two variables. It is not able to measure properties of multiple (N) variables. McGill presented what is called Interaction Information as a multi-variate generalization for Shannon’s Mutual Information. For instance, the Interaction Information between three random variables is I({X1, X2, X3}) = I(X1; X2|X3) − I(X1; X2) (6) The general form for arbitrary size S is defined recursively. I(S ∪ {X}) = I(S|X) − I(S) (7) W. McGill, Multivariate information transmission, IEEE Trans. on Information Theory, vol. 4, no. 4, pp. 93111, 1954. 9 / 20
  • 10. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Mutual Information Decomposition Theorem Given a set of classifiers S = {X1, . . . , XN } and a target class label Y, the Shannon mutual information between X1:N and Y can be decomposed into a sum of Interaction Information terms, I(X1:N ; Y) = T⊆S I(T ∪ {Y}), |T| ≥ 1. (8) For a set of classifiers S = {X1, X2, X3}, the mutual information between the joint variable X1:3 and a target Y can be decomposed as I(X1:3; Y) = I(X1; Y) + I(X2; Y) + I(X3; Y) + I(X1, X2, Y) + I(X1, X3, Y) + I(X1, X3, Y) + I(X1, X2, X3, Y) Each term can then be decomposed into class unconditional I(X) and conditional I(X|Y) according to Eq. (6). I(X1:3; Y) = 3 i=1 I(Xi ; Y) − X⊆S |X|=2,3 I(X) + X⊆S |X|=2,3 I(X|Y) 10 / 20
  • 11. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Mutual Information Decomposition (cont’d) For an ensemble S of size N and according to Eq. (7), I(X1:N ; Y) = N i=1 I(Xi ; Y) − X⊆S |X|=2..N I(X) + X⊆S |X|=2..N I(X|Y) (9) We assume that there exist only pairwise unconditional and conditional interactions and omit higher order terms. I(X1:N ; Y) N i=1 I(Xi ; Y) − N−1 i=1 N j=i+1 I(Xi ; Xj ) + N−1 i=1 N j=i+1 I(Xi ; Xj |Y) (10) G. Brown, A new perspective for information theoretic feature selection, in Proc. of the 12th Int. Conf. on Artificial Intelligence and Statistics (AI-STATS 2009), 2009. 11 / 20
  • 12. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Classifier Selection Criterion The objective of an information-theoretic classifier selection method, is to select a subset of K classifiers (S) from a pool of N classifiers (Ω), constructed by any ensemble learning algorithm, that carries as much information as possible about the target class using a predefined selection criterion, J(Xu(j)) = I(X1:k+1; Y) − I(X1:k ; Y) = I(Xu(j); Y) − k i=1 I(Xu(j); Xv(i)) + k i=1 I(Xu(j); Xv(i)|Y) (11) That is the difference in information, after and before the addition of Xu(j) into S. This tells us that the best classifier is a trade-off between these components: the relevance of the classifier, the unconditional correlations, and the class-conditional correlations. In order to trade-off between these components, Eq. (11) [Brown, AI-STATS 2009] can be parameterized to define the root criterion, J(Xu(j)) = I(Xu(j); Y) − β k i=1 I(Xu(j); Xv(i)) + γ k i=1 I(Xu(j); Xv(i)|Y). (12) 12 / 20
  • 13. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Classifier Selection Algorithm 1: Select the most relevant classifier, v(1) = arg max1≤j≤N I(Xj ; Y) 2: S = {Xv(1)} 3: for k = 1 : K − 1 do 4: for j = 1 : |Ω S| do 5: Calculate J(Xu(j)) as defined in Eq. (12) 6: end for 7: v(k + 1) = arg max1≤j≤|ΩS| J(Xu(j)) 8: S = S ∪ {Xv(k+1)} 9: end for 13 / 20
  • 14. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Classifier Selection Heuristics Maximal relevance (MR) J(Xu(j)) = I(Xu(j); Y) (13) Mutual Information Feature Selection (MIFS) [Battiti, 1994] J(Xu(j)) = I(Xu(j); Y) − k i=1 I(Xu(j); Xv(i)) (14) Minimal Redundancy Maximal Relevance (mRMR) [Peng et al., 2005] J(Xu(j)) = I(Xu(j); Y) − 1 |S| k i=1 I(Xu(j); Xv(i)) (15) 14 / 20
  • 15. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Classifier Selection Heuristics (cont’d) Joint Mutual Information (JMI) [Yang and Moody, 1999] J(Xu(j)) = k i=1 I(Xu(j)Xv(i); Y) (16) Conditional Infomax Feature Extraction (CIFE) [Lin and Tang, 2006] J(Xu(j)) = I(Xu(j); Y) − k i=1 I(Xu(j); Xv(i)) − I(Xu(j); Xv(i)|Y) (17) Conditional Mutual Information Maximization (CMIM) [Fleuret, 2004] J(Xu(j)) = min 1≤i≤k I(Xu(j); Y|Xv(i)) = I(Xu(j); Y) − max 1≤i≤k [I(Xu(j); Xv(i)) − I(Xu(j); Xv(i)|Y)] (18) 15 / 20
  • 16. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Experimental Results Bagging and Random Forest to construct pool of 50 decision trees (N =50) Each selection criterion is evaluated with K=40 (20%), 30 (40%), 20 (60%) and 10 (80%). 11 data sets from the UCI machine learning repository average of performing 5 runs of 10-fold cross-validation normalized_test_acc = pruned_ens_acc−single_tree_acc unpruned_ens_acc−single_tree_acc id name Classes Examples Features Discrete Continuous d1 anneal 6 898 32 6 d2 autos 7 205 10 16 d3 wisconsin-breast 2 699 0 9 d4 bupa liver disorders 2 345 0 6 d5 german-credit 2 1000 13 7 d6 pima-diabetes 2 768 0 8 d7 glass 7 214 0 9 d8 cleveland-heart 2 303 7 6 d9 hepatitis 2 155 13 6 d10 ionosphere 2 351 0 34 d11 vehicle 4 846 0 18 16 / 20
  • 17. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Results Figure: Comparison of the normalized test accuracy of the ensemble of C4.5 decision trees constructed by Bagging 17 / 20
  • 18. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Results (cont’d) Figure: Comparison of the normalized test accuracy of the ensemble of random trees constructed by Random Forest 18 / 20
  • 19. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Conclusion This paper examined the issue of classifier selection from an information theoretic viewpoint. The main advantage of information theoretic criteria is that they capture higher order statistics of the data. The ensemble mutual information is decomposed into accuracy and diversity components. Although diversity was represented by low and high order terms, we keep only the first-order terms in this paper. In further study, we will study the influence of including the higher-order terms on pruning performance. We selected in the paper some points within the continuous space of possible selection criteria, that represent well-known feature selection criteria, such as mRMR, CIFE, JMI and CMIM, and use them for classifier selection. In a future work, we will explore other points in this space that may lead to more effective pruning. We plan to extend the algorithm for pruning ensembles of regression estimator. 19 / 20
  • 20. Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F Thanks for your attention Questions ?? 20 / 20