An Experimental Study about Simple Decision Trees for Bagging Ensemble on Datasets with Classification Noise

An Experimental Study about Simple Decision Trees for
Bagging Ensemble on Datasets with Classiﬁcation Noise
Joaquín Abellán and Andrés R. Masegosa
Department of Computer Science and Artiﬁcial Intelligence
University of Granada
Verona, July 2009
10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty
ECSQARU 2009 Verona (Italy) 1/23

Introduction
Part I
Introduction

Introduction
Introduction
Ensembles of Decision Trees (DT)
Features
They usually build different DT for different samples of the training dataset.
The ﬁnal prediction is a combination of the individual predictions of each tree.
Take advantage of the inherent instability of DT.
Bagging, AdaBoost and Randomization the most known approaches.

Introduction
Introduction
Classification Noise (CN) in the class values
Definition
The class values of the samples given to the learning algorithm have some
errors.
Random classification noise: the label of each example is flipped randomly
and independently with some fixed probability called the noise rate.
Causes
It is mainly due to errors in the data capture process.
Very common in real world applications: surveys, biological or medical
information...
Effects on ensembles of decision trees
The presence of classification noise degenerates the performance of any
classification inducer.
AdaBoost is known to be very affected by the presence of classification noise.
Bagging is the ensemble approach with the better response to classification
noise.

Introduction
Introduction
Motivation of this study
Description
Decision trees built with different split criteria are considered in a Bagging
scheme.
Common split criteria (InfoGain, InfoGain Ratio and Gini Index) and a new split
criteria based on imprecise probabilities are analyzed.
Analyze which split criteria is more robust to the presence of classiﬁcation
noise.
Outline
Description of the different split criteria.
Bagging Decision Trees
Experimental Results.
Conclusions and Future Works.

Split Criteria
Part II
Split Criteria

Split Criteria
Split Criteria
Decision Trees
Example Description
Attributes are placed at the nodes.
Class values are placed at the leaves.
Each leaf corresponds to a decision
rule.
Learning
Split Criteria selects the attribute to
place at each branching node.
Stop Criteria decides when to ﬁx a leaf
and stop the branching.

Split Criteria
Split Criteria
Classic Split Criteria
Description
A real-valued function which measures the goodness of an attribute X as an
split node in the decision tree.
A local measure that allows a recursive building of the decision tree.
Information Gain (IG)
Introduced by Quinlan as basis of his ID3 model [18].
It is based on Shannon’s entropy.
IG(X, C) = H(C|X) − H(C) = −
i j
p(cj , xi ) log
p(cj , xi )
p(cj )p(xi )
.
Tendency to select attributes with a high number of states.

Split Criteria
Split Criteria
Classic Split Criteria
Information Gain Ratio (IGR)
Improved version of IG (Quinlan’C4.5 tree inducer [19]).
Normalizes the information gain dividing by the entropy of the split attribute.
IGR(X, C) =
IG(X, C)
H(X)
.
Penalizes attributes with many states.
Gini Index (GIx)
Measure the impurity degree of a partition.
Introduced by Breiman as basis of CART tree inducer [8].
GIx(X, C) = gini(C|X) − gini(C)
gini(C|X) =
t
p(xi )gini(C|X = xi ) gini(C) = 1 −
j
p2
(cj )
Tendency to select attributes with a high number of states.

Split Criteria
Split Criteria
Split Criteria based on Imprecise Probabilities
Imprecise Information Gain (IIG) [3]
An uncertainty measure for convex sets of probability distributions.
Probability intervals for each state of the class variable are computed from the
dataset using Walley’s Imprecise Dirichlet Model (IDM) [24].
p(cj ) ∈
ncj
N + s
,
ncj + s
N + s
≡ Icj , p(cj |xi ) ∈
ncj ,xi
N + s
,
ncj ,xi + s
Nxi + s
≡ Icj ,xi
If we label K(C) and K(C|(X = xi )) for the following sets of probability
distributions q on ΩC:
K(C) = {q| q(cj ) ∈ Icj } K(C|X = xi ) = {q| q(cj ) ∈ Icj ,xi },
Imprecise Info-Gain for each variable X is deﬁned as:
IIG(X, C) = S(K(C)) − p(xi )S(K(C|(X = xi )))
where S() is the maximum entropy function of a convex set.
It can be efﬁciently computed for s=1 [1].

Part III

Procedure
Ti samples are generated by
random sampling with
replacement from the initial training
dataset.
From each Ti sample, a simple
decision tree is built using a given
split criteria.
Final prediction is made by a
majority voting criteria.
Description
As Breiman [9] said about Bagging: The vital element is the instability of the
prediction method. If perturbing the learning set can cause signiﬁcant changes
in the predictor constructed, then Bagging can improve accuracy.
The combination of multiple models reduce the overﬁtting of the single
decision trees to the data set.

Experiments
Part IV
Experiments

Experiments
Experiments
Experimental Set-up
Datasets Benchmark
25 UCI datasets with very different features.
Missing values were replaced with mean and mode values for continuous and
discrete attributes respectively.
Continuous attributes were discretized with Fayyad & Irani’s method [13].
Preprocessing was only carried out considering information for training data sets.
Evaluated Algorithms
Bagging ensembles of 100 trees.
Different split criteria: IG, IGR, GIx and IIG.
Evaluation Method
Different noise rates were applied to training datasets (not to test datasets):
0%, 5%, 10%, 20% and 30%.
k-10 fold cross validation repeated 10 times were used to estimate the
classiﬁcation accuracy.

Experiments
Experiments
Statistical Tests
Two classifiers on a single dataset
Corrected Paired T-test [26]: A corrected version of the paired T-test
implemented in Weka.
Two classifiers on multiple datasets
Wilconxon Signed-Ranks Test [25]: A non-parametric test which ranks the
differences in each dataset.
Sign Test [20,22]: A binomial test that counts the number of wins, losses and
ties across each dataset.
Multiple classifiers on multiple datasets
Friedman Test [15,16]: A non-parametric test that ranks the algorithms for each
dataset, the best one gets rank 1, second one gets rank 2... Null-hypothesis is
that all algorithms perform equally well.
Nemenyi Test [17]: A post-hoc test that is employed to compare the algorithms
among them when the null-hypothesis with Friedman test is rejected.

Experiments
Experiments
Average Performance
Analysis
The average accuracy is similar when no noise is introduced.
The introduction of noise deteriorates the performance of classiﬁers.
But IIG is more robust to noise, because its average performance is higher in
each one of the noise levels.

Experiments
Experiments
Corrected Paired T-Test at 0.05 level
Number of accumulated Wins, Ties and Defeates (W/T/D)
of IIG respect to IG, IGR and GIx in 25 datasets.
Noise IG IGR GIx
0% 2/22/1 1/23/1 2/22/1
5% 11/14/0 10/15/0 11/14/0
10% 13/12/0 10/15/0 13/12/0
20% 16/9/0 11/14/0 18/7/0
30% 17/8/0 11/14/0 17/8/0
Analysis
Without noise, there is a tie in almost all datasets.
As much noise is added, higher the number of wins are.
IIG wins in a high number of datasets and it is not defeated in none of them.

Experiments
Experiments
Wilconxon and Sign Test at 0.05 level
Comparison of IIG respect to the rest of split criteria.
’-’ indicates non-statistically signiﬁcant differences.
Noise
0 %
5 %
10 %
20 %
30 %
Wilconxon Test
IG IGR GIx
IIG - IIG
IIG IIG IIG
IIG IIG IIG
IIG IIG IIG
IIG IIG IIG
Sign Test
IG IGR GIx
IIG - IIG
IIG IIG IIG
IIG IIG IIG
IIG IIG IIG
IIG IIG IIG
Analysis
Without noise, IIG outperforms IG and GIx, but not IGR.
With any level of noise, IIG outperforms the rest of the splits.
IGR also outperforms IG and GIx when there is some noise level.

Experiments
Experiments
Friedman Test at 0.05 level
The ranks assessed by Friedman Test are depicted.
As lower the rank, better the performance.
Ranks in bold face indicates that IIG statistically outperforms with Nemenyi Test.
Noise IIG IG IGR GIx
0% 1.86 2.92 2.52 2.70
5% 1.18 3.18 2.54 3.12
10% 1.12 3.26 2.36 3.26
20% 1.12 3.20 2.16 3.52
30% 1.12 3.36 2.26 3.26
Analysis
Without noise, IIG has the best ranking and outperforms IG.
With a noise level higher than 10%, IIG outperforms over the rest.
IGR also outperforms IG and GIx when the noise level is higher than 20%.

Experiments
Experiments
Computational Time
Analysis
Without noise, all split criteria have similar time average.
The introduction of noise deteriorates the computational performance of
classiﬁers.
IIG and GIx consumes less time than the other split criteria. IGR is the most time
consumer.

Conclusions and Future Works
Part V

Conclusions
Experimental study about the performance of different split criteria in a
bagging scheme under classiﬁcation noise.
Three classic split criteria: IG, IGR and GIx; and a new one based on
imprecise probabilities: IIG.
Bagging with IIG has a strong behavior respect to the other ones when the
noise level is increased.
IGR has also a good performance with noise level, but lower than IIG.
Future Works
Extend the methods for continuous and missing values.
Further investigate the computational cost of any of the models as well as other
factors such as number of trees, pruning...
Introduce new imprecise models.

Thanks for your attention!!
Questions?

An Experimental Study about Simple Decision Trees for Bagging Ensemble on Datasets with Classification Noise

More Related Content

What's hot (16)

Viewers also liked (20)

Similar to An Experimental Study about Simple Decision Trees for Bagging Ensemble on Datasets with Classification Noise (20)

More from NTNU (16)

Recently uploaded (20)

An Experimental Study about Simple Decision Trees for Bagging Ensemble on Datasets with Classification Noise