Aaa ped-14-Ensemble Learning: About Ensemble Learning

Ensemble Learning:
About Ensemble Learning
AAA-Python Edition

Plan
●
1- Ensemble Learning
●
2- Bagging & Pasting
●
3- Features sampling
●
4- Boosting : Adaptive Boosting
●
5- Boosting: Gradient boosting
●
6- Stacking or Blending

3
1-EnsembleLearning
[By Amina Delali]
ConceptConcept
●
In Ensemble Learning, we combine several models to build a
better model. The algorithm used in Ensemble learning is
called: an Ensemble Method.
●
We can combine classifers or regressors.
●
The models can be all the same type, or diferent.
Model 1 Model 2 ... Model n
Learning from data
Making predictions for new
data
Select the final prediction

4
1-EnsembleLearning
[By Amina Delali]
VotingVoting
●
Its about: how to select the fnal prediction.
●
In Classifcation
➢
Hard Voting: For each sample, a classifer will make a
prediction : a class for that sample
➢
Select the most predicted class by all the
classifers, for that sample .
➢
Soft Voting: available when the classifers can predict class
probabilities.
●
Select the class with the highest averaged
probability
●
In Regression
●
The average of the predicted values.

5
1-EnsembleLearning
[By Amina Delali]
Ensemble MethodsEnsemble Methods
●
The ensemble methods can vary by:
●
Varying or not the types of the models: use the same or
different models.
●
Select the same sample only once or several times in the
same model.
●
Whether or not the model use all the features or only a subset
of features.
●
The models learn in parallel or sequentially.
●
The type of mechanism used to make a prediction.

6
1-EnsembleLearning
[By Amina Delali]
ExampleExample
The 3 models used in
this ensemble method
The will combine the models
A string representing
the type of the model:
‘dt’ for : decision tree
Hard voting
was used to
determine
these
classes

7
2-Bagging&Pasting
[By Amina Delali]
DefnitionDefnition
●
Bagging (or Bootstrap aggregating) and pasting are both
ensemble methods that combine same type of models. They both
train the models on diferent random sub sets. All the models run
in parallel. They use the voting mechanism for the fnal
prediction
●
Both Bagging and Basting apply random sampling :
➢
The training set for each model is a subset of the
original data randomly selected.
➢
The same sample can be found in different models
(different subsets).
➢
●
The difference is:
➢
Bagging : random sampling with replacement <==>
➔
One sample can be found several times in the
same model (same subset).
➢
Pasting: random sampling without replacement <==>
➔
One sample can be found only once in the same
model (same subset).

8
2-Bagging&Pasting
[By Amina Delali]
Bagging example using scikit-learnBagging example using scikit-learn
●
The data
Bootstrap =True ==> Ensemble method : bagging
TheModel2 = SVC == > all the models are support vector machine classifiers
max_samples=90 ==> the size of the subsets (bags) == 90 sample
n_jobs=-1 ==> use all the available cores (to compute in parallel)
n_estimators= 300 ==> use 300 SVC

9
2-Bagging&Pasting
[By Amina Delali]
Pasting example using sckit-learnPasting example using sckit-learn
●
Bootstrap =False ==> Ensemble method : pasting
The remaining parameter are the same as the
previous ones
The model used is a LogisticRegression classifier

10
3-Featuressampling
[By Amina Delali]
DefnitionDefnition
●
All the following methods use features sampling:
each model will be trained in a random subset of features.
●
Sampling features can be with or without replacement.
●
Random patches method
➢
Sampling both “training instances” and “features”
●
Random subspaces method
➢
Keeping all “training” instances but “sampling” features

11
3-Featuressampling
[By Amina Delali]
Random Patches methodRandom Patches method
●
The data: 75
samples
(instance),
with 100
features
Sampling
features
with
replacement
Bagging
Number of
selected
features =
80 <100
features==>
features
sampling
Number of selected samples = 50 <
75 ==> instances sampling

12
3-Featuressampling
[By Amina Delali]
Random Subspaces methodRandom Subspaces method
●
Sampling features :
80 < 100
Pasting
Features
sampling
without
replacement Since max_samples
= 1.0 (a float
value)==>
max_samples
=100% of the
training data (100%
*75 = 75)
Since all samples
are used without
replacement ==>
the instances are
not sampled ==>
all the training
data is used (75
samples)

13
4-Boosting:
AdaptiveBoosting
[By Amina Delali]
DefnitionDefnition
●
Boosting : Ensemble method, that combines several weak learners
into a stronger learner.
●
This is done by training the models sequentially ==> each
model correct (boost) its predecessor.
●
Uses the same models on the same data each time.
●
The most known boosting methods are: Adaptive
Boosting and Gradient Boosting.
➢
Adaptive Boosting: each new predictor focus on the
training samples that its predecessor underftted
( for example: misclassifed in a classifcation problem)
by modifying the instances weight .
➢
Gradient Boosting: the new predictor tries to ft
to the residual errors made by the previous
predictor.

14
4-Boosting:
AdaptiveBoosting
[By Amina Delali]
AdaBoost: trainingAdaBoost: training
●
Weighting samples ==> each sample value will be multiplied by
its weight.
●
AdaBoost is Applicable in binary classifcation.
●
The steps of the algorithm are as follow:
➢ initialize the samples weight wi
(for the frst predictor ) by 1/m.
m is the number of the training samples.
➢
for each predictor j compute:
➔ the weighted error rate : rj
=∑wi
(whre the prediction is wrong) / ∑wi
➔ compute the j predictor's weight: αj
=ηlog(1−rj)
/ rj
). η is the
learning rate parameter.
➔
Compute the new weights (to be used by the following new
predictor j+1) : wi
=wi
,if,ytrue
(i)=ypred
(i)
wi
=wi
exp(αj
),if,ytrue
(i)≠ypred
(i)
➔
Normalize the new weights wi
by: wi
=1/∑wi
➔
The process is repeated until the perfect predictor is found, or
the maximum number of predictors is reached.
●

15
4-Boosting:
AdaptiveBoosting
[By Amina Delali]
AdaBoost: PredictingAdaBoost: Predicting
➢
To make a prediction:
➔
make a prediction with each predictor j from the resulting N
predictors.
➔
attribute a weight to each prediction by the predictor's j weight
αj
➔
for each sample x select the class k that receives the majority
of weighted votes: for each predicted class k sum up the
corresponding αj
weights, then select the class k with the
biggest sum.
AdaBoost: SAMMEAdaBoost: SAMME
●
SAMME : Stagewise Additive Modeling using a Multi-class Exponential loss
function
●
Enhanced version of AdaBoost, applicable in multiclass classification.
●
Same steps as AdaBoost, just the α weight is computed differently:
αj
=η (log[(1−∗ rj) /rj] + log(K−1)). K is the number of classes.

16
4-Boosting:
AdaptiveBoosting
[By Amina Delali]
ExampleExample
➢
A weak classifier : a
decision tree with 1
level: the 2 leafs, the
split of the root node.
This Tree is called a :
Decision Stump
In sckit-learn to apply the
adaptive boosting to
regression, the weights are
adjusted according to the error
of the predictions

17
5-Boosting:
Gradientboosting
[By Amina Delali]
The conceptThe concept
1.The predictor will be frst trained on a set of data: x,y
2.The residual errors are computed from its prediction: r = y - ypred
3.A new predictor will be trained with the new set of data: x,r
4.The residual errors are computed again as follow: r2 = r - rpred
5.The steps 3 and 4 are repeated until: you predict using all the
predefned number of predictors, or you determine the optimal
consecutive predictors (the least generated error) and you select
those predictors as your fnal model. Or, you continue adding
predictors until the errors will not diminish
6. The fnal prediction will be the sum of all the predictions.
●
Scikit learn implements gradient tree boosting: the models used
are decision trees.

18
5-Boosting:
Gradientboosting
[By Amina Delali]
Example: the dataExample: the data
●
●
Boston House Prices dataset.
●
The chosen feature represents
the: % lower status of the
population

19
5-Boosting:
Gradientboosting
[By Amina Delali]
Example: using a GBRTExample: using a GBRT
●
GBRT for Gradient Boosted Regression Trees

20
6-Stackingorblending
[By Amina Delali]
ConceptConcept
●
The idea here, is to train a model to learn how to aggregate the
ensemble models predictions.
●
The method is composed of:
➢
Learner models : that will ft to the data, and make the
predictions.
➢
Blender: the fnal model or meta learner, that will make the fnal
prediction.
●
There are different methods to train the blender:
●
Hold-out set: Blending
●
Out-of-fold: Stacking

21
[By Amina Delali]
Hold-out set: principleHold-out set: principle
●
An example using 3 Predictors and 1 blender (2 layers)
predicting
Held-out set
Predicted
Values:
Set 1 Blender
Training
Predicted
Values:
Set 3
Training
Predicted
Values:
Set 2
New values
Final prediction
●
To make a
prediction, the
new instance
will go through
the first layer.
●
The resulting
predictions will
serve as input
for the second
layer.
●
The prediction
made by this
later one is the
final result.
Layer 1
Layer 2

22
[By Amina Delali]
Hold-out set: TrainingHold-out set: Training
Subset xr1 for training
Subset xr2 for predicting
Training the first level
Predicting with the
first level
Generate the new
features Training the blender

23
[By Amina Delali]
Hold-out set: TestingHold-out set: Testing
The test data will be used
by the first level to predict
the values == features used
by later by the blender
The final prediciton, will be
done by the blender.

24
[By Amina Delali]
Hold-out set: generalizationHold-out set: generalization
●
Its possible to train several type of blenders. And , each one can be
a set of models.
●
The idea is to divide the original training set into several subsets: n
subsets ==> n layers (n-1 blending phase)
●
The frst set of predictors will train from the frst subset, and make
prediction with the second one.
●
The second set of predictors will train from the previous predictions.
And then make new predictions using the third subset.
●
The process is repeated until the last subset of predictors: it will
train from the last predictions made by the previous predictors
using the last subset of data.
Layer 1
….
Layer 2 Layer 3
...
Layer n
n
subsets

References
●
Aurélien Géron. Hands-on machine learning with Scikit-Learn and
Tensor-Flow: concepts, tools, and techniques to build intelligent
systems. O’Reilly Media, Inc, 2017.
●
Scikit-learn.org. scikit-learn, machine learning in python. On-line
at https://scikit-learn.org/stable/. Accessed on 03-11-2018.

Aaa ped-14-Ensemble Learning: About Ensemble Learning

More Related Content

What's hot (20)

Similar to Aaa ped-14-Ensemble Learning: About Ensemble Learning (20)

More from AminaRepo (20)

Recently uploaded (20)

Aaa ped-14-Ensemble Learning: About Ensemble Learning