SlideShare a Scribd company logo
Omar Odibat
Data Scientist,
Nov 15th 2017
Boosting Algorithms
• Fraud and Fallout rates:
Ensemble Learning can make you
rich & famous!!
Netflix- KDD-Cup 2007
Goal: Improve movie recommendations by 10%.
Winners used Boosting Trees!!!
http://www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
AT&T
Research
Boosting algorithms are widely used
algorithms in data science competitions.
"Our single XGBoost model can get to the top three!
Our final model just averaged XGBoost models with
different random seeds.”
- Dmitrii Tsybulevskii & Stanislav Semenov, winners of
Avito Duplicate Ads Detection Kaggle competition.
- https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
Main Idea …
Weak
classifier
Weak
classifier
Weak
classifier
Strong
classifier
Main Idea …
Weak
classifier
Weak
classifier
Weak
classifier
Strong
classifier
• Weak classifiers
• Only slightly better
than random guess.
• Very easy to find
• Strong classifier:
• Very accurate
• Less overfitting
Outline
❖ Introduction
❖Ensemble Learning
❖Boosting Algorithms
❖ Demo in Python
Part 1:
Overview of Decision Trees
Decision Trees
https://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html
Would you survive a disaster?
Decision Trees
https://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html
Would you survive a disaster?
Decision Trees
https://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html
Stopping criteria:
• Stop when data points at the
leaf are all of the same class
• Stop when the leaf contains
less than K data points
• Stop when further branching
does not improve homogeneity
beyond a minimum threshold
Would you survive a disaster?
Decision Tree Overview
Source: http://www.slideshare.net/DataRobot/gradient-boosted-regression-trees-in-scikitlearn
Predictions for California housing market
Decision Trees: Practical Use
• Non linear
• Robust to correlated features
• Robust to feature distributions
• Robust to missing values
• Simple to comprehend
• Fast to train
• Fast to score
• Poor accuracy
• Cannot project
• Inefficiently fits linear
relationships
Strengths Weaknesses
https://github.com/mlandry22/boosting-austin-sigkdd-talk-20160803
Part 2:
Overview of Ensemble Learning
Ensemble Learning
• Generating
the base
learners
• Combining
the base
learners
• Ensemble
Model
A Mixture of experts or
“base learners”
combined to solve the
same learning problem.
• Bagging
• Boosting
• Stacking
• Voting
• Averaging
Construct a set of
hypotheses and
combine them to use .
• Classification
• Regression
The generalization ability is much
stronger than that of base
learners.
Why Ensembles Superior to Singles?
No sufficient training
to choose a single best
learner.
The true function
cannot be
represented by any
of the hypothesis in
H.
Imperfect search
processes result in
sub-optimal
hypotheses.
Statistical Computational Representational
Dietterich, T. Ensemble methods in machine learning. Multiple classifier systems, Vol. 1857 of Lecture Notes in Computer Science. 2000
Illustration of the representation issue
Three staircase approximations The voted decision boundary
Diagonal Decision boundary with decision trees base learners
Dietterich, T. Ensemble methods in machine learning. Multiple classifier systems, Vol. 1857 of Lecture Notes in Computer Science. 2000
Boosting is one Type of Ensemble Learning
Ensemble
learning
Bagging Boosting
Adaboost
GBM
XGBoost
• Stacking
Heterogeneous
Stacking uses multiple learning
algorithms
• Bagging & Boosting are
considered homogeneous.
They use a single base
learning algorithm.
• Staking uses different
kinds of base learners
Ensemble Learning in SKlearn
Ensemble
Learning  in
Scikit Learn
0.16.1
Averaging
methods
Bagging methods
BaggingClassifier
BaggingRegressor
Forest of randomized
trees
The Random Forest
RandomForestClassifier
RandomForestRegressor
The Extra-Trees method
ExtraTreesClassifier
ExtraTreesRegressor
Boosting
methods
AdaBoost
AdaBoostClassifier
AdaBoostRegressor 
Gradient Tree Boosting
GradientBoostingClassifier
GradientBoostingRegressor
Bagging
• Trains a number of base learners each from a different bootstrap sample
(Bootstrap AGGregatING)
• Each dataset is generated by sampling from the total N data examples,
choosing N items uniformly at random with replacement.
• For a bootstrap sample, some training examples may appear but some may not.
• The outputs of the models are combined by:
• Averaging (in the case of regression)
• Voting (in the case of classification)
• Example: Random forests
Bagging
https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
Tree Ensemble Model
Predict whether a given user likes
computer games or not
Decision Trees & Random forests in SKlearn
Input parameters:
criterion 
max_depth 
min_samples_split
min_samples_leaf
max_features
max_leaf_nodes
…
Input parameters:
n_estimators
+
criterion 
max_depth 
min_samples_split
min_samples_leaf
max_features
max_leaf_nodes
…
Variants of Bagging
Bagging
Pasting
Random
Subspaces
Random
Patches
When random subsets of the dataset are drawn as random
subsets of the samples.
L. Breiman, “Pasting small votes for classification in large databases and on-line”, Machine Learning, 36(1), 85-103, 1999.
When random samples of the dataset are drawn with
replacement
L. Breiman, “Bagging predictors”, Machine Learning, 24(2), 123-140, 1996
When random subsets of the dataset are drawn as random
subsets of the features
•T. Ho, “The random subspace method for constructing decision forests”, Pattern Analysis and Machine Intelligence, 20(8),
1998.
When base estimators are built on subsets of both samples
and features
G. Louppe and P. Geurts, “Ensembles on Random Patches”, Machine Learning and Knowledge Discovery in Databases 2012
Random
Forests
A hybrid of the Bagging and the Random Subspace Method
Uses Decision Trees as the base classier with random splits
L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
Random Forests
• The scikit-learn implements by averaging instead of voting.
• The split that is picked is the best split among a random subset of the features.
 
Extremely Randomized Trees
• Randomness goes one step further in the way splits are computed.
• As in random forests, a random subset of candidate features is used.
• Thresholds are drawn at random for each candidate feature and the best of
these randomly-generated thresholds is picked as the splitting rule.
“n_estimators
”
Feature
importance
number of trees in the forest.
1. Top of the tree. 2. Used in many
trees
Feature
Selection
Main
parameter
Stacking
Stacking
SVM
Predictions 3
Training Data
Testing Data
Logistic regression
Predictions 1 Predictions 2
Neural
Networks
Decision
trees
Level-0
models
Level-1
model
Final prediction
Stacked generalization (or stacking) is used to combine models of different types.
Part 3:
Boosting Algorithms
AdaBoost
• It Creates a ‘weak’ classifier that its accuracy is only slightly better than random guessing.
• A succession of models are built iteratively.
• Records that were misclassified by the previous model are given more weight.
• Finally, all of the successive models are weighted according to their success.
• Uses decision stumps as the base learners
Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2) (1990)
AdaBoost
Boosting illustration
A
C D
B
C
B
D
A
A
C D
B
C
B
D
A
A
C D
B
C
B
D
A
A
C
DB
C
B
D
A
Class 1
Class 2
Bagging VS Boosting
Bagging
Training data
Weighted sample
Weighted sample
Model
Model
Model
f1
f2
fn
F
Boosting
Random sample
Random sample
Random sample
Model
Model
Model
f1
f2
fn
FTraining data
Resampling Reweighting
Uniform distribution Non-uniform distribution
Parallel Style Sequential Style
• In gradient boosting, it trains many model sequentially. Each new model gradually
minimizes the loss function using Gradient Descent method.
• The learning procedure consecutively fit new models to provide a more accurate estima
of the response variable.
• In Adaboost, the weights are derived from the misclassifications of the previous mode
the resulting increased weights assigned to misclassifications.
•The result of Gradient Boosting is an altogether different function from the beginning,
because the result is the addition of multiple functions.
Gradient Boosted Models (GBM’s)
https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
Gradient Boosting Trees
At each iteration:
• Draws a subsample of the training
data (without replacement)
• Constructs a regression tree from
the sample
• All trees are added together to get
the final model
Jerome H. Friedman, “Stochastic gradient boosting”, Computational
Statistics & Data Analysis 2002
GBM_Score= response1 +response2 +response3 +… +responseM
 
Class A
Class B
Prediction with GBM
monotonic function
AdaBoost & GBM in SKlearn
Input parameters:
base_estimator
n_estimators
Learning_rate
…
Input parameters:
loss
n_estimators
learning_rate
n_estimator
max_depth
Mx_features
…
XGBoost
• XGBoost is an implementation of GBM, with major
improvements.
• GBM’s build trees sequentially, but XGBoost is parallelized.
This makes XGBoost faster.
• XGBoost is an open-sourced machine learning library available in
Python, R, Julia, Java, C++, Scala.
XGBoost: A Scalable Tree Boosting System, Tianqi Chen, Carlos Guestrin. 2016
XGBoost features
1. Split finding algorithms: approximate algorithm
• Candidate split points are proposed based on the percentiles of feature distribution.
• The continuous features are binned into buckets that are split based on the candidate
split points.
• The best solution for candidate split points is chosen from the aggregated statistics
on the buckets.
2. Column block for parallel learning
• To reduce sorting costs, data is stored in in-memory units called ‘blocks’.
• Each block has data columns sorted by the corresponding feature value.
• This computation needs to be done only once before training and can be reused later.
• Sorting of blocks can be done independently and divided between parallel threads.
• The split finding can be parallelized as the collection of statistics for each column is
done in parallel.
https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
XGBoost features ..cont.
3. Sparsity-aware algorithm:
• XGBoost visits only the default direction (non-missing entries) in each node.
4. Cache-aware access:
• Optimizes how many examples per block.
5. Out-of-core computation:
• For blocks that does not fit into memory, they are compressed on the disk .
• The blocks are decompressed on the fly (In parallel).


https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
Xgboost VS Other tools
http://datascience.la/benchmarking-random-forest-implementations/
Light GBM…the latest of Boosting algorithms
• A fast, distributed, high performance gradient boosting framework.
• Used for ranking, classification and many other machine learning tasks.
• It is under the umbrella of the DMTK(http://github.com/microsoft/
dmtk) project of Microsoft.
• Similar to XGBoost but it is faster
https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment
Advantages Ensemble Learning
Accuracy
less
Overfitting
less
Variance Diversity
In Ensemble learning, the large
variance of unstable learners is
`averaged out' across multiple
learners.
10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)
83.7% majority vote accuracy
How about if we have101 such classifiers

99.9% majority vote accuracy
Imagine we have:
• An ensemble of 5 independent classifiers.
• Accuracy is 70% for each
What is the accuracy for the majority vote?
Different classifiers to work on
different random subsets of the full
feature space or different subset of
the training data.
Are ensembles easy to understand?
Are ensembles easy to understand?
• Decision trees are easy to understand
• Ensemble models are considered complex model, interpreting
predictions from them is a challenge.
SHAP (SHapley Additive exPlanations)
A unified approach to interpreting model predictions, Scott Lundberg, Su-In Lee 2017
LIME - Local Interpretable Model-Agnostic
Explanations
Approximate the complex model near a given prediction. - Ribeiro et al. 2016
DEMO
References
• Zhou, Zhi-Hua, “Ensemble Learning”, Encyclopedia of Biometrics, 2009
• Dietterich, T. Ensemble methods in machine learning. Multiple classifier systems, Vol. 1857 of Lecture Notes in
Computer Science. 2000
• Ho, Tin Kam, The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 1998.
• VALENTINI, Giorgio, and Francesco MASULLI, 2002. Ensembles of Learning Machines. 2002. Revised Papers, Volume 2486
of Lecture Notes in Computer Science. Berlin: Springer, pp. 3–19.
• FREUND, Yoav, and Robert E. SCHAPIRE, 1996. Experiments with a New Boosting Algorithm. In: Lorenza SAITTA, ed.
(ICML ’96). San Francisco, CA: Morgan Kaufmann, pp. 148–156.
• SCHAPIRE, Robert E., 1990. The Strength of Weak Learnability. Machine Learning, 5(2), 197–227. Oxford University
Press.
• BREIMAN, Leo, 1996. Bagging Predictors. Machine Learning, 24(2), 123–140.Sewell M (2011) Ensemble learning,
Technical Report RN/11/02. Department of Computer Science, UCL, London
• G Brown, “Ensemble Learning”, Encyclopedia of Machine Learning, 2010
• http://scikit-learn.org/0.16/modules/ensemble.html
• http://scikit-learn.org/stable/modules/multiclass.html
• http://xgboost.readthedocs.io/en/latest/
Boosting Algorithms
Thank you!!!
Omar Odibat, Data Scientist,

More Related Content

PPTX
boosting algorithm
PPTX
Boosting Approach to Solving Machine Learning Problems
PDF
Decision trees in Machine Learning
PPTX
LAN (Local Area Network)
PPTX
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
PDF
XGBoost & LightGBM
PPT
Counterpropagation NETWORK
boosting algorithm
Boosting Approach to Solving Machine Learning Problems
Decision trees in Machine Learning
LAN (Local Area Network)
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
XGBoost & LightGBM
Counterpropagation NETWORK

What's hot (20)

PPTX
Random forest
PPTX
Optimization in Deep Learning
PPTX
Machine Learning
PDF
Support Vector Machines for Classification
PPT
Semi-supervised Learning
PPTX
Machine learning with ADA Boost
PPTX
Ensemble learning Techniques
PPTX
Gradient descent method
PDF
Understanding Bagging and Boosting
PPTX
Random Forest Classifier in Machine Learning | Palin Analytics
PPTX
Ensemble methods
PPTX
Machine Learning - Ensemble Methods
PPTX
Decision tree
PDF
XGBoost: the algorithm that wins every competition
PPTX
Ensemble methods
PDF
Deep Feed Forward Neural Networks and Regularization
PDF
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
PPTX
Introduction to XGboost
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
PDF
Naive Bayes
Random forest
Optimization in Deep Learning
Machine Learning
Support Vector Machines for Classification
Semi-supervised Learning
Machine learning with ADA Boost
Ensemble learning Techniques
Gradient descent method
Understanding Bagging and Boosting
Random Forest Classifier in Machine Learning | Palin Analytics
Ensemble methods
Machine Learning - Ensemble Methods
Decision tree
XGBoost: the algorithm that wins every competition
Ensemble methods
Deep Feed Forward Neural Networks and Regularization
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Introduction to XGboost
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Naive Bayes
Ad

Similar to Boosting Algorithms Omar Odibat (20)

PDF
Decision tree
PPTX
random forest.pptx
PPT
Ensemble Learning in Machine Learning.ppt
PDF
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
PDF
Overview of tree algorithms from decision tree to xgboost
PPTX
XgBoost.pptx
PPTX
ngboost.pptx
PPTX
Introduction to XGBoost Machine Learning Model.pptx
PPTX
ML howtodo.pptx. Get learning how to do a
PPTX
[Women in Data Science Meetup ATX] Decision Trees
PPTX
XGBOOST [Autosaved]12.pptx
PPTX
Gradient Boosted trees
PDF
dm1.pdf
PDF
Demystifying Xgboost
PDF
H2O World - Ensembles with Erin LeDell
PPTX
Comparison Study of Decision Tree Ensembles for Regression
PPTX
Case Study Presentation on Random Variables in machine learning.pptx
PPTX
Random Forest
PPTX
AIML UNIT 4.pptx. IT contains syllabus and full subject
PPTX
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
Decision tree
random forest.pptx
Ensemble Learning in Machine Learning.ppt
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Overview of tree algorithms from decision tree to xgboost
XgBoost.pptx
ngboost.pptx
Introduction to XGBoost Machine Learning Model.pptx
ML howtodo.pptx. Get learning how to do a
[Women in Data Science Meetup ATX] Decision Trees
XGBOOST [Autosaved]12.pptx
Gradient Boosted trees
dm1.pdf
Demystifying Xgboost
H2O World - Ensembles with Erin LeDell
Comparison Study of Decision Tree Ensembles for Regression
Case Study Presentation on Random Variables in machine learning.pptx
Random Forest
AIML UNIT 4.pptx. IT contains syllabus and full subject
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
Ad

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Computer network topology notes for revision
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Lecture1 pattern recognition............
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Data_Analytics_and_PowerBI_Presentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Computer network topology notes for revision
Business Acumen Training GuidePresentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Lecture1 pattern recognition............
Business Ppt On Nestle.pptx huunnnhhgfvu
Moving the Public Sector (Government) to a Digital Adoption
STUDY DESIGN details- Lt Col Maksud (21).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Supervised vs unsupervised machine learning algorithms
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Boosting Algorithms Omar Odibat

  • 1. Omar Odibat Data Scientist, Nov 15th 2017 Boosting Algorithms • Fraud and Fallout rates:
  • 2. Ensemble Learning can make you rich & famous!! Netflix- KDD-Cup 2007 Goal: Improve movie recommendations by 10%. Winners used Boosting Trees!!! http://www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf AT&T Research Boosting algorithms are widely used algorithms in data science competitions. "Our single XGBoost model can get to the top three! Our final model just averaged XGBoost models with different random seeds.” - Dmitrii Tsybulevskii & Stanislav Semenov, winners of Avito Duplicate Ads Detection Kaggle competition. - https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
  • 4. Main Idea … Weak classifier Weak classifier Weak classifier Strong classifier • Weak classifiers • Only slightly better than random guess. • Very easy to find • Strong classifier: • Very accurate • Less overfitting
  • 6. Part 1: Overview of Decision Trees
  • 9. Decision Trees https://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html Stopping criteria: • Stop when data points at the leaf are all of the same class • Stop when the leaf contains less than K data points • Stop when further branching does not improve homogeneity beyond a minimum threshold Would you survive a disaster?
  • 10. Decision Tree Overview Source: http://www.slideshare.net/DataRobot/gradient-boosted-regression-trees-in-scikitlearn Predictions for California housing market
  • 11. Decision Trees: Practical Use • Non linear • Robust to correlated features • Robust to feature distributions • Robust to missing values • Simple to comprehend • Fast to train • Fast to score • Poor accuracy • Cannot project • Inefficiently fits linear relationships Strengths Weaknesses https://github.com/mlandry22/boosting-austin-sigkdd-talk-20160803
  • 12. Part 2: Overview of Ensemble Learning
  • 13. Ensemble Learning • Generating the base learners • Combining the base learners • Ensemble Model A Mixture of experts or “base learners” combined to solve the same learning problem. • Bagging • Boosting • Stacking • Voting • Averaging Construct a set of hypotheses and combine them to use . • Classification • Regression The generalization ability is much stronger than that of base learners.
  • 14. Why Ensembles Superior to Singles? No sufficient training to choose a single best learner. The true function cannot be represented by any of the hypothesis in H. Imperfect search processes result in sub-optimal hypotheses. Statistical Computational Representational Dietterich, T. Ensemble methods in machine learning. Multiple classifier systems, Vol. 1857 of Lecture Notes in Computer Science. 2000
  • 15. Illustration of the representation issue Three staircase approximations The voted decision boundary Diagonal Decision boundary with decision trees base learners Dietterich, T. Ensemble methods in machine learning. Multiple classifier systems, Vol. 1857 of Lecture Notes in Computer Science. 2000
  • 16. Boosting is one Type of Ensemble Learning Ensemble learning Bagging Boosting Adaboost GBM XGBoost • Stacking Heterogeneous Stacking uses multiple learning algorithms • Bagging & Boosting are considered homogeneous. They use a single base learning algorithm. • Staking uses different kinds of base learners
  • 17. Ensemble Learning in SKlearn Ensemble Learning  in Scikit Learn 0.16.1 Averaging methods Bagging methods BaggingClassifier BaggingRegressor Forest of randomized trees The Random Forest RandomForestClassifier RandomForestRegressor The Extra-Trees method ExtraTreesClassifier ExtraTreesRegressor Boosting methods AdaBoost AdaBoostClassifier AdaBoostRegressor  Gradient Tree Boosting GradientBoostingClassifier GradientBoostingRegressor
  • 18. Bagging • Trains a number of base learners each from a different bootstrap sample (Bootstrap AGGregatING) • Each dataset is generated by sampling from the total N data examples, choosing N items uniformly at random with replacement. • For a bootstrap sample, some training examples may appear but some may not. • The outputs of the models are combined by: • Averaging (in the case of regression) • Voting (in the case of classification) • Example: Random forests
  • 21. Decision Trees & Random forests in SKlearn Input parameters: criterion  max_depth  min_samples_split min_samples_leaf max_features max_leaf_nodes … Input parameters: n_estimators + criterion  max_depth  min_samples_split min_samples_leaf max_features max_leaf_nodes …
  • 22. Variants of Bagging Bagging Pasting Random Subspaces Random Patches When random subsets of the dataset are drawn as random subsets of the samples. L. Breiman, “Pasting small votes for classification in large databases and on-line”, Machine Learning, 36(1), 85-103, 1999. When random samples of the dataset are drawn with replacement L. Breiman, “Bagging predictors”, Machine Learning, 24(2), 123-140, 1996 When random subsets of the dataset are drawn as random subsets of the features •T. Ho, “The random subspace method for constructing decision forests”, Pattern Analysis and Machine Intelligence, 20(8), 1998. When base estimators are built on subsets of both samples and features G. Louppe and P. Geurts, “Ensembles on Random Patches”, Machine Learning and Knowledge Discovery in Databases 2012 Random Forests A hybrid of the Bagging and the Random Subspace Method Uses Decision Trees as the base classier with random splits L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
  • 23. Random Forests • The scikit-learn implements by averaging instead of voting. • The split that is picked is the best split among a random subset of the features.   Extremely Randomized Trees • Randomness goes one step further in the way splits are computed. • As in random forests, a random subset of candidate features is used. • Thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. “n_estimators ” Feature importance number of trees in the forest. 1. Top of the tree. 2. Used in many trees Feature Selection Main parameter
  • 25. Stacking SVM Predictions 3 Training Data Testing Data Logistic regression Predictions 1 Predictions 2 Neural Networks Decision trees Level-0 models Level-1 model Final prediction Stacked generalization (or stacking) is used to combine models of different types.
  • 27. AdaBoost • It Creates a ‘weak’ classifier that its accuracy is only slightly better than random guessing. • A succession of models are built iteratively. • Records that were misclassified by the previous model are given more weight. • Finally, all of the successive models are weighted according to their success. • Uses decision stumps as the base learners Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2) (1990)
  • 29. Boosting illustration A C D B C B D A A C D B C B D A A C D B C B D A A C DB C B D A Class 1 Class 2
  • 30. Bagging VS Boosting Bagging Training data Weighted sample Weighted sample Model Model Model f1 f2 fn F Boosting Random sample Random sample Random sample Model Model Model f1 f2 fn FTraining data Resampling Reweighting Uniform distribution Non-uniform distribution Parallel Style Sequential Style
  • 31. • In gradient boosting, it trains many model sequentially. Each new model gradually minimizes the loss function using Gradient Descent method. • The learning procedure consecutively fit new models to provide a more accurate estima of the response variable. • In Adaboost, the weights are derived from the misclassifications of the previous mode the resulting increased weights assigned to misclassifications. •The result of Gradient Boosting is an altogether different function from the beginning, because the result is the addition of multiple functions. Gradient Boosted Models (GBM’s) https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
  • 32. Gradient Boosting Trees At each iteration: • Draws a subsample of the training data (without replacement) • Constructs a regression tree from the sample • All trees are added together to get the final model Jerome H. Friedman, “Stochastic gradient boosting”, Computational Statistics & Data Analysis 2002
  • 33. GBM_Score= response1 +response2 +response3 +… +responseM   Class A Class B Prediction with GBM monotonic function
  • 34. AdaBoost & GBM in SKlearn Input parameters: base_estimator n_estimators Learning_rate … Input parameters: loss n_estimators learning_rate n_estimator max_depth Mx_features …
  • 35. XGBoost • XGBoost is an implementation of GBM, with major improvements. • GBM’s build trees sequentially, but XGBoost is parallelized. This makes XGBoost faster. • XGBoost is an open-sourced machine learning library available in Python, R, Julia, Java, C++, Scala. XGBoost: A Scalable Tree Boosting System, Tianqi Chen, Carlos Guestrin. 2016
  • 36. XGBoost features 1. Split finding algorithms: approximate algorithm • Candidate split points are proposed based on the percentiles of feature distribution. • The continuous features are binned into buckets that are split based on the candidate split points. • The best solution for candidate split points is chosen from the aggregated statistics on the buckets. 2. Column block for parallel learning • To reduce sorting costs, data is stored in in-memory units called ‘blocks’. • Each block has data columns sorted by the corresponding feature value. • This computation needs to be done only once before training and can be reused later. • Sorting of blocks can be done independently and divided between parallel threads. • The split finding can be parallelized as the collection of statistics for each column is done in parallel. https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
  • 37. XGBoost features ..cont. 3. Sparsity-aware algorithm: • XGBoost visits only the default direction (non-missing entries) in each node. 4. Cache-aware access: • Optimizes how many examples per block. 5. Out-of-core computation: • For blocks that does not fit into memory, they are compressed on the disk . • The blocks are decompressed on the fly (In parallel). 
 https://www.kdnuggets.com/2017/10/xgboost-concise-technical-overview.html
  • 38. Xgboost VS Other tools http://datascience.la/benchmarking-random-forest-implementations/
  • 39. Light GBM…the latest of Boosting algorithms • A fast, distributed, high performance gradient boosting framework. • Used for ranking, classification and many other machine learning tasks. • It is under the umbrella of the DMTK(http://github.com/microsoft/ dmtk) project of Microsoft. • Similar to XGBoost but it is faster https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment
  • 40. Advantages Ensemble Learning Accuracy less Overfitting less Variance Diversity In Ensemble learning, the large variance of unstable learners is `averaged out' across multiple learners. 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5) 83.7% majority vote accuracy How about if we have101 such classifiers
 99.9% majority vote accuracy Imagine we have: • An ensemble of 5 independent classifiers. • Accuracy is 70% for each What is the accuracy for the majority vote? Different classifiers to work on different random subsets of the full feature space or different subset of the training data.
  • 41. Are ensembles easy to understand?
  • 42. Are ensembles easy to understand? • Decision trees are easy to understand • Ensemble models are considered complex model, interpreting predictions from them is a challenge. SHAP (SHapley Additive exPlanations) A unified approach to interpreting model predictions, Scott Lundberg, Su-In Lee 2017 LIME - Local Interpretable Model-Agnostic Explanations Approximate the complex model near a given prediction. - Ribeiro et al. 2016
  • 43. DEMO
  • 44. References • Zhou, Zhi-Hua, “Ensemble Learning”, Encyclopedia of Biometrics, 2009 • Dietterich, T. Ensemble methods in machine learning. Multiple classifier systems, Vol. 1857 of Lecture Notes in Computer Science. 2000 • Ho, Tin Kam, The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998. • VALENTINI, Giorgio, and Francesco MASULLI, 2002. Ensembles of Learning Machines. 2002. Revised Papers, Volume 2486 of Lecture Notes in Computer Science. Berlin: Springer, pp. 3–19. • FREUND, Yoav, and Robert E. SCHAPIRE, 1996. Experiments with a New Boosting Algorithm. In: Lorenza SAITTA, ed. (ICML ’96). San Francisco, CA: Morgan Kaufmann, pp. 148–156. • SCHAPIRE, Robert E., 1990. The Strength of Weak Learnability. Machine Learning, 5(2), 197–227. Oxford University Press. • BREIMAN, Leo, 1996. Bagging Predictors. Machine Learning, 24(2), 123–140.Sewell M (2011) Ensemble learning, Technical Report RN/11/02. Department of Computer Science, UCL, London • G Brown, “Ensemble Learning”, Encyclopedia of Machine Learning, 2010 • http://scikit-learn.org/0.16/modules/ensemble.html • http://scikit-learn.org/stable/modules/multiclass.html • http://xgboost.readthedocs.io/en/latest/
  • 45. Boosting Algorithms Thank you!!! Omar Odibat, Data Scientist,