SlideShare a Scribd company logo
Ensemble Learning with
Apache Spark MLlib 1.5
leoricklin@gmail.com
Reference
[1] http://www.csdn.net/article/2015-03-02/2824069
[2] http://www.csdn.net/article/2015-09-07/2825629
[3] http://www.scholarpedia.org/article/Ensemble_learning
What is Ensemble Learning (集成学习) ?
● 结合不同的学习模块(单个模型)来加强模型的稳定性和预
测能力
● 导致模型不同的4个主要因素。这些因素的组合也可能会造
成模型不同:
● 集成学习是典型的实践驱动的研究方向,它一开始先在实践
中证明有效,而后才有学者从理论上进行各种分析
● 不同种类
● 不同假设
● 不同建模技术
● 初始化参数不同
A pinch of math
● There are 3 (independent) binary classifiers (A,B,C) with a
70% accuracy
● For a majority vote with 3 members we can expect 4
outcomes:
● All three are correct
0.7 * 0.7 * 0.7 = 0.3429
● Two are correct
0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7 = 0.4409
● Two are wrong
0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 *
0.3 = 0.189
● All three are wrong
0.3 * 0.3 * 0.3 = 0.027
0.3429 + 0.4409 = 0.7838 > 0.7
Model Error
● 任何模型中出现的误差都可以在
数学上分解成三个分量:
○ Bias error 是用来度量预测值与实
际值的差异
○ Variance 则是度量基于同一观测
值,预测值之间的差异
Trade-off management of bias-variance errors
● 通当模型复杂性增加时,最
终会过拟合,因此模型开始
出现Variance
● 优良的模型应该在这两种
误差之间保持平衡
● 集成学习就是执行折衷权
衡的一种方法
○ 怎么训练每个算法?
○ 怎么融合每个算法?
EL techniques (1): Bagging
● 试图在小样本集上实现相
似的学习模块,然后对预
测值求平均值
● 可以帮助减少Variance
EL techniques (2): Boosting
● 是一项迭代技术
● 它在上一次分类的基础上
调整观测值的权重。如果
观测值被错误分类,它就
会增加这个观测值的权重
● 会减少Bias error,但是有
些时候会在训练数据上过
拟合
EL techniques (3): Stacking
● 用一个学习模块与来自
不同学习模块的输出结
合起来
● 可以减少Bias error和
Variance
● 选择合适的集成模块与
其说是纯粹的科研问题,
不如说是一种艺术
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng
Stacking with Apache MLLib (1)
● Dataset:UCI Covtype (Ch04, Adv. Analytic w/ Spark)
● Baseline: RandomForest (Best from 8 hyper-parameters
with 3-folds C.V.)
○ precision = 0.956144
○ recall = 0.956144
Training
set X RF(θ1
)
fits
Training
set Y
predicts
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
● Using Meta-features
Stacking with Apache MLLib (2)
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951056
recall 0.956144 0.951056
Stacking with Apache MLLib (3)
● Using Original features
& Meta-features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951094
recall 0.956144 0.951094
f1
fn
………..
f1
...fn
Stacking with Apache MLLib (4)
● Retrain tier-1 models and
stacking with all features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts Training
set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.956836
recall 0.956144 0.956836
f1
fn
………..
f1
...fn
RF(θ1
)
RF(θ2
)
RF(θ3
)

More Related Content

PDF
Data Structures 01
PPTX
object oriented programming OOP
PPTX
Review of basic data structures
PDF
PPTX
PPTX
Whiteboarding Coding Challenges in Python
PDF
TensorFlow In 10 Minutes | Deep Learning & TensorFlow | Edureka
PPTX
Data Analysis in Python-NumPy
Data Structures 01
object oriented programming OOP
Review of basic data structures
Whiteboarding Coding Challenges in Python
TensorFlow In 10 Minutes | Deep Learning & TensorFlow | Edureka
Data Analysis in Python-NumPy

What's hot (19)

PPTX
Machine Learning - Dataset Preparation
PPT
Chapter 6 ds
PDF
PPTX
PDF
Object-Oriented Programming (OOP)
PPT
7.basic array
PPTX
PDF
The ABC of Implementing Supervised Machine Learning with Python.pptx
PDF
Introduction to Machine Learning in Python using Scikit-Learn
PPTX
Introduction to numpy
PDF
Numpy
PPTX
PPT
Lec3
PPTX
Dev Concepts: Object-Oriented Programming
PPT
DATASTRUCTURES UNIT-1
PPTX
Pointer to array and structure
PPTX
Basic of python for data analysis
PPTX
Integration of all 6 trig functions
PDF
Introduction to NumPy (PyData SV 2013)
Machine Learning - Dataset Preparation
Chapter 6 ds
Object-Oriented Programming (OOP)
7.basic array
The ABC of Implementing Supervised Machine Learning with Python.pptx
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to numpy
Numpy
Lec3
Dev Concepts: Object-Oriented Programming
DATASTRUCTURES UNIT-1
Pointer to array and structure
Basic of python for data analysis
Integration of all 6 trig functions
Introduction to NumPy (PyData SV 2013)
Ad

Viewers also liked (12)

DOC
обява про прийом в гуртки 2016 з виправленням
DOCX
речівки
DOCX
Work Project 2-latest
PDF
160203 테헤란로 커피클럽_바이로봇
PDF
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
PPTX
AMISTAD
PDF
160810_테헤란로 커피클럽_52th_헤이뷰티
DOCX
8. la prohibition de l'inceste en islam
PDF
KubeFuse - A File-System for Kubernetes
PDF
킥스타터 모바일 참여방법 Sgnl(outline)
PPT
Поетична студія "Елегія"
PDF
160615_테헤란로 커피클럽_이놈들연구소
обява про прийом в гуртки 2016 з виправленням
речівки
Work Project 2-latest
160203 테헤란로 커피클럽_바이로봇
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
AMISTAD
160810_테헤란로 커피클럽_52th_헤이뷰티
8. la prohibition de l'inceste en islam
KubeFuse - A File-System for Kubernetes
킥스타터 모바일 참여방법 Sgnl(outline)
Поетична студія "Елегія"
160615_테헤란로 커피클럽_이놈들연구소
Ad

Similar to 1.5.ensemble learning with apache spark m llib 1.5 (20)

PDF
The Power of Ensembles in Machine Learning
PPTX
Ensemble Learning.pptx
PDF
MLlib: Spark's Machine Learning Library
PDF
Spark m llib
PPTX
Apache Spark MLlib
PPTX
introduction to machine learning and ensemble methods
PDF
Intro_to_ML
PDF
Scaling Machine Learning with Apache Spark
PPTX
MLconf NYC Xiangrui Meng
PPTX
Ensemble Methods
PDF
Practical Machine Learning Pipelines with MLlib
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
PPTX
Apache Spark MLlib - Random Foreset and Desicion Trees
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
Ensemble learning Techniques
PPT
Ensembles_Unit_IV.ppt
PDF
Machine learning pipeline with spark ml
PPTX
MLlib and Machine Learning on Spark
PDF
VSSML17 L2. Ensembles and Logistic Regressions
PDF
lecture-ensembling-techniques.pdf
The Power of Ensembles in Machine Learning
Ensemble Learning.pptx
MLlib: Spark's Machine Learning Library
Spark m llib
Apache Spark MLlib
introduction to machine learning and ensemble methods
Intro_to_ML
Scaling Machine Learning with Apache Spark
MLconf NYC Xiangrui Meng
Ensemble Methods
Practical Machine Learning Pipelines with MLlib
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Apache Spark MLlib - Random Foreset and Desicion Trees
Recent Developments in Spark MLlib and Beyond
Ensemble learning Techniques
Ensembles_Unit_IV.ppt
Machine learning pipeline with spark ml
MLlib and Machine Learning on Spark
VSSML17 L2. Ensembles and Logistic Regressions
lecture-ensembling-techniques.pdf

More from leorick lin (6)

PDF
How to prepare for pca certification 2021
PDF
1.5.recommending music with apache spark ml
PDF
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
PDF
Multiclassification with Decision Tree in Spark MLlib 1.3
PDF
Email Classifier using Spark 1.3 Mlib / ML Pipeline
PDF
Integrating data stored in rdbms and hadoop
How to prepare for pca certification 2021
1.5.recommending music with apache spark ml
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
Multiclassification with Decision Tree in Spark MLlib 1.3
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Integrating data stored in rdbms and hadoop

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
KodekX | Application Modernization Development
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
KodekX | Application Modernization Development
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation

1.5.ensemble learning with apache spark m llib 1.5

  • 1. Ensemble Learning with Apache Spark MLlib 1.5 leoricklin@gmail.com
  • 3. What is Ensemble Learning (集成学习) ? ● 结合不同的学习模块(单个模型)来加强模型的稳定性和预 测能力 ● 导致模型不同的4个主要因素。这些因素的组合也可能会造 成模型不同: ● 集成学习是典型的实践驱动的研究方向,它一开始先在实践 中证明有效,而后才有学者从理论上进行各种分析 ● 不同种类 ● 不同假设 ● 不同建模技术 ● 初始化参数不同
  • 4. A pinch of math ● There are 3 (independent) binary classifiers (A,B,C) with a 70% accuracy ● For a majority vote with 3 members we can expect 4 outcomes: ● All three are correct 0.7 * 0.7 * 0.7 = 0.3429 ● Two are correct 0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7 + 0.3 * 0.7 * 0.7 = 0.4409 ● Two are wrong 0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 * 0.3 = 0.189 ● All three are wrong 0.3 * 0.3 * 0.3 = 0.027 0.3429 + 0.4409 = 0.7838 > 0.7
  • 5. Model Error ● 任何模型中出现的误差都可以在 数学上分解成三个分量: ○ Bias error 是用来度量预测值与实 际值的差异 ○ Variance 则是度量基于同一观测 值,预测值之间的差异
  • 6. Trade-off management of bias-variance errors ● 通当模型复杂性增加时,最 终会过拟合,因此模型开始 出现Variance ● 优良的模型应该在这两种 误差之间保持平衡 ● 集成学习就是执行折衷权 衡的一种方法 ○ 怎么训练每个算法? ○ 怎么融合每个算法?
  • 7. EL techniques (1): Bagging ● 试图在小样本集上实现相 似的学习模块,然后对预 测值求平均值 ● 可以帮助减少Variance
  • 8. EL techniques (2): Boosting ● 是一项迭代技术 ● 它在上一次分类的基础上 调整观测值的权重。如果 观测值被错误分类,它就 会增加这个观测值的权重 ● 会减少Bias error,但是有 些时候会在训练数据上过 拟合
  • 9. EL techniques (3): Stacking ● 用一个学习模块与来自 不同学习模块的输出结 合起来 ● 可以减少Bias error和 Variance ● 选择合适的集成模块与 其说是纯粹的科研问题, 不如说是一种艺术
  • 12. Stacking with Apache MLLib (1) ● Dataset:UCI Covtype (Ch04, Adv. Analytic w/ Spark) ● Baseline: RandomForest (Best from 8 hyper-parameters with 3-folds C.V.) ○ precision = 0.956144 ○ recall = 0.956144 Training set X RF(θ1 ) fits Training set Y predicts h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy
  • 13. ● Using Meta-features Stacking with Apache MLLib (2) Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts 3-folds C.V of Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts RF(θ1 ) RF(θ2 ) RF(θ3 ) h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.951056 recall 0.956144 0.951056
  • 14. Stacking with Apache MLLib (3) ● Using Original features & Meta-features Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts 3-folds C.V of Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts RF(θ1 ) RF(θ2 ) RF(θ3 ) h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.951094 recall 0.956144 0.951094 f1 fn ……….. f1 ...fn
  • 15. Stacking with Apache MLLib (4) ● Retrain tier-1 models and stacking with all features Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.956836 recall 0.956144 0.956836 f1 fn ……….. f1 ...fn RF(θ1 ) RF(θ2 ) RF(θ3 )