1.5.ensemble learning with apache spark m llib 1.5

Ensemble Learning with
Apache Spark MLlib 1.5
leoricklin@gmail.com

Reference
[1] http://www.csdn.net/article/2015-03-02/2824069
[2] http://www.csdn.net/article/2015-09-07/2825629
[3] http://www.scholarpedia.org/article/Ensemble_learning

What is Ensemble Learning (集成学习) ?
● 结合不同的学习模块（单个模型）来加强模型的稳定性和预
测能力
● 导致模型不同的4个主要因素。这些因素的组合也可能会造
成模型不同：
● 集成学习是典型的实践驱动的研究方向，它一开始先在实践
中证明有效，而后才有学者从理论上进行各种分析
● 不同种类
● 不同假设
● 不同建模技术
● 初始化参数不同

A pinch of math
● There are 3 (independent) binary classifiers (A,B,C) with a
70% accuracy
● For a majority vote with 3 members we can expect 4
outcomes:
● All three are correct
0.7 * 0.7 * 0.7 = 0.3429
● Two are correct
0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7 = 0.4409
● Two are wrong
0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 *
0.3 = 0.189
● All three are wrong
0.3 * 0.3 * 0.3 = 0.027
0.3429 + 0.4409 = 0.7838 > 0.7

Model Error
● 任何模型中出现的误差都可以在
数学上分解成三个分量：
○ Bias error 是用来度量预测值与实
际值的差异
○ Variance 则是度量基于同一观测
值，预测值之间的差异

Trade-off management of bias-variance errors
● 通当模型复杂性增加时，最
终会过拟合，因此模型开始
出现Variance
● 优良的模型应该在这两种
误差之间保持平衡
● 集成学习就是执行折衷权
衡的一种方法
○ 怎么训练每个算法？
○ 怎么融合每个算法？

EL techniques (1): Bagging
● 试图在小样本集上实现相
似的学习模块，然后对预
测值求平均值
● 可以帮助减少Variance

EL techniques (2): Boosting
● 是一项迭代技术
● 它在上一次分类的基础上
调整观测值的权重。如果
观测值被错误分类，它就
会增加这个观测值的权重
● 会减少Bias error，但是有
些时候会在训练数据上过
拟合

EL techniques (3): Stacking
● 用一个学习模块与来自
不同学习模块的输出结
合起来
● 可以减少Bias error和
Variance
● 选择合适的集成模块与
其说是纯粹的科研问题，
不如说是一种艺术

https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov

https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng

Stacking with Apache MLLib (1)
● Dataset：UCI Covtype (Ch04, Adv. Analytic w/ Spark)
● Baseline: RandomForest (Best from 8 hyper-parameters
with 3-folds C.V.)
○ precision = 0.956144
○ recall = 0.956144
Training
set X RF(θ1
)
fits
Training
set Y
predicts
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy

● Using Meta-features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
θ2
θ3
: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951056
recall 0.956144 0.951056

● Using Original features
& Meta-features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
θ2
θ3
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951094
recall 0.956144 0.951094
f1
fn
………..
f1
...fn

● Retrain tier-1 models and
stacking with all features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
θ2
θ3
fits
predicts Training
set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.956836
recall 0.956144 0.956836
f1
fn
………..
f1
...fn
RF(θ1
)
RF(θ2
)
RF(θ3
)

1.5.ensemble learning with apache spark m llib 1.5

More Related Content

What's hot (19)

Viewers also liked (12)

Similar to 1.5.ensemble learning with apache spark m llib 1.5 (20)

More from leorick lin (6)

Recently uploaded (20)

1.5.ensemble learning with apache spark m llib 1.5