Poggi analytics - ensamble - 1b

Buenos Aires, marzo de 2016
Eduardo Poggi

Temas
 Ensambles
 Bagging
 Boosting
 Random Forest

Ensambles
 Ensamble:
 Conjunto de modelos que se usan juntos como un “meta
modelo”.
 Idea base conocida:
 Usar conocimiento de distintas fuentes al tomar decisiones.

Ensambles
 Comité de expertos:
 muchos elementos
 todos con alto conocimiento
 todos sobre el mismo tema
 votan
 Gabinete de asesores:
 expertos en diferentes áreas
 alto conocimiento
 hay una cabeza que decide quién
sabe del tema
Ensambles planos:
-Fusión
-Bagging
-Boosting
-Random Forest
Ensambles divisivos:
-Mixture of experts
-Stacking
Crowding decision?

Ensambles
 Dos componentes base:
 Un método para seleccionar o construir los miembros
 Misma o distinta área?
 Distintos datasets x distintos modelos x distintas configuraciones
 Un método para combinar las decisiones
 Votación simple, votación ponderada, promedio, función específica,
selectividad …

Ensambles
 Planos:
 Muchos expertos, todos buenos:
 Necesito que sean lo mejor posible individualmente.
 De lo contrario, usualmente no sirven.
 Pero necesito que opinen distinto en algunos casos.
 Si todos opinan siempre igual… me quedo con uno solo!

Ensambles
 Divisivos:
 Dividir el problema en una serie de subproblemas con mínima
sobreposición.
 Estrategia de “divide & conquer”.
 Útiles para atacar problemas grandes.
 Se necesita una función que decida que clasificador tiene que
actuar.

Ensambles
 Si un “aprendiz” es bueno produce un buen clasificador,
puede que muchos “aprendices” produzcan algo mejor?
 Por qué no aprender: { h1, h2, h3 }, entonces:
 h*(x) = mayoría { h1(x), h2(x), h3(x) }
 Si hi’s tienen errores independientes
 h* es más precisa.
 Error(hi) = ε, entonces Error(h*) = 3ε⌃2
 (0.01 → 0.0003)

Ensambles
 1. Subsample Training Sample
 Bagging
 Boosting
 2. Manipulate Input Features
 3. Manipulate Output Targets
 ECOC
 4. Injecting Randomness
 Data Algorithm
 5. Algorithm Specific methods
 Other combinations
 Why do Ensembles work?

Ensambles
 Manipulate Input Features

Ensambles
 Manipulate Output Targets

Ensambles
 Un aprendiz se dice inestable si el clasificador que
produce sufre cambios importantes ante pequeñas
variaciones en los datos de entrenamiento
 Inestables: árbol de decisiones, redes neuronales, …
 Estables: La regresión lineal, el vecino más cercano, ...
 Subsampling es mejor para los alumnos inestables

Ensambles
 Voting Algoritms
 Take an inducer and A training set,
 Run the inducer multiple times by changing the distribution of the
training set instances,
 The generated classifiers are combined,
 … and then classify the set.

Ensambles
 Voting algorithms can be divided into two types:
 those that adaptively change the distribution of the training set
based on the performance of previous classifers (as in boosting
methods) and
 those that do not (as in Bagging).

Bagging Algorithm
 Bootstrap aggregating (Breiman 96)
 Votes classifiers generated by different bootstrap samples
(replicates)
 Uniformly sampling m instances from the training set with
replacement.
 T bootstrap samples B1, B2, … , BT are generated and a
classfier Ci is built from each bootstrap sample Bi
 A final classfier C* is built from C1, C2, … , CT whose
output is the class predicted most often by its
subclassiers, with ties broken arbitrarily

Bagging Algorithm
 An instance instance in the training set has probability
1−(1−1/m)^m of being selected at least once in the m
times instances are randomly selected
 For large m, this is about 1 − 1/e = 63.2%, which means
that each bootstrap sample contains only about 63.2%
unique instances from the training set.
 If the inducer is unstable (ANN, DT), the performance can
improve.
 If the inducer is stable (k-nearest neighbor), may slightly
degrade the performance.

Adaboost Algorithm
 Boosting (Schapire 90), AdaBoost M1 (Freund & Schapire
96)
 Generates the classifers sequentially, while Bagging can
generate them in parallel.
 AdaBoost also changes the weights of the training
instances provided as input to each inducer based on
classifers that were previously built.
 The goal is to force the inducer to minimize expected
error over diferent input distributions.
 C* = weighted voting. The weight of each classfier
depends on its performance on the training set used to
build i

Adaboost Algorithm
 The incorrect instances are weighted by a factor inversely
proportional to the error on the training set, i.e., 1/(2Ei).
Small training set errors, such as 0.1%, will cause
weights to grow by several orders of magnitude.
 The AdaBoost algorithm requires a weak learning
algorithm whose error is bounded by a constant strictly
less than 1/2. In practice, the inducers we use provide no
such guarantee.
 The original algorithm aborted when the error bound was
breached
 Resampling + reweighting
 Success (???) distribution of the “margins”

Adaboost : How Will Test Error Behave? (Guess!)
 Expect…
 training error to continue to drop (or reach 0)
 test error to increase when h* becomes “too complex”
 “Occam’s razor”
 overfitting

Adaboost : How Will Test Error Behave? (Real!)
 But…
 test error does not increase, even after 1000 rounds
 test error continues to drop, even after training error is 0!
 Occam’s razor: “simpler rule is better”... appears to not apply!

Adaboost : Margins
 key idea:
 training error only measures whether classifications are right or wrong
 should also consider confidence of classifications
 measure confidence by margin = strength of the vote
 (weighted fraction voting correctly) − (weighted fraction voting incorrectly)

Adaboost : Margins
key idea:
training error only measures whether classifications are right
or wrong
should also consider confidence of classifications

Adaboost : Application detecting Faces [Viola & Jones]
 problem: find faces in photograph or movie
 weak classifiers: detect light/dark rectangles in image
 many clever tricks to make extremely fast and accurate

Adaboost : practical advantages
 Fast
 simple and easy to program
 no parameters to tune (except T, sometimes)
 flexible — can combine with any learning algorithm
 no prior knowledge needed about weak learner
 provably effective, given weak classifier
 shift in mind set: goal now is merely to find classifiers barely
better than random guessing
 Versatile
 can use with data that is textual, numeric, discrete, etc.
 has been extended to learning problems well beyond binary
classification

Adaboost : warnings
 Performance of AdaBoost depends on data and weak
learner.
 Consistent with theory, AdaBoost can fail if...
 weak classifiers too complex
 overfitting
 weak classifiers too weak (γt → 0 too quickly)
 underfitting
 low margins
 overfitting
 Empirically, AdaBoost seems especially susceptible to
uniform noise.

Adaboost : Conclusions
 Boosting is a practical tool for classification and other
learning problems
 grounded in rich theory
 performs well experimentally
 often (but not always!) resistant to overfitting
 many applications and extensions

Recognizing Handwritten Number
 “Obvious” approach: learn F: Scribble → {0,1,2,...,9}
 ...doesn’t work very well (too hard!)
 Or... “decompose” the learning task into 6 “subproblems”
 learn 6 classifiers, one for each “sub-problem ”to classify
a new scribble:
 Run each classifier
 Predict the class whose code-word is closest (Hamming distance)
to the predicted code

Recognizing Handwritten Number
 Predict the class whose code-word is closest (Hamming
distance) to the predicted code

Ramdom Forest: Bagging + trees
 Usar bootstraps genera diversidad, pero los árboles
siguen estando muy correlacionados
 Las mismas variables tienden a ocupar los primeros cortes siempre.
Ejemplo:
Dos árboles generados
con rpart a partir de
bootstraps del dataset
Pima.tr. La misma
variable está en la raíz

Ramdom Forest
 Agregar un poco de azar al crecimiento
 En cada nodo, seleccionar un grupo chico de variables al
azar y evaluar sólo esas variables.
 No agrega sesgo: A la larga todas las variables entran en juego
 Agrega varianza: pero eso se soluciona fácil promediando
modelos
 Es efectivo para decorrelacionar los árboles

Ramdom Forest
 Construye los árboles hasta separar todo. No hay podado.
No hay criterio de parada.
 El valor de m (mtry en R) es importante. El default es
sqrt(p) que suele ser bueno.
 Si uso m=p recupero bagging
 El número de árboles no es importante, mientras sean
muchos. 500, 1000, 2000.

Ramdom Forest
 Resumen
 Mejora de bagging sólo para árboles
 Mejores predicciones que Bagging.
 Muy usado. Casi automático.
 Resultados comparables a los mejores métodos actuales.
 Subproductos útiles, sobre todo la estima OOB y la importancia
de variables.

Bagging o Boosting: El dilema sesgo-varianza
 Los predictores sin sesgo tienen alta varianza (y al revés)
 Hay dos formas de resolver el dilema:
 Disminuir la varianza de los predictores sin sesgo
 Construir muchos predictores y promediarlos: Bagging y Random
Forest
 Reducir el sesgo de los predictores estables
 Construir una secuencia tal que la combinación tenga menos sesgo:
Boosting

Sesgo y Varianza
 Que funciones utilizar?
 Funciones rígidas:
 Buena estimación de los
parámetros óptimos – poca
flexibilidad.
 Funciones flexibles:
 Buen ajuste – mala estimación
de los parámetros óptimos.
Error de sesgo
Error de varianza

¿Y ahora?
 Las herramientas de ensamble han demostrado que
mejoran la performance de las técnicas atómicas que las
conforman.
 Hay teoremas que demuestran que AdaBoost es mejor
siempre y cuando el modelo busteado tenga ciertas
características de weakness (sean limitados, no
complejos).

¿Y ahora?
 Corolario:
 No hace falta que los votantes sean inteligentes, bien formados,
expertos, etc., basta que sean diversos y fieles a sus capacidades
limitadas.
 “Un comité de tontos funciona mejor que un experto …”
 ¿Cómo sería un parlamento con legisladores busteados?

eduardopoggi@yahoo.com.ar
eduardo-poggi
http://ar.linkedin.com/in/eduardoapoggi
https://www.facebook.com/eduardo.poggi
@eduardoapoggi

Bibliografía
 https://www.stat.berkeley.edu/~breiman/RandomForests
/cc_home.htm

Poggi analytics - ensamble - 1b

Más contenido relacionado

Similar a Poggi analytics - ensamble - 1b (20)

Más de Gaston Liberman (15)

Último (20)

Poggi analytics - ensamble - 1b