An Introduction to boosting

An Introduction to Boosting Yoav Freund Banter Inc.

Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications

Toy Example Computer receives telephone call Measures Pitch of voice Decides gender of caller Human Voice Male Female

Generative modeling Voice Pitch Probability mean1 var1 mean2 var2

Discriminative approach Voice Pitch No. of mistakes

Ill-behaved data Voice Pitch Probability mean1 mean2 No. of mistakes

Traditional Statistics vs. Machine Learning Data Estimated world state Predictions Actions Statistics Decision Theory Machine Learning

Comparison of methodologies Mismatch problems Performance measure Goal Model Misclassifications Outliers Misclassification rate Likelihood Classification rule Probability estimates Discriminative Generative

A weak learner weak learner A weak rule h h Weighted training set (x1, y1 , w1 ),(x2, y2 , w2 ) … (xn, yn , wn ) instances x1,x2,x3,…,xn labels y1,y2,y3,…,yn The weak requirement: Feature vector Binary label Non-negative weights sum to 1

The boosting process Final rule: Sign [ ] h1   h2   hT   weak learner h1 (x1,y1, 1/n ), … (xn,yn, 1/n ) weak learner h2 (x1,y1, w1 ), … (xn,yn, wn ) h3 (x1,y1, w1 ), … (xn,yn, wn ) h4 (x1,y1, w1 ), … (xn,yn, wn ) h5 (x1,y1, w1 ), … (xn,yn, wn ) h6 (x1,y1, w1 ), … (xn,yn, wn ) h7 (x1,y1, w1 ), … (xn,yn, wn ) h8 (x1,y1, w1 ), … (xn,yn, wn ) h9 (x1,y1, w1 ), … (xn,yn, wn ) hT (x1,y1, w1 ), … (xn,yn, wn )

Adaboost Binary labels y = -1,+1 margin ( x,y ) = y [  t  t  h t (x) ] P(x,y) = (1/Z) exp (-margin(x,y)) Given h t , we choose  t to minimize  (x,y) exp (-margin (x,y) )

Main property of adaboost If advantages of weak rules over random guessing are:      T then in-sample error of final rule is at most (w.r.t. the initial weights)

A demo www.cs.huji.ac.il/~yoavf/adabooost

Adaboost as gradient descent Discriminator class: a linear discriminator in the space of “weak hypotheses” Original goal: find hyper plane with smallest number of mistakes Known to be an NP-hard problem (no algorithm that runs in time polynomial in d , where d is the dimension of the space) Computational method: Use exponential loss as a surrogate, perform gradient descent.

Margins view Project Prediction = + - + + + + + + - - - - - - - w Correct Mistakes Margin Cumulative # examples Mistakes Correct Margin =

Adaboost et al. Loss Correct Margin Mistakes Brownboost Logitboost Adaboost = 0-1 loss

One coordinate at a time Adaboost performs gradient descent on exponential loss Adds one coordinate ( “weak learner” ) at each iteration. Weak learning in binary classification = slightly better than random guessing. Weak learning in regression – unclear. Uses example-weights to communicate the gradient direction to the weak learner Solves a computational problem

What is a good weak learner? The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Small enough to allow exhaustive search for the minimal weighted training error. Small enough to avoid over-fitting . Should be able to calculate predicted label very efficiently . Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).

Alternating Trees Joint work with Llew Mason

Decision Trees X>3 Y>5 -1 +1 -1 +1 -1 -1 no yes yes no X Y 3 5

Decision tree as a sum -0.2 -0.2 X Y Y>5 +0.2 -0.3 yes no X>3 -0.1 no yes +0.1 +0.1 -0.1 +0.2 -0.3 +1 -1 -1 sign

An alternating decision tree -0.2 +0.7 X Y +0.1 -0.1 +0.2 -0.3 sign Y>5 +0.2 -0.3 yes no X>3 -0.1 no yes +0.1 Y<1 0.0 no yes +0.7 +1 -1 -1 +1

Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.

Adtree for Cleveland heart-disease diagnostics problem

Cross-validated accuracy 0.8% 16.5% 16 Boost Stumps 0.5% 20.2% 446 C5.0 + boosting 0.5% 27.2% 27 C5.0 0.6% 17.0% 6 ADtree Test error variance Average test error Number of splits Learning algorithm

Curious phenomenon Boosting decision trees Using < 10,000 training examples we fit > 2,000,000 parameters

Explanation using margins Margin 0-1 loss

Explanation using margins Margin 0-1 loss No examples with small margins!!

Theorem For any convex combination and any threshold No dependence on number of weak rules that are combined!!! Schapire, Freund, Bartlett & Lee Annals of stat. 98 Probability of mistake Fraction of training example with small margin Size of training sample VC dimension of weak rules

An Introduction to boosting

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to An Introduction to boosting (20)

More from butest (20)

An Introduction to boosting

Editor's Notes