SlideShare a Scribd company logo
An Introduction to Boosting Yoav Freund Banter Inc.
Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications
Toy Example Computer receives telephone call Measures Pitch of voice Decides gender of caller Human Voice Male Female
Generative modeling  Voice Pitch Probability mean1 var1 mean2 var2
Discriminative approach Voice Pitch No. of mistakes
Ill-behaved data Voice Pitch Probability mean1 mean2 No. of mistakes
Traditional Statistics vs.  Machine Learning Data Estimated  world state Predictions Actions Statistics Decision  Theory Machine Learning
Comparison of methodologies Mismatch   problems Performance   measure Goal Model Misclassifications Outliers Misclassification rate Likelihood Classification rule Probability estimates Discriminative Generative
Boosting
A weak learner weak learner A weak rule h h Weighted training set (x1, y1 , w1 ),(x2, y2 , w2 ) … (xn, yn , wn ) instances x1,x2,x3,…,xn labels y1,y2,y3,…,yn The weak requirement: Feature vector Binary label Non-negative weights sum to 1
The boosting process Final rule: Sign [   ] h1   h2   hT   weak learner h1 (x1,y1, 1/n ), … (xn,yn, 1/n ) weak learner h2 (x1,y1, w1 ), … (xn,yn, wn ) h3 (x1,y1, w1 ), … (xn,yn, wn ) h4 (x1,y1, w1 ), … (xn,yn, wn ) h5 (x1,y1, w1 ), … (xn,yn, wn ) h6 (x1,y1, w1 ), … (xn,yn, wn ) h7 (x1,y1, w1 ), … (xn,yn, wn ) h8 (x1,y1, w1 ), … (xn,yn, wn ) h9 (x1,y1, w1 ), … (xn,yn, wn ) hT (x1,y1, w1 ), … (xn,yn, wn )
Adaboost Binary labels  y  = -1,+1 margin ( x,y ) =  y  [  t  t  h t (x) ] P(x,y)  = (1/Z) exp (-margin(x,y)) Given  h t , we choose   t  to minimize  (x,y)  exp (-margin (x,y) )
Main property of adaboost If advantages of weak rules over random guessing are:       T   then  in-sample error  of final rule is at most (w.r.t. the initial weights)
A demo www.cs.huji.ac.il/~yoavf/adabooost
Adaboost as gradient descent Discriminator class:  a linear discriminator in the space of “weak hypotheses” Original goal:   find hyper plane with smallest number of mistakes  Known to be an  NP-hard  problem (no algorithm that runs in time polynomial in  d , where  d is the dimension  of the space) Computational method:  Use exponential loss as a surrogate, perform gradient descent.
Margins view Project Prediction =  + - + + + + + + - - - - - - - w Correct Mistakes Margin Cumulative # examples Mistakes Correct Margin =
Adaboost et al. Loss Correct Margin Mistakes Brownboost Logitboost Adaboost = 0-1 loss
One coordinate at a time Adaboost performs  gradient descent  on exponential loss Adds one coordinate ( “weak learner” ) at each iteration. Weak learning in  binary classification  =  slightly better than random guessing.  Weak learning in regression – unclear. Uses  example-weights  to communicate the gradient direction to the weak learner Solves a  computational  problem
What is a good weak learner? The set of weak rules (features) should be  flexible enough to be (weakly) correlated  with most conceivable relations between feature vector and label. Small enough to allow exhaustive search  for the minimal weighted training error. Small enough to avoid over-fitting . Should be able to  calculate predicted label very efficiently . Rules can be  “specialists”  – predict only on a small subset of the input space and  abstain from predicting  on the rest (output 0).
Alternating Trees Joint work with Llew Mason
Decision Trees X>3 Y>5 -1 +1 -1 +1 -1 -1 no yes yes no X Y 3 5
Decision tree as a sum -0.2 -0.2 X Y Y>5 +0.2 -0.3 yes no X>3 -0.1 no yes +0.1 +0.1 -0.1 +0.2 -0.3 +1 -1 -1 sign
An alternating decision tree -0.2 +0.7 X Y +0.1 -0.1 +0.2 -0.3 sign Y>5 +0.2 -0.3 yes no X>3 -0.1 no yes +0.1 Y<1 0.0 no yes +0.7 +1 -1 -1 +1
Example: Medical Diagnostics Cleve  dataset from UC Irvine database. Heart disease diagnostics  (+1=healthy,-1=sick)   13 features from tests  (real valued and discrete). 303 instances.
Adtree for Cleveland heart-disease diagnostics problem
Cross-validated accuracy 0.8% 16.5% 16 Boost Stumps 0.5% 20.2% 446 C5.0 + boosting 0.5% 27.2% 27 C5.0 0.6% 17.0% 6 ADtree Test error variance Average test error Number of splits Learning algorithm
Boosting and over-fitting
Curious phenomenon Boosting decision trees Using < 10,000  training examples we fit > 2,000,000  parameters
Explanation using margins Margin 0-1 loss
Explanation using margins Margin 0-1 loss No examples with small margins!!
Experimental Evidence
Theorem For any convex combination and any threshold No dependence on number of weak rules  that are combined!!! Schapire, Freund, Bartlett & Lee Annals of stat. 98 Probability of mistake Fraction of  training example with small margin Size of training sample VC dimension of weak rules
Suggested optimization problem Margin
Idea of Proof
Applications
Applications of Boosting Academic research Applied research Commercial deployment
Academic research % test error rates 7.4 2.95 3.5 11.8 16.5 Boosting ~40% ~60% 74% 46% 39% Error reduction 11.3, 12.1, 13.4 Reuters 8 Reuters 4 Letter Promoters Cleveland Database 5.8, 6.0, 9.8 13.8  (DT) 22.0  (DT) 27.2  (DT) Other
Applied research “ AT&T, How may I help you?” Classify voice requests Voice -> text -> category Fourteen categories Area code, AT&T service, billing credit,  calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time Schapire, Singer, Gorin 98
Yes I’d like to place a collect call long distance please Operator I need to make a call but I need to bill it to my office  Yes I’d like to place a call on my master card please  I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill Examples collect third party billing credit calling card
Weak rules generated by “boostexter” Calling card Collect  call Third party Weak Rule Category Word occurs Word does not occur
Results 7844 training examples hand transcribed 1000 test examples hand / machine transcribed Accuracy with 20% rejected Machine transcribed: 75% Hand transcribed: 90%
Commercial deployment  Distinguish business/residence customers Using statistics from call-detail records Alternating decision trees Similar to boosting decision trees, more flexible Combines very simple rules Can over-fit, cross validation used to stop Freund, Mason, Rogers, Pregibon, Cortes 2000
Massive datasets 260M  calls / day 230M  telephone numbers Label unknown for  ~30% Hancock : software for computing statistical signatures. 100K  randomly selected training examples,  ~10K  is enough Training takes about  2  hours. Generated classifier has to be both  accurate   and  efficient
Alternating tree for “buizocity”
Alternating Tree (Detail)
Precision/recall graphs Score Accuracy
Business impact Increased coverage from 44% to 56% Accuracy ~94% Saved AT&T  15M$  in the year 2000 in operations costs and missed opportunities.
Summary Boosting is a computational method  for learning accurate classifiers Resistance to over-fit explained by  margins Underlying explanation –  large “neighborhoods”  of good classifiers Boosting has been applied successfully to  a variety of classification problems
Come talk with me! [email_address] http://www.cs.huji.ac.il/~yoavf

More Related Content

PPTX
Decision tree, softmax regression and ensemble methods in machine learning
PPT
2.4 rule based classification
PDF
Introduction to Some Tree based Learning Method
PPT
ensemble learning
PDF
L2. Evaluating Machine Learning Algorithms I
PPTX
Decision Tree Learning
PPTX
Lect9 Decision tree
PPT
Machine Learning 1 - Introduction
Decision tree, softmax regression and ensemble methods in machine learning
2.4 rule based classification
Introduction to Some Tree based Learning Method
ensemble learning
L2. Evaluating Machine Learning Algorithms I
Decision Tree Learning
Lect9 Decision tree
Machine Learning 1 - Introduction

What's hot (19)

PPSX
Classification Using Decision tree
PPTX
Decision Tree and Bayesian Classification
ODP
Machine Learning with Decision trees
PDF
Barga Data Science lecture 7
PPTX
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
PPT
Slide3.ppt
PPTX
Decision Tree Learning
PPTX
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
PDF
Decision tree
PDF
Barga Data Science lecture 9
PDF
Machine Learning Lecture 3 Decision Trees
PDF
Overview of tree algorithms from decision tree to xgboost
PDF
Beyond Churn Prediction : An Introduction to uplift modeling
PPT
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
PDF
Barga Data Science lecture 5
PDF
Decision trees in Machine Learning
PPTX
Decision tree in artificial intelligence
PDF
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
PPTX
Improve Your Regression with CART and RandomForests
Classification Using Decision tree
Decision Tree and Bayesian Classification
Machine Learning with Decision trees
Barga Data Science lecture 7
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Slide3.ppt
Decision Tree Learning
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Decision tree
Barga Data Science lecture 9
Machine Learning Lecture 3 Decision Trees
Overview of tree algorithms from decision tree to xgboost
Beyond Churn Prediction : An Introduction to uplift modeling
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
Barga Data Science lecture 5
Decision trees in Machine Learning
Decision tree in artificial intelligence
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Improve Your Regression with CART and RandomForests
Ad

Viewers also liked (20)

PPT
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
PDF
Predictive Model for Customer Segmentation using Database Marketing Techniques
PPTX
Insights into Customer Churn
PPTX
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
PDF
Behavioral Track & Trigger featuring Browse Abandonment Campaigns
PPTX
A combination of decision tree learning and clustering
PPTX
How to Perform Churn Analysis for your Mobile Application?
PDF
Churn Prediction in Mobile Social Games: Towards a Complete Assessment Using ...
PDF
Prospect Identification from a Credit Database using Regression, Decision Tre...
PPTX
Decision Tree - C4.5&CART
PPSX
Telco Churn Roi V3
PPTX
Flight Delay Prediction Model (2)
PDF
Making a Difference in Your Student Retention - Identifying Analytic Trends f...
 
PDF
The Leading Reasons for Customer Churn in SaaS
PDF
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
PPSX
Decision tree Using c4.5 Algorithm
PDF
The No-BS Guide to Understanding (and Calculating) Churn
PDF
How to Reduce Churn by 50% and Increase Customer Happiness with NPS Processes
PDF
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
PDF
churn prediction in telecom
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
Predictive Model for Customer Segmentation using Database Marketing Techniques
Insights into Customer Churn
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Behavioral Track & Trigger featuring Browse Abandonment Campaigns
A combination of decision tree learning and clustering
How to Perform Churn Analysis for your Mobile Application?
Churn Prediction in Mobile Social Games: Towards a Complete Assessment Using ...
Prospect Identification from a Credit Database using Regression, Decision Tre...
Decision Tree - C4.5&CART
Telco Churn Roi V3
Flight Delay Prediction Model (2)
Making a Difference in Your Student Retention - Identifying Analytic Trends f...
 
The Leading Reasons for Customer Churn in SaaS
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
Decision tree Using c4.5 Algorithm
The No-BS Guide to Understanding (and Calculating) Churn
How to Reduce Churn by 50% and Increase Customer Happiness with NPS Processes
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
churn prediction in telecom
Ad

Similar to An Introduction to boosting (20)

PPT
INTRODUCTION TO BOOSTING.ppt
PPTX
Unit V -Multiple Learners.pptx for artificial intelligence
PPTX
Unit V -Multiple Learners in artificial intelligence and machine learning
PPT
Download It
PDF
Boosting - An Ensemble Machine Learning Method
PPTX
Bagging_and_Boosting.pptx
PPTX
Ml8 boosting and-stacking
PDF
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
PPTX
boosting algorithm
PDF
C3.5.1
PPTX
PDF
DMTM 2015 - 15 Classification Ensembles
PPTX
Boosting in ensemble learning in ml.pptx
DOC
Figure 1.doc
DOC
Figure 1.doc
DOC
Figure 1.doc
DOC
Figure 1.doc
PPTX
Comparison Study of Decision Tree Ensembles for Regression
PPTX
Adaboost Classifier for Machine Learning Course
INTRODUCTION TO BOOSTING.ppt
Unit V -Multiple Learners.pptx for artificial intelligence
Unit V -Multiple Learners in artificial intelligence and machine learning
Download It
Boosting - An Ensemble Machine Learning Method
Bagging_and_Boosting.pptx
Ml8 boosting and-stacking
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
boosting algorithm
C3.5.1
DMTM 2015 - 15 Classification Ensembles
Boosting in ensemble learning in ml.pptx
Figure 1.doc
Figure 1.doc
Figure 1.doc
Figure 1.doc
Comparison Study of Decision Tree Ensembles for Regression
Adaboost Classifier for Machine Learning Course

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

An Introduction to boosting

  • 1. An Introduction to Boosting Yoav Freund Banter Inc.
  • 2. Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications
  • 3. Toy Example Computer receives telephone call Measures Pitch of voice Decides gender of caller Human Voice Male Female
  • 4. Generative modeling Voice Pitch Probability mean1 var1 mean2 var2
  • 5. Discriminative approach Voice Pitch No. of mistakes
  • 6. Ill-behaved data Voice Pitch Probability mean1 mean2 No. of mistakes
  • 7. Traditional Statistics vs. Machine Learning Data Estimated world state Predictions Actions Statistics Decision Theory Machine Learning
  • 8. Comparison of methodologies Mismatch problems Performance measure Goal Model Misclassifications Outliers Misclassification rate Likelihood Classification rule Probability estimates Discriminative Generative
  • 10. A weak learner weak learner A weak rule h h Weighted training set (x1, y1 , w1 ),(x2, y2 , w2 ) … (xn, yn , wn ) instances x1,x2,x3,…,xn labels y1,y2,y3,…,yn The weak requirement: Feature vector Binary label Non-negative weights sum to 1
  • 11. The boosting process Final rule: Sign [ ] h1   h2   hT   weak learner h1 (x1,y1, 1/n ), … (xn,yn, 1/n ) weak learner h2 (x1,y1, w1 ), … (xn,yn, wn ) h3 (x1,y1, w1 ), … (xn,yn, wn ) h4 (x1,y1, w1 ), … (xn,yn, wn ) h5 (x1,y1, w1 ), … (xn,yn, wn ) h6 (x1,y1, w1 ), … (xn,yn, wn ) h7 (x1,y1, w1 ), … (xn,yn, wn ) h8 (x1,y1, w1 ), … (xn,yn, wn ) h9 (x1,y1, w1 ), … (xn,yn, wn ) hT (x1,y1, w1 ), … (xn,yn, wn )
  • 12. Adaboost Binary labels y = -1,+1 margin ( x,y ) = y [  t  t  h t (x) ] P(x,y) = (1/Z) exp (-margin(x,y)) Given h t , we choose  t to minimize  (x,y) exp (-margin (x,y) )
  • 13. Main property of adaboost If advantages of weak rules over random guessing are:      T then in-sample error of final rule is at most (w.r.t. the initial weights)
  • 15. Adaboost as gradient descent Discriminator class: a linear discriminator in the space of “weak hypotheses” Original goal: find hyper plane with smallest number of mistakes Known to be an NP-hard problem (no algorithm that runs in time polynomial in d , where d is the dimension of the space) Computational method: Use exponential loss as a surrogate, perform gradient descent.
  • 16. Margins view Project Prediction = + - + + + + + + - - - - - - - w Correct Mistakes Margin Cumulative # examples Mistakes Correct Margin =
  • 17. Adaboost et al. Loss Correct Margin Mistakes Brownboost Logitboost Adaboost = 0-1 loss
  • 18. One coordinate at a time Adaboost performs gradient descent on exponential loss Adds one coordinate ( “weak learner” ) at each iteration. Weak learning in binary classification = slightly better than random guessing. Weak learning in regression – unclear. Uses example-weights to communicate the gradient direction to the weak learner Solves a computational problem
  • 19. What is a good weak learner? The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Small enough to allow exhaustive search for the minimal weighted training error. Small enough to avoid over-fitting . Should be able to calculate predicted label very efficiently . Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).
  • 20. Alternating Trees Joint work with Llew Mason
  • 21. Decision Trees X>3 Y>5 -1 +1 -1 +1 -1 -1 no yes yes no X Y 3 5
  • 22. Decision tree as a sum -0.2 -0.2 X Y Y>5 +0.2 -0.3 yes no X>3 -0.1 no yes +0.1 +0.1 -0.1 +0.2 -0.3 +1 -1 -1 sign
  • 23. An alternating decision tree -0.2 +0.7 X Y +0.1 -0.1 +0.2 -0.3 sign Y>5 +0.2 -0.3 yes no X>3 -0.1 no yes +0.1 Y<1 0.0 no yes +0.7 +1 -1 -1 +1
  • 24. Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.
  • 25. Adtree for Cleveland heart-disease diagnostics problem
  • 26. Cross-validated accuracy 0.8% 16.5% 16 Boost Stumps 0.5% 20.2% 446 C5.0 + boosting 0.5% 27.2% 27 C5.0 0.6% 17.0% 6 ADtree Test error variance Average test error Number of splits Learning algorithm
  • 28. Curious phenomenon Boosting decision trees Using < 10,000 training examples we fit > 2,000,000 parameters
  • 29. Explanation using margins Margin 0-1 loss
  • 30. Explanation using margins Margin 0-1 loss No examples with small margins!!
  • 32. Theorem For any convex combination and any threshold No dependence on number of weak rules that are combined!!! Schapire, Freund, Bartlett & Lee Annals of stat. 98 Probability of mistake Fraction of training example with small margin Size of training sample VC dimension of weak rules
  • 36. Applications of Boosting Academic research Applied research Commercial deployment
  • 37. Academic research % test error rates 7.4 2.95 3.5 11.8 16.5 Boosting ~40% ~60% 74% 46% 39% Error reduction 11.3, 12.1, 13.4 Reuters 8 Reuters 4 Letter Promoters Cleveland Database 5.8, 6.0, 9.8 13.8 (DT) 22.0 (DT) 27.2 (DT) Other
  • 38. Applied research “ AT&T, How may I help you?” Classify voice requests Voice -> text -> category Fourteen categories Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time Schapire, Singer, Gorin 98
  • 39. Yes I’d like to place a collect call long distance please Operator I need to make a call but I need to bill it to my office Yes I’d like to place a call on my master card please I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill Examples collect third party billing credit calling card
  • 40. Weak rules generated by “boostexter” Calling card Collect call Third party Weak Rule Category Word occurs Word does not occur
  • 41. Results 7844 training examples hand transcribed 1000 test examples hand / machine transcribed Accuracy with 20% rejected Machine transcribed: 75% Hand transcribed: 90%
  • 42. Commercial deployment Distinguish business/residence customers Using statistics from call-detail records Alternating decision trees Similar to boosting decision trees, more flexible Combines very simple rules Can over-fit, cross validation used to stop Freund, Mason, Rogers, Pregibon, Cortes 2000
  • 43. Massive datasets 260M calls / day 230M telephone numbers Label unknown for ~30% Hancock : software for computing statistical signatures. 100K randomly selected training examples, ~10K is enough Training takes about 2 hours. Generated classifier has to be both accurate and efficient
  • 44. Alternating tree for “buizocity”
  • 47. Business impact Increased coverage from 44% to 56% Accuracy ~94% Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.
  • 48. Summary Boosting is a computational method for learning accurate classifiers Resistance to over-fit explained by margins Underlying explanation – large “neighborhoods” of good classifiers Boosting has been applied successfully to a variety of classification problems
  • 49. Come talk with me! [email_address] http://www.cs.huji.ac.il/~yoavf

Editor's Notes

  • #38: Cleveland: Heart disease, features: blood pressure, etc. Promoters: DNA, indicates that intron is coming Letter: standard b&amp;w OCR benchmark, Frey &amp; Slate 1991, Roman alphabet, 20 distorted fonts, 16 given attributes: # pixels, moments etc. Reuters-21450, 12,902 documents, standard benchmark, Reuters 4: first table A9, A10. Naïve bayes CMU, Rocchio,
  • #39: AT&amp;T service – sign up or drop
  • #43: 3 years for fraud
  • #44: Call detail: originating, called, terminated (e.g, for 800, forwarding) #, start, end time, quality, termination code (e.g., for wireless), quality, no rate information Statistics: #calls, Distribution within day, week, # #’s called, type (800, toll, etc.), incoming vs. outgoing