SlideShare a Scribd company logo
Advances in Boosted Tree Technology:
TreeNet Model Compression and Optimal Rule
                Extraction




 Dan Steinberg, Milkail Golovnya, N Scott Cardell
                    May 2012
                Salford Systems
      http://www.salford-systems.com
Beyond TreeNet

• TreeNet has set a high bar for automatic off-the-shelf model
  performance
   – TreeNet was used to win all four 1st place awards in the
     Duke/Teradata churn modeling competition of 2002
   – Awards in 2010, 2009, 2008, 2007, 2004 all based on TreeNet


• TreeNet was first developed (MART) in 1999 and essentially
  perfected in 2000
   – Many improvements since then but the fundamentals are largely
     those of the 2000 technology
• In subsequent work Friedman has introduced major
  extensions that go beyond the framework of boosted trees


                     © Copyright Salford Systems 2012
Importance Sampled Learning Ensembles (ISLE)

• Friedman’s work in 2003 is somewhat more complex than
  what we describe here
   – Presented his paper at our first data mining conference in San
     Francisco in March of 2004
• We focus on the concept of model compression
• TreeNet model is grown myopically one added tree at a time
   –   From current model attempt to improve it by predicting residuals
   –   Each tree represents incremental learning and error correction
   –   Slow learning, small steps
   –   During model development we do not know where we are going to
       end up
• Once we have the TreeNet model completed can we review
  it and “clean it up”
                       © Copyright Salford Systems 2012
Post-Processing With Regularized Regression

• Friedman’s ISLE takes a TreeNet model as its raw material
  and considers how we can refine it using regression
• Consider: every tree takes our raw data as input and
  generates outputs at the terminal nodes
• Each tree can be thought of as a new variable constructed
  out of the original data
   – No missing values in tree outputs even if there were missing values
     in the raw data
   – Outliers among such predictors are expected to be rare as each
     terminal is doing averaging and the trees are typically small
• Might create many more generated variables than original
  raw variables
   – Boston data set has 13 predictors, TN might generate 1000 trees

                      © Copyright Salford Systems 2012
Regularized Regression

• Modern regression techniques starting with Ridge
  regression, and then the Lasso, and finally hybrid models
• Methods have advantages over classical regression
   –   Can handle highly correlated variables (Ridge)
   –   Can work with data sets with more columns than rows
   –   Can do variable selection (Lasso, Ridge-Lasso hybrids)
   –   Much more effective and reliable than old fashioned stepwise
• Regularized regression is still regression and thus suffers
  from all the primary limitations of classical regression
   – No missing value handling
   – Linear additive model (no interactions)
   – Sensitive to functional form of predictors


                       © Copyright Salford Systems 2012
Regularized Regression Applied to Trees

• Applying to regularized regression to trees is not vulnerable
  to these traditional problems
   –   Missing values already handled and transformed to non-missing
   –   Interactions incorporated into the tree structure
   –   Trees are invariant with respect to typical univariate transformations
   –   Any order preserving transform will not affect tree
• What will a regularized regression on trees accomplish?
   –   Combine all identical trees into one
   –   Combine several similar trees into a compromise tree
   –   Bypass any meandering while TreeNet searched for optimum
   –   Reweights the trees (in TN all trees have equal weight)




                        © Copyright Salford Systems 2012
Regularized Regression of TreeNet

• In this mode of ISLE we develop the best TreeNet model we
  can
• Post-process results allowing for different degrees of
  compression
• By default we run four models on the TreeNet
   –   Ridge (no compression, just reweighting)
   –   Lasso (compression possible)
   –   Ridged Lasso (hybrid of Lasso and Ridge but mostly Lasso)
   –   Compact (maximum compression)
• Goal usually is to find a substantial degree of compression
  while giving up little or nothing on test sample performance
• Could focus only on beating TN performance

                       © Copyright Salford Systems 2012
Model Compression: Early Days

• TreeNet has always offered model truncation
• Instead of using the fully articulated model stop the process
  early
• In 2005 this method was being used by a major web portal
   – TreeNet model used to predict likely response to item presented to
     visitor on a web page (ad, link, photo, story)
   – To implement real time response TN model limited to first 30 trees
   – Sacrificed considerable predictive accuracy to have a model that
     could score fast enough in real time
   – Truncated TreeNet at 30 trees still was better than other alternatives
   – Consider that model might have been rebuilt every hour




                      © Copyright Salford Systems 2012
Illustrative Example: Boston Housing Data Set
                                Set Up Model




              © Copyright Salford Systems 2012
TreeNet Controls




1000 trees, Least Squares, AUTO Learnrate
                   © Copyright Salford Systems 2012
Post Processor Controls:
    What Type of Post Processing




© Copyright Salford Systems 2012
Post Processor Details:
                                                   Use all defaults




• Standardizing the “trees” gives all equal weight in regularized regression
• Worth experimenting with unstandardized – larger variance trees will dominate
                         © Copyright Salford Systems 2012
Two Stage Modeling Process

• First Stage here is a TreeNet but in SPM could also be
   –   Single CART Tree (focus would be on nodes eg from maximal tree)
   –   Ensemble of CART trees (bagger)
   –   MARS model (basis functions from maximal model)
   –   Random Forests
• In ISLE mode we need to operate on a collection of
  variables created by a learning machine
   – these can come from any of our tree engines or MARS


• We will get first stage results: a model
• Then get second stage: model refinement
   – Model compression or model selection (eg tree pruning)

                      © Copyright Salford Systems 2012
TreeNet Results




Test Set R2=.87875 MSE=7.407
                  © Copyright Salford Systems 2012
TreeNet Results: Residual Stats




One substantial outlier more than 5 IQR outside central data range
                          © Copyright Salford Systems 2012
TreeNet and Compressed TreeNet
                                                 Both Models Reported Below




•   The dashed lines show evolution of the compressed model
•   Because we can choose any of our 1000 trees to start the compressed model starts
    off much better than the original TreeNet and it has a coefficient
                           © Copyright Salford Systems 2012
ISLE Reweighted TreeNet:
                   Test Data Results




© Copyright Salford Systems 2012
TreeNet vs ISLE Residuals
              ISLE is wider in the center but narrower top to bottom


TreeNet Residuals                       ISLE Compressed TreeNet




               © Copyright Salford Systems 2012
Comment on the First Tree

• It is interesting to observe that in this example the
  compressed model with just one tree in it outperforms the
  TreeNet model with just one tree
• Trees are built without look ahead but having a menu of
  1000 trees to choose from allows the 2nd stage model to do
  better
• Worst case scenario is that 2nd stage chooses same first
  tree
• Coefficient can spread out the predictions




                   © Copyright Salford Systems 2012
TreeNet Model Compression

• TreeNet has set a high bar for predictive accuracy in the
  data mining field
• We now offer several ways in which a TreeNet can be
  further improved by post-processing
• Consider that a TreeNet model is built one step at a time
  without knowledge of where we will end up
   – Some trees are exact or almost exact copies of other trees
   – Some trees may exhibit some “wandering” before the right direction
     is found
   – Trees are each built on a different random subset of the data and
     some trees may just be “unlucky”
   – Post processing can combine multiple copies of essentially the same
     tree and skip any unnecessary wandering

                     © Copyright Salford Systems 2012
How Much Compression is Possible?

• Our experience derives from working with data from several
  industries (retail sales, online web advertising, credit risk,
  direct marketing)
• Compression of 80% is not uncommon for the best model
  generated by the post-processing
• However, user is free to truncate the compressed model as
  it is also built up sequentially (we add one tree at a time to
  the model)
• User can thus choose from a possibly broad range of
  tradeoffs opting for even greater compression available from
  a less accurate model
• In the BOSTON example 90% compression also performs
  quite well (about 40 trees instead of the optimal 91 trees)
                    © Copyright Salford Systems 2012
A Comment on the Theory behind ISLE

• In Friedman’s paper on ISLE he provides a rationale for this
  approach quite different from ours
• Consider that our goal is to learn a model from data where it
  is clear that a linear regression is not adequate
• How to automatically manufacture basis functions that
  capture more complex structure than raw variables
   – Imagine offering high order polynomials
   – Some have suggested adding Xi*Xj interactions and also 1/Xi as new
     predictors plus log(Xi) for all strictly positive regressors
   – Friedman proposes TreeNet as a vehicle for generating such new
     variables in the search for a more faithful model (to the truth)
   – Think of TreeNet as a search engine for features (constructed
     predictors)

                      © Copyright Salford Systems 2012
From Trees to Nodes

• In a second round of work on the idea of post-processing a
  tree ensemble Friedman suggested working with nodes
• Every node in a decision tree (other than the root) defines a
  potentially interesting subset of data
• Analysts have long thought about the terminal nodes of a
  CART tree in this way
   – Each terminal node is a segment or can be thought of as an
     interesting rule
   – Cardell and Steinberg proposed blending CART and logistic
     regression in this way (each terminal node is a dummy variable)
• Now we extend this thinking to all nodes below the root
   •   Tibshirani proposed using all the nodes of a maximal tree in a Lasso model
       to “prune” the tree

                         © Copyright Salford Systems 2012
Nodes in a Single TreeNet Tree
                                Tree grown to have T=6 terminal nodes


                                     •   Typical TreeNet has T=6 terminal nodes
                                     •   One level down has two nodes
                                     •   Next level has 4 nodes (3 terminal)
                                     •   Next 2 levels have 2 nodes each
                                     •   Total is 10 non-root nodes
                                     •   Will always be T + (T-1) -1 = 2(T-1)

                                     • Represent each node as a 0/1 indicator
                                     • Record passes through this node (1) or
                                       does not pass through this node (0)



• With 10 node indicators per each 6-terminal tree a 1,000 tree TreeNet will
  generate 10,000 node indicators
• Now we want to post-process this node representation of the TreeNet
• Methodology can generate an immense number of predictors

                          © Copyright Salford Systems 2012
Use Regularized Regression to Post Process

• Essential because even if we start with a small data set
  (rows and columns) we might generate thousands of trees
• The regularized regression is used to
   – SELECT trees (only a subset of the original trees will be used)
   – REWEIGHT trees (originally all had equal weight)
• The new model is still an ensemble of regression trees but
  now recombined differently
   – Some trees might get a negative weight
• New model could have two advantages
   – Could be MUCH smaller than original model (good for deployment)
   – Could be more accurate on holdout data
• No guarantees but results often attractive

                      © Copyright Salford Systems 2012
Variations on Node Post Processing

•   Pure: nodes (only node dummies in 2nd stage model)
•   Hybrid: nodes + trees (mix of ISLE and nodes)
•   Hybrid: raw predictors + nodes (Friedman’s preferred)
•   Hybrid: raw predictors + ISLE variables
•   Hybrid: raw predictors + ISLE trees + nodes

• In addition we could add the original TreeNet prediction to
  any of these sets of predictors
• Ideal interaction detection: include TreeNet prediction from a
  pure additive model and node indicators as regressors



                    © Copyright Salford Systems 2012
Raw Predictor Problems

• Much of our empirical work involves incomplete data
  (missing values) and the 2nd stage model requires complete
  data (listwise deletion)
• While the hybrid models involving raw variables can capture
  nonlinearity and interactions the raw predictors act as
  everyday regressors
   – Issue of functional form
   – Issue of outliers
• Using ISLE variables may be far better for working with data
  for which careful cleaning and repair is not an option




                      © Copyright Salford Systems 2012
Same Data Post-Processing Nodes




• In this example running only on nodes does not do well
    • See the upper dotted performance curves
• Still we will examine the outputs generated
• Which method works best will vary with specifics of the data

                       © Copyright Salford Systems 2012
Pure RuleSeeker




•   Each variable in model is a node, or a RULE
•   Worthwhile to examine mean target, lift, support and agreement with test data
•   All shown above
                            © Copyright Salford Systems 2012
Rule table:
                                                        Display is Sortable




•   Number of terms in a rule is determined by location of node in tree
•   Deep nodes can involve more variables (minimum is one, max is equal to depth of tree)
                           © Copyright Salford Systems 2012
Rule Statistics




More columns from the Rule Table Display

                      © Copyright Salford Systems 2012
Lift Report:
                         High Lifts Represent Interesting Sergments




Dot for each rule (here displaying test data results)

                        © Copyright Salford Systems 2012
Parametric Bootstrap For Interaction Statistics




© Copyright Salford Systems 2012
Final Details

• We have described RuleSeeker as a way to post-process a
  TreeNet model and this is a fundamental use of the method
• When our goal from the start is to extract rules then we are
  advised to modify the TreeNet control in two ways
   – Allow the sizes of the trees to vary at random
   – Use very small subsets of the data when growing each tree
• Friedman recommends an average tree size of 4 terminal
  nodes and using a Poisson distribution to generate varying
  tree sizes (will often yield a few trees with 10-16 nodes)
• Friedman describes experiments in which each tree in the
  TreeNet is grown on just 5% of the available data
   – TreeNet first stage is inferior to standard TreeNet but 2nd stage could
     actually outperform the standard TreeNet
                       © Copyright Salford Systems 2012
RuleSeeker and Huge Data

• If the RuleSeeker approach can in fact outperform standard
  TreeNet this suggests a sampling approach to massive data
  sets
• Extract rather small (possibly stratified) samples from each
  of many data repositories
• Grow a Treenet tree
• Repeat random draws to grow subsequent trees
• Friedman’s approach does not grow very many trees (200)
• The 2nd stage regression must be run on a much larger
  sample but regression is much easier to distribute than trees



                   © Copyright Salford Systems 2012
RuleSeeker Summary

• A RuleSeeker model has several interesting dimensions
   –   It is a post-processed version of a TreeNet
   –   RuleSeeker model could offer better performance than original TN
   –   RuleSeeker model might also be more compact
   –   Rules extracted could be seen as important INTERACTIONS
   –   Rules could be studied as rules
        • Compare train vs test Lift (want good agreement)
        • Consider tradeoff of Lift versus Support
            – Rules can guide targeting but only worthwhile if support is sufficient




                          © Copyright Salford Systems 2012
Big Data

• Currently we support 64-bit single server
• Using typical modern servers means 32-cores and 512GB
  RAM
   – Shortly we expect to see 200 cores and 2TB RAM at modest prices
   – Our training data can reach about 1/3 RAM without disk thrashing
   – 200GB training data (50 million rows by 1000 predictors)


• MapReduce/Hadoop appears to be the emerging standard
  for massively parallel data stores and computation
• Our approach will be bagging models that extract random
  samples from each of the data stores
• Each mapper and reducer are expected to have 4GB RAM
• We will require reducers to be equipped with 16GB
                     © Copyright Salford Systems 2012

More Related Content

PPTX
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
PPTX
TreeNet Overview - Updated October 2012
PPTX
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
PPTX
Churn Modeling For Mobile Telecommunications
PDF
Machine Learning - Supervised Learning
PDF
Salford Systems - On the Cutting Edge of Technology
PPT
Tree net and_randomforests_2009
PDF
Accelerating Random Forests in Scikit-Learn
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Overview - Updated October 2012
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
Churn Modeling For Mobile Telecommunications
Machine Learning - Supervised Learning
Salford Systems - On the Cutting Edge of Technology
Tree net and_randomforests_2009
Accelerating Random Forests in Scikit-Learn

Similar to Some of the new features in SPM 7 (17)

PPTX
GraphLab Conference 2014 Keynote - Carlos Guestrin
PPTX
Yuwu chen
PPTX
Churn Modeling-For-Mobile-Telecommunications
PPTX
Text mining tutorial
PPTX
Icse15 Tech-briefing Data Science
PPTX
Parallel Algorithms for Geometric Graph Problems (at Stanford)
PDF
11_winter_lecture-2023-2024————————-.pdf
PDF
2023 Supervised Learning for Orange3 from scratch
 
PDF
4_2_Ensemble models and grad boost part 2.pdf
PPTX
Decision Tree - C4.5&CART
PDF
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...
PPT
algorthm analysis from computer scince.ppt
PDF
Distributed Decision Tree Induction
PPTX
Using CART For Beginners with A Teclo Example Dataset
PDF
Comparison Between WEKA and Salford System in Data Mining Software
PPT
NIPS2007: structured prediction
DOCX
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
GraphLab Conference 2014 Keynote - Carlos Guestrin
Yuwu chen
Churn Modeling-For-Mobile-Telecommunications
Text mining tutorial
Icse15 Tech-briefing Data Science
Parallel Algorithms for Geometric Graph Problems (at Stanford)
11_winter_lecture-2023-2024————————-.pdf
2023 Supervised Learning for Orange3 from scratch
 
4_2_Ensemble models and grad boost part 2.pdf
Decision Tree - C4.5&CART
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...
algorthm analysis from computer scince.ppt
Distributed Decision Tree Induction
Using CART For Beginners with A Teclo Example Dataset
Comparison Between WEKA and Salford System in Data Mining Software
NIPS2007: structured prediction
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Ad

More from Salford Systems (20)

PDF
Datascience101presentation4
PPTX
Improve Your Regression with CART and RandomForests
PPTX
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
PPT
The Do's and Don'ts of Data Mining
PPTX
Introduction to Random Forests by Dr. Adele Cutler
PPTX
9 Data Mining Challenges From Data Scientists Like You
PPTX
Statistically Significant Quotes To Remember
PPT
CART Classification and Regression Trees Experienced User Guide
PPTX
Evolution of regression ols to gps to mars
PPTX
Data Mining for Higher Education
PDF
Comparison of statistical methods commonly used in predictive modeling
PDF
Molecular data mining tool advances in hiv
PDF
SPM v7.0 Feature Matrix
PDF
SPM User's Guide: Introducing MARS
PPT
Hybrid cart logit model 1998
PPTX
Session Logs Tutorial for SPM
PPT
Paradigm shifts in wildlife and biodiversity management through machine learning
PPT
Global Modeling of Biodiversity and Climate Change
PPTX
Predicting Hospital Readmission Using TreeNet
PPTX
Changing Requirements of Business Analytics in Financial Services
Datascience101presentation4
Improve Your Regression with CART and RandomForests
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
The Do's and Don'ts of Data Mining
Introduction to Random Forests by Dr. Adele Cutler
9 Data Mining Challenges From Data Scientists Like You
Statistically Significant Quotes To Remember
CART Classification and Regression Trees Experienced User Guide
Evolution of regression ols to gps to mars
Data Mining for Higher Education
Comparison of statistical methods commonly used in predictive modeling
Molecular data mining tool advances in hiv
SPM v7.0 Feature Matrix
SPM User's Guide: Introducing MARS
Hybrid cart logit model 1998
Session Logs Tutorial for SPM
Paradigm shifts in wildlife and biodiversity management through machine learning
Global Modeling of Biodiversity and Climate Change
Predicting Hospital Readmission Using TreeNet
Changing Requirements of Business Analytics in Financial Services
Ad

Recently uploaded (20)

PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
The various Industrial Revolutions .pptx
PDF
Architecture types and enterprise applications.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
Modernising the Digital Integration Hub
PDF
NewMind AI Weekly Chronicles - August'25-Week II
observCloud-Native Containerability and monitoring.pptx
1 - Historical Antecedents, Social Consideration.pdf
Programs and apps: productivity, graphics, security and other tools
Chapter 5: Probability Theory and Statistics
Getting started with AI Agents and Multi-Agent Systems
A comparative study of natural language inference in Swahili using monolingua...
cloud_computing_Infrastucture_as_cloud_p
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A novel scalable deep ensemble learning framework for big data classification...
Getting Started with Data Integration: FME Form 101
The various Industrial Revolutions .pptx
Architecture types and enterprise applications.pdf
TLE Review Electricity (Electricity).pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
O2C Customer Invoices to Receipt V15A.pptx
Modernising the Digital Integration Hub
NewMind AI Weekly Chronicles - August'25-Week II

Some of the new features in SPM 7

  • 1. Advances in Boosted Tree Technology: TreeNet Model Compression and Optimal Rule Extraction Dan Steinberg, Milkail Golovnya, N Scott Cardell May 2012 Salford Systems http://www.salford-systems.com
  • 2. Beyond TreeNet • TreeNet has set a high bar for automatic off-the-shelf model performance – TreeNet was used to win all four 1st place awards in the Duke/Teradata churn modeling competition of 2002 – Awards in 2010, 2009, 2008, 2007, 2004 all based on TreeNet • TreeNet was first developed (MART) in 1999 and essentially perfected in 2000 – Many improvements since then but the fundamentals are largely those of the 2000 technology • In subsequent work Friedman has introduced major extensions that go beyond the framework of boosted trees © Copyright Salford Systems 2012
  • 3. Importance Sampled Learning Ensembles (ISLE) • Friedman’s work in 2003 is somewhat more complex than what we describe here – Presented his paper at our first data mining conference in San Francisco in March of 2004 • We focus on the concept of model compression • TreeNet model is grown myopically one added tree at a time – From current model attempt to improve it by predicting residuals – Each tree represents incremental learning and error correction – Slow learning, small steps – During model development we do not know where we are going to end up • Once we have the TreeNet model completed can we review it and “clean it up” © Copyright Salford Systems 2012
  • 4. Post-Processing With Regularized Regression • Friedman’s ISLE takes a TreeNet model as its raw material and considers how we can refine it using regression • Consider: every tree takes our raw data as input and generates outputs at the terminal nodes • Each tree can be thought of as a new variable constructed out of the original data – No missing values in tree outputs even if there were missing values in the raw data – Outliers among such predictors are expected to be rare as each terminal is doing averaging and the trees are typically small • Might create many more generated variables than original raw variables – Boston data set has 13 predictors, TN might generate 1000 trees © Copyright Salford Systems 2012
  • 5. Regularized Regression • Modern regression techniques starting with Ridge regression, and then the Lasso, and finally hybrid models • Methods have advantages over classical regression – Can handle highly correlated variables (Ridge) – Can work with data sets with more columns than rows – Can do variable selection (Lasso, Ridge-Lasso hybrids) – Much more effective and reliable than old fashioned stepwise • Regularized regression is still regression and thus suffers from all the primary limitations of classical regression – No missing value handling – Linear additive model (no interactions) – Sensitive to functional form of predictors © Copyright Salford Systems 2012
  • 6. Regularized Regression Applied to Trees • Applying to regularized regression to trees is not vulnerable to these traditional problems – Missing values already handled and transformed to non-missing – Interactions incorporated into the tree structure – Trees are invariant with respect to typical univariate transformations – Any order preserving transform will not affect tree • What will a regularized regression on trees accomplish? – Combine all identical trees into one – Combine several similar trees into a compromise tree – Bypass any meandering while TreeNet searched for optimum – Reweights the trees (in TN all trees have equal weight) © Copyright Salford Systems 2012
  • 7. Regularized Regression of TreeNet • In this mode of ISLE we develop the best TreeNet model we can • Post-process results allowing for different degrees of compression • By default we run four models on the TreeNet – Ridge (no compression, just reweighting) – Lasso (compression possible) – Ridged Lasso (hybrid of Lasso and Ridge but mostly Lasso) – Compact (maximum compression) • Goal usually is to find a substantial degree of compression while giving up little or nothing on test sample performance • Could focus only on beating TN performance © Copyright Salford Systems 2012
  • 8. Model Compression: Early Days • TreeNet has always offered model truncation • Instead of using the fully articulated model stop the process early • In 2005 this method was being used by a major web portal – TreeNet model used to predict likely response to item presented to visitor on a web page (ad, link, photo, story) – To implement real time response TN model limited to first 30 trees – Sacrificed considerable predictive accuracy to have a model that could score fast enough in real time – Truncated TreeNet at 30 trees still was better than other alternatives – Consider that model might have been rebuilt every hour © Copyright Salford Systems 2012
  • 9. Illustrative Example: Boston Housing Data Set Set Up Model © Copyright Salford Systems 2012
  • 10. TreeNet Controls 1000 trees, Least Squares, AUTO Learnrate © Copyright Salford Systems 2012
  • 11. Post Processor Controls: What Type of Post Processing © Copyright Salford Systems 2012
  • 12. Post Processor Details: Use all defaults • Standardizing the “trees” gives all equal weight in regularized regression • Worth experimenting with unstandardized – larger variance trees will dominate © Copyright Salford Systems 2012
  • 13. Two Stage Modeling Process • First Stage here is a TreeNet but in SPM could also be – Single CART Tree (focus would be on nodes eg from maximal tree) – Ensemble of CART trees (bagger) – MARS model (basis functions from maximal model) – Random Forests • In ISLE mode we need to operate on a collection of variables created by a learning machine – these can come from any of our tree engines or MARS • We will get first stage results: a model • Then get second stage: model refinement – Model compression or model selection (eg tree pruning) © Copyright Salford Systems 2012
  • 14. TreeNet Results Test Set R2=.87875 MSE=7.407 © Copyright Salford Systems 2012
  • 15. TreeNet Results: Residual Stats One substantial outlier more than 5 IQR outside central data range © Copyright Salford Systems 2012
  • 16. TreeNet and Compressed TreeNet Both Models Reported Below • The dashed lines show evolution of the compressed model • Because we can choose any of our 1000 trees to start the compressed model starts off much better than the original TreeNet and it has a coefficient © Copyright Salford Systems 2012
  • 17. ISLE Reweighted TreeNet: Test Data Results © Copyright Salford Systems 2012
  • 18. TreeNet vs ISLE Residuals ISLE is wider in the center but narrower top to bottom TreeNet Residuals ISLE Compressed TreeNet © Copyright Salford Systems 2012
  • 19. Comment on the First Tree • It is interesting to observe that in this example the compressed model with just one tree in it outperforms the TreeNet model with just one tree • Trees are built without look ahead but having a menu of 1000 trees to choose from allows the 2nd stage model to do better • Worst case scenario is that 2nd stage chooses same first tree • Coefficient can spread out the predictions © Copyright Salford Systems 2012
  • 20. TreeNet Model Compression • TreeNet has set a high bar for predictive accuracy in the data mining field • We now offer several ways in which a TreeNet can be further improved by post-processing • Consider that a TreeNet model is built one step at a time without knowledge of where we will end up – Some trees are exact or almost exact copies of other trees – Some trees may exhibit some “wandering” before the right direction is found – Trees are each built on a different random subset of the data and some trees may just be “unlucky” – Post processing can combine multiple copies of essentially the same tree and skip any unnecessary wandering © Copyright Salford Systems 2012
  • 21. How Much Compression is Possible? • Our experience derives from working with data from several industries (retail sales, online web advertising, credit risk, direct marketing) • Compression of 80% is not uncommon for the best model generated by the post-processing • However, user is free to truncate the compressed model as it is also built up sequentially (we add one tree at a time to the model) • User can thus choose from a possibly broad range of tradeoffs opting for even greater compression available from a less accurate model • In the BOSTON example 90% compression also performs quite well (about 40 trees instead of the optimal 91 trees) © Copyright Salford Systems 2012
  • 22. A Comment on the Theory behind ISLE • In Friedman’s paper on ISLE he provides a rationale for this approach quite different from ours • Consider that our goal is to learn a model from data where it is clear that a linear regression is not adequate • How to automatically manufacture basis functions that capture more complex structure than raw variables – Imagine offering high order polynomials – Some have suggested adding Xi*Xj interactions and also 1/Xi as new predictors plus log(Xi) for all strictly positive regressors – Friedman proposes TreeNet as a vehicle for generating such new variables in the search for a more faithful model (to the truth) – Think of TreeNet as a search engine for features (constructed predictors) © Copyright Salford Systems 2012
  • 23. From Trees to Nodes • In a second round of work on the idea of post-processing a tree ensemble Friedman suggested working with nodes • Every node in a decision tree (other than the root) defines a potentially interesting subset of data • Analysts have long thought about the terminal nodes of a CART tree in this way – Each terminal node is a segment or can be thought of as an interesting rule – Cardell and Steinberg proposed blending CART and logistic regression in this way (each terminal node is a dummy variable) • Now we extend this thinking to all nodes below the root • Tibshirani proposed using all the nodes of a maximal tree in a Lasso model to “prune” the tree © Copyright Salford Systems 2012
  • 24. Nodes in a Single TreeNet Tree Tree grown to have T=6 terminal nodes • Typical TreeNet has T=6 terminal nodes • One level down has two nodes • Next level has 4 nodes (3 terminal) • Next 2 levels have 2 nodes each • Total is 10 non-root nodes • Will always be T + (T-1) -1 = 2(T-1) • Represent each node as a 0/1 indicator • Record passes through this node (1) or does not pass through this node (0) • With 10 node indicators per each 6-terminal tree a 1,000 tree TreeNet will generate 10,000 node indicators • Now we want to post-process this node representation of the TreeNet • Methodology can generate an immense number of predictors © Copyright Salford Systems 2012
  • 25. Use Regularized Regression to Post Process • Essential because even if we start with a small data set (rows and columns) we might generate thousands of trees • The regularized regression is used to – SELECT trees (only a subset of the original trees will be used) – REWEIGHT trees (originally all had equal weight) • The new model is still an ensemble of regression trees but now recombined differently – Some trees might get a negative weight • New model could have two advantages – Could be MUCH smaller than original model (good for deployment) – Could be more accurate on holdout data • No guarantees but results often attractive © Copyright Salford Systems 2012
  • 26. Variations on Node Post Processing • Pure: nodes (only node dummies in 2nd stage model) • Hybrid: nodes + trees (mix of ISLE and nodes) • Hybrid: raw predictors + nodes (Friedman’s preferred) • Hybrid: raw predictors + ISLE variables • Hybrid: raw predictors + ISLE trees + nodes • In addition we could add the original TreeNet prediction to any of these sets of predictors • Ideal interaction detection: include TreeNet prediction from a pure additive model and node indicators as regressors © Copyright Salford Systems 2012
  • 27. Raw Predictor Problems • Much of our empirical work involves incomplete data (missing values) and the 2nd stage model requires complete data (listwise deletion) • While the hybrid models involving raw variables can capture nonlinearity and interactions the raw predictors act as everyday regressors – Issue of functional form – Issue of outliers • Using ISLE variables may be far better for working with data for which careful cleaning and repair is not an option © Copyright Salford Systems 2012
  • 28. Same Data Post-Processing Nodes • In this example running only on nodes does not do well • See the upper dotted performance curves • Still we will examine the outputs generated • Which method works best will vary with specifics of the data © Copyright Salford Systems 2012
  • 29. Pure RuleSeeker • Each variable in model is a node, or a RULE • Worthwhile to examine mean target, lift, support and agreement with test data • All shown above © Copyright Salford Systems 2012
  • 30. Rule table: Display is Sortable • Number of terms in a rule is determined by location of node in tree • Deep nodes can involve more variables (minimum is one, max is equal to depth of tree) © Copyright Salford Systems 2012
  • 31. Rule Statistics More columns from the Rule Table Display © Copyright Salford Systems 2012
  • 32. Lift Report: High Lifts Represent Interesting Sergments Dot for each rule (here displaying test data results) © Copyright Salford Systems 2012
  • 33. Parametric Bootstrap For Interaction Statistics © Copyright Salford Systems 2012
  • 34. Final Details • We have described RuleSeeker as a way to post-process a TreeNet model and this is a fundamental use of the method • When our goal from the start is to extract rules then we are advised to modify the TreeNet control in two ways – Allow the sizes of the trees to vary at random – Use very small subsets of the data when growing each tree • Friedman recommends an average tree size of 4 terminal nodes and using a Poisson distribution to generate varying tree sizes (will often yield a few trees with 10-16 nodes) • Friedman describes experiments in which each tree in the TreeNet is grown on just 5% of the available data – TreeNet first stage is inferior to standard TreeNet but 2nd stage could actually outperform the standard TreeNet © Copyright Salford Systems 2012
  • 35. RuleSeeker and Huge Data • If the RuleSeeker approach can in fact outperform standard TreeNet this suggests a sampling approach to massive data sets • Extract rather small (possibly stratified) samples from each of many data repositories • Grow a Treenet tree • Repeat random draws to grow subsequent trees • Friedman’s approach does not grow very many trees (200) • The 2nd stage regression must be run on a much larger sample but regression is much easier to distribute than trees © Copyright Salford Systems 2012
  • 36. RuleSeeker Summary • A RuleSeeker model has several interesting dimensions – It is a post-processed version of a TreeNet – RuleSeeker model could offer better performance than original TN – RuleSeeker model might also be more compact – Rules extracted could be seen as important INTERACTIONS – Rules could be studied as rules • Compare train vs test Lift (want good agreement) • Consider tradeoff of Lift versus Support – Rules can guide targeting but only worthwhile if support is sufficient © Copyright Salford Systems 2012
  • 37. Big Data • Currently we support 64-bit single server • Using typical modern servers means 32-cores and 512GB RAM – Shortly we expect to see 200 cores and 2TB RAM at modest prices – Our training data can reach about 1/3 RAM without disk thrashing – 200GB training data (50 million rows by 1000 predictors) • MapReduce/Hadoop appears to be the emerging standard for massively parallel data stores and computation • Our approach will be bagging models that extract random samples from each of the data stores • Each mapper and reducer are expected to have 4GB RAM • We will require reducers to be equipped with 16GB © Copyright Salford Systems 2012