Some of the new features in SPM 7

Advances in Boosted Tree Technology:
TreeNet Model Compression and Optimal Rule
Extraction

Dan Steinberg, Milkail Golovnya, N Scott Cardell
May 2012
Salford Systems
http://www.salford-systems.com

Beyond TreeNet

• TreeNet has set a high bar for automatic off-the-shelf model
performance
– TreeNet was used to win all four 1st place awards in the
Duke/Teradata churn modeling competition of 2002
– Awards in 2010, 2009, 2008, 2007, 2004 all based on TreeNet

• TreeNet was first developed (MART) in 1999 and essentially
perfected in 2000
– Many improvements since then but the fundamentals are largely
those of the 2000 technology
• In subsequent work Friedman has introduced major
extensions that go beyond the framework of boosted trees

© Copyright Salford Systems 2012

Importance Sampled Learning Ensembles (ISLE)

• Friedman’s work in 2003 is somewhat more complex than
what we describe here
– Presented his paper at our first data mining conference in San
Francisco in March of 2004
• We focus on the concept of model compression
• TreeNet model is grown myopically one added tree at a time
– From current model attempt to improve it by predicting residuals
– Each tree represents incremental learning and error correction
– Slow learning, small steps
– During model development we do not know where we are going to
end up
• Once we have the TreeNet model completed can we review
it and “clean it up”

Post-Processing With Regularized Regression

• Friedman’s ISLE takes a TreeNet model as its raw material
and considers how we can refine it using regression
• Consider: every tree takes our raw data as input and
generates outputs at the terminal nodes
• Each tree can be thought of as a new variable constructed
out of the original data
– No missing values in tree outputs even if there were missing values
in the raw data
– Outliers among such predictors are expected to be rare as each
terminal is doing averaging and the trees are typically small
• Might create many more generated variables than original
raw variables
– Boston data set has 13 predictors, TN might generate 1000 trees


Regularized Regression

• Modern regression techniques starting with Ridge
regression, and then the Lasso, and finally hybrid models
• Methods have advantages over classical regression
– Can handle highly correlated variables (Ridge)
– Can work with data sets with more columns than rows
– Can do variable selection (Lasso, Ridge-Lasso hybrids)
– Much more effective and reliable than old fashioned stepwise
• Regularized regression is still regression and thus suffers
from all the primary limitations of classical regression
– No missing value handling
– Linear additive model (no interactions)
– Sensitive to functional form of predictors


Regularized Regression Applied to Trees

• Applying to regularized regression to trees is not vulnerable
to these traditional problems
– Missing values already handled and transformed to non-missing
– Interactions incorporated into the tree structure
– Trees are invariant with respect to typical univariate transformations
– Any order preserving transform will not affect tree
• What will a regularized regression on trees accomplish?
– Combine all identical trees into one
– Combine several similar trees into a compromise tree
– Bypass any meandering while TreeNet searched for optimum
– Reweights the trees (in TN all trees have equal weight)


Regularized Regression of TreeNet

• In this mode of ISLE we develop the best TreeNet model we
can
• Post-process results allowing for different degrees of
compression
• By default we run four models on the TreeNet
– Ridge (no compression, just reweighting)
– Lasso (compression possible)
– Ridged Lasso (hybrid of Lasso and Ridge but mostly Lasso)
– Compact (maximum compression)
• Goal usually is to find a substantial degree of compression
while giving up little or nothing on test sample performance
• Could focus only on beating TN performance


Model Compression: Early Days

• TreeNet has always offered model truncation
• Instead of using the fully articulated model stop the process
early
• In 2005 this method was being used by a major web portal
– TreeNet model used to predict likely response to item presented to
visitor on a web page (ad, link, photo, story)
– To implement real time response TN model limited to first 30 trees
– Sacrificed considerable predictive accuracy to have a model that
could score fast enough in real time
– Truncated TreeNet at 30 trees still was better than other alternatives
– Consider that model might have been rebuilt every hour


Illustrative Example: Boston Housing Data Set
Set Up Model


TreeNet Controls

1000 trees, Least Squares, AUTO Learnrate

Post Processor Controls:
What Type of Post Processing


Post Processor Details:
Use all defaults

• Standardizing the “trees” gives all equal weight in regularized regression
• Worth experimenting with unstandardized – larger variance trees will dominate

Two Stage Modeling Process

• First Stage here is a TreeNet but in SPM could also be
– Single CART Tree (focus would be on nodes eg from maximal tree)
– Ensemble of CART trees (bagger)
– MARS model (basis functions from maximal model)
– Random Forests
• In ISLE mode we need to operate on a collection of
variables created by a learning machine
– these can come from any of our tree engines or MARS

• We will get first stage results: a model
• Then get second stage: model refinement
– Model compression or model selection (eg tree pruning)


TreeNet Results

Test Set R2=.87875 MSE=7.407

TreeNet Results: Residual Stats

One substantial outlier more than 5 IQR outside central data range

TreeNet and Compressed TreeNet
Both Models Reported Below

• The dashed lines show evolution of the compressed model
• Because we can choose any of our 1000 trees to start the compressed model starts
off much better than the original TreeNet and it has a coefficient

ISLE Reweighted TreeNet:
Test Data Results


TreeNet vs ISLE Residuals
ISLE is wider in the center but narrower top to bottom

TreeNet Residuals ISLE Compressed TreeNet


Comment on the First Tree

• It is interesting to observe that in this example the
compressed model with just one tree in it outperforms the
TreeNet model with just one tree
• Trees are built without look ahead but having a menu of
1000 trees to choose from allows the 2nd stage model to do
better
• Worst case scenario is that 2nd stage chooses same first
tree
• Coefficient can spread out the predictions


TreeNet Model Compression

• TreeNet has set a high bar for predictive accuracy in the
data mining field
• We now offer several ways in which a TreeNet can be
further improved by post-processing
• Consider that a TreeNet model is built one step at a time
without knowledge of where we will end up
– Some trees are exact or almost exact copies of other trees
– Some trees may exhibit some “wandering” before the right direction
is found
– Trees are each built on a different random subset of the data and
some trees may just be “unlucky”
– Post processing can combine multiple copies of essentially the same
tree and skip any unnecessary wandering


How Much Compression is Possible?

• Our experience derives from working with data from several
industries (retail sales, online web advertising, credit risk,
direct marketing)
• Compression of 80% is not uncommon for the best model
generated by the post-processing
• However, user is free to truncate the compressed model as
it is also built up sequentially (we add one tree at a time to
the model)
• User can thus choose from a possibly broad range of
tradeoffs opting for even greater compression available from
a less accurate model
• In the BOSTON example 90% compression also performs
quite well (about 40 trees instead of the optimal 91 trees)

A Comment on the Theory behind ISLE

• In Friedman’s paper on ISLE he provides a rationale for this
approach quite different from ours
• Consider that our goal is to learn a model from data where it
is clear that a linear regression is not adequate
• How to automatically manufacture basis functions that
capture more complex structure than raw variables
– Imagine offering high order polynomials
– Some have suggested adding Xi*Xj interactions and also 1/Xi as new
predictors plus log(Xi) for all strictly positive regressors
– Friedman proposes TreeNet as a vehicle for generating such new
variables in the search for a more faithful model (to the truth)
– Think of TreeNet as a search engine for features (constructed
predictors)


From Trees to Nodes

• In a second round of work on the idea of post-processing a
tree ensemble Friedman suggested working with nodes
• Every node in a decision tree (other than the root) defines a
potentially interesting subset of data
• Analysts have long thought about the terminal nodes of a
CART tree in this way
– Each terminal node is a segment or can be thought of as an
interesting rule
– Cardell and Steinberg proposed blending CART and logistic
regression in this way (each terminal node is a dummy variable)
• Now we extend this thinking to all nodes below the root
• Tibshirani proposed using all the nodes of a maximal tree in a Lasso model
to “prune” the tree


Nodes in a Single TreeNet Tree
Tree grown to have T=6 terminal nodes

• Typical TreeNet has T=6 terminal nodes
• One level down has two nodes
• Next level has 4 nodes (3 terminal)
• Next 2 levels have 2 nodes each
• Total is 10 non-root nodes
• Will always be T + (T-1) -1 = 2(T-1)

• Represent each node as a 0/1 indicator
• Record passes through this node (1) or
does not pass through this node (0)

• With 10 node indicators per each 6-terminal tree a 1,000 tree TreeNet will
generate 10,000 node indicators
• Now we want to post-process this node representation of the TreeNet
• Methodology can generate an immense number of predictors


Use Regularized Regression to Post Process

• Essential because even if we start with a small data set
(rows and columns) we might generate thousands of trees
• The regularized regression is used to
– SELECT trees (only a subset of the original trees will be used)
– REWEIGHT trees (originally all had equal weight)
• The new model is still an ensemble of regression trees but
now recombined differently
– Some trees might get a negative weight
• New model could have two advantages
– Could be MUCH smaller than original model (good for deployment)
– Could be more accurate on holdout data
• No guarantees but results often attractive


Variations on Node Post Processing

• Pure: nodes (only node dummies in 2nd stage model)
• Hybrid: nodes + trees (mix of ISLE and nodes)
• Hybrid: raw predictors + nodes (Friedman’s preferred)
• Hybrid: raw predictors + ISLE variables
• Hybrid: raw predictors + ISLE trees + nodes

• In addition we could add the original TreeNet prediction to
any of these sets of predictors
• Ideal interaction detection: include TreeNet prediction from a
pure additive model and node indicators as regressors


Raw Predictor Problems

• Much of our empirical work involves incomplete data
(missing values) and the 2nd stage model requires complete
data (listwise deletion)
• While the hybrid models involving raw variables can capture
nonlinearity and interactions the raw predictors act as
everyday regressors
– Issue of functional form
– Issue of outliers
• Using ISLE variables may be far better for working with data
for which careful cleaning and repair is not an option


Same Data Post-Processing Nodes

• In this example running only on nodes does not do well
• See the upper dotted performance curves
• Still we will examine the outputs generated
• Which method works best will vary with specifics of the data


Pure RuleSeeker

• Each variable in model is a node, or a RULE
• Worthwhile to examine mean target, lift, support and agreement with test data
• All shown above

Rule table:
Display is Sortable

• Number of terms in a rule is determined by location of node in tree
• Deep nodes can involve more variables (minimum is one, max is equal to depth of tree)

Rule Statistics

More columns from the Rule Table Display


Lift Report:
High Lifts Represent Interesting Sergments

Dot for each rule (here displaying test data results)


Parametric Bootstrap For Interaction Statistics


Final Details

• We have described RuleSeeker as a way to post-process a
TreeNet model and this is a fundamental use of the method
• When our goal from the start is to extract rules then we are
advised to modify the TreeNet control in two ways
– Allow the sizes of the trees to vary at random
– Use very small subsets of the data when growing each tree
• Friedman recommends an average tree size of 4 terminal
nodes and using a Poisson distribution to generate varying
tree sizes (will often yield a few trees with 10-16 nodes)
• Friedman describes experiments in which each tree in the
TreeNet is grown on just 5% of the available data
– TreeNet first stage is inferior to standard TreeNet but 2nd stage could
actually outperform the standard TreeNet

RuleSeeker and Huge Data

• If the RuleSeeker approach can in fact outperform standard
TreeNet this suggests a sampling approach to massive data
sets
• Extract rather small (possibly stratified) samples from each
of many data repositories
• Grow a Treenet tree
• Repeat random draws to grow subsequent trees
• Friedman’s approach does not grow very many trees (200)
• The 2nd stage regression must be run on a much larger
sample but regression is much easier to distribute than trees


RuleSeeker Summary

• A RuleSeeker model has several interesting dimensions
– It is a post-processed version of a TreeNet
– RuleSeeker model could offer better performance than original TN
– RuleSeeker model might also be more compact
– Rules extracted could be seen as important INTERACTIONS
– Rules could be studied as rules
• Compare train vs test Lift (want good agreement)
• Consider tradeoff of Lift versus Support
– Rules can guide targeting but only worthwhile if support is sufficient


Big Data

• Currently we support 64-bit single server
• Using typical modern servers means 32-cores and 512GB
RAM
– Shortly we expect to see 200 cores and 2TB RAM at modest prices
– Our training data can reach about 1/3 RAM without disk thrashing
– 200GB training data (50 million rows by 1000 predictors)

• MapReduce/Hadoop appears to be the emerging standard
for massively parallel data stores and computation
• Our approach will be bagging models that extract random
samples from each of the data stores
• Each mapper and reducer are expected to have 4GB RAM
• We will require reducers to be equipped with 16GB

Some of the new features in SPM 7

More Related Content

Similar to Some of the new features in SPM 7 (17)

More from Salford Systems (20)

Recently uploaded (20)

Some of the new features in SPM 7