The caret Package: A Unified Interface for Predictive Models

The caret Package: A Unified Interface for Predictive
Models

Max Kuhn

Pfizer Global R&D
Nonclinical Statistics
Groton, CT
max.kuhn@pfizer.com

May 12, 2011

Shameless Plug # 1: Courses

I’ll be teaching 2 R classes here for Predictive Analytics World.

R Bootcamp (October 16)
R for Predictive Modeling: A Hands-On Introduction (October 17)

http://www.predictiveanalyticsworld.com/newyork/2011/

Max Kuhn (Pﬁzer Global R&D) caret May 12, 2011 2 / 44

Motivation

Theorem (No Free Lunch)
In the absence of any knowledge about the prediction problem, no model
can be said to be uniformly better than any other

Given this, it makes sense to use a variety of different models to find one
that best fits the data

R has many packages for predictive modeling (aka machine learning)(aka
pattern recognition) . . .


Model Function Consistency
Since there are many modeling packages written by different people, there
are some inconsistencies in how models are specified and predictions are
made.

For example, many models have only one method of specifying the model
(e.g. formula method only)

The table below shows the syntax to get probability estimates from several
classification models:
obj Class Package predict Function Syntax
lda MASS predict(obj) (no options needed)
glm stats predict(obj, type = "response")
gbm gbm predict(obj, type = "response", n.trees)
mda mda predict(obj, type = "posterior")
rpart rpart predict(obj, type = "prob")
Weka RWeka predict(obj, type = "probability")
LogitBoost caTools predict(obj, type = "raw", nIter)


The caret Package

The caret package was developed to:
create a unified interface for modeling and prediction
streamline model tuning using resampling
provide a variety of “helper” functions and classes for day–to–day
model building tasks
increase computational efficiency using parallel processing

First commits within Pfizer: 6/2005
First version on CRAN: 10/2007
Website: http://caret.r-forge.r-project.org
JSS Paper: www.jstatsoft.org/v28/i05/paper
4 package vignettes (82 pages total)


Example Data: TunedIT Music Challenge

http://tunedit.org/challenge/music-retrieval/genres
Using 191 descriptors, classify 12495 musical segments into one of 6
genres: Blues, Classical, Jazz, Metal, Pop, Rock.
Use these data to predict a large test set of music segments.


Example Data: TunedIT Music Challenge

The predictors and class variables are contained in a data frame called
music.

> head(music[,1:5])
TC SC SC_V ASE1 ASE2
1 2.5788 481.45 76989.0 -0.12334 -0.11578
2 2.7195 1405.30 825380.0 -0.17655 -0.18323
3 2.5351 601.09 686240.0 -0.13940 -0.13251
4 2.4465 637.73 122580.0 -0.14995 -0.14802
5 2.5657 776.86 124010.0 -0.16863 -0.16112
6 2.7737 447.09 8531.9 -0.16128 -0.15742
> head(music$GENRE)
[1] Pop Blues Pop Jazz Jazz Classical
Levels: Blues Classical Jazz Metal Pop Rock


Data Splitting

createDataPartition conducts stratiﬁed random splits
> ## Create a test set with 25% of the data
> set.seed(1)
> inTrain <- createDataPartition(music$GENRE, p = .75)[[1]]
> length(inTrain)
[1] 9373
> head(inTrain)
[1] 2 7 14 20 22 47

This produces a list for each resample. The list elements are integers for
the resampled set.


Data Splitting

> trainDescr <- music[ inTrain, -ncol(music)]
> testDescr <- music[-inTrain, -ncol(music)]
> trainClass <- music$GENRE[ inTrain]
> testClass <- music$GENRE[-inTrain]
> prop.table(table(music$GENRE))
Blues Classical Jazz Metal Pop Rock
0.12773109 0.27563025 0.24033613 0.07394958 0.12605042 0.15630252
> prop.table(table(trainClass))
trainClass
0.12770724 0.27557879 0.24037128 0.07393577 0.12610690 0.15630001

Other functions: createFolds, createMultiFolds, createResamples


Data Pre–Processing Methods

preProcess calculates values that can be used to apply to any data set
(e.g. training, set, unknowns).
Current methods: centering, scaling, spatial sign transformation, PCA or
ICA “signal extraction” imputation (via bagging or k –nearest neighbors),
,
Box–Cox transformations

> ## Determine means and sd's
> procValues <- preProcess(trainDescr, method = c("center", "scale"))
> procValues
> ## Use the predict methods to do the adjustments
> trainScaled <- predict(procValues, trainDescr)
> testScaled <- predict(procValues, testDescr)

preProcess can also be called within other functions, such as train, for
each resampling iteration.


Model Tuning Using Resampling

Deﬁne sets of model parameter values to evaluate;
for each parameter set do
for each resampling iteration do
Hold–out speciﬁc samples ;
Fit the model on the remainder;
Predict the hold–out samples;
end
Calculate the average performance across hold–out predictions
end
Determine the optimal parameter set;


Model Tuning

train uses resampling to tune and/or evaluate candidate models.

> set.seed(1)
> rbfSVM <- train(x = trainDescr, y = trainClass,
+ method = "svmRadial",
+ ## center and scale
+ preProc = c("center", "scale"),
+ ## Length of default tuning parameter grid
+ tuneLength = 8,
+ ## Repeated cross-validation resampling
+ trControl = trainControl(method = "repeatedcv",
+ repeats = 5),
+ ## Pick the best model using resampled Kappa
+ metric = "Kappa",
+ ## Pass arguments to ksvm
+ fit = FALSE)


Model Tuning
> print(rbfSVM, printCall = FALSE)
9373 samples
191 predictors
6 classes: 'Blues', 'Classical', 'Jazz', 'Metal', 'Pop', 'Rock'

Pre-processing: centered, scaled
Resampling: Cross-Validation (10 fold, repeated 5 times)

Summary of sample sizes: 8437, 8435, 8434, 8435, 8437, 8436, ...

Resampling results across tuning parameters:

C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.916 0.895 0.00953 0.0119
0.5 0.938 0.923 0.00824 0.0103
1 0.956 0.945 0.00641 0.008
2 0.964 0.955 0.00614 0.00766
4 0.968 0.961 0.0061 0.00761
8 0.969 0.962 0.00623 0.00777
16 0.969 0.962 0.00633 0.0079
32 0.969 0.962 0.0063 0.00786

Tuning parameter 'sigma' was held constant at a value of 0.00518
Kappa was used to select the optimal model using the largest value.
The final values used for the model were C = 16 and sigma = 0.00518.

Model Tuning

> class(rbfSVM)
[1] "train"
> class(rbfSVM$finalModel)
[1] "ksvm"
attr(,"package")
[1] "kernlab"


Model Tuning

train uses as many “tricks” as possible to reduce the number of
models ﬁts (e.g. using sub–models). Here, it uses the kernlab
function sigest to analytically estimate the RBF scale parameter.
Currently, there are options for 110 models (see ?train for a list)
Allows user–deﬁned search grid, performance metrics and selection
rules
Easily integrates with any parallel processing framework that can
emulate lapply
Formula and non–formula interfaces
Methods: predict, print, plot, varImp, resamples, xyplot,
densityplot, histogram, stripplot, . . .


Plots
plot(rbfSVM, xTrans = function(x) log2(x))
Accuracy (Repeated Cross−Validation)

0.97 q q q q

q

0.96
q

0.95

0.94 q

0.93

0.92
q

−2 0 2 4

Cost


Plots
densityplot(rbfSVM, metric = "Kappa", pch = "|")

40

30
Density

20

10

0 | | | || || | || | ||
| | | | | | | | || | | | |
| | | |

0.94 0.96 0.98

Kappa


Prediction and Performance Assessment

The predict method can be used to get results for other data sets:
> svmPred <- predict(rbfSVM, testDescr)
> str(svmPred)
Factor w/ 6 levels "Blues","Classical",..: 3 2 6 3 5 6 5 1 2 6 ...
> svmProbs <- predict(rbfSVM, testDescr, type = "prob")
> head(svmProbs)
1 0.03109657 0.51176742 0.31534778 0.05645315 0.02457019 0.06076489
2 0.00000000 0.98948148 0.01051852 0.00000000 0.00000000 0.00000000
3 0.05631158 0.03418600 0.07429845 0.14161385 0.20161666 0.49197345
4 0.09363752 0.15474426 0.32233519 0.14328794 0.14338776 0.14260733
5 0.07702743 0.09083003 0.16349012 0.17140600 0.23710395 0.26014248
6 0.06928080 0.03574326 0.08477684 0.13890564 0.18578909 0.48550437


> confusionMatrix(svmPred, testClass)
Confusion Matrix and Statistics

Reference
Prediction Blues Classical Jazz Metal Pop Rock
Blues 395 0 0 3 1 1
Classical 0 841 21 0 1 2
Jazz 4 20 724 9 4 8
Metal 0 0 0 214 2 0
Pop 0 0 0 3 378 6
Rock 0 0 5 2 7 471

Overall Statistics

Accuracy : 0.9683
95% CI : (0.9615, 0.9742)
No Information Rate : 0.2758
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9605
Mcnemar's Test P-Value : NA



Statistics by Class:

Class: Blues Class: Classical Class: Jazz Class: Metal Class: P
Sensitivity 0.9900 0.9768 0.9653 0.92641 0.96
Specificity 0.9982 0.9894 0.9810 0.99931 0.99
Pos Pred Value 0.9875 0.9723 0.9415 0.99074 0.97
Neg Pred Value 0.9985 0.9911 0.9890 0.99415 0.99
Prevalence 0.1278 0.2758 0.2402 0.07399 0.12
Detection Rate 0.1265 0.2694 0.2319 0.06855 0.12
Detection Prevalence 0.1281 0.2771 0.2463 0.06919 0.12


Comparing Models

We can use the resampling results to make formal and informal
comparisons between models.
Based on the work of
Hothorn et al. “The design and analysis of benchmark experiments” .
Journal of Computational and Graphical Statistics (2005) vol. 14 (3)
pp. 675-699
Eugster et al. “Exploratory and inferential analysis of benchmark
experiments” Ludwigs-Maximilians-Universitat Munchen, Department
.
of Statistics, Tech. Rep (2008) vol. 30


Comparing Models

> set.seed(1)
> rfFit <- train(x = trainDescr, y = trainClass,
+ method = "rf", tuneLength = 5,
+ repeats = 5,
+ verboseIter = FALSE),
+ metric = "Kappa")
> set.seed(1)
> plsFit <- train(x = trainDescr, y = trainClass,
+ method = "pls", tuneLength = 20,
+ preProc = c("center", "scale", "BoxCox"),
+ repeats = 5,
+ verboseIter = FALSE),
+ metric = "Kappa")


Comparing Models

> resamps <- resamples(list(rf = rfFit, pls = plsFit, svm = rbfSVM))
> print(summary(resamps))
Call:
summary.resamples(object = resamps)

Models: rf, pls, svm
Number of resamples: 50

Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max.
rf 0.9200 0.9328 0.9370 0.9370 0.9424 0.9499
pls 0.8348 0.8488 0.8554 0.8554 0.8631 0.8806
svm 0.9478 0.9648 0.9691 0.9694 0.9752 0.9819

Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max.
rf 0.9003 0.9162 0.9215 0.9215 0.9282 0.9376
pls 0.7932 0.8106 0.8192 0.8190 0.8286 0.8507
svm 0.9350 0.9561 0.9615 0.9619 0.9691 0.9774


Comparing Models

> diffs <- diff(resamps, metric = "Kappa")
> print(summary(diffs))
Call:
summary.diff.resamples(object = diffs)

p-value adjustment: bonferroni
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Kappa
rf pls svm
rf 0.10245 -0.04043
pls < 2.2e-16 -0.14288
svm < 2.2e-16 < 2.2e-16


Parallel Coordinate Plots
parallel(resamps, metric = "Kappa")

svm

rf

pls

0.8 0.85 0.9 0.95

Kappa


Box Plots
bwplot(resamps, metric = "Kappa")

Kappa

svm q q

rf q

pls q

0.80 0.85 0.90 0.95


Dot Plots of Average Diﬀerences
dotplot(diffs)

rf − svm q

rf − pls q

pls − svm q

−0.15 −0.10 −0.05 0.00 0.05 0.10

Difference in Kappa
Confidence Level 0.983 (multiplicity adjusted)


Feature Selection

There are many predictive models with built–in feature selection (e.g.
trees, the lasso, MARS, etc).
caret contains a few functions for supervised feature selection via
“wrappers”.
Two wrappers techniques in caret are::
recursive feature selection (RFE)
ﬁltering using simple, univariate statistics

This can be tricky and can be fraught with bias.
See: Ambroise and McLachlan (2002) for an example


Recursive Feature Selection

This is basically backwards selection.
We rank the predictors by importance, then cull the least important.
We create a performance profile across the subset size and pick the best
The final model is refit using only the subset.
The feature selection step must be cross–validated!


Recursive Feature Elimination

for Each Resampling Iteration do
Partition original data into training and hold–back sets via resampling ;
Train the model on the training set using all predictors;
Predict the held–back samples;
Calculate variable importance or rankings;
for Each subset size Si , i = 1 . . . S do
Keep the Si most important variables;
Train the model on the training set using Si predictors;
Predict the held–back samples;
end
end
Calculate the performance profile over the Si using the held–back samples;
Determine the appropriate number of predictors;
Estimate the final list of predictors to keep in the final model;
Fit the final model based on the optimal Si using the original data set;


Recursive Feature Selection
The rfe function is a framework for doing this. There are several
pre–deﬁned functions for certain models (and a wrapper for train)
For each subset, let’s run a few regression models for illustration

> data(BloodBrain)
> varSizes <- c(2:25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 80, 80)
> x <- bbbDescr[,-nearZeroVar(bbbDescr)]
> x <- x[, -findCorrelation(cor(x), .9)]
> set.seed(1)
> lmProfile <- rfe(x, logBBB,
+ sizes = varSizes,
+ rfeControl = rfeControl(functions = lmFuncs,
+ number = 200,
+ verbose = FALSE))
> rfProfile <- rfe(x, logBBB,
+ sizes = varSizes,
+ rfeControl = rfeControl(functions = rfFuncs,
+ number = 200,
+ verbose = FALSE))


Backwards Selection Results

Linear Reg q

Random Forests

1.1 q q
q

1.0 q

q

0.9 q
RMSE

q
0.8 q
q

q q
0.7 q
q
q
qqq q
qqqqqqqqqqqqqqqqq

0.6

0 20 40 60 80

Variables


Opportunities for Parallel Processing

Recall the algorithm for selecting models via resampling:

Deﬁne sets of model parameter values to evaluate;
for each parameter set do
for each resampling iteration do
Hold–out speciﬁc samples ;
Fit the model on the remainder;
Predict the hold–out samples;
end
Calculate the average performance across hold–out predictions
end
Determine the optimal parameter set;


Opportunities for Parallel Processing

In this process, M models are fit to B resampled data sets.

There is (usually∗ ) no connection between these models, so they could be
run within different processes on the same computer or over separate
computers.

Can we get any benefit from parallel processing?

∗ There are some exceptions where sub–models are evaluated without
further re–fitting. For example, if we can fit a PLS model with 10
components, we can get the results from models with 1–9 components for
free. We’ll call this the “sub–model” trick.


An Example – Boosted Trees

We trained a medium sized data set (n = 4, 500) to tune a gradient
boosting machine (GBM) model sequentially and in parallel.

We fit models with four different values of the interaction depth and 10
different values for the number of boosting iterations.

It turns out that, for each value of the interaction depth, we can fit one
model with the largest number of iterations and get the predictions from
smaller models at no cost.

This means we need to fit four models (with different interaction depths)
for 50 bootstrap samples. We’ll partition these 200 model fits onto
different processes in a few ways to see if parallelization helps.


Execution Times – An Example – Support Vector Machines

SVM regression models with 5 candidate values of the cost parameter with
50 bootstrap iterations can be tested on the same data.

caret uses sub–models wherever possible to be eﬃcient but, unlike
boosted trees, support vector machines cannot be exploited in this way.

The GBM and SVM computations were performed using sequential
process and parallel processing with 1 to 16 “worker nodes’.


Execution Time Results
parallel q sequential
5 10 15

GBM SVM

35
qq q
80

q
q

30
60

25
q
q
q
Training Time

q
q

20
q
q
q
40

q
q

15
q
q
q

q
q
q q
qq
q
q
10
q
q
q
q
q qq
q q
q
q q
q
q qq
q q q
q q
q q q
20

q
q q q q q q
q q q
q q q q
q
q q
q q
q
q
q q q
q q
q q
q
q

5 10 15

#Processors


Speedups
speedup = Sequential Time / Parallel Time

GBM q SVM

5

q
q
q
q q
q q
q q q
4 q
q q
q
q
q q
q
q
Speedup

q
q
q
q
q
3
q
q

q
q

2 q
q

q

1 q

5 10 15

#Processors


Results

There is a benefit to adding more workers for these calculations.

The optimal speedup with be W where W is the number of workers. We
are not optimal, but we can cut the execution time down by 4–5 fold.

The SVM model benefited more than the GBM model, perhaps since GBM
was fitting less models (using sub–models).


Other Functions and Classes

nearZeroVar: a function to remove predictors that are sparse and
highly unbalanced
findCorrelation: a function to remove the optimal set of
predictors to achieve low pair–wise correlations
predictors: class for determining which predictors are included in
the prediction equations (e.g. rpart, earth, lars models) (currently
57 methods)
confusionMatrix, sensitivity, specificity, posPredValue,
negPredValue: classes for assessing classiﬁer performance
varImp: classes for assessing the aggregate eﬀect of a predictor on
the model equations (currently 20 methods)


Other Functions and Classes

knnreg: nearest–neighbor regression
plsda, splsda: PLS discriminant analysis
icr: independent component regression
pcaNNet: nnet:::nnet with automatic PCA pre–processing step
bagEarth, bagFDA: bagging with MARS and FDA models
normalize2Reference: RMA–like processing of Aﬀy arrays using a
training set
spatialSign: class for transforming numeric data (x = x /||x ||)
maxDissim: a function for maximum dissimilarity sampling
featurePlot: a wrapper for several lattice functions


Shameless Plug # 2: Other Packages

A few others that I’m working on...
sparseLDA: Lasso–type regularization for LDA
Cubist: Quinlan’s model trees
C5.0: Quinlan’s decision trees (I could use some C help here)
FuseBox: a framework for combining ensembles of models


Thanks

Kirk Mettler, Bruno and the NYC Predictive Analytics Organizers
R Core
Pﬁzer’s Statistics leadership for providing the time and support to create R
packages
caret contributors: Jed Wing, Steve Weston, Andre Williams, Chris
Keefer and Allan Engelhardt


Session Info

R version 2.11.1 (2010-05-31), x86_64-apple-darwin9.8.0
Base packages: base, datasets, graphics, grDevices, methods, splines, stats,
tools, utils
Other packages: caret 4.87, class 7.3-2, cluster 1.12.3, codetools 0.2-2,
digest 0.4.2, e1071 1.5-24, gbm 1.6-3.1, kernlab 0.9-12, lattice 0.18-8,
plyr 1.2.1, reshape 0.8.3, survival 2.35-8, weaver 1.16.0
Loaded via a namespace (and not attached): grid 2.11.1

This presentation was created with a MacPro using LTEXand R’s Sweave
A
function at 16:02 on Wednesday, May 11, 2011.


The caret Package: A Unified Interface for Predictive Models

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to The caret Package: A Unified Interface for Predictive Models (20)

More from NYC Predictive Analytics (10)

Recently uploaded (20)

The caret Package: A Unified Interface for Predictive Models