SlideShare a Scribd company logo
The caret Package: A Unified Interface for Predictive
                     Models

                       Max Kuhn

                    Pfizer Global R&D
                   Nonclinical Statistics
                       Groton, CT
                   max.kuhn@pfizer.com


                    May 12, 2011
Shameless Plug # 1: Courses




I’ll be teaching 2 R classes here for Predictive Analytics World.

      R Bootcamp (October 16)
      R for Predictive Modeling: A Hands-On Introduction (October 17)

http://www.predictiveanalyticsworld.com/newyork/2011/




 Max Kuhn (Pfizer Global R&D)         caret                    May 12, 2011   2 / 44
Motivation



Theorem (No Free Lunch)
In the absence of any knowledge about the prediction problem, no model
can be said to be uniformly better than any other


Given this, it makes sense to use a variety of different models to find one
that best fits the data

R has many packages for predictive modeling (aka machine learning)(aka
pattern recognition) . . .




 Max Kuhn (Pfizer Global R&D)       caret                    May 12, 2011   3 / 44
Model Function Consistency
Since there are many modeling packages written by different people, there
are some inconsistencies in how models are specified and predictions are
made.

For example, many models have only one method of specifying the model
(e.g. formula method only)

The table below shows the syntax to get probability estimates from several
classification models:
   obj Class         Package   predict Function Syntax
      lda              MASS    predict(obj) (no options needed)
      glm             stats    predict(obj, type = "response")
      gbm               gbm    predict(obj, type = "response", n.trees)
      mda               mda    predict(obj, type = "posterior")
     rpart            rpart    predict(obj, type = "prob")
      Weka            RWeka    predict(obj, type = "probability")
  LogitBoost         caTools   predict(obj, type = "raw", nIter)


 Max Kuhn (Pfizer Global R&D)          caret                 May 12, 2011   4 / 44
The caret Package

The caret package was developed to:
      create a unified interface for modeling and prediction
      streamline model tuning using resampling
      provide a variety of “helper” functions and classes for day–to–day
      model building tasks
      increase computational efficiency using parallel processing

First commits within Pfizer: 6/2005
First version on CRAN: 10/2007
Website: http://caret.r-forge.r-project.org
JSS Paper: www.jstatsoft.org/v28/i05/paper
4 package vignettes (82 pages total)


 Max Kuhn (Pfizer Global R&D)         caret                    May 12, 2011   5 / 44
Example Data: TunedIT Music Challenge




http://tunedit.org/challenge/music-retrieval/genres
Using 191 descriptors, classify 12495 musical segments into one of 6
genres: Blues, Classical, Jazz, Metal, Pop, Rock.
Use these data to predict a large test set of music segments.




 Max Kuhn (Pfizer Global R&D)        caret                   May 12, 2011   6 / 44
Example Data: TunedIT Music Challenge

The predictors and class variables are contained in a data frame called
music.

> head(music[,1:5])
         TC      SC     SC_V     ASE1     ASE2
1    2.5788 481.45 76989.0 -0.12334 -0.11578
2    2.7195 1405.30 825380.0 -0.17655 -0.18323
3    2.5351 601.09 686240.0 -0.13940 -0.13251
4    2.4465 637.73 122580.0 -0.14995 -0.14802
5    2.5657 776.86 124010.0 -0.16863 -0.16112
6    2.7737 447.09    8531.9 -0.16128 -0.15742
> head(music$GENRE)
[1] Pop       Blues     Pop       Jazz      Jazz   Classical
Levels: Blues Classical Jazz Metal Pop Rock




    Max Kuhn (Pfizer Global R&D)          caret                 May 12, 2011   7 / 44
Data Splitting


createDataPartition conducts stratified random splits
>    ## Create a test set with 25% of the data
>    set.seed(1)
>    inTrain <- createDataPartition(music$GENRE, p = .75)[[1]]
>    length(inTrain)
[1] 9373
> head(inTrain)
[1]     2   7 14 20 22 47

This produces a list for each resample. The list elements are integers for
the resampled set.




    Max Kuhn (Pfizer Global R&D)           caret                  May 12, 2011   8 / 44
Data Splitting


>    trainDescr   <-   music[ inTrain, -ncol(music)]
>    testDescr    <-   music[-inTrain, -ncol(music)]
>    trainClass   <-   music$GENRE[ inTrain]
>    testClass    <-   music$GENRE[-inTrain]
> prop.table(table(music$GENRE))
     Blues Classical        Jazz      Metal        Pop       Rock
0.12773109 0.27563025 0.24033613 0.07394958 0.12605042 0.15630252
> prop.table(table(trainClass))
trainClass
     Blues Classical        Jazz      Metal        Pop       Rock
0.12770724 0.27557879 0.24037128 0.07393577 0.12610690 0.15630001

Other functions: createFolds, createMultiFolds, createResamples




    Max Kuhn (Pfizer Global R&D)               caret                 May 12, 2011   9 / 44
Data Pre–Processing Methods

preProcess calculates values that can be used to apply to any data set
(e.g. training, set, unknowns).
Current methods: centering, scaling, spatial sign transformation, PCA or
ICA “signal extraction” imputation (via bagging or k –nearest neighbors),
                       ,
Box–Cox transformations

> ## Determine means and sd's
> procValues <- preProcess(trainDescr, method = c("center", "scale"))
> procValues
> ## Use the predict methods to do the adjustments
> trainScaled <- predict(procValues, trainDescr)
> testScaled <- predict(procValues, testDescr)


preProcess can also be called within other functions, such as train, for
each resampling iteration.


 Max Kuhn (Pfizer Global R&D)           caret                     May 12, 2011   10 / 44
Model Tuning Using Resampling


Define sets of model parameter values to evaluate;
for each parameter set do
    for each resampling iteration do
        Hold–out specific samples ;
        Fit the model on the remainder;
        Predict the hold–out samples;
    end
    Calculate the average performance across hold–out predictions
end
Determine the optimal parameter set;




 Max Kuhn (Pfizer Global R&D)       caret                  May 12, 2011   11 / 44
Model Tuning


train uses resampling to tune and/or evaluate candidate models.

> set.seed(1)
> rbfSVM <- train(x = trainDescr, y = trainClass,
+                 method = "svmRadial",
+                 ## center and scale
+                 preProc = c("center", "scale"),
+                 ## Length of default tuning parameter grid
+                 tuneLength = 8,
+                 ## Repeated cross-validation resampling
+                 trControl = trainControl(method = "repeatedcv",
+                                          repeats = 5),
+                 ## Pick the best model using resampled Kappa
+                 metric = "Kappa",
+                 ## Pass arguments to ksvm
+                 fit = FALSE)




 Max Kuhn (Pfizer Global R&D)           caret                        May 12, 2011   12 / 44
Model Tuning
> print(rbfSVM, printCall = FALSE)
9373 samples
 191 predictors
   6 classes: 'Blues', 'Classical', 'Jazz', 'Metal', 'Pop', 'Rock'

Pre-processing: centered, scaled
Resampling: Cross-Validation (10 fold, repeated 5 times)

Summary of sample sizes: 8437, 8435, 8434, 8435, 8437, 8436, ...

Resampling results across tuning parameters:

  C      Accuracy     Kappa    Accuracy SD   Kappa SD
  0.25   0.916        0.895    0.00953       0.0119
  0.5    0.938        0.923    0.00824       0.0103
  1      0.956        0.945    0.00641       0.008
  2      0.964        0.955    0.00614       0.00766
  4      0.968        0.961    0.0061        0.00761
  8      0.969        0.962    0.00623       0.00777
  16     0.969        0.962    0.00633       0.0079
  32     0.969        0.962    0.0063        0.00786

Tuning parameter 'sigma' was held constant at a value of 0.00518
Kappa was used to select the optimal model using the largest value.
The final values used for the model were C = 16 and sigma = 0.00518.
 Max Kuhn (Pfizer Global R&D)                 caret                   May 12, 2011   13 / 44
Model Tuning




> class(rbfSVM)
[1] "train"
> class(rbfSVM$finalModel)
[1] "ksvm"
attr(,"package")
[1] "kernlab"




 Max Kuhn (Pfizer Global R&D)   caret   May 12, 2011   14 / 44
Model Tuning


      train uses as many “tricks” as possible to reduce the number of
      models fits (e.g. using sub–models). Here, it uses the kernlab
      function sigest to analytically estimate the RBF scale parameter.
      Currently, there are options for 110 models (see ?train for a list)
      Allows user–defined search grid, performance metrics and selection
      rules
      Easily integrates with any parallel processing framework that can
      emulate lapply
      Formula and non–formula interfaces
      Methods: predict, print, plot, varImp, resamples, xyplot,
      densityplot, histogram, stripplot, . . .



 Max Kuhn (Pfizer Global R&D)         caret                   May 12, 2011   15 / 44
Plots
plot(rbfSVM, xTrans = function(x) log2(x))
            Accuracy (Repeated Cross−Validation)


                                                   0.97                           q   q   q   q

                                                                       q

                                                   0.96
                                                                   q


                                                   0.95


                                                   0.94        q



                                                   0.93


                                                   0.92
                                                          q



                                                          −2       0              2       4

                                                                           Cost


 Max Kuhn (Pfizer Global R&D)                                           caret                  May 12, 2011   16 / 44
Plots
densityplot(rbfSVM, metric = "Kappa", pch = "|")




                      40


                      30
            Density




                      20


                      10


                      0        |          | | || || | || | ||
                                            |         |    |    | | | | | || | | | |
                                                                  |   |      |         |



                                   0.94                 0.96                               0.98

                                              Kappa


 Max Kuhn (Pfizer Global R&D)               caret                                                  May 12, 2011   17 / 44
Prediction and Performance Assessment


The predict method can be used to get results for other data sets:
> svmPred <- predict(rbfSVM, testDescr)
> str(svmPred)
 Factor w/ 6 levels "Blues","Classical",..: 3 2 6 3 5 6 5 1 2 6 ...
> svmProbs <- predict(rbfSVM, testDescr, type = "prob")
> head(svmProbs)
          Blues     Classical           Jazz        Metal          Pop         Rock
1    0.03109657    0.51176742     0.31534778   0.05645315   0.02457019   0.06076489
2    0.00000000    0.98948148     0.01051852   0.00000000   0.00000000   0.00000000
3    0.05631158    0.03418600     0.07429845   0.14161385   0.20161666   0.49197345
4    0.09363752    0.15474426     0.32233519   0.14328794   0.14338776   0.14260733
5    0.07702743    0.09083003     0.16349012   0.17140600   0.23710395   0.26014248
6    0.06928080    0.03574326     0.08477684   0.13890564   0.18578909   0.48550437




    Max Kuhn (Pfizer Global R&D)                   caret                         May 12, 2011   18 / 44
Prediction and Performance Assessment
> confusionMatrix(svmPred, testClass)
Confusion Matrix and Statistics

           Reference
Prediction Blues Classical Jazz Metal Pop Rock
  Blues       395        0    0     3   1    1
  Classical     0      841   21     0   1    2
  Jazz          4       20 724      9   4    8
  Metal         0        0    0   214   2    0
  Pop           0        0    0     3 378    6
  Rock          0        0    5     2   7 471

Overall Statistics

               Accuracy        :   0.9683
                 95% CI        :   (0.9615, 0.9742)
    No Information Rate        :   0.2758
    P-Value [Acc > NIR]        :   < 2.2e-16

                  Kappa : 0.9605
 Mcnemar's Test P-Value : NA


 Max Kuhn (Pfizer Global R&D)                   caret   May 12, 2011   19 / 44
Prediction and Performance Assessment



Statistics by Class:

                          Class: Blues Class: Classical Class: Jazz Class: Metal Class: P
Sensitivity                     0.9900           0.9768      0.9653      0.92641     0.96
Specificity                     0.9982           0.9894      0.9810      0.99931     0.99
Pos Pred Value                  0.9875           0.9723      0.9415      0.99074     0.97
Neg Pred Value                  0.9985           0.9911      0.9890      0.99415     0.99
Prevalence                      0.1278           0.2758      0.2402      0.07399     0.12
Detection Rate                  0.1265           0.2694      0.2319      0.06855     0.12
Detection Prevalence            0.1281           0.2771      0.2463      0.06919     0.12




 Max Kuhn (Pfizer Global R&D)               caret                     May 12, 2011   20 / 44
Comparing Models


We can use the resampling results to make formal and informal
comparisons between models.
Based on the work of
      Hothorn et al. “The design and analysis of benchmark experiments” .
      Journal of Computational and Graphical Statistics (2005) vol. 14 (3)
      pp. 675-699
      Eugster et al. “Exploratory and inferential analysis of benchmark
      experiments” Ludwigs-Maximilians-Universitat Munchen, Department
                    .
      of Statistics, Tech. Rep (2008) vol. 30




 Max Kuhn (Pfizer Global R&D)        caret                   May 12, 2011   21 / 44
Comparing Models


>    set.seed(1)
>    rfFit <- train(x = trainDescr, y = trainClass,
+                   method = "rf", tuneLength = 5,
+                   trControl = trainControl(method = "repeatedcv",
+                                            repeats = 5,
+                                            verboseIter = FALSE),
+                   metric = "Kappa")
>    set.seed(1)
>    plsFit <- train(x = trainDescr, y = trainClass,
+                   method = "pls", tuneLength = 20,
+                    preProc = c("center", "scale", "BoxCox"),
+                   trControl = trainControl(method = "repeatedcv",
+                                            repeats = 5,
+                                            verboseIter = FALSE),
+                   metric = "Kappa")




    Max Kuhn (Pfizer Global R&D)           caret                       May 12, 2011   22 / 44
Comparing Models

> resamps <- resamples(list(rf = rfFit, pls = plsFit, svm = rbfSVM))
> print(summary(resamps))
Call:
summary.resamples(object = resamps)

Models: rf, pls, svm
Number of resamples: 50

Accuracy
      Min. 1st Qu.     Median   Mean 3rd Qu.  Max.
rf 0.9200 0.9328       0.9370 0.9370 0.9424 0.9499
pls 0.8348 0.8488      0.8554 0.8554 0.8631 0.8806
svm 0.9478 0.9648      0.9691 0.9694 0.9752 0.9819

Kappa
      Min. 1st Qu.     Median   Mean 3rd Qu.  Max.
rf 0.9003 0.9162       0.9215 0.9215 0.9282 0.9376
pls 0.7932 0.8106      0.8192 0.8190 0.8286 0.8507
svm 0.9350 0.9561      0.9615 0.9619 0.9691 0.9774



 Max Kuhn (Pfizer Global R&D)             caret                   May 12, 2011   23 / 44
Comparing Models


> diffs <- diff(resamps, metric = "Kappa")
> print(summary(diffs))
Call:
summary.diff.resamples(object = diffs)

p-value adjustment: bonferroni
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Kappa
    rf        pls       svm
rf             0.10245 -0.04043
pls < 2.2e-16           -0.14288
svm < 2.2e-16 < 2.2e-16




 Max Kuhn (Pfizer Global R&D)             caret   May 12, 2011   24 / 44
Parallel Coordinate Plots
parallel(resamps, metric = "Kappa")


              svm




                rf




               pls

                       0.8     0.85            0.9   0.95

                                       Kappa


 Max Kuhn (Pfizer Global R&D)          caret                 May 12, 2011   25 / 44
Box Plots
bwplot(resamps, metric = "Kappa")


                                           Kappa


              svm                                            q          q




                rf                                       q




               pls             q




                       0.80        0.85           0.90           0.95




 Max Kuhn (Pfizer Global R&D)              caret                             May 12, 2011   26 / 44
Dot Plots of Average Differences
dotplot(diffs)




               rf − svm                              q




                rf − pls                                                              q




              pls − svm       q




                           −0.15      −0.10      −0.05      0.00       0.05         0.10

                                               Difference in Kappa
                                   Confidence Level 0.983 (multiplicity adjusted)


 Max Kuhn (Pfizer Global R&D)                       caret                             May 12, 2011   27 / 44
Feature Selection


There are many predictive models with built–in feature selection (e.g.
trees, the lasso, MARS, etc).
caret contains a few functions for supervised feature selection via
“wrappers”.
Two wrappers techniques in caret are::
      recursive feature selection (RFE)
      filtering using simple, univariate statistics

This can be tricky and can be fraught with bias.
See: Ambroise and McLachlan (2002) for an example




 Max Kuhn (Pfizer Global R&D)           caret                May 12, 2011   28 / 44
Recursive Feature Selection



This is basically backwards selection.
We rank the predictors by importance, then cull the least important.
We create a performance profile across the subset size and pick the best
The final model is refit using only the subset.
The feature selection step must be cross–validated!




 Max Kuhn (Pfizer Global R&D)         caret                 May 12, 2011   29 / 44
Recursive Feature Elimination

for Each Resampling Iteration do
     Partition original data into training and hold–back sets via resampling ;
     Train the model on the training set using all predictors;
     Predict the held–back samples;
     Calculate variable importance or rankings;
     for Each subset size Si , i = 1 . . . S do
         Keep the Si most important variables;
         Train the model on the training set using Si predictors;
         Predict the held–back samples;
     end
end
Calculate the performance profile over the Si using the held–back samples;
Determine the appropriate number of predictors;
Estimate the final list of predictors to keep in the final model;
Fit the final model based on the optimal Si using the original data set;



 Max Kuhn (Pfizer Global R&D)            caret                      May 12, 2011   30 / 44
Recursive Feature Selection
The rfe function is a framework for doing this. There are several
pre–defined functions for certain models (and a wrapper for train)
For each subset, let’s run a few regression models for illustration

>    data(BloodBrain)
>    varSizes <- c(2:25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 80, 80)
>    x <- bbbDescr[,-nearZeroVar(bbbDescr)]
>    x <- x[, -findCorrelation(cor(x), .9)]
>    set.seed(1)
>    lmProfile <- rfe(x, logBBB,
+                     sizes = varSizes,
+                     rfeControl = rfeControl(functions = lmFuncs,
+                                             number = 200,
+                                             verbose = FALSE))
>    rfProfile <- rfe(x, logBBB,
+                     sizes = varSizes,
+                     rfeControl = rfeControl(functions = rfFuncs,
+                                             number = 200,
+                                             verbose = FALSE))



    Max Kuhn (Pfizer Global R&D)           caret                        May 12, 2011   31 / 44
Backwards Selection Results

                                                            Linear Reg                q

                                                            Random Forests


                   1.1                                                                            q    q
                                                                                              q


                   1.0                                                                    q

                                                                                  q

                   0.9                                                        q
            RMSE




                                                                          q
                   0.8                                                q
                                                                 q

                             q                               q
                   0.7        q
                               q
                                                        q
                                qqq                 q
                                   qqqqqqqqqqqqqqqqq

                   0.6


                         0                   20                  40               60              80

                                                                 Variables



 Max Kuhn (Pfizer Global R&D)                                     caret                                 May 12, 2011   32 / 44
Opportunities for Parallel Processing

Recall the algorithm for selecting models via resampling:

Define sets of model parameter values to evaluate;
for each parameter set do
    for each resampling iteration do
        Hold–out specific samples ;
        Fit the model on the remainder;
        Predict the hold–out samples;
    end
    Calculate the average performance across hold–out predictions
end
Determine the optimal parameter set;




 Max Kuhn (Pfizer Global R&D)        caret                   May 12, 2011   33 / 44
Opportunities for Parallel Processing


In this process, M models are fit to B resampled data sets.

There is (usually∗ ) no connection between these models, so they could be
run within different processes on the same computer or over separate
computers.

Can we get any benefit from parallel processing?

∗ There are some exceptions where sub–models are evaluated without
further re–fitting. For example, if we can fit a PLS model with 10
components, we can get the results from models with 1–9 components for
free. We’ll call this the “sub–model” trick.




 Max Kuhn (Pfizer Global R&D)       caret                     May 12, 2011   34 / 44
An Example – Boosted Trees

We trained a medium sized data set (n = 4, 500) to tune a gradient
boosting machine (GBM) model sequentially and in parallel.

We fit models with four different values of the interaction depth and 10
different values for the number of boosting iterations.

It turns out that, for each value of the interaction depth, we can fit one
model with the largest number of iterations and get the predictions from
smaller models at no cost.

This means we need to fit four models (with different interaction depths)
for 50 bootstrap samples. We’ll partition these 200 model fits onto
different processes in a few ways to see if parallelization helps.



 Max Kuhn (Pfizer Global R&D)        caret                   May 12, 2011   35 / 44
Execution Times – An Example – Support Vector Machines




SVM regression models with 5 candidate values of the cost parameter with
50 bootstrap iterations can be tested on the same data.

caret uses sub–models wherever possible to be efficient but, unlike
boosted trees, support vector machines cannot be exploited in this way.

The GBM and SVM computations were performed using sequential
process and parallel processing with 1 to 16 “worker nodes’.




 Max Kuhn (Pfizer Global R&D)       caret                   May 12, 2011   36 / 44
Execution Time Results
                                                                              parallel      q            sequential
                                                                                                                           5                     10                   15

                                                         GBM                                                                             SVM




                                                                                                    35
                           qq                                                                              q
                      80




                                                                                                           q
                                                                                                           q




                                                                                                    30
                      60




                                                                                                    25
                                q
                                q
                                q
      Training Time




                                                                                                               q
                                                                                                               q




                                                                                                    20
                                    q
                                    q
                                    q
                      40




                                                                                                                   q
                                                                                                                   q




                                                                                                    15
                                        q
                                        q
                                        q

                                                                                                                       q
                                                                                                                       q
                                            q                                                                          q
                                            qq
                                                                                                                           q
                                                                                                                           q
                                                                                                    10
                                                 q
                                                 q
                                                     q
                                                     q
                                                     q                                                                         qq
                                                     q   q
                                                         q
                                                         q   q
                                                             q
                                                             q                                                                      qq
                                                                      q                                                             q    q
                                                                 q        q
                                                                          q                                                              q   q
                      20




                                                                 q
                                                                 q    q        q                                                             q   q        q
                                                                          q    q   q
                                                                                   q   q                                                         q    q
                                                                                                                                                      q
                                                                                                                                                      q   q
                                                                                                                                                          q       q
                                                                                                                                                                  q
                                                                                       q
                                                                                       q    q                                                                 q
                                                                                                                                                              q       q
                                                                                                                                                                      q    q
                                                                                                                                                                           q
                                                                                                                                                                           q




                                            5                    10                    15

                                                                                            #Processors

 Max Kuhn (Pfizer Global R&D)                                                                caret                                                     May 12, 2011             37 / 44
Speedups
speedup = Sequential Time / Parallel Time

                                             GBM       q           SVM


                     5

                                                                                               q
                                                                                               q
                                                                                         q
                                                                                     q   q
                                                                             q   q
                                                                     q   q       q
                     4                                               q
                                                                     q       q
                                                                         q
                                                           q
                                                           q   q
                                                               q
                                                           q
           Speedup




                                                 q
                                                 q
                                                 q
                                             q
                                             q
                     3
                                         q
                                         q


                                     q
                                     q

                     2           q
                                 q


                             q


                     1   q




                                         5                          10                   15

                                                     #Processors




 Max Kuhn (Pfizer Global R&D)                          caret                                   May 12, 2011   38 / 44
Results




There is a benefit to adding more workers for these calculations.

The optimal speedup with be W where W is the number of workers. We
are not optimal, but we can cut the execution time down by 4–5 fold.

The SVM model benefited more than the GBM model, perhaps since GBM
was fitting less models (using sub–models).




 Max Kuhn (Pfizer Global R&D)       caret                   May 12, 2011   39 / 44
Other Functions and Classes


      nearZeroVar: a function to remove predictors that are sparse and
      highly unbalanced
      findCorrelation: a function to remove the optimal set of
      predictors to achieve low pair–wise correlations
      predictors: class for determining which predictors are included in
      the prediction equations (e.g. rpart, earth, lars models) (currently
      57 methods)
      confusionMatrix, sensitivity, specificity, posPredValue,
      negPredValue: classes for assessing classifier performance
      varImp: classes for assessing the aggregate effect of a predictor on
      the model equations (currently 20 methods)




 Max Kuhn (Pfizer Global R&D)         caret                  May 12, 2011   40 / 44
Other Functions and Classes


      knnreg: nearest–neighbor regression
      plsda, splsda: PLS discriminant analysis
      icr: independent component regression
      pcaNNet: nnet:::nnet with automatic PCA pre–processing step
      bagEarth, bagFDA: bagging with MARS and FDA models
      normalize2Reference: RMA–like processing of Affy arrays using a
      training set
      spatialSign: class for transforming numeric data (x = x /||x ||)
      maxDissim: a function for maximum dissimilarity sampling
      featurePlot: a wrapper for several lattice functions




 Max Kuhn (Pfizer Global R&D)        caret                  May 12, 2011   41 / 44
Shameless Plug # 2: Other Packages




A few others that I’m working on...
      sparseLDA: Lasso–type regularization for LDA
      Cubist: Quinlan’s model trees
      C5.0: Quinlan’s decision trees (I could use some C help here)
      FuseBox: a framework for combining ensembles of models




 Max Kuhn (Pfizer Global R&D)          caret              May 12, 2011   42 / 44
Thanks



Kirk Mettler, Bruno and the NYC Predictive Analytics Organizers
R Core
Pfizer’s Statistics leadership for providing the time and support to create R
packages
caret contributors: Jed Wing, Steve Weston, Andre Williams, Chris
Keefer and Allan Engelhardt




Max Kuhn (Pfizer Global R&D)         caret                   May 12, 2011   43 / 44
Session Info


      R version 2.11.1 (2010-05-31), x86_64-apple-darwin9.8.0
      Base packages: base, datasets, graphics, grDevices, methods, splines, stats,
      tools, utils
      Other packages: caret 4.87, class 7.3-2, cluster 1.12.3, codetools 0.2-2,
      digest 0.4.2, e1071 1.5-24, gbm 1.6-3.1, kernlab 0.9-12, lattice 0.18-8,
      plyr 1.2.1, reshape 0.8.3, survival 2.35-8, weaver 1.16.0
      Loaded via a namespace (and not attached): grid 2.11.1


This presentation was created with a MacPro using LTEXand R’s Sweave
                                                  A
function at 16:02 on Wednesday, May 11, 2011.




 Max Kuhn (Pfizer Global R&D)            caret                      May 12, 2011   44 / 44

More Related Content

PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
PDF
TOROS N2 - lightweight approximate Nearest Neighbor library
PDF
Test strategies for data processing pipelines
PDF
E-commerce Search Engine with Apache Lucene/Solr
PPTX
201804 neo4 j_cypher_guide
PDF
Elastic Search (엘라스틱서치) 입문
PDF
Large scale-lm-part1
Spark And Cassandra: 2 Fast, 2 Furious
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
TOROS N2 - lightweight approximate Nearest Neighbor library
Test strategies for data processing pipelines
E-commerce Search Engine with Apache Lucene/Solr
201804 neo4 j_cypher_guide
Elastic Search (엘라스틱서치) 입문
Large scale-lm-part1

What's hot (20)

PDF
HBaseCon2017 Community-Driven Graphs with JanusGraph
PDF
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
PDF
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
PPTX
Indexing with MongoDB
PDF
Database monitoring and performance management
PDF
Understanding of Apache kafka metrics for monitoring
PPTX
KorQuAD v2.0 소개
PPTX
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
PPTX
KERAS Python Tutorial
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
PPT
MongoDB Schema Design
PPT
Karyotyping
PPTX
Airflow at lyft
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Déjà Vu: The Importance of Time and Causality in Recommender Systems
PDF
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
PPTX
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
PDF
Raffi Krikorian, Twitter Timelines at Scale
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
Neo4J 사용
HBaseCon2017 Community-Driven Graphs with JanusGraph
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Indexing with MongoDB
Database monitoring and performance management
Understanding of Apache kafka metrics for monitoring
KorQuAD v2.0 소개
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
KERAS Python Tutorial
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
MongoDB Schema Design
Karyotyping
Airflow at lyft
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Raffi Krikorian, Twitter Timelines at Scale
Apache Calcite (a tutorial given at BOSS '21)
Neo4J 사용
Ad

Viewers also liked (20)

PPTX
Predictive Modeling Workshop
PDF
Model Automation in R
PDF
Caret Package for R
PDF
Data mining with caret package
PDF
The caret package is a unified interface to a large number of predictive mode...
ODP
Caret Introduction
 
PDF
Max Kuhn's talk on R machine learning
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
DOCX
R Machine Learning packages( generally used)
PDF
Machine Learning with R
PDF
Conditional trees
PPTX
Access any data anywhere
PDF
Tokyo r11caret
PPTX
Big Data in Stock Exchange( HFT, Forex, Flash Crashes)
PPTX
Meeting the data management challenges of MiFID II
PPT
Larry tabb hft - part 1
PDF
MiFID II: Data for best execution
PDF
Getting Ready for MiFID II
PDF
MiFID II: Data for transparency
PPTX
The impact of MiFID II on your OTC derivatives trading business
Predictive Modeling Workshop
Model Automation in R
Caret Package for R
Data mining with caret package
The caret package is a unified interface to a large number of predictive mode...
Caret Introduction
 
Max Kuhn's talk on R machine learning
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
R Machine Learning packages( generally used)
Machine Learning with R
Conditional trees
Access any data anywhere
Tokyo r11caret
Big Data in Stock Exchange( HFT, Forex, Flash Crashes)
Meeting the data management challenges of MiFID II
Larry tabb hft - part 1
MiFID II: Data for best execution
Getting Ready for MiFID II
MiFID II: Data for transparency
The impact of MiFID II on your OTC derivatives trading business
Ad

Similar to The caret Package: A Unified Interface for Predictive Models (20)

PDF
Caret max kuhn
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PPT
modeling.ppt
PDF
maxbox starter60 machine learning
PPTX
Machine Learning Algorithms (Part 1)
PDF
Data Profiling in Apache Calcite
PPTX
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
PPT
IGARSS2011-I-Ling.ppt
PPTX
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
PDF
BPstudy sklearn 20180925
PDF
Object Oriented Programming in Matlab
PDF
Machine learning for_finance
PDF
R refcard-data-mining
PDF
Achitecture Aware Algorithms and Software for Peta and Exascale
PDF
Andres hernandez ai_machine_learning_london_nov2017
PDF
Introduction to spatstat
PDF
maXbox starter67 machine learning V
PPT
Discovery Bus: UK QSAR meeting at GSK
PDF
Heuristic design of experiments w meta gradient search
PPTX
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
Caret max kuhn
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
modeling.ppt
maxbox starter60 machine learning
Machine Learning Algorithms (Part 1)
Data Profiling in Apache Calcite
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
IGARSS2011-I-Ling.ppt
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
BPstudy sklearn 20180925
Object Oriented Programming in Matlab
Machine learning for_finance
R refcard-data-mining
Achitecture Aware Algorithms and Software for Peta and Exascale
Andres hernandez ai_machine_learning_london_nov2017
Introduction to spatstat
maXbox starter67 machine learning V
Discovery Bus: UK QSAR meeting at GSK
Heuristic design of experiments w meta gradient search
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...

More from NYC Predictive Analytics (10)

PDF
Graph Based Machine Learning with Applications to Media Analytics
PDF
Intro to Classification: Logistic Regression & SVM
PDF
Introduction to R Package Recommendation System Competition
PDF
R package Recommendation Engine
PDF
Optimization: A Framework for Predictive Analytics
PPT
An Introduction to Multilevel Regression Modeling for Prediction
PPTX
How OMGPOP Uses Predictive Analytics to Drive Change
PDF
Introduction to Probabilistic Latent Semantic Analysis
PDF
Recommendation Engine Demystified
PDF
Building a Recommendation Engine - An example of a product recommendation engine
Graph Based Machine Learning with Applications to Media Analytics
Intro to Classification: Logistic Regression & SVM
Introduction to R Package Recommendation System Competition
R package Recommendation Engine
Optimization: A Framework for Predictive Analytics
An Introduction to Multilevel Regression Modeling for Prediction
How OMGPOP Uses Predictive Analytics to Drive Change
Introduction to Probabilistic Latent Semantic Analysis
Recommendation Engine Demystified
Building a Recommendation Engine - An example of a product recommendation engine

Recently uploaded (20)

PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Complications of Minimal Access Surgery at WLH
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
01-Introduction-to-Information-Management.pdf
PPTX
master seminar digital applications in india
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Lesson notes of climatology university.
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Anesthesia in Laparoscopic Surgery in India
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Complications of Minimal Access Surgery at WLH
FourierSeries-QuestionsWithAnswers(Part-A).pdf
O7-L3 Supply Chain Operations - ICLT Program
Microbial diseases, their pathogenesis and prophylaxis
01-Introduction-to-Information-Management.pdf
master seminar digital applications in india
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Final Presentation General Medicine 03-08-2024.pptx
Cell Types and Its function , kingdom of life
Module 4: Burden of Disease Tutorial Slides S2 2025
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Supply Chain Operations Speaking Notes -ICLT Program
A systematic review of self-coping strategies used by university students to ...
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Lesson notes of climatology university.
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Anesthesia in Laparoscopic Surgery in India

The caret Package: A Unified Interface for Predictive Models

  • 1. The caret Package: A Unified Interface for Predictive Models Max Kuhn Pfizer Global R&D Nonclinical Statistics Groton, CT max.kuhn@pfizer.com May 12, 2011
  • 2. Shameless Plug # 1: Courses I’ll be teaching 2 R classes here for Predictive Analytics World. R Bootcamp (October 16) R for Predictive Modeling: A Hands-On Introduction (October 17) http://www.predictiveanalyticsworld.com/newyork/2011/ Max Kuhn (Pfizer Global R&D) caret May 12, 2011 2 / 44
  • 3. Motivation Theorem (No Free Lunch) In the absence of any knowledge about the prediction problem, no model can be said to be uniformly better than any other Given this, it makes sense to use a variety of different models to find one that best fits the data R has many packages for predictive modeling (aka machine learning)(aka pattern recognition) . . . Max Kuhn (Pfizer Global R&D) caret May 12, 2011 3 / 44
  • 4. Model Function Consistency Since there are many modeling packages written by different people, there are some inconsistencies in how models are specified and predictions are made. For example, many models have only one method of specifying the model (e.g. formula method only) The table below shows the syntax to get probability estimates from several classification models: obj Class Package predict Function Syntax lda MASS predict(obj) (no options needed) glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) mda mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") LogitBoost caTools predict(obj, type = "raw", nIter) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 4 / 44
  • 5. The caret Package The caret package was developed to: create a unified interface for modeling and prediction streamline model tuning using resampling provide a variety of “helper” functions and classes for day–to–day model building tasks increase computational efficiency using parallel processing First commits within Pfizer: 6/2005 First version on CRAN: 10/2007 Website: http://caret.r-forge.r-project.org JSS Paper: www.jstatsoft.org/v28/i05/paper 4 package vignettes (82 pages total) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 5 / 44
  • 6. Example Data: TunedIT Music Challenge http://tunedit.org/challenge/music-retrieval/genres Using 191 descriptors, classify 12495 musical segments into one of 6 genres: Blues, Classical, Jazz, Metal, Pop, Rock. Use these data to predict a large test set of music segments. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 6 / 44
  • 7. Example Data: TunedIT Music Challenge The predictors and class variables are contained in a data frame called music. > head(music[,1:5]) TC SC SC_V ASE1 ASE2 1 2.5788 481.45 76989.0 -0.12334 -0.11578 2 2.7195 1405.30 825380.0 -0.17655 -0.18323 3 2.5351 601.09 686240.0 -0.13940 -0.13251 4 2.4465 637.73 122580.0 -0.14995 -0.14802 5 2.5657 776.86 124010.0 -0.16863 -0.16112 6 2.7737 447.09 8531.9 -0.16128 -0.15742 > head(music$GENRE) [1] Pop Blues Pop Jazz Jazz Classical Levels: Blues Classical Jazz Metal Pop Rock Max Kuhn (Pfizer Global R&D) caret May 12, 2011 7 / 44
  • 8. Data Splitting createDataPartition conducts stratified random splits > ## Create a test set with 25% of the data > set.seed(1) > inTrain <- createDataPartition(music$GENRE, p = .75)[[1]] > length(inTrain) [1] 9373 > head(inTrain) [1] 2 7 14 20 22 47 This produces a list for each resample. The list elements are integers for the resampled set. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 8 / 44
  • 9. Data Splitting > trainDescr <- music[ inTrain, -ncol(music)] > testDescr <- music[-inTrain, -ncol(music)] > trainClass <- music$GENRE[ inTrain] > testClass <- music$GENRE[-inTrain] > prop.table(table(music$GENRE)) Blues Classical Jazz Metal Pop Rock 0.12773109 0.27563025 0.24033613 0.07394958 0.12605042 0.15630252 > prop.table(table(trainClass)) trainClass Blues Classical Jazz Metal Pop Rock 0.12770724 0.27557879 0.24037128 0.07393577 0.12610690 0.15630001 Other functions: createFolds, createMultiFolds, createResamples Max Kuhn (Pfizer Global R&D) caret May 12, 2011 9 / 44
  • 10. Data Pre–Processing Methods preProcess calculates values that can be used to apply to any data set (e.g. training, set, unknowns). Current methods: centering, scaling, spatial sign transformation, PCA or ICA “signal extraction” imputation (via bagging or k –nearest neighbors), , Box–Cox transformations > ## Determine means and sd's > procValues <- preProcess(trainDescr, method = c("center", "scale")) > procValues > ## Use the predict methods to do the adjustments > trainScaled <- predict(procValues, trainDescr) > testScaled <- predict(procValues, testDescr) preProcess can also be called within other functions, such as train, for each resampling iteration. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 10 / 44
  • 11. Model Tuning Using Resampling Define sets of model parameter values to evaluate; for each parameter set do for each resampling iteration do Hold–out specific samples ; Fit the model on the remainder; Predict the hold–out samples; end Calculate the average performance across hold–out predictions end Determine the optimal parameter set; Max Kuhn (Pfizer Global R&D) caret May 12, 2011 11 / 44
  • 12. Model Tuning train uses resampling to tune and/or evaluate candidate models. > set.seed(1) > rbfSVM <- train(x = trainDescr, y = trainClass, + method = "svmRadial", + ## center and scale + preProc = c("center", "scale"), + ## Length of default tuning parameter grid + tuneLength = 8, + ## Repeated cross-validation resampling + trControl = trainControl(method = "repeatedcv", + repeats = 5), + ## Pick the best model using resampled Kappa + metric = "Kappa", + ## Pass arguments to ksvm + fit = FALSE) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 12 / 44
  • 13. Model Tuning > print(rbfSVM, printCall = FALSE) 9373 samples 191 predictors 6 classes: 'Blues', 'Classical', 'Jazz', 'Metal', 'Pop', 'Rock' Pre-processing: centered, scaled Resampling: Cross-Validation (10 fold, repeated 5 times) Summary of sample sizes: 8437, 8435, 8434, 8435, 8437, 8436, ... Resampling results across tuning parameters: C Accuracy Kappa Accuracy SD Kappa SD 0.25 0.916 0.895 0.00953 0.0119 0.5 0.938 0.923 0.00824 0.0103 1 0.956 0.945 0.00641 0.008 2 0.964 0.955 0.00614 0.00766 4 0.968 0.961 0.0061 0.00761 8 0.969 0.962 0.00623 0.00777 16 0.969 0.962 0.00633 0.0079 32 0.969 0.962 0.0063 0.00786 Tuning parameter 'sigma' was held constant at a value of 0.00518 Kappa was used to select the optimal model using the largest value. The final values used for the model were C = 16 and sigma = 0.00518. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 13 / 44
  • 14. Model Tuning > class(rbfSVM) [1] "train" > class(rbfSVM$finalModel) [1] "ksvm" attr(,"package") [1] "kernlab" Max Kuhn (Pfizer Global R&D) caret May 12, 2011 14 / 44
  • 15. Model Tuning train uses as many “tricks” as possible to reduce the number of models fits (e.g. using sub–models). Here, it uses the kernlab function sigest to analytically estimate the RBF scale parameter. Currently, there are options for 110 models (see ?train for a list) Allows user–defined search grid, performance metrics and selection rules Easily integrates with any parallel processing framework that can emulate lapply Formula and non–formula interfaces Methods: predict, print, plot, varImp, resamples, xyplot, densityplot, histogram, stripplot, . . . Max Kuhn (Pfizer Global R&D) caret May 12, 2011 15 / 44
  • 16. Plots plot(rbfSVM, xTrans = function(x) log2(x)) Accuracy (Repeated Cross−Validation) 0.97 q q q q q 0.96 q 0.95 0.94 q 0.93 0.92 q −2 0 2 4 Cost Max Kuhn (Pfizer Global R&D) caret May 12, 2011 16 / 44
  • 17. Plots densityplot(rbfSVM, metric = "Kappa", pch = "|") 40 30 Density 20 10 0 | | | || || | || | || | | | | | | | | || | | | | | | | | 0.94 0.96 0.98 Kappa Max Kuhn (Pfizer Global R&D) caret May 12, 2011 17 / 44
  • 18. Prediction and Performance Assessment The predict method can be used to get results for other data sets: > svmPred <- predict(rbfSVM, testDescr) > str(svmPred) Factor w/ 6 levels "Blues","Classical",..: 3 2 6 3 5 6 5 1 2 6 ... > svmProbs <- predict(rbfSVM, testDescr, type = "prob") > head(svmProbs) Blues Classical Jazz Metal Pop Rock 1 0.03109657 0.51176742 0.31534778 0.05645315 0.02457019 0.06076489 2 0.00000000 0.98948148 0.01051852 0.00000000 0.00000000 0.00000000 3 0.05631158 0.03418600 0.07429845 0.14161385 0.20161666 0.49197345 4 0.09363752 0.15474426 0.32233519 0.14328794 0.14338776 0.14260733 5 0.07702743 0.09083003 0.16349012 0.17140600 0.23710395 0.26014248 6 0.06928080 0.03574326 0.08477684 0.13890564 0.18578909 0.48550437 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 18 / 44
  • 19. Prediction and Performance Assessment > confusionMatrix(svmPred, testClass) Confusion Matrix and Statistics Reference Prediction Blues Classical Jazz Metal Pop Rock Blues 395 0 0 3 1 1 Classical 0 841 21 0 1 2 Jazz 4 20 724 9 4 8 Metal 0 0 0 214 2 0 Pop 0 0 0 3 378 6 Rock 0 0 5 2 7 471 Overall Statistics Accuracy : 0.9683 95% CI : (0.9615, 0.9742) No Information Rate : 0.2758 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.9605 Mcnemar's Test P-Value : NA Max Kuhn (Pfizer Global R&D) caret May 12, 2011 19 / 44
  • 20. Prediction and Performance Assessment Statistics by Class: Class: Blues Class: Classical Class: Jazz Class: Metal Class: P Sensitivity 0.9900 0.9768 0.9653 0.92641 0.96 Specificity 0.9982 0.9894 0.9810 0.99931 0.99 Pos Pred Value 0.9875 0.9723 0.9415 0.99074 0.97 Neg Pred Value 0.9985 0.9911 0.9890 0.99415 0.99 Prevalence 0.1278 0.2758 0.2402 0.07399 0.12 Detection Rate 0.1265 0.2694 0.2319 0.06855 0.12 Detection Prevalence 0.1281 0.2771 0.2463 0.06919 0.12 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 20 / 44
  • 21. Comparing Models We can use the resampling results to make formal and informal comparisons between models. Based on the work of Hothorn et al. “The design and analysis of benchmark experiments” . Journal of Computational and Graphical Statistics (2005) vol. 14 (3) pp. 675-699 Eugster et al. “Exploratory and inferential analysis of benchmark experiments” Ludwigs-Maximilians-Universitat Munchen, Department . of Statistics, Tech. Rep (2008) vol. 30 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 21 / 44
  • 22. Comparing Models > set.seed(1) > rfFit <- train(x = trainDescr, y = trainClass, + method = "rf", tuneLength = 5, + trControl = trainControl(method = "repeatedcv", + repeats = 5, + verboseIter = FALSE), + metric = "Kappa") > set.seed(1) > plsFit <- train(x = trainDescr, y = trainClass, + method = "pls", tuneLength = 20, + preProc = c("center", "scale", "BoxCox"), + trControl = trainControl(method = "repeatedcv", + repeats = 5, + verboseIter = FALSE), + metric = "Kappa") Max Kuhn (Pfizer Global R&D) caret May 12, 2011 22 / 44
  • 23. Comparing Models > resamps <- resamples(list(rf = rfFit, pls = plsFit, svm = rbfSVM)) > print(summary(resamps)) Call: summary.resamples(object = resamps) Models: rf, pls, svm Number of resamples: 50 Accuracy Min. 1st Qu. Median Mean 3rd Qu. Max. rf 0.9200 0.9328 0.9370 0.9370 0.9424 0.9499 pls 0.8348 0.8488 0.8554 0.8554 0.8631 0.8806 svm 0.9478 0.9648 0.9691 0.9694 0.9752 0.9819 Kappa Min. 1st Qu. Median Mean 3rd Qu. Max. rf 0.9003 0.9162 0.9215 0.9215 0.9282 0.9376 pls 0.7932 0.8106 0.8192 0.8190 0.8286 0.8507 svm 0.9350 0.9561 0.9615 0.9619 0.9691 0.9774 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 23 / 44
  • 24. Comparing Models > diffs <- diff(resamps, metric = "Kappa") > print(summary(diffs)) Call: summary.diff.resamples(object = diffs) p-value adjustment: bonferroni Upper diagonal: estimates of the difference Lower diagonal: p-value for H0: difference = 0 Kappa rf pls svm rf 0.10245 -0.04043 pls < 2.2e-16 -0.14288 svm < 2.2e-16 < 2.2e-16 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 24 / 44
  • 25. Parallel Coordinate Plots parallel(resamps, metric = "Kappa") svm rf pls 0.8 0.85 0.9 0.95 Kappa Max Kuhn (Pfizer Global R&D) caret May 12, 2011 25 / 44
  • 26. Box Plots bwplot(resamps, metric = "Kappa") Kappa svm q q rf q pls q 0.80 0.85 0.90 0.95 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 26 / 44
  • 27. Dot Plots of Average Differences dotplot(diffs) rf − svm q rf − pls q pls − svm q −0.15 −0.10 −0.05 0.00 0.05 0.10 Difference in Kappa Confidence Level 0.983 (multiplicity adjusted) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 27 / 44
  • 28. Feature Selection There are many predictive models with built–in feature selection (e.g. trees, the lasso, MARS, etc). caret contains a few functions for supervised feature selection via “wrappers”. Two wrappers techniques in caret are:: recursive feature selection (RFE) filtering using simple, univariate statistics This can be tricky and can be fraught with bias. See: Ambroise and McLachlan (2002) for an example Max Kuhn (Pfizer Global R&D) caret May 12, 2011 28 / 44
  • 29. Recursive Feature Selection This is basically backwards selection. We rank the predictors by importance, then cull the least important. We create a performance profile across the subset size and pick the best The final model is refit using only the subset. The feature selection step must be cross–validated! Max Kuhn (Pfizer Global R&D) caret May 12, 2011 29 / 44
  • 30. Recursive Feature Elimination for Each Resampling Iteration do Partition original data into training and hold–back sets via resampling ; Train the model on the training set using all predictors; Predict the held–back samples; Calculate variable importance or rankings; for Each subset size Si , i = 1 . . . S do Keep the Si most important variables; Train the model on the training set using Si predictors; Predict the held–back samples; end end Calculate the performance profile over the Si using the held–back samples; Determine the appropriate number of predictors; Estimate the final list of predictors to keep in the final model; Fit the final model based on the optimal Si using the original data set; Max Kuhn (Pfizer Global R&D) caret May 12, 2011 30 / 44
  • 31. Recursive Feature Selection The rfe function is a framework for doing this. There are several pre–defined functions for certain models (and a wrapper for train) For each subset, let’s run a few regression models for illustration > data(BloodBrain) > varSizes <- c(2:25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 80, 80) > x <- bbbDescr[,-nearZeroVar(bbbDescr)] > x <- x[, -findCorrelation(cor(x), .9)] > set.seed(1) > lmProfile <- rfe(x, logBBB, + sizes = varSizes, + rfeControl = rfeControl(functions = lmFuncs, + number = 200, + verbose = FALSE)) > rfProfile <- rfe(x, logBBB, + sizes = varSizes, + rfeControl = rfeControl(functions = rfFuncs, + number = 200, + verbose = FALSE)) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 31 / 44
  • 32. Backwards Selection Results Linear Reg q Random Forests 1.1 q q q 1.0 q q 0.9 q RMSE q 0.8 q q q q 0.7 q q q qqq q qqqqqqqqqqqqqqqqq 0.6 0 20 40 60 80 Variables Max Kuhn (Pfizer Global R&D) caret May 12, 2011 32 / 44
  • 33. Opportunities for Parallel Processing Recall the algorithm for selecting models via resampling: Define sets of model parameter values to evaluate; for each parameter set do for each resampling iteration do Hold–out specific samples ; Fit the model on the remainder; Predict the hold–out samples; end Calculate the average performance across hold–out predictions end Determine the optimal parameter set; Max Kuhn (Pfizer Global R&D) caret May 12, 2011 33 / 44
  • 34. Opportunities for Parallel Processing In this process, M models are fit to B resampled data sets. There is (usually∗ ) no connection between these models, so they could be run within different processes on the same computer or over separate computers. Can we get any benefit from parallel processing? ∗ There are some exceptions where sub–models are evaluated without further re–fitting. For example, if we can fit a PLS model with 10 components, we can get the results from models with 1–9 components for free. We’ll call this the “sub–model” trick. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 34 / 44
  • 35. An Example – Boosted Trees We trained a medium sized data set (n = 4, 500) to tune a gradient boosting machine (GBM) model sequentially and in parallel. We fit models with four different values of the interaction depth and 10 different values for the number of boosting iterations. It turns out that, for each value of the interaction depth, we can fit one model with the largest number of iterations and get the predictions from smaller models at no cost. This means we need to fit four models (with different interaction depths) for 50 bootstrap samples. We’ll partition these 200 model fits onto different processes in a few ways to see if parallelization helps. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 35 / 44
  • 36. Execution Times – An Example – Support Vector Machines SVM regression models with 5 candidate values of the cost parameter with 50 bootstrap iterations can be tested on the same data. caret uses sub–models wherever possible to be efficient but, unlike boosted trees, support vector machines cannot be exploited in this way. The GBM and SVM computations were performed using sequential process and parallel processing with 1 to 16 “worker nodes’. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 36 / 44
  • 37. Execution Time Results parallel q sequential 5 10 15 GBM SVM 35 qq q 80 q q 30 60 25 q q q Training Time q q 20 q q q 40 q q 15 q q q q q q q qq q q 10 q q q q q qq q q q q q q q qq q q q q q q q q 20 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 5 10 15 #Processors Max Kuhn (Pfizer Global R&D) caret May 12, 2011 37 / 44
  • 38. Speedups speedup = Sequential Time / Parallel Time GBM q SVM 5 q q q q q q q q q q 4 q q q q q q q q q Speedup q q q q q 3 q q q q 2 q q q 1 q 5 10 15 #Processors Max Kuhn (Pfizer Global R&D) caret May 12, 2011 38 / 44
  • 39. Results There is a benefit to adding more workers for these calculations. The optimal speedup with be W where W is the number of workers. We are not optimal, but we can cut the execution time down by 4–5 fold. The SVM model benefited more than the GBM model, perhaps since GBM was fitting less models (using sub–models). Max Kuhn (Pfizer Global R&D) caret May 12, 2011 39 / 44
  • 40. Other Functions and Classes nearZeroVar: a function to remove predictors that are sparse and highly unbalanced findCorrelation: a function to remove the optimal set of predictors to achieve low pair–wise correlations predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently 57 methods) confusionMatrix, sensitivity, specificity, posPredValue, negPredValue: classes for assessing classifier performance varImp: classes for assessing the aggregate effect of a predictor on the model equations (currently 20 methods) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 40 / 44
  • 41. Other Functions and Classes knnreg: nearest–neighbor regression plsda, splsda: PLS discriminant analysis icr: independent component regression pcaNNet: nnet:::nnet with automatic PCA pre–processing step bagEarth, bagFDA: bagging with MARS and FDA models normalize2Reference: RMA–like processing of Affy arrays using a training set spatialSign: class for transforming numeric data (x = x /||x ||) maxDissim: a function for maximum dissimilarity sampling featurePlot: a wrapper for several lattice functions Max Kuhn (Pfizer Global R&D) caret May 12, 2011 41 / 44
  • 42. Shameless Plug # 2: Other Packages A few others that I’m working on... sparseLDA: Lasso–type regularization for LDA Cubist: Quinlan’s model trees C5.0: Quinlan’s decision trees (I could use some C help here) FuseBox: a framework for combining ensembles of models Max Kuhn (Pfizer Global R&D) caret May 12, 2011 42 / 44
  • 43. Thanks Kirk Mettler, Bruno and the NYC Predictive Analytics Organizers R Core Pfizer’s Statistics leadership for providing the time and support to create R packages caret contributors: Jed Wing, Steve Weston, Andre Williams, Chris Keefer and Allan Engelhardt Max Kuhn (Pfizer Global R&D) caret May 12, 2011 43 / 44
  • 44. Session Info R version 2.11.1 (2010-05-31), x86_64-apple-darwin9.8.0 Base packages: base, datasets, graphics, grDevices, methods, splines, stats, tools, utils Other packages: caret 4.87, class 7.3-2, cluster 1.12.3, codetools 0.2-2, digest 0.4.2, e1071 1.5-24, gbm 1.6-3.1, kernlab 0.9-12, lattice 0.18-8, plyr 1.2.1, reshape 0.8.3, survival 2.35-8, weaver 1.16.0 Loaded via a namespace (and not attached): grid 2.11.1 This presentation was created with a MacPro using LTEXand R’s Sweave A function at 16:02 on Wednesday, May 11, 2011. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 44 / 44