SlideShare a Scribd company logo
Data science with R
Brigitte Mueller
https://github.com/mbbrigitte/Ruby_Talk_Material
Dataset Number 3 Dataset Number 4
Dataset Number 1 Dataset Number 2
Mueller et al. GRL, 2011
How to compare evaporation datasets?
How to compare 30 datasets?
Hierarchical clustering
How to compare datasets?
Create some data
x <- rnorm(30)
y <- rnorm(30)
plot(x,y)
datamatrix <- cbind(x,y)
Calculate the distances and the clusters
distmatrix <- dist(datamatrix)
fit <- hclust(distmatrix, method="ward.D")
plot(fit)
Data science with R - Clustering and Classification
Data science with R - Clustering and Classification
Data science with R - Clustering and Classification
>> require "rinruby"
- Reads definition of RinRuby class into Ruby interpreter
- Creates instance of RinRuby class named R
- eval instance method passes R commands contained in the supplied string
>> sample_size = 10
>> R.eval "x <- rnorm(#{sample_size})"
>> R.eval "summary(x)"
produces the following :
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.88900 -0.84930 -0.45220 -0.49290 -0.06069 0.78160
More info: https://sites.google.com/a/ddahl.org/rinruby-users/documentation
Data science with R - Clustering and Classification
Delayed or not?
Supervised learning
Target
Binary prediction: Delayed 0/1
Arriving late (= 15 minutes)
50%
accuracy
Goal
70% accuracy
Prepare data
Clean, explore, tidy
Prepare data
Variable
Observation
Clean, explore, tidy
Prepare data
Variable
Observation
Clean, explore, tidy
Split into training and testing data
Training data Testing data
Use your model with new data
Test your model
Train your model
Prepare data
- Clean, explore, tidy
- Split into training and testing data
Data
Prepared for you, tidied and saved as
train.csv test.csv
Download at
https://github.com/mbbrigitte/Ruby_Talk_Material
Data
train.csv test.csv
mbbrigitte/Ruby_Talk_Material
Data
predict_flightdelays.md
mbbrigitte/Ruby_Talk_Material
Data
What variables are in the files?
Check with
read.csv(filename)
names(data)
ARR_DEL15, DAY_OF_WEEK, CARRIER, DEST, ORIGIN,
DEP_TIME_BLK
Code: Set-up
set.seed(100)
install.packages(‘caret’)
library(caret)
Code: Read data
trainData <- read.csv('train.csv',sep=',', header=TRUE)
testData <- read.csv('test.csv',sep=',', header=TRUE)
Use your model with new data
Test your model
Train your model
Prepare data
- Clean, explore, tidy
- Split into training and testing data
Select algorithm
• Classification algorithm
• Start simple
• If performance not that good, improve
– Ensemble algorithms
– Select more important variables from the data
– Include additional predictor variables
– Feature-engineering
Logistic regression
• Regression that predicts a categorical value
Predictor
Train
library(caret)
logisticRegModel <- train(ARR_DEL15 ~ .,
data=trainData, method = 'glm', family =
'binomial')
Dot: 'all available variables, i.e. all columns', glm gene
ralized linear regression. Family binomial for logistic
regression.
Use your model with new data
Test your model
Train your model
Prepare data
- Clean, explore, tidy
- Split into training and testing data
Predict and test
Use your model and the test data to check how well we predict flight arrival
delays.
logRegPrediction <- predict(logisticRegModel, testData)
logRegConfMat <- confusionMatrix(logRegPrediction,
testData[,"ARR_DEL15"])
logRegConfMat
## Confusion Matrix and Statistics
## Reference
## Prediction 0 1
## 0 7465 2273
## 1 65 94
##
## Accuracy : 0.7638
## 95% CI : (0.7553, 0.7721)
## No Information Rate : 0.7608
## P-Value [Acc > NIR] : 0.2513
##
## Kappa : 0.0457
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99137
## Specificity : 0.03971
## Pos Pred Value : 0.76658
## Neg Pred Value : 0.59119
## Prevalence : 0.76084
## Detection Rate : 0.75427
## Detection Prevalence : 0.98393
## Balanced Accuracy : 0.51554
##
## 'Positive' Class : 0
## Confusion Matrix and Statistics
## Reference
## Prediction 0 1
## 0 7465 2273
## 1 65 94
##
## Accuracy : 0.7638
## 95% CI : (0.7553, 0.7721)
## No Information Rate : 0.7608
## P-Value [Acc > NIR] : 0.2513
##
## Kappa : 0.0457
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99137
## Specificity : 0.03971
## Pos Pred Value : 0.76658
## Neg Pred Value : 0.59119
## Prevalence : 0.76084
## Detection Rate : 0.75427
## Detection Prevalence : 0.98393
## Balanced Accuracy : 0.51554
##
## 'Positive' Class : 0
Specificity
proportion of negatives that are correctly
identified as such
Specificity = 94/(2273+94)
## Confusion Matrix and Statistics
## Reference
## Prediction 0 1
## 0 7465 2273
## 1 65 94
##
## Accuracy : 0.7638
## 95% CI : (0.7553, 0.7721)
## No Information Rate : 0.7608
## P-Value [Acc > NIR] : 0.2513
##
## Kappa : 0.0457
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99137
## Specificity : 0.03971
## Pos Pred Value : 0.76658
## Neg Pred Value : 0.59119
## Prevalence : 0.76084
## Detection Rate : 0.75427
## Detection Prevalence : 0.98393
## Balanced Accuracy : 0.51554
##
## 'Positive' Class : 0
Prediction
Reference
0 not delayed 1 delayed
0 not delayed 7465 2273
1 delayed 64 94
Specificity is low - Improve model
names(getModelInfo())
logisticRegModel <- train(ARR_DEL15 ~ .,
data=trainData, method = 'glm', family =
'binomial')
Data science with R - Clustering and Classification
Next steps
• Try basics yourself
– Improve model used with data in this talk
– Titanic dataset: http://amunategui.github.io/binary-
outcome-modeling/
– https://www.datacamp.com/courses/kaggle-tutorial-
on-machine-learing-the-sinking-of-the-titanic
• Try advanced methods
– Kaggle
• Find your own dataset
• Learn more about machine learning and R:
Further reading
Elements of Statistical Learning, Hastie
et al. 2009, Springer:
Available for fee
http://statweb.stanford.edu/~tibs/Ele
mStatLearn/
Thank you
Questions & feedback
brigitte.mueller@yahoo.ca
Picture sources
• http://www.tronviggroup.com/open-source-evolution/
(world with people)
• http://twit88.com/blog/2011/03/01/open-source-ide-for-r/
(R IDE)
• http://www.dailymail.co.uk (Coin toss)
• http://www.theanalysisfactor.com/r-glm-plotting/ (log.
Regression figure)
Do it yourself
• Download and install R https://www.r-
project.org/ and RStudio
https://www.rstudio.com/ if you want to (it is
convenient)
• Download the train.csv and test.csv files from
Github
https://github.com/mbbrigitte/Ruby_Talk_Materi
al
• Use the ….Rmd files in R or just browse the code
with the ….md file in your explorer
R: Packages and functions
• Lots of statistical packages (libraries)
install.packages(‘caret’)
library(caret)
• Run line by line or write programms with ending .R
source(“foo.R”)
• Function
myfun<- function(arg1, arg2, …)
w=arg1^2
return(arg2 + w)
}
myfun(arg=3,arg2=5)
R: Subsetting
• Matrix
mat <- matrix(data=c(9,2,3,4,5,6),ncol=3)
mat[1,2] #output is 3
mat[2,] #output is 2,4,6
• Lists:
L = list(one=1, two=c(1,2), five=seq(0, 1,length=5))
L$five #output 0.00 0.25 0.50 0.75 1.00
Original data source
http://1.usa.gov/1KEd08B
Mueller et al. GRL, 2011
Data groups
Results from evaporation dataset clustering
Example with gbm instead of glm method, i.e. boosted
tree model: see
http://topepo.github.io/caret/training.html
fitControl <- trainControl(method = 'repe
atedcv', number = 10, repeats = 10)
gbmFit1 <- train(ARR_DEL15 ~ ., data=trai
nData, method = 'gbm',trControl = fitCont
rol,verbose = FALSE)

More Related Content

PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
Caret Package for R
PDF
Data mining with caret package
PDF
PDF
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
PDF
Reinforcement learning Research experiments OpenAI
PDF
Probabilistic Data Structures and Approximate Solutions
PDF
The caret package is a unified interface to a large number of predictive mode...
Nyc open-data-2015-andvanced-sklearn-expanded
Caret Package for R
Data mining with caret package
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Reinforcement learning Research experiments OpenAI
Probabilistic Data Structures and Approximate Solutions
The caret package is a unified interface to a large number of predictive mode...

What's hot (20)

PPTX
GBM package in r
PDF
PDF
XGBoost: the algorithm that wins every competition
PPTX
Introduction of Xgboost
PDF
Ensembling & Boosting 概念介紹
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
Classification examp
PDF
The caret Package: A Unified Interface for Predictive Models
PDF
Introduction of Feature Hashing
PDF
GBM theory code and parameters
PDF
Gradient boosting in practice: a deep dive into xgboost
PPTX
Ember
PDF
Gradient Boosted Regression Trees in scikit-learn
PPTX
Machine Learning Model Bakeoff
PDF
RDataMining slides-clustering-with-r
PDF
Machine Learning: Classification Concepts (Part 1)
PDF
Data Wrangling For Kaggle Data Science Competitions
PPTX
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...
PPTX
Streaming Python on Hadoop
PPTX
Time Series Analysis for Network Secruity
GBM package in r
XGBoost: the algorithm that wins every competition
Introduction of Xgboost
Ensembling & Boosting 概念介紹
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Classification examp
The caret Package: A Unified Interface for Predictive Models
Introduction of Feature Hashing
GBM theory code and parameters
Gradient boosting in practice: a deep dive into xgboost
Ember
Gradient Boosted Regression Trees in scikit-learn
Machine Learning Model Bakeoff
RDataMining slides-clustering-with-r
Machine Learning: Classification Concepts (Part 1)
Data Wrangling For Kaggle Data Science Competitions
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...
Streaming Python on Hadoop
Time Series Analysis for Network Secruity
Ad

Viewers also liked (20)

PPT
PPT file
DOCX
Latest seo news, tips and tricks website lists
PPSX
Electron Configuration
PDF
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
PDF
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
PPTX
Building a Scalable Data Science Platform with R
PDF
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
PDF
An Introduction to Data Mining with R
PPTX
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
PDF
Introduction to R for Data Science :: Session 4
PDF
Introduction to R for Data Science :: Session 3
PDF
Introduction to R for Data Science :: Session 2
PDF
Actuarial Analytics in R
PDF
Introduction to R for Data Science :: Session 1
PDF
Putting the Magic in Data Science
PPT
K means Clustering Algorithm
PPSX
PDF
Data Science, Machine Learning and Neural Networks
PDF
7Jpros : Conservation partagée en médecine et animation du réseau francilien ...
PPSX
The effects of visual realism on search tasks in mixed reality simulations-IE...
PPT file
Latest seo news, tips and tricks website lists
Electron Configuration
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Building a Scalable Data Science Platform with R
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
An Introduction to Data Mining with R
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Introduction to R for Data Science :: Session 4
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 2
Actuarial Analytics in R
Introduction to R for Data Science :: Session 1
Putting the Magic in Data Science
K means Clustering Algorithm
Data Science, Machine Learning and Neural Networks
7Jpros : Conservation partagée en médecine et animation du réseau francilien ...
The effects of visual realism on search tasks in mixed reality simulations-IE...
Ad

Similar to Data science with R - Clustering and Classification (20)

PPTX
Decision Tree.pptx
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
Building Machine Learning Pipelines
PDF
Building ML Pipelines
PPTX
Grid search.pptx
PDF
Workshop - Introduction to Machine Learning with R
PDF
Spark ml streaming
PPTX
Machine Learning in R
PPTX
casestudy_important.pptx
PDF
Course Project for Coursera Practical Machine Learning
PPTX
CMU Lecture on Hadoop Performance
PPTX
Learning Predictive Modeling with TSA and Kaggle
PDF
cluster(python)
PDF
Peterson_-_Machine_Learning_Project
PDF
Data_Mining_Exploration
PDF
Customer analytics for e commerce
PPTX
wk5ppt2_Iris
PDF
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
PPTX
Training course lect2
PPTX
Cloudera Data Science Challenge
Decision Tree.pptx
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Building Machine Learning Pipelines
Building ML Pipelines
Grid search.pptx
Workshop - Introduction to Machine Learning with R
Spark ml streaming
Machine Learning in R
casestudy_important.pptx
Course Project for Coursera Practical Machine Learning
CMU Lecture on Hadoop Performance
Learning Predictive Modeling with TSA and Kaggle
cluster(python)
Peterson_-_Machine_Learning_Project
Data_Mining_Exploration
Customer analytics for e commerce
wk5ppt2_Iris
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Training course lect2
Cloudera Data Science Challenge

Recently uploaded (20)

PPT
Predictive modeling basics in data cleaning process
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Business Analytics and business intelligence.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Leprosy and NLEP programme community medicine
PDF
Introduction to the R Programming Language
PPTX
Managing Community Partner Relationships
PPTX
Computer network topology notes for revision
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Mega Projects Data Mega Projects Data
Predictive modeling basics in data cleaning process
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Analytics and business intelligence.pdf
Clinical guidelines as a resource for EBP(1).pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
.pdf is not working space design for the following data for the following dat...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Optimise Shopper Experiences with a Strong Data Estate.pdf
Leprosy and NLEP programme community medicine
Introduction to the R Programming Language
Managing Community Partner Relationships
Computer network topology notes for revision
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Mega Projects Data Mega Projects Data

Data science with R - Clustering and Classification

  • 1. Data science with R Brigitte Mueller https://github.com/mbbrigitte/Ruby_Talk_Material
  • 2. Dataset Number 3 Dataset Number 4 Dataset Number 1 Dataset Number 2 Mueller et al. GRL, 2011 How to compare evaporation datasets?
  • 3. How to compare 30 datasets? Hierarchical clustering
  • 4. How to compare datasets?
  • 5. Create some data x <- rnorm(30) y <- rnorm(30) plot(x,y) datamatrix <- cbind(x,y)
  • 6. Calculate the distances and the clusters distmatrix <- dist(datamatrix) fit <- hclust(distmatrix, method="ward.D") plot(fit)
  • 10. >> require "rinruby" - Reads definition of RinRuby class into Ruby interpreter - Creates instance of RinRuby class named R - eval instance method passes R commands contained in the supplied string >> sample_size = 10 >> R.eval "x <- rnorm(#{sample_size})" >> R.eval "summary(x)" produces the following : Min. 1st Qu. Median Mean 3rd Qu. Max. -1.88900 -0.84930 -0.45220 -0.49290 -0.06069 0.78160 More info: https://sites.google.com/a/ddahl.org/rinruby-users/documentation
  • 13. Target Binary prediction: Delayed 0/1 Arriving late (= 15 minutes) 50% accuracy Goal 70% accuracy
  • 16. Prepare data Variable Observation Clean, explore, tidy Split into training and testing data Training data Testing data
  • 17. Use your model with new data Test your model Train your model Prepare data - Clean, explore, tidy - Split into training and testing data
  • 18. Data Prepared for you, tidied and saved as train.csv test.csv Download at https://github.com/mbbrigitte/Ruby_Talk_Material
  • 21. Data What variables are in the files? Check with read.csv(filename) names(data) ARR_DEL15, DAY_OF_WEEK, CARRIER, DEST, ORIGIN, DEP_TIME_BLK
  • 23. Code: Read data trainData <- read.csv('train.csv',sep=',', header=TRUE) testData <- read.csv('test.csv',sep=',', header=TRUE)
  • 24. Use your model with new data Test your model Train your model Prepare data - Clean, explore, tidy - Split into training and testing data
  • 25. Select algorithm • Classification algorithm • Start simple • If performance not that good, improve – Ensemble algorithms – Select more important variables from the data – Include additional predictor variables – Feature-engineering
  • 26. Logistic regression • Regression that predicts a categorical value Predictor
  • 27. Train library(caret) logisticRegModel <- train(ARR_DEL15 ~ ., data=trainData, method = 'glm', family = 'binomial') Dot: 'all available variables, i.e. all columns', glm gene ralized linear regression. Family binomial for logistic regression.
  • 28. Use your model with new data Test your model Train your model Prepare data - Clean, explore, tidy - Split into training and testing data
  • 29. Predict and test Use your model and the test data to check how well we predict flight arrival delays. logRegPrediction <- predict(logisticRegModel, testData) logRegConfMat <- confusionMatrix(logRegPrediction, testData[,"ARR_DEL15"]) logRegConfMat
  • 30. ## Confusion Matrix and Statistics ## Reference ## Prediction 0 1 ## 0 7465 2273 ## 1 65 94 ## ## Accuracy : 0.7638 ## 95% CI : (0.7553, 0.7721) ## No Information Rate : 0.7608 ## P-Value [Acc > NIR] : 0.2513 ## ## Kappa : 0.0457 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.99137 ## Specificity : 0.03971 ## Pos Pred Value : 0.76658 ## Neg Pred Value : 0.59119 ## Prevalence : 0.76084 ## Detection Rate : 0.75427 ## Detection Prevalence : 0.98393 ## Balanced Accuracy : 0.51554 ## ## 'Positive' Class : 0
  • 31. ## Confusion Matrix and Statistics ## Reference ## Prediction 0 1 ## 0 7465 2273 ## 1 65 94 ## ## Accuracy : 0.7638 ## 95% CI : (0.7553, 0.7721) ## No Information Rate : 0.7608 ## P-Value [Acc > NIR] : 0.2513 ## ## Kappa : 0.0457 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.99137 ## Specificity : 0.03971 ## Pos Pred Value : 0.76658 ## Neg Pred Value : 0.59119 ## Prevalence : 0.76084 ## Detection Rate : 0.75427 ## Detection Prevalence : 0.98393 ## Balanced Accuracy : 0.51554 ## ## 'Positive' Class : 0
  • 32. Specificity proportion of negatives that are correctly identified as such Specificity = 94/(2273+94) ## Confusion Matrix and Statistics ## Reference ## Prediction 0 1 ## 0 7465 2273 ## 1 65 94 ## ## Accuracy : 0.7638 ## 95% CI : (0.7553, 0.7721) ## No Information Rate : 0.7608 ## P-Value [Acc > NIR] : 0.2513 ## ## Kappa : 0.0457 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.99137 ## Specificity : 0.03971 ## Pos Pred Value : 0.76658 ## Neg Pred Value : 0.59119 ## Prevalence : 0.76084 ## Detection Rate : 0.75427 ## Detection Prevalence : 0.98393 ## Balanced Accuracy : 0.51554 ## ## 'Positive' Class : 0 Prediction Reference 0 not delayed 1 delayed 0 not delayed 7465 2273 1 delayed 64 94
  • 33. Specificity is low - Improve model names(getModelInfo()) logisticRegModel <- train(ARR_DEL15 ~ ., data=trainData, method = 'glm', family = 'binomial')
  • 35. Next steps • Try basics yourself – Improve model used with data in this talk – Titanic dataset: http://amunategui.github.io/binary- outcome-modeling/ – https://www.datacamp.com/courses/kaggle-tutorial- on-machine-learing-the-sinking-of-the-titanic • Try advanced methods – Kaggle • Find your own dataset • Learn more about machine learning and R:
  • 36. Further reading Elements of Statistical Learning, Hastie et al. 2009, Springer: Available for fee http://statweb.stanford.edu/~tibs/Ele mStatLearn/
  • 37. Thank you Questions & feedback brigitte.mueller@yahoo.ca
  • 38. Picture sources • http://www.tronviggroup.com/open-source-evolution/ (world with people) • http://twit88.com/blog/2011/03/01/open-source-ide-for-r/ (R IDE) • http://www.dailymail.co.uk (Coin toss) • http://www.theanalysisfactor.com/r-glm-plotting/ (log. Regression figure)
  • 39. Do it yourself • Download and install R https://www.r- project.org/ and RStudio https://www.rstudio.com/ if you want to (it is convenient) • Download the train.csv and test.csv files from Github https://github.com/mbbrigitte/Ruby_Talk_Materi al • Use the ….Rmd files in R or just browse the code with the ….md file in your explorer
  • 40. R: Packages and functions • Lots of statistical packages (libraries) install.packages(‘caret’) library(caret) • Run line by line or write programms with ending .R source(“foo.R”) • Function myfun<- function(arg1, arg2, …) w=arg1^2 return(arg2 + w) } myfun(arg=3,arg2=5)
  • 41. R: Subsetting • Matrix mat <- matrix(data=c(9,2,3,4,5,6),ncol=3) mat[1,2] #output is 3 mat[2,] #output is 2,4,6 • Lists: L = list(one=1, two=c(1,2), five=seq(0, 1,length=5)) L$five #output 0.00 0.25 0.50 0.75 1.00
  • 43. Mueller et al. GRL, 2011 Data groups Results from evaporation dataset clustering
  • 44. Example with gbm instead of glm method, i.e. boosted tree model: see http://topepo.github.io/caret/training.html fitControl <- trainControl(method = 'repe atedcv', number = 10, repeats = 10) gbmFit1 <- train(ARR_DEL15 ~ ., data=trai nData, method = 'gbm',trControl = fitCont rol,verbose = FALSE)