Data science with R - Clustering and Classification

Data science with R
Brigitte Mueller
https://github.com/mbbrigitte/Ruby_Talk_Material

Dataset Number 3 Dataset Number 4
Dataset Number 1 Dataset Number 2
Mueller et al. GRL, 2011
How to compare evaporation datasets?

How to compare 30 datasets?
Hierarchical clustering

Create some data
x <- rnorm(30)
y <- rnorm(30)
plot(x,y)
datamatrix <- cbind(x,y)

Calculate the distances and the clusters
distmatrix <- dist(datamatrix)
fit <- hclust(distmatrix, method="ward.D")
plot(fit)

Data science with R - Clustering and Classification

>> require "rinruby"
- Reads deﬁnition of RinRuby class into Ruby interpreter
- Creates instance of RinRuby class named R
- eval instance method passes R commands contained in the supplied string
>> sample_size = 10
>> R.eval "x <- rnorm(#{sample_size})"
>> R.eval "summary(x)"
produces the following :
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.88900 -0.84930 -0.45220 -0.49290 -0.06069 0.78160
More info: https://sites.google.com/a/ddahl.org/rinruby-users/documentation

Delayed or not?
Supervised learning

Target
Binary prediction: Delayed 0/1
Arriving late (= 15 minutes)
50%
accuracy
Goal
70% accuracy

Prepare data
Clean, explore, tidy

Prepare data
Variable
Observation

Prepare data
Variable
Observation
Split into training and testing data
Training data Testing data

Use your model with new data
Test your model
Train your model
Prepare data
- Clean, explore, tidy
- Split into training and testing data

Data
Prepared for you, tidied and saved as
train.csv test.csv
Download at
https://github.com/mbbrigitte/Ruby_Talk_Material

Data
train.csv test.csv
mbbrigitte/Ruby_Talk_Material

Data
predict_flightdelays.md
mbbrigitte/Ruby_Talk_Material

Data
What variables are in the files?
Check with
read.csv(filename)
names(data)
ARR_DEL15, DAY_OF_WEEK, CARRIER, DEST, ORIGIN,
DEP_TIME_BLK

Code: Set-up
set.seed(100)
install.packages(‘caret’)
library(caret)

Code: Read data
trainData <- read.csv('train.csv',sep=',', header=TRUE)
testData <- read.csv('test.csv',sep=',', header=TRUE)

Select algorithm
• Classification algorithm
• Start simple
• If performance not that good, improve
– Ensemble algorithms
– Select more important variables from the data
– Include additional predictor variables
– Feature-engineering

Logistic regression
• Regression that predicts a categorical value
Predictor

Train
library(caret)
logisticRegModel <- train(ARR_DEL15 ~ .,
data=trainData, method = 'glm', family =
'binomial')
Dot: 'all available variables, i.e. all columns', glm gene
ralized linear regression. Family binomial for logistic
regression.

Predict and test
Use your model and the test data to check how well we predict flight arrival
delays.
logRegPrediction <- predict(logisticRegModel, testData)
logRegConfMat <- confusionMatrix(logRegPrediction,
testData[,"ARR_DEL15"])
logRegConfMat

## Confusion Matrix and Statistics
## Reference
## Prediction 0 1
## 0 7465 2273
## 1 65 94
##
## Accuracy : 0.7638
## 95% CI : (0.7553, 0.7721)
## No Information Rate : 0.7608
## P-Value [Acc > NIR] : 0.2513
##
## Kappa : 0.0457
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99137
## Specificity : 0.03971
## Pos Pred Value : 0.76658
## Neg Pred Value : 0.59119
## Prevalence : 0.76084
## Detection Rate : 0.75427
## Detection Prevalence : 0.98393
## Balanced Accuracy : 0.51554
##
## 'Positive' Class : 0

Specificity
proportion of negatives that are correctly
identified as such
Specificity = 94/(2273+94)
## Confusion Matrix and Statistics
## Reference
## Prediction 0 1
## 0 7465 2273
## 1 65 94
##
## Accuracy : 0.7638
## 95% CI : (0.7553, 0.7721)
## No Information Rate : 0.7608
## P-Value [Acc > NIR] : 0.2513
##
## Kappa : 0.0457
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99137
## Specificity : 0.03971
## Pos Pred Value : 0.76658
## Neg Pred Value : 0.59119
## Prevalence : 0.76084
## Detection Rate : 0.75427
## Detection Prevalence : 0.98393
## Balanced Accuracy : 0.51554
##
## 'Positive' Class : 0
Prediction
Reference
0 not delayed 1 delayed
0 not delayed 7465 2273
1 delayed 64 94

Specificity is low - Improve model
names(getModelInfo())
logisticRegModel <- train(ARR_DEL15 ~ .,
data=trainData, method = 'glm', family =
'binomial')

Next steps
• Try basics yourself
– Improve model used with data in this talk
– Titanic dataset: http://amunategui.github.io/binary-
outcome-modeling/
– https://www.datacamp.com/courses/kaggle-tutorial-
on-machine-learing-the-sinking-of-the-titanic
• Try advanced methods
– Kaggle
• Find your own dataset
• Learn more about machine learning and R:

Further reading
Elements of Statistical Learning, Hastie
et al. 2009, Springer:
Available for fee
http://statweb.stanford.edu/~tibs/Ele
mStatLearn/

Thank you
Questions & feedback
brigitte.mueller@yahoo.ca

Picture sources
• http://www.tronviggroup.com/open-source-evolution/
(world with people)
• http://twit88.com/blog/2011/03/01/open-source-ide-for-r/
(R IDE)
• http://www.dailymail.co.uk (Coin toss)
• http://www.theanalysisfactor.com/r-glm-plotting/ (log.
Regression figure)

Do it yourself
• Download and install R https://www.r-
project.org/ and RStudio
https://www.rstudio.com/ if you want to (it is
convenient)
• Download the train.csv and test.csv files from
Github
https://github.com/mbbrigitte/Ruby_Talk_Materi
al
• Use the ….Rmd files in R or just browse the code
with the ….md file in your explorer

R: Packages and functions
• Lots of statistical packages (libraries)
install.packages(‘caret’)
library(caret)
• Run line by line or write programms with ending .R
source(“foo.R”)
• Function
myfun<- function(arg1, arg2, …)
w=arg1^2
return(arg2 + w)
}
myfun(arg=3,arg2=5)

R: Subsetting
• Matrix
mat <- matrix(data=c(9,2,3,4,5,6),ncol=3)
mat[1,2] #output is 3
mat[2,] #output is 2,4,6
• Lists:
L = list(one=1, two=c(1,2), five=seq(0, 1,length=5))
L$five #output 0.00 0.25 0.50 0.75 1.00

Original data source
http://1.usa.gov/1KEd08B

Mueller et al. GRL, 2011
Data groups
Results from evaporation dataset clustering

Example with gbm instead of glm method, i.e. boosted
tree model: see
http://topepo.github.io/caret/training.html
fitControl <- trainControl(method = 'repe
atedcv', number = 10, repeats = 10)
gbmFit1 <- train(ARR_DEL15 ~ ., data=trai
nData, method = 'gbm',trControl = fitCont
rol,verbose = FALSE)

Data science with R - Clustering and Classification

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data science with R - Clustering and Classification (20)

Recently uploaded (20)

Data science with R - Clustering and Classification