SlideShare a Scribd company logo
Data Manipulation on R 
Factor Manipulations,subset,sorting and Reshape 
Abhik Seal 
Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
Basic Manipulating Data 
So far , we've covered how to read in data from various ways like from files, internet and databases and 
reading various formats of files. This session we are interested to manipulate data after reading in the file for 
easy data processing. 
2/35
Sorting and Ordering data 
sort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descending 
order.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument into 
ascending or descending order, breaking ties by further arguments.' 
x <- c(1,5,7,8,3,12,34,2) 
sort(x) 
## [1] 1 2 3 5 7 8 12 34 
order(x) 
## [1] 1 8 5 2 3 4 6 7 
3/35
Some examples of sorting and ordering 
# sort by mpg 
newdata <- mtcars[order(mpg),] 
head(newdata,3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
# sort by mpg and cyl 
newdata <- mtcars[order(mpg, cyl),] 
head(newdata,3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
4/35
Ordering with plyr 
library(plyr) 
head(arrange(mtcars,mpg),3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
head(arrange(mtcars,desc(mpg)),3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 
## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 
## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 
5/35
Subsetting data 
set.seed(12345) 
#create a dataframe 
X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30)) 
# Add NA VALUES 
X<-X[sample(1:10),];X$B[c(1,6,10)]=NA 
head(X) 
## A B C 
## 8 4 NA 27 
## 1 8 11 25 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
## 10 5 NA 26 
6/35
Basic data subsetting 
# Accessing only first row 
X[1,] 
## A B C 
## 8 4 NA 27 
# accessing only first column 
X[,1] 
## [1] 4 8 10 3 7 5 9 1 2 6 
# accessing first row and first column 
X[1,1] 
## [1] 4 
7/35
And/OR's 
head(X[(X$A <=6 & X$C > 24),],3) 
## A B C 
## 8 4 NA 27 
## 10 5 NA 26 
## 7 2 19 29 
head(X[(X$A <=6 | X$C > 24),],3) 
## A B C 
## 8 4 NA 27 
## 1 8 11 25 
## 5 3 13 24 
8/35
select Non NA values Data Frame 
# select the dataframe without NA values in B column 
head(X[which(X$B!='NA'),],4) 
## A B C 
## 1 8 11 25 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
# select those which have values > 14 
head(X[which(X$B>11),],4) 
## A B C 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
## 4 9 20 30 
9/35
# creating a data frame with 2 variables 
data <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1)) 
list_data<-list(dat=data,vec.obj=c(1,2,3)) 
list_data 
## $dat 
## x1 x2 
## 1 2 5 
## 2 3 6 
## 3 4 7 
## 4 5 8 
## 5 6 1 
## 
## $vec.obj 
## [1] 1 2 3 
# accessing second element of the list_obj objects 
list_data[[2]] 
## [1] 1 2 3 
10/35
Factors 
Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an 
intrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'The 
function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used 
for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with S 
there is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership and 
coercion functions for these classes. 
11/35
Factors 
Suppose we have a vector of case-control status 
cc=factor(c("case","case","case","control","control","control")) 
cc 
## [1] case case case control control control 
## Levels: case control 
levels(cc)=c("control","case") 
cc 
## [1] control control control case case case 
## Levels: control case 
12/35
Factors 
Factors can be converted to numericor charactervery easily 
x=factor(c("case","case","case","control","control","control"),levels=c("control","case")) 
as.character(x) 
## [1] "case" "case" "case" "control" "control" "control" 
as.numeric(x) 
## [1] 2 2 2 1 1 1 
13/35
Cut 
Now that we know more about factors, cut()will make more sense: 
x=1:100 
cx=cut(x,breaks=c(0,10,25,50,100)) 
head(cx) 
## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] 
## Levels: (0,10] (10,25] (25,50] (50,100] 
table(cx) 
## cx 
## (0,10] (10,25] (25,50] (50,100] 
## 10 15 25 50 
14/35
Cut 
We can also leave off the labels 
cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE) 
head(cx) 
## [1] 1 1 1 1 1 1 
table(cx) 
## cx 
## 1 2 3 4 
## 10 15 25 50 
15/35
Cut 
cx=cut(x,breaks=c(10,25,50),labels=FALSE) 
head(cx) 
## [1] NA NA NA NA NA NA 
table(cx) 
## cx 
## 1 2 
## 15 25 
table(cx,useNA="ifany") 
## cx 
## 1 2 <NA> 
## 15 25 60 
16/35
Adding to data frames 
m1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE) 
m1 
## [,1] [,2] [,3] 
## [1,] 1 4 7 
## [2,] 2 5 8 
## [3,] 3 6 9 
m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE) 
m2 
## [,1] [,2] [,3] 
## [1,] 1 2 3 
## [2,] 4 5 6 
## [3,] 7 8 9 
17/35
Adding using cbind 
You can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind'). 
You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you are 
adding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind()) 
cbind(m1,m2) 
## [,1] [,2] [,3] [,4] [,5] [,6] 
## [1,] 1 4 7 1 2 3 
## [2,] 2 5 8 4 5 6 
## [3,] 3 6 9 7 8 9 
18/35
Reshape data 
Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, 
whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysis 
sometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet the 
requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not 
change the content of the dataset. This section mainly focuses the melt and cast paradigm of reshaping 
datasets, which is implemented in the reshape contributed package. Later on, this same package is 
reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping 
Data with the reshape Package paper, by Wickham, which can be found at 
(http://www.jstatsoft.org/v21/i12/paper)) 
19/35
Wide data has a column for each variable. For example, this is wide-format data: 
# ozone wind temp 
# 1 23.62 11.623 65.55 
# 2 29.44 10.267 79.10 
# 3 59.12 8.942 83.90 
# 4 59.96 8.794 83.97 
Data in long format 
# variable value 
# 1 ozone 23.615 
# 2 ozone 29.444 
# 3 ozone 59.115 
# 4 ozone 59.962 
# 5 wind 11.623 
# 6 wind 10.267 
# 7 wind 8.942 
# 8 wind 8.794 
# 9 temp 65.548 
# 10 temp 79.100 
# 11 temp 83.903 
# 12 temp 83.968 
20/35
reshape 2 Package 
"In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 
requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(), 
and gam()) require long-format data. But people often find it easier to record their data in wide format." 
reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it into 
long-format data. cast takes long-format data and casts it into wide-format data. 
21/35
Melt 
library(reshape2) 
head(airquality,2) 
## ozone solar.r wind temp month day 
## 1 41 190 7.4 67 5 1 
## 2 36 118 8.0 72 5 2 
aql <- melt(airquality) # [a]ir [q]uality [l]ong format 
head(aql,5) 
## variable value 
## 1 ozone 41 
## 2 ozone 36 
## 3 ozone 12 
## 4 ozone 18 
## 5 ozone NA 
22/35
By default, melt has assumed that all columns with numeric values are variables with values. Maybe here we 
want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt 
by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify 
individual rows of data. 
m <- melt(airquality, id.vars = c("month", "day")) 
head(m,4) 
## month day variable value 
## 1 5 1 ozone 41 
## 2 5 2 ozone 36 
## 3 5 3 ozone 12 
## 4 5 4 ozone 18 
23/35
Melt also allow us to control the column names in long data format 
m <- melt(airquality, id.vars = c("month", "day"), 
variable.name = "climate_variable", 
value.name = "climate_value") 
head(m) 
## month day climate_variable climate_value 
## 1 5 1 ozone 41 
## 2 5 2 ozone 36 
## 3 5 3 ozone 12 
## 4 5 4 ozone 18 
## 5 5 5 ozone NA 
## 6 5 6 ozone 28 
24/35
Long- to wide-format data: the cast functions 
In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, 
we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formula 
to describe the shape of the data. 
m <- melt(airquality, id.vars = c("month", "day")) 
aqw <- dcast(m, month + day ~ variable) 
head(aqw) 
## month day ozone solar.r wind temp 
## 1 5 1 41 190 7.4 67 
## 2 5 2 36 118 8.0 72 
## 3 5 3 12 149 12.6 74 
## 4 5 4 18 313 11.5 62 
## 5 5 5 NA NA 14.3 56 
## 6 5 6 28 NA 14.9 66 
Here, we need to tell dcast that month and day are the ID variables. 
Besides re-arranging the columns, we’ve recovered our original data. 
25/35
Data Manipulation Using plyr 
For large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a 
single output again. This type of split using default R is not much efficient, and to overcome this limitation, 
Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine 
strategy. We can compare this strategy to map-reduce strategy for processing large amount of data. 
In the coming slides i will give example of the split-apply-combine strategy using 
· 
Without Loops 
· 
With Loops 
· 
Using plyr package 
26/35
Without loops 
I am using the iris dataset here 
1. Split the iris dataset into three parts. 
2. Remove the species name variable from the data. 
3. Calculate the mean of each variable for the three different parts separately. 
4. Combine the output into a single data frame. 
iris.set <- iris[iris$Species=="setosa",-5] 
iris.versi <- iris[iris$Species=="versicolor",-5] 
iris.virg <- iris[iris$Species=="virginica",-5] 
# calculating mean for each piece (The apply step) 
mean.set <- colMeans(iris.set) 
mean.versi <- colMeans(iris.versi) 
mean.virg <- colMeans(iris.virg) 
# combining the output (The combine step) 
mean.iris <- rbind(mean.set,mean.versi,mean.virg) 
# giving row names so that the output could be easily understood 
rownames(mean.iris) <- c("setosa","versicolor","virginica") 
27/35
With Loops 
mean.iris.loop <- NULL 
for(species in unique(iris$Species)) 
{ 
iris_sub <- iris[iris$Species==species,] 
column_means <- colMeans(iris_sub[,-5]) 
mean.iris.loop <- rbind(mean.iris.loop,column_means) 
} 
# giving row names so that the output could be easily understood 
rownames(mean.iris.loop) <- unique(iris$Species) 
NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategy 
wont work if one piece is dependent upon one another. 
28/35
Using plyr 
library (plyr) 
ddply(iris,~Species,function(x) colMeans(x[,- 
which(colnames(x)=="Species")])) 
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width 
## 1 setosa 5.006 3.428 1.462 0.246 
## 2 versicolor 5.936 2.770 4.260 1.326 
## 3 virginica 6.588 2.974 5.552 2.026 
mean.iris.loop 
## Sepal.Length Sepal.Width Petal.Length Petal.Width 
## setosa 5.006 3.428 1.462 0.246 
## versicolor 5.936 2.770 4.260 1.326 
## virginica 6.588 2.974 5.552 2.026 
29/35
Merging data frames 
# Make a data frame mapping story numbers to titles 
stories <- read.table(header=T, text=' 
storyid title 
1 lions 
2 tigers 
3 bears 
') 
# Make another data frame with the data and story numbers (no titles) 
data <- read.table(header=T, text=' 
subject storyid rating 
1 1 6.7 
1 2 4.5 
1 3 3.7 
2 2 3.3 
2 3 4.1 
2 1 5.2 
') 
30/35
Merge the two data frames 
merge(stories, data, "storyid") 
## storyid title subject rating 
## 1 1 lions 1 6.7 
## 2 1 lions 2 5.2 
## 3 2 tigers 1 4.5 
## 4 2 tigers 2 3.3 
## 5 3 bears 1 3.7 
## 6 3 bears 2 4.1 
If the two data frames have different names for the columns you want to match on, the names can be 
specified: 
# In this case, the column is named 'id' instead of storyid 
stories2 <- read.table(header=T, text=' 
id title 
1 lions 
2 tigers 
3 bears ') 
merge(x=stories2, y=data, by.x="id", by.y="storyid") 
31/35
Resources and Materials used 
· 
Data Manipulation with R by Phil Spector 
· 
Getting and Cleaning data Coursera Course 
· 
plyr by Hadley Wickham 
· 
Andrew Jaffe Notes 
· 
R cookbok 
32/35

More Related Content

PDF
Introduction to R and R Studio
PDF
R code for data manipulation
PPTX
Comparing EDA with classical and Bayesian analysis.pptx
PDF
Machine Learning in R
PPT
R programming slides
PPT
7. Relational Database Design in DBMS
PDF
Intro to RStudio
PPTX
Python pandas Library
Introduction to R and R Studio
R code for data manipulation
Comparing EDA with classical and Bayesian analysis.pptx
Machine Learning in R
R programming slides
7. Relational Database Design in DBMS
Intro to RStudio
Python pandas Library

What's hot (20)

PPT
Python Pandas
PDF
Data Visualization With R
PPTX
Functional dependency
PPTX
Database : Relational Data Model
PPTX
Dbms and rdbms
PPTX
Association rule mining.pptx
PPTX
Data Analysis with Python Pandas
PPTX
Adjacency list
PPTX
Data Management in R
PDF
Linear Regression With R
PPTX
Introduction to R Programming
PDF
Quantitative Data Analysis using R
PDF
Class ppt intro to r
PPTX
Unit 1 - R Programming (Part 2).pptx
PPTX
Pandas
PPTX
Programming in R
ODP
Data Analysis in Python
PPTX
R programming presentation
PDF
Python functions
PPTX
File Management in C
Python Pandas
Data Visualization With R
Functional dependency
Database : Relational Data Model
Dbms and rdbms
Association rule mining.pptx
Data Analysis with Python Pandas
Adjacency list
Data Management in R
Linear Regression With R
Introduction to R Programming
Quantitative Data Analysis using R
Class ppt intro to r
Unit 1 - R Programming (Part 2).pptx
Pandas
Programming in R
Data Analysis in Python
R programming presentation
Python functions
File Management in C
Ad

Viewers also liked (11)

PDF
스마트러닝시장동향
DOCX
Impacto de las aulas virtuales en la educación
PDF
Evolucion de la comunicacion humana susana castaneda
DOC
Zaragoza turismo 200
PDF
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
PDF
Interview Ilb Life Style Dordrecht Dec2011
PPT
Judit Jorba
PPT
Sharing is the new lead gen - Talk at Web 2.0 expo
DOCX
Interview exercise
PDF
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
PPTX
Chapter 11
스마트러닝시장동향
Impacto de las aulas virtuales en la educación
Evolucion de la comunicacion humana susana castaneda
Zaragoza turismo 200
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
Interview Ilb Life Style Dordrecht Dec2011
Judit Jorba
Sharing is the new lead gen - Talk at Web 2.0 expo
Interview exercise
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Chapter 11
Ad

Similar to Data manipulation on r (20)

PPTX
Unit I - introduction to r language 2.pptx
PDF
R Cheat Sheet – Data Management
PDF
Data Manipulation Using R (& dplyr)
PPTX
Basic data analysis using R.
PDF
Basic R Data Manipulation
PPTX
R programming language
PDF
R code for data manipulation
PDF
R Programming Reference Card
PPTX
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
PDF
Broom: Converting Statistical Models to Tidy Data Frames
PPTX
Coding and Cookies: R basics
PDF
R Programming: Transform/Reshape Data In R
PDF
Next Generation Programming in R
PDF
Rtips123
PDF
R gráfico
PPTX
R part I
PDF
R_CheatSheet.pdf
PDF
tidyr.pdf
PPTX
Introduction to R
PDF
18 cleaning
Unit I - introduction to r language 2.pptx
R Cheat Sheet – Data Management
Data Manipulation Using R (& dplyr)
Basic data analysis using R.
Basic R Data Manipulation
R programming language
R code for data manipulation
R Programming Reference Card
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
Broom: Converting Statistical Models to Tidy Data Frames
Coding and Cookies: R basics
R Programming: Transform/Reshape Data In R
Next Generation Programming in R
Rtips123
R gráfico
R part I
R_CheatSheet.pdf
tidyr.pdf
Introduction to R
18 cleaning

More from Abhik Seal (20)

PDF
Chemical data
PPTX
Clinicaldataanalysis in r
PDF
Virtual Screening in Drug Discovery
PDF
Data handling in r
PPTX
Networks
PDF
Modeling Chemical Datasets
PPTX
Introduction to Adverse Drug Reactions
PPTX
Mapping protein to function
PPTX
Sequencedatabases
PPTX
Chemical File Formats for storing chemical data
PPTX
Understanding Smiles
PDF
Learning chemistry with google
PPTX
3 d virtual screening of pknb inhibitors using data
PPTX
Poster
DOCX
R scatter plots
PDF
Indo us 2012
PDF
Q plot tutorial
PDF
Weka guide
PPTX
Pharmacohoreppt
PDF
Document1
Chemical data
Clinicaldataanalysis in r
Virtual Screening in Drug Discovery
Data handling in r
Networks
Modeling Chemical Datasets
Introduction to Adverse Drug Reactions
Mapping protein to function
Sequencedatabases
Chemical File Formats for storing chemical data
Understanding Smiles
Learning chemistry with google
3 d virtual screening of pknb inhibitors using data
Poster
R scatter plots
Indo us 2012
Q plot tutorial
Weka guide
Pharmacohoreppt
Document1

Recently uploaded (20)

PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
Classroom Observation Tools for Teachers
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Yogi Goddess Pres Conference Studio Updates
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Lesson notes of climatology university.
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
master seminar digital applications in india
PPTX
Cell Types and Its function , kingdom of life
LDMMIA Reiki Yoga Finals Review Spring Summer
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Paper A Mock Exam 9_ Attempt review.pdf.
A systematic review of self-coping strategies used by university students to ...
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Orientation - ARALprogram of Deped to the Parents.pptx
Classroom Observation Tools for Teachers
2.FourierTransform-ShortQuestionswithAnswers.pdf
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
STATICS OF THE RIGID BODIES Hibbelers.pdf
Yogi Goddess Pres Conference Studio Updates
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Lesson notes of climatology university.
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Practical Manual AGRO-233 Principles and Practices of Natural Farming
master seminar digital applications in india
Cell Types and Its function , kingdom of life

Data manipulation on r

  • 1. Data Manipulation on R Factor Manipulations,subset,sorting and Reshape Abhik Seal Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
  • 2. Basic Manipulating Data So far , we've covered how to read in data from various ways like from files, internet and databases and reading various formats of files. This session we are interested to manipulate data after reading in the file for easy data processing. 2/35
  • 3. Sorting and Ordering data sort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descending order.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments.' x <- c(1,5,7,8,3,12,34,2) sort(x) ## [1] 1 2 3 5 7 8 12 34 order(x) ## [1] 1 8 5 2 3 4 6 7 3/35
  • 4. Some examples of sorting and ordering # sort by mpg newdata <- mtcars[order(mpg),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 # sort by mpg and cyl newdata <- mtcars[order(mpg, cyl),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 4/35
  • 5. Ordering with plyr library(plyr) head(arrange(mtcars,mpg),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 head(arrange(mtcars,desc(mpg)),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 5/35
  • 6. Subsetting data set.seed(12345) #create a dataframe X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30)) # Add NA VALUES X<-X[sample(1:10),];X$B[c(1,6,10)]=NA head(X) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 10 5 NA 26 6/35
  • 7. Basic data subsetting # Accessing only first row X[1,] ## A B C ## 8 4 NA 27 # accessing only first column X[,1] ## [1] 4 8 10 3 7 5 9 1 2 6 # accessing first row and first column X[1,1] ## [1] 4 7/35
  • 8. And/OR's head(X[(X$A <=6 & X$C > 24),],3) ## A B C ## 8 4 NA 27 ## 10 5 NA 26 ## 7 2 19 29 head(X[(X$A <=6 | X$C > 24),],3) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 5 3 13 24 8/35
  • 9. select Non NA values Data Frame # select the dataframe without NA values in B column head(X[which(X$B!='NA'),],4) ## A B C ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 # select those which have values > 14 head(X[which(X$B>11),],4) ## A B C ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 4 9 20 30 9/35
  • 10. # creating a data frame with 2 variables data <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1)) list_data<-list(dat=data,vec.obj=c(1,2,3)) list_data ## $dat ## x1 x2 ## 1 2 5 ## 2 3 6 ## 3 4 7 ## 4 5 8 ## 5 6 1 ## ## $vec.obj ## [1] 1 2 3 # accessing second element of the list_obj objects list_data[[2]] ## [1] 1 2 3 10/35
  • 11. Factors Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an intrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'The function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with S there is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership and coercion functions for these classes. 11/35
  • 12. Factors Suppose we have a vector of case-control status cc=factor(c("case","case","case","control","control","control")) cc ## [1] case case case control control control ## Levels: case control levels(cc)=c("control","case") cc ## [1] control control control case case case ## Levels: control case 12/35
  • 13. Factors Factors can be converted to numericor charactervery easily x=factor(c("case","case","case","control","control","control"),levels=c("control","case")) as.character(x) ## [1] "case" "case" "case" "control" "control" "control" as.numeric(x) ## [1] 2 2 2 1 1 1 13/35
  • 14. Cut Now that we know more about factors, cut()will make more sense: x=1:100 cx=cut(x,breaks=c(0,10,25,50,100)) head(cx) ## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] ## Levels: (0,10] (10,25] (25,50] (50,100] table(cx) ## cx ## (0,10] (10,25] (25,50] (50,100] ## 10 15 25 50 14/35
  • 15. Cut We can also leave off the labels cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE) head(cx) ## [1] 1 1 1 1 1 1 table(cx) ## cx ## 1 2 3 4 ## 10 15 25 50 15/35
  • 16. Cut cx=cut(x,breaks=c(10,25,50),labels=FALSE) head(cx) ## [1] NA NA NA NA NA NA table(cx) ## cx ## 1 2 ## 15 25 table(cx,useNA="ifany") ## cx ## 1 2 <NA> ## 15 25 60 16/35
  • 17. Adding to data frames m1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE) m1 ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE) m2 ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 17/35
  • 18. Adding using cbind You can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind'). You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you are adding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind()) cbind(m1,m2) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 4 7 1 2 3 ## [2,] 2 5 8 4 5 6 ## [3,] 3 6 9 7 8 9 18/35
  • 19. Reshape data Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysis sometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. This section mainly focuses the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping Data with the reshape Package paper, by Wickham, which can be found at (http://www.jstatsoft.org/v21/i12/paper)) 19/35
  • 20. Wide data has a column for each variable. For example, this is wide-format data: # ozone wind temp # 1 23.62 11.623 65.55 # 2 29.44 10.267 79.10 # 3 59.12 8.942 83.90 # 4 59.96 8.794 83.97 Data in long format # variable value # 1 ozone 23.615 # 2 ozone 29.444 # 3 ozone 59.115 # 4 ozone 59.962 # 5 wind 11.623 # 6 wind 10.267 # 7 wind 8.942 # 8 wind 8.794 # 9 temp 65.548 # 10 temp 79.100 # 11 temp 83.903 # 12 temp 83.968 20/35
  • 21. reshape 2 Package "In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(), and gam()) require long-format data. But people often find it easier to record their data in wide format." reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it into long-format data. cast takes long-format data and casts it into wide-format data. 21/35
  • 22. Melt library(reshape2) head(airquality,2) ## ozone solar.r wind temp month day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 aql <- melt(airquality) # [a]ir [q]uality [l]ong format head(aql,5) ## variable value ## 1 ozone 41 ## 2 ozone 36 ## 3 ozone 12 ## 4 ozone 18 ## 5 ozone NA 22/35
  • 23. By default, melt has assumed that all columns with numeric values are variables with values. Maybe here we want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify individual rows of data. m <- melt(airquality, id.vars = c("month", "day")) head(m,4) ## month day variable value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 23/35
  • 24. Melt also allow us to control the column names in long data format m <- melt(airquality, id.vars = c("month", "day"), variable.name = "climate_variable", value.name = "climate_value") head(m) ## month day climate_variable climate_value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 ## 5 5 5 ozone NA ## 6 5 6 ozone 28 24/35
  • 25. Long- to wide-format data: the cast functions In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formula to describe the shape of the data. m <- melt(airquality, id.vars = c("month", "day")) aqw <- dcast(m, month + day ~ variable) head(aqw) ## month day ozone solar.r wind temp ## 1 5 1 41 190 7.4 67 ## 2 5 2 36 118 8.0 72 ## 3 5 3 12 149 12.6 74 ## 4 5 4 18 313 11.5 62 ## 5 5 5 NA NA 14.3 56 ## 6 5 6 28 NA 14.9 66 Here, we need to tell dcast that month and day are the ID variables. Besides re-arranging the columns, we’ve recovered our original data. 25/35
  • 26. Data Manipulation Using plyr For large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a single output again. This type of split using default R is not much efficient, and to overcome this limitation, Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine strategy. We can compare this strategy to map-reduce strategy for processing large amount of data. In the coming slides i will give example of the split-apply-combine strategy using · Without Loops · With Loops · Using plyr package 26/35
  • 27. Without loops I am using the iris dataset here 1. Split the iris dataset into three parts. 2. Remove the species name variable from the data. 3. Calculate the mean of each variable for the three different parts separately. 4. Combine the output into a single data frame. iris.set <- iris[iris$Species=="setosa",-5] iris.versi <- iris[iris$Species=="versicolor",-5] iris.virg <- iris[iris$Species=="virginica",-5] # calculating mean for each piece (The apply step) mean.set <- colMeans(iris.set) mean.versi <- colMeans(iris.versi) mean.virg <- colMeans(iris.virg) # combining the output (The combine step) mean.iris <- rbind(mean.set,mean.versi,mean.virg) # giving row names so that the output could be easily understood rownames(mean.iris) <- c("setosa","versicolor","virginica") 27/35
  • 28. With Loops mean.iris.loop <- NULL for(species in unique(iris$Species)) { iris_sub <- iris[iris$Species==species,] column_means <- colMeans(iris_sub[,-5]) mean.iris.loop <- rbind(mean.iris.loop,column_means) } # giving row names so that the output could be easily understood rownames(mean.iris.loop) <- unique(iris$Species) NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategy wont work if one piece is dependent upon one another. 28/35
  • 29. Using plyr library (plyr) ddply(iris,~Species,function(x) colMeans(x[,- which(colnames(x)=="Species")])) ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 mean.iris.loop ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## setosa 5.006 3.428 1.462 0.246 ## versicolor 5.936 2.770 4.260 1.326 ## virginica 6.588 2.974 5.552 2.026 29/35
  • 30. Merging data frames # Make a data frame mapping story numbers to titles stories <- read.table(header=T, text=' storyid title 1 lions 2 tigers 3 bears ') # Make another data frame with the data and story numbers (no titles) data <- read.table(header=T, text=' subject storyid rating 1 1 6.7 1 2 4.5 1 3 3.7 2 2 3.3 2 3 4.1 2 1 5.2 ') 30/35
  • 31. Merge the two data frames merge(stories, data, "storyid") ## storyid title subject rating ## 1 1 lions 1 6.7 ## 2 1 lions 2 5.2 ## 3 2 tigers 1 4.5 ## 4 2 tigers 2 3.3 ## 5 3 bears 1 3.7 ## 6 3 bears 2 4.1 If the two data frames have different names for the columns you want to match on, the names can be specified: # In this case, the column is named 'id' instead of storyid stories2 <- read.table(header=T, text=' id title 1 lions 2 tigers 3 bears ') merge(x=stories2, y=data, by.x="id", by.y="storyid") 31/35
  • 32. Resources and Materials used · Data Manipulation with R by Phil Spector · Getting and Cleaning data Coursera Course · plyr by Hadley Wickham · Andrew Jaffe Notes · R cookbok 32/35