SlideShare a Scribd company logo
R basic
@anchu
April 2017
Essential data wrangling tasks
Import
Explore
Index/subset
Reshape
Merge
Aggregate
Repeat
Importing
Plain text files: the workhorse function read.table()
read.table("path_to_file",
header = TRUE, # first row as column names
sep = ",", # column separtor
stringsAsFactors = FALSE) # not convert text to factors
Importing
Customized read.table() variants:
read.csv(sep = ",")
read.csv2(sep = ";")
read.delim(sep = "t")
Others: read_csv() (readr) or fread() (data.table)
Importing
Excel spreadsheets:
library(readxl)
dtf <- read_excel("path_to_file",
sheet = 1, # sheet to read (name or position)
skip = 0) # number of rows to skip
Others: read.xls (gdata) or read.xlsx() (xlsx)
Exploring
Structure and type of columns:
str(cars)
> 'data.frame': 50 obs. of 2 variables:
> $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
> $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Exploring
The first and the last six rows of the data set:
head(cars) # tail(cars)
> speed dist
> 1 4 2
> 2 4 10
> 3 7 4
> 4 7 22
> 5 8 16
> 6 9 10
Exploring
Summary statistics:
summary(cars)
> speed dist
> Min. : 4.0 Min. : 2.00
> 1st Qu.:12.0 1st Qu.: 26.00
> Median :15.0 Median : 36.00
> Mean :15.4 Mean : 42.98
> 3rd Qu.:19.0 3rd Qu.: 56.00
> Max. :25.0 Max. :120.00
Exploring
Counting:
table(mtcars$cyl) # frequency
>
> 4 6 8
> 11 7 14
prop.table(table(mtcars$cyl)) # proportion
>
> 4 6 8
> 0.34375 0.21875 0.43750
Indexing/Subsetting
Question: What’s the difference among the following data structures in R?
Array
Atomic vector
Data frame
List
Matrix
Indexing/Subsetting
Answer:
Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array
Homogeneous: all contents must be the same type.
Heterogeneous: the contents can be of different types.
Indexing/Subsetting
Atomic vector:
x <- c(2, 4, 3, 5)
## positive integers (note: duplicated indices yield duplicated val
x[c(3, 1)]
> [1] 3 2
## negative integers (note: can't mix positive and negative integer
x[-c(3, 1)]
> [1] 4 5
Indexing/Subsetting
Atomic vector:
## logical vector (note: conditional expr is OK: x[x %% 2 == 0])
x[c(TRUE, TRUE, FALSE, FALSE)]
> [1] 2 4
## nothing
x[] # returns original vector
> [1] 2 4 3 5
Indexing/Subsetting
Atomic vector:
## zero
x[0] # returns zero-length vector
> numeric(0)
## character vector (subsetting using names)
y <- setNames(x, letters[1:4])
y[c("c", "a", "d")]
> c a d
> 3 2 5
Indexing/Subsetting
List:
Subsetting a list works in the same way as subsetting an atomic vector.
Using [ will always return a list; [[ and $ pull out the components of the list.
Indexing/Subsetting
Matrix:
General form of matrix subsets: x[i, j]
(m <- matrix(1:12, nrow = 3, ncol = 4))
> [,1] [,2] [,3] [,4]
> [1,] 1 4 7 10
> [2,] 2 5 8 11
> [3,] 3 6 9 12
m[1:2, c(2, 4)]
> [,1] [,2]
> [1,] 4 10
> [2,] 5 11
Indexing/Subsetting
Data frames:
Data frames possess the characteristics of both lists and matrices: if you
subset with a single vector, they behave like lists; if you subset with two
vectors, they behave like matrices.
dtf <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
dtf
> x y z
> 1 1 3 a
> 2 2 2 b
> 3 3 1 c
Indexing/Subsetting
Data frames:
dtf[2, ] # slicing
> x y z
> 2 2 2 b
dtf[dtf$x == 2, ] # conditional subsetting
> x y z
> 2 2 2 b
Indexing/Subsetting
Data frames:
dtf[, c(1, 3)]
> x z
> 1 1 a
> 2 2 b
> 3 3 c
dtf[, c("x", "z")]
> x z
> 1 1 a
> 2 2 b
> 3 3 c
Indexing/Subsetting
Data frames:
if output is a single column, returns a vector instead of a data frame.
str(dtf[, "x"]) # simplifying
> int [1:3] 1 2 3
str(dtf[, "x", drop = F]) # preserving
> 'data.frame': 3 obs. of 1 variable:
> $ x: int 1 2 3
Indexing/Subsetting
Data frames: alternative facility subset()
subset(cars, speed > 20)
> speed dist
> 44 22 66
> 45 23 54
> 46 24 70
> 47 24 92
> 48 24 93
> 49 24 120
> 50 25 85
Indexing/Subsetting
Exercises:
Point out subsetting errors in the following expressions:
mtcars[mtcars$cyl = 4, ]
mtcars[-1:4, ]
mtcars[mtcars$cyl <= 5]
mtcars[mtcars$cyl == 4 | 6, ]
Indexing/Subsetting
Solutions:
mtcars[mtcars$cyl == 4, ]
mtcars[-c(1:4), ]
mtcars[mtcars$cyl <= 5, ]
mtcars[mtcars$cyl == 4 | mtcars$cyl == 6, ]
Reshaping
Tidy data
Reshaping
reshape2 written by Hadley Wickham that makes it dealy easy to transform
data between wide and long formats.
Wide-format data:
day storeA storeB storeC
2017/03/22 12 2 34
2017/03/23 1 11 5
Long-format data:
day stores sales
2017/03/22 storeA 12
2017/03/22 storeB 2
2017/03/22 storeC 34
2017/03/23 storeA 1
2017/03/23 storeB 11
2017/03/23 storeC 5
Reshaping
melt() takes wide-format data and melts it into long-format data.
long <- melt(dtf, id.vars = "day",
variable.name = "stores", value.name = "sales")
long
> day stores sales
> 1 2017/03/22 storeA 12
> 2 2017/03/23 storeA 1
> 3 2017/03/22 storeB 2
> 4 2017/03/23 storeB 11
> 5 2017/03/22 storeC 34
> 6 2017/03/23 storeC 5
Reshaping
dcast() takes long-format data and casts it into wide-format data.
wide <- dcast(long, day ~ stores, value.var = "sales")
wide
> day storeA storeB storeC
> 1 2017/03/22 12 2 34
> 2 2017/03/23 1 11 5
Reshaping
Other solutions:
spread() and gather() (tidyr)
reshape() (stats)
Merging
Binding 2 data frames vertically:
## The two data frames must have the same variables,
## but they do not have to be in the same order.
total <- rbind(dtf_A, dtf_B)
Binding 2 data frames horizontally:
## The two data frames must have the same rows.
total <- cbind(dtf_A, dtf_B)
Merging
Question: Given
a <- data.frame(x1 = c("A", "B", "C"), x2 = c(1, 2, 3))
b <- data.frame(x1 = c("A", "B", "D"), x3 = c(T, F, T))
Which expression is used to get the following result?
> x1 x2 x3
> 1 A 1 TRUE
> 2 B 2 FALSE
> 3 C 3 NA
a. merge(a, b, by = "x1", all = T)
b. merge(a, b, by = "x1", all.x = T)
c. merge(a, b, by = "x1", all = F)
d. merge(a, b, by = "x1", all.y = T)
Merging
Answer: b
merge(a, b, by = "x1", all.x = T)
> x1 x2 x3
> 1 A 1 TRUE
> 2 B 2 FALSE
> 3 C 3 NA
Merging
Joining two data frames by key (similiar to JOIN two tables in SQL)
dtf1 dtf2
Figure 1: Two data frames with shared columns for merging
Merging
Left join:
merge(x = dtf1, y = dtf2, all.x = TRUE)
dtf1 dtf2
Figure 2: all.x = TRUE
Merging
Right join:
merge(x = dtf1, y = dtf2, all.y = TRUE)
dtf1 dtf2
Figure 3: all.y = TRUE
Merging
Full join:
merge(x = dtf1, y = dtf2, all = TRUE)
dtf1 dtf2
Figure 4: all = TRUE
Merging
Inner join:
merge(x = dtf1, y = dtf2, all = FALSE)
dtf1 dtf2
Figure 5: all = FALSE
Merging
Other solutions:
dplyr:
left_join(dtf1, dtf2)
right_join(dtf1, dtf2)
full_join(dtf1, dtf2)
inner_join(dtf1, dtf2)
## and more:
semi_join(dtf1, dtf2)
anti_join(dtf1, dtf2)
data.table (enhanced merge() is extremly fast)
Aggregating
Repeating/Looping
Generating sequences:
1:10
> [1] 1 2 3 4 5 6 7 8 9 10
10:1
> [1] 10 9 8 7 6 5 4 3 2 1
Repeating/Looping
More general sequences:
The step in sequences created by : is always 1.
seq() makes it possible to generate more general sequences
seq(from,
to,
by, # stepsize
length.out) # length of final vector
Repeating/Looping
Sequence examples:
seq(0, 1, by = 0.1)
> [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, 1, length.out = 11)
> [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, by = 0.1, length.out = 11)
> [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Repeating/Looping
Repeating values with rep():
rep(1:4, times = 3)
> [1] 1 2 3 4 1 2 3 4 1 2 3 4
rep(1:4, each = 3)
> [1] 1 1 1 2 2 2 3 3 3 4 4 4
Repeating/Looping
Quiz:
Use paste0() and rep() to generate the following sequence of dates:
> [1] "1/2016" "2/2016" "3/2016" "4/2016" "5/2016" "6/2016"
> [8] "8/2016" "9/2016" "10/2016" "11/2016" "12/2016" "1/2017"
> [15] "3/2017" "4/2017" "5/2017" "6/2017" "7/2017" "8/2017"
> [22] "10/2017" "11/2017" "12/2017"
Repeating/Looping
Answer:
paste0(rep(1:12, times = 2), "/", rep(2016:2017, each = 12))
> [1] "1/2016" "2/2016" "3/2016" "4/2016" "5/2016" "6/2016"
> [8] "8/2016" "9/2016" "10/2016" "11/2016" "12/2016" "1/2017"
> [15] "3/2017" "4/2017" "5/2017" "6/2017" "7/2017" "8/2017"
> [22] "10/2017" "11/2017" "12/2017"
Repeating/Looping
if-then-else statements:
if-then-else helps to choose between two expressions depending on the value of
a (logical) condition.
Form:
if (condition) expr1 else expr2
Note:
Only the first element in condition is checked.
Repeating/Looping
if-then-else example:
x <- 9
if (x > 0) y <- sqrt(x) else y <- x^2
print(y)
> [1] 3
Repeating/Looping
if-then-else example:
x <- c(-4, 9)
if (x > 0) y <- sqrt(x) else y <- x^2
> Warning in if (x > 0) y <- sqrt(x) else y <- x^2: the condition h
> > 1 and only the first element will be used
print(y)
> [1] 16 81
Repeating/Looping
for loop:
for-loop repeatedly carries out some tasks for each element of a vector.
Form:
for (variable in vector) expression
Repeating/Looping
for-loop examples:
(x <- sample(letters, size = 13))
> [1] "t" "r" "j" "i" "s" "v" "x" "g" "a" "l" "p" "u" "e"
for (i in 1:length(x)) {
if (x[i] %in% c("a", "e", "i", "o", "u", "y")) {
print(i) # position of vowels
}
}
> [1] 4
> [1] 9
> [1] 12
Repeating/Looping
for-loop examples:
for (i in 1:21) {
plot(..., pch = i)
}
pch = 1 pch = 2 pch = 3 pch = 4 pch = 5 pch = 6 pch = 7
pch = 8 pch = 9 pch = 10 pch = 11 pch = 12 pch = 13 pch = 14
pch = 15 pch = 16 pch = 17 pch = 18 pch = 19 pch = 20 pch = 21
Repeating/Looping
for-loop examples:
for (i in 1:6) {
plot(..., lty = i)
}
lty = 1
lty = 2
lty = 3
lty = 4
lty = 5
lty = 6

More Related Content

PDF
Python 2.5 reference card (2009)
PPTX
R programming
PDF
Data transformation-cheatsheet
PDF
Data import-cheatsheet
PDF
R programming intro with examples
PDF
Day 2 repeats.pptx
PDF
Day 1c access, select ordering copy.pptx
PDF
Day 1d R structures & objects: matrices and data frames.pptx
Python 2.5 reference card (2009)
R programming
Data transformation-cheatsheet
Data import-cheatsheet
R programming intro with examples
Day 2 repeats.pptx
Day 1c access, select ordering copy.pptx
Day 1d R structures & objects: matrices and data frames.pptx

What's hot (20)

PPTX
DataFrame in Python Pandas
PDF
Day 2b i/o.pptx
PDF
Day 1b R structures objects.pptx
PDF
Data Analysis and Programming in R
PDF
3 R Tutorial Data Structure
PPTX
PPT
R for Statistical Computing
PDF
Python matplotlib cheat_sheet
PPTX
List and Dictionary in python
PPT
Multi dimensional arrays
PPTX
Python crush course
PPTX
Pandas Series
PPTX
R Language Introduction
PDF
Pandas pythonfordatascience
PDF
[1062BPY12001] Data analysis with R / week 2
PPT
array
PPTX
Language R
PDF
Programming with matlab session 6
PPT
Chapter 3 ds
PDF
Cheat Sheet for Machine Learning in Python: Scikit-learn
DataFrame in Python Pandas
Day 2b i/o.pptx
Day 1b R structures objects.pptx
Data Analysis and Programming in R
3 R Tutorial Data Structure
R for Statistical Computing
Python matplotlib cheat_sheet
List and Dictionary in python
Multi dimensional arrays
Python crush course
Pandas Series
R Language Introduction
Pandas pythonfordatascience
[1062BPY12001] Data analysis with R / week 2
array
Language R
Programming with matlab session 6
Chapter 3 ds
Cheat Sheet for Machine Learning in Python: Scikit-learn
Ad

Similar to Basic R Data Manipulation (20)

PDF
PPT
MATLAB-Introd.ppt
PPT
R tutorial for a windows environment
PPTX
Programming in R
PDF
Idea for ineractive programming language
DOC
20100528
DOC
20100528
PDF
R Programming Homework Help
PPTX
世预赛买球-世预赛买球竞彩平台-世预赛买球竞猜平台|【​网址​🎉ac123.net🎉​】
PPTX
美洲杯买球-美洲杯买球怎么押注-美洲杯买球押注怎么玩|【​网址​🎉ac99.net🎉​】
PPTX
欧洲杯足彩-欧洲杯足彩线上体育买球-欧洲杯足彩买球推荐网站|【​网址​🎉ac55.net🎉​】
PPTX
欧洲杯下注-欧洲杯下注买球网-欧洲杯下注买球网站|【​网址​🎉ac10.net🎉​】
PPTX
世预赛买球-世预赛买球比赛投注-世预赛买球比赛投注官网|【​网址​🎉ac10.net🎉​】
PPTX
世预赛投注-世预赛投注投注官网app-世预赛投注官网app下载|【​网址​🎉ac123.net🎉​】
PPTX
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
PPTX
欧洲杯买球-欧洲杯买球投注网-欧洲杯买球投注网站|【​网址​🎉ac44.net🎉​】
PDF
Fp in scala part 2
PDF
Matlab-free course by Mohd Esa
PDF
R Cheat Sheet – Data Management
PPTX
R programming language
MATLAB-Introd.ppt
R tutorial for a windows environment
Programming in R
Idea for ineractive programming language
20100528
20100528
R Programming Homework Help
世预赛买球-世预赛买球竞彩平台-世预赛买球竞猜平台|【​网址​🎉ac123.net🎉​】
美洲杯买球-美洲杯买球怎么押注-美洲杯买球押注怎么玩|【​网址​🎉ac99.net🎉​】
欧洲杯足彩-欧洲杯足彩线上体育买球-欧洲杯足彩买球推荐网站|【​网址​🎉ac55.net🎉​】
欧洲杯下注-欧洲杯下注买球网-欧洲杯下注买球网站|【​网址​🎉ac10.net🎉​】
世预赛买球-世预赛买球比赛投注-世预赛买球比赛投注官网|【​网址​🎉ac10.net🎉​】
世预赛投注-世预赛投注投注官网app-世预赛投注官网app下载|【​网址​🎉ac123.net🎉​】
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
欧洲杯买球-欧洲杯买球投注网-欧洲杯买球投注网站|【​网址​🎉ac44.net🎉​】
Fp in scala part 2
Matlab-free course by Mohd Esa
R Cheat Sheet – Data Management
R programming language
Ad

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
.pdf is not working space design for the following data for the following dat...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Quality review (1)_presentation of this 21
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
1_Introduction to advance data techniques.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Database Infoormation System (DBIS).pptx
Qualitative Qantitative and Mixed Methods.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
.pdf is not working space design for the following data for the following dat...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Fluorescence-microscope_Botany_detailed content
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Quality review (1)_presentation of this 21
Foundation of Data Science unit number two notes
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
annual-report-2024-2025 original latest.
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Basic R Data Manipulation