R for hadoopers

1
Scalable Analytics
with
R, Hadoop and RHadoop
Gwen Shapira, Software Engineer
@gwenshap
gshapira@cloudera.com

Agenda
• R Basics
• Hadoop Basics
• Data Manipulation
• Rhadoop
6

Basic Data Types
• String
• Number
• Boolean
• Assignment <-
8

R can be a nice calculator
> x <- 1
> x * 2
[1] 2
> y <- x + 3
> y
[1] 4
> log(y)
[1] 1.386294
> help(log)
9

Complex Data Types
• Vector
• c, seq, rep, []
• List
• Data Frame
• Lists of vectors of same length
• Not a matrix
10

Creating vectors
> v1 <- c(1,2,3,4)
[1] 1 2 3 4
> v1 * 4
[1] 4 8 12 16
> v4 <- c(1:5)
[1] 1 2 3 4 5
> v2 <- seq(2,12,by=3)
[1] 2 5 8 11
> v1 * v2
[1] 2 10 24 44
> v3 <- rep(3,4)
[1] 3 3 3 3
11

Accessing and filtering vectors
> v1 <- c(2,4,6,8)
[1] 2 4 6 8
> v1[2]
[1] 4
> v1[2:4]
[1] 4 6 8
> v1[-2]
[1] 2 6 8
> v1[v1>3]
[1] 4 6 8
12

Lists
> lst <- list (1,"x",FALSE)
[[1]]
[1] 1
[[2]]
[1] "x"
[[3]]
[1] FALSE
> lst[1]
[[1]]
[1] 1
> lst[[1]]
[1] 1
13

Data Frames
books <- read.csv("~/books.csv")
books[1,]
books[,1]
books[3:4]
books$price
books[books$price==6.99,]
martin_price <- books[books$author_t=="George
R.R. Martin",]$price
mean(martin_price)
subset(books,select=-c(id,cat,sequence_i))
14

Functions
> sq <- function(x) { x*x }
> sq(3)
[1] 9
16
Note:
R is a functional programming language.
Functions are first class objects
And can be passed to other functions.

Agenda
• R Basics
• Hadoop Basics
• Data Manipulation
• Rhadoop
18

“In pioneer days they used oxen for heavy
pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox”
— Grace Hopper, early advocate of distributed computing

Map-Reduce is the interesting bit
• Map – Apply a function to each input record
• Shuffle & Sort – Partition the map output and sort
each partition
• Reduce – Apply aggregation function to all values in
each partition
• Map reads input from disk
• Reduce writes output to disk
21

Example – Sessionize clickstream
22

Sessionize
Identify unique “sessions” of interacting with our
website
Session – for each user (IP), set of clicks that happened
within 30 minutes of each other
23

Input – Apache Access Log Records
127.0.0.1 - frank
[10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0"
200 2326
24

Output – Add Session ID
127.0.0.1 - frank
[10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0"
200 2326 15
25

Overview
26
Map
Map
Map
Reduce
Reduce
Log line
Log line
Log line
IP1, log lines
Log line, session ID

Map
parsedRecord = re.search(‘(d+.d+….’,record)
IP = parsedRecord.group(1)
timestamp = parsedRecord.group(2)
print ((IP,Timestamp),record)
27

Shuffle & Sort
Partition by: IP
Sort by: timestamp
Now reduce gets:
(IP,timestamp) [record1,record2,record3….]
28

Reduce
SessionID = 1
curr_record = records[0]
Curr_timestamp = getTimestamp(curr_record)
foreach record in records:
if (curr_timestamp – getTimestamp(record) > 30):
sessionID += 1
curr_timestamp = getTimestamp(record)
print(record + “ “ + sessionID)
29

Agenda
• R Basics
• Hadoop Basics
• Data Manipulation Libraries
• Rhadoop
30

Reshape2
• Two functions:
• Melt – wide format to long format
• Cast – long format to wide format
• Columns: identifiers or measured variables
• Molten data:
• Unique identifiers
• New column – variable name
• New column – value
• Default – all numbers are values
31

Melt
> tips
total_bill tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.50 Male No Sun Dinner 3
> melt(tips)
sex smoker day time variable value
Female No Sun Dinner total_bill 16.99
Female No Sun Dinner tip 1.01
Female No Sun Dinner size 2
32

Cast
> m_tips <- melt(tips)
sex smoker day time variable value
Female No Sun Dinner total_bill 16.99
Female No Sun Dinner tip 1.01
Female No Sun Dinner size 2
> dcast(m_tips,sex+time~variable,mean)
sex time total_bill tip size
Female Dinner 19.21308 3.002115 2.461538
Female Lunch 16.33914 2.582857 2.457143
Male Dinner 21.46145 3.144839 2.701613
Male Lunch 18.04848 2.882121 2.363636
33

*Apply
• apply – apply function on rows or columns of matrix
• lapply – apply function on each item of list
• Returns list
• sapply – like lapply, but return vector
• tapply – apply function to subsets of vector or lists
34

plyr
• Split – apply – combine
• Ddply – data frame to data frame
ddply(.data, .variables, .fun = NULL, ...,
• Summarize – aggregate data into new data frame
• Transform – modify data frame
35

DDPLY Example
> ddply(tips,c("sex","time"),summarize,
+ mean=mean(tip),
+ sd=sd(tip),
+ ratio=mean(tip/total_bill)
+ )
sex time mean sd ratio
1 Female Dinner 3.002115 1.193483 0.1693216
2 Female Lunch 2.582857 1.075108 0.1622849
3 Male Dinner 3.144839 1.529116 0.1554065
4 Male Lunch 2.882121 1.329017 0.1660826
36

Agenda
• R Basics
• Hadoop Basics
• Data Manipulation Libraries
• Rhadoop
37

Rhadoop Projects
• RMR
• RHDFS
• RHBase
• (new) PlyRMR
38

Most Important:
RMR does not parallelize algorithms.
It allows you to implement MapReduce in R.
Efficiently. That’s it.
39

What does that mean?
• Use RMR if you can break your problem down to
small pieces and apply the algorithm there
• Use commercial R+Hadoop if you need a parallel
version of well known algorithm
• Good fit: Fit piecewise regression model for each
county in the US
• Bad fit: Fit piecewise regression model for the entire
US population
• Bad fit: Logistic regression
40

Use-case examples – Good or Bad?
1. Model power consumption per household to
determine if incentive programs work
2. Aggregate corn yield per 10x10 portion of field to
determine best seeds to use
3. Create churn models for service subscribers and
determine who is most likely to cancel
4. Determine correlation between device restarts and
support calls
41

Second Most Important:
RMR requires R, RMR and all libraries you’ll
use to be installed on all nodes and
accessible by Hadoop user
42

RMR is different from Hadoop Streaming.
RMR mapper input:
Key, [List of Records]
This is so we can use vector operations
43

In more detail…
• Mappers get list of values
• You need to process each one independently
• But do it for all lines at once.
• Reducers work normally
45

Demo 6
> library(rmr2)
t <- list("hello world","don't worry be happy")
unlist(sapply(t,function (x) {strsplit(x," ")}))
function(k,v) {
ret_k <- unlist(sapply(v,function(x){strsplit(x," ")}))
keyval(ret_k,1)
}
function(k,v) {
keyval(k,sum(v))}
mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt",
output=”~/wc.json",input.format="text”,output.format=”json",
map=wc.map,reduce=wc.reduce);
46

Cheating in MapReduce:
Do everything possible to have
map only jobs
47

Avg Tips per Person – Naïve Input
Gwen 1
Jeff 2
Leon 1
Gwen 2.5
Leon 3
Jeff 1
Gwen 1
Gwen 2
Jeff 1.5
48

Avg Tips per Person - Naive
avg.map <- function(k,v){keyval(v$V1,v$V2)}
avg.reduce <- function(k,v) {keyval(k,mean(v))}
mapreduce(input=”~/hadoop-recipes/data/tip1.txt",
output="~/avg.txt",
input.format=make.input.format("csv"),
output.format="text",
map=avg.map,reduce=avg.reduce);
49

Avg Tips per Person – Awesome Input
Gwen 1,2.5,1,2
Jeff 2,1,1.5
Leon 1,3
50

Avg Tips per Person - Optimized
function(k,v) {
v1 <- (sapply(v$V2,function(x){strsplit(as.character(x),"
")}))
keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))}))
}
mapreduce(input=”~/hadoop-recipes/data/tip2.txt",
output="~/avg2.txt",
input.format=make.input.format("csv",sep=","),
output.format="text",map=avg2.map);
51

Few Final RMR Tips
• Backend = “local” has files as input and output
• Backend = “hadoop” uses HDFS directories
• In “hadoop” mode, print(X) inside the mapper will fail
the job.
• Use: cat(“ERROR!”, file = stderr())
52

Recommended Reading
• http://cran.r-project.org/doc/manuals/R-intro.html
• http://blog.revolutionanalytics.com/2013/02/10-r-packages-
every-data-scientist-should-know-about.
html
• http://had.co.nz/reshape/paper-dsc2005.pdf
• http://seananderson.ca/2013/12/01/plyr.html
• https://github.com/RevolutionAnalytics/rmr2/blob/m
aster/docs/tutorial.md
• http://cran.r-project.
org/web/packages/data.table/index.html
53

R for hadoopers

More Related Content

What's hot (20)

Similar to R for hadoopers (20)

More from Gwen (Chen) Shapira (20)

Recently uploaded (20)

R for hadoopers

Editor's Notes