SlideShare a Scribd company logo
1 
Scalable Analytics 
with 
R, Hadoop and RHadoop 
Gwen Shapira, Software Engineer 
@gwenshap 
gshapira@cloudera.com
2
3
4
#include warning.h 
5
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
6
Get Started with R-Studio 
7
Basic Data Types 
• String 
• Number 
• Boolean 
• Assignment <- 
8
R can be a nice calculator 
> x <- 1 
> x * 2 
[1] 2 
> y <- x + 3 
> y 
[1] 4 
> log(y) 
[1] 1.386294 
> help(log) 
9
Complex Data Types 
• Vector 
• c, seq, rep, [] 
• List 
• Data Frame 
• Lists of vectors of same length 
• Not a matrix 
10
Creating vectors 
> v1 <- c(1,2,3,4) 
[1] 1 2 3 4 
> v1 * 4 
[1] 4 8 12 16 
> v4 <- c(1:5) 
[1] 1 2 3 4 5 
> v2 <- seq(2,12,by=3) 
[1] 2 5 8 11 
> v1 * v2 
[1] 2 10 24 44 
> v3 <- rep(3,4) 
[1] 3 3 3 3 
11
Accessing and filtering vectors 
> v1 <- c(2,4,6,8) 
[1] 2 4 6 8 
> v1[2] 
[1] 4 
> v1[2:4] 
[1] 4 6 8 
> v1[-2] 
[1] 2 6 8 
> v1[v1>3] 
[1] 4 6 8 
12
Lists 
> lst <- list (1,"x",FALSE) 
[[1]] 
[1] 1 
[[2]] 
[1] "x" 
[[3]] 
[1] FALSE 
> lst[1] 
[[1]] 
[1] 1 
> lst[[1]] 
[1] 1 
13
Data Frames 
books <- read.csv("~/books.csv") 
books[1,] 
books[,1] 
books[3:4] 
books$price 
books[books$price==6.99,] 
martin_price <- books[books$author_t=="George 
R.R. Martin",]$price 
mean(martin_price) 
subset(books,select=-c(id,cat,sequence_i)) 
14
15
Functions 
> sq <- function(x) { x*x } 
> sq(3) 
[1] 9 
16 
Note: 
R is a functional programming language. 
Functions are first class objects 
And can be passed to other functions.
packages 
17
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
18
“In pioneer days they used oxen for heavy 
pulling, and when one ox couldn’t budge a log, 
we didn’t try to grow a larger ox” 
— Grace Hopper, early advocate of distributed computing
20 
Hadoop in a Nutshell
Map-Reduce is the interesting bit 
• Map – Apply a function to each input record 
• Shuffle & Sort – Partition the map output and sort 
each partition 
• Reduce – Apply aggregation function to all values in 
each partition 
• Map reads input from disk 
• Reduce writes output to disk 
21
Example – Sessionize clickstream 
22
Sessionize 
Identify unique “sessions” of interacting with our 
website 
Session – for each user (IP), set of clicks that happened 
within 30 minutes of each other 
23
Input – Apache Access Log Records 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 
24
Output – Add Session ID 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 15 
25
Overview 
26 
Map 
Map 
Map 
Reduce 
Reduce 
Log line 
Log line 
Log line 
IP1, log lines 
Log line, session ID
Map 
parsedRecord = re.search(‘(d+.d+….’,record) 
IP = parsedRecord.group(1) 
timestamp = parsedRecord.group(2) 
print ((IP,Timestamp),record) 
27
Shuffle & Sort 
Partition by: IP 
Sort by: timestamp 
Now reduce gets: 
(IP,timestamp) [record1,record2,record3….] 
28
Reduce 
SessionID = 1 
curr_record = records[0] 
Curr_timestamp = getTimestamp(curr_record) 
foreach record in records: 
if (curr_timestamp – getTimestamp(record) > 30): 
sessionID += 1 
curr_timestamp = getTimestamp(record) 
print(record + “ “ + sessionID) 
29
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
30
Reshape2 
• Two functions: 
• Melt – wide format to long format 
• Cast – long format to wide format 
• Columns: identifiers or measured variables 
• Molten data: 
• Unique identifiers 
• New column – variable name 
• New column – value 
• Default – all numbers are values 
31
Melt 
> tips 
total_bill tip sex smoker day time size 
16.99 1.01 Female No Sun Dinner 2 
10.34 1.66 Male No Sun Dinner 3 
21.01 3.50 Male No Sun Dinner 3 
> melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
32
Cast 
> m_tips <- melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
> dcast(m_tips,sex+time~variable,mean) 
sex time total_bill tip size 
Female Dinner 19.21308 3.002115 2.461538 
Female Lunch 16.33914 2.582857 2.457143 
Male Dinner 21.46145 3.144839 2.701613 
Male Lunch 18.04848 2.882121 2.363636 
33
*Apply 
• apply – apply function on rows or columns of matrix 
• lapply – apply function on each item of list 
• Returns list 
• sapply – like lapply, but return vector 
• tapply – apply function to subsets of vector or lists 
34
plyr 
• Split – apply – combine 
• Ddply – data frame to data frame 
ddply(.data, .variables, .fun = NULL, ..., 
• Summarize – aggregate data into new data frame 
• Transform – modify data frame 
35
DDPLY Example 
> ddply(tips,c("sex","time"),summarize, 
+ mean=mean(tip), 
+ sd=sd(tip), 
+ ratio=mean(tip/total_bill) 
+ ) 
sex time mean sd ratio 
1 Female Dinner 3.002115 1.193483 0.1693216 
2 Female Lunch 2.582857 1.075108 0.1622849 
3 Male Dinner 3.144839 1.529116 0.1554065 
4 Male Lunch 2.882121 1.329017 0.1660826 
36
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
37
Rhadoop Projects 
• RMR 
• RHDFS 
• RHBase 
• (new) PlyRMR 
38
Most Important: 
RMR does not parallelize algorithms. 
It allows you to implement MapReduce in R. 
Efficiently. That’s it. 
39
What does that mean? 
• Use RMR if you can break your problem down to 
small pieces and apply the algorithm there 
• Use commercial R+Hadoop if you need a parallel 
version of well known algorithm 
• Good fit: Fit piecewise regression model for each 
county in the US 
• Bad fit: Fit piecewise regression model for the entire 
US population 
• Bad fit: Logistic regression 
40
Use-case examples – Good or Bad? 
1. Model power consumption per household to 
determine if incentive programs work 
2. Aggregate corn yield per 10x10 portion of field to 
determine best seeds to use 
3. Create churn models for service subscribers and 
determine who is most likely to cancel 
4. Determine correlation between device restarts and 
support calls 
41
Second Most Important: 
RMR requires R, RMR and all libraries you’ll 
use to be installed on all nodes and 
accessible by Hadoop user 
42
RMR is different from Hadoop Streaming. 
RMR mapper input: 
Key, [List of Records] 
This is so we can use vector operations 
43
How to RMRify a Problem 
44
In more detail… 
• Mappers get list of values 
• You need to process each one independently 
• But do it for all lines at once. 
• Reducers work normally 
45
Demo 6 
> library(rmr2) 
t <- list("hello world","don't worry be happy") 
unlist(sapply(t,function (x) {strsplit(x," ")})) 
function(k,v) { 
ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) 
keyval(ret_k,1) 
} 
function(k,v) { 
keyval(k,sum(v))} 
mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", 
output=”~/wc.json",input.format="text”,output.format=”json", 
map=wc.map,reduce=wc.reduce); 
46
Cheating in MapReduce: 
Do everything possible to have 
map only jobs 
47
Avg Tips per Person – Naïve Input 
Gwen 1 
Jeff 2 
Leon 1 
Gwen 2.5 
Leon 3 
Jeff 1 
Gwen 1 
Gwen 2 
Jeff 1.5 
48
Avg Tips per Person - Naive 
avg.map <- function(k,v){keyval(v$V1,v$V2)} 
avg.reduce <- function(k,v) {keyval(k,mean(v))} 
mapreduce(input=”~/hadoop-recipes/data/tip1.txt", 
output="~/avg.txt", 
input.format=make.input.format("csv"), 
output.format="text", 
map=avg.map,reduce=avg.reduce); 
49
Avg Tips per Person – Awesome Input 
Gwen 1,2.5,1,2 
Jeff 2,1,1.5 
Leon 1,3 
50
Avg Tips per Person - Optimized 
function(k,v) { 
v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," 
")})) 
keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) 
} 
mapreduce(input=”~/hadoop-recipes/data/tip2.txt", 
output="~/avg2.txt", 
input.format=make.input.format("csv",sep=","), 
output.format="text",map=avg2.map); 
51
Few Final RMR Tips 
• Backend = “local” has files as input and output 
• Backend = “hadoop” uses HDFS directories 
• In “hadoop” mode, print(X) inside the mapper will fail 
the job. 
• Use: cat(“ERROR!”, file = stderr()) 
52
Recommended Reading 
• http://cran.r-project.org/doc/manuals/R-intro.html 
• http://blog.revolutionanalytics.com/2013/02/10-r-packages- 
every-data-scientist-should-know-about. 
html 
• http://had.co.nz/reshape/paper-dsc2005.pdf 
• http://seananderson.ca/2013/12/01/plyr.html 
• https://github.com/RevolutionAnalytics/rmr2/blob/m 
aster/docs/tutorial.md 
• http://cran.r-project. 
org/web/packages/data.table/index.html 
53
54

More Related Content

PDF
Hadoop ecosystem
PPTX
Incredible Impala
PPTX
Scaling ETL with Hadoop - Avoiding Failure
PPTX
Is hadoop for you
PPTX
Scaling etl with hadoop shapira 3
PDF
Hadoop-Introduction
PDF
Hadoop Spark Introduction-20150130
PDF
Hadoop Overview & Architecture
 
Hadoop ecosystem
Incredible Impala
Scaling ETL with Hadoop - Avoiding Failure
Is hadoop for you
Scaling etl with hadoop shapira 3
Hadoop-Introduction
Hadoop Spark Introduction-20150130
Hadoop Overview & Architecture
 

What's hot (20)

PPTX
Apache Spark Architecture
PDF
Syncsort et le retour d'expérience ComScore
PPTX
Introduction to Hadoop and Hadoop component
PPTX
Yahoo's Experience Running Pig on Tez at Scale
PDF
Apache Drill - Why, What, How
PDF
Hadoop ecosystem
PDF
Apache Hadoop MapReduce Tutorial
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Scalable Data Science with SparkR
PPTX
Apache Flink Deep Dive
PDF
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PPTX
Free Code Friday - Spark Streaming with HBase
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Real time hadoop + mapreduce intro
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PPT
Hadoop MapReduce Fundamentals
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Apache Spark Architecture
Syncsort et le retour d'expérience ComScore
Introduction to Hadoop and Hadoop component
Yahoo's Experience Running Pig on Tez at Scale
Apache Drill - Why, What, How
Hadoop ecosystem
Apache Hadoop MapReduce Tutorial
Spark Summit East 2015 Advanced Devops Student Slides
Scalable Data Science with SparkR
Apache Flink Deep Dive
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
Free Code Friday - Spark Streaming with HBase
Why you should care about data layout in the file system with Cheng Lian and ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Data Wrangling and Oracle Connectors for Hadoop
LLAP: Sub-Second Analytical Queries in Hive
Real time hadoop + mapreduce intro
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Hadoop MapReduce Fundamentals
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Ad

Similar to R for hadoopers (20)

PDF
Big datacourse
PPTX
Hadoop-part1 in cloud computing subject.pptx
PDF
Hadoop Overview kdd2011
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PDF
R programming & Machine Learning
KEY
PPTX
Step By Step Guide to Learn R
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPT
Lecture1_R Programming Introduction1.ppt
PPTX
Introduction to R.pptx
PPTX
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
PPTX
Hadoop
PPT
Scala and spark
PDF
Introduction to Hadoop
PDF
Hadoop in Data Warehousing
PDF
Hadoop Tutorial with @techmilind
 
PPT
R_Language_study_forstudents_R_Material.ppt
PPT
Brief introduction to R Lecturenotes1_R .ppt
PPTX
Hadoop and HBase experiences in perf log project
PDF
Lecture1_R.pdf
Big datacourse
Hadoop-part1 in cloud computing subject.pptx
Hadoop Overview kdd2011
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
R programming & Machine Learning
Step By Step Guide to Learn R
AI與大數據數據處理 Spark實戰(20171216)
Lecture1_R Programming Introduction1.ppt
Introduction to R.pptx
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Hadoop
Scala and spark
Introduction to Hadoop
Hadoop in Data Warehousing
Hadoop Tutorial with @techmilind
 
R_Language_study_forstudents_R_Material.ppt
Brief introduction to R Lecturenotes1_R .ppt
Hadoop and HBase experiences in perf log project
Lecture1_R.pdf
Ad

More from Gwen (Chen) Shapira (20)

PPTX
Velocity 2019 - Kafka Operations Deep Dive
PPTX
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
PPTX
Gluecon - Kafka and the service mesh
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PPTX
Papers we love realtime at facebook
PPTX
Kafka reliability velocity 17
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PPTX
Streaming Data Integration - For Women in Big Data Meetup
PPTX
Kafka at scale facebook israel
PPTX
Kafka connect-london-meetup-2016
PPTX
Fraud Detection for Israel BigThings Meetup
PPT
Kafka Reliability - When it absolutely, positively has to be there
PPTX
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
PPTX
Fraud Detection Architecture
PPTX
Have your cake and eat it too
PPTX
Kafka for DBAs
PPTX
Data Architectures for Robust Decision Making
PPTX
Kafka and Hadoop at LinkedIn Meetup
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PPTX
Twitter with hadoop for oow
Velocity 2019 - Kafka Operations Deep Dive
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gluecon - Kafka and the service mesh
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Papers we love realtime at facebook
Kafka reliability velocity 17
Multi-Datacenter Kafka - Strata San Jose 2017
Streaming Data Integration - For Women in Big Data Meetup
Kafka at scale facebook israel
Kafka connect-london-meetup-2016
Fraud Detection for Israel BigThings Meetup
Kafka Reliability - When it absolutely, positively has to be there
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Fraud Detection Architecture
Have your cake and eat it too
Kafka for DBAs
Data Architectures for Robust Decision Making
Kafka and Hadoop at LinkedIn Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Twitter with hadoop for oow

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
annual-report-2024-2025 original latest.
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Transcultural that can help you someday.
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Miokarditis (Inflamasi pada Otot Jantung)
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
annual-report-2024-2025 original latest.
STUDY DESIGN details- Lt Col Maksud (21).pptx
Lecture1 pattern recognition............
Introduction to Knowledge Engineering Part 1
STERILIZATION AND DISINFECTION-1.ppthhhbx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Qualitative Qantitative and Mixed Methods.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Transcultural that can help you someday.
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Supervised vs unsupervised machine learning algorithms
climate analysis of Dhaka ,Banglades.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
[EN] Industrial Machine Downtime Prediction
Miokarditis (Inflamasi pada Otot Jantung)

R for hadoopers

  • 1. 1 Scalable Analytics with R, Hadoop and RHadoop Gwen Shapira, Software Engineer @gwenshap gshapira@cloudera.com
  • 2. 2
  • 3. 3
  • 4. 4
  • 6. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 6
  • 7. Get Started with R-Studio 7
  • 8. Basic Data Types • String • Number • Boolean • Assignment <- 8
  • 9. R can be a nice calculator > x <- 1 > x * 2 [1] 2 > y <- x + 3 > y [1] 4 > log(y) [1] 1.386294 > help(log) 9
  • 10. Complex Data Types • Vector • c, seq, rep, [] • List • Data Frame • Lists of vectors of same length • Not a matrix 10
  • 11. Creating vectors > v1 <- c(1,2,3,4) [1] 1 2 3 4 > v1 * 4 [1] 4 8 12 16 > v4 <- c(1:5) [1] 1 2 3 4 5 > v2 <- seq(2,12,by=3) [1] 2 5 8 11 > v1 * v2 [1] 2 10 24 44 > v3 <- rep(3,4) [1] 3 3 3 3 11
  • 12. Accessing and filtering vectors > v1 <- c(2,4,6,8) [1] 2 4 6 8 > v1[2] [1] 4 > v1[2:4] [1] 4 6 8 > v1[-2] [1] 2 6 8 > v1[v1>3] [1] 4 6 8 12
  • 13. Lists > lst <- list (1,"x",FALSE) [[1]] [1] 1 [[2]] [1] "x" [[3]] [1] FALSE > lst[1] [[1]] [1] 1 > lst[[1]] [1] 1 13
  • 14. Data Frames books <- read.csv("~/books.csv") books[1,] books[,1] books[3:4] books$price books[books$price==6.99,] martin_price <- books[books$author_t=="George R.R. Martin",]$price mean(martin_price) subset(books,select=-c(id,cat,sequence_i)) 14
  • 15. 15
  • 16. Functions > sq <- function(x) { x*x } > sq(3) [1] 9 16 Note: R is a functional programming language. Functions are first class objects And can be passed to other functions.
  • 18. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 18
  • 19. “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox” — Grace Hopper, early advocate of distributed computing
  • 20. 20 Hadoop in a Nutshell
  • 21. Map-Reduce is the interesting bit • Map – Apply a function to each input record • Shuffle & Sort – Partition the map output and sort each partition • Reduce – Apply aggregation function to all values in each partition • Map reads input from disk • Reduce writes output to disk 21
  • 22. Example – Sessionize clickstream 22
  • 23. Sessionize Identify unique “sessions” of interacting with our website Session – for each user (IP), set of clicks that happened within 30 minutes of each other 23
  • 24. Input – Apache Access Log Records 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 24
  • 25. Output – Add Session ID 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15 25
  • 26. Overview 26 Map Map Map Reduce Reduce Log line Log line Log line IP1, log lines Log line, session ID
  • 27. Map parsedRecord = re.search(‘(d+.d+….’,record) IP = parsedRecord.group(1) timestamp = parsedRecord.group(2) print ((IP,Timestamp),record) 27
  • 28. Shuffle & Sort Partition by: IP Sort by: timestamp Now reduce gets: (IP,timestamp) [record1,record2,record3….] 28
  • 29. Reduce SessionID = 1 curr_record = records[0] Curr_timestamp = getTimestamp(curr_record) foreach record in records: if (curr_timestamp – getTimestamp(record) > 30): sessionID += 1 curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID) 29
  • 30. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 30
  • 31. Reshape2 • Two functions: • Melt – wide format to long format • Cast – long format to wide format • Columns: identifiers or measured variables • Molten data: • Unique identifiers • New column – variable name • New column – value • Default – all numbers are values 31
  • 32. Melt > tips total_bill tip sex smoker day time size 16.99 1.01 Female No Sun Dinner 2 10.34 1.66 Male No Sun Dinner 3 21.01 3.50 Male No Sun Dinner 3 > melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 32
  • 33. Cast > m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 > dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115 2.461538 Female Lunch 16.33914 2.582857 2.457143 Male Dinner 21.46145 3.144839 2.701613 Male Lunch 18.04848 2.882121 2.363636 33
  • 34. *Apply • apply – apply function on rows or columns of matrix • lapply – apply function on each item of list • Returns list • sapply – like lapply, but return vector • tapply – apply function to subsets of vector or lists 34
  • 35. plyr • Split – apply – combine • Ddply – data frame to data frame ddply(.data, .variables, .fun = NULL, ..., • Summarize – aggregate data into new data frame • Transform – modify data frame 35
  • 36. DDPLY Example > ddply(tips,c("sex","time"),summarize, + mean=mean(tip), + sd=sd(tip), + ratio=mean(tip/total_bill) + ) sex time mean sd ratio 1 Female Dinner 3.002115 1.193483 0.1693216 2 Female Lunch 2.582857 1.075108 0.1622849 3 Male Dinner 3.144839 1.529116 0.1554065 4 Male Lunch 2.882121 1.329017 0.1660826 36
  • 37. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 37
  • 38. Rhadoop Projects • RMR • RHDFS • RHBase • (new) PlyRMR 38
  • 39. Most Important: RMR does not parallelize algorithms. It allows you to implement MapReduce in R. Efficiently. That’s it. 39
  • 40. What does that mean? • Use RMR if you can break your problem down to small pieces and apply the algorithm there • Use commercial R+Hadoop if you need a parallel version of well known algorithm • Good fit: Fit piecewise regression model for each county in the US • Bad fit: Fit piecewise regression model for the entire US population • Bad fit: Logistic regression 40
  • 41. Use-case examples – Good or Bad? 1. Model power consumption per household to determine if incentive programs work 2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use 3. Create churn models for service subscribers and determine who is most likely to cancel 4. Determine correlation between device restarts and support calls 41
  • 42. Second Most Important: RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user 42
  • 43. RMR is different from Hadoop Streaming. RMR mapper input: Key, [List of Records] This is so we can use vector operations 43
  • 44. How to RMRify a Problem 44
  • 45. In more detail… • Mappers get list of values • You need to process each one independently • But do it for all lines at once. • Reducers work normally 45
  • 46. Demo 6 > library(rmr2) t <- list("hello world","don't worry be happy") unlist(sapply(t,function (x) {strsplit(x," ")})) function(k,v) { ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) keyval(ret_k,1) } function(k,v) { keyval(k,sum(v))} mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", output=”~/wc.json",input.format="text”,output.format=”json", map=wc.map,reduce=wc.reduce); 46
  • 47. Cheating in MapReduce: Do everything possible to have map only jobs 47
  • 48. Avg Tips per Person – Naïve Input Gwen 1 Jeff 2 Leon 1 Gwen 2.5 Leon 3 Jeff 1 Gwen 1 Gwen 2 Jeff 1.5 48
  • 49. Avg Tips per Person - Naive avg.map <- function(k,v){keyval(v$V1,v$V2)} avg.reduce <- function(k,v) {keyval(k,mean(v))} mapreduce(input=”~/hadoop-recipes/data/tip1.txt", output="~/avg.txt", input.format=make.input.format("csv"), output.format="text", map=avg.map,reduce=avg.reduce); 49
  • 50. Avg Tips per Person – Awesome Input Gwen 1,2.5,1,2 Jeff 2,1,1.5 Leon 1,3 50
  • 51. Avg Tips per Person - Optimized function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) } mapreduce(input=”~/hadoop-recipes/data/tip2.txt", output="~/avg2.txt", input.format=make.input.format("csv",sep=","), output.format="text",map=avg2.map); 51
  • 52. Few Final RMR Tips • Backend = “local” has files as input and output • Backend = “hadoop” uses HDFS directories • In “hadoop” mode, print(X) inside the mapper will fail the job. • Use: cat(“ERROR!”, file = stderr()) 52
  • 53. Recommended Reading • http://cran.r-project.org/doc/manuals/R-intro.html • http://blog.revolutionanalytics.com/2013/02/10-r-packages- every-data-scientist-should-know-about. html • http://had.co.nz/reshape/paper-dsc2005.pdf • http://seananderson.ca/2013/12/01/plyr.html • https://github.com/RevolutionAnalytics/rmr2/blob/m aster/docs/tutorial.md • http://cran.r-project. org/web/packages/data.table/index.html 53
  • 54. 54

Editor's Notes

  • #16: Modern CPUs are optimized with vector instructions – so many vector operations can be done on entire vectors in one instructions. Loops obviously take many instructions both for the operations and for running through the loop.
  • #20: This quote is excerpted from the one at the beginning of Chapter 1 in Hadoop: The Definitive Guide by Tom White.
  • #22: Example to illustrate MR
  • #40: RevolutionR and Oracle have (expensive) packages of popular algorithms, parallelized.
  • #43: Just saved you hours of debugging. You can thank me later 