R + 15 minutes = Hadoop cluster

useR Vignette:

R + 15 minutes =
Hadoop cluster

Greater Boston useR Group
February 2011

by

Jeffrey Breen
jbreen@cambridge.aero

Agenda

● What's Hadoop?
● But I don't have Big
Data
● Building the cluster
● Estimating π
stochastically
● Want to know more?

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 2

MapReduce, Hadoop and Big Data

● Hadoop is an open source implementation of
Google's MapReduce-based data processing
infrastructure
● Designed to process huge data sets
– “huge” = “all of facebook's web logs”
– Yahoo! sorted 1TB in 62 seconds in May 2009
– HDFS distributed file system makes replication decisions
based on knowledge of network topology
● Amazon Elastic MapReduce is full Hadoop stack
on EC2


MapReduce = Map + shuffle + Reduce

Source: http://developer.yahoo.com/hadoop/tutorial/module4.html


But I don't have Big Data

● Agricultural economist J.D. Long doesn't either, but
he does have a bunch of simulations to run
● Had a key insight: the input could be small amount
of data (like 1:1000) to serve as random seeds for
simulation code in “mapper” function
● Enjoy Hadoop's infrastructure for job scheduling,
fault tolerance, inter-node communication, etc.
● Use Amazon's cloud to scale up quickly as needed


Load the segue library
> library(segue)
Loading required package: rJava
Loading required package: caTools
Loading required package: bitops
Segue did not find your AWS credentials. Please run
the setCredentials() function.

> setCredentials('YOUR_ACCESS_KEY_ID',
'YOUR_SECRET_ACCESS_KEY')


Start the cluster
> myCluster <- createCluster(numInstances=5)
STARTING - 2011-01-04 15:07:53
[…]
BOOTSTRAPPING - 2011-01-04 15:11:28
[…]
WAITING - 2011-01-04 15:15:35
Your Amazon EMR Hadoop Cluster is ready for action.
Remember to terminate your cluster with
stopCluster().
Amazon is billing you!


Estimate π stochastically
> estimatePi <- function(seed){
set.seed(seed)
numDraws <- 1e6

r <- .5 #radius
x <- runif(numDraws, min=-r, max=r)
y <- runif(numDraws, min=-r, max=r)
inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)

return(sum(inCircle) / length(inCircle) * 4)
}


Run the simulation
> seedList <- as.list(1:1e3)
> myEstimates <- emrlapply( myCluster, seedList,
estimatePi )
RUNNING - 2011-01-04 15:22:28
[…]
WAITING - 2011-01-04 15:32:18
> myPi <- Reduce(sum, myEstimates) / length(myEstimates)
> format(myPi, digits=10)
[1] "3.141586544"
> format(pi, digits=10)
[1] "3.141592654"


Won't break the bank

● Total cost: $0.15
Standard On-Demand Amazon EC2 Amazon Elastic
Instances Price per hour MapReduce
(On-Demand Instances) Price per hour

Small (Default) $0.085 per hour $0.015 per hour

Large $0.34 per hour $0.06 per hour

Extra Large $0.68 per hour $0.12 per hour


Want to know more?

● JD Long's segue package
● http://code.google.com/p/segue/
● Hadoop
● http://hadoop.apache.org/
● Book: http://oreilly.com/catalog/0636920010388
● My blog
● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-a


R + 15 minutes = Hadoop cluster

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to R + 15 minutes = Hadoop cluster (20)

Recently uploaded (20)

R + 15 minutes = Hadoop cluster