R, Hadoop and Amazon Web Services

R, Hadoop and Amazon Web
Services
Portland R Users Group
December 20th, 2011

A general disclaimer
• Good programmers learn fast and develop expertise in
technologies and methodologies in a rather intrepid,
exploratory manner.
• I am by no means a expert in the paradigm which we
are discussing this evening but I’d like to share what I
have learned in the last year while developing
MapReduce applications in R within the AWS.
Translation: ask anything and everything but reserve
the right to say “I don’t know, yet.”
• Also, this is a meetup.com meeting – seems only
appropriate to keep this short, sweet, high-level and
full of solicitous discussion points.

The whole point of this presentation
• I am selfish (and you should be too!)
– I like collaborators
– I like collaborators interested in things I am interested in
– I believe that dissemination of information related to sophisticated,
numerical decision making processes generally makes the world a
better place
– I believe that the more people use Open Source technology, the more
people contribute to Open Source technology and the better Open
Source technology gets in general. Hence, my life gets easier and
cheaper which is presumably analogous to “better” in some respect.
– There is beer at this meetup. Queue short intermission.
• Otherweiser® (brought by the aforementioned speaking point,) I’d
really be very happy if people said to themselves at the end of this
presentation “Hadoop seems easy! I’m going to give it a try.”

Why are we talking about this
anyhow?
“Every two days now we create as much information as we did from the dawn of
civilization up until 2003.“ -Eric Schmidt, August 2010

• We aggregate a lot of data (and have been)
– Particularly businesses like Google, Amazon, Apple etc…
– Presumably the government is doing awful things with data too
• But aggregation isn’t understanding
– Lawnmower Man aside
– We need to UNDERSTAND the data- that is take raw data and make it interoperable.
– Hence the need for a marriage of Statistics and Programming directed at understanding
phenomena expressed in these large data sets
– Can’t recommend this book enough:
• The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert
Tibshirani and Jerome Freidman
• http://www.amazon.com/Elements-Statistical-Learning-Prediction-
Statistics/dp/0387848576/ref=pd_sim_b_1
• So everybody is going crazy about this in general.

Also, who is this “self” I speak of?
• tis’ I, Timothy Dalbey
• I work for the Emerging Technologies Group of News
Corporation
• I live in North East Portland and keep an office on 53rd
and 5th in New York City
• Studied Mathematics and Economics as a
undergraduate student and Statistics as a graduate
student at University of Virginia
• 2 awesome kids and a awesome partner at home: Liam,
Juniper and Lindsay
• Enthusiastic about technology, science and futuristic
endeavors in general

Elastic MapReduce
• Elastic Map reduce is
– A service of Amazon Web Services
– Is composed of Amazon Machine Images
• ssh capability
• Debian Linux
• Preloaded with ancient versions of R
– A complimentary set of Ruby Client Tools
– A web interface
– Preconfigured to run Hadoop

Hadoop
• Popular framework for controlling distributed cluster computations
– Popularity is important – queue story about MPI at Levy Laboratory
and Beowulf clusters…
• Hadoop is a Apache Project product
– http://hadoop.apache.org/
• Open Source
• Java
• Configurable (mostly uses XML config files)
• Fault Tolerant
• Lots of ways to interact with Hadoop
– Pig
– Hive
– Streaming
– Custom .jar

Hadoop is MapReduce
• What is a MapReduce?
– Originally coined by Google Labs in 2004
– A super simplified single-node version of the paradigm is as follows:
cat input.txt | ./mapper.R | sort | reducer.R > output.txt
• That is, MapReduce has follows a general process:
– Read input (cat input)
– Map (mapper.R)
– Partition
– Comparison (sort)
– Reduce (reducer.R)
– Output (output.txt)
• You can use most popular scripting languages
– Perl, PHP, Python etc…
– R

But – that sort of misses the point
• MapReduce is computational paradigm intended for
– Large Datasets
– Multi-Node Computation
– Truly Parallel Processing
• Master/Slave architecture
– Nodes are agnostic of one another, only the master
node(s) have any idea about the greater scheme of things.
• The importance of truly parallel processing
• A good first question before engaging in creating a
Hadoop job is:
– Is this process a good candidate for Hadoop processing in
the first place?

Benefits to using AWS for Hadoop Jobs
• Preconfigured to run Hadoop
– This is itself is something of a miracle
• Virtual Servers
– Use the servers for only as long as you need
– configurability
• Handy command line tools
• S3 is sitting in the same cloud
– Your data is sitting in the same space
• Servers come at $0.06 per hour of compute time
– dirt cheap

Specifics
• Bootstrapping
– Bootstrapping is a process by which you may customize the nodes via bash shell
• Acquiring data
• Updating R
• Installing Packages
• Please, you example:

#!/bin/bash
#debian R upgrade
gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480
gpg -a --export 06F90DE5381BA480 | sudo apt-key add -
echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev

• Input file
– Mapper specific
• Classic example in WordCounter.py
– Example: “It was the best of times, it was the worst of times…”
– Note: Big data set!
• An example from a recent appliocation of mine:
– "25621”r"23803"r"31712”r…
– Note: Not such a big data set

• Mapper & Reducer
– Both typically draw from STDIN and write to STDOUT
– Please see the following examples

The typical “Hello World” MapReduce
Mapper
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}

close(con)

The typical “Hello World” MapReduce
Reducer
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "t"))
list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}else{
assign(word, count, envir = env)
}
}

close(con)
for (w in ls(env, all = TRUE)){
cat(w, "t", get(w, envir = env), "n", sep = "”)
}

MapReduce and R: Forecasting data
for News Corporation
• 50k+ products with historical unit sales data of roughly
2.5MM rows
• Some of the titles require heavy computational processing
– Titles with insufficient data require augmented or surrogate
data in order to make “good” predictions – thus identifying good
candidate data was also necessary in addition to prediction
methods
– Took lots of time (particularly in R)
• But R had the analysis tools I needed!
• Key observation: The predictions were independent of one
another which made the process truly parallel.
• Thus, Hadoop and Elastic MapReduce were merited

My Experience Learning and Using
Hadoop with AWS
• Debugging is something of a nightmare.
– SSH onto nodes to figure out what’s really going on
– STDERR is your enemy – it will cause your job to fail rather completely
– STDERR is your best friend. No errors and failed jobs are rather frustrating
• Most of the work is in transactional with AWS Elastic MapReduce
• I followed conventional advice which is “move data to the nodes.”
– This meant moving data into csv’s in S3 and importing the data into R via standard read methods
– This also meant that my processes were database agnostic
– JSON is a great way of structuring input and output between phases of the MapReduce Process
• To that effect, check out RJSON – great package.
• In general, the following rule seems to apply:
– Data frame bad.
– Data table good.
• http://cran.r-project.org/web/packages/data.table/index.html
• Packages to simplify R make my skin crawl
– Ever see Jurassic Park?
– Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I
said that.

R Package to Utilize Map Reduce
• Segue – Written J.D. Long
– http://www.cerebralmastication.com
• P.s. We all realize that www is a subdomain, right?
World Wide Web… is that really necessary?
– Handles much of the transactional details and
allows the use of Elastic MapReduce through
apply() and lapply() wrappers
• Seems like this is a good tutorial too:
– http://jeffreybreen.wordpress.com/2011/01/10/s
egue-r-to-amazon-elastic-mapreduce-hadoop/

Other stuff
• Distributed Cache
– Load your data the smart way!
• Ruby Command Tools
– Interact with AWS the smart way!
• Web interface
– Simple.
– Helpful when monitoring jobs when you wake up
at 3:30AM and wonder “is my script still running?”

R, Hadoop and Amazon Web Services

More Related Content

What's hot (18)

Viewers also liked (10)

Similar to R, Hadoop and Amazon Web Services (20)

Recently uploaded (20)

R, Hadoop and Amazon Web Services