R, Hadoop and Amazon Web
         Services
    Portland R Users Group
     December 20th, 2011
A general disclaimer
• Good programmers learn fast and develop expertise in
  technologies and methodologies in a rather intrepid,
  exploratory manner.
• I am by no means a expert in the paradigm which we
  are discussing this evening but I’d like to share what I
  have learned in the last year while developing
  MapReduce applications in R within the AWS.
  Translation: ask anything and everything but reserve
  the right to say “I don’t know, yet.”
• Also, this is a meetup.com meeting – seems only
  appropriate to keep this short, sweet, high-level and
  full of solicitous discussion points.
The whole point of this presentation
• I am selfish (and you should be too!)
    – I like collaborators
    – I like collaborators interested in things I am interested in
    – I believe that dissemination of information related to sophisticated,
      numerical decision making processes generally makes the world a
      better place
    – I believe that the more people use Open Source technology, the more
      people contribute to Open Source technology and the better Open
      Source technology gets in general. Hence, my life gets easier and
      cheaper which is presumably analogous to “better” in some respect.
    – There is beer at this meetup. Queue short intermission.
• Otherweiser® (brought by the aforementioned speaking point,) I’d
  really be very happy if people said to themselves at the end of this
  presentation “Hadoop seems easy! I’m going to give it a try.”
Why are we talking about this
                    anyhow?
“Every two days now we create as much information as we did from the dawn of
   civilization up until 2003.“ -Eric Schmidt, August 2010

•   We aggregate a lot of data (and have been)
     – Particularly businesses like Google, Amazon, Apple etc…
     – Presumably the government is doing awful things with data too
•   But aggregation isn’t understanding
     – Lawnmower Man aside
     – We need to UNDERSTAND the data- that is take raw data and make it interoperable.
     – Hence the need for a marriage of Statistics and Programming directed at understanding
       phenomena expressed in these large data sets
     – Can’t recommend this book enough:
          •   The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert
              Tibshirani and Jerome Freidman
          •   http://www.amazon.com/Elements-Statistical-Learning-Prediction-
              Statistics/dp/0387848576/ref=pd_sim_b_1
•   So everybody is going crazy about this in general.
Also, who is this “self” I speak of?
• tis’ I, Timothy Dalbey
     • I work for the Emerging Technologies Group of News
       Corporation
     • I live in North East Portland and keep an office on 53rd
       and 5th in New York City
     • Studied Mathematics and Economics as a
       undergraduate student and Statistics as a graduate
       student at University of Virginia
     • 2 awesome kids and a awesome partner at home: Liam,
       Juniper and Lindsay
     • Enthusiastic about technology, science and futuristic
       endeavors in general
Elastic MapReduce
• Elastic Map reduce is
  – A service of Amazon Web Services
  – Is composed of Amazon Machine Images
     • ssh capability
     • Debian Linux
     • Preloaded with ancient versions of R
  – A complimentary set of Ruby Client Tools
  – A web interface
  – Preconfigured to run Hadoop
Hadoop
• Popular framework for controlling distributed cluster computations
     – Popularity is important – queue story about MPI at Levy Laboratory
       and Beowulf clusters…
• Hadoop is a Apache Project product
     – http://hadoop.apache.org/
•   Open Source
•   Java
•   Configurable (mostly uses XML config files)
•   Fault Tolerant
•   Lots of ways to interact with Hadoop
     –   Pig
     –   Hive
     –   Streaming
     –   Custom .jar
Hadoop is MapReduce
• What is a MapReduce?
   – Originally coined by Google Labs in 2004
   – A super simplified single-node version of the paradigm is as follows:
       cat input.txt | ./mapper.R | sort | reducer.R > output.txt
• That is, MapReduce has follows a general process:
   –   Read input (cat input)
   –   Map (mapper.R)
   –   Partition
   –   Comparison (sort)
   –   Reduce (reducer.R)
   –   Output (output.txt)
• You can use most popular scripting languages
   – Perl, PHP, Python etc…
   – R
But – that sort of misses the point
• MapReduce is computational paradigm intended for
   – Large Datasets
   – Multi-Node Computation
   – Truly Parallel Processing
• Master/Slave architecture
   – Nodes are agnostic of one another, only the master
     node(s) have any idea about the greater scheme of things.
      • The importance of truly parallel processing
• A good first question before engaging in creating a
  Hadoop job is:
   – Is this process a good candidate for Hadoop processing in
     the first place?
Benefits to using AWS for Hadoop Jobs
• Preconfigured to run Hadoop
   – This is itself is something of a miracle
• Virtual Servers
   – Use the servers for only as long as you need
   – configurability
• Handy command line tools
• S3 is sitting in the same cloud
   – Your data is sitting in the same space
• Servers come at $0.06 per hour of compute time
  – dirt cheap
Specifics
•       Bootstrapping
           –       Bootstrapping is a process by which you may customize the nodes via bash shell
                       •     Acquiring data
                       •     Updating R
                       •     Installing Packages
                       •     Please, you example:

#!/bin/bash
#debian R upgrade
gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480
gpg -a --export 06F90DE5381BA480 | sudo apt-key add -
echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev



•       Input file
           –       Mapper specific
                       •     Classic example in WordCounter.py
                                   –     Example: “It was the best of times, it was the worst of times…”
                                   –     Note: Big data set!
                       •     An example from a recent appliocation of mine:
                                   –     "25621”r"23803"r"31712”r…
                                   –     Note: Not such a big data set


•       Mapper & Reducer
           –       Both typically draw from STDIN and write to STDOUT
           –       Please see the following examples
The typical “Hello World” MapReduce
                Mapper
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
    cat(paste(words, "t1n", sep=""), sep="")
}

close(con)
The typical “Hello World” MapReduce
                 Reducer
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
        val <- unlist(strsplit(line, "t"))
        list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
       line <- trimWhiteSpace(line)
       split <- splitLine(line)
       word <- split$word
       count <- split$count
       if (exists(word, envir = env, inherits = FALSE)) {
           oldcount <- get(word, envir = env)
           assign(word, oldcount + count, envir = env)
       }else{
           assign(word, count, envir = env)
       }
}

close(con)
for (w in ls(env, all = TRUE)){
       cat(w, "t", get(w, envir = env), "n", sep = "”)
}
MapReduce and R: Forecasting data
       for News Corporation
• 50k+ products with historical unit sales data of roughly
  2.5MM rows
• Some of the titles require heavy computational processing
   – Titles with insufficient data require augmented or surrogate
     data in order to make “good” predictions – thus identifying good
     candidate data was also necessary in addition to prediction
     methods
   – Took lots of time (particularly in R)
      • But R had the analysis tools I needed!
• Key observation: The predictions were independent of one
  another which made the process truly parallel.
• Thus, Hadoop and Elastic MapReduce were merited
My Experience Learning and Using
            Hadoop with AWS
•   Debugging is something of a nightmare.
     –   SSH onto nodes to figure out what’s really going on
     –   STDERR is your enemy – it will cause your job to fail rather completely
     –   STDERR is your best friend. No errors and failed jobs are rather frustrating
•   Most of the work is in transactional with AWS Elastic MapReduce
•   I followed conventional advice which is “move data to the nodes.”
     –   This meant moving data into csv’s in S3 and importing the data into R via standard read methods
     –   This also meant that my processes were database agnostic
     –   JSON is a great way of structuring input and output between phases of the MapReduce Process
           •   To that effect, check out RJSON – great package.
•   In general, the following rule seems to apply:
     –   Data frame bad.
     –   Data table good.
           •   http://cran.r-project.org/web/packages/data.table/index.html
•   Packages to simplify R make my skin crawl
     –   Ever see Jurassic Park?
     –   Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I
         said that.
R Package to Utilize Map Reduce
• Segue – Written J.D. Long
  – http://www.cerebralmastication.com
     • P.s. We all realize that www is a subdomain, right?
       World Wide Web… is that really necessary?
  – Handles much of the transactional details and
    allows the use of Elastic MapReduce through
    apply() and lapply() wrappers
• Seems like this is a good tutorial too:
  – http://jeffreybreen.wordpress.com/2011/01/10/s
    egue-r-to-amazon-elastic-mapreduce-hadoop/
Other stuff
• Distributed Cache
  – Load your data the smart way!
• Ruby Command Tools
  – Interact with AWS the smart way!
• Web interface
  – Simple.
  – Helpful when monitoring jobs when you wake up
    at 3:30AM and wonder “is my script still running?”

More Related Content

PPT
hadoop&zing
PDF
Introduction To Elastic MapReduce at WHUG
PDF
Introduction To Apache Pig at WHUG
PDF
Hadoop pig
PPT
Introduction To Map Reduce
PPT
Another Intro To Hadoop
PPTX
Pig, Making Hadoop Easy
PPTX
Map Reduce
hadoop&zing
Introduction To Elastic MapReduce at WHUG
Introduction To Apache Pig at WHUG
Hadoop pig
Introduction To Map Reduce
Another Intro To Hadoop
Pig, Making Hadoop Easy
Map Reduce

What's hot (18)

KEY
Intro to Hadoop
PPT
Hive ICDE 2010
PPT
Hadoop basics
PPTX
Introduction to Pig
PDF
introduction to data processing using Hadoop and Pig
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PDF
Geek camp
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
PDF
Hadoop-Introduction
PPTX
Hive and data analysis using pandas
PDF
Overview of Hadoop and HDFS
PPTX
Python in big data world
PPTX
Introduction to Hadoop Technology
PPTX
Practical Hadoop using Pig
PDF
report on aadhaar anlysis using bid data hadoop and hive
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Intro to Hadoop
Hive ICDE 2010
Hadoop basics
Introduction to Pig
introduction to data processing using Hadoop and Pig
Practical Problem Solving with Apache Hadoop & Pig
Hw09 Hadoop Development At Facebook Hive And Hdfs
Geek camp
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Hadoop-Introduction
Hive and data analysis using pandas
Overview of Hadoop and HDFS
Python in big data world
Introduction to Hadoop Technology
Practical Hadoop using Pig
report on aadhaar anlysis using bid data hadoop and hive
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Ad

Viewers also liked (10)

PDF
R, HTTP, and APIs, with a preview of TopicWatchr
PPT
Zing Me - Build brand engagement with Zing Me
PPT
Distributed search solutions and comparison
PPT
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
PDF
Zing Me Platform Policy
PDF
"R & Text Analytics" (15 January 2013)
PDF
Inaugural Addresses
PDF
Teaching Students with Emojis, Emoticons, & Textspeak
PDF
Hype vs. Reality: The AI Explainer
PDF
Study: The Future of VR, AR and Self-Driving Cars
R, HTTP, and APIs, with a preview of TopicWatchr
Zing Me - Build brand engagement with Zing Me
Distributed search solutions and comparison
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Zing Me Platform Policy
"R & Text Analytics" (15 January 2013)
Inaugural Addresses
Teaching Students with Emojis, Emoticons, & Textspeak
Hype vs. Reality: The AI Explainer
Study: The Future of VR, AR and Self-Driving Cars
Ad

Similar to R, Hadoop and Amazon Web Services (20)

PDF
Parallel Computing for Econometricians with Amazon Web Services
PDF
Getting started with R & Hadoop
PDF
Running R on Hadoop - CHUG - 20120815
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
PPT
Brust hadoopecosystem
PDF
Extending lifespan with Hadoop and R
PPTX
Fundamental of Big Data with Hadoop and Hive
PDF
R and-hadoop
PDF
How to use hadoop and r for big data parallel processing
PDF
Tools and techniques for data science
PPTX
Introduction to Apache Hadoop
PPTX
R for hadoopers
PDF
Hadoop Overview & Architecture
 
KEY
PPTX
Big Data Analysis With RHadoop
PDF
Data Hacking with RHadoop
PDF
HadoopThe Hadoop Java Software Framework
PPTX
Streaming Python on Hadoop
Parallel Computing for Econometricians with Amazon Web Services
Getting started with R & Hadoop
Running R on Hadoop - CHUG - 20120815
The Powerful Marriage of Hadoop and R (David Champagne)
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Brust hadoopecosystem
Extending lifespan with Hadoop and R
Fundamental of Big Data with Hadoop and Hive
R and-hadoop
How to use hadoop and r for big data parallel processing
Tools and techniques for data science
Introduction to Apache Hadoop
R for hadoopers
Hadoop Overview & Architecture
 
Big Data Analysis With RHadoop
Data Hacking with RHadoop
HadoopThe Hadoop Java Software Framework
Streaming Python on Hadoop

Recently uploaded (20)

PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PPTX
Microsoft Excel 365/2024 Beginner's training
PPTX
TEXTILE technology diploma scope and career opportunities
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Architecture types and enterprise applications.pdf
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
Five Habits of High-Impact Board Members
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Chapter 5: Probability Theory and Statistics
DOCX
search engine optimization ppt fir known well about this
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Getting started with AI Agents and Multi-Agent Systems
OpenACC and Open Hackathons Monthly Highlights July 2025
Microsoft Excel 365/2024 Beginner's training
TEXTILE technology diploma scope and career opportunities
2018-HIPAA-Renewal-Training for executives
A review of recent deep learning applications in wood surface defect identifi...
Module 1.ppt Iot fundamentals and Architecture
Architecture types and enterprise applications.pdf
UiPath Agentic Automation session 1: RPA to Agents
Five Habits of High-Impact Board Members
Enhancing plagiarism detection using data pre-processing and machine learning...
Benefits of Physical activity for teenagers.pptx
Developing a website for English-speaking practice to English as a foreign la...
Chapter 5: Probability Theory and Statistics
search engine optimization ppt fir known well about this
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Custom Battery Pack Design Considerations for Performance and Safety
Improvisation in detection of pomegranate leaf disease using transfer learni...

R, Hadoop and Amazon Web Services

  • 1. R, Hadoop and Amazon Web Services Portland R Users Group December 20th, 2011
  • 2. A general disclaimer • Good programmers learn fast and develop expertise in technologies and methodologies in a rather intrepid, exploratory manner. • I am by no means a expert in the paradigm which we are discussing this evening but I’d like to share what I have learned in the last year while developing MapReduce applications in R within the AWS. Translation: ask anything and everything but reserve the right to say “I don’t know, yet.” • Also, this is a meetup.com meeting – seems only appropriate to keep this short, sweet, high-level and full of solicitous discussion points.
  • 3. The whole point of this presentation • I am selfish (and you should be too!) – I like collaborators – I like collaborators interested in things I am interested in – I believe that dissemination of information related to sophisticated, numerical decision making processes generally makes the world a better place – I believe that the more people use Open Source technology, the more people contribute to Open Source technology and the better Open Source technology gets in general. Hence, my life gets easier and cheaper which is presumably analogous to “better” in some respect. – There is beer at this meetup. Queue short intermission. • Otherweiser® (brought by the aforementioned speaking point,) I’d really be very happy if people said to themselves at the end of this presentation “Hadoop seems easy! I’m going to give it a try.”
  • 4. Why are we talking about this anyhow? “Every two days now we create as much information as we did from the dawn of civilization up until 2003.“ -Eric Schmidt, August 2010 • We aggregate a lot of data (and have been) – Particularly businesses like Google, Amazon, Apple etc… – Presumably the government is doing awful things with data too • But aggregation isn’t understanding – Lawnmower Man aside – We need to UNDERSTAND the data- that is take raw data and make it interoperable. – Hence the need for a marriage of Statistics and Programming directed at understanding phenomena expressed in these large data sets – Can’t recommend this book enough: • The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Freidman • http://www.amazon.com/Elements-Statistical-Learning-Prediction- Statistics/dp/0387848576/ref=pd_sim_b_1 • So everybody is going crazy about this in general.
  • 5. Also, who is this “self” I speak of? • tis’ I, Timothy Dalbey • I work for the Emerging Technologies Group of News Corporation • I live in North East Portland and keep an office on 53rd and 5th in New York City • Studied Mathematics and Economics as a undergraduate student and Statistics as a graduate student at University of Virginia • 2 awesome kids and a awesome partner at home: Liam, Juniper and Lindsay • Enthusiastic about technology, science and futuristic endeavors in general
  • 6. Elastic MapReduce • Elastic Map reduce is – A service of Amazon Web Services – Is composed of Amazon Machine Images • ssh capability • Debian Linux • Preloaded with ancient versions of R – A complimentary set of Ruby Client Tools – A web interface – Preconfigured to run Hadoop
  • 7. Hadoop • Popular framework for controlling distributed cluster computations – Popularity is important – queue story about MPI at Levy Laboratory and Beowulf clusters… • Hadoop is a Apache Project product – http://hadoop.apache.org/ • Open Source • Java • Configurable (mostly uses XML config files) • Fault Tolerant • Lots of ways to interact with Hadoop – Pig – Hive – Streaming – Custom .jar
  • 8. Hadoop is MapReduce • What is a MapReduce? – Originally coined by Google Labs in 2004 – A super simplified single-node version of the paradigm is as follows: cat input.txt | ./mapper.R | sort | reducer.R > output.txt • That is, MapReduce has follows a general process: – Read input (cat input) – Map (mapper.R) – Partition – Comparison (sort) – Reduce (reducer.R) – Output (output.txt) • You can use most popular scripting languages – Perl, PHP, Python etc… – R
  • 9. But – that sort of misses the point • MapReduce is computational paradigm intended for – Large Datasets – Multi-Node Computation – Truly Parallel Processing • Master/Slave architecture – Nodes are agnostic of one another, only the master node(s) have any idea about the greater scheme of things. • The importance of truly parallel processing • A good first question before engaging in creating a Hadoop job is: – Is this process a good candidate for Hadoop processing in the first place?
  • 10. Benefits to using AWS for Hadoop Jobs • Preconfigured to run Hadoop – This is itself is something of a miracle • Virtual Servers – Use the servers for only as long as you need – configurability • Handy command line tools • S3 is sitting in the same cloud – Your data is sitting in the same space • Servers come at $0.06 per hour of compute time – dirt cheap
  • 11. Specifics • Bootstrapping – Bootstrapping is a process by which you may customize the nodes via bash shell • Acquiring data • Updating R • Installing Packages • Please, you example: #!/bin/bash #debian R upgrade gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480 gpg -a --export 06F90DE5381BA480 | sudo apt-key add - echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list sudo apt-get update sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev • Input file – Mapper specific • Classic example in WordCounter.py – Example: “It was the best of times, it was the worst of times…” – Note: Big data set! • An example from a recent appliocation of mine: – "25621”r"23803"r"31712”r… – Note: Not such a big data set • Mapper & Reducer – Both typically draw from STDIN and write to STDOUT – Please see the following examples
  • 12. The typical “Hello World” MapReduce Mapper #! /usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="") } close(con)
  • 13. The typical “Hello World” MapReduce Reducer #! /usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) }else{ assign(word, count, envir = env) } } close(con) for (w in ls(env, all = TRUE)){ cat(w, "t", get(w, envir = env), "n", sep = "”) }
  • 14. MapReduce and R: Forecasting data for News Corporation • 50k+ products with historical unit sales data of roughly 2.5MM rows • Some of the titles require heavy computational processing – Titles with insufficient data require augmented or surrogate data in order to make “good” predictions – thus identifying good candidate data was also necessary in addition to prediction methods – Took lots of time (particularly in R) • But R had the analysis tools I needed! • Key observation: The predictions were independent of one another which made the process truly parallel. • Thus, Hadoop and Elastic MapReduce were merited
  • 15. My Experience Learning and Using Hadoop with AWS • Debugging is something of a nightmare. – SSH onto nodes to figure out what’s really going on – STDERR is your enemy – it will cause your job to fail rather completely – STDERR is your best friend. No errors and failed jobs are rather frustrating • Most of the work is in transactional with AWS Elastic MapReduce • I followed conventional advice which is “move data to the nodes.” – This meant moving data into csv’s in S3 and importing the data into R via standard read methods – This also meant that my processes were database agnostic – JSON is a great way of structuring input and output between phases of the MapReduce Process • To that effect, check out RJSON – great package. • In general, the following rule seems to apply: – Data frame bad. – Data table good. • http://cran.r-project.org/web/packages/data.table/index.html • Packages to simplify R make my skin crawl – Ever see Jurassic Park? – Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I said that.
  • 16. R Package to Utilize Map Reduce • Segue – Written J.D. Long – http://www.cerebralmastication.com • P.s. We all realize that www is a subdomain, right? World Wide Web… is that really necessary? – Handles much of the transactional details and allows the use of Elastic MapReduce through apply() and lapply() wrappers • Seems like this is a good tutorial too: – http://jeffreybreen.wordpress.com/2011/01/10/s egue-r-to-amazon-elastic-mapreduce-hadoop/
  • 17. Other stuff • Distributed Cache – Load your data the smart way! • Ruby Command Tools – Interact with AWS the smart way! • Web interface – Simple. – Helpful when monitoring jobs when you wake up at 3:30AM and wonder “is my script still running?”