SlideShare a Scribd company logo
useR Vignette:



 R + 15 minutes =
 Hadoop cluster


Greater Boston useR Group
      February 2011


           by

      Jeffrey Breen
  jbreen@cambridge.aero
Agenda

 ●      What's Hadoop?
          ●      But I don't have Big
                 Data
 ●      Building the cluster
 ●      Estimating π
        stochastically
 ●      Want to know more?




useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 2
MapReduce, Hadoop and Big Data

 ●      Hadoop is an open source implementation of
        Google's MapReduce-based data processing
        infrastructure
          ●      Designed to process huge data sets
                    –     “huge” = “all of facebook's web logs”
                    –     Yahoo! sorted 1TB in 62 seconds in May 2009
                    –     HDFS distributed file system makes replication decisions
                          based on knowledge of network topology
 ●      Amazon Elastic MapReduce is full Hadoop stack
        on EC2

useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 3
MapReduce = Map + shuffle + Reduce




                                                 Source: http://developer.yahoo.com/hadoop/tutorial/module4.html

useR Vignette: R + 15 minutes = Hadoop Cluster     Greater Boston useR Meeting, February 2011              Slide 4
But I don't have Big Data

 ●      Agricultural economist J.D. Long doesn't either, but
        he does have a bunch of simulations to run
 ●      Had a key insight: the input could be small amount
        of data (like 1:1000) to serve as random seeds for
        simulation code in “mapper” function
 ●      Enjoy Hadoop's infrastructure for job scheduling,
        fault tolerance, inter-node communication, etc.
 ●      Use Amazon's cloud to scale up quickly as needed



useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 5
Load the segue library
> library(segue)
Loading required package: rJava
Loading required package: caTools
Loading required package: bitops
Segue did not find your AWS credentials. Please run
the setCredentials() function.


> setCredentials('YOUR_ACCESS_KEY_ID',
'YOUR_SECRET_ACCESS_KEY')




useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 6
Start the cluster
> myCluster <- createCluster(numInstances=5)
STARTING - 2011-01-04 15:07:53
[…]
BOOTSTRAPPING - 2011-01-04 15:11:28
[…]
WAITING - 2011-01-04 15:15:35
Your Amazon EMR Hadoop Cluster is ready for action.
Remember to terminate your cluster with
stopCluster().
Amazon is billing you!


useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 7
Estimate π stochastically
> estimatePi <- function(seed){
        set.seed(seed)
        numDraws <- 1e6


        r <- .5 #radius
        x <- runif(numDraws, min=-r, max=r)
        y <- runif(numDraws, min=-r, max=r)
        inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)


        return(sum(inCircle) / length(inCircle) * 4)
  }


useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 8
Run the simulation
> seedList <- as.list(1:1e3)
> myEstimates <- emrlapply( myCluster, seedList,
estimatePi )
RUNNING - 2011-01-04 15:22:28
[…]
WAITING - 2011-01-04 15:32:18
> myPi <- Reduce(sum, myEstimates) / length(myEstimates)
> format(myPi, digits=10)
[1] "3.141586544"
> format(pi, digits=10)
[1] "3.141592654"


useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 9
Won't break the bank

 ●      Total cost: $0.15
                Standard On-Demand               Amazon EC2                                          Amazon Elastic
                Instances                        Price per hour                                      MapReduce
                                                 (On-Demand Instances)                               Price per hour


                Small (Default)                  $0.085 per hour                                     $0.015 per hour


                Large                            $0.34 per hour                                      $0.06 per hour


                Extra Large                      $0.68 per hour                                      $0.12 per hour




useR Vignette: R + 15 minutes = Hadoop Cluster          Greater Boston useR Meeting, February 2011                     Slide 10
Want to know more?

 ●      JD Long's segue package
          ●      http://code.google.com/p/segue/
 ●      Hadoop
          ●      http://hadoop.apache.org/
          ●      Book: http://oreilly.com/catalog/0636920010388
 ●      My blog
          ●      http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-a




useR Vignette: R + 15 minutes = Hadoop Cluster   Greater Boston useR Meeting, February 2011   Slide 11

More Related Content

PDF
Grouping & Summarizing Data in R
PDF
Accessing Databases from R
PDF
Data Manipulation Using R (& dplyr)
PDF
Rsplit apply combine
PDF
Data Profiling in Apache Calcite
PPT
Hive User Meeting August 2009 Facebook
PDF
Spatial query on vanilla databases
PPTX
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Grouping & Summarizing Data in R
Accessing Databases from R
Data Manipulation Using R (& dplyr)
Rsplit apply combine
Data Profiling in Apache Calcite
Hive User Meeting August 2009 Facebook
Spatial query on vanilla databases
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)

What's hot (20)

PDF
Next Generation Programming in R
PPT
Hive User Meeting 2009 8 Facebook
PDF
Data manipulation with dplyr
PDF
Morel, a Functional Query Language
PPTX
Big Data Analysis With RHadoop
PPT
Hive User Meeting March 2010 - Hive Team
PPT
Hadoop Summit 2009 Hive
PDF
Don’t optimize my queries, optimize my data!
PDF
Data preparation covariates
 
PDF
R data-import, data-export
 
PDF
Scaling PostreSQL with Stado
ODT
ACADILD:: HADOOP LESSON
PPT
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
PDF
Advanced Sharding Techniques with Spider (MUC2010)
ODP
Scaling PostgreSQL With GridSQL
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
PDF
Efficient spatial queries on vanilla databases
PPT
Session 19 - MapReduce
PPTX
Hive query optimization infinity
Next Generation Programming in R
Hive User Meeting 2009 8 Facebook
Data manipulation with dplyr
Morel, a Functional Query Language
Big Data Analysis With RHadoop
Hive User Meeting March 2010 - Hive Team
Hadoop Summit 2009 Hive
Don’t optimize my queries, optimize my data!
Data preparation covariates
 
R data-import, data-export
 
Scaling PostreSQL with Stado
ACADILD:: HADOOP LESSON
How to understand and analyze Apache Hive query execution plan for performanc...
Advanced Sharding Techniques with Spider (MUC2010)
Scaling PostgreSQL With GridSQL
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Efficient spatial queries on vanilla databases
Session 19 - MapReduce
Hive query optimization infinity
Ad

Viewers also liked (20)

PDF
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
PDF
Big Data Step-by-Step: Infrastructure 1/3: Local VM
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
BIG Data Science: A Path Forward
PDF
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
PDF
Big Analytics: Building Lasting Value
PDF
Move your data (Hans Rosling style) with googleVis + 1 line of R code
PDF
Apachecon Europe 2012: Operating HBase - Things you need to know
PDF
Getting started with R & Hadoop
PDF
Setting High Availability in Hadoop Cluster
PDF
Running R on Hadoop - CHUG - 20120815
PPTX
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
PPTX
Are You Ready for Big Data Big Analytics?
PDF
Reshaping Data in R
PDF
HBase and Impala Notes - Munich HUG - 20131017
PDF
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
PDF
Using R with Hadoop
PDF
High Performance Predictive Analytics in R and Hadoop
PDF
Tapping the Data Deluge with R
PDF
Predictive Analytics using R
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Real Time Data Processing Using Spark Streaming
BIG Data Science: A Path Forward
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
Big Analytics: Building Lasting Value
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Apachecon Europe 2012: Operating HBase - Things you need to know
Getting started with R & Hadoop
Setting High Availability in Hadoop Cluster
Running R on Hadoop - CHUG - 20120815
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Are You Ready for Big Data Big Analytics?
Reshaping Data in R
HBase and Impala Notes - Munich HUG - 20131017
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Using R with Hadoop
High Performance Predictive Analytics in R and Hadoop
Tapping the Data Deluge with R
Predictive Analytics using R
Ad

Similar to R + 15 minutes = Hadoop cluster (20)

PPTX
Cost effective BigData Processing on Amazon EC2
PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Hadoop For Enterprises
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PDF
Hadoop at Nokia
PDF
Hadoop Tutorial for Big Data Enthusiasts
PPTX
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
PPTX
Hive at Yahoo: Letters from the trenches
PPTX
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
PDF
Getting Started with Hadoop
PPTX
Hadoop crashcourse v3
PPTX
Hadoop And Big Data - My Presentation To Selective Audience
PDF
Exploring BigData with Google BigQuery
PPT
Big Data Real Time Analytics - A Facebook Case Study
PPT
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
PPTX
Big Data Laboratory
PPTX
Airflow - a data flow engine
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Cost effective BigData Processing on Amazon EC2
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop ecosystem framework n hadoop in live environment
Hadoop For Enterprises
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Hadoop at Nokia
Hadoop Tutorial for Big Data Enthusiasts
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hive at Yahoo: Letters from the trenches
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Getting Started with Hadoop
Hadoop crashcourse v3
Hadoop And Big Data - My Presentation To Selective Audience
Exploring BigData with Google BigQuery
Big Data Real Time Analytics - A Facebook Case Study
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Big Data Laboratory
Airflow - a data flow engine
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Modernizing your data center with Dell and AMD
PDF
KodekX | Application Modernization Development
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation_ Review paper, used for researhc scholars
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Modernizing your data center with Dell and AMD
KodekX | Application Modernization Development
Building Integrated photovoltaic BIPV_UPV.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx

R + 15 minutes = Hadoop cluster

  • 1. useR Vignette: R + 15 minutes = Hadoop cluster Greater Boston useR Group February 2011 by Jeffrey Breen jbreen@cambridge.aero
  • 2. Agenda ● What's Hadoop? ● But I don't have Big Data ● Building the cluster ● Estimating π stochastically ● Want to know more? useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 2
  • 3. MapReduce, Hadoop and Big Data ● Hadoop is an open source implementation of Google's MapReduce-based data processing infrastructure ● Designed to process huge data sets – “huge” = “all of facebook's web logs” – Yahoo! sorted 1TB in 62 seconds in May 2009 – HDFS distributed file system makes replication decisions based on knowledge of network topology ● Amazon Elastic MapReduce is full Hadoop stack on EC2 useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 3
  • 4. MapReduce = Map + shuffle + Reduce Source: http://developer.yahoo.com/hadoop/tutorial/module4.html useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 4
  • 5. But I don't have Big Data ● Agricultural economist J.D. Long doesn't either, but he does have a bunch of simulations to run ● Had a key insight: the input could be small amount of data (like 1:1000) to serve as random seeds for simulation code in “mapper” function ● Enjoy Hadoop's infrastructure for job scheduling, fault tolerance, inter-node communication, etc. ● Use Amazon's cloud to scale up quickly as needed useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 5
  • 6. Load the segue library > library(segue) Loading required package: rJava Loading required package: caTools Loading required package: bitops Segue did not find your AWS credentials. Please run the setCredentials() function. > setCredentials('YOUR_ACCESS_KEY_ID', 'YOUR_SECRET_ACCESS_KEY') useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 6
  • 7. Start the cluster > myCluster <- createCluster(numInstances=5) STARTING - 2011-01-04 15:07:53 […] BOOTSTRAPPING - 2011-01-04 15:11:28 […] WAITING - 2011-01-04 15:15:35 Your Amazon EMR Hadoop Cluster is ready for action. Remember to terminate your cluster with stopCluster(). Amazon is billing you! useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 7
  • 8. Estimate π stochastically > estimatePi <- function(seed){ set.seed(seed) numDraws <- 1e6 r <- .5 #radius x <- runif(numDraws, min=-r, max=r) y <- runif(numDraws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0) return(sum(inCircle) / length(inCircle) * 4) } useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 8
  • 9. Run the simulation > seedList <- as.list(1:1e3) > myEstimates <- emrlapply( myCluster, seedList, estimatePi ) RUNNING - 2011-01-04 15:22:28 […] WAITING - 2011-01-04 15:32:18 > myPi <- Reduce(sum, myEstimates) / length(myEstimates) > format(myPi, digits=10) [1] "3.141586544" > format(pi, digits=10) [1] "3.141592654" useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 9
  • 10. Won't break the bank ● Total cost: $0.15 Standard On-Demand Amazon EC2 Amazon Elastic Instances Price per hour MapReduce (On-Demand Instances) Price per hour Small (Default) $0.085 per hour $0.015 per hour Large $0.34 per hour $0.06 per hour Extra Large $0.68 per hour $0.12 per hour useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 10
  • 11. Want to know more? ● JD Long's segue package ● http://code.google.com/p/segue/ ● Hadoop ● http://hadoop.apache.org/ ● Book: http://oreilly.com/catalog/0636920010388 ● My blog ● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-a useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 11