SlideShare a Scribd company logo
Big Data Step-by-Step
                              Boston Predictive Analytics
                                 Big Data Workshop
                                Microsoft New England Research &
                               Development Center, Cambridge, MA
                                    Saturday, March 10, 2012



                                                            by Jeffrey Breen

                                                        President and Co-Founder
         http://atms.gr/bigdata0310                   Atmosphere Research Group
                                                      email: jeffrey@atmosgrp.com
                                                             Twitter: @JeffreyBreen

Saturday, March 10, 2012
Big Data Infrastructure
                           Part 3: Taking it to the cloud... easily... with Whirr




    Code & more on github:
    https://github.com/jeffreybreen/tutorial-201203-big-data
Saturday, March 10, 2012
Overview
                    • Download and install Apache whirr to our local
                           Cloudera VM
                    • Use whirr to launch a Hadoop cluster on Amazon
                           EC2
                    • Tell our local Hadoop tools to use the cluster instead
                           of the local installation
                    • Run some tests
                    • How to use Hadoop’s “distcp” to load data into HDFS
                           from Amazon’s S3 storage service
                    • Extra credit: save money with Amazon’s spot instances

Saturday, March 10, 2012
Heavy lifting by jclouds and Whirr
                    jclouds - http://www.jclouds.org/
                           “jclouds is an open source library that helps you get started
                           in the cloud and reuse your java and clojure development
                           skills. Our api allows you freedom to use portable
                           abstractions or cloud-specific features. We test support of 30
                           cloud providers and cloud software stacks, including
                           Amazon, GoGrid, Ninefold, vCloud, OpenStack, and Azure.”
                    Apache Whirr - http://whirr.apache.org/
                           “Apache Whirr is a set of libraries for running cloud
                           services.
                           Whirr provides:
                                • A cloud-neutral way to run services. You don't have
                                 to worry about the idiosyncrasies of each provider.
                                • A common service API. The details of provisioning
                                 are particular to the service.
                                • Smart defaults for services. You can get a properly
                                 configured system running quickly, while still being
                                 able to override settings as needed.
                           You can also use Whirr as a command line tool for deploying
                           clusters.”                                                      Just what we want!




Saturday, March 10, 2012
Whirr makes it look easy
                • All you need is a simple config file
                      whirr.cluster-name=hadoop-ec2
                      whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,15 hadoop-datanode+hadoop-tasktracker
                      whirr.provider=aws-ec2
                      whirr.identity=${env:AWS_ACCESS_KEY_ID}
                      whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
                      whirr.hardware-id=m1.large
                      whirr.location-id=us-east-1
                      whirr.image-id=us-east-1/ami-49e32320
                      whirr.java.install-function=install_oab_java
                      whirr.hadoop.install-function=install_cdh_hadoop
                      whirr.hadoop.configure-function=configure_cdh_hadoop



                • And one line to launch your cluster
                      $ ./whirr launch-cluster --config hadoop-ec2.properties
                      Bootstrapping cluster
                      Configuring template
                      Configuring template
                      Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
                      Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]




Saturday, March 10, 2012
One line?!? That’s too easy! What
                                 didn’t you show us?
                    • Download and install Whirr (≥ 0.7.1!)
                    • Specify your AWS security credentials
                    • Create a key pair to access the nodes
                    • Install R and add-on packages onto each node
                    • Configure VM to use cluster’s Hadoop instance
                           & run a proxy
                    • Copy data onto the cluster & run a test
                    • So... let’s walk through those steps next...
Saturday, March 10, 2012
Download & Install Whirr (≥ 0.7.1)
                • Find an Apache mirror
                      http://www.apache.org/dyn/closer.cgi/whirr/

                • From your VM’s shell, download it with wget
                      $ wget http://apache.mirrors.pair.com/whirr/stable/whirr-0.7.1.tar.gz


                • Installing is as simple as expanding the tarball
                      $ tar xf whirr-0.7.1.tar.gz


                • Modify your path so this new version runs
                      $ export PATH="~/whirr-0.7.1/bin:$PATH"
                      $ whirr version
                      Apache Whirr 0.7.1




Saturday, March 10, 2012
Amazon Login Info
   •     From AWS Management Console, look up your Access Keys
           •     “Access Key ID” ➔ whirr.identity

           •     “Secret Access Key” ➔ whirr.credential
   •     You could enter into Whirr’s config file, but please don’t
           •     instead, just pick up environment variables in config file:
         whirr.identity=${env:AWS_ACCESS_KEY_ID}
         whirr.credential=${env:AWS_SECRET_ACCESS_KEY}


           •     and set them for your session session
         $ export AWS_ACCESS_KEY_ID=”your access key id here”
         $ export AWS_SECRET_ACCESS_KEY=”your secret access key here”


   •     While we’re at it, create a key pair
         $ ssh-keygen -t rsa -P ""




Saturday, March 10, 2012
Configuration file highlights
        Specify how many nodes of each type
              whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,10 hadoop-datanode+hadoop-tasktracker


        Select instance size & type (m1.large, c1.xlarge, m2.large, etc., as described at
        http://aws.amazon.com/ec2/instance-types/)
              whirr.hardware-id=m1.large

        Use a RightScale-published CentOS image (with transitory “instance” storage)
              whirr.image-id=us-east-1/ami-49e32320




Saturday, March 10, 2012
Launch the Cluster
                Yes, just one line... but then pages of output
                $ whirr launch-cluster --config hadoop-ec2.properties
                Bootstrapping cluster
                Configuring template
                Configuring template
                Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]
                Starting 10 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
                [...]
                Running configure phase script on: us-east-1/i-e301ab87
                configure phase script run completed on: us-east-1/i-e301ab87
                [...]
                You can log into instances using the following ssh commands:
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.22.25.82'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@204.236.222.162'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@23.20.97.157'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@75.101.192.112'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@50.16.43.91'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.22.84.246'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@23.20.134.238'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.22.61.144'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@23.20.6.74'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@174.129.137.89'
                'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.21.77.224'




Saturday, March 10, 2012
Saturday, March 10, 2012
Install R and Packages
                •     install-r+packages.sh contains code to download and install R, plyr, rmr and their
                      prerequisites
                •     whirr will run scripts on each node for us
                      $ whirr run-script --script install-r+packages.sh --config hadoop-ec2.properties

                •     And then you get to see pages and pages of output for each and every node!
                      ** Node us-east-1/i-eb01ab8f: [10.124.18.198, 107.21.77.224]
                      rightscale-epel                                          | 951 B      00:00
                      Setting up Install Process
                      Resolving Dependencies
                      --> Running transaction check
                      ---> Package R.x86_64 0:2.14.1-1.el5 set to be updated
                      --> Processing Dependency: libRmath-devel = 2.14.1-1.el5 for package: R
                      ---> Package R-devel.i386 0:2.14.1-1.el5 set to be updated
                      --> Processing Dependency: R-core = 2.14.1-1.el5 for package: R-devel
                      [...]

                •     Hopefully it ends with something positive like
                      * DONE (rmr)
                      Making packages.html        ... done




Saturday, March 10, 2012
install-r+packages.sh
      sudo yum -y --enablerepo=epel install R R-devel

      sudo R --no-save << EOF
      install.packages(c('RJSONIO', 'itertools', 'digest', 'plyr'), repos="http://
      cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )
      EOF

      # install latest version of the rmr package from RHadoop's github repository:
      branch=master
      wget --no-check-certificate https://github.com/RevolutionAnalytics/RHadoop/tarball/$branch -O - | tar zx
      mv RevolutionAnalytics-RHadoop* RHadoop
      sudo R CMD INSTALL --byte-compile RHadoop/rmr/pkg/

      sudo su << EOF1
      cat >> /etc/profile <<EOF

      export HADOOP_HOME=/usr/lib/hadoop

      EOF
      EOF1




Saturday, March 10, 2012
Switch from local to cluster Hadoop
               •     CDH uses linux’s alternatives facility to specify the location of the current configuration
                     files
                     $ sudo /usr/sbin/alternatives --display hadoop-0.20-conf
                     hadoop-0.20-conf - status is manual.
                      link currently points to /etc/hadoop-0.20/conf.pseudo
                     /etc/hadoop-0.20/conf.empty - priority 10
                     /etc/hadoop-0.20/conf.pseudo - priority 30
                     Current `best' version is /etc/hadoop-0.20/conf.pseudo.

               •     Whirr generates the config file we need to create a “conf.ec2” alternative
                     $ sudo mkdir /etc/hadoop-0.20/conf.ec2
                     $ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.ec2
                     $ sudo rm -f /etc/hadoop-0.20/conf.ec2/*-site.xml
                     $ sudo cp ~/.whirr/hadoop-ec2/hadoop-site.xml /etc/hadoop-0.20/conf.ec2/
                     $ sudo /usr/sbin/alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.ec2 30
                     $ sudo /usr/sbin/alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.ec2
                     $ sudo /usr/sbin/alternatives --display hadoop-0.20-conf
                     hadoop-0.20-conf - status is manual.
                      link currently points to /etc/hadoop-0.20/conf.ec2
                     /etc/hadoop-0.20/conf.empty - priority 10
                     /etc/hadoop-0.20/conf.pseudo - priority 30
                     /etc/hadoop-0.20/conf.ec2 - priority 30
                     Current `best' version is /etc/hadoop-0.20/conf.pseudo.




Saturday, March 10, 2012
Fire up a proxy connection
                •     Whirr generates a proxy to connect your VM to the cluster
                      $ ~/.whirr/hadoop-ec2/hadoop-proxy.sh
                      Running proxy to Hadoop cluster at ec2-107-21-77-224.compute-1.amazonaws.com.
                      Use Ctrl-c to quit.
                      Warning: Permanently added 'ec2-107-21-77-224.compute-1.amazonaws.com,
                      107.21.77.224' (RSA) to the list of known hosts.


                •     Any hadoop commands executed on your VM should go to the
                      cluster instead
                      $ hadoop dfsadmin -report
                      Configured Capacity: 4427851038720 (4.03 TB)
                      Present Capacity: 4144534683648 (3.77 TB)
                      DFS Remaining: 4139510718464 (3.76 TB)
                      DFS Used: 5023965184 (4.68 GB)
                      DFS Used%: 0.12%                               Definitely not in
                      Under replicated blocks: 0
                      Blocks with corrupt replicas: 0                Kansas anymore
                      Missing blocks: 0
                      [...]




Saturday, March 10, 2012
Test Hadoop with a small job
             Download my fork of Jonathan Seidman’s sample R code from github
                   $   mkdir hadoop-r
                   $   cd hadoop-r
                   $   git init
                   $   git pull git://github.com/jeffreybreen/hadoop-R.git

             Grab first 1,000 lines from ASA’s 2004 airline data
                   $ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat 
                      | head -1000 > 2004-1000.csv

             Make some directories in HDFS and load the data file
                   $   hadoop   fs   -mkdir /user/cloudera
                   $   hadoop   fs   -mkdir asa-airline
                   $   hadoop   fs   -mkdir asa-airline/data
                   $   hadoop   fs   -mkdir asa-airline/out
                   $   hadoop   fs   -put 2004-1000.csv asa-airline/data/

             Run Jonathan’s sample streaming job
                   $ cd airline/src/deptdelay_by_month/R/streaming
                   $ hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-*.jar 
                     -input asa-airline/data -output asa-airline/out/dept-delay-month 
                     -mapper map.R -reducer reduce.R -file map.R -file reduce.R
                   [...]
                   $ hadoop fs -cat asa-airline/out/dept-delay-month/part-00000
                   2004       1        973       UA       11.55293




Saturday, March 10, 2012
distcp: using Hadoop to load its own data
 $ hadoop distcp -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY_ID 
    -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY 
    s3n://asa-airline/data asa-airline

 12/03/08         21:42:21   INFO   tools.DistCp: srcPaths=[s3n://asa-airline/data]
 12/03/08         21:42:21   INFO   tools.DistCp: destPath=asa-airline
 12/03/08         21:42:27   INFO   tools.DistCp: sourcePathsCount=23
 12/03/08         21:42:27   INFO   tools.DistCp: filesToCopyCount=22
 12/03/08         21:42:27   INFO   tools.DistCp: bytesToCopyCount=1.5g
 12/03/08         21:42:31   INFO   mapred.JobClient: Running job: job_201203082122_0002
 12/03/08         21:42:32   INFO   mapred.JobClient: map 0% reduce 0%
 12/03/08         21:42:41   INFO   mapred.JobClient: map 14% reduce 0%
 12/03/08         21:42:45   INFO   mapred.JobClient: map 46% reduce 0%
 12/03/08         21:42:46   INFO   mapred.JobClient: map 61% reduce 0%
 12/03/08         21:42:47   INFO   mapred.JobClient: map 63% reduce 0%
 12/03/08         21:42:48   INFO   mapred.JobClient: map 70% reduce 0%
 12/03/08         21:42:50   INFO   mapred.JobClient: map 72% reduce 0%
 12/03/08         21:42:51   INFO   mapred.JobClient: map 80% reduce 0%
 12/03/08         21:42:53   INFO   mapred.JobClient: map 83% reduce 0%
 12/03/08         21:42:54   INFO   mapred.JobClient: map 89% reduce 0%
 12/03/08         21:42:56   INFO   mapred.JobClient: map 92% reduce 0%
 12/03/08         21:42:58   INFO   mapred.JobClient: map 99% reduce 0%
 12/03/08         21:43:04   INFO   mapred.JobClient: map 100% reduce 0%
 12/03/08         21:43:05   INFO   mapred.JobClient: Job complete: job_201203082122_0002
 [...]



Saturday, March 10, 2012
Are you sure you want to shut down?

                •     Unlike the EBS-backed instance we created in Part 2, when the nodes
                      are gone, they’re gone–including their data–so you need to copy your
                      results out of the cluster’s HDFS before your throw the switch
                •     You could use hadoop fs -get to copy to your local file system
                      $ hadoop fs -get asa-airline/out/dept-delay-month .
                      $ ls -lh dept-delay-month
                      total 1.0K
                      drwxr-xr-x 1 1120 games 102 Mar       8 23:06 _logs
                      -rw-r--r-- 1 1120 games     33 Mar    8 23:06 part-00000
                      -rw-r--r-- 1 1120 games      0 Mar    8 23:06 _SUCCESS
                      $ cat dept-delay-month/part-00000
                      2004         1      973          UA         11.55293


                •     Or you could have your programming language of choice save the
                      results locally for you
                      save( dept.delay.month.df, file=’out/dept.delay.month.RData’ )




Saturday, March 10, 2012
Say goodnight, Gracie
                •     control-c to close the proxy connection
                      $ ~/.whirr/hadoop-ec2/hadoop-proxy.sh
                      Running proxy to Hadoop cluster at ec2-107-21-77-224.compute-1.amazonaws.com. Use Ctrl-c to
                      quit.
                      Warning: Permanently added 'ec2-107-21-77-224.compute-1.amazonaws.com,107.21.77.224' (RSA) to
                      the list of known hosts.
                      ^C
                      Killed by signal 2.

                •     Shut down the cluster
                      $ whirr destroy-cluster --config hadoop-ec2.properties

                      Starting to run scripts on cluster for phase destroyinstances: us-east-1/i-c901abad, us-east-1/
                      i-ad01abc9, us-east-1/i-f901ab9d, us-east-1/i-e301ab87, us-east-1/i-d901abbd, us-east-1/i-
                      c301aba7, us-east-1/i-dd01abb9, us-east-1/i-d101abb5, us-east-1/i-f101ab95, us-east-1/i-
                      d501abb1
                      Running destroy phase script on: us-east-1/i-c901abad
                      [...]
                      Finished running destroy phase scripts on all cluster instances
                      Destroying hadoop-ec2 cluster
                      Cluster hadoop-ec2 destroyed

                •     Switch back to your local Hadoop
                      $ sudo /usr/sbin/alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.pseudo




Saturday, March 10, 2012
Extra Credit: Use Spot Instances
                Through the “whirr.aws-ec2-spot-price” parameter, Whirr even
                lets you bid for excess capacity
                           http://aws.amazon.com/ec2/spot-instances/
                           http://aws.amazon.com/pricing/ec2/




Saturday, March 10, 2012
Whirr bids, waits, and launches




Saturday, March 10, 2012
Hey, big spender
                10+1 m1.large nodes for 3 hours = $3.56




Saturday, March 10, 2012
Obligatory iPhone p0rn




Saturday, March 10, 2012
</infrastructure>




Saturday, March 10, 2012

More Related Content

PDF
Big Data Step-by-Step: Infrastructure 1/3: Local VM
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
PDF
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
PDF
Tapping the Data Deluge with R
PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
PPTX
Configuring Your First Hadoop Cluster On EC2
PDF
Troubleshooting Hadoop: Distributed Debugging
PPTX
Learn to setup a Hadoop Multi Node Cluster
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Tapping the Data Deluge with R
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Configuring Your First Hadoop Cluster On EC2
Troubleshooting Hadoop: Distributed Debugging
Learn to setup a Hadoop Multi Node Cluster

What's hot (20)

PPTX
Optimizing your Infrastrucure and Operating System for Hadoop
PDF
Improving Hadoop Cluster Performance via Linux Configuration
PDF
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
PDF
Improving Hadoop Performance via Linux
PPT
Hw09 Monitoring Best Practices
PPTX
Hadoop Interview Questions and Answers
ODP
Hadoop2.2
DOCX
Upgrading hadoop
PDF
Big data interview questions and answers
PDF
Hadoop 2.0 handout 5.0
PDF
Introduction to apache hadoop
PPTX
Learn Hadoop Administration
PDF
Hadoop installation by santosh nage
PPTX
Hadoop administration
PDF
Hadoop Operations for Production Systems (Strata NYC)
PDF
Hadoop single node installation on ubuntu 14
PDF
Apache Hadoop In Theory And Practice
DOC
Configure h base hadoop and hbase client
PPT
Deployment and Management of Hadoop Clusters
Optimizing your Infrastrucure and Operating System for Hadoop
Improving Hadoop Cluster Performance via Linux Configuration
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Improving Hadoop Performance via Linux
Hw09 Monitoring Best Practices
Hadoop Interview Questions and Answers
Hadoop2.2
Upgrading hadoop
Big data interview questions and answers
Hadoop 2.0 handout 5.0
Introduction to apache hadoop
Learn Hadoop Administration
Hadoop installation by santosh nage
Hadoop administration
Hadoop Operations for Production Systems (Strata NYC)
Hadoop single node installation on ubuntu 14
Apache Hadoop In Theory And Practice
Configure h base hadoop and hbase client
Deployment and Management of Hadoop Clusters
Ad

Viewers also liked (10)

PDF
Hadoop Cluster on Docker Containers
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
PPT
Docker based Hadoop provisioning - Hadoop Summit 2014
PDF
Docker Swarm Cluster
PPTX
Simplified Cluster Operation & Troubleshooting
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
PPTX
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
PPTX
Managing Docker Containers In A Cluster - Introducing Kubernetes
PPTX
Hadoop on Docker
PPTX
Lessons Learned Running Hadoop and Spark in Docker Containers
Hadoop Cluster on Docker Containers
Hortonworks Technical Workshop: What's New in HDP 2.3
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker Swarm Cluster
Simplified Cluster Operation & Troubleshooting
Apache Hadoop YARN - Enabling Next Generation Data Applications
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Managing Docker Containers In A Cluster - Introducing Kubernetes
Hadoop on Docker
Lessons Learned Running Hadoop and Spark in Docker Containers
Ad

Similar to Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily... with Whirr (20)

PDF
Apache Whirr
PDF
Automated Hadoop Cluster Construction on EC2
PDF
Apache Whirr
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
ODP
CloudStack, jclouds and Whirr!
PDF
Deploying Hadoop-Based Bigdata Environments
PDF
Deploying Hadoop-based Bigdata Environments
PDF
2013 05-fite-club-working-models-cloud-growing-up
KEY
Whirr dev-up-puppetconf2011
PDF
Hadoop summit cloudera keynote_v5
PPT
Hadoop ecosystem
PDF
2013 05-openstack-israel-heat
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
PDF
Inside the Hadoop Machine @ VMworld
PDF
App cap2956v2-121001194956-phpapp01 (1)
PDF
Big data + cloud computing glossary for community
PDF
Dev & Test on AWS - Journey Through the Cloud
PDF
Instant Download Hadoop Operations 1st Edition Eric Sammer PDF All Chapters
PPT
AWS (Hadoop) Meetup 30.04.09
PDF
Systems Bioinformatics Workshop Keynote
Apache Whirr
Automated Hadoop Cluster Construction on EC2
Apache Whirr
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
CloudStack, jclouds and Whirr!
Deploying Hadoop-Based Bigdata Environments
Deploying Hadoop-based Bigdata Environments
2013 05-fite-club-working-models-cloud-growing-up
Whirr dev-up-puppetconf2011
Hadoop summit cloudera keynote_v5
Hadoop ecosystem
2013 05-openstack-israel-heat
App Cap2956v2 121001194956 Phpapp01 (1)
Inside the Hadoop Machine @ VMworld
App cap2956v2-121001194956-phpapp01 (1)
Big data + cloud computing glossary for community
Dev & Test on AWS - Journey Through the Cloud
Instant Download Hadoop Operations 1st Edition Eric Sammer PDF All Chapters
AWS (Hadoop) Meetup 30.04.09
Systems Bioinformatics Workshop Keynote

More from Jeffrey Breen (8)

PDF
Getting started with R & Hadoop
PDF
Move your data (Hans Rosling style) with googleVis + 1 line of R code
KEY
R by example: mining Twitter for consumer attitudes towards airlines
PDF
Accessing Databases from R
PDF
Reshaping Data in R
PDF
Grouping & Summarizing Data in R
PDF
R + 15 minutes = Hadoop cluster
PDF
FAA Aviation Forecasts 2011-2031 overview
Getting started with R & Hadoop
Move your data (Hans Rosling style) with googleVis + 1 line of R code
R by example: mining Twitter for consumer attitudes towards airlines
Accessing Databases from R
Reshaping Data in R
Grouping & Summarizing Data in R
R + 15 minutes = Hadoop cluster
FAA Aviation Forecasts 2011-2031 overview

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Mushroom cultivation and it's methods.pdf
PDF
August Patch Tuesday
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PDF
Getting Started with Data Integration: FME Form 101
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
1. Introduction to Computer Programming.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Mushroom cultivation and it's methods.pdf
August Patch Tuesday
Programs and apps: productivity, graphics, security and other tools
Hindi spoken digit analysis for native and non-native speakers
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
Getting Started with Data Integration: FME Form 101
NewMind AI Weekly Chronicles - August'25-Week II
1 - Historical Antecedents, Social Consideration.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Assigned Numbers - 2025 - Bluetooth® Document
1. Introduction to Computer Programming.pptx

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily... with Whirr

  • 1. Big Data Step-by-Step Boston Predictive Analytics Big Data Workshop Microsoft New England Research & Development Center, Cambridge, MA Saturday, March 10, 2012 by Jeffrey Breen President and Co-Founder http://atms.gr/bigdata0310 Atmosphere Research Group email: jeffrey@atmosgrp.com Twitter: @JeffreyBreen Saturday, March 10, 2012
  • 2. Big Data Infrastructure Part 3: Taking it to the cloud... easily... with Whirr Code & more on github: https://github.com/jeffreybreen/tutorial-201203-big-data Saturday, March 10, 2012
  • 3. Overview • Download and install Apache whirr to our local Cloudera VM • Use whirr to launch a Hadoop cluster on Amazon EC2 • Tell our local Hadoop tools to use the cluster instead of the local installation • Run some tests • How to use Hadoop’s “distcp” to load data into HDFS from Amazon’s S3 storage service • Extra credit: save money with Amazon’s spot instances Saturday, March 10, 2012
  • 4. Heavy lifting by jclouds and Whirr jclouds - http://www.jclouds.org/ “jclouds is an open source library that helps you get started in the cloud and reuse your java and clojure development skills. Our api allows you freedom to use portable abstractions or cloud-specific features. We test support of 30 cloud providers and cloud software stacks, including Amazon, GoGrid, Ninefold, vCloud, OpenStack, and Azure.” Apache Whirr - http://whirr.apache.org/ “Apache Whirr is a set of libraries for running cloud services. Whirr provides: • A cloud-neutral way to run services. You don't have to worry about the idiosyncrasies of each provider. • A common service API. The details of provisioning are particular to the service. • Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed. You can also use Whirr as a command line tool for deploying clusters.” Just what we want! Saturday, March 10, 2012
  • 5. Whirr makes it look easy • All you need is a simple config file whirr.cluster-name=hadoop-ec2 whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,15 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} whirr.hardware-id=m1.large whirr.location-id=us-east-1 whirr.image-id=us-east-1/ami-49e32320 whirr.java.install-function=install_oab_java whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop • And one line to launch your cluster $ ./whirr launch-cluster --config hadoop-ec2.properties Bootstrapping cluster Configuring template Configuring template Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker] Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker] Saturday, March 10, 2012
  • 6. One line?!? That’s too easy! What didn’t you show us? • Download and install Whirr (≥ 0.7.1!) • Specify your AWS security credentials • Create a key pair to access the nodes • Install R and add-on packages onto each node • Configure VM to use cluster’s Hadoop instance & run a proxy • Copy data onto the cluster & run a test • So... let’s walk through those steps next... Saturday, March 10, 2012
  • 7. Download & Install Whirr (≥ 0.7.1) • Find an Apache mirror http://www.apache.org/dyn/closer.cgi/whirr/ • From your VM’s shell, download it with wget $ wget http://apache.mirrors.pair.com/whirr/stable/whirr-0.7.1.tar.gz • Installing is as simple as expanding the tarball $ tar xf whirr-0.7.1.tar.gz • Modify your path so this new version runs $ export PATH="~/whirr-0.7.1/bin:$PATH" $ whirr version Apache Whirr 0.7.1 Saturday, March 10, 2012
  • 8. Amazon Login Info • From AWS Management Console, look up your Access Keys • “Access Key ID” ➔ whirr.identity • “Secret Access Key” ➔ whirr.credential • You could enter into Whirr’s config file, but please don’t • instead, just pick up environment variables in config file: whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} • and set them for your session session $ export AWS_ACCESS_KEY_ID=”your access key id here” $ export AWS_SECRET_ACCESS_KEY=”your secret access key here” • While we’re at it, create a key pair $ ssh-keygen -t rsa -P "" Saturday, March 10, 2012
  • 9. Configuration file highlights Specify how many nodes of each type whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,10 hadoop-datanode+hadoop-tasktracker Select instance size & type (m1.large, c1.xlarge, m2.large, etc., as described at http://aws.amazon.com/ec2/instance-types/) whirr.hardware-id=m1.large Use a RightScale-published CentOS image (with transitory “instance” storage) whirr.image-id=us-east-1/ami-49e32320 Saturday, March 10, 2012
  • 10. Launch the Cluster Yes, just one line... but then pages of output $ whirr launch-cluster --config hadoop-ec2.properties Bootstrapping cluster Configuring template Configuring template Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker] Starting 10 node(s) with roles [hadoop-datanode, hadoop-tasktracker] [...] Running configure phase script on: us-east-1/i-e301ab87 configure phase script run completed on: us-east-1/i-e301ab87 [...] You can log into instances using the following ssh commands: 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.22.25.82' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@204.236.222.162' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@23.20.97.157' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@75.101.192.112' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@50.16.43.91' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.22.84.246' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@23.20.134.238' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.22.61.144' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@23.20.6.74' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@174.129.137.89' 'ssh -i /home/cloudera/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no cloudera@107.21.77.224' Saturday, March 10, 2012
  • 12. Install R and Packages • install-r+packages.sh contains code to download and install R, plyr, rmr and their prerequisites • whirr will run scripts on each node for us $ whirr run-script --script install-r+packages.sh --config hadoop-ec2.properties • And then you get to see pages and pages of output for each and every node! ** Node us-east-1/i-eb01ab8f: [10.124.18.198, 107.21.77.224] rightscale-epel | 951 B 00:00 Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package R.x86_64 0:2.14.1-1.el5 set to be updated --> Processing Dependency: libRmath-devel = 2.14.1-1.el5 for package: R ---> Package R-devel.i386 0:2.14.1-1.el5 set to be updated --> Processing Dependency: R-core = 2.14.1-1.el5 for package: R-devel [...] • Hopefully it ends with something positive like * DONE (rmr) Making packages.html ... done Saturday, March 10, 2012
  • 13. install-r+packages.sh sudo yum -y --enablerepo=epel install R R-devel sudo R --no-save << EOF install.packages(c('RJSONIO', 'itertools', 'digest', 'plyr'), repos="http:// cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') ) EOF # install latest version of the rmr package from RHadoop's github repository: branch=master wget --no-check-certificate https://github.com/RevolutionAnalytics/RHadoop/tarball/$branch -O - | tar zx mv RevolutionAnalytics-RHadoop* RHadoop sudo R CMD INSTALL --byte-compile RHadoop/rmr/pkg/ sudo su << EOF1 cat >> /etc/profile <<EOF export HADOOP_HOME=/usr/lib/hadoop EOF EOF1 Saturday, March 10, 2012
  • 14. Switch from local to cluster Hadoop • CDH uses linux’s alternatives facility to specify the location of the current configuration files $ sudo /usr/sbin/alternatives --display hadoop-0.20-conf hadoop-0.20-conf - status is manual. link currently points to /etc/hadoop-0.20/conf.pseudo /etc/hadoop-0.20/conf.empty - priority 10 /etc/hadoop-0.20/conf.pseudo - priority 30 Current `best' version is /etc/hadoop-0.20/conf.pseudo. • Whirr generates the config file we need to create a “conf.ec2” alternative $ sudo mkdir /etc/hadoop-0.20/conf.ec2 $ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.ec2 $ sudo rm -f /etc/hadoop-0.20/conf.ec2/*-site.xml $ sudo cp ~/.whirr/hadoop-ec2/hadoop-site.xml /etc/hadoop-0.20/conf.ec2/ $ sudo /usr/sbin/alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.ec2 30 $ sudo /usr/sbin/alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.ec2 $ sudo /usr/sbin/alternatives --display hadoop-0.20-conf hadoop-0.20-conf - status is manual. link currently points to /etc/hadoop-0.20/conf.ec2 /etc/hadoop-0.20/conf.empty - priority 10 /etc/hadoop-0.20/conf.pseudo - priority 30 /etc/hadoop-0.20/conf.ec2 - priority 30 Current `best' version is /etc/hadoop-0.20/conf.pseudo. Saturday, March 10, 2012
  • 15. Fire up a proxy connection • Whirr generates a proxy to connect your VM to the cluster $ ~/.whirr/hadoop-ec2/hadoop-proxy.sh Running proxy to Hadoop cluster at ec2-107-21-77-224.compute-1.amazonaws.com. Use Ctrl-c to quit. Warning: Permanently added 'ec2-107-21-77-224.compute-1.amazonaws.com, 107.21.77.224' (RSA) to the list of known hosts. • Any hadoop commands executed on your VM should go to the cluster instead $ hadoop dfsadmin -report Configured Capacity: 4427851038720 (4.03 TB) Present Capacity: 4144534683648 (3.77 TB) DFS Remaining: 4139510718464 (3.76 TB) DFS Used: 5023965184 (4.68 GB) DFS Used%: 0.12% Definitely not in Under replicated blocks: 0 Blocks with corrupt replicas: 0 Kansas anymore Missing blocks: 0 [...] Saturday, March 10, 2012
  • 16. Test Hadoop with a small job Download my fork of Jonathan Seidman’s sample R code from github $ mkdir hadoop-r $ cd hadoop-r $ git init $ git pull git://github.com/jeffreybreen/hadoop-R.git Grab first 1,000 lines from ASA’s 2004 airline data $ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv Make some directories in HDFS and load the data file $ hadoop fs -mkdir /user/cloudera $ hadoop fs -mkdir asa-airline $ hadoop fs -mkdir asa-airline/data $ hadoop fs -mkdir asa-airline/out $ hadoop fs -put 2004-1000.csv asa-airline/data/ Run Jonathan’s sample streaming job $ cd airline/src/deptdelay_by_month/R/streaming $ hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-*.jar -input asa-airline/data -output asa-airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R [...] $ hadoop fs -cat asa-airline/out/dept-delay-month/part-00000 2004 1 973 UA 11.55293 Saturday, March 10, 2012
  • 17. distcp: using Hadoop to load its own data $ hadoop distcp -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY_ID -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY s3n://asa-airline/data asa-airline 12/03/08 21:42:21 INFO tools.DistCp: srcPaths=[s3n://asa-airline/data] 12/03/08 21:42:21 INFO tools.DistCp: destPath=asa-airline 12/03/08 21:42:27 INFO tools.DistCp: sourcePathsCount=23 12/03/08 21:42:27 INFO tools.DistCp: filesToCopyCount=22 12/03/08 21:42:27 INFO tools.DistCp: bytesToCopyCount=1.5g 12/03/08 21:42:31 INFO mapred.JobClient: Running job: job_201203082122_0002 12/03/08 21:42:32 INFO mapred.JobClient: map 0% reduce 0% 12/03/08 21:42:41 INFO mapred.JobClient: map 14% reduce 0% 12/03/08 21:42:45 INFO mapred.JobClient: map 46% reduce 0% 12/03/08 21:42:46 INFO mapred.JobClient: map 61% reduce 0% 12/03/08 21:42:47 INFO mapred.JobClient: map 63% reduce 0% 12/03/08 21:42:48 INFO mapred.JobClient: map 70% reduce 0% 12/03/08 21:42:50 INFO mapred.JobClient: map 72% reduce 0% 12/03/08 21:42:51 INFO mapred.JobClient: map 80% reduce 0% 12/03/08 21:42:53 INFO mapred.JobClient: map 83% reduce 0% 12/03/08 21:42:54 INFO mapred.JobClient: map 89% reduce 0% 12/03/08 21:42:56 INFO mapred.JobClient: map 92% reduce 0% 12/03/08 21:42:58 INFO mapred.JobClient: map 99% reduce 0% 12/03/08 21:43:04 INFO mapred.JobClient: map 100% reduce 0% 12/03/08 21:43:05 INFO mapred.JobClient: Job complete: job_201203082122_0002 [...] Saturday, March 10, 2012
  • 18. Are you sure you want to shut down? • Unlike the EBS-backed instance we created in Part 2, when the nodes are gone, they’re gone–including their data–so you need to copy your results out of the cluster’s HDFS before your throw the switch • You could use hadoop fs -get to copy to your local file system $ hadoop fs -get asa-airline/out/dept-delay-month . $ ls -lh dept-delay-month total 1.0K drwxr-xr-x 1 1120 games 102 Mar 8 23:06 _logs -rw-r--r-- 1 1120 games 33 Mar 8 23:06 part-00000 -rw-r--r-- 1 1120 games 0 Mar 8 23:06 _SUCCESS $ cat dept-delay-month/part-00000 2004 1 973 UA 11.55293 • Or you could have your programming language of choice save the results locally for you save( dept.delay.month.df, file=’out/dept.delay.month.RData’ ) Saturday, March 10, 2012
  • 19. Say goodnight, Gracie • control-c to close the proxy connection $ ~/.whirr/hadoop-ec2/hadoop-proxy.sh Running proxy to Hadoop cluster at ec2-107-21-77-224.compute-1.amazonaws.com. Use Ctrl-c to quit. Warning: Permanently added 'ec2-107-21-77-224.compute-1.amazonaws.com,107.21.77.224' (RSA) to the list of known hosts. ^C Killed by signal 2. • Shut down the cluster $ whirr destroy-cluster --config hadoop-ec2.properties Starting to run scripts on cluster for phase destroyinstances: us-east-1/i-c901abad, us-east-1/ i-ad01abc9, us-east-1/i-f901ab9d, us-east-1/i-e301ab87, us-east-1/i-d901abbd, us-east-1/i- c301aba7, us-east-1/i-dd01abb9, us-east-1/i-d101abb5, us-east-1/i-f101ab95, us-east-1/i- d501abb1 Running destroy phase script on: us-east-1/i-c901abad [...] Finished running destroy phase scripts on all cluster instances Destroying hadoop-ec2 cluster Cluster hadoop-ec2 destroyed • Switch back to your local Hadoop $ sudo /usr/sbin/alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.pseudo Saturday, March 10, 2012
  • 20. Extra Credit: Use Spot Instances Through the “whirr.aws-ec2-spot-price” parameter, Whirr even lets you bid for excess capacity http://aws.amazon.com/ec2/spot-instances/ http://aws.amazon.com/pricing/ec2/ Saturday, March 10, 2012
  • 21. Whirr bids, waits, and launches Saturday, March 10, 2012
  • 22. Hey, big spender 10+1 m1.large nodes for 3 hours = $3.56 Saturday, March 10, 2012