SlideShare a Scribd company logo
Elastic MapReduce
   Outsourcing BigData

        Nathan McCourtney
            @beaknit
What is MapReduce?
From Wikipedia:

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use
different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker
nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the
smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way
to form the output – the answer to the problem it was originally trying to solve.
The Map
Mapping involves taking raw data and converting it into a
series of symbols.

For example, DNA sequencing:
ddATP   ->   A
ddGTP   ->   G
ddCTP   ->   C
ddTTP   ->   T

Results in representations like "GATTACA"
Practical Mapping
Inputs are generally flat-files containing lines of text.
   clever_critters.txt:
       foxes are clever
       cats are clever




Files are read in and fed to a mapper one line at a time via
STDIN.
   cat clever_critters.txt | mapper.rb
Practical Mapping Cont'd
The mapper processes the line and outputs a key/value
pair to STDOUT for each symbol it maps
   foxes 1
   are 1
   clever 1
   cats 1
   are 1
   clever 1
Work Partitioning
These key/value pairs are passed to a "partition function"
which organizes the output and assigns it to reducer nodes

   foxes -> node 1
   are -> node 2
   clever -> node 3
   cat -> node 4
Practical Reduction
The Reducers each receive the sharded
workload assigned to them by the partitioning.

Typically the work is received as a stream of
key/value pairs via STDIN:
 "foxes 1" -> node 1
 "are 1|are 1" -> node 2
 "clever 1|clever 1" -> node 3
 "cats 1|cats 1" -> node 4
Practical Reduction Cont'd
The reduction is essentially whatever you want it to be.
There are common patterns that are often pre-solved by
the map-reduce framework.

See Hadoop's Built-In Reducers

eg, "Aggregate" - give me a total of all the key/values
  foxes - 1
  are - 2
  clever -2
  cats - 1
What is Hadoop?
From wikipedia:
Apache Hadoop is a software framework that supports data-intensive distributed applications under a
free license.[1] It enables applications to work with thousands of computational independent
computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File
System (GFS) papers.


Essentially, Hadoop is a practical implementation of all the pieces you'd need to
accomplish everything we've discussed thus far. It takes in the data, organizes
the tasks, passes the data through its entire path and finally outputs the
reduction.
Hadoop's Guts




source: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html
Fun to build?



    No
Solution?
Amazon's Elastic MapReduce
Look complex? It's not
1.   Sign up for the service
2.   Download the tools (requires ruby 1.8)
3.   mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli
4.   Create your credentials.json file
      {
      "access_id": "<key>",
      "private_key": "<secret key>",
      "keypair": "<name of keypair>",
      "key-pair-file": "~/.ssh/<key>.pem",
      "log_uri": "s3://<unique s3 bucket/",
      "region": "us-east-1"
      }

5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
Run it

  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --create --alive
  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --terminate <JobFlowID>

  Note you can also view it in the Amazon EMR web interface

  Logs can be viewed by looking into the s3 bucket you specified in your
  credentials.json file. Just drill down via the s3 web interface and double-
  click the file.
Creating a minimal job
1. Set up a dedicated s3 bucket

2. Create a folder called "input" in that bucket

3. Upload your inputs into s3://bucket/input
     s3cmd put *log s3://bucket/input
Minimal Job Cont'd
4. Write a mapper
     eg:
     ARGF.each do |line|

        # remove any newline
        line = line.chomp

        if /ERROR/.match(line)
           puts "ERRORt1"
        end
        if /INFO/.match(line)
           puts "INFOt1"
        end
        if /DEBUG/.match(line)
           puts "DEBUGt1"
        end
     end


See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for
examples
Minimal Job Cont'd
5. Upload your mapper to your s3 bucket
     s3cmd put mapper.rb s3://bucket


6. Run it
     elastic-mapreduce --create --stream 
          --mapper s3://bucket/mapper.rb 
          --input   s3://bucket/input 
          --output s3://bucket/output 
          --reducer aggregate


      NOTE: This job uses the built-in aggregator.
      NOTE: The output directory must NOT exist at the time of the run

      Amazon will scale ec2 instances to consume the load dynamically.

7. Pick up your results in the output folder
AWS Demo App
AWS has a very cool publicly-available app to
run:

elastic-mapreduce --create --stream 
     --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py 
     --input   s3://elasticmapreduce/samples/wordcount/input 
     --output s3://bucket/output 
     --reducer aggregate



See Amazon Example Doc
Possibilities
EMR is a fully-functional Hadoop
implementation.

Mappers and reducers can be written in python,
ruby, PHP and Java

Go crazy.
Further Reading
Tom White's O'Reilly on Hadoop

AWS EMR Getting Started Guide

Hadoop Wiki

More Related Content

PDF
Map Reduce
PPT
Map Reduce
PPTX
Map reduce presentation
PPT
Hadoop Map Reduce
PPTX
Introduction to MapReduce
PPTX
MapReduce basic
PPTX
Introduction to Map Reduce
PDF
Mapreduce by examples
Map Reduce
Map Reduce
Map reduce presentation
Hadoop Map Reduce
Introduction to MapReduce
MapReduce basic
Introduction to Map Reduce
Mapreduce by examples

What's hot (18)

PDF
Introduction to map reduce
PPT
Hadoop introduction 2
PDF
Hadoop Map Reduce Arch
PPT
Hadoop 2
PPTX
Map Reduce
PDF
Introduction to Map-Reduce
PPTX
MapReduce Paradigm
PPTX
Overview of Spark for HPC
PPTX
Map reduce and Hadoop on windows
PPT
Map Reduce introduction
PPTX
Map reduce paradigm explained
PPTX
Mapreduce advanced
PPT
Introduction To Map Reduce
PDF
Hadoop map reduce v2
PPSX
MapReduce Scheduling Algorithms
PDF
Hadoop & MapReduce
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PPT
Map Reduce
Introduction to map reduce
Hadoop introduction 2
Hadoop Map Reduce Arch
Hadoop 2
Map Reduce
Introduction to Map-Reduce
MapReduce Paradigm
Overview of Spark for HPC
Map reduce and Hadoop on windows
Map Reduce introduction
Map reduce paradigm explained
Mapreduce advanced
Introduction To Map Reduce
Hadoop map reduce v2
MapReduce Scheduling Algorithms
Hadoop & MapReduce
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Map Reduce
Ad

Viewers also liked (7)

PPT
Mi vida (sebastián)
PPTX
hello ( julián)
PPT
ALL ABOUT ME
PPTX
ALL ABOUT ME ( PAULA)
PDF
Aws dc elastic-mapreduce
PPTX
all about me
PPT
Mi vida
Mi vida (sebastián)
hello ( julián)
ALL ABOUT ME
ALL ABOUT ME ( PAULA)
Aws dc elastic-mapreduce
all about me
Mi vida
Ad

Similar to Aws dc elastic-mapreduce (20)

PDF
Scala+data
PPT
Meethadoop
PDF
Apache Cassandra and Apache Spark
PDF
Report Hadoop Map Reduce
PPT
hadoop-spark.ppt
PPT
hadoop_spark_Introduction_Bigdata_intro.ppt
PPT
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Hadoop interview question
PPT
Scala and spark
PPTX
MapReduce and Hadoop Introcuctory Presentation
PDF
Unit-2 Hadoop Framework.pdf
PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
PDF
Hadoop ecosystem
PDF
Hadoop first mr job - inverted index construction
PDF
Hadoop tutorial hand-outs
PDF
Fast Data Analytics with Spark and Python
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PDF
Apache Spark: What? Why? When?
PDF
Introduction to Apache Spark
Scala+data
Meethadoop
Apache Cassandra and Apache Spark
Report Hadoop Map Reduce
hadoop-spark.ppt
hadoop_spark_Introduction_Bigdata_intro.ppt
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
Apache spark sneha challa- google pittsburgh-aug 25th
Hadoop interview question
Scala and spark
MapReduce and Hadoop Introcuctory Presentation
Unit-2 Hadoop Framework.pdf
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Hadoop ecosystem
Hadoop first mr job - inverted index construction
Hadoop tutorial hand-outs
Fast Data Analytics with Spark and Python
L19CloudMapReduce introduction for cloud computing .ppt
Apache Spark: What? Why? When?
Introduction to Apache Spark

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation theory and applications.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Machine Learning_overview_presentation.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
Unlocking AI with Model Context Protocol (MCP)
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Programs and apps: productivity, graphics, security and other tools
Encapsulation theory and applications.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine Learning_overview_presentation.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation

Aws dc elastic-mapreduce

  • 1. Elastic MapReduce Outsourcing BigData Nathan McCourtney @beaknit
  • 2. What is MapReduce? From Wikipedia: MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  • 3. The Map Mapping involves taking raw data and converting it into a series of symbols. For example, DNA sequencing: ddATP -> A ddGTP -> G ddCTP -> C ddTTP -> T Results in representations like "GATTACA"
  • 4. Practical Mapping Inputs are generally flat-files containing lines of text. clever_critters.txt: foxes are clever cats are clever Files are read in and fed to a mapper one line at a time via STDIN. cat clever_critters.txt | mapper.rb
  • 5. Practical Mapping Cont'd The mapper processes the line and outputs a key/value pair to STDOUT for each symbol it maps foxes 1 are 1 clever 1 cats 1 are 1 clever 1
  • 6. Work Partitioning These key/value pairs are passed to a "partition function" which organizes the output and assigns it to reducer nodes foxes -> node 1 are -> node 2 clever -> node 3 cat -> node 4
  • 7. Practical Reduction The Reducers each receive the sharded workload assigned to them by the partitioning. Typically the work is received as a stream of key/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4
  • 8. Practical Reduction Cont'd The reduction is essentially whatever you want it to be. There are common patterns that are often pre-solved by the map-reduce framework. See Hadoop's Built-In Reducers eg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1
  • 9. What is Hadoop? From wikipedia: Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Essentially, Hadoop is a practical implementation of all the pieces you'd need to accomplish everything we've discussed thus far. It takes in the data, organizes the tasks, passes the data through its entire path and finally outputs the reduction.
  • 13. Look complex? It's not 1. Sign up for the service 2. Download the tools (requires ruby 1.8) 3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli 4. Create your credentials.json file { "access_id": "<key>", "private_key": "<secret key>", "keypair": "<name of keypair>", "key-pair-file": "~/.ssh/<key>.pem", "log_uri": "s3://<unique s3 bucket/", "region": "us-east-1" } 5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
  • 14. Run it ruby elastic-mapreduce --list ruby elastic-mapreduce --create --alive ruby elastic-mapreduce --list ruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double- click the file.
  • 15. Creating a minimal job 1. Set up a dedicated s3 bucket 2. Create a folder called "input" in that bucket 3. Upload your inputs into s3://bucket/input s3cmd put *log s3://bucket/input
  • 16. Minimal Job Cont'd 4. Write a mapper eg: ARGF.each do |line| # remove any newline line = line.chomp if /ERROR/.match(line) puts "ERRORt1" end if /INFO/.match(line) puts "INFOt1" end if /DEBUG/.match(line) puts "DEBUGt1" end end See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for examples
  • 17. Minimal Job Cont'd 5. Upload your mapper to your s3 bucket s3cmd put mapper.rb s3://bucket 6. Run it elastic-mapreduce --create --stream --mapper s3://bucket/mapper.rb --input s3://bucket/input --output s3://bucket/output --reducer aggregate NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically. 7. Pick up your results in the output folder
  • 18. AWS Demo App AWS has a very cool publicly-available app to run: elastic-mapreduce --create --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://elasticmapreduce/samples/wordcount/input --output s3://bucket/output --reducer aggregate See Amazon Example Doc
  • 19. Possibilities EMR is a fully-functional Hadoop implementation. Mappers and reducers can be written in python, ruby, PHP and Java Go crazy.
  • 20. Further Reading Tom White's O'Reilly on Hadoop AWS EMR Getting Started Guide Hadoop Wiki