Aws dc elastic-mapreduce

Elastic MapReduce
Outsourcing BigData

Nathan McCourtney
@beaknit

What is MapReduce?
From Wikipedia:

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use
different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker
nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the
smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way
to form the output – the answer to the problem it was originally trying to solve.

The Map
Mapping involves taking raw data and converting it into a
series of symbols.

For example, DNA sequencing:
ddATP -> A
ddGTP -> G
ddCTP -> C
ddTTP -> T

Results in representations like "GATTACA"

Practical Mapping
Inputs are generally flat-files containing lines of text.
clever_critters.txt:
foxes are clever
cats are clever

Files are read in and fed to a mapper one line at a time via
STDIN.
cat clever_critters.txt | mapper.rb

Practical Mapping Cont'd
The mapper processes the line and outputs a key/value
pair to STDOUT for each symbol it maps
foxes 1
are 1
clever 1
cats 1
are 1
clever 1

Work Partitioning
These key/value pairs are passed to a "partition function"
which organizes the output and assigns it to reducer nodes

foxes -> node 1
are -> node 2
clever -> node 3
cat -> node 4

Practical Reduction
The Reducers each receive the sharded
workload assigned to them by the partitioning.

Typically the work is received as a stream of
key/value pairs via STDIN:
"foxes 1" -> node 1
"are 1|are 1" -> node 2
"clever 1|clever 1" -> node 3
"cats 1|cats 1" -> node 4

Practical Reduction Cont'd
The reduction is essentially whatever you want it to be.
There are common patterns that are often pre-solved by
the map-reduce framework.

See Hadoop's Built-In Reducers

eg, "Aggregate" - give me a total of all the key/values
foxes - 1
are - 2
clever -2
cats - 1

What is Hadoop?
From wikipedia:
Apache Hadoop is a software framework that supports data-intensive distributed applications under a
free license.[1] It enables applications to work with thousands of computational independent
computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File
System (GFS) papers.

Essentially, Hadoop is a practical implementation of all the pieces you'd need to
accomplish everything we've discussed thus far. It takes in the data, organizes
the tasks, passes the data through its entire path and finally outputs the
reduction.

Hadoop's Guts

source: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html

Solution?
Amazon's Elastic MapReduce

Look complex? It's not
1. Sign up for the service
2. Download the tools (requires ruby 1.8)
3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli
4. Create your credentials.json file
{
"access_id": "<key>",
"private_key": "<secret key>",
"keypair": "<name of keypair>",
"key-pair-file": "~/.ssh/<key>.pem",
"log_uri": "s3://<unique s3 bucket/",
"region": "us-east-1"
}

5. unzip ~/Downloads/elastic-mapreduce-ruby.zip

Run it

ruby elastic-mapreduce --list
ruby elastic-mapreduce --create --alive
ruby elastic-mapreduce --list
ruby elastic-mapreduce --terminate <JobFlowID>

Note you can also view it in the Amazon EMR web interface

Logs can be viewed by looking into the s3 bucket you specified in your
credentials.json file. Just drill down via the s3 web interface and double-
click the file.

Creating a minimal job
1. Set up a dedicated s3 bucket

2. Create a folder called "input" in that bucket

3. Upload your inputs into s3://bucket/input
s3cmd put *log s3://bucket/input

Minimal Job Cont'd
4. Write a mapper
eg:
ARGF.each do |line|

# remove any newline
line = line.chomp

if /ERROR/.match(line)
puts "ERRORt1"
end
if /INFO/.match(line)
puts "INFOt1"
end
if /DEBUG/.match(line)
puts "DEBUGt1"
end
end

See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for
examples

Minimal Job Cont'd
5. Upload your mapper to your s3 bucket
s3cmd put mapper.rb s3://bucket

6. Run it
elastic-mapreduce --create --stream
--mapper s3://bucket/mapper.rb
--input s3://bucket/input
--output s3://bucket/output
--reducer aggregate

NOTE: This job uses the built-in aggregator.
NOTE: The output directory must NOT exist at the time of the run

Amazon will scale ec2 instances to consume the load dynamically.

7. Pick up your results in the output folder

AWS Demo App
AWS has a very cool publicly-available app to
run:

elastic-mapreduce --create --stream
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input s3://elasticmapreduce/samples/wordcount/input
--output s3://bucket/output
--reducer aggregate

See Amazon Example Doc

Possibilities
EMR is a fully-functional Hadoop
implementation.

Mappers and reducers can be written in python,
ruby, PHP and Java

Go crazy.

Further Reading
Tom White's O'Reilly on Hadoop

AWS EMR Getting Started Guide

Hadoop Wiki

Aws dc elastic-mapreduce

More Related Content

What's hot (18)

Viewers also liked (7)

Similar to Aws dc elastic-mapreduce (20)

Recently uploaded (20)

Aws dc elastic-mapreduce