MapReduce with Hadoop and Ruby

Ohai Hadoop!
Build your first MapReduce with
Hadoop & Ruby

Tweet@_swanand
GitHub@swanandp
StackOverflow@18678
Work@Kaverisoft
Make { DispatchTrack }
mailto:swanand@pagnis.
in
Who am I?
Ruby, Coffeescript,
Java, Rails, Sinatra,
Android, TextMate,
Emacs, Minitest,
MySQL, Cassandra,
Hadoop, Mountain
Lion, Curl, Zsh, GMail,
Solarized, Oscar
Wilde, Robert Jordan,
Quentin Tarantino,
Charlize Theron

● MapReduce! Wait, what?
● Enter the Hadoop. *gong*
● Convention over Configuration? You wish.
● Instant Gratification. Now you're talkin'
● Further Reading. Go forth and read!
Tell 'em what you're going to tell 'em

MapReduce! Wait, what?
● Map: Given a set of values (or key-values),
output another set of values (or key-values)
● [K1, V1] -> map -> [K2, V2]
● Map each value into a new value

MapReduce! Wait, what?
● Reduce: Given a set of values for a key,
come up with a summarized version
● K1[V1, V2 ... Vn] -> reduce -> K1[Y]
● Reduce given values into 1 value

MapReduce! Um.. hmm..
Q: What is the single biggest takeaway from
mapping?
A: Map operation is stateless i.e. one iteration
doesn't depend on previous iteration.
Q: What is the single biggest takeaway from
reducing?
A: Reduce represents an operation for a
particular key.

Enter the Hadoop. *gong*
"The really interesting thing I want you to notice,
here, is that as soon as you think of map and
reduce as functions that everybody can use, and
they use them, you only have to get one
supergenius to write the hard code to run map and
reduce on a global massively parallel array of
computers, and all the old code that used to work
fine when you just ran a loop still works only it's a
zillion times faster which means it can be used to
tackle huge problems in an instant."
- Joel Spolsky

MapReduce! Oh, yeah!
1. Convert raw data into readable format
2. Iterate over data chunks, convert each
chunk into meaningful key, value pairs
3. Do this for all your data using massive
parallelization
4. Group all the keys and their respective
values
5. Take values for a key and convert into
desired meaningful format
6. Step 2 is called mapper
7. Step 5 is called reducer

Enter the Hadoop. *gong*
Same process has now become:
1. Put data into Hadoop
2. Define your mapper
3. Define your reducer
4. Run your jobs
5. Read processed data from Hadoop
Other advantages:
● Encapsulations over common problems
like large files, process management, disk
/ node failure

Top Level Descriptor
job has_many tasks
HDFS Boss core-site.xml
HDFS Slaves slaves
MapReduce Boss mapred-site.xml
MapReduce Slave mapred-site.xml
User's window into Hadoop, through the
command hadoop
Convention over Configuration? You wish.
Job
Task
NameNode
DataNode
JobTracker
TaskTracker
Client

● Configuration in XML & Shell scripts. Yuck!
● Respite:
○ Option for specifying a configuration directory
○ Shell script configuration is mostly ENV variables
● Which means:
○ Configuration can be written in YML or JSON or
Ruby and exported in XML
○ ENV variables can be set using rake, thor or just
plain Ruby
● Caveats:
○ No standard wrapper to do this (Go write one!)

● Default mappers and reducers are defined in
Java
● Other languages supported using Streaming
API
● Streaming API makes use of STDIN and
STDOUT to read and output data and
executable binaries for processing
● Caveats
○ No dependency management, we are on our own

Instant Gratification. Now you're talkin'
GOAL:
1. Take a couple of books in txt format
2. Find out the total usage of each character in
the english alphabet.
3. Establish that e is the most used.
4. Why this example?
a. Perfect use case for MapReduce.
b. Algorithm is simple.
c. Results are simple to analyze.
d. Txt formatted books are easily available in Project
Gutenberg.

● Official Documentation
● Wiki: http://wiki.apache.org/hadoop/
● Hadoop examples that ship with Hadoop
● http://www.bigfastblog.com/map-reduce-
with-ruby-using-hadoop
● http://www.youtube.com/watch?
v=d2xeNpfzsYI
Further Reading and Watching

MapReduce with Hadoop and Ruby

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to MapReduce with Hadoop and Ruby (20)

Recently uploaded (20)

MapReduce with Hadoop and Ruby