Ruby on hadoop

Ruby on Hadoop
Tuesday, January 8, 13

Introduction

Hi.
I’m Ted O’Meara
...and I just quit my job last week.

@tomeara
tedomeara.com


MapReduce

History of MapReduce

• First implemented
by Google
• Used in CouchDB,
Hadoop, etc.
• Helps to “distill” data into
a concentrated result set


What is MapReduce?


What is MapReduce?

sum = 0
input = ["deer", "bear",
input.each do |x|
"river", "car", "car", "river", input.map! { |x| [x, 1] }
sum += x[1]
"deer", "car", "bear"]
end


Hadoop Breakdown

History of Hadoop

•Doug Cutting @ Yahoo!
•It is a Toy Elephant
•It is also a framework for
distributed computing
•It is a distributed ﬁlesystem


Network Topology


Hadoop Cluster

Cluster
•Commodity hardware
•Partition tolerant
•Network-aware (rack-aware)

555.555.1.* 555.555.2.* 444.444.1.*
JobTracker NameNode TaskTracker/DataNode

TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode






Hadoop Cluster

NameNode
•Keeps track of the DataNodes
•Uses “heartbeat” to determine a node’s health
•The most resources should be spent here

555.555.1.* 555.555.2.* 444.444.1.*

TaskTracker/DataNode TaskTracker/DataNode
♥ TaskTracker/DataNode






Hadoop Cluster

DataNode
•Stores ﬁlesystem blocks
•Can be scaled. Spun up/down.
•Replicate based on a set replication factor

555.555.1.* 555.555.2.* 444.444.1.*







Hadoop Cluster

JobTracker
•Delegates which TaskTrackers should handle a
MapReduce job
•Communicates with the NameNode to assign a TaskTracker
close to the DataNode where the source exists

555.555.1.* 555.555.2.* 444.444.1.*

TaskTracker/DataNode
♥ TaskTracker/DataNode TaskTracker/DataNode






Hadoop Cluster

TaskTracker
•Worker for MapReduce jobs
•The closer to the DataNode with the data, the better

555.555.1.* 555.555.2.* 444.444.1.*







HDFS

hadoop fs -put localfile /user/hadoop/hadoopfile

555.555.1.* 555.555.2.* 444.444.1.*







Hadoop Streaming


Hadoop Streaming
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input "/user/me/samples/cachefile/input.txt"
-mapper "xargs cat"
-reducer "cat"
-output "/user/me/samples/cachefile/out"
-cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink'
-jobconf mapred.map.tasks=3
-jobconf mapred.reduce.tasks=3
-jobconf mapred.job.name="Experiment"

555.555.1.* 555.555.2.* 444.444.1.*







Hadoop Streaming

Pig Hive Wukong
Pig Latin SQL-ish Ruby!

Hadoop Ecosystem

Wukong

•Infochimps
•Currently going through
heavy development
•Use the 3.0.0.pre3 gem
https://github.com/infochimps-labs/wukong/tree/3.0.0

•Model your jobs with
wukong-hadoop
https://github.com/infochimps-labs/wukong-hadoop


Wukong

Wukong wukong-hadoop
•Write mappers and reducers •A CLI to use with Hadoop
using Ruby •Created around building tasks
•As of 3.0.0, Wukong uses with Wukong
“Processors”, which are Ruby •Better than piping in the shell
classes that deﬁne map, reduce,
(you can see this with --dry_run)
and other tasks


Wukong Processors

Wukong.processor(:mapper) do

  field :min_length, Integer, :default => 1
  field :max_length, Integer, :default => 256
  field :split_on, Regexp, :default => /s+/
  field :remove, Regexp, :default => /[^a-zA-Z0-9']+/
  field :fold_case, :boolean, :default => false

  def process string

•Fields are accessible     tokenize(string).each do |token|
      yield token if acceptable?(token)
    end
through switches in shell   end

  private
•Local hand-oﬀ is made at   def tokenize string
    string.split(split_on).map do |token|
STDOUT to STDIN       stripped = token.gsub(remove, '')
      fold_case ? stripped.downcase : stripped
    end
  end

  def acceptable? token
    (min_length..max_length).include?(token.length)
  end
end


Wukong Processors

Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

  attr_accessor :count

  def start record
    self.count = 0
  end

  def accumulate record
    self.count += 1
  end

  def finalize
    yield [key, count].join("t")
  end
end


Wukong Processors

wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb
--mode=local
--input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub

Simpsons - Ep 8
do 7
Doctor 1
Does 2
doesn't 1
dog 2
D'oh 1
doif 1
doing 2
done 1
doneYou 1
don't 10
Don't 1


The End

Thank you!
@tomeara
ted@tedomeara.com


Ruby on hadoop

More Related Content

Similar to Ruby on hadoop (20)

Recently uploaded (20)

Ruby on hadoop