SlideShare a Scribd company logo
Ruby on Hadoop
Tuesday, January 8, 13
Introduction




                                      Hi.
                                   I’m Ted O’Meara
                         ...and I just quit my job last week.

                                    @tomeara
                                 tedomeara.com

Tuesday, January 8, 13
MapReduce
Tuesday, January 8, 13
History of MapReduce



        • First implemented
          by Google
        • Used in CouchDB,
          Hadoop, etc.
        • Helps to “distill” data into
          a concentrated result set




Tuesday, January 8, 13
What is MapReduce?




Tuesday, January 8, 13
What is MapReduce?




                                                                 sum = 0
   input = ["deer", "bear",
                                                                 input.each do |x|
   "river", "car", "car", "river",   input.map! { |x| [x, 1] }
                                                                   sum += x[1]
   "deer", "car", "bear"]
                                                                 end




Tuesday, January 8, 13
Hadoop Breakdown
Tuesday, January 8, 13
History of Hadoop



        •Doug Cutting @ Yahoo!
        •It is a Toy Elephant
        •It is also a framework for
         distributed computing
        •It is a distributed filesystem




Tuesday, January 8, 13
Network Topology


Tuesday, January 8, 13
Hadoop Cluster

                         Cluster
                         •Commodity hardware
                         •Partition tolerant
                         •Network-aware (rack-aware)



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         NameNode
                         •Keeps track of the DataNodes
                         •Uses “heartbeat” to determine a node’s health
                         •The most resources should be spent here



                          555.555.1.*             555.555.2.*                 444.444.1.*
                              JobTracker              NameNode                 TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode
                                                                          ♥    TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         DataNode
                         •Stores filesystem blocks
                         •Can be scaled. Spun up/down.
                         •Replicate based on a set replication factor



                          555.555.1.*             555.555.2.*               444.444.1.*
                              JobTracker               NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         JobTracker
                         •Delegates which TaskTrackers should handle a
                          MapReduce job
                         •Communicates with the NameNode to assign a TaskTracker
                          close to the DataNode where the source exists


                          555.555.1.*                 555.555.2.*              444.444.1.*
                              JobTracker                  NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode
                                                  ♥    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         TaskTracker
                         •Worker for MapReduce jobs
                         •The closer to the DataNode with the data, the better



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
HDFS


Tuesday, January 8, 13
HDFS

                                           hadoop fs -put localfile /user/hadoop/hadoopfile




                         555.555.1.*                   555.555.2.*                    444.444.1.*
                             JobTracker                      NameNode                   TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming


Tuesday, January 8, 13
Hadoop Streaming
        $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
                          -input "/user/me/samples/cachefile/input.txt" 
                          -mapper "xargs cat" 
                          -reducer "cat" 
                          -output "/user/me/samples/cachefile/out" 
                          -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' 
                          -jobconf mapred.map.tasks=3 
                          -jobconf mapred.reduce.tasks=3 
                          -jobconf mapred.job.name="Experiment"




                         555.555.1.*                  555.555.2.*                      444.444.1.*
                             JobTracker                     NameNode                     TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming




                          Pig        Hive          Wukong
                         Pig Latin   SQL-ish         Ruby!




           Hadoop Ecosystem
Tuesday, January 8, 13
Wukong


        •Infochimps
        •Currently going through
         heavy development
        •Use the 3.0.0.pre3 gem
            https://github.com/infochimps-labs/wukong/tree/3.0.0

        •Model your jobs with
         wukong-hadoop
            https://github.com/infochimps-labs/wukong-hadoop




Tuesday, January 8, 13
Wukong



            Wukong                             wukong-hadoop
            •Write mappers and reducers        •A CLI to use with Hadoop
             using Ruby                        •Created around building tasks
            •As of 3.0.0, Wukong uses           with Wukong
             “Processors”, which are Ruby      •Better than piping in the shell
             classes that define map, reduce,
                                                (you can see this with --dry_run)
             and other tasks




Tuesday, January 8, 13
Wukong Processors

                                     Wukong.processor(:mapper) do
                                       
                                       field :min_length, Integer, :default    =>   1
                                       field :max_length, Integer, :default    =>   256
                                       field :split_on,   Regexp,   :default   =>   /s+/
                                       field :remove,     Regexp,   :default   =>   /[^a-zA-Z0-9']+/
                                       field :fold_case, :boolean, :default    =>   false
                                       
                                       def process string

        •Fields are accessible           tokenize(string).each do |token|
                                           yield token if acceptable?(token)
                                         end
         through switches in shell     end

                                       private
        •Local hand-off is made at      def tokenize string
                                         string.split(split_on).map do |token|
         STDOUT to STDIN                   stripped = token.gsub(remove, '')
                                           fold_case ? stripped.downcase : stripped
                                         end
                                       end

                                       def acceptable? token
                                         (min_length..max_length).include?(token.length)
                                       end
                                     end




Tuesday, January 8, 13
Wukong Processors



                         Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

                           attr_accessor :count
                           
                           def start record
                             self.count = 0
                           end
                           
                           def accumulate record
                             self.count += 1
                           end

                           def finalize
                             yield [key, count].join("t")
                           end
                         end




Tuesday, January 8, 13
Wukong Processors

           wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb 
                            --mode=local 
                            --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub




                                      Simpsons - Ep 8
                                      do 7
                                      Doctor     1
                                      Does 2
                                      doesn't    1
                                      dog 2
                                      D'oh 1
                                      doif 1
                                      doing      2
                                      done 1
                                      doneYou    1
                                      don't 10
                                      Don't 1




Tuesday, January 8, 13
The End




                         Thank you!
                             @tomeara
                             ted@tedomeara.com




Tuesday, January 8, 13

More Related Content

PPT
Hadoop
PPT
Hadoop 130419075715-phpapp02(1)
PPT
Checkupload1 140213043220-phpapp01
PPTX
Pptx present
PPTX
PPTX
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Hadoop
Hadoop 130419075715-phpapp02(1)
Checkupload1 140213043220-phpapp01
Pptx present
Soft-Shake 2013 : Enabling Realtime Queries to End Users

Similar to Ruby on hadoop (20)

PDF
Lecture 2 part 1
PDF
Hadoop with Lustre WhitePaper
PPTX
Hadoop Distributed File System
PPSX
Hadoop-Quick introduction
PPTX
HA Hadoop -ApacheCon talk
PPTX
Hadoop architecture meetup
PPTX
Introduction to hadoop and hdfs
PPTX
HADOOP.pptx
PDF
データ解析技術入門(Hadoop編)
PDF
Hadoop, Taming Elephants
PDF
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
PDF
Introduction to Hadoop Administration
PDF
Introduction to Hadoop Administration
PPTX
Hadoop & HDFS for Beginners
PPTX
Hadoop
PPTX
Understanding hdfs
ODP
Hadoop admin
PDF
Prdc2012
PDF
Spotting Hadoop in the wild
Lecture 2 part 1
Hadoop with Lustre WhitePaper
Hadoop Distributed File System
Hadoop-Quick introduction
HA Hadoop -ApacheCon talk
Hadoop architecture meetup
Introduction to hadoop and hdfs
HADOOP.pptx
データ解析技術入門(Hadoop編)
Hadoop, Taming Elephants
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Introduction to Hadoop Administration
Introduction to Hadoop Administration
Hadoop & HDFS for Beginners
Hadoop
Understanding hdfs
Hadoop admin
Prdc2012
Spotting Hadoop in the wild
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine Learning_overview_presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
sap open course for s4hana steps from ECC to s4
Programs and apps: productivity, graphics, security and other tools
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
“AI and Expert System Decision Support & Business Intelligence Systems”
Ad

Ruby on hadoop

  • 1. Ruby on Hadoop Tuesday, January 8, 13
  • 2. Introduction Hi. I’m Ted O’Meara ...and I just quit my job last week. @tomeara tedomeara.com Tuesday, January 8, 13
  • 4. History of MapReduce • First implemented by Google • Used in CouchDB, Hadoop, etc. • Helps to “distill” data into a concentrated result set Tuesday, January 8, 13
  • 6. What is MapReduce? sum = 0 input = ["deer", "bear", input.each do |x| "river", "car", "car", "river", input.map! { |x| [x, 1] } sum += x[1] "deer", "car", "bear"] end Tuesday, January 8, 13
  • 8. History of Hadoop •Doug Cutting @ Yahoo! •It is a Toy Elephant •It is also a framework for distributed computing •It is a distributed filesystem Tuesday, January 8, 13
  • 10. Hadoop Cluster Cluster •Commodity hardware •Partition tolerant •Network-aware (rack-aware) 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 11. Hadoop Cluster NameNode •Keeps track of the DataNodes •Uses “heartbeat” to determine a node’s health •The most resources should be spent here 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 12. Hadoop Cluster DataNode •Stores filesystem blocks •Can be scaled. Spun up/down. •Replicate based on a set replication factor 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 13. Hadoop Cluster JobTracker •Delegates which TaskTrackers should handle a MapReduce job •Communicates with the NameNode to assign a TaskTracker close to the DataNode where the source exists 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 14. Hadoop Cluster TaskTracker •Worker for MapReduce jobs •The closer to the DataNode with the data, the better 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 16. HDFS hadoop fs -put localfile /user/hadoop/hadoopfile 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 18. Hadoop Streaming $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input "/user/me/samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "/user/me/samples/cachefile/out" -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=3 -jobconf mapred.job.name="Experiment" 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 19. Hadoop Streaming Pig Hive Wukong Pig Latin SQL-ish Ruby! Hadoop Ecosystem Tuesday, January 8, 13
  • 20. Wukong •Infochimps •Currently going through heavy development •Use the 3.0.0.pre3 gem https://github.com/infochimps-labs/wukong/tree/3.0.0 •Model your jobs with wukong-hadoop https://github.com/infochimps-labs/wukong-hadoop Tuesday, January 8, 13
  • 21. Wukong Wukong wukong-hadoop •Write mappers and reducers •A CLI to use with Hadoop using Ruby •Created around building tasks •As of 3.0.0, Wukong uses with Wukong “Processors”, which are Ruby •Better than piping in the shell classes that define map, reduce, (you can see this with --dry_run) and other tasks Tuesday, January 8, 13
  • 22. Wukong Processors Wukong.processor(:mapper) do      field :min_length, Integer, :default => 1   field :max_length, Integer, :default => 256   field :split_on, Regexp, :default => /s+/   field :remove, Regexp, :default => /[^a-zA-Z0-9']+/   field :fold_case, :boolean, :default => false      def process string •Fields are accessible     tokenize(string).each do |token|       yield token if acceptable?(token)     end through switches in shell   end   private •Local hand-off is made at   def tokenize string     string.split(split_on).map do |token| STDOUT to STDIN       stripped = token.gsub(remove, '')       fold_case ? stripped.downcase : stripped     end   end   def acceptable? token     (min_length..max_length).include?(token.length)   end end Tuesday, January 8, 13
  • 23. Wukong Processors Wukong.processor(:reducer, Wukong::Processor::Accumulator) do   attr_accessor :count      def start record     self.count = 0   end      def accumulate record     self.count += 1   end   def finalize     yield [key, count].join("t")   end end Tuesday, January 8, 13
  • 24. Wukong Processors wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb --mode=local --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub Simpsons - Ep 8 do 7 Doctor 1 Does 2 doesn't 1 dog 2 D'oh 1 doif 1 doing 2 done 1 doneYou 1 don't 10 Don't 1 Tuesday, January 8, 13
  • 25. The End Thank you! @tomeara ted@tedomeara.com Tuesday, January 8, 13