SlideShare a Scribd company logo
MapReduce by JRuby and DSL
     Hadoop Papyrus

        2010/8/28
      JRubyKaigi 2010
 藤川幸一 FUJIKAWA Koichi @fujibee
What’s Hadoop?
• FW of parallel distributed processing
  framework for BIG data
• OSS clone of Google MapReduce
• For over terabyte scale data processing
  – Took over 2000hr if you read the data of
    400TB(Web scale data) by standard HDD, reading
    50MB/s
  – Need the distributed file system and parallel
    processing framework!
Hadoop Papyrus
• My own OSS project
  – Hosted by github http://github.com/fujibee/hadoop-papyrus
• Framework for running Hadoop jobs by (J)Ruby
  DSL description
  – Originally Hadoop jobs written by Java
  – Just few lines in Ruby same as the very complex
    procedure if using Java!
• Supported by IPA MITOH 2009 project
  (Government support)
• Can run by Hudson (CI tool) plug-in
Step.1
Not Java, But we can write in Ruby!
Step.2
Simple description by DSL in Ruby

       Map   Reduce      Job
                      Description




                                    Log Analysis
                                        DSL
Step.3
Enable the Hadoop server environment
easily by Hudson
package org.apache.hadoop.examples;            Javaの場合
import java.io.IOException;
import java.util.StringTokenizer;
                                                                                70 lines are needed in Java..
import org.apache.hadoop.conf.Configuration;
                                                                                Hadoop Papyrus is only needed 10 lines!
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper; IntSumReducer extends
                              public static class
import org.apache.hadoop.mapreduce.Reducer;
                              Reducer<Text, IntWritable, Text, IntWritable> {
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
                              private IntWritable result = new IntWritable();
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
                               public void reduce(Text key, Iterable<IntWritable> values,
                               Context context) throws IOException, InterruptedException {
public class WordCount {       int sum = 0;
                               for (IntWritable val : values) {
                               sum += val.get();
public static class TokenizerMapper extends
                               }
Mapper<Object, Text, Text, IntWritable> {
                               result.set(sum);
                                                                                                                   Hadoop Papyrus
                                 context.write(key, result);
                                 }                                                             dsl 'LogAnalysis‘
private final static IntWritable one = new IntWritable(1);
                                 }
private Text word = new Text();

                                public static void main(String[] args) throws Exception {
public void map(Object key, Text value, Context context) Configuration();
                                Configuration conf = new
                                                                                               from ‘test/in‘
throws IOException, InterruptedException {
                                String[] otherArgs = new GenericOptionsParser(conf, args)
StringTokenizer itr = new StringTokenizer(value.toString());
                                .getRemainingArgs();
                                                                                               to ‘test/out’
while (itr.hasMoreTokens()) { if (otherArgs.length != 2) {
word.set(itr.nextToken());      System.err.println("Usage: wordcount <in> <out>");
context.write(word, one);       System.exit(2);
}
}
                                }                                                              pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/
                                Job job = new Job(conf, "word count");
}                               job.setJarByClass(WordCount.class);
                                job.setMapperClass(TokenizerMapper.class);
                                                                                               column_name :link
                                job.setCombinerClass(IntSumReducer.class);
                                job.setReducerClass(IntSumReducer.class);
                                job.setOutputKeyClass(Text.class);
                                job.setOutputValueClass(IntWritable.class);                    topic "link num", :label => 'n' do
                                FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                                FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));      count_uniq column[:link]
                                System.exit(job.waitForCompletion(true) ? 0 : 1);
                                }
                                }
                                                                                               end
Hadoop Papyrus Details
• Invoke Ruby script using JRuby in the process
  of Map/Reduce running on Java
Hadoop Papyrus Details (con’t)
• Additionally, we can write the DSL script you want to process (log analysis,
  etc). Papyrus can choose the different process on each phase (Map or
  Reduce, job initialization). So we just need the only one script.
ありがとうございました! Thank you!




      Twitter ID: @fujibee

More Related Content

PDF
Design of a_dsl_by_ruby_for_heavy_computations
PDF
The emerging world of mongo db csp
PDF
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
PDF
Cascading Through Hadoop for the Boulder JUG
PPTX
Pig and Pig Latin - Module 5
PDF
Groovy.pptx
PDF
Scalding for Hadoop
PDF
Fun Teaching MongoDB New Tricks
Design of a_dsl_by_ruby_for_heavy_computations
The emerging world of mongo db csp
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Cascading Through Hadoop for the Boulder JUG
Pig and Pig Latin - Module 5
Groovy.pptx
Scalding for Hadoop
Fun Teaching MongoDB New Tricks

What's hot (20)

PPTX
Scoobi - Scala for Startups
PDF
PDF
Postgresql search demystified
PPTX
Sphinx autodoc - automated api documentation - PyCon.KR 2015
DOCX
Java and xml
PDF
Elastic Search Training#1 (brief tutorial)-ESCC#1
PDF
#살아있다 #자프링외길12년차 #코프링2개월생존기
PDF
Introduction to Scalding and Monoids
PDF
[2 d1] elasticsearch 성능 최적화
PDF
Productive Programming in Groovy
PPTX
엘라스틱서치 적합성 이해하기 20160630
PDF
MongoDB 在盛大大数据量下的应用
PDF
Programming with Python and PostgreSQL
ODP
Gpars concepts explained
PDF
Scalding - Hadoop Word Count in LESS than 70 lines of code
PDF
CommonJS: JavaScript Everywhere
PPTX
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
PPTX
Config BuildConfig
PPTX
Psycopg2 - Connect to PostgreSQL using Python Script
PDF
Leap Ahead with Redis 6.2
Scoobi - Scala for Startups
Postgresql search demystified
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Java and xml
Elastic Search Training#1 (brief tutorial)-ESCC#1
#살아있다 #자프링외길12년차 #코프링2개월생존기
Introduction to Scalding and Monoids
[2 d1] elasticsearch 성능 최적화
Productive Programming in Groovy
엘라스틱서치 적합성 이해하기 20160630
MongoDB 在盛大大数据量下的应用
Programming with Python and PostgreSQL
Gpars concepts explained
Scalding - Hadoop Word Count in LESS than 70 lines of code
CommonJS: JavaScript Everywhere
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Config BuildConfig
Psycopg2 - Connect to PostgreSQL using Python Script
Leap Ahead with Redis 6.2
Ad

Similar to JRubyKaigi2010 Hadoop Papyrus (20)

PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
PDF
Open XKE - Big Data, Big Mess par Bertrand Dechoux
PPTX
Hadoop MapReduce framework - Module 3
PDF
Introducción a hadoop
PDF
Hadoop Integration in Cassandra
PPTX
Hadoop ecosystem
PDF
Beyond Map/Reduce: Getting Creative With Parallel Processing
PDF
Hadoop ecosystem
PPTX
An introduction to Test Driven Development on MapReduce
PDF
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
PPTX
Amazon elastic map reduce
PDF
Mapreduce by examples
PPTX
Cs267 hadoop programming
PDF
AJUG April 2011 Raw hadoop example
PPTX
Writing Hadoop Jobs in Scala using Scalding
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
PPTX
Hadoop
PPTX
EuroPython 2015 - Big Data with Python and Hadoop
PPTX
Create & Execute First Hadoop MapReduce Project in.pptx
PDF
Hadoop + Clojure
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Hadoop MapReduce framework - Module 3
Introducción a hadoop
Hadoop Integration in Cassandra
Hadoop ecosystem
Beyond Map/Reduce: Getting Creative With Parallel Processing
Hadoop ecosystem
An introduction to Test Driven Development on MapReduce
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Amazon elastic map reduce
Mapreduce by examples
Cs267 hadoop programming
AJUG April 2011 Raw hadoop example
Writing Hadoop Jobs in Scala using Scalding
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
Create & Execute First Hadoop MapReduce Project in.pptx
Hadoop + Clojure
Ad

More from Koichi Fujikawa (7)

PPTX
Amazon Redshiftの開発者がこれだけは知っておきたい10のTIPS / 第18回 AWS User Group - Japan
PPTX
Tokyo Webmining #12 Hapyrus
PPTX
第2回 Jenkins勉強会 LT 藤川
PPT
クラウド時代の並列分散処理技術
PDF
Rakuten tech conf
PDF
Cloud computing competition by Hapyrus
PDF
Hadoop Conf Japan 2009 After Party LT - Hadoop Ruby DSL
Amazon Redshiftの開発者がこれだけは知っておきたい10のTIPS / 第18回 AWS User Group - Japan
Tokyo Webmining #12 Hapyrus
第2回 Jenkins勉強会 LT 藤川
クラウド時代の並列分散処理技術
Rakuten tech conf
Cloud computing competition by Hapyrus
Hadoop Conf Japan 2009 After Party LT - Hadoop Ruby DSL

JRubyKaigi2010 Hadoop Papyrus

  • 1. MapReduce by JRuby and DSL Hadoop Papyrus 2010/8/28 JRubyKaigi 2010 藤川幸一 FUJIKAWA Koichi @fujibee
  • 2. What’s Hadoop? • FW of parallel distributed processing framework for BIG data • OSS clone of Google MapReduce • For over terabyte scale data processing – Took over 2000hr if you read the data of 400TB(Web scale data) by standard HDD, reading 50MB/s – Need the distributed file system and parallel processing framework!
  • 3. Hadoop Papyrus • My own OSS project – Hosted by github http://github.com/fujibee/hadoop-papyrus • Framework for running Hadoop jobs by (J)Ruby DSL description – Originally Hadoop jobs written by Java – Just few lines in Ruby same as the very complex procedure if using Java! • Supported by IPA MITOH 2009 project (Government support) • Can run by Hudson (CI tool) plug-in
  • 4. Step.1 Not Java, But we can write in Ruby!
  • 5. Step.2 Simple description by DSL in Ruby Map Reduce Job Description Log Analysis DSL
  • 6. Step.3 Enable the Hadoop server environment easily by Hudson
  • 7. package org.apache.hadoop.examples; Javaの場合 import java.io.IOException; import java.util.StringTokenizer; 70 lines are needed in Java.. import org.apache.hadoop.conf.Configuration; Hadoop Papyrus is only needed 10 lines! import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; IntSumReducer extends public static class import org.apache.hadoop.mapreduce.Reducer; Reducer<Text, IntWritable, Text, IntWritable> { import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { public class WordCount { int sum = 0; for (IntWritable val : values) { sum += val.get(); public static class TokenizerMapper extends } Mapper<Object, Text, Text, IntWritable> { result.set(sum); Hadoop Papyrus context.write(key, result); } dsl 'LogAnalysis‘ private final static IntWritable one = new IntWritable(1); } private Text word = new Text(); public static void main(String[] args) throws Exception { public void map(Object key, Text value, Context context) Configuration(); Configuration conf = new from ‘test/in‘ throws IOException, InterruptedException { String[] otherArgs = new GenericOptionsParser(conf, args) StringTokenizer itr = new StringTokenizer(value.toString()); .getRemainingArgs(); to ‘test/out’ while (itr.hasMoreTokens()) { if (otherArgs.length != 2) { word.set(itr.nextToken()); System.err.println("Usage: wordcount <in> <out>"); context.write(word, one); System.exit(2); } } } pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/ Job job = new Job(conf, "word count"); } job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); column_name :link job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); topic "link num", :label => 'n' do FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); count_uniq column[:link] System.exit(job.waitForCompletion(true) ? 0 : 1); } } end
  • 8. Hadoop Papyrus Details • Invoke Ruby script using JRuby in the process of Map/Reduce running on Java
  • 9. Hadoop Papyrus Details (con’t) • Additionally, we can write the DSL script you want to process (log analysis, etc). Papyrus can choose the different process on each phase (Map or Reduce, job initialization). So we just need the only one script.