JRubyKaigi2010 Hadoop Papyrus

MapReduce by JRuby and DSL
Hadoop Papyrus

2010/8/28
JRubyKaigi 2010
藤川幸一 FUJIKAWA Koichi @fujibee

What’s Hadoop?
• FW of parallel distributed processing
framework for BIG data
• OSS clone of Google MapReduce
• For over terabyte scale data processing
– Took over 2000hr if you read the data of
400TB(Web scale data) by standard HDD, reading
50MB/s
– Need the distributed file system and parallel
processing framework!

Hadoop Papyrus
• My own OSS project
– Hosted by github http://github.com/fujibee/hadoop-papyrus
• Framework for running Hadoop jobs by (J)Ruby
DSL description
– Originally Hadoop jobs written by Java
– Just few lines in Ruby same as the very complex
procedure if using Java!
• Supported by IPA MITOH 2009 project
(Government support)
• Can run by Hudson (CI tool) plug-in

Step.1
Not Java, But we can write in Ruby!

Step.2
Simple description by DSL in Ruby

Map Reduce Job
Description

Log Analysis
DSL

Step.3
Enable the Hadoop server environment
easily by Hudson

package org.apache.hadoop.examples; Javaの場合
import java.io.IOException;
import java.util.StringTokenizer;
70 lines are needed in Java..
import org.apache.hadoop.conf.Configuration;
Hadoop Papyrus is only needed 10 lines!
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper; IntSumReducer extends
public static class
import org.apache.hadoop.mapreduce.Reducer;
Reducer<Text, IntWritable, Text, IntWritable> {
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
private IntWritable result = new IntWritable();
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
public class WordCount { int sum = 0;
for (IntWritable val : values) {
sum += val.get();
public static class TokenizerMapper extends
}
Mapper<Object, Text, Text, IntWritable> {
result.set(sum);
Hadoop Papyrus
context.write(key, result);
} dsl 'LogAnalysis‘
private final static IntWritable one = new IntWritable(1);
}
private Text word = new Text();

public static void main(String[] args) throws Exception {
public void map(Object key, Text value, Context context) Configuration();
Configuration conf = new
from ‘test/in‘
throws IOException, InterruptedException {
String[] otherArgs = new GenericOptionsParser(conf, args)
StringTokenizer itr = new StringTokenizer(value.toString());
.getRemainingArgs();
to ‘test/out’
while (itr.hasMoreTokens()) { if (otherArgs.length != 2) {
word.set(itr.nextToken()); System.err.println("Usage: wordcount <in> <out>");
context.write(word, one); System.exit(2);
}
}
} pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/
Job job = new Job(conf, "word count");
} job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
column_name :link
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); topic "link num", :label => 'n' do
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); count_uniq column[:link]
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
end

Hadoop Papyrus Details
• Invoke Ruby script using JRuby in the process
of Map/Reduce running on Java

Hadoop Papyrus Details (con’t)
• Additionally, we can write the DSL script you want to process (log analysis,
etc). Papyrus can choose the different process on each phase (Map or
Reduce, job initialization). So we just need the only one script.

ありがとうございました! Thank you!

Twitter ID: @fujibee

JRubyKaigi2010 Hadoop Papyrus

More Related Content

What's hot (20)

Similar to JRubyKaigi2010 Hadoop Papyrus (20)

More from Koichi Fujikawa (7)

JRubyKaigi2010 Hadoop Papyrus