SlideShare a Scribd company logo
Cascading
through Hadoop
Simpler mapreduce through data flows
by Matthew McCullough,Ambient Ideas, LLC
Matthew
McCullough
✓
Using Hadoop?
Work with Big Data?
Familiar with MapReduce?
✓
✓
?
?
?
Cascading Through Hadoop for the Boulder JUG
http://delicious.com/matthew.mccullough/cascading
http://delicious.com/matthew.mccullough/hadoop
http://github.com/matthewmccullough/cascading-course
Cascading Through Hadoop for the Boulder JUG
MapReduce
a quick review...
classical Map & Reduce
now MapReduce®
RawData
Split Shuffle
Processed
Data
Map Reduce
Hadoop Java API implementation...
RawData
Split Shuffle
Processed
Data
Map Reduce
// The WordCount Mapper
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
RawData
Split Shuffle
Processed
Data
Map Reduce
// The WordCount Reducer
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
but wait...
// The WordCount main()
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs =
new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
and how about multiple files?
package org.apache.hadoop.examples;
import java.io.BufferedReader;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MultiFileInputFormat;
import org.apache.hadoop.mapred.MultiFileSplit;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
//set the InputFormat of the job to our InputFormat
job.setInputFormat(MyInputFormat.class);
// the keys are words (strings)
job.setOutputKeyClass(Text.class);
// the values are counts (ints)
job.setOutputValueClass(IntWritable.class);
//use the defined mapper
job.setMapperClass(MapClass.class);
//use the WordCount Reducer
job.setCombinerClass(LongSumReducer.class);
job.setReducerClass(LongSumReducer.class);
FileInputFormat.addInputPaths(job, args[0]);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new MultiFileWordCount(), args);
System.exit(ret);
}
}
Cascading Through Hadoop for the Boulder JUG
// The WordCount main()
public static void main(String[] arg
Cascading Through Hadoop for the Boulder JUG
Coding a Java Flow
public class SimplestPipe1Flip {
public static void main(String[] args) {
String inputPath = "data/babynamedefinitions.csv";
String outputPath = "output/simplestpipe1";
Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "flip" );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class);
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "flipflow", source, sink, assembly );
flow.complete();
}
}
Cascading Through Hadoop for the Boulder JUG
The Author
Ignoring that Hadoop is as much
about analytics as it is about
integration leads to a fair
number of compromises, including,
but not exclusive to a loss in
quality of life (in trade for
a false sense of accomplishment)
-Chris Wensel, Cascading Inventor
http://cascading.org
http://concurrentinc.com
citizen of the big data domain
proper level of abstraction
for Hadoop
Cascading Through Hadoop for the Boulder JUG
Hadoop: 2011
who's using Hadoop?
-Meetup.com
-AOL
-Bing
-Facebook
-Netflix
-Yahoo
-Twitter
Hadoop is as much about
analytics as it is about
integration.
Ignoring that leads to crazy
complex tool chains that
typically involve XML
-Chris Wensel, Cascading Inventor
Cascading Through Hadoop for the Boulder JUG
✓Two humans
✓IBM cluster
✓Hadoop
✓Java
Go!
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
Hadoop DSLs
Pig approximates ETL
#Pig Script
Person = LOAD 'people.csv' using PigStorage(',');
Names = FOREACH Person GENERATE $2 AS name;
OrderedNames = ORDER Names BY name ASC;
GroupedNames = GROUP OrderedNames BY name;
NameCount = FOREACH GroupedNames
GENERATE group, COUNT(OrderedNames);
store NameCount into 'names.out';
Hive approximates SQL
#Hive Script
LOAD DATA INPATH “shakespeare_freq”
INTO TABLE shakespeare;
SELECT * FROM shakespeare
WHERE freq > 100
SORT BY freq ASC
LIMIT 10;
Cascading Groovy
approximates MapR
//Cascading Groovy Script
def cascading = new Cascading()
def builder = cascading.builder();
Flow flow = builder.flow("wordcount")
{
source(input, scheme: text())
tokenize(/[.,]*s+/)
group()
count()
group(["count"], reverse: true)
sink(output, delete: true)
}
Cascalog approximates Datalog
#Cascalog Script
(?<- (stdout) [?person] (age ?person 25))
Cascading Through Hadoop for the Boulder JUG
Here's
another faux
DSL for you!
Don't worry Martin.
Cascading isn't a DSL.
Really.
Cascading Through Hadoop for the Boulder JUG
The Metaphor
Divide & Conquer
with a different metaphor
Water
Pipes
Taps
Source
Sink
Flows
Planner
Planner
to optimize parallelism
Cascading Through Hadoop for the Boulder JUG
Tuples
Tuples
ordered list of elements
["Matthew", 2, true]
Tuple Stream
["Matthew", 2, true], ["Jay", 2, true], ["Peter", 0, false]
["Matthew", "Red"], ["Jay", "Grey"], ["Peter", "Brown"]
["Matthew", 2, true], ["Jay", 2, true], ["Peter", 0, false]
Co-Group
Cascading Through Hadoop for the Boulder JUG
The Process
Pipe
Head
Tail
Source
Sink
Tap
Tap
Pipe
Head
Tail
PipeHead
Tail
Pipe
Head
Tail
SourceTap
SinkTap
Flow
Late binding to taps
public class SimplestPipe1Flip {
public static void main(String[] args) {
String inputPath = "data/babynamedefinitions.csv";
String outputPath = "output/simplestpipe1";
Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "flip" );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class);
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "flipflow", source, sink, assembly );
flow.complete();
}
}
Pipe Types
Each
GroupBy
CoGroup
Every
Sub-Assembly
GroupBy CoGroup
Every
Sub-Assembly
Each
CoGroup
Flow
DAG
Cascade
GroupBy
CoGroup
Every
Sub-Assembly
Each
CoGroup
GroupBy
CoGroup
Every
Sub-Assembly
Each
CoGroup GroupBy CoGroup
Every
Sub-Assembly
Each
CoGroup
public class SimplestPipe3CoGroup {
public static void main(String[] args) {
String inputPathDefinitions = "data/babynamedefinitions.csv";
String inputPathCounts = "data/babynamecounts.csv";
String outputPath = "output/simplestpipe3";
Scheme sourceSchemeDefinitions = new TextDelimited( new Fields( "name", "definition" ), "," );
Scheme sourceSchemeCounts = new TextDelimited( new Fields( "name", "count" ), "," );
Tap sourceDefinitions = new Hfs( sourceSchemeDefinitions, inputPathDefinitions );
Tap sourceCounts = new Hfs( sourceSchemeCounts, inputPathCounts );
Scheme sinkScheme = new TextDelimited( new Fields( "dname", "count", "definition" ), " ^^^ " );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe definitionspipe = new Pipe( "definitionspipe" );
Pipe countpipe = new Pipe( "countpipe" );
//Join the tuple streams
Fields commonfields = new Fields( "name" );
Fields newfields = new Fields("dname", "definition", "cname", "count");
Pipe joinpipe = new CoGroup( definitionspipe, commonfields, countpipe, commonfields,
newfields, new InnerJoin() );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, SimplestPipe3CoGroup.class);
FlowConnector flowConnector = new FlowConnector( properties );
Map<String, Tap> sources = new HashMap<String, Tap>();
sources.put("definitionspipe", sourceDefinitions);
sources.put("countpipe", sourceCounts);
Flow flow = flowConnector.connect( sources, sink, joinpipe );
flow.complete();
}
}
Cascading Through Hadoop for the Boulder JUG
Motivations
Big Data is a
g r o w i n g field
MapReduce is the primary technique
is becoming the MR standard
Why a new MR toolkit?
㊌ Simpler coding
㊌ More logical processing abstractions
㊌ Run MapReduce locally
㊌ Debug jobs with ease
easy debugging...
public class SimplestPipe1Flip {
public static void main(String[] args) {
String inputPath = "data/babynamedefinitions.csv";
String outputPath = "output/simplestpipe1";
Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "flip" );
//OPTIONAL: Debug the tuple
//assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class);
FlowConnector flowConnector = new FlowConnector( properties );
//OPTIONAL: Have the planner use or filter out the debugging statements
//FlowConnector.setDebugLevel( properties, DebugLevel.VERBOSE );
Flow flow = flowConnector.connect( "flipflow", source, sink, assembly );
flow.complete();
}
}
Cascading User Roles
㊌ Application executor
㊌ Process assembler
㊌ Operation developer
Hadoop is never used alone.
The dirty secret is that it
is really a huge ETL tool.
-Chris Wensel, Cascading Inventor
50gal Hot Water Heater
Tankless Hot Water Heater
Cascading Through Hadoop for the Boulder JUG
Building
Let's prep the build
Why?
When in doubt, look at the
Cascading source code.
If something is not
documented in this User
Guide, the source code will
give you clear instructions
on what to do or expect.
-Chris Wensel, Cascading Inventor
https://github.com/cwensel
Cascading Through Hadoop for the Boulder JUG
https://github.com/cwensel/cascading
Ant 1.8.x
Ivy 2.2.x
# Verified Ant > 1.8.x
# Verified Ivy > 2.2.x
$ ant retrieve
Cascading Through Hadoop for the Boulder JUG
Let's build it...
$ ls -al
drwxr-xr-x 15 mccm06 staff 510B Feb 21 14:31 ./
drwxr-xr-x 20 mccm06 staff 680B Feb 17 15:39 ../
drwxr-xr-x 10 mccm06 staff 340B Feb 19 01:40 cascading.groovy_git/
drwxr-xr-x 7 mccm06 staff 238B Feb 19 01:40 cascading.hbase_git/
drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:40 cascading.jdbc_git/
drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:39 cascading.load_git/
drwxr-xr-x 9 mccm06 staff 306B Feb 19 01:39 cascading.memcached_git/
drwxr-xr-x 9 mccm06 staff 306B Feb 19 01:39 cascading.multitool_git/
drwxr-xr-x 10 mccm06 staff 340B Feb 19 01:39 cascading.samples_git/
drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:39 cascading.work_git/
drwxr-xr-x 14 mccm06 staff 476B Feb 21 14:26 cascading_git/
drwxr-xr-x 11 mccm06 staff 374B Dec 31 16:16 cascalog_git/
lrwxr-xr-x 1 mccm06 staff 45B Feb 21 14:31 hadoop ->
/Applications/Dev/hadoop-family/hadoop-0.20.1
# Trying Hadoop == 0.21.0
# Verified 'hadoop' is neighbor to cascading
$ ant compile
[javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:52: cannot find symbol
[javac] symbol : class JobConf
[javac] location: class cascading.tap.hadoop.TapIterator
[javac] private final JobConf conf;
[javac] ^
[javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:54: cannot find symbol
[javac] symbol : class InputSplit
[javac] location: class cascading.tap.hadoop.TapIterator
[javac] private InputSplit[] splits;
[javac] ^
[javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:56: cannot find symbol
[javac] symbol : class RecordReader
[javac] location: class cascading.tap.hadoop.TapIterator
[javac] private RecordReader reader;
[javac] ^
[javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:75: cannot find symbol
[javac] symbol : class JobConf
[javac] location: class cascading.tap.hadoop.TapIterator
[javac] public TapIterator( Tap tap, JobConf conf ) throws IOException
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 100 errors
Hadoop 0.21.0
Argh!
Hadoop 0.20.1
# Verified Hadoop == 0.20.1
# Verified 'hadoop' is neighbor to cascading
$ ant compile
Buildfile: cascading_git/build.xml
init:
[echo] initializing cascading environment...
[mkdir] Created dir: cascading_git/build/core
[mkdir] Created dir: cascading_git/build/xml
[mkdir] Created dir: cascading_git/build/test
[mkdir] Created dir: cascading_git/build/testresults
echo-compile-buildnum:
compile:
[echo] building cascading...
[javac] Compiling 238 source files to cascading_git/build/core
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[copy] Copying 1 file to cascading_git/build/core/cascading
[javac] Compiling 5 source files to cascading_git/build/xml
[javac] Compiling 85 source files to cascading_git/build/test
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[copy] Copying 24 files to cascading_git/build/test
BUILD SUCCESSFUL
Total time: 7 seconds
Cascading Through Hadoop for the Boulder JUG
Planner
planner diagrams
public class SimplestPipe2Sort {
public static void main(String[] args) {
String inputPath = "data/babynamedefinitions.csv";
String outputPath = "output/simplestpipe2";
Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ^^^ " );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "sortreverse" );
Fields groupFields = new Fields( "name");
//OPTIONAL: Set the comparator
//groupFields.setComparator("name", Collections.reverseOrder());
assembly = new GroupBy( assembly, groupFields );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, SimplestPipe2Sort.class);
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "sortflow", source, sink, assembly );
flow.complete();
//OPTIONAL: Output a debugging diagram
//flow.writeDOT(outputPath + "/flowdiagram.dot");
}
}
Cascading Through Hadoop for the Boulder JUG
Abstraction Levels
a unique Java API
similar to command abstractions
in the core JVM
CPU Instruction
Assembly Language
Class File
Java
Groovy
DSL
Hadoop
Cascading
Cascalog
Cascading Groovy
Cascading Through Hadoop for the Boulder JUG
Builders
a unique Java API
but enhanced via...
Jython
JRuby
Clojure
Groovy
Cascading Through Hadoop for the Boulder JUG
Coding a Groovy Flow
Groovy
setup...
$ cd cascading.groovy
$ ant dist
$ cd dist
$ groovy setup.groovy
coding...
def cascading = new Cascading()
def builder = cascading.builder();
Flow flow = builder.flow("wordcount")
{
source(input, scheme: text())
// output new tuple for each split,
//result replaces stream by default
tokenize(/[.,]*s+/)
group() // group on stream
// count values in group
// creates 'count' field by default
count()
// group/sort on 'count', reverse the sort order
group(["count"], reverse: true)
sink(output, delete: true)
}
execution...
$ groovy wordcount
INFO - Concurrent, Inc - Cascading 1.2.1 [hadoop-0.19.2+]
INFO - [wordcount] starting
INFO - [wordcount] source: Hfs["TextLine[['line']->[ALL]]"]["output/fetched/fetch.txt"]"]
INFO - [wordcount] sink: Hfs["TextLine[['line']->[ALL]]"]["output/counted"]"]
INFO - [wordcount] parallel execution is enabled: false
INFO - [wordcount] starting jobs: 2
INFO - [wordcount] allocating threads: 1
INFO - [wordcount] starting step: (1/2) TempHfs["SequenceFile[[0, 'count']]"][wordcount/18750/]
INFO - [wordcount] starting step: (2/2) Hfs["TextLine[['line']->[ALL]]"]["output/counted"]"]
INFO - deleting temp path output/counted/_temporary
Cascading Through Hadoop for the Boulder JUG
Cascalog
Clojure
Cascading Through Hadoop for the Boulder JUG
functional MR programming
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
㊌ Simple
㊌ Functions, filters, and
aggregators all use the same
syntax. Joins are implicit and
natural.
㊌Expressive
㊌Logical composition is very
powerful, and you can run
arbitrary Clojure code in your
query with little effort.
㊌Interactive
㊌Run queries from the Clojure
REPL.
㊌Scalable
㊌Cascalog queries run as a series
of MapReduce jobs.
㊌Query Anything
㊌Query HDFS data, database
data, and/or local data by
making use of Cascading’s
“Tap” abstraction
influenced by Datalog
http://www.ccs.neu.edu/home/ramsdell/tools/datalog/datalog.html
Cascading Through Hadoop for the Boulder JUG
query planner
alternative to Pig, Hive
read or write any data source
higher density of code
(?<- (stdout) [?word ?count] (sentence ?s)
(split ?s :> ?word) (c/ count ?count))
Tap Outputs Source
Cascading Through Hadoop for the Boulder JUG
Is It Fully Baked?
Java is 16 years old
Cascading Through Hadoop for the Boulder JUG
Hadoop is ~6 years old
Cascading Through Hadoop for the Boulder JUG
Cascading is 4 years old
Cascading Through Hadoop for the Boulder JUG
Cascading is
MapReduce done right
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
Cascading
through Hadoop
Simpler mapreduce through data flows
by Matthew McCullough,Ambient Ideas, LLC
Cascading Through Hadoop for the Boulder JUG

More Related Content

PDF
Scalding - the not-so-basics @ ScalaDays 2014
PPTX
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
PPTX
MongoDB Aggregation
PPTX
Writing Hadoop Jobs in Scala using Scalding
PDF
Scalding - Hadoop Word Count in LESS than 70 lines of code
PPTX
MongoDB - Aggregation Pipeline
PPTX
Should I Use Scalding or Scoobi or Scrunch?
PPTX
Scalding: Reaching Efficient MapReduce
Scalding - the not-so-basics @ ScalaDays 2014
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
MongoDB Aggregation
Writing Hadoop Jobs in Scala using Scalding
Scalding - Hadoop Word Count in LESS than 70 lines of code
MongoDB - Aggregation Pipeline
Should I Use Scalding or Scoobi or Scrunch?
Scalding: Reaching Efficient MapReduce

What's hot (20)

PDF
A Divine Data Comedy
PDF
Introduction to Scalding and Monoids
PDF
Scalding for Hadoop
PPTX
Scoobi - Scala for Startups
PPTX
Read, store and create xml and json
PDF
Indexing and Query Optimizer (Mongo Austin)
PDF
Data Exploration with Apache Drill: Day 2
PDF
Advanced pg_stat_statements: Filtering, Regression Testing & more
PDF
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
PDF
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
DOCX
WOTC_Import
PPTX
Apache Flink Training: DataSet API Basics
PDF
Wprowadzenie do technologi Big Data i Apache Hadoop
PPTX
MongoDB + Java - Everything you need to know
PDF
PGConf APAC 2018 - Lightening Talk #2 - Centralizing Authorization in PostgreSQL
PDF
Look Ma, “update DB to HTML5 using C++”, no hands! 
PDF
PostgreSQL: Advanced indexing
PPS
MongoDB crud
PDF
JRubyKaigi2010 Hadoop Papyrus
A Divine Data Comedy
Introduction to Scalding and Monoids
Scalding for Hadoop
Scoobi - Scala for Startups
Read, store and create xml and json
Indexing and Query Optimizer (Mongo Austin)
Data Exploration with Apache Drill: Day 2
Advanced pg_stat_statements: Filtering, Regression Testing & more
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
WOTC_Import
Apache Flink Training: DataSet API Basics
Wprowadzenie do technologi Big Data i Apache Hadoop
MongoDB + Java - Everything you need to know
PGConf APAC 2018 - Lightening Talk #2 - Centralizing Authorization in PostgreSQL
Look Ma, “update DB to HTML5 using C++”, no hands! 
PostgreSQL: Advanced indexing
MongoDB crud
JRubyKaigi2010 Hadoop Papyrus
Ad

Similar to Cascading Through Hadoop for the Boulder JUG (20)

PDF
Hadoop Integration in Cassandra
PDF
Store and Process Big Data with Hadoop and Cassandra
PPTX
Amazon elastic map reduce
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
PDF
Jakarta Commons - Don't re-invent the wheel
PPTX
What is new in Java 8
PDF
Apache Cassandra and Go
PDF
Intro To Cascading
PDF
Full stack analytics with Hadoop 2
PDF
Testing multi outputformat based mapreduce
PPTX
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
PDF
The How and Why of Fast Data Analytics with Apache Spark
PDF
Productive Programming in Groovy
PPTX
Hadoop ecosystem
PPTX
Cs267 hadoop programming
PPTX
Psycopg2 - Connect to PostgreSQL using Python Script
PDF
AJUG April 2011 Cascading example
PDF
Hadoop User Group EU 2014
PDF
Hadoop ecosystem
PDF
Apache Commons - Don\'t re-invent the wheel
Hadoop Integration in Cassandra
Store and Process Big Data with Hadoop and Cassandra
Amazon elastic map reduce
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Jakarta Commons - Don't re-invent the wheel
What is new in Java 8
Apache Cassandra and Go
Intro To Cascading
Full stack analytics with Hadoop 2
Testing multi outputformat based mapreduce
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
The How and Why of Fast Data Analytics with Apache Spark
Productive Programming in Groovy
Hadoop ecosystem
Cs267 hadoop programming
Psycopg2 - Connect to PostgreSQL using Python Script
AJUG April 2011 Cascading example
Hadoop User Group EU 2014
Hadoop ecosystem
Apache Commons - Don\'t re-invent the wheel
Ad

More from Matthew McCullough (20)

PDF
Using Git and GitHub Effectively at Emerge Interactive
PDF
All About GitHub Pull Requests
PDF
Adam Smith Builds an App
PDF
Git's Filter Branch Command
PDF
Git Graphs, Hashes, and Compression, Oh My
PDF
Git and GitHub at the San Francisco JUG
PDF
Finding Things in Git
PDF
Git and GitHub for RallyOn
PDF
Migrating from Subversion to Git and GitHub
PDF
Git Notes and GitHub
PDF
Intro to Git and GitHub
PDF
Build Lifecycle Craftsmanship for the Transylvania JUG
PDF
Git Going for the Transylvania JUG
PDF
Transylvania JUG Pre-Meeting Announcements
PDF
Game Theory for Software Developers at the Boulder JUG
PDF
JQuery Mobile
PDF
R Data Analysis Software
PDF
Please, Stop Using Git
PDF
Dr. Strangedev
PDF
Jenkins for One
Using Git and GitHub Effectively at Emerge Interactive
All About GitHub Pull Requests
Adam Smith Builds an App
Git's Filter Branch Command
Git Graphs, Hashes, and Compression, Oh My
Git and GitHub at the San Francisco JUG
Finding Things in Git
Git and GitHub for RallyOn
Migrating from Subversion to Git and GitHub
Git Notes and GitHub
Intro to Git and GitHub
Build Lifecycle Craftsmanship for the Transylvania JUG
Git Going for the Transylvania JUG
Transylvania JUG Pre-Meeting Announcements
Game Theory for Software Developers at the Boulder JUG
JQuery Mobile
R Data Analysis Software
Please, Stop Using Git
Dr. Strangedev
Jenkins for One

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
GamePlan Trading System Review: Professional Trader's Honest Take
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Per capita expenditure prediction using model stacking based on satellite ima...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf

Cascading Through Hadoop for the Boulder JUG

  • 1. Cascading through Hadoop Simpler mapreduce through data flows by Matthew McCullough,Ambient Ideas, LLC
  • 3. ✓ Using Hadoop? Work with Big Data? Familiar with MapReduce? ✓ ✓ ? ? ?
  • 11. classical Map & Reduce
  • 14. Hadoop Java API implementation...
  • 16. // The WordCount Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 18. // The WordCount Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 20. // The WordCount main() public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
  • 21. and how about multiple files?
  • 22. package org.apache.hadoop.examples; import java.io.BufferedReader; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.io.InputStreamReader; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MultiFileInputFormat; import org.apache.hadoop.mapred.MultiFileSplit; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader;
  • 23. //set the InputFormat of the job to our InputFormat job.setInputFormat(MyInputFormat.class); // the keys are words (strings) job.setOutputKeyClass(Text.class); // the values are counts (ints) job.setOutputValueClass(IntWritable.class); //use the defined mapper job.setMapperClass(MapClass.class); //use the WordCount Reducer job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); FileInputFormat.addInputPaths(job, args[0]); FileOutputFormat.setOutputPath(job, new Path(args[1])); JobClient.runJob(job); return 0; } public static void main(String[] args) throws Exception { int ret = ToolRunner.run(new MultiFileWordCount(), args); System.exit(ret); } }
  • 25. // The WordCount main() public static void main(String[] arg
  • 28. public class SimplestPipe1Flip { public static void main(String[] args) { String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe1"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "flip" ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "flipflow", source, sink, assembly ); flow.complete(); } }
  • 31. Ignoring that Hadoop is as much about analytics as it is about integration leads to a fair number of compromises, including, but not exclusive to a loss in quality of life (in trade for a false sense of accomplishment) -Chris Wensel, Cascading Inventor
  • 34. citizen of the big data domain
  • 35. proper level of abstraction for Hadoop
  • 40. Hadoop is as much about analytics as it is about integration. Ignoring that leads to crazy complex tool chains that typically involve XML -Chris Wensel, Cascading Inventor
  • 47. #Pig Script Person = LOAD 'people.csv' using PigStorage(','); Names = FOREACH Person GENERATE $2 AS name; OrderedNames = ORDER Names BY name ASC; GroupedNames = GROUP OrderedNames BY name; NameCount = FOREACH GroupedNames GENERATE group, COUNT(OrderedNames); store NameCount into 'names.out';
  • 49. #Hive Script LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare; SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10;
  • 51. //Cascading Groovy Script def cascading = new Cascading() def builder = cascading.builder(); Flow flow = builder.flow("wordcount") { source(input, scheme: text()) tokenize(/[.,]*s+/) group() count() group(["count"], reverse: true) sink(output, delete: true) }
  • 53. #Cascalog Script (?<- (stdout) [?person] (age ?person 25))
  • 56. Don't worry Martin. Cascading isn't a DSL. Really.
  • 60. with a different metaphor
  • 61. Water
  • 62. Pipes
  • 63. Taps
  • 65. Sink
  • 66. Flows
  • 72. ordered list of elements
  • 75. ["Matthew", 2, true], ["Jay", 2, true], ["Peter", 0, false]
  • 76. ["Matthew", "Red"], ["Jay", "Grey"], ["Peter", "Brown"] ["Matthew", 2, true], ["Jay", 2, true], ["Peter", 0, false] Co-Group
  • 82. public class SimplestPipe1Flip { public static void main(String[] args) { String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe1"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "flip" ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "flipflow", source, sink, assembly ); flow.complete(); } }
  • 85. DAG
  • 87. public class SimplestPipe3CoGroup { public static void main(String[] args) { String inputPathDefinitions = "data/babynamedefinitions.csv"; String inputPathCounts = "data/babynamecounts.csv"; String outputPath = "output/simplestpipe3"; Scheme sourceSchemeDefinitions = new TextDelimited( new Fields( "name", "definition" ), "," ); Scheme sourceSchemeCounts = new TextDelimited( new Fields( "name", "count" ), "," ); Tap sourceDefinitions = new Hfs( sourceSchemeDefinitions, inputPathDefinitions ); Tap sourceCounts = new Hfs( sourceSchemeCounts, inputPathCounts ); Scheme sinkScheme = new TextDelimited( new Fields( "dname", "count", "definition" ), " ^^^ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe definitionspipe = new Pipe( "definitionspipe" ); Pipe countpipe = new Pipe( "countpipe" ); //Join the tuple streams Fields commonfields = new Fields( "name" ); Fields newfields = new Fields("dname", "definition", "cname", "count"); Pipe joinpipe = new CoGroup( definitionspipe, commonfields, countpipe, commonfields, newfields, new InnerJoin() ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe3CoGroup.class); FlowConnector flowConnector = new FlowConnector( properties ); Map<String, Tap> sources = new HashMap<String, Tap>(); sources.put("definitionspipe", sourceDefinitions); sources.put("countpipe", sourceCounts); Flow flow = flowConnector.connect( sources, sink, joinpipe ); flow.complete(); } }
  • 90. Big Data is a g r o w i n g field
  • 91. MapReduce is the primary technique
  • 92. is becoming the MR standard
  • 93. Why a new MR toolkit? ㊌ Simpler coding ㊌ More logical processing abstractions ㊌ Run MapReduce locally ㊌ Debug jobs with ease
  • 95. public class SimplestPipe1Flip { public static void main(String[] args) { String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe1"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "flip" ); //OPTIONAL: Debug the tuple //assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class); FlowConnector flowConnector = new FlowConnector( properties ); //OPTIONAL: Have the planner use or filter out the debugging statements //FlowConnector.setDebugLevel( properties, DebugLevel.VERBOSE ); Flow flow = flowConnector.connect( "flipflow", source, sink, assembly ); flow.complete(); } }
  • 96. Cascading User Roles ㊌ Application executor ㊌ Process assembler ㊌ Operation developer
  • 97. Hadoop is never used alone. The dirty secret is that it is really a huge ETL tool. -Chris Wensel, Cascading Inventor
  • 98. 50gal Hot Water Heater
  • 102. Let's prep the build
  • 103. Why?
  • 104. When in doubt, look at the Cascading source code. If something is not documented in this User Guide, the source code will give you clear instructions on what to do or expect. -Chris Wensel, Cascading Inventor
  • 110. # Verified Ant > 1.8.x # Verified Ivy > 2.2.x $ ant retrieve
  • 113. $ ls -al drwxr-xr-x 15 mccm06 staff 510B Feb 21 14:31 ./ drwxr-xr-x 20 mccm06 staff 680B Feb 17 15:39 ../ drwxr-xr-x 10 mccm06 staff 340B Feb 19 01:40 cascading.groovy_git/ drwxr-xr-x 7 mccm06 staff 238B Feb 19 01:40 cascading.hbase_git/ drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:40 cascading.jdbc_git/ drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:39 cascading.load_git/ drwxr-xr-x 9 mccm06 staff 306B Feb 19 01:39 cascading.memcached_git/ drwxr-xr-x 9 mccm06 staff 306B Feb 19 01:39 cascading.multitool_git/ drwxr-xr-x 10 mccm06 staff 340B Feb 19 01:39 cascading.samples_git/ drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:39 cascading.work_git/ drwxr-xr-x 14 mccm06 staff 476B Feb 21 14:26 cascading_git/ drwxr-xr-x 11 mccm06 staff 374B Dec 31 16:16 cascalog_git/ lrwxr-xr-x 1 mccm06 staff 45B Feb 21 14:31 hadoop -> /Applications/Dev/hadoop-family/hadoop-0.20.1
  • 114. # Trying Hadoop == 0.21.0 # Verified 'hadoop' is neighbor to cascading $ ant compile
  • 115. [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:52: cannot find symbol [javac] symbol : class JobConf [javac] location: class cascading.tap.hadoop.TapIterator [javac] private final JobConf conf; [javac] ^ [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:54: cannot find symbol [javac] symbol : class InputSplit [javac] location: class cascading.tap.hadoop.TapIterator [javac] private InputSplit[] splits; [javac] ^ [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:56: cannot find symbol [javac] symbol : class RecordReader [javac] location: class cascading.tap.hadoop.TapIterator [javac] private RecordReader reader; [javac] ^ [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:75: cannot find symbol [javac] symbol : class JobConf [javac] location: class cascading.tap.hadoop.TapIterator [javac] public TapIterator( Tap tap, JobConf conf ) throws IOException [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 100 errors
  • 117. Argh!
  • 119. # Verified Hadoop == 0.20.1 # Verified 'hadoop' is neighbor to cascading $ ant compile
  • 120. Buildfile: cascading_git/build.xml init: [echo] initializing cascading environment... [mkdir] Created dir: cascading_git/build/core [mkdir] Created dir: cascading_git/build/xml [mkdir] Created dir: cascading_git/build/test [mkdir] Created dir: cascading_git/build/testresults echo-compile-buildnum: compile: [echo] building cascading... [javac] Compiling 238 source files to cascading_git/build/core [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [copy] Copying 1 file to cascading_git/build/core/cascading [javac] Compiling 5 source files to cascading_git/build/xml [javac] Compiling 85 source files to cascading_git/build/test [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [copy] Copying 24 files to cascading_git/build/test BUILD SUCCESSFUL Total time: 7 seconds
  • 124. public class SimplestPipe2Sort { public static void main(String[] args) { String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe2"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ^^^ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "sortreverse" ); Fields groupFields = new Fields( "name"); //OPTIONAL: Set the comparator //groupFields.setComparator("name", Collections.reverseOrder()); assembly = new GroupBy( assembly, groupFields ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe2Sort.class); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "sortflow", source, sink, assembly ); flow.complete(); //OPTIONAL: Output a debugging diagram //flow.writeDOT(outputPath + "/flowdiagram.dot"); } }
  • 128. similar to command abstractions in the core JVM
  • 129. CPU Instruction Assembly Language Class File Java Groovy DSL Hadoop Cascading Cascalog Cascading Groovy
  • 134. Jython
  • 135. JRuby
  • 137. Groovy
  • 140. Groovy
  • 142. $ cd cascading.groovy $ ant dist $ cd dist $ groovy setup.groovy
  • 144. def cascading = new Cascading() def builder = cascading.builder(); Flow flow = builder.flow("wordcount") { source(input, scheme: text()) // output new tuple for each split, //result replaces stream by default tokenize(/[.,]*s+/) group() // group on stream // count values in group // creates 'count' field by default count() // group/sort on 'count', reverse the sort order group(["count"], reverse: true) sink(output, delete: true) }
  • 146. $ groovy wordcount INFO - Concurrent, Inc - Cascading 1.2.1 [hadoop-0.19.2+] INFO - [wordcount] starting INFO - [wordcount] source: Hfs["TextLine[['line']->[ALL]]"]["output/fetched/fetch.txt"]"] INFO - [wordcount] sink: Hfs["TextLine[['line']->[ALL]]"]["output/counted"]"] INFO - [wordcount] parallel execution is enabled: false INFO - [wordcount] starting jobs: 2 INFO - [wordcount] allocating threads: 1 INFO - [wordcount] starting step: (1/2) TempHfs["SequenceFile[[0, 'count']]"][wordcount/18750/] INFO - [wordcount] starting step: (2/2) Hfs["TextLine[['line']->[ALL]]"]["output/counted"]"] INFO - deleting temp path output/counted/_temporary
  • 154. ㊌ Simple ㊌ Functions, filters, and aggregators all use the same syntax. Joins are implicit and natural.
  • 155. ㊌Expressive ㊌Logical composition is very powerful, and you can run arbitrary Clojure code in your query with little effort.
  • 157. ㊌Scalable ㊌Cascalog queries run as a series of MapReduce jobs.
  • 158. ㊌Query Anything ㊌Query HDFS data, database data, and/or local data by making use of Cascading’s “Tap” abstraction
  • 164. read or write any data source
  • 166. (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count)) Tap Outputs Source
  • 168. Is It Fully Baked?
  • 169. Java is 16 years old
  • 171. Hadoop is ~6 years old
  • 173. Cascading is 4 years old
  • 178. Cascading through Hadoop Simpler mapreduce through data flows by Matthew McCullough,Ambient Ideas, LLC