Indic threads pune12-apache-crunch

Apache Crunch
Rahul Sharma
Apache

Agenda :


Issues with MapReduce pipelines

Solving with Apache Crunch

Data Model & Operations

System Workflow

Examples

Question & Answers

2

Issues with MapReduce Pipelines

Unit Testing pipeline ??
You must be joking !! Can someone tell me where
is the business logic ??

Chain performance??

Learn Latin(pig)
first!!

3

Apache Crunch


Is a Java library

Contains Collections which can excute Parallel operations

Lazy evaluation of Collections at runtime

Operations merged at runtime to have efficient chains.

Available @ http://incubator.apache.org/crunch/

Based on Google FlumeJava paper

4

Apache Crunch


Supports Hadoop version 1 and 2-alpha

Supports HBase, jdbc etc

Works with Writables, Avro, Thrift and proto-buffers

Scala varient also exists

Integration with R and Clojure in process

Archetype exists for creating sample maven project

5

Apache Crunch : Data Model

Pipeline

MRPipeline

MemPipeline

PCollection<T>

PTable<K,V>

PGroupTable<K,V>

Source<T>

Target<T>

Emitter<T>
6

PType<K,V>

Apache Crunch : Operations


DoFn<S,T>

CombineFn<S,T>

FilterFn<T>

Joins

Cartesian

Sort

SecondarySort

PObject<T>

BloomFilters
7

Apache Crunch : System Workflow
Construct a pipeline

Pipeline.done()

Map Map Map

GBK GBK

Reduce Reduce

8
Output

Apache Crunch : Examples


WordCount example

Avro example

Sorting example

SecondarySort

Join Example

BloomFilters

9

Write to me : rsharma@apache.org
Example src : http://github.com/rahul0208
10
Blog : devlearnings.wordpress.com

Indic threads pune12-apache-crunch

More Related Content

What's hot (20)

Similar to Indic threads pune12-apache-crunch (20)

More from IndicThreads (20)

Recently uploaded (20)

Indic threads pune12-apache-crunch