The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent

The Cascading
(big) data
application framework
André Kelpe | HUG France | Paris | 25. November 2014

Who am I?
André Kelpe
Senior Software Engineer at Concurrent
company behind Cascading, Lingual and
Driven
http://concurrentinc.com / @concurrent
andre@concurrentinc.com / @fs111

http://cascading.org
Apache licensed Java framework for writing data
oriented applications
production ready, stable and battle proven
(soundcloud, twitter, etsy, climate corp + many
more)

Cascading goals
developer productivity
focus on business problems, not distributed
systems knowledge
useful abstractions over underlying „fabrics“

Cascading goals
Testability & robustness
production quality applications rather than a
collection of scripts
(hooks into the core for experts)

https://www.flickr.com/photos/theilr/4283377543/sizes/l

Cascading terminology
Taps are sources and sinks for data
Schemes represent the format of the data
Pipes are connecting Taps

Cascading terminology
● Tuples flow through Pipes
● Fields describe the Tuples
● Operations are executed on Tuples in
TupleStreams
● FlowConnector uses QueryPlanner to
translate FlowDef into Flow to run on
computational fabric

Compiler
QueryPlanner
FlowDef
FlowDef
FlowDef
Hadoop
FlowDef Tez
Spark
User Code Translation
Optimization
Assembly
CPU Architecture

User-APIs
● Fluid - A Fluent API for Cascading
– Targeted at application writers
– https://github.com/Cascading/fluid
● „Raw“ Cascading API
– Targeted for library writers, code generators,
integration layers
– https://github.com/Cascading/cascading

Counting words
// configuration
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
...

Counting words (cont.)
// specify a regex operation to split the "document" text lines into a
token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter =
new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
...

Counting words (cont.)
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
Flow wcFlow = flowConnector.connect( flowDef )
wcFlow.complete(); // ← runs the code
}

https://driven.cascading.io/driven/871A2C66DA1D
4841B229CDD2B04B9FDA

Impatient
Cascading for the Impatient
http://docs.cascading.org/impatient/index.html

● Operations
A full toolbox
– Function
– Filter
– Regex/Scripts
– Boolean operators
– Count/Limit/Last/First
– Scripts
– Unique
– Asserts
– Min/Max
– …
● Splices
– GroupBy
– CoGroup
– HashJoin
– Merge
● Joins
Left, right, outer, inner,
mixed...

A full toolbox
data access: JDBC, HBase, elasticsearch,
redshift, HDFS, S3, Cassandra...
data formats: avro, thrift, protobuf, CSV, TSV...
integration points: Cascading Lingual (SQL),
Apache Hive, classical M/R apps..
not Java?: Scalding (Scala), Cascalog (clojure)

Status quo
● Cascading 2.6
– Production release
● Hadoop 2.x
● Hadoop 1.x
● Local mode
● Cascading 3.0
– public wip builds
● Tez
● Hadoop 2.x
● Hadoop 1.x
● Local mode
● Others (Spark...)

Questions?
andre@concurrentinc.com

Link Collection
http://www.cascading.org/
https://github.com/Cascading/
http://concurrentinc.com
http://cascading.io/driven/
https://groups.google.com/forum/#!forum/cascading-user
http://docs.cascading.org/impatient/
http://docs.cascading.org/cascading/2.6/userguide/html/

The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent (20)

More from Cascading (8)

Recently uploaded (20)

The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent