SlideShare a Scribd company logo
The Cascading 
(big) data 
application framework 
André Kelpe | HUG France | Paris | 25. November 2014
Who am I? 
André Kelpe 
Senior Software Engineer at Concurrent 
company behind Cascading, Lingual and 
Driven 
http://concurrentinc.com / @concurrent 
andre@concurrentinc.com / @fs111
http://cascading.org 
Apache licensed Java framework for writing data 
oriented applications 
production ready, stable and battle proven 
(soundcloud, twitter, etsy, climate corp + many 
more)
Cascading goals 
developer productivity 
focus on business problems, not distributed 
systems knowledge 
useful abstractions over underlying „fabrics“
Cascading goals 
Testability & robustness 
production quality applications rather than a 
collection of scripts 
(hooks into the core for experts)
https://www.flickr.com/photos/theilr/4283377543/sizes/l
Cascading terminology 
Taps are sources and sinks for data 
Schemes represent the format of the data 
Pipes are connecting Taps
Cascading terminology 
● Tuples flow through Pipes 
● Fields describe the Tuples 
● Operations are executed on Tuples in 
TupleStreams 
● FlowConnector uses QueryPlanner to 
translate FlowDef into Flow to run on 
computational fabric
Compiler 
QueryPlanner 
FlowDef 
FlowDef 
FlowDef 
Hadoop 
FlowDef Tez 
Spark 
User Code Translation 
Optimization 
Assembly 
CPU Architecture
User-APIs 
● Fluid - A Fluent API for Cascading 
– Targeted at application writers 
– https://github.com/Cascading/fluid 
● „Raw“ Cascading API 
– Targeted for library writers, code generators, 
integration layers 
– https://github.com/Cascading/cascading
Counting words 
// configuration 
String docPath = args[ 0 ]; 
String wcPath = args[ 1 ]; 
Properties properties = new Properties(); 
AppProps.setApplicationJarClass( properties, Main.class ); 
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); 
// create source and sink taps 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); 
...
Counting words (cont.) 
// specify a regex operation to split the "document" text lines into a 
token stream 
Fields token = new Fields( "token" ); 
Fields text = new Fields( "text" ); 
RegexSplitGenerator splitter = 
new RegexSplitGenerator( token, "[ [](),.]" ); 
// only returns "token" 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 
// determine the word counts 
Pipe wcPipe = new Pipe( "wc", docPipe ); 
wcPipe = new GroupBy( wcPipe, token ); 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 
...
Counting words (cont.) 
// connect the taps, pipes, etc., into a flow 
FlowDef flowDef = FlowDef.flowDef() 
.setName( "wc" ) 
.addSource( docPipe, docTap ) 
.addTailSink( wcPipe, wcTap ); 
Flow wcFlow = flowConnector.connect( flowDef ) 
wcFlow.complete(); // ← runs the code 
}
https://driven.cascading.io/driven/871A2C66DA1D 
4841B229CDD2B04B9FDA
Impatient 
Cascading for the Impatient 
http://docs.cascading.org/impatient/index.html
● Operations 
A full toolbox 
– Function 
– Filter 
– Regex/Scripts 
– Boolean operators 
– Count/Limit/Last/First 
– Scripts 
– Unique 
– Asserts 
– Min/Max 
– … 
● Splices 
– GroupBy 
– CoGroup 
– HashJoin 
– Merge 
● Joins 
Left, right, outer, inner, 
mixed...
A full toolbox 
data access: JDBC, HBase, elasticsearch, 
redshift, HDFS, S3, Cassandra... 
data formats: avro, thrift, protobuf, CSV, TSV... 
integration points: Cascading Lingual (SQL), 
Apache Hive, classical M/R apps.. 
not Java?: Scalding (Scala), Cascalog (clojure)
Status quo 
● Cascading 2.6 
– Production release 
● Hadoop 2.x 
● Hadoop 1.x 
● Local mode 
● Cascading 3.0 
– public wip builds 
● Tez 
● Hadoop 2.x 
● Hadoop 1.x 
● Local mode 
● Others (Spark...)
Questions? 
andre@concurrentinc.com
Link Collection 
http://www.cascading.org/ 
https://github.com/Cascading/ 
http://concurrentinc.com 
http://cascading.io/driven/ 
https://groups.google.com/forum/#!forum/cascading-user 
http://docs.cascading.org/impatient/ 
http://docs.cascading.org/cascading/2.6/userguide/html/
fin.

More Related Content

PDF
Overhauling a database engine in 2 months
PDF
Creating Fault Tolerant Services on Mesos
PDF
Javantura v3 - ELK – Big Data for DevOps – Maarten Mulders
PDF
Javantura v3 - Logs – the missing gold mine – Franjo Žilić
PPTX
Meet the squirrel @ #CSHUG
PPTX
Onyx data processing the clojure way
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
J-Day Kraków: Listen to the sounds of your application
Overhauling a database engine in 2 months
Creating Fault Tolerant Services on Mesos
Javantura v3 - ELK – Big Data for DevOps – Maarten Mulders
Javantura v3 - Logs – the missing gold mine – Franjo Žilić
Meet the squirrel @ #CSHUG
Onyx data processing the clojure way
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
J-Day Kraków: Listen to the sounds of your application

What's hot (20)

PDF
Dive into Spark Streaming
PDF
Hugfr SPARK & RIAK -20160114_hug_france
PDF
Creating data centric microservices
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PPSX
Data Pipelines with Apache Airflow
PPTX
Are you a Tortoise or a Hare?
PDF
Data Analytics with Apache Spark and Cassandra
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
ODP
Understanding Spark Structured Streaming
PPTX
Apache Airflow in Production
PDF
DataEngConf SF16 - Spark SQL Workshop
ODP
Introduction to Akka Streams [Part-II]
ODP
Introduction to Akka Streams [Part-I]
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
presto-at-netflix-hadoop-summit-15
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PDF
Fluentd and Docker - running fluentd within a docker container
PDF
Airstream: Spark Streaming At Airbnb
PDF
Complex queries in a distributed multi-model database
PPTX
Lightning Fast Analytics with Cassandra and Spark
Dive into Spark Streaming
Hugfr SPARK & RIAK -20160114_hug_france
Creating data centric microservices
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Data Pipelines with Apache Airflow
Are you a Tortoise or a Hare?
Data Analytics with Apache Spark and Cassandra
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Understanding Spark Structured Streaming
Apache Airflow in Production
DataEngConf SF16 - Spark SQL Workshop
Introduction to Akka Streams [Part-II]
Introduction to Akka Streams [Part-I]
Simplifying Big Data Analytics with Apache Spark
presto-at-netflix-hadoop-summit-15
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Fluentd and Docker - running fluentd within a docker container
Airstream: Spark Streaming At Airbnb
Complex queries in a distributed multi-model database
Lightning Fast Analytics with Cassandra and Spark
Ad

Viewers also liked (13)

PDF
Extending Application Data In The Cloud
PDF
Big Data application - OSS / BSS
PDF
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
PPTX
Application of data mining
PDF
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
PDF
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
PDF
Converging Big Data and Application Infrastructure by Steven Poutsy
PPTX
A Brief History of Big Data
PPTX
Big data ppt
PPT
Big data ppt
PPTX
What is Big Data?
PPTX
Big Data Analytics with Hadoop
PPTX
Big data ppt
Extending Application Data In The Cloud
Big Data application - OSS / BSS
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Application of data mining
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
Converging Big Data and Application Infrastructure by Steven Poutsy
A Brief History of Big Data
Big data ppt
Big data ppt
What is Big Data?
Big Data Analytics with Hadoop
Big data ppt
Ad

Similar to The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent (20)

PDF
Intro to Cascading
PDF
Data Processing with Cascading Java API on Apache Hadoop
PDF
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDF
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
PDF
Functional programming
 for optimization problems 
in Big Data
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
PDF
Accelerate Big Data Application Development with Cascading
PDF
Using Cascalog to build an app with City of Palo Alto Open Data
PDF
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
PDF
Hadoop User Group EU 2014
PPTX
Introduction to Cascading
PDF
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
PPTX
Nosql East October 2009
PDF
Cascading on starfish
PDF
Reducing Development Time for Production-Grade Hadoop Applications
PDF
Elasticsearch + Cascading for Scalable Log Processing
PDF
Pattern - an open source project for migrating predictive models from SAS, et...
PPTX
Hadoop ecosystem
PDF
Hadoop ecosystem
PDF
Hadoop Summit: Pattern – an open source project for migrating predictive mode...
Intro to Cascading
Data Processing with Cascading Java API on Apache Hadoop
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
Functional programming
 for optimization problems 
in Big Data
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Accelerate Big Data Application Development with Cascading
Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
Hadoop User Group EU 2014
Introduction to Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Nosql East October 2009
Cascading on starfish
Reducing Development Time for Production-Grade Hadoop Applications
Elasticsearch + Cascading for Scalable Log Processing
Pattern - an open source project for migrating predictive models from SAS, et...
Hadoop ecosystem
Hadoop ecosystem
Hadoop Summit: Pattern – an open source project for migrating predictive mode...

More from Cascading (8)

PPTX
Overview of Cascading 3.0 on Apache Flink
PDF
Predicting Hospital Readmission Using Cascading
PDF
Cascading 2015 User Survey Results
PDF
Breathe new life into your data warehouse by offloading etl processes to hadoop
PDF
How To Get Hadoop App Intelligence with Driven
PPTX
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
PDF
Cascading - A Java Developer’s Companion to the Hadoop World
PDF
Cascading concurrent yahoo lunch_nlearn
Overview of Cascading 3.0 on Apache Flink
Predicting Hospital Readmission Using Cascading
Cascading 2015 User Survey Results
Breathe new life into your data warehouse by offloading etl processes to hadoop
How To Get Hadoop App Intelligence with Driven
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading concurrent yahoo lunch_nlearn

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx

The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent

  • 1. The Cascading (big) data application framework André Kelpe | HUG France | Paris | 25. November 2014
  • 2. Who am I? André Kelpe Senior Software Engineer at Concurrent company behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent andre@concurrentinc.com / @fs111
  • 3. http://cascading.org Apache licensed Java framework for writing data oriented applications production ready, stable and battle proven (soundcloud, twitter, etsy, climate corp + many more)
  • 4. Cascading goals developer productivity focus on business problems, not distributed systems knowledge useful abstractions over underlying „fabrics“
  • 5. Cascading goals Testability & robustness production quality applications rather than a collection of scripts (hooks into the core for experts)
  • 7. Cascading terminology Taps are sources and sinks for data Schemes represent the format of the data Pipes are connecting Taps
  • 8. Cascading terminology ● Tuples flow through Pipes ● Fields describe the Tuples ● Operations are executed on Tuples in TupleStreams ● FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational fabric
  • 9. Compiler QueryPlanner FlowDef FlowDef FlowDef Hadoop FlowDef Tez Spark User Code Translation Optimization Assembly CPU Architecture
  • 10. User-APIs ● Fluid - A Fluent API for Cascading – Targeted at application writers – https://github.com/Cascading/fluid ● „Raw“ Cascading API – Targeted for library writers, code generators, integration layers – https://github.com/Cascading/cascading
  • 11. Counting words // configuration String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); ...
  • 12. Counting words (cont.) // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...
  • 13. Counting words (cont.) // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code }
  • 15. Impatient Cascading for the Impatient http://docs.cascading.org/impatient/index.html
  • 16. ● Operations A full toolbox – Function – Filter – Regex/Scripts – Boolean operators – Count/Limit/Last/First – Scripts – Unique – Asserts – Min/Max – … ● Splices – GroupBy – CoGroup – HashJoin – Merge ● Joins Left, right, outer, inner, mixed...
  • 17. A full toolbox data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra... data formats: avro, thrift, protobuf, CSV, TSV... integration points: Cascading Lingual (SQL), Apache Hive, classical M/R apps.. not Java?: Scalding (Scala), Cascalog (clojure)
  • 18. Status quo ● Cascading 2.6 – Production release ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Cascading 3.0 – public wip builds ● Tez ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Others (Spark...)
  • 20. Link Collection http://www.cascading.org/ https://github.com/Cascading/ http://concurrentinc.com http://cascading.io/driven/ https://groups.google.com/forum/#!forum/cascading-user http://docs.cascading.org/impatient/ http://docs.cascading.org/cascading/2.6/userguide/html/
  • 21. fin.