SlideShare a Scribd company logo
AQUICKINTRODUCTIONTO
THECASCADINGECOSYSTEM
Chris K Wensel | Hadoop Summit EU 2014
• Lead developer of the Cascading open-source project
• Founder of Concurrent, Inc.
• Involved with Apache Hadoop since it was called Apache Nutch
!
• Systems Architect, not a Data Scientist
WHOAMI?
2
3
For creating data oriented applications, frameworks,
and languages [on Apache Hadoop]
Originally designed to hide complexity of Hadoop and
prevent thinking in MapReduce
cascading.org
• Started in 2007
• 2.0 released June 2012
• 2.5 out now
• 3.0 WIP (if you look for it)
• Apache 2.0 Licensed
• Supports all Hadoop distros
SOMESTATS
4
5
What’s it used for?
6
• Cascading Java API
• Data normalization and cleansing of search and click-through
logs for use by analytics tools
• Easy to operationalize heavy lifting of data
7
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
8
• Scalding (Scala)
• Machine learning (linear algebra) to improve
• User experience
• Ad quality (matching users and ad effectiveness)
• All revenue applications are running on Cascading/Scalding
• IPO
TWITTER
9
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
KEYPROJECTS
10
Lingual Pattern
Cascading
Apache Hadoop
Scalding Cascalog
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
11
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
‣ Rolling windows
SOMECOMMONPATTERNS
12
filter
filter
function
functionfilterfunction
data
Pipeline
Split Join
Merge
data
Topology
13
word count – Cascading Java API	

!
String docPath = args[ 0 ];!
String wcPath = args[ 1 ];!
Properties properties = new Properties();!
AppProps.setApplicationJarClass( properties, Main.class );!
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!
!
// create source and sink taps!
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );!
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );!
!
// specify a regex to split "document" text lines into token stream!
Fields token = new Fields( "token" );!
Fields text = new Fields( "text" );!
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );!
// only returns "token"!
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!
// determine the word counts!
Pipe wcPipe = new Pipe( "wc", docPipe );!
wcPipe = new GroupBy( wcPipe, token );!
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!
!
// connect the taps, pipes, etc., into a flow definition!
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )!
.addSource( docPipe, docTap )!
 .addTailSink( wcPipe, wcTap );!
// create the Flow!
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!
wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide!
wcFlow.complete(); // <<-- Runs jobs on Cluster
1
3
2
scheduling
processing
integration
configuration
14
mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{2}:'token', 'count']
[{2}:'token', 'count']
[{1}:'token']
[{1}:'token']
wc.dot
AREALWORLDAPP
15
[1/75] map+reduce
[2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce
[19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce
[36/75] map+reduce
[37/75] map+reduce
[38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce
[54/75] map+reduce
[55/75] map [56/75] map+reduce [57/75] map[58/75] map
[59/75] map
[60/75] map [61/75] map[62/75] map
[63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce
[71/75] map [72/75] map
[73/75] map+reduce [74/75] map+reduce
[75/75] map+reduce
1 App, 1 Flow, 75 Steps/MRJobs
!
green = map + reduce
purple = map
blue = join/merge
orange = map split
A graph of jobs, not
operations!
16
It’s not just for Java
17
word count – Scalding (Scala)	

// Sujit Pal!
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html!
!
package com.mycompany.impatient!
!
import com.twitter.scalding._!
!
class Part2(args : Args) extends Job(args) {!
  val input = Tsv(args("input"), ('docId, 'text))!
  val output = Tsv(args("output"))!
  input.read.!
    flatMap('text -> 'word) {!
text : String => text.split("""s+""")!
}.!
    groupBy('word) { group => group.size }.!
    write(output)!
}!
18
word count – Cascalog (Clojure)	

; Paul Lam!
; github.com/Quantisan/Impatient!
!
(ns impatient.core!
  (:use [cascalog.api]!
        [cascalog.more-taps :only (hfs-delimited)])!
  (:require [clojure.string :as s]!
            [cascalog.ops :as c])!
  (:gen-class))!
!
(defmapcatop split [line]!
  "reads in a line of string and splits it by regex"!
  (s/split line #"[[](),.)s]+"))!
!
(defn -main [in out & args]!
  (?<- (hfs-delimited out)!
       [?word ?count]!
       ((hfs-delimited in :skip-header? true) _ ?line)!
       (split ?line :> ?word)!
       (c/count ?count)))!
• Step by step tutorials on Cascading on GitHub
• Community has ported them to Scalding and Cascalog
!
• http://docs.cascading.org/impatient/
“FORTHEIMPATIENT”SERIES
19
• Foundation of patterns and best practices for building
Languages, Frameworks, and Applications
• Designed to abstract Hadoop away from the business logic
• Other models than MapReduce on the way!
WHYCASCADING?
20
• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
21
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache HadoopLingual Data Stores
CLI / Shell Enterprise Java
Catalog
22
Cascading API	

!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "sqlflow" )!
.addSource( "example.employee", emplTap )!
.addSource( "example.sales", salesTap )!
.addSink( "results", resultsTap );!
 !
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
 !
flowDef.addAssemblyPlanner( sqlPlanner );!
!
!
23
JDBC driver	

public void run() throws ClassNotFoundException, SQLException {!
Class.forName( "cascading.lingual.jdbc.Driver" );!
Connection connection =!
DriverManager.getConnection(!
"jdbc:lingual:local;schemas=src/main/resources/data/example" );!
Statement statement = connection.createStatement();!
 !
ResultSet resultSet = statement.executeQuery(!
"select *n"!
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"!
+ "join "EXAMPLE"."EMPLOYEE" as en"!
+ "on e."EMPID" = s."CUST_ID"" );!
 !
// do something!
 !
resultSet.close();!
statement.close();!
connection.close();!
}
SHELL-!TABLES
24
25
# load the JDBC package!
library(RJDBC)!
 !
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-
jdbc.jar")!
 !
# set up a database connection to a local repository!
connection <- dbConnect(drv, !
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")!
 !
# query the repository: in this case the MySQL sample database (CSV files)!
df <- dbGetQuery(connection, !
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!
head(df)!
 !
# use R functions to summarize and visualize part of the data!
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!
summary(df$hire_age)!
!
library(ggplot2)!
m <- ggplot(df, aes(x=hire_age))!
m <- m + ggtitle("Age at hire, people named Gina")!
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
26
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
27
“But we use a custom data format”
• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster
DATAPROVIDERAPI
28
29
Amazon Elastic MapReduce
Job Job Job Job
SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...
Amazon S3
Amazon RedShift
file1 file2
results
• Quickly migrate existing work loads from RDBMS to Hadoop
• Quickly extract data from Hadoop into applications
WHYLINGUAL
30
• Predictive model scoring
• Java API and PMML parser
• Supports:
‣ (General) Regression
‣ Clustering
‣ Decisions Trees
‣ Random Forest
‣ and ensembles of models
PATTERN
31
PMML Parser Pattern API
Cascading
Apache Hadoop
Pattern
Data Stores
Enterprise Java
32
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "classifier" )!
.addSource( "input", inputTap )!
.addSink( "classify", classifyTap );!
 !
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlModel ) )!
.retainOnlyActiveIncomingFields();!
 !
flowDef.addAssemblyPlanner( pmmlPlanner );!
!
!
• Standards compliance provides integration with many tools
• Models are independent of data and integration
• Only debugging Cascading, not an ensemble of applications
WHYPATTERN
33
CLOSINGTHELOOP
34
Cluster
Pattern
Desktop
Job
PMML
Flow
JDBC
Flow
import data
create models
export models
execute models
import results
JDBC
Flow
PMML
DATA
DATA
test results
Job Job
• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
35
http://cascading.io/driven/
MANAGEDWITHDRIVEN
36
37
• New query planner
‣ User definable Assertion and Transformation rules
‣ Sub-Graph Isomorphism Pattern Matching
‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph
isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75
• Hadoop Tez support
• And likely other platforms
CASCADING3.0
38
THERE’SABOOK!
39
Enterprise DataWorkflows with Cascading	

- Paco Nathan	

O’Reilly, 2013	

amazon.com/dp/1449358721
CONTACT
40
@cwensel | @cascading	

chris@wensel.net	

www.cascading.org	

www.concurrentinc.com

More Related Content

PPTX
Alasql JavaScript SQL Database Library: User Manual
PPTX
SQL and NoSQL Better Together in Alasql
PDF
Scala active record
PPTX
Alasql fast JavaScript in-memory SQL database
PDF
Cloudera Impala, updated for v1.0
PDF
Cassandra 3.0 - JSON at scale - StampedeCon 2015
PDF
Polyglot Persistence
PPTX
Client-side Rendering with AngularJS
Alasql JavaScript SQL Database Library: User Manual
SQL and NoSQL Better Together in Alasql
Scala active record
Alasql fast JavaScript in-memory SQL database
Cloudera Impala, updated for v1.0
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Polyglot Persistence
Client-side Rendering with AngularJS

What's hot (20)

PPTX
20141001 delapsley-oc-openstack-final
PPTX
Async Redux Actions With RxJS - React Rally 2016
PPT
Spring data iii
PPTX
An ADF Special Report
PDF
Drools 6.0 (Red Hat Summit)
PDF
SpringとGrarlVM Native Image -2019/12-
PPTX
Getting started with Elasticsearch and .NET
PPTX
20141002 delapsley-socalangularjs-final
PPT
Presentation
PDF
Scalding - the not-so-basics @ ScalaDays 2014
PPTX
HiveServer2
PPTX
Clogeny Hadoop ecosystem - an overview
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Not your Grandma's XQuery
PDF
Spark Dataframe - Mr. Jyotiska
PDF
Ajax Performance Tuning and Best Practices
PPTX
Full stack development with node and NoSQL - All Things Open - October 2017
PDF
XQuery Rocks
PDF
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
PDF
XQuery in the Cloud
20141001 delapsley-oc-openstack-final
Async Redux Actions With RxJS - React Rally 2016
Spring data iii
An ADF Special Report
Drools 6.0 (Red Hat Summit)
SpringとGrarlVM Native Image -2019/12-
Getting started with Elasticsearch and .NET
20141002 delapsley-socalangularjs-final
Presentation
Scalding - the not-so-basics @ ScalaDays 2014
HiveServer2
Clogeny Hadoop ecosystem - an overview
Using Spark to Load Oracle Data into Cassandra
Not your Grandma's XQuery
Spark Dataframe - Mr. Jyotiska
Ajax Performance Tuning and Best Practices
Full stack development with node and NoSQL - All Things Open - October 2017
XQuery Rocks
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
XQuery in the Cloud
Ad

Viewers also liked (18)

KEY
Processing Big Data
PDF
2015 Title Pckg_HEART YOUR LADY PARTS 5k
PDF
Digital Marketing Lecture 2015
PDF
Hadoop Summit EU 2014
KEY
Buzz words
PPTX
Illinois Birds2
PPTX
Illinois Birds
KEY
BigDataCamp 2011
PDF
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
PDF
Social Media Lecture Summer 2011
PDF
Real Social Media Recruitment ROI
PPT
An Integrated Marketing Plan
PDF
Engaging the customer
PPTX
Dialog Marketing with Digital Media
PDF
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
PDF
Building Scale Free Applications with Hadoop and Cascading
PDF
Digital Marketing Lecture 2016
PPTX
Isra' wal mikraj
Processing Big Data
2015 Title Pckg_HEART YOUR LADY PARTS 5k
Digital Marketing Lecture 2015
Hadoop Summit EU 2014
Buzz words
Illinois Birds2
Illinois Birds
BigDataCamp 2011
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
Social Media Lecture Summer 2011
Real Social Media Recruitment ROI
An Integrated Marketing Plan
Engaging the customer
Dialog Marketing with Digital Media
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Building Scale Free Applications with Hadoop and Cascading
Digital Marketing Lecture 2016
Isra' wal mikraj
Ad

Similar to Hadoop User Group EU 2014 (20)

PDF
Accelerate Big Data Application Development with Cascading
PPTX
Hadoop ecosystem
PDF
Hadoop ecosystem
PDF
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
PDF
Cascading Through Hadoop for the Boulder JUG
PDF
Intro to Cascading
PDF
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
PDF
The Cascading (big) data application framework
PDF
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
PPTX
NoSQL, Hadoop, Cascading June 2010
PDF
Cascading - A Java Developer’s Companion to the Hadoop World
PDF
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
PDF
Reducing Development Time for Production-Grade Hadoop Applications
PDF
Data Processing with Cascading Java API on Apache Hadoop
PDF
Using Cascalog to build an app with City of Palo Alto Open Data
PDF
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
PDF
Hadoop pig
PDF
Scala+data
Accelerate Big Data Application Development with Cascading
Hadoop ecosystem
Hadoop ecosystem
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
Cascading Through Hadoop for the Boulder JUG
Intro to Cascading
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
The Cascading (big) data application framework
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
NoSQL, Hadoop, Cascading June 2010
Cascading - A Java Developer’s Companion to the Hadoop World
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Reducing Development Time for Production-Grade Hadoop Applications
Data Processing with Cascading Java API on Apache Hadoop
Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Hadoop pig
Scala+data

Recently uploaded (20)

PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
L1 - Introduction to python Backend.pptx
DOCX
The Five Best AI Cover Tools in 2025.docx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
history of c programming in notes for students .pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
AI in Product Development-omnex systems
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Complete React Javascript Course Syllabus.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
Design an Analysis of Algorithms I-SECS-1021-03
L1 - Introduction to python Backend.pptx
The Five Best AI Cover Tools in 2025.docx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Wondershare Filmora 15 Crack With Activation Key [2025
Understanding Forklifts - TECH EHS Solution
How to Choose the Right IT Partner for Your Business in Malaysia
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
history of c programming in notes for students .pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
top salesforce developer skills in 2025.pdf
AI in Product Development-omnex systems
2025 Textile ERP Trends: SAP, Odoo & Oracle
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Online Work Permit System for Fast Permit Processing
Complete React Javascript Course Syllabus.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms II-SECS-1021-03

Hadoop User Group EU 2014

  • 2. • Lead developer of the Cascading open-source project • Founder of Concurrent, Inc. • Involved with Apache Hadoop since it was called Apache Nutch ! • Systems Architect, not a Data Scientist WHOAMI? 2
  • 3. 3 For creating data oriented applications, frameworks, and languages [on Apache Hadoop] Originally designed to hide complexity of Hadoop and prevent thinking in MapReduce cascading.org
  • 4. • Started in 2007 • 2.0 released June 2012 • 2.5 out now • 3.0 WIP (if you look for it) • Apache 2.0 Licensed • Supports all Hadoop distros SOMESTATS 4
  • 6. 6 • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools • Easy to operationalize heavy lifting of data
  • 7. 7 • Cascalog (Clojure) • Weather pattern modeling to protect growers against loss • ETL against 20+ datasets daily • Machine learning to create models • Purchased by Monsanto for $930M US
  • 8. 8 • Scalding (Scala) • Machine learning (linear algebra) to improve • User experience • Ad quality (matching users and ad effectiveness) • All revenue applications are running on Cascading/Scalding • IPO TWITTER
  • 9. 9 • Estimate suicide risk from what people write online • Cascading + Cassandra • You can do more than optimize add yields • http://www.durkheimproject.org
  • 11. • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 11 Process Planner Processing API Integration API Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 12. • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc ‣ Rolling windows SOMECOMMONPATTERNS 12 filter filter function functionfilterfunction data Pipeline Split Join Merge data Topology
  • 13. 13 word count – Cascading Java API ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )!  .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide! wcFlow.complete(); // <<-- Runs jobs on Cluster 1 3 2 scheduling processing integration configuration
  • 15. AREALWORLDAPP 15 [1/75] map+reduce [2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce [19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce [36/75] map+reduce [37/75] map+reduce [38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce [54/75] map+reduce [55/75] map [56/75] map+reduce [57/75] map[58/75] map [59/75] map [60/75] map [61/75] map[62/75] map [63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce [71/75] map [72/75] map [73/75] map+reduce [74/75] map+reduce [75/75] map+reduce 1 App, 1 Flow, 75 Steps/MRJobs ! green = map + reduce purple = map blue = join/merge orange = map split A graph of jobs, not operations!
  • 16. 16 It’s not just for Java
  • 17. 17 word count – Scalding (Scala) // Sujit Pal! // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html! ! package com.mycompany.impatient! ! import com.twitter.scalding._! ! class Part2(args : Args) extends Job(args) {!   val input = Tsv(args("input"), ('docId, 'text))!   val output = Tsv(args("output"))!   input.read.!     flatMap('text -> 'word) {! text : String => text.split("""s+""")! }.!     groupBy('word) { group => group.size }.!     write(output)! }!
  • 18. 18 word count – Cascalog (Clojure) ; Paul Lam! ; github.com/Quantisan/Impatient! ! (ns impatient.core!   (:use [cascalog.api]!         [cascalog.more-taps :only (hfs-delimited)])!   (:require [clojure.string :as s]!             [cascalog.ops :as c])!   (:gen-class))! ! (defmapcatop split [line]!   "reads in a line of string and splits it by regex"!   (s/split line #"[[](),.)s]+"))! ! (defn -main [in out & args]!   (?<- (hfs-delimited out)!        [?word ?count]!        ((hfs-delimited in :skip-header? true) _ ?line)!        (split ?line :> ?word)!        (c/count ?count)))!
  • 19. • Step by step tutorials on Cascading on GitHub • Community has ported them to Scalding and Cascalog ! • http://docs.cascading.org/impatient/ “FORTHEIMPATIENT”SERIES 19
  • 20. • Foundation of patterns and best practices for building Languages, Frameworks, and Applications • Designed to abstract Hadoop away from the business logic • Other models than MapReduce on the way! WHYCASCADING? 20
  • 21. • ANSI Compatible SQL • JDBC Driver • Cascading Java API • SQL Command Shell • Catalog Manager Tool • Data Provider API LINGUAL 21 Query Planner JDBC API Lingual APIProvider API Cascading Apache HadoopLingual Data Stores CLI / Shell Enterprise Java Catalog
  • 22. 22 Cascading API ! FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! !
  • 23. 23 JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();!  ! ResultSet resultSet = statement.executeQuery(! "select *n"! + "from "EXAMPLE"."SALES_FACT_1997" as sn"! + "join "EXAMPLE"."EMPLOYEE" as en"! + "on e."EMPID" = s."CUST_ID"" );!  ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }
  • 25. 25 # load the JDBC package! library(RJDBC)!  ! # set up the driver! drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev- jdbc.jar")!  ! # set up a database connection to a local repository! connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")!  ! # query the repository: in this case the MySQL sample database (CSV files)! df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")! head(df)!  ! # use R functions to summarize and visualize part of the data! df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25! summary(df$hire_age)! ! library(ggplot2)! m <- ggplot(df, aes(x=hire_age))! m <- m + ggtitle("Age at hire, people named Gina")! m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
  • 26. 26 > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
  • 27. 27 “But we use a custom data format”
  • 28. • Any Cascading Tap and/or Scheme can be used from JDBC • Use a “fat jar” on local disk or from a Maven repo ‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0 • The Jar is dynamically loaded into cluster DATAPROVIDERAPI 28
  • 29. 29 Amazon Elastic MapReduce Job Job Job Job SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ... Amazon S3 Amazon RedShift file1 file2 results
  • 30. • Quickly migrate existing work loads from RDBMS to Hadoop • Quickly extract data from Hadoop into applications WHYLINGUAL 30
  • 31. • Predictive model scoring • Java API and PMML parser • Supports: ‣ (General) Regression ‣ Clustering ‣ Decisions Trees ‣ Random Forest ‣ and ensembles of models PATTERN 31 PMML Parser Pattern API Cascading Apache Hadoop Pattern Data Stores Enterprise Java
  • 32. 32 ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
  • 33. • Standards compliance provides integration with many tools • Models are independent of data and integration • Only debugging Cascading, not an ensemble of applications WHYPATTERN 33
  • 34. CLOSINGTHELOOP 34 Cluster Pattern Desktop Job PMML Flow JDBC Flow import data create models export models execute models import results JDBC Flow PMML DATA DATA test results Job Job
  • 35. • Understand how your application maps onto your cluster • Identify bottlenecks (data, code, or the system) • Jump to the line of code implicated on a failure • Plugin available via Maven repo • Beta UI hosted online DRIVEN 35 http://cascading.io/driven/
  • 37. 37
  • 38. • New query planner ‣ User definable Assertion and Transformation rules ‣ Sub-Graph Isomorphism Pattern Matching ‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75 • Hadoop Tez support • And likely other platforms CASCADING3.0 38
  • 39. THERE’SABOOK! 39 Enterprise DataWorkflows with Cascading - Paco Nathan O’Reilly, 2013 amazon.com/dp/1449358721