The Workflow Abstraction

“The Workﬂow Abstraction”

Strata SC
2013-02-28

Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid

Copyright @2013, Concurrent, Inc.

Friday, 01 March 13 1
Background: dual in quantitative and distributed systems.
I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -

The Workflow Abstraction
Document
Collection

Scrub
Tokenize
token

M

1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines

This talk is about the workflow abstraction:
* the business process of structuring data
* the practices of building robust apps at scale
* the open source projects for Enterprise Data Workflows

We’ll consider some theory, examples, best practices, trendlines --
what are the drivers that brought us, and where is this work heading toward?

Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.

Marketing Funnel – overview

In reference to Making Data Work…
Customers
Almost every business uses a model
similar to this – give or take a few steps. Campaigns

Customer leads go in at the top,
Awareness
those get refined through several stages,
then results flow out the bottom.
Interest

Evalutation

Conversion

Referral

Repeat

Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.

This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.

Marketing Funnel – clickstream

Different funnel stages get represented
in ecommerce by events captured in Customers
log ﬁles, as a class of machine data
called clickstream Campaigns

Impression
• ad impressions Awareness

• URL clicks Click

• landing page views Interest

• new user registrations Sign Up

Evalutation
• session cookies
Purchase
• online purchases Conversion

• social network activity "Like"

• etc. Referral

Repeat

Online advertising involves what we call “clickstream” data, lots of events in log ﬁles -- i.e., lots of unstructured data.

Marketing Funnel – metrics

A variety of clickstream metrics can
be used as performance indicators Customers
at different stages of the funnel:
Campaigns
• CPM: cost per thousand Impression

• CTR: click-through rate Awareness CPM

• CPA: cost per action Click

• etc. Interest CTR

Sign Up

Evalutation behaviors

Purchase

Conversion CPA

"Like"

Referral NPS, social graph, etc.

Repeat loyalty, win back, etc.

The many different highly-nuanced metrics which apply are mind-boggling :)

Marketing Funnel – example calculations Customers

Campaigns

Awareness

Interest

metric cost events formula rate Evalutation

Conversion

Referral

Repeat

$4,000
CPM $4,000 10^6 ÷ $4.00
(10^6 ÷ 10^3)

3∙10^3
CTR - 3∙10^3
÷ 10^6
0.3%

$4,000
CPA - 20 ÷ $200
20

Here are examples of the kinds of calculations performed...

Marketing Funnel – predictive model

Given these metrics, we can go further
to estimate cost per paying user (CPP) Customers
customer lifetime value (LTV), etc.
Campaigns
Then we can build a predictive model for
return on investment (ROI) per customer, Awareness
summarizing the funnel performance:
ROI ＝ (LTV − CPP) ∕ CPP Interest

As an example, after crunching lots of logs, Evalutation

suppose that…
Conversion

CPP ＝ $200
LTV ＝ $2000 Referral

ROI ＝ ($2000 − $200) ∕ $200
Repeat
for a 9x multiple

For applications within a business, we can use these calculated metrics to create a predictive model for the proﬁtability of customers,
which describes the efﬁciency of the marketing funnel at different stages.

Marketing Funnel – example architecture Customers

Campaigns

Customers
Awareness

Let’s consider an example architecture Interest

Evalutation

for calculating, reporting, and taking action Web
Conversion

on funnel metrics, based on large-scale App
Referral

Repeat

clickstream data…
logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Here’s an example architecture of using clickstream metrics within an online business.

Marketing Funnel – complexities

Multiple ad partners, different contracts
terms, reporting different metrics at Customers
×
×
different times, click scrubs, etc.
Campaigns
Campaigns target speciﬁc geo/demo, Impression

× ×
test alternate landing pages, probably Awareness CPM
need to segment customer base… Click

These issues make clickstream data Interest CTR

large and yet sparse. Sign Up

Evalutation behaviors

Other issues:

×
Purchase

• seasonal variation Conversion CPA

• ﬂuctuating currency exchange rates "Like"

• distortions due to credit card fraud
• diminishing returns Repeat loyalty, win back, etc.

• forecasting requirements
However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.

scrubs
many vendors, data sources, different metrics to be aligned
lots of roll-ups
Bayesian point estimates
forecasts and dashboards

social dimension makes this convoluted
not simple

Marketing Funnel – very large scale

Even a small start-up may need to
make decisions about billions of Customers
events, many millions of users, and
millions of dollars in annual ad spend. Campaigns

Impression
Ad networks attempt to simplify and Awareness CPM
optimize parts of the funnel process Click
as a value-add. Interest CTR

The need for these insights has been a Sign Up

driver for Hadoop-related technologies. Evalutation behaviors

Purchase

Conversion CPA

"Like"



The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.

Marketing Funnel – very large scale

Even a small start-up may need to
make decisions about billions of Customers
events, many millions of users, and
millions of dollars in annual ad spend. Campaigns

Impression
Ad networks attempt to simplify and Awareness CPM
optimize parts of the funnel process Click
as a value-add.
funnel modeling and optimization Interest CTR

The need for these insights has been a Sign Up

driver for Hadoop-relatedrequires complex data workﬂows
technologies. Evalutation behaviors

to obtain the required insights Purchase

Conversion CPA

"Like"



These needs imply complex data workﬂows.

It’s not about doing a BI query or a pivot table;
that’s how retailers were thinking when Amazon came along.

Document
Collection

Scrub
Tokenize
token

M

1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines

A personal history of ad networks, Apache Hadoop apps, and Enterprise data workﬂows, circa 2008.

Circa 2008 – Hadoop at scale
Customers

Scenario: Analytics team at a large ad network… Campaigns

Awareness

Company had invested $MM capex in a Interest

large data warehouse across LOBs Evalutation

Conversion

Mission-critical app had been written as
Referral

collab Repeat

a large SQL workﬂow in the DW roll-ups
ﬁlter

Marketing funnel metrics were estimated
for many advertisers, many campaigns, per-user
recommends
many publishers, many customers –
billions of calculations daily
query/load
Predictive models matched publisher ~ advertiser clickstream RDBMS

and campaign ~ user, to optimize marketing
funnel performance

Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..

Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.

Customers

Issues: Campaigns

Awareness

• critical app had hit hard limits for scalability Interest

• several Tb data, 100’s of servers
Evalutation

Conversion

• batch window length vs. failure rate vs. SLA collab
Referral

Repeat

in the context of business growth posed roll-ups
ﬁlter
an existential risk

×
We built out a team to address these issues per-user
recommends
as rapidly as possible…
Needed to re-create that data workﬂows query/load
based on Enterprise requirements. clickstream RDBMS

Marching orders:
5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;
5 weeks to reverse engineer the mission-critical app without any access to its author;
5 weeks to implement a Hadoop version which could scale-out on EC2.

We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.


Approach: roll-ups
collab
ﬁlter
• reverse-engineered business process from
~1500 lines of undocumented SQL
per-user
• created a large, multi-step Apache Hadoop recommends
app on AWS HDFS

• leveraged cloud strategy to trade $MM
capex for lower, scalable opex
• Amazon identiﬁed our app as one of the msg
queue
largest Hadoop deployments on EC2
• our app became a case study for AWS query/load
RDBMS
prior to Elastic MapReduce launch clickstream

Our solution involved dependencies among more than a dozen Hadoop job steps.


×
Unresolved: roll-ups
collab
filter
• ETL was still a separate app
• difficult to handle exceptions, notifications, per-user
debugging, etc., across the entire workflow recommends
HDFS
• data scientists wore beepers since Ops

× ×
lacked visibility into business process
• coding directly in MapReduce created
a staffing bottleneck msg
queue

query/load
clickstream RDBMS

This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --
for troubleshooting, handling exceptions, notifications, etc.

Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.

Three issues about Enterprise workflows:
* staffing bottleneck unless there’s a good abstraction layer
* operational complexity, mostly due to lack of transparency
* system integration problems *are* the main problem to solve


Unresolved: roll-ups
collab
filter
• ETL was still a separate app
• difficult to handle exceptions, notifications, per-user
debugging, etc., across the entire workflow recommends

• data scientists worea good since Ops for a large, commercial
beepers solution
HDFS

lacked visibility into Apachebusiness logic deployment, but
the app’s Hadoop
• coding directly in MapReduce created
a staffing bottleneck workflow management lacked crucial
msg
queue
features…
query/load
which led to a search for a better clickstream RDBMS

workflow abstraction

While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.

I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.

Document
Collection

Scrub
Tokenize
token

M

1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines

Origin and overview of Cascading API as a workﬂow abstraction for Enterprise Big Data apps.

Cascading – origins

API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for several popular
data products.
Wensel was following the Nutch open source project –
before Hadoop even had a name.
He noted that it would become difficult to find Java
developers to write complex Enterprise apps directly
in Apache Hadoop – a potential blocker for leveraging
this new open source technology.

Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.

Cascading – functional programming

Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:

• leverages JVM and Java-based tools without an need
to create an entirely new language
• allows many programmers who have J2EE expertise
to build apps that leverage the economics of Hadoop
clusters

Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.

quotes…

“Cascading gives Java developers the ability to build
Big Data applications on Hadoop using their existing
skillset … Management can really go out and build a
team around folks that are already very experienced
with Java. Switching over to this is really a very short
exercise.”
CIO, Thor Olavsrud
2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

“Masks the complexity of MapReduce, simplifies the
programming, and speeds you on your journey toward
actionable analytics … A vast improvement over native
MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck
2012-09-18
infoworld.com/slideshow/65089

Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”

The issues:
* staffing bottleneck
* operational complexity
* system integration

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
• partners: Amazon AWS, Microsoft Azure, Hortonworks,
MapR, EMC, SpringSource, Cloudera
• 5+ history of Enterprise production deployments,
ASL 2 license, GitHub src, http://conjars.org
• use cases: ETL, marketing funnel, anti-fraud, social media,
retail pricing, search analytics, recommenders, eCRM,
utility grids, genomics, climatology, etc.

Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.

Signiﬁcant investment by Twitter, Etsy, and other ﬁrms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.

examples…

• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
in functional programming open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-speciﬁc languages (DSLs) in JVM
languages which emphasize functional programming:

Cascalog in Clojure (2010)
Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki

Many case studies, many Enterprise production deployments now for 5+ years.

examples…

• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
in functional programming open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascading as the basis for workflow
abstractions atop Hadoop and more,
Cascalog in Clojure (2010)
Scalding in Scala (2012)
with a 5+ year history of production
deployments across multiple verticals

Cascading as a basis for workflow abstraction, for Enterprise data workflows

Document
Collection

Scrub
Tokenize
token

M

1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines

Code samples in Cascading / Cascalog / Scalding, based on Word Count

The Ubiquitous Word Count
Document
Collection

Definition: M
Tokenize
GroupBy
token Count

count how often each word appears
count how often each word appears
R Word
Count

inin a collection of text documents
a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):
for each word w in segment(text):
• requires a minimal amount of code emit(w, "1");

• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):

• is not many steps away from useful search indexing
int count = 0;

• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);

Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.

Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...

Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.

word count – conceptual flow diagram

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702

Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.

word count – Cascading app in Java
Document
Collection

String docPath = args[ 0 ]; Tokenize
GroupBy
token
String wcPath = args[ 1 ]; M Count

Properties properties = new Properties(); R Word
Count

AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT ﬁle for the ﬂow diagram

word count – generated ﬂow diagram
Document
Collection

Tokenize
[head] M
GroupBy
token Count

R Word
Count

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[{2}:'doc_id', 'text']

map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

[{1}:'token']
[{1}:'token']

GroupBy('wc')[by:['token']]

wc[{1}:'token']
[{1}:'token']

reduce
Every('wc')[Count[decl:'count']]

[{2}:'token', 'count']
[{1}:'token']

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']


[tail]

As a concrete example of literate programming in Cascading,
here is the DOT representation of the ﬂow plan -- generated by the app itself.

word count – Cascalog / Clojure
Document
Collection

(ns impatient.core M
Tokenize
GroupBy
token Count

  (:use [cascalog.api] R Word
Count

        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))

; Paul Lam
; github.com/Quantisan/Impatient

Here is the same Word Count app written in Clojure, using Cascalog.

word count – Cascalog / Clojure
Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn

From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.

word count – Scalding / Scala
Document
Collection

import com.twitter.scalding._ M
Tokenize
GroupBy
token Count

R Word
Count

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}

Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog,
not as much of a high-level language

If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping to limit
• significant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
(imagine SOA infra @ Google as an open source project)
• less learning curve than Cascalog,
not as much of a high-level language

Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…

Document
Collection

Scrub
Tokenize
token

M

1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines

Tracking back to the Marketing Funnel as an example workﬂow…
Let’s consider how Cascading apps incorporate other components beyond Hadoop

Enterprise Data Workﬂows
Customers
Back to our marketing funnel, let’s consider
an example app… at the front end Web
App
LOB use cases drive demand for apps
logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

LOB use cases drive the demand for Big Data apps

Customers
An example… in the back ofﬁce
Organizations have substantial investments Web
App
in people, infrastructure, process
logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Enterprise organizations have seriously ginormous investments in existing back ofﬁce practices:
people, infrastructure, processes

Customers
An example… for the heavy lifting!
“Main Street” firms are migrating Web
App
workflows to Hadoop, for cost
savings and scale-out
logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.

Cascading workﬂows – taps

• taps integrate other data frameworks, as tuple streams
Customers

• these are “plumbing” endpoints in the pattern language
• sources (inputs), sinks (outputs), traps (exceptions) Web
App

• text delimited, JDBC, Memcached,
HBase, Cassandra, MongoDB, etc. logs
logs
Logs
Cache

• data serialization: Avro, Thrift,
Support
source
trap sink
tap
Kryo, JSON, etc. tap tap

• extend a new kind of tap in just
Data
Modeling PMML
Workflow

a few lines of Java sink
source
tap
tap

Analytics
Cubes customer
Customer
profile DBs
schema and provenance get Hadoop
Prefs

derived from analysis of the taps Reporting
Cluster

Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.

Cascading workﬂows – taps

String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();


RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS
wcFlow.complete();

Here are the taps in the WordCount source

Cascading workflows – topologies

• topologies execute workflows on clusters
Customers

• flow planner is like a compiler for queries
- Hadoop (MapReduce jobs) Web
App

- local mode (dev/test or special config)
logs Cache
- in-memory data grids (real-time) logs
Logs

Support

• flow planner can be extended trap
tap
source
tap sink
tap
to support other topologies
Data
Modeling PMML
Workflow

source
sink
tap
blend flows in different topologies tap

Analytics
into the same app – for example, Cubes customer
Customer
profile DBs
batch (Hadoop) + transactions (IMDG) Hadoop
Prefs

Cluster
Reporting

Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.

Cascading workflows – topologies

String docPath = args[ 0 ];
String wcPath = args[ 1 ];


RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop
Pipe wcPipe = new Pipe( "wc", docPipe ); topology
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

wcFlow.complete();

Here is the flow planner for Hadoop in the WordCount source

example topologies…

Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.

Cascading workflows – ANSI SQL

• collab with Optiq – industry-proven code base
Customers

• ANSI SQL parser/optimizer atop Cascading
flow planner Web
App

• JDBC driver to integrate into existing
tools and app servers logs
logs Cache
Logs

• relational catalog over a collection Support
source
of unstructured data trap
tap
tap sink
tap

• SQL shell prompt to run queries Modeling PMML
Data
Workflow

• enable analysts without retraining sink
tap
source
tap

on Hadoop, etc. Analytics
Cubes customer

• transparency for Support, Ops, Hadoop
Customer
profile DBs
Prefs

Finance, et al. Reporting
Cluster

• a language for queries – not a database,
but ANSI SQL as a DSL for workflows

ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.

BTW, most of the SQL in the world is written by machines. This is not a database; this is about making machine-to-machine communications simpler and more robust at scale.

ANSI SQL – CSV data in local file system

cascading.org/lingual

The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.

ANSI SQL – shell prompt, catalog


Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.

ANSI SQL – queries


Here’s an example SQL query on that “employee” test database from MySQL.

Cascading workflows – machine learning

• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML Customers

• Cascading creates parallelized models Web
App
to run at scale on Hadoop clusters
• Random Forest, Logistic Regression, logs
logs Cache
Logs
GLM, Decision Trees, K-Means, Support
Hierarchical Clustering, etc. trap
source
tap sink
tap tap

• integrate with other libraries Data
(Matrix API, etc.) and great open Modeling PMML
Workflow

source tools (R, Weka, KNIME, sink
tap
source
tap

RapidMiner, etc.) Analytics
Cubes customer

• 2 lines of code or pre-built JAR Hadoop
Customer
profile DBs
Prefs

Cluster
Reporting

Run multiple variants of models as
customer experiments

PMML has been around for a while, and export is supported by nearly every commercial analytics platform,
covering a wide variety of predictive modeling algorithms.

Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.

Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)

Several companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern

model creation in R
## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t",
row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

cascading.org/pattern

Sample code in R for generating a predictive model for anti-fraud, based on a machine learning algorithm called Random Forest.

model run at scale as a Cascading app

Customer
Orders

Scored GroupBy
Classify Assert
Orders token

M R

PMML
Model
Count

Failure Confusion
Traps Matrix

cascading.org/pattern

Conceptual ﬂow diagram for a Cascading app which runs a PMML model at scale, while trapping data exceptions (e.g., regression tests) and tallying a “confusion matrix” for quantifying the model performance.

model run at scale as a Cascading app
public class Main {
public static void main( String[] args ) {
  String pmmlPath = args[ 0 ];
  String ordersPath = args[ 1 ];
  String classifyPath = args[ 2 ];
  String trapPath = args[ 3 ];


  Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
  Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
  Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

  // define a "Classifier" model from PMML to evaluate the orders
  ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
  Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

  FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
   .addSource( classifyPipe, ordersTap )
   .addTrap( classifyPipe, trapTap )
   .addSink( classifyPipe, classifyTap );

  Flow classifyFlow = flowConnector.connect( flowDef );
  classifyFlow.writeDOT( "dot/classify.dot" );
  classifyFlow.complete();
}
}

Source code for a simple Cascading app that runs PMML models in general.

PMML support…

Popular tools which can create predictive models for export as PMML

Cascading workflows – test-driven development

• assert patterns (regex) on the tuple streams
Customers
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions” Web
App

• TDD at scale:
1. start from raw inputs in the flow graph logs
logs
Logs
Cache

2. define stream assertions for each stage Support
source
trap sink
of transforms tap
tap
tap

3. verify exceptions, code to remove them Modeling PMML
Data
Workflow

4. when impl is complete, app has full sink
source
tap
tap
test coverage Analytics
Cubes
• TDD follows from Cascalog’s customer
Customer
profile DBs
Prefs
composable subqueries Hadoop
Cluster
Reporting

• redirect traps in production
to Ops, QA, Support, Audit, etc.

TDD is not usually high on the list when people start discussing Big Data apps.

The notion of a “data exception” was introduced into Cascading, based on setting stream assertion levels as part of the business logic of an application.

Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.

Cascading workﬂows – TDD meets API principles

• specify what is required, not how it must
be achieved Customers

• plan far ahead, before consuming cluster Web
App

resources – fail fast prior to submit
logs Cache
• fail the same way twice – deterministic logs
Logs

Support
ﬂow planners help reduce engineering trap
source
sink
tap
costs for debugging at scale tap tap

Data
Modeling PMML

• same JAR, any scale – app does not Workflow

source
require a recompile to change data sink
tap
tap

taps or cluster topologies Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Some of the design principles for the pattern language

Two Avenues…

Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,

complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff

Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞

Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity

Two Avenues…

Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,

complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Hadoop almost never gets used
in isolation; data workflows define
Start-ups: crave complexity and
scale to become viable… the “glue” required for system
new ventures move into Enterprise space of Enterprise apps
integration
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞

Hadoop is almost never used in isolation.
Enterprise data workflows are about system integration.
There are a couple different ways to arrive at the party.

Document
Collection

Scrub
Tokenize
token

M

1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines

Origin and overview of Cascading API as a workﬂow abstraction for Enterprise Big Data apps.

Cascading workflows – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Data is represented as flows of tuples. Operations within Word

the tuple flows bring functional programming aspects into Count

Java apps.
In formal terms, this provides a pattern language.

A pattern language, based on the metaphor of “plumbing”

references…

pattern language: a structured method for solving
large, complex design problems, where the syntax of
the language promotes the use of best practices.

amazon.com/dp/0195019199

design patterns: the notion originated in consensus
negotiation for architecture, later applied in OOP
software engineering by “Gang of Four”.

Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.

Cascading workflows – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection

Scrub
Tokenize

design principles of the pattern
token

M

language ensure best practices
Stop Word
List
HashJoin
Left
Regex
token
GroupBy
token
R

for robust, parallel data workflows
RHS

at scale Count

Data is represented as flows of tuples. Operations within Word

the tuple flows bring functional programming aspects into Count

Java apps.
In formal terms, this provides a pattern language.

The pattern language provides a structured method for solving large,
complex design problems where the syntax of the language promotes
use of best practices – which also addresses staffing issues

Cascading workflows – literate programming

Cascading workflows generate their own visual
documentation: flow diagrams

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

In formal terms, flow diagrams leverage a methodology Word
Count

called literate programming
Provides intuitive, visual representations for apps, great
for cross-team collaboration.

Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first

references…

by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/

“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”

Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.

examples…

• Scalding apps have nearly 1:1 correspondence
between function calls and the elements in their
flow diagrams – excellent elision and literate
representation
• noticed on cascading-users email list:
when troubleshooting issues, Cascading experts ask
novices to provide an app’s flow diagram (generated [head]

as a DOT file), sometimes in lieu of showing code Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']


map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

In formal terms, a flow diagram is a directed, acyclic [{1}:'token']
[{1}:'token']

graph (DAG) on which lots of interesting math applies GroupBy('wc')[by:['token']]

for query optimization, predictive models about app
wc[{1}:'token']
[{1}:'token']

reduce
execution, parallel efficiency metrics, etc. Every('wc')[Count[decl:'count']]

[{1}:'token']

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']


[tail]

Literate programming examples observed on the email list are some of the best illustrations of this methodology.

Cascading workflows – business process

Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
As a separation of concerns between business process
and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner in used
in a Cascading app determines how to translate business
process into efficient, parallel jobs at scale.

Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)

references…

by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on:
the process of structuring data
That’s what apps do – Making Data Work

Focus on *the process of structuring data*
which must happen before the large-scale joins, predictive models, visualizations, etc.

Just because your data is loaded into a “structured” store, that does not imply that your app has ﬁnished structuring it for the purpose of making data work.

BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)

Cascading workflows – functional relational programming

The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in:
Moseley & Marks, 2006
“Out of the Tar Pit”
goo.gl/SKspn

A more contemporary statement along similar lines...

Cascading workflows – functional relational programming

The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended for a
several theoretical aspects converge
“data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in: into software engineering practices
Moseley & Marks, 2006which mitigates the complexity of
“Out of the Tar Pit” building and maintaining Enterprise
goo.gl/SKspn
data workflows


Document
Collection

Scrub
Tokenize
token

M

1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines

Let’s consider a trendline subsequent to the 1997 Q3 inﬂection point which enabled huge ecommerce successes and commercialized Big Data.
Where did Big Data come from, and where is this kind of work headed?

Q3 1997: inﬂection point

Four independent teams were working toward horizontal
scale-out of workﬂows based on commodity hardware.
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack
emerged from this.

Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:

parallelize workloads onto clusters of commodity servers to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.

Circa 1996: pre- inﬂection point

Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

Ah, teh olde days - Perl and C++ for CGI :)

Feedback loops shown in red represent data innovations at the time…

Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos

Circa 2001: post- big ecommerce successes

Stakeholder Product Customers

dashboards UX
Engineering

models servlets

recommenders
Algorithmic + Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classiﬁers, and other predictive models -- e.g., ad networks automating parts of the
marketing funnel, as in our case study.

LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.

Circa 2013: clusters everywhere

Data Products Customers
business
Domain process Prod
Expert Workﬂow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content

App Dev
Use Cases Across Topologies

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.

We see this feeding into cluster optimization in YARN, Mesos, etc.

Asymptotically…

• long-term trends toward more instrumentation DSL
of Enterprise data workflows:
- workflow abstraction enables business cases Planner/
- more machine data collected about apps Optimizer

- flow diagram (DAG) as unit of work
(abstract type for machine data) Workflow

- evolving feedback loops convert machine data App
into actionable insights and optimizations History

Cluster
• industry moves beyond common needs of ad-hoc
queries on logs and basic reporting, as a new class
of complex data workflows emerges to provide
Cluster
the insights required by Enterprise Scheduler

• end game is less about “bigness” of data, more about
managing complexity in the process of structuring data

In summary…

references…

by Leo Breiman
Statistical Modeling: The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L

Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)

references…

Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM

Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
“The Birth of Google” – John Battelle
wired.com/wired/archive/13.08/battelle.html

In their own words…

references…

by Paco Nathan
with Cascading
O’Reilly, 2013

Some of this material comes from an upcoming O’Reilly book:
“Enterprise Data Workﬂows with Cascading”

This should be in Rough Cuts soon -
scheduled to be out in print this June.

Many thanks to my wonderful editor, Courtney Nash.

drill-down…

blog, dev community, code/wiki/gists, maven repo,
commercial products, career opportunities:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com

join us for very interesting work! Copyright @2013, Concurrent, Inc.

Links to our open source projects, developer community, etc…

contact me @pacoid
http://concurrentinc.com/
(we're hiring too!)

The Workflow Abstraction

More Related Content

Similar to The Workflow Abstraction (20)

More from Paco Nathan (20)

Recently uploaded (20)

The Workflow Abstraction