SlideShare a Scribd company logo
“The Workflow Abstraction”

                     Strata SC
                     2013-02-28

                     Paco Nathan
                     Concurrent, Inc.
                     San Francisco, CA
                     @pacoid




                   Copyright @2013, Concurrent, Inc.




Friday, 01 March 13                                                                                           1
Background: dual in quantitative and distributed systems.
I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
The Workflow Abstraction
                                                                                                                                                            Document
                                                                                                                                                            Collection



                                                                                                                                                                                         Scrub
                                                                                                                                                                         Tokenize
                                                                                                                                                                                         token

                                                                                                                                                                    M




                       1. Funnel
                                                                                                                                                                                                 HashJoin   Regex
                                                                                                                                                                                                   Left     token
                                                                                                                                                                                                                    GroupBy    R
                                                                                                                                                                                    Stop Word                        token
                                                                                                                                                                                       List
                                                                                                                                                                                                   RHS




                                                                                                                                                                                                                       Count




                                                                                                                                                                                                                                   Word
                                                                                                                                                                                                                                   Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                                                                                        2
This talk is about the workflow abstraction:
 * the business process of structuring data
 * the practices of building robust apps at scale
 * the open source projects for Enterprise Data Workflows

We’ll consider some theory, examples, best practices, trendlines --
what are the drivers that brought us, and where is this work heading toward?

Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
Marketing Funnel – overview

             In reference to Making Data Work…
                                                                                                                               Customers
             Almost every business uses a model
             similar to this – give or take a few steps.                                                                       Campaigns


             Customer leads go in at the top,
                                                                                                                               Awareness
             those get refined through several stages,
             then results flow out the bottom.
                                                                                                                                Interest



                                                                                                                               Evalutation



                                                                                                                               Conversion



                                                                                                                                Referral



                                                                                                                                 Repeat




Friday, 01 March 13                                                                                                                          3
Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.

This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
Marketing Funnel – clickstream

              Different funnel stages get represented
              in ecommerce by events captured in                                                                                           Customers
              log files, as a class of machine data
              called clickstream                                                                                                           Campaigns

                                                                                                        Impression
                •   ad impressions                                                                                                         Awareness

                •   URL clicks                                                                                Click

                •   landing page views                                                                                                      Interest


                •   new user registrations                                                                            Sign Up

                                                                                                                                           Evalutation
                •   session cookies
                                                                                                                              Purchase
                •   online purchases                                                                                                       Conversion

                •   social network activity                                                                                       "Like"


                •   etc.                                                                                                                    Referral



                                                                                                                                             Repeat




Friday, 01 March 13                                                                                                                                      4
Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
Marketing Funnel – metrics

             A variety of clickstream metrics can
             be used as performance indicators                                                             Customers
             at different stages of the funnel:
                                                                                                           Campaigns
              •    CPM: cost per thousand                                    Impression

              •    CTR: click-through rate                                                                 Awareness                           CPM

              •    CPA: cost per action                                         Click


              •    etc.                                                                                     Interest                     CTR

                                                                                        Sign Up

                                                                                                           Evalutation                behaviors

                                                                                           Purchase

                                                                                                           Conversion           CPA

                                                                                                  "Like"

                                                                                                            Referral        NPS, social graph, etc.



                                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                                   5
The many different highly-nuanced metrics which apply are mind-boggling :)
Marketing Funnel – example calculations                                               Customers


                                                                                                   Campaigns



                                                                                                   Awareness



                                                                                                    Interest




                            metric                       cost     events     formula       rate    Evalutation



                                                                                                   Conversion



                                                                                                    Referral



                                                                                                     Repeat




                                                                              $4,000
                              CPM                     $4,000       10^6          ÷         $4.00
                                                                           (10^6 ÷ 10^3)



                                                                               3∙10^3
                               CTR                            -   3∙10^3
                                                                              ÷ 10^6
                                                                                           0.3%




                                                                              $4,000
                               CPA                            -     20           ÷         $200
                                                                                20




Friday, 01 March 13                                                                                              6
Here are examples of the kinds of calculations performed...
Marketing Funnel – predictive model

             Given these metrics, we can go further
             to estimate cost per paying user (CPP)                                                                                       Customers
             customer lifetime value (LTV), etc.
                                                                                                                                          Campaigns
             Then we can build a predictive model for
             return on investment (ROI) per customer,                                                                                     Awareness
             summarizing the funnel performance:
                     ROI = (LTV − CPP) ∕ CPP                                                                                               Interest




             As an example, after crunching lots of logs,                                                                                 Evalutation

             suppose that…
                                                                                                                                          Conversion

                     CPP = $200
                     LTV = $2000                                                                                                           Referral

                     ROI = ($2000 − $200) ∕ $200
                                                                                                                                            Repeat
             for a 9x multiple

Friday, 01 March 13                                                                                                                                     7
For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers,
which describes the efficiency of the marketing funnel at different stages.
Marketing Funnel – example architecture                                                                        Customers


                                                                                                                            Campaigns




                                                                                                              Customers
                                                                                                                            Awareness




             Let’s consider an example architecture                                                                          Interest



                                                                                                                            Evalutation




             for calculating, reporting, and taking action                                                      Web
                                                                                                                            Conversion




             on funnel metrics, based on large-scale                                                            App
                                                                                                                             Referral



                                                                                                                              Repeat




             clickstream data…
                                                                                                  logs         Cache
                                                                                                    logs
                                                                                                      Logs

                                                                      Support
                                                                                                     source
                                                                                           trap                  sink
                                                                                                       tap
                                                                                            tap                  tap


                                                                                                   Data
                                                                     Modeling            PMML
                                                                                                  Workflow

                                                                                                                source
                                                                                           sink
                                                                                                                  tap
                                                                                           tap

                                                                     Analytics
                                                                      Cubes                                    customer
                                                                                                                Customer
                                                                                                              profile DBs
                                                                                                                  Prefs
                                                                                                    Hadoop
                                                                                                    Cluster
                                                                     Reporting




Friday, 01 March 13                                                                                                                       8
Here’s an example architecture of using clickstream metrics within an online business.
Marketing Funnel – complexities

             Multiple ad partners, different contracts
             terms, reporting different metrics at                                                                                  Customers
                                                                                                                                                ×
                                                                                                    ×
             different times, click scrubs, etc.
                                                                                                                                    Campaigns
             Campaigns target specific geo/demo,                                                     Impression




                                                                                                    ×                                           ×
             test alternate landing pages, probably                                                                                 Awareness                           CPM
             need to segment customer base…                                                              Click


             These issues make clickstream data                                                                                      Interest                     CTR


             large and yet sparse.                                                                               Sign Up

                                                                                                                                    Evalutation                behaviors

             Other issues:


                                                                                                                                                ×
                                                                                                                    Purchase

             • seasonal variation                                                                                                   Conversion           CPA


             • fluctuating currency exchange rates                                                                          "Like"

                                                                                                                                     Referral        NPS, social graph, etc.
             • distortions due to credit card fraud
             • diminishing returns                                                                                                    Repeat      loyalty, win back, etc.

             • forecasting requirements
Friday, 01 March 13                                                                                                                                                            9
However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.

scrubs
many vendors, data sources, different metrics to be aligned
lots of roll-ups
Bayesian point estimates
forecasts and dashboards

social dimension makes this convoluted
not simple
Marketing Funnel – very large scale

             Even a small start-up may need to
             make decisions about billions of                                                                                              Customers
             events, many millions of users, and
             millions of dollars in annual ad spend.                                                                                       Campaigns

                                                                                               Impression
             Ad networks attempt to simplify and                                                                                           Awareness                           CPM
             optimize parts of the funnel process                                                   Click
             as a value-add.                                                                                                                Interest                     CTR

             The need for these insights has been a                                                         Sign Up

             driver for Hadoop-related technologies.                                                                                       Evalutation                behaviors

                                                                                                                 Purchase

                                                                                                                                           Conversion           CPA

                                                                                                                        "Like"

                                                                                                                                            Referral        NPS, social graph, etc.



                                                                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                                                                   10
The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
Marketing Funnel – very large scale

            Even a small start-up may need to
            make decisions about billions of                                               Customers
            events, many millions of users, and
            millions of dollars in annual ad spend.                                        Campaigns

                                                             Impression
            Ad networks attempt to simplify and                                            Awareness                           CPM
            optimize parts of the funnel process                Click
            as a value-add.
                                      funnel modeling and optimization                      Interest                     CTR

            The need for these insights has been a                      Sign Up

            driver for Hadoop-relatedrequires complex data workflows
                                       technologies.                                       Evalutation                behaviors

                                      to obtain the required insights      Purchase

                                                                                           Conversion           CPA

                                                                                  "Like"

                                                                                            Referral        NPS, social graph, etc.



                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                   11
These needs imply complex data workflows.

It’s not about doing a BI query or a pivot table;
that’s how retailers were thinking when Amazon came along.
The Workflow Abstraction
                                                                                                      Document
                                                                                                      Collection



                                                                                                                                   Scrub
                                                                                                                   Tokenize
                                                                                                                                   token

                                                                                                              M




                      1. Funnel
                                                                                                                                           HashJoin   Regex
                                                                                                                                             Left     token
                                                                                                                                                              GroupBy    R
                                                                                                                              Stop Word                        token
                                                                                                                                 List
                                                                                                                                             RHS




                                                                                                                                                                 Count




                                                                                                                                                                             Word
                                                                                                                                                                             Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                                                  12
A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
Circa 2008 – Hadoop at scale
                                                                                                                                                Customers




             Scenario: Analytics team at a large ad network…                                                                                    Campaigns



                                                                                                                                                Awareness




             Company had invested $MM capex in a                                                                                                 Interest




             large data warehouse across LOBs                                                                                                   Evalutation



                                                                                                                                                Conversion




             Mission-critical app had been written as
                                                                                                                                                 Referral




                                                                                                                                     collab       Repeat



             a large SQL workflow in the DW                                                                            roll-ups
                                                                                                                                     filter


             Marketing funnel metrics were estimated
             for many advertisers, many campaigns,                                                                                   per-user
                                                                                                                                   recommends
             many publishers, many customers –
             billions of calculations daily
                                                                                                                     query/load
             Predictive models matched publisher ~ advertiser                                                        clickstream     RDBMS

             and campaign ~ user, to optimize marketing
             funnel performance




Friday, 01 March 13                                                                                                                                           13
Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..

Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
Circa 2008 – Hadoop at scale
                                                                                                                                                         Customers




             Issues:                                                                                                                                     Campaigns



                                                                                                                                                         Awareness




              • critical app had hit hard limits for scalability                                                                                          Interest




              • several Tb data, 100’s of servers
                                                                                                                                                         Evalutation



                                                                                                                                                         Conversion




              • batch window length vs. failure rate vs. SLA                                                                                collab
                                                                                                                                                          Referral



                                                                                                                                                           Repeat



                in the context of business growth posed                                                                      roll-ups
                                                                                                                                            filter
                an existential risk




                                                                                                                                                     ×
             We built out a team to address these issues                                                                                    per-user
                                                                                                                                          recommends
             as rapidly as possible…
             Needed to re-create that data workflows                                                                         query/load
             based on Enterprise requirements.                                                                              clickstream     RDBMS




Friday, 01 March 13                                                                                                                                                    14
Marching orders:
5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;
5 weeks to reverse engineer the mission-critical app without any access to its author;
5 weeks to implement a Hadoop version which could scale-out on EC2.

We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
Circa 2008 – Hadoop at scale

            Approach:                                                           roll-ups
                                                                                               collab
                                                                                               filter
             • reverse-engineered business process from
               ~1500 lines of undocumented SQL
                                                                                               per-user
             • created a large, multi-step Apache Hadoop                                     recommends
               app on AWS                                                        HDFS


             • leveraged cloud strategy to trade $MM
               capex for lower, scalable opex
             • Amazon identified our app as one of the                             msg
                                                                                 queue
               largest Hadoop deployments on EC2
             • our app became a case study for AWS                             query/load
                                                                                               RDBMS
               prior to Elastic MapReduce launch                               clickstream




Friday, 01 March 13                                                                                       15
Our solution involved dependencies among more than a dozen Hadoop job steps.
Circa 2008 – Hadoop at scale




                                                                                                                                 ×
             Unresolved:                                                                                                                 roll-ups
                                                                                                                                                        collab
                                                                                                                                                        filter
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                                                                                            per-user
                debugging, etc., across the entire workflow                                                                                            recommends
                                                                                                                                          HDFS
              • data scientists wore beepers since Ops

                                                                                                                                × ×
                lacked visibility into business process
              • coding directly in MapReduce created
                a staffing bottleneck                                                                                                       msg
                                                                                                                                          queue



                                                                                                                                        query/load
                                                                                                                                        clickstream     RDBMS




Friday, 01 March 13                                                                                                                                                16
This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --
for troubleshooting, handling exceptions, notifications, etc.

Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.

Three issues about Enterprise workflows:
 * staffing bottleneck unless there’s a good abstraction layer
 * operational complexity, mostly due to lack of transparency
 * system integration problems *are* the main problem to solve
Circa 2008 – Hadoop at scale

             Unresolved:                                           roll-ups
                                                                               collab
                                                                                filter
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                  per-user
                debugging, etc., across the entire workflow                  recommends

              • data scientists worea good since Ops for a large, commercial
                                       beepers solution
                                                                    HDFS

                lacked visibility into Apachebusiness logic deployment, but
                                       the app’s Hadoop
              • coding directly in MapReduce created
                a staffing bottleneck   workflow management lacked crucial
                                                                     msg
                                                                    queue
                                                             features…
                                                                                                                                     query/load
                                                             which led to a search for a better                                      clickstream                RDBMS


                                                             workflow abstraction



Friday, 01 March 13                                                                                                                                                                           17
While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.

I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
The Workflow Abstraction
                                                                                                 Document
                                                                                                 Collection



                                                                                                                              Scrub
                                                                                                              Tokenize
                                                                                                                              token

                                                                                                         M




                       1. Funnel
                                                                                                                                      HashJoin   Regex
                                                                                                                                        Left     token
                                                                                                                                                         GroupBy    R
                                                                                                                         Stop Word                        token
                                                                                                                            List
                                                                                                                                        RHS




                                                                                                                                                            Count




                                                                                                                                                                        Word
                                                                                                                                                                        Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                             18
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading – origins

             API author Chris Wensel worked as a system architect
             at an Enterprise firm well-known for several popular
             data products.
             Wensel was following the Nutch open source project –
             before Hadoop even had a name.
             He noted that it would become difficult to find Java
             developers to write complex Enterprise apps directly
             in Apache Hadoop – a potential blocker for leveraging
             this new open source technology.




Friday, 01 March 13                                                                                                                                                            19
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
Cascading – functional programming

             Key insight: MapReduce is based on functional programming
             – back to LISP in 1970s. Apache Hadoop use cases are
             mostly about data pipelines, which are functional in nature.
             To ease staffing problems as “Main Street” Enterprise firms
             began to embrace Hadoop, Cascading was introduced
             in late 2007, as a new Java API to implement functional
             programming for large-scale data workflows:

               • leverages JVM and Java-based tools without an need
                    to create an entirely new language
               •    allows many programmers who have J2EE expertise
                    to build apps that leverage the economics of Hadoop
                    clusters




Friday, 01 March 13                                                                                                                           20
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
quotes…

                       “Cascading gives Java developers the ability to build
                        Big Data applications on Hadoop using their existing
                        skillset … Management can really go out and build a
                        team around folks that are already very experienced
                        with Java. Switching over to this is really a very short
                        exercise.”
                            CIO, Thor Olavsrud
                            2012-06-06
                            cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

                       “Masks the complexity of MapReduce, simplifies the
                        programming, and speeds you on your journey toward
                        actionable analytics … A vast improvement over native
                        MapReduce functions or Pig UDFs.”
                            2012 BOSSIE Awards, James Borck
                            2012-09-18
                            infoworld.com/slideshow/65089




Friday, 01 March 13                                                                           21
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”

The issues:
 * staffing bottleneck
 * operational complexity
 * system integration
Cascading – deployments

              • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
                   uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
              • partners: Amazon AWS, Microsoft Azure, Hortonworks,
                   MapR, EMC, SpringSource, Cloudera
              • 5+ history of Enterprise production deployments,
                   ASL 2 license, GitHub src, http://conjars.org
              • use cases: ETL, marketing funnel, anti-fraud, social media,
                   retail pricing, search analytics, recommenders, eCRM,
                   utility grids, genomics, climatology, etc.




Friday, 01 March 13                                                                  22
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.
examples…

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                           deployments
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:

                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)


                     github.com/nathanmarz/cascalog/wiki
                     github.com/twitter/scalding/wiki




Friday, 01 March 13                                                                    23
Many case studies, many Enterprise production deployments now for 5+ years.
examples…

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                           deployments
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:
                                         Cascading as the basis for workflow
                                         abstractions atop Hadoop and more,
                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)
                                         with a 5+ year history of production
                                         deployments across multiple verticals
                      github.com/nathanmarz/cascalog/wiki
                      github.com/twitter/scalding/wiki




Friday, 01 March 13                                                                    24
Cascading as a basis for workflow abstraction, for Enterprise data workflows
The Workflow Abstraction
                                                                          Document
                                                                          Collection



                                                                                                       Scrub
                                                                                       Tokenize
                                                                                                       token

                                                                                  M




                      1. Funnel
                                                                                                               HashJoin   Regex
                                                                                                                 Left     token
                                                                                                                                  GroupBy    R
                                                                                                  Stop Word                        token
                                                                                                     List
                                                                                                                 RHS




                                                                                                                                     Count




                                                                                                                                                 Word
                                                                                                                                                 Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                      25
Code samples in Cascading / Cascalog / Scalding, based on Word Count
The Ubiquitous Word Count
                                                                                                                     Document
                                                                                                                     Collection




             Definition:                                                                                                     M
                                                                                                                                  Tokenize
                                                                                                                                             GroupBy
                                                                                                                                              token    Count




                 count how often each word appears
               count how often each word appears
                                                                                                                                                R              Word
                                                                                                                                                               Count




               inin a collection of text documents
                  a collection of text documents
             This simple program provides an excellent test case for
             parallel processing, since it illustrates:                                                    void map (String doc_id, String text):
                                                                                                            for each word w in segment(text):
              • requires a minimal amount of code                                                             emit(w, "1");

              • demonstrates use of both symbolic and numeric values
              • shows a dependency graph of tuples as an abstraction                                       void reduce (String word, Iterator group):


              • is not many steps away from useful search indexing
                                                                                                            int count = 0;



              • serves as a “Hello World” for Hadoop apps                                                   for each pc in group:
                                                                                                              count += Int(pc);


             Any distributed computing framework which can run Word                                         emit(word, String(count));
             Count efficiently in parallel at scale can handle much
             larger and more interesting compute problems.


Friday, 01 March 13                                                                                                                                                    26
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...

Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
word count – conceptual flow diagram


                 Document
                 Collection




                                                       Tokenize
                                                                                                       GroupBy
                               M                                                                        token                                             Count




                                                                                                             R                                                                                Word
                                                                                                                                                                                              Count




                1 map                                                                                            cascading.org/category/impatient
                1 reduce
               18 lines code                                                                                                               gist.github.com/3900702


Friday, 01 March 13                                                                                                                                                                                                      27
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
word count – Cascading app in Java
                                                                                                     Document
                                                                                                     Collection




             String docPath = args[ 0 ];                                                                          Tokenize
                                                                                                                             GroupBy
                                                                                                                              token
             String wcPath = args[ 1 ];                                                                      M                         Count




             Properties properties = new Properties();                                                                          R              Word
                                                                                                                                               Count


             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
             Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

             // specify a regex to split "document" text lines into token stream
             Fields token = new Fields( "token" );
             Fields text = new Fields( "text" );
             RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
             // only returns "token"
             Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
             // determine the word counts
             Pipe wcPipe = new Pipe( "wc", docPipe );
             wcPipe = new GroupBy( wcPipe, token );
             wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
              .addSource( docPipe, docTap )
              .addTailSink( wcPipe, wcTap );
             // write a DOT file and run the flow
             Flow wcFlow = flowConnector.connect( flowDef );
             wcFlow.writeDOT( "dot/wc.dot" );
             wcFlow.complete();



Friday, 01 March 13                                                                                                                                    28
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT file for the flow diagram
word count – generated flow diagram
                                                                                                                                                      Document
                                                                                                                                                      Collection




                                                                                                                                                                   Tokenize
                                                                                                      [head]                                                  M
                                                                                                                                                                              GroupBy
                                                                                                                                                                               token    Count




                                                                                                                                                                                 R              Word
                                                                                                                                                                                                Count




                                                                        Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                [{2}:'doc_id', 'text']
                                                                                                [{2}:'doc_id', 'text']




                                                                                                                                             map
                                                                         Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                                                                    [{1}:'token']
                                                                                                    [{1}:'token']



                                                                                          GroupBy('wc')[by:['token']]

                                                                                                  wc[{1}:'token']
                                                                                                  [{1}:'token']




                                                                                                                                             reduce
                                                                                       Every('wc')[Count[decl:'count']]

                                                                                                [{2}:'token', 'count']
                                                                                                [{1}:'token']



                                                                    Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                [{2}:'token', 'count']
                                                                                                [{2}:'token', 'count']



                                                                                                       [tail]


Friday, 01 March 13                                                                                                                                                                                     29
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
word count – Cascalog / Clojure
                                                                      Document
                                                                      Collection




             (ns impatient.core                                               M
                                                                                   Tokenize
                                                                                              GroupBy
                                                                                               token    Count



               (:use [cascalog.api]                                                              R              Word
                                                                                                                Count


                     [cascalog.more-taps :only (hfs-delimited)])
               (:require [clojure.string :as s]
                         [cascalog.ops :as c])
               (:gen-class))

             (defmapcatop split [line]
               "reads in a line of string and splits it by regex"
               (s/split line #"[[](),.)s]+"))

             (defn -main [in out & args]
               (?<- (hfs-delimited out)
                    [?word ?count]
                    ((hfs-delimited in :skip-header? true) _ ?line)
                    (split ?line :> ?word)
                    (c/count ?count)))

             ; Paul Lam
             ; github.com/Quantisan/Impatient




Friday, 01 March 13                                                                                                     30
Here is the same Word Count app written in Clojure, using Cascalog.
word count – Cascalog / Clojure
                                                                                                                    Document
                                                                                                                    Collection




             github.com/nathanmarz/cascalog/wiki
                                                                                                                                 Tokenize
                                                                                                                                            GroupBy
                                                                                                                            M                token    Count




                                                                                                                                               R              Word
                                                                                                                                                              Count




               • implements Datalog in Clojure, with predicates backed
                 by Cascading – for a highly declarative language
               • run ad-hoc queries from the Clojure REPL –
                 approx. 10:1 code reduction compared with SQL
               • composable subqueries, used for test-driven development
                 (TDD) practices at scale
               • Leiningen build: simple, no surprises, in Clojure itself
               • more new deployments than other Cascading DSLs –
                 Climate Corp is largest use case: 90% Clojure/Cascalog
               • has a learning curve, limited number of Clojure developers
               • aggregators are the magic, and those take effort to learn




Friday, 01 March 13                                                                                                                                                   31
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.
word count – Scalding / Scala
                                                                                 Document
                                                                                 Collection




           import com.twitter.scalding._                                                 M
                                                                                              Tokenize
                                                                                                         GroupBy
                                                                                                          token    Count



                                                                                                            R              Word
                                                                                                                           Count


           class WordCount(args : Args) extends Job(args) {
             Tsv(args("doc"),
                  ('doc_id, 'text),
                  skipHeader = true)
               .read
               .flatMap('text -> 'token) {
                  text : String => text.split("[ [](),.]")
                }
               .groupBy('token) { _.size('count) }
               .write(Tsv(args("wc"), writeHeader = true))
           }




Friday, 01 March 13                                                                                                                32
Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.
word count – Scalding / Scala
                                                                                                                                                                                  Document
                                                                                                                                                                                  Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                                               Tokenize
                                                                                                                                                                                                          GroupBy
                                                                                                                                                                                          M                token    Count




                                                                                                                                                                                                             R              Word
                                                                                                                                                                                                                            Count




                • extends the Scala collections API so that distributed lists
                  become “pipes” backed by Cascading
                • code is compact, easy to understand
                • nearly 1:1 between elements of conceptual flow diagram
                  and function calls
                • extensive libraries are available for linear algebra, abstract
                  algebra, machine learning – e.g., Matrix API, Algebird, etc.
                • significant investments by Twitter, Etsy, eBay, etc.
                • great for data services at scale
                • less learning curve than Cascalog,
                  not as much of a high-level language




Friday, 01 March 13                                                                                                                                                                                                                 33
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
word count – Scalding / Scala
                                                                                                                                                    Document
                                                                                                                                                    Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                 Tokenize
                                                                                                                                                                            GroupBy
                                                                                                                                                            M                token    Count




                                                                                                                                                                               R              Word
                                                                                                                                                                                              Count




               • extends the Scala collections API so that distributed lists
                 become “pipes” backed by Cascading
               • code is compact, easy to understand
               • nearly 1:1 between elements of conceptual flow diagram
                 and function calls        Cascalog and Scalding DSLs
               • extensive libraries are available for linear algebra, abstractaspects
                                           leverage the functional
                 algebra, machine learning – e.g., Matrix API, Algebird, etc.
                                           of MapReduce, helping to limit
               • significant investments by Twitter, Etsy, eBay, etc.
                                           complexity in process
               • great for data services at scale
                 (imagine SOA infra @ Google as an open source project)
               • less learning curve than Cascalog,
                 not as much of a high-level language



Friday, 01 March 13                                                                                                                                                                                   34
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
The Workflow Abstraction
                                                                                  Document
                                                                                  Collection



                                                                                                               Scrub
                                                                                               Tokenize
                                                                                                               token

                                                                                          M




                     1. Funnel
                                                                                                                       HashJoin   Regex
                                                                                                                         Left     token
                                                                                                                                          GroupBy    R
                                                                                                          Stop Word                        token
                                                                                                             List
                                                                                                                         RHS




                                                                                                                                             Count




                                                                                                                                                         Word
                                                                                                                                                         Count




                     2. Circa 2008
                     3. Cascading
                     4. Sample Code
                     5. Workflows
                     6. Abstraction
                     7. Trendlines


Friday, 01 March 13                                                                                                                                              35
Tracking back to the Marketing Funnel as an example workflow…
Let’s consider how Cascading apps incorporate other components beyond Hadoop
Enterprise Data Workflows
                                                                                    Customers
            Back to our marketing funnel, let’s consider
            an example app… at the front end                                          Web
                                                                                      App
            LOB use cases drive demand for apps
                                                                        logs         Cache
                                                                          logs
                                                                            Logs

                                                   Support
                                                                           source
                                                                 trap                  sink
                                                                             tap
                                                                  tap                  tap


                                                                         Data
                                                   Modeling    PMML
                                                                        Workflow

                                                                                      source
                                                                 sink
                                                                                        tap
                                                                 tap

                                                   Analytics
                                                    Cubes                            customer
                                                                                      Customer
                                                                                    profile DBs
                                                                                        Prefs
                                                                          Hadoop
                                                                          Cluster
                                                   Reporting




Friday, 01 March 13                                                                               36
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
                                                                                                                 Customers
             An example… in the back office
             Organizations have substantial investments                                                            Web
                                                                                                                   App
             in people, infrastructure, process
                                                                                                     logs         Cache
                                                                                                       logs
                                                                                                         Logs

                                                                      Support
                                                                                                        source
                                                                                              trap                  sink
                                                                                                          tap
                                                                                               tap                  tap


                                                                                                      Data
                                                                     Modeling            PMML
                                                                                                     Workflow

                                                                                                                   source
                                                                                              sink
                                                                                                                     tap
                                                                                              tap

                                                                     Analytics
                                                                      Cubes                                       customer
                                                                                                                   Customer
                                                                                                                 profile DBs
                                                                                                                     Prefs
                                                                                                       Hadoop
                                                                                                       Cluster
                                                                    Reporting




Friday, 01 March 13                                                                                                            37
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
                                                                                                          Customers
              An example… for the heavy lifting!
              “Main Street” firms are migrating                                                              Web
                                                                                                            App
              workflows to Hadoop, for cost
              savings and scale-out
                                                                                              logs         Cache
                                                                                                logs
                                                                                                  Logs

                                                                          Support
                                                                                                 source
                                                                                       trap                  sink
                                                                                                   tap
                                                                                        tap                  tap


                                                                                               Data
                                                                         Modeling    PMML
                                                                                              Workflow

                                                                                                            source
                                                                                       sink
                                                                                                              tap
                                                                                       tap

                                                                         Analytics
                                                                          Cubes                            customer
                                                                                                            Customer
                                                                                                          profile DBs
                                                                                                              Prefs
                                                                                                Hadoop
                                                                                                Cluster
                                                                        Reporting




Friday, 01 March 13                                                                                                     38
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Cascading workflows – taps

               •   taps integrate other data frameworks, as tuple streams
                                                                                                            Customers

               •   these are “plumbing” endpoints in the pattern language
               •   sources (inputs), sinks (outputs), traps (exceptions)                                      Web
                                                                                                              App


               •   text delimited, JDBC, Memcached,
                   HBase, Cassandra, MongoDB, etc.                                              logs
                                                                                                  logs
                                                                                                    Logs
                                                                                                             Cache



               • data serialization: Avro, Thrift,
                                                                           Support
                                                                                                   source
                                                                                         trap                  sink
                                                                                                     tap
                   Kryo, JSON, etc.                                                       tap                  tap




               • extend a new kind of tap in just
                                                                                                 Data
                                                                           Modeling    PMML
                                                                                                Workflow

                   a few lines of Java                                                   sink
                                                                                                              source
                                                                                                                tap
                                                                                         tap

                                                                           Analytics
                                                                            Cubes                            customer
                                                                                                              Customer
                                                                                                            profile DBs
             schema and provenance get                                                            Hadoop
                                                                                                                Prefs


             derived from analysis of the taps                             Reporting
                                                                                                  Cluster




Friday, 01 March 13                                                                                                       39
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
Cascading workflows – taps

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                source and sink taps
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );                      for TSV data in HDFS
            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/wc.dot" );
            wcFlow.complete();



Friday, 01 March 13                                                                                                        40
Here are the taps in the WordCount source
Cascading workflows – topologies

               • topologies execute workflows on clusters
                                                                                                                                              Customers

               • flow planner is like a compiler for queries
                 - Hadoop (MapReduce jobs)                                                                                                      Web
                                                                                                                                                App


                 - local mode (dev/test or special config)
                                                                                                                                  logs         Cache
                 - in-memory data grids (real-time)                                                                                 logs
                                                                                                                                      Logs

                                                                                                             Support

               • flow planner can be extended                                                                               trap
                                                                                                                            tap
                                                                                                                                     source
                                                                                                                                       tap       sink
                                                                                                                                                 tap
                   to support other topologies
                                                                                                                                   Data
                                                                                                             Modeling    PMML
                                                                                                                                  Workflow

                                                                                                                                                source
                                                                                                                           sink
                                                                                                                                                  tap
             blend flows in different topologies                                                                            tap

                                                                                                             Analytics
             into the same app – for example,                                                                 Cubes                            customer
                                                                                                                                                Customer
                                                                                                                                              profile DBs
             batch (Hadoop) + transactions (IMDG)                                                                                   Hadoop
                                                                                                                                                  Prefs

                                                                                                                                    Cluster
                                                                                                             Reporting




Friday, 01 March 13                                                                                                                                         41
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
Cascading workflows – topologies

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );   flow planner for
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );                     Apache Hadoop
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                topology
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/wc.dot" );
            wcFlow.complete();



Friday, 01 March 13                                                                                                   42
Here is the flow planner for Hadoop in the WordCount source
example topologies…




Friday, 01 March 13                                                                             43
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base
                                                                                                                                                                             Customers

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                                                 Web
                                                                                                                                                                               App

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                        logs
                                                                                                                                                  logs                        Cache
                                                                                                                                                    Logs

               • relational catalog over a collection                                                        Support
                                                                                                                                                    source
                   of unstructured data                                                                                          trap
                                                                                                                                  tap
                                                                                                                                                      tap                       sink
                                                                                                                                                                                tap



               • SQL shell prompt to run queries                                                            Modeling         PMML
                                                                                                                                                  Data
                                                                                                                                                 Workflow


               • enable analysts without retraining                                                                              sink
                                                                                                                                 tap
                                                                                                                                                                               source
                                                                                                                                                                                 tap

                   on Hadoop, etc.                                                                          Analytics
                                                                                                             Cubes                                                            customer

               • transparency for Support, Ops,                                                                                                    Hadoop
                                                                                                                                                                               Customer
                                                                                                                                                                             profile DBs
                                                                                                                                                                                 Prefs

                   Finance, et al.                                                                          Reporting
                                                                                                                                                   Cluster



               • a language for queries – not a database,
                   but ANSI SQL as a DSL for workflows

Friday, 01 March 13                                                                                                                                                                        44
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.

BTW, most of the SQL in the world is written by machines. This is not a database; this is about making machine-to-machine communications simpler and more robust at scale.
ANSI SQL – CSV data in local file system




               cascading.org/lingual


Friday, 01 March 13                                                                              45
The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
ANSI SQL – shell prompt, catalog




                cascading.org/lingual


Friday, 01 March 13                                                                       46
Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
ANSI SQL – queries




              cascading.org/lingual


Friday, 01 March 13                                                        47
Here’s an example SQL query on that “employee” test database from MySQL.
Cascading workflows – machine learning

              • migrate workloads: SAS,Teradata, etc.,
                   exporting predictive models as PMML                                                                                        Customers



              • Cascading creates parallelized models                                                                                           Web
                                                                                                                                                App
                   to run at scale on Hadoop clusters
              • Random Forest, Logistic Regression,                                                                               logs
                                                                                                                                    logs       Cache
                                                                                                                                      Logs
                   GLM, Decision Trees, K-Means,                                                              Support
                   Hierarchical Clustering, etc.                                                                           trap
                                                                                                                                     source
                                                                                                                                       tap       sink
                                                                                                                            tap                  tap

              • integrate with other libraries                                                                                     Data
                   (Matrix API, etc.) and great open                                                         Modeling    PMML
                                                                                                                                  Workflow


                   source tools (R, Weka, KNIME,                                                                           sink
                                                                                                                           tap
                                                                                                                                                source
                                                                                                                                                  tap

                   RapidMiner, etc.)                                                                         Analytics
                                                                                                              Cubes                            customer

              • 2 lines of code or pre-built JAR                                                                                    Hadoop
                                                                                                                                                Customer
                                                                                                                                              profile DBs
                                                                                                                                                  Prefs

                                                                                                                                    Cluster
                                                                                                            Reporting


             Run multiple variants of models as
             customer experiments

Friday, 01 March 13                                                                                                                                         48
PMML has been around for a while, and export is supported by nearly every commercial analytics platform,
covering a wide variety of predictive modeling algorithms.

Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.

Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)

Several companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
model creation in R
                       ## train a RandomForest model

                       f <- as.formula("as.factor(label) ~ .")
                       fit <- randomForest(f, data_train, ntree=50)

                       ## test the model on the holdout test set

                       print(fit$importance)
                       print(fit)

                       predicted <- predict(fit, data)
                       data$predicted <- predicted
                       confuse <- table(pred = predicted, true = data[,1])
                       print(confuse)

                       ## export predicted labels to TSV

                       write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t",
                       row.names=FALSE)

                       ## export RF model to PMML

                       saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))




               cascading.org/pattern


Friday, 01 March 13                                                                                                              49
Sample code in R for generating a predictive model for anti-fraud, based on a machine learning algorithm called Random Forest.
model run at scale as a Cascading app




                                                      Customer
                                                       Orders



                                                                                                        Scored                                                GroupBy
                                                                             Classify                                               Assert
                                                                                                        Orders                                                 token

                                                                M                                                                                                               R




                                                PMML
                                                Model
                                                                                                                                                                   Count




                                                                                                                                    Failure                                     Confusion
                                                                                                                                     Traps                                       Matrix




               cascading.org/pattern


Friday, 01 March 13                                                                                                                                                                                             50
Conceptual flow diagram for a Cascading app which runs a PMML model at scale, while trapping data exceptions (e.g., regression tests) and tallying a “confusion matrix” for quantifying the model performance.
model run at scale as a Cascading app
                      public class Main {
                        public static void main( String[] args ) {
                          String pmmlPath = args[ 0 ];
                          String ordersPath = args[ 1 ];
                          String classifyPath = args[ 2 ];
                          String trapPath = args[ 3 ];

                            Properties properties = new Properties();
                            AppProps.setApplicationJarClass( properties, Main.class );
                            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

                            // create source and sink taps
                            Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
                            Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
                            Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

                            // define a "Classifier" model from PMML to evaluate the orders
                            ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
                            Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

                            // connect the taps, pipes, etc., into a flow
                            FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
                             .addSource( classifyPipe, ordersTap )
                             .addTrap( classifyPipe, trapTap )
                             .addSink( classifyPipe, classifyTap );

                            // write a DOT file and run the flow
                            Flow classifyFlow = flowConnector.connect( flowDef );
                            classifyFlow.writeDOT( "dot/classify.dot" );
                            classifyFlow.complete();
                          }
                      }

Friday, 01 March 13                                                                                                                      51
Source code for a simple Cascading app that runs PMML models in general.
PMML support…




Friday, 01 March 13                                                   52
Popular tools which can create predictive models for export as PMML
Cascading workflows – test-driven development

               •   assert patterns (regex) on the tuple streams
                                                                                                                                                                             Customers
               •   adjust assert levels, like log4j levels
               •   trap edge cases as “data exceptions”                                                                                                                         Web
                                                                                                                                                                                App

               •   TDD at scale:
                   1. start from raw inputs in the flow graph                                                                                         logs
                                                                                                                                                       logs
                                                                                                                                                         Logs
                                                                                                                                                                               Cache


                   2. define stream assertions for each stage                                                     Support
                                                                                                                                                            source
                                                                                                                                      trap                                        sink
                      of transforms                                                                                                    tap
                                                                                                                                                              tap
                                                                                                                                                                                  tap



                   3. verify exceptions, code to remove them                                                    Modeling          PMML
                                                                                                                                                       Data
                                                                                                                                                      Workflow

                   4. when impl is complete, app has full                                                                             sink
                                                                                                                                                                                source
                                                                                                                                                                                  tap
                                                                                                                                      tap
                      test coverage                                                                             Analytics
                                                                                                                 Cubes
               •   TDD follows from Cascalog’s                                                                                                                                customer
                                                                                                                                                                               Customer
                                                                                                                                                                             profile DBs
                                                                                                                                                                                 Prefs
                   composable subqueries                                                                                                                Hadoop
                                                                                                                                                        Cluster
                                                                                                                Reporting

               • redirect traps in production
                   to Ops, QA, Support, Audit, etc.

Friday, 01 March 13                                                                                                                                                                                                            53
TDD is not usually high on the list when people start discussing Big Data apps.

The notion of a “data exception” was introduced into Cascading, based on setting stream assertion levels as part of the business logic of an application.

Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
Cascading workflows – TDD meets API principles

               • specify what is required, not how it must
                   be achieved                                                                Customers




               • plan far ahead, before consuming cluster                                       Web
                                                                                                App

                   resources – fail fast prior to submit
                                                                                  logs         Cache
               • fail the same way twice – deterministic                            logs
                                                                                      Logs

                                                             Support
                   flow planners help reduce engineering                    trap
                                                                                     source
                                                                                                 sink
                                                                                       tap
                   costs for debugging at scale                             tap                  tap


                                                                                   Data
                                                             Modeling    PMML

               • same JAR, any scale – app does not                               Workflow

                                                                                                source
                   require a recompile to change data                      sink
                                                                           tap
                                                                                                  tap


                   taps or cluster topologies                Analytics
                                                              Cubes                            customer
                                                                                                Customer
                                                                                              profile DBs
                                                                                                  Prefs
                                                                                    Hadoop
                                                                                    Cluster
                                                             Reporting




Friday, 01 March 13                                                                                         54
Some of the design principles for the pattern language
Two Avenues…

             Enterprise: must contend with
             complexity at scale everyday…
             incumbents extend current practices and
             infrastructure investments – using J2EE,




                                                                                                            complexity ➞
             ANSI SQL, SAS, etc. – to migrate
             workflows onto Apache Hadoop while
             leveraging existing staff


              Start-ups: crave complexity and
              scale to become viable…
              new ventures move into Enterprise space
              to compete using relatively lean staff,
              while leveraging sophisticated engineering
              practices, e.g., Cascalog and Scalding
                                                                                                                                    scale ➞

Friday, 01 March 13                                                                                                                           55
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Two Avenues…

              Enterprise: must contend with
              complexity at scale everyday…
              incumbents extend current practices and
              infrastructure investments – using J2EE,




                                                            complexity ➞
              ANSI SQL, SAS, etc. – to migrate
              workflows onto Apache Hadoop while
              leveraging existing staff
                                         Hadoop almost never gets used
                                         in isolation; data workflows define
               Start-ups: crave complexity and
               scale to become viable… the “glue” required for system
               new ventures move into Enterprise space of Enterprise apps
                                         integration
               to compete using relatively lean staff,
               while leveraging sophisticated engineering
               practices, e.g., Cascalog and Scalding
                                                                           scale ➞

Friday, 01 March 13                                                                  56
Hadoop is almost never used in isolation.
Enterprise data workflows are about system integration.
There are a couple different ways to arrive at the party.
The Workflow Abstraction
                                                                                                 Document
                                                                                                 Collection



                                                                                                                              Scrub
                                                                                                              Tokenize
                                                                                                                              token

                                                                                                         M




                       1. Funnel
                                                                                                                                      HashJoin   Regex
                                                                                                                                        Left     token
                                                                                                                                                         GroupBy    R
                                                                                                                         Stop Word                        token
                                                                                                                            List
                                                                                                                                        RHS




                                                                                                                                                            Count




                                                                                                                                                                        Word
                                                                                                                                                                        Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                             57
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading workflows – pattern language

             Cascading uses a “plumbing” metaphor in the Java API,
             to define workflows out of familiar elements: Pipes, Taps,
             Tuple Flows, Filters, Joins, Traps, etc.
                                                          Document
                                                          Collection



                                                                                       Scrub
                                                                       Tokenize
                                                                                       token

                                                                  M



                                                                                               HashJoin   Regex
                                                                                                 Left     token
                                                                                                                  GroupBy    R
                                                                                  Stop Word                        token
                                                                                     List
                                                                                                 RHS




                                                                                                                     Count


             Data is represented as flows of tuples. Operations within                                                            Word

             the tuple flows bring functional programming aspects into                                                            Count




             Java apps.
             In formal terms, this provides a pattern language.


Friday, 01 March 13                                                                                                                      58
A pattern language, based on the metaphor of “plumbing”
references…

                      pattern language: a structured method for solving
                      large, complex design problems, where the syntax of
                      the language promotes the use of best practices.

                      amazon.com/dp/0195019199



                      design patterns: the notion originated in consensus
                      negotiation for architecture, later applied in OOP
                      software engineering by “Gang of Four”.
                      amazon.com/dp/0201633612




Friday, 01 March 13                                                                                                 59
Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
Cascading workflows – pattern language

             Cascading uses a “plumbing” metaphor in the Java API,
             to define workflows out of familiar elements: Pipes, Taps,
             Tuple Flows, Filters, Joins, Traps, etc.
                                                           Document
                                                           Collection



                                                                                        Scrub
                                                                        Tokenize



                                                                   design principles of the pattern
                                                                                        token

                                                                   M




                                                                   language ensure best practices
                                                                                   Stop Word
                                                                                      List
                                                                                                HashJoin
                                                                                                  Left
                                                                                                           Regex
                                                                                                           token
                                                                                                                   GroupBy
                                                                                                                    token
                                                                                                                              R




                                                                   for robust, parallel data workflows
                                                                                                  RHS




                                                                   at scale                                           Count


              Data is represented as flows of tuples. Operations within                                                            Word

              the tuple flows bring functional programming aspects into                                                            Count




              Java apps.
              In formal terms, this provides a pattern language.


Friday, 01 March 13                                                                                                                       60
The pattern language provides a structured method for solving large,
complex design problems where the syntax of the language promotes
use of best practices – which also addresses staffing issues
Cascading workflows – literate programming

             Cascading workflows generate their own visual
             documentation: flow diagrams

                                                            Document
                                                            Collection



                                                                                                   Scrub
                                                                              Tokenize
                                                                                                   token

                                                                    M



                                                                                                                     HashJoin             Regex
                                                                                                                       Left               token
                                                                                                                                                              GroupBy      R
                                                                                              Stop Word                                                        token
                                                                                                 List
                                                                                                                       RHS




                                                                                                                                                                  Count



              In formal terms, flow diagrams leverage a methodology                                                                                                                 Word
                                                                                                                                                                                   Count

              called literate programming
              Provides intuitive, visual representations for apps, great
              for cross-team collaboration.


Friday, 01 March 13                                                                                                                                                                                                                          61
Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
references…

                       by Don Knuth
                       Literate Programming
                       Univ of Chicago Press, 1992
                       literateprogramming.com/

                       “Instead of imagining that our main task is
                        to instruct a computer what to do, let us
                        concentrate rather on explaining to human
                        beings what we want a computer to do.”




Friday, 01 March 13                                                                                       62
Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
examples…

                        • Scalding apps have nearly 1:1 correspondence
                            between function calls and the elements in their
                            flow diagrams – excellent elision and literate
                            representation
                        •   noticed on cascading-users email list:
                            when troubleshooting issues, Cascading experts ask
                            novices to provide an app’s flow diagram (generated                                                                      [head]


                            as a DOT file), sometimes in lieu of showing code                                          Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                                                              [{2}:'doc_id', 'text']
                                                                                                                                              [{2}:'doc_id', 'text']




                                                                                                                                                                                           map
                                                                                                                       Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                      In formal terms, a flow diagram is a directed, acyclic                                                                       [{1}:'token']
                                                                                                                                                  [{1}:'token']



                      graph (DAG) on which lots of interesting math applies                                                             GroupBy('wc')[by:['token']]



                      for query optimization, predictive models about app
                                                                                                                                                wc[{1}:'token']
                                                                                                                                                [{1}:'token']




                                                                                                                                                                                           reduce
                      execution, parallel efficiency metrics, etc.                                                                    Every('wc')[Count[decl:'count']]

                                                                                                                                              [{2}:'token', 'count']
                                                                                                                                              [{1}:'token']



                                                                                                                   Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                                                              [{2}:'token', 'count']
                                                                                                                                              [{2}:'token', 'count']



                                                                                                                                                     [tail]




Friday, 01 March 13                                                                                                                                                                                 63
Literate programming examples observed on the email list are some of the best illustrations of this methodology.
Cascading workflows – business process

            Following the essence of literate programming, Cascading
            workflows provide statements of business process
            This recalls a sense of business process management
            for Enterprise apps (think BPM/BPEL for Big Data)
            As a separation of concerns between business process
            and implementation details (Hadoop, etc.)
            This is especially apparent in large-scale Cascalog apps:
                “Specify what you require, not how to achieve it.”
            By virtue of the pattern language, the flow planner in used
            in a Cascading app determines how to translate business
            process into efficient, parallel jobs at scale.




Friday, 01 March 13                                                      64
Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
references…

                      by Edgar Codd
                      “A relational model of data for large shared data banks”
                      Communications of the ACM, 1970
                      dl.acm.org/citation.cfm?id=362685
                      Rather than arguing between SQL vs. NoSQL…
                      structured vs. unstructured data frameworks…
                      this approach focuses on:
                            the process of structuring data
                      That’s what apps do – Making Data Work




Friday, 01 March 13                                                                                                                                             65
Focus on *the process of structuring data*
which must happen before the large-scale joins, predictive models, visualizations, etc.

Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.

BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
Cascading workflows – functional relational programming

             The combination of functional programming, pattern language,
             DSLs, literate programming, business process, etc., traces back
             to the original definition of the relational model (Codd, 1970)
             prior to SQL.
             Cascalog, in particular, implements more of what Codd intended
             for a “data sublanguage” and is considered to be close to a full
             implementation of the functional relational programming
             paradigm defined in:
                    Moseley & Marks, 2006
                    “Out of the Tar Pit”
                    goo.gl/SKspn




Friday, 01 March 13                                                             66
A more contemporary statement along similar lines...
Cascading workflows – functional relational programming

            The combination of functional programming, pattern language,
            DSLs, literate programming, business process, etc., traces back
            to the original definition of the relational model (Codd, 1970)
            prior to SQL.
            Cascalog, in particular, implements more of what Codd intended for a
                                       several theoretical aspects converge
            “data sublanguage” and is considered to be close to a full
            implementation of the functional relational programming
            paradigm defined in:        into software engineering practices
                      Moseley & Marks, 2006which mitigates the complexity of
                      “Out of the Tar Pit” building and maintaining Enterprise
                      goo.gl/SKspn
                                      data workflows



Friday, 01 March 13                                                                67
The Workflow Abstraction
                                                                                                                                              Document
                                                                                                                                              Collection



                                                                                                                                                                           Scrub
                                                                                                                                                           Tokenize
                                                                                                                                                                           token

                                                                                                                                                      M




                      1. Funnel
                                                                                                                                                                                   HashJoin   Regex
                                                                                                                                                                                     Left     token
                                                                                                                                                                                                      GroupBy    R
                                                                                                                                                                      Stop Word                        token
                                                                                                                                                                         List
                                                                                                                                                                                     RHS




                                                                                                                                                                                                         Count




                                                                                                                                                                                                                     Word
                                                                                                                                                                                                                     Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                                                                                          68
Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.
Where did Big Data come from, and where is this kind of work headed?
Q3 1997: inflection point

             Four independent teams were working toward horizontal
             scale-out of workflows based on commodity hardware.
             This effort prepared the way for huge Internet successes
             in the 1997 holiday season… AMZN, EBAY, Inktomi
             (YHOO Search), then GOOG

             MapReduce and the Apache Hadoop open source stack
             emerged from this.




Friday, 01 March 13                                                                                                        69
Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:

parallelize workloads onto clusters of commodity servers to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.
Circa 1996: pre- inflection point

                                                                              Stakeholder                   Customers

                                                       Excel pivot tables
                                                     PowerPoint slide decks        strategy



                                                            BI
                                                                                  Product
                                                          Analysts


                                                                                 requirements



                                                          SQL Query                             optimized
                                                                                Engineering       code         Web App
                                                           result sets



                                                                                                               transactions




                                                                                                               RDBMS




Friday, 01 March 13                                                                                                           70
Ah, teh olde days - Perl and C++ for CGI :)

Feedback loops shown in red represent data innovations at the time…

Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos
Circa 2001: post- big ecommerce successes

                                                        Stakeholder                                            Product                                          Customers




                                                            dashboards                                                                                                 UX
                                                                                                            Engineering

                                                                                          models                                         servlets

                                                                                                             recommenders
                                                        Algorithmic                                                 +                                           Web Apps
                                                         Modeling                                               classifiers


                                                                                                                                                               Middleware
                                                                                          aggregation
                                                                                                                                          event
                                                            SQL Query                                                                    history
                                                             result sets                                                                                             customer
                                                                                                                                                                   transactions
                                                                                                                  Logs



                                                               DW                                                   ETL                                            RDBMS




Friday, 01 March 13                                                                                                                                                                                                                    71
Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the
marketing funnel, as in our case study.

LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
Circa 2013: clusters everywhere

                                                                                      Data Products                                                     Customers
                                                             business
                                  Domain                     process                                                                                                                 Prod
                                  Expert                                                 Workflow
                                                                dashboard
                                                                 metrics
                                                  data
                                                                                                                                                       Web Apps,               s/w
                                                                                           History                              services
                                                science                                                                                                Mobile, etc.            dev
                                 Data
                               Scientist
                                                                                           Planner                                                 social
                                                              discovery                                                                         interactions
                                                                  +                                      optimized                                             transactions,
                                                                                                                                                                                      Eng
                                                              modeling                         taps       capacity                                                content

                                  App Dev
                                                                                               Use Cases Across Topologies


                                                                                            Hadoop,                         Log                          In-Memory
                                                                                              etc.                         Events                         Data Grid
                                    Ops                              DW                                                                                                               Ops
                                                                                                                                       batch       near time


                                                                                                                        Cluster Scheduler
                                  introduced                                                                                                                                         existing
                                   capability                                                                                                                                         SDLC

                                                                                                                                                               RDBMS
                                                                                                                                                                RDBMS


Friday, 01 March 13                                                                                                                                                                             72
Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.

We see this feeding into cluster optimization in YARN, Mesos, etc.
Asymptotically…

              • long-term trends toward more instrumentation                 DSL
                  of Enterprise data workflows:
                  - workflow abstraction enables business cases             Planner/
                  - more machine data collected about apps                 Optimizer

                  - flow diagram (DAG) as unit of work
                      (abstract type for machine data)                     Workflow

                  - evolving feedback loops convert machine data                        App
                      into actionable insights and optimizations                       History

                                                                           Cluster
              • industry moves beyond common needs of ad-hoc
                  queries on logs and basic reporting, as a new class
                  of complex data workflows emerges to provide
                                                                            Cluster
                  the insights required by Enterprise                      Scheduler

              • end game is less about “bigness” of data, more about
                  managing complexity in the process of structuring data

Friday, 01 March 13                                                                              73
In summary…
references…

                       by Leo Breiman
                       Statistical Modeling: The Two Cultures
                       Statistical Science, 2001
                       bit.ly/eUTh9L




Friday, 01 March 13                                                                                                                                                                                                         74
Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
references…

                      Amazon
                      “Early Amazon: Splitting the website” – Greg Linden
                      glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

                      eBay
                      “The eBay Architecture” – Randy Shoup, Dan Pritchett
                      addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
                      addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

                      Inktomi (YHOO Search)
                      “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
                      youtube.com/watch?v=E91oEn1bnXM

                      Google
                      “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
                      youtube.com/watch?v=qsan-GQaeyk
                      perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
                      “The Birth of Google” – John Battelle
                      wired.com/wired/archive/13.08/battelle.html




Friday, 01 March 13                                                                              75
In their own words…
references…


                        by Paco Nathan
                        Enterprise Data Workflows
                        with Cascading
                        O’Reilly, 2013
                        amazon.com/dp/1449358721




Friday, 01 March 13                                           76
Some of this material comes from an upcoming O’Reilly book:
“Enterprise Data Workflows with Cascading”

This should be in Rough Cuts soon -
scheduled to be out in print this June.

Many thanks to my wonderful editor, Courtney Nash.
drill-down…


                      blog, dev community, code/wiki/gists, maven repo,
                      commercial products, career opportunities:
                            cascading.org
                            zest.to/group11
                            github.com/Cascading
                            conjars.org
                            goo.gl/KQtUL
                            concurrentinc.com

                      join us for very interesting work!                  Copyright @2013, Concurrent, Inc.




Friday, 01 March 13                                                                                           77
Links to our open source projects, developer community, etc…

contact me @pacoid
http://concurrentinc.com/
(we're hiring too!)

More Related Content

PDF
Graph-based Ontology Analysis in the Linked Open Data
PDF
LinkedDataと地理空間情報
PDF
200711 R E S T Apache Con
 
PDF
City of Canada Bay LEP Grid 1 Maps
PDF
Identifying Information Needs by Modelling Collective Query Patterns
PPTX
Decision trees in hadoop
PDF
Functional programming for optimization problems in Big Data
PDF
Enterprise Data Workflows with Cascading
Graph-based Ontology Analysis in the Linked Open Data
LinkedDataと地理空間情報
200711 R E S T Apache Con
 
City of Canada Bay LEP Grid 1 Maps
Identifying Information Needs by Modelling Collective Query Patterns
Decision trees in hadoop
Functional programming for optimization problems in Big Data
Enterprise Data Workflows with Cascading

Similar to The Workflow Abstraction (20)

PDF
Chicago Hadoop Users Group: Enterprise Data Workflows
KEY
Cascading for the Impatient
KEY
Intro to Data Science for Enterprise Big Data
PDF
Using Cascalog to build
 an app based on City of Palo Alto Open Data
KEY
Intro to Cascading (SpringOne2GX)
PDF
Pattern: an open source project for migrating predictive models onto Apache H...
PDF
Cascading meetup #4 @ BlueKai
KEY
A Data Scientist And A Log File Walk Into A Bar...
PDF
EventStudio: Sequence Diagram Based System Modeling Tool
KEY
Buzz words
PDF
Semantic Web For Hack Days
KEY
Building Enterprise Apps for Big Data with Cascading
PDF
Print-n-Link: Weaving the Paper Web
PDF
Koen Handekyn - Variability in the Cloud
PDF
Devoxx 2011 - Scaffolding with Telosys
PDF
Devoxx 2011 - Scaffolding with Telosys
PDF
Distributed computing the Google way
PPTX
WikiVote!iadis-2012 AN INNOVATIVE APPROACH TO COLLABORATIVE DOCUMENT IMPROVEMENT
PDF
ActiveContext
PDF
revenue.ne.gov tax current f_3800nwkst_r&d
Chicago Hadoop Users Group: Enterprise Data Workflows
Cascading for the Impatient
Intro to Data Science for Enterprise Big Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Intro to Cascading (SpringOne2GX)
Pattern: an open source project for migrating predictive models onto Apache H...
Cascading meetup #4 @ BlueKai
A Data Scientist And A Log File Walk Into A Bar...
EventStudio: Sequence Diagram Based System Modeling Tool
Buzz words
Semantic Web For Hack Days
Building Enterprise Apps for Big Data with Cascading
Print-n-Link: Weaving the Paper Web
Koen Handekyn - Variability in the Cloud
Devoxx 2011 - Scaffolding with Telosys
Devoxx 2011 - Scaffolding with Telosys
Distributed computing the Google way
WikiVote!iadis-2012 AN INNOVATIVE APPROACH TO COLLABORATIVE DOCUMENT IMPROVEMENT
ActiveContext
revenue.ne.gov tax current f_3800nwkst_r&d
Ad

More from Paco Nathan (20)

PDF
Human in the loop: a design pattern for managing teams working with ML
PDF
Human-in-the-loop: a design pattern for managing teams that leverage ML
PDF
Human-in-a-loop: a design pattern for managing teams which leverage ML
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
PDF
Humans in the loop: AI in open source and industry
PDF
Computable Content
PDF
Computable Content: Lessons Learned
PDF
SF Python Meetup: TextRank in Python
PDF
Use of standards and related issues in predictive analytics
PDF
Data Science in 2016: Moving Up
PDF
Data Science Reinvents Learning?
PDF
Jupyter for Education: Beyond Gutenberg and Erasmus
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Microservices, containers, and machine learning
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Graph Analytics in Spark
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
A New Year in Data Science: ML Unpaused
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Computable Content
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced IT Governance
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
Advanced IT Governance
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
Advanced Soft Computing BINUS July 2025.pdf
MYSQL Presentation for SQL database connectivity

The Workflow Abstraction

  • 1. “The Workflow Abstraction” Strata SC 2013-02-28 Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. Friday, 01 March 13 1 Background: dual in quantitative and distributed systems. I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
  • 2. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 2 This talk is about the workflow abstraction: * the business process of structuring data * the practices of building robust apps at scale * the open source projects for Enterprise Data Workflows We’ll consider some theory, examples, best practices, trendlines -- what are the drivers that brought us, and where is this work heading toward? Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
  • 3. Marketing Funnel – overview In reference to Making Data Work… Customers Almost every business uses a model similar to this – give or take a few steps. Campaigns Customer leads go in at the top, Awareness those get refined through several stages, then results flow out the bottom. Interest Evalutation Conversion Referral Repeat Friday, 01 March 13 3 Let’s consider one of the most fundamental predictive models used in business: a marketing funnel. This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
  • 4. Marketing Funnel – clickstream Different funnel stages get represented in ecommerce by events captured in Customers log files, as a class of machine data called clickstream Campaigns Impression • ad impressions Awareness • URL clicks Click • landing page views Interest • new user registrations Sign Up Evalutation • session cookies Purchase • online purchases Conversion • social network activity "Like" • etc. Referral Repeat Friday, 01 March 13 4 Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
  • 5. Marketing Funnel – metrics A variety of clickstream metrics can be used as performance indicators Customers at different stages of the funnel: Campaigns • CPM: cost per thousand Impression • CTR: click-through rate Awareness CPM • CPA: cost per action Click • etc. Interest CTR Sign Up Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 5 The many different highly-nuanced metrics which apply are mind-boggling :)
  • 6. Marketing Funnel – example calculations Customers Campaigns Awareness Interest metric cost events formula rate Evalutation Conversion Referral Repeat $4,000 CPM $4,000 10^6 ÷ $4.00 (10^6 ÷ 10^3) 3∙10^3 CTR - 3∙10^3 ÷ 10^6 0.3% $4,000 CPA - 20 ÷ $200 20 Friday, 01 March 13 6 Here are examples of the kinds of calculations performed...
  • 7. Marketing Funnel – predictive model Given these metrics, we can go further to estimate cost per paying user (CPP) Customers customer lifetime value (LTV), etc. Campaigns Then we can build a predictive model for return on investment (ROI) per customer, Awareness summarizing the funnel performance: ROI = (LTV − CPP) ∕ CPP Interest As an example, after crunching lots of logs, Evalutation suppose that… Conversion CPP = $200 LTV = $2000 Referral ROI = ($2000 − $200) ∕ $200 Repeat for a 9x multiple Friday, 01 March 13 7 For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers, which describes the efficiency of the marketing funnel at different stages.
  • 8. Marketing Funnel – example architecture Customers Campaigns Customers Awareness Let’s consider an example architecture Interest Evalutation for calculating, reporting, and taking action Web Conversion on funnel metrics, based on large-scale App Referral Repeat clickstream data… logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 8 Here’s an example architecture of using clickstream metrics within an online business.
  • 9. Marketing Funnel – complexities Multiple ad partners, different contracts terms, reporting different metrics at Customers × × different times, click scrubs, etc. Campaigns Campaigns target specific geo/demo, Impression × × test alternate landing pages, probably Awareness CPM need to segment customer base… Click These issues make clickstream data Interest CTR large and yet sparse. Sign Up Evalutation behaviors Other issues: × Purchase • seasonal variation Conversion CPA • fluctuating currency exchange rates "Like" Referral NPS, social graph, etc. • distortions due to credit card fraud • diminishing returns Repeat loyalty, win back, etc. • forecasting requirements Friday, 01 March 13 9 However, real life intercedes. In many businesses, this is a complicated model to calculate correctly. scrubs many vendors, data sources, different metrics to be aligned lots of roll-ups Bayesian point estimates forecasts and dashboards social dimension makes this convoluted not simple
  • 10. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. Interest CTR The need for these insights has been a Sign Up driver for Hadoop-related technologies. Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 10 The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
  • 11. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. funnel modeling and optimization Interest CTR The need for these insights has been a Sign Up driver for Hadoop-relatedrequires complex data workflows technologies. Evalutation behaviors to obtain the required insights Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 11 These needs imply complex data workflows. It’s not about doing a BI query or a pivot table; that’s how retailers were thinking when Amazon came along.
  • 12. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 12 A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
  • 13. Circa 2008 – Hadoop at scale Customers Scenario: Analytics team at a large ad network… Campaigns Awareness Company had invested $MM capex in a Interest large data warehouse across LOBs Evalutation Conversion Mission-critical app had been written as Referral collab Repeat a large SQL workflow in the DW roll-ups filter Marketing funnel metrics were estimated for many advertisers, many campaigns, per-user recommends many publishers, many customers – billions of calculations daily query/load Predictive models matched publisher ~ advertiser clickstream RDBMS and campaign ~ user, to optimize marketing funnel performance Friday, 01 March 13 13 Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network.. Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
  • 14. Circa 2008 – Hadoop at scale Customers Issues: Campaigns Awareness • critical app had hit hard limits for scalability Interest • several Tb data, 100’s of servers Evalutation Conversion • batch window length vs. failure rate vs. SLA collab Referral Repeat in the context of business growth posed roll-ups filter an existential risk × We built out a team to address these issues per-user recommends as rapidly as possible… Needed to re-create that data workflows query/load based on Enterprise requirements. clickstream RDBMS Friday, 01 March 13 14 Marching orders: 5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City; 5 weeks to reverse engineer the mission-critical app without any access to its author; 5 weeks to implement a Hadoop version which could scale-out on EC2. We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
  • 15. Circa 2008 – Hadoop at scale Approach: roll-ups collab filter • reverse-engineered business process from ~1500 lines of undocumented SQL per-user • created a large, multi-step Apache Hadoop recommends app on AWS HDFS • leveraged cloud strategy to trade $MM capex for lower, scalable opex • Amazon identified our app as one of the msg queue largest Hadoop deployments on EC2 • our app became a case study for AWS query/load RDBMS prior to Elastic MapReduce launch clickstream Friday, 01 March 13 15 Our solution involved dependencies among more than a dozen Hadoop job steps.
  • 16. Circa 2008 – Hadoop at scale × Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends HDFS • data scientists wore beepers since Ops × × lacked visibility into business process • coding directly in MapReduce created a staffing bottleneck msg queue query/load clickstream RDBMS Friday, 01 March 13 16 This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM -- for troubleshooting, handling exceptions, notifications, etc. Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea. Three issues about Enterprise workflows: * staffing bottleneck unless there’s a good abstraction layer * operational complexity, mostly due to lack of transparency * system integration problems *are* the main problem to solve
  • 17. Circa 2008 – Hadoop at scale Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends • data scientists worea good since Ops for a large, commercial beepers solution HDFS lacked visibility into Apachebusiness logic deployment, but the app’s Hadoop • coding directly in MapReduce created a staffing bottleneck workflow management lacked crucial msg queue features… query/load which led to a search for a better clickstream RDBMS workflow abstraction Friday, 01 March 13 17 While leading this team, I sought out other ways of managing a complex workflow involving Hadoop. I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
  • 18. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 18 Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  • 19. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology. Friday, 01 March 13 19 Cascading initially grew from interaction with the Nutch project, before Hadoop had a name API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
  • 20. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without an need to create an entirely new language • allows many programmers who have J2EE expertise to build apps that leverage the economics of Hadoop clusters Friday, 01 March 13 20 Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
  • 21. quotes… “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck 2012-09-18 infoworld.com/slideshow/65089 Friday, 01 March 13 21 Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch” The issues: * staffing bottleneck * operational complexity * system integration
  • 22. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, http://conjars.org • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc. Friday, 01 March 13 22 Several published case studies about Cascading, Cascalog, Scalding, etc. Wide range of use cases. Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading. Partnerships with the various Hadoop distro vendors, cloud providers, etc.
  • 23. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Friday, 01 March 13 23 Many case studies, many Enterprise production deployments now for 5+ years.
  • 24. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascading as the basis for workflow abstractions atop Hadoop and more, Cascalog in Clojure (2010) Scalding in Scala (2012) with a 5+ year history of production deployments across multiple verticals github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Friday, 01 March 13 24 Cascading as a basis for workflow abstraction, for Enterprise data workflows
  • 25. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 25 Code samples in Cascading / Cascalog / Scalding, based on Word Count
  • 26. The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. Friday, 01 March 13 26 Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already... Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
  • 27. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702 Friday, 01 March 13 27 Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
  • 28. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 28 Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop 2nd to last line: generates a DOT file for the flow diagram
  • 29. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Friday, 01 March 13 29 As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.
  • 30. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient Friday, 01 March 13 30 Here is the same Word Count app written in Clojure, using Cascalog.
  • 31. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Friday, 01 March 13 31 From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments. Great for large-scale, complex apps, where small teams must limit the complexities in their process.
  • 32. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Friday, 01 March 13 32 Here is the same Word Count app written in Scala, using Scalding. Very compact, easy to understand; however, also more imperative than Cascalog.
  • 33. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 33 If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
  • 34. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 34 Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
  • 35. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 35 Tracking back to the Marketing Funnel as an example workflow… Let’s consider how Cascading apps incorporate other components beyond Hadoop
  • 36. Enterprise Data Workflows Customers Back to our marketing funnel, let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 36 LOB use cases drive the demand for Big Data apps
  • 37. Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 37 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 38. Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 38 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 39. Cascading workflows – taps • taps integrate other data frameworks, as tuple streams Customers • these are “plumbing” endpoints in the pattern language • sources (inputs), sinks (outputs), traps (exceptions) Web App • text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. logs logs Logs Cache • data serialization: Avro, Thrift, Support source trap sink tap Kryo, JSON, etc. tap tap • extend a new kind of tap in just Data Modeling PMML Workflow a few lines of Java sink source tap tap Analytics Cubes customer Customer profile DBs schema and provenance get Hadoop Prefs derived from analysis of the taps Reporting Cluster Friday, 01 March 13 39 Speaking of system integration, taps provide the simplest approach for integrating different frameworks.
  • 40. Cascading workflows – taps String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 40 Here are the taps in the WordCount source
  • 41. Cascading workflows – topologies • topologies execute workflows on clusters Customers • flow planner is like a compiler for queries - Hadoop (MapReduce jobs) Web App - local mode (dev/test or special config) logs Cache - in-memory data grids (real-time) logs Logs Support • flow planner can be extended trap tap source tap sink tap to support other topologies Data Modeling PMML Workflow source sink tap blend flows in different topologies tap Analytics into the same app – for example, Cubes customer Customer profile DBs batch (Hadoop) + transactions (IMDG) Hadoop Prefs Cluster Reporting Friday, 01 March 13 41 Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
  • 42. Cascading workflows – topologies String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); topology wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 42 Here is the flow planner for Hadoop in the WordCount source
  • 43. example topologies… Friday, 01 March 13 43 Here are some examples of topologies for distributed computing -- Apache Hadoop being the first supported by Cascading, followed by local mode, and now a tuple space (IMDG) flow planner in the works. Several other widely used platforms would also be likely suspects for Cascading flow planners.
  • 44. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Logs • relational catalog over a collection Support source of unstructured data trap tap tap sink tap • SQL shell prompt to run queries Modeling PMML Data Workflow • enable analysts without retraining sink tap source tap on Hadoop, etc. Analytics Cubes customer • transparency for Support, Ops, Hadoop Customer profile DBs Prefs Finance, et al. Reporting Cluster • a language for queries – not a database, but ANSI SQL as a DSL for workflows Friday, 01 March 13 44 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer. BTW, most of the SQL in the world is written by machines. This is not a database; this is about making machine-to-machine communications simpler and more robust at scale.
  • 45. ANSI SQL – CSV data in local file system cascading.org/lingual Friday, 01 March 13 45 The test database for MySQL is available for download from https://launchpad.net/test-db/ Here we have a bunch o’ CSV flat files in a directory in the local file system. Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
  • 46. ANSI SQL – shell prompt, catalog cascading.org/lingual Friday, 01 March 13 46 Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
  • 47. ANSI SQL – queries cascading.org/lingual Friday, 01 March 13 47 Here’s an example SQL query on that “employee” test database from MySQL.
  • 48. Cascading workflows – machine learning • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • Cascading creates parallelized models Web App to run at scale on Hadoop clusters • Random Forest, Logistic Regression, logs logs Cache Logs GLM, Decision Trees, K-Means, Support Hierarchical Clustering, etc. trap source tap sink tap tap • integrate with other libraries Data (Matrix API, etc.) and great open Modeling PMML Workflow source tools (R, Weka, KNIME, sink tap source tap RapidMiner, etc.) Analytics Cubes customer • 2 lines of code or pre-built JAR Hadoop Customer profile DBs Prefs Cluster Reporting Run multiple variants of models as customer experiments Friday, 01 March 13 48 PMML has been around for a while, and export is supported by nearly every commercial analytics platform, covering a wide variety of predictive modeling algorithms. Cascading reads PMML, building out workflows under the hood which run efficiently in parallel. Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;) Several companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
  • 49. model creation in R ## train a RandomForest model f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) cascading.org/pattern Friday, 01 March 13 49 Sample code in R for generating a predictive model for anti-fraud, based on a machine learning algorithm called Random Forest.
  • 50. model run at scale as a Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps Matrix cascading.org/pattern Friday, 01 March 13 50 Conceptual flow diagram for a Cascading app which runs a PMML model at scale, while trapping data exceptions (e.g., regression tests) and tallying a “confusion matrix” for quantifying the model performance.
  • 51. model run at scale as a Cascading app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } } Friday, 01 March 13 51 Source code for a simple Cascading app that runs PMML models in general.
  • 52. PMML support… Friday, 01 March 13 52 Popular tools which can create predictive models for export as PMML
  • 53. Cascading workflows – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes • TDD follows from Cascalog’s customer Customer profile DBs Prefs composable subqueries Hadoop Cluster Reporting • redirect traps in production to Ops, QA, Support, Audit, etc. Friday, 01 March 13 53 TDD is not usually high on the list when people start discussing Big Data apps. The notion of a “data exception” was introduced into Cascading, based on setting stream assertion levels as part of the business logic of an application. Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
  • 54. Cascading workflows – TDD meets API principles • specify what is required, not how it must be achieved Customers • plan far ahead, before consuming cluster Web App resources – fail fast prior to submit logs Cache • fail the same way twice – deterministic logs Logs Support flow planners help reduce engineering trap source sink tap costs for debugging at scale tap tap Data Modeling PMML • same JAR, any scale – app does not Workflow source require a recompile to change data sink tap tap taps or cluster topologies Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 54 Some of the design principles for the pattern language
  • 55. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Friday, 01 March 13 55 Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  • 56. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Hadoop almost never gets used in isolation; data workflows define Start-ups: crave complexity and scale to become viable… the “glue” required for system new ventures move into Enterprise space of Enterprise apps integration to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Friday, 01 March 13 56 Hadoop is almost never used in isolation. Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.
  • 57. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 57 Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  • 58. Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language. Friday, 01 March 13 58 A pattern language, based on the metaphor of “plumbing”
  • 59. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices. amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”. amazon.com/dp/0201633612 Friday, 01 March 13 59 Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
  • 60. Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language. Friday, 01 March 13 60 The pattern language provides a structured method for solving large, complex design problems where the syntax of the language promotes use of best practices – which also addresses staffing issues
  • 61. Cascading workflows – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps, great for cross-team collaboration. Friday, 01 March 13 61 Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming. Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
  • 62. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” Friday, 01 March 13 62 Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
  • 63. examples… • Scalding apps have nearly 1:1 correspondence between function calls and the elements in their flow diagrams – excellent elision and literate representation • noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated [head] as a DOT file), sometimes in lieu of showing code Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] In formal terms, a flow diagram is a directed, acyclic [{1}:'token'] [{1}:'token'] graph (DAG) on which lots of interesting math applies GroupBy('wc')[by:['token']] for query optimization, predictive models about app wc[{1}:'token'] [{1}:'token'] reduce execution, parallel efficiency metrics, etc. Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Friday, 01 March 13 63 Literate programming examples observed on the email list are some of the best illustrations of this methodology.
  • 64. Cascading workflows – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) As a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale. Friday, 01 March 13 64 Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL)
  • 65. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on: the process of structuring data That’s what apps do – Making Data Work Friday, 01 March 13 65 Focus on *the process of structuring data* which must happen before the large-scale joins, predictive models, visualizations, etc. Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work. BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
  • 66. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspn Friday, 01 March 13 66 A more contemporary statement along similar lines...
  • 67. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a several theoretical aspects converge “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: into software engineering practices Moseley & Marks, 2006which mitigates the complexity of “Out of the Tar Pit” building and maintaining Enterprise goo.gl/SKspn data workflows Friday, 01 March 13 67
  • 68. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 68 Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data. Where did Big Data come from, and where is this kind of work headed?
  • 69. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware. This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this. Friday, 01 March 13 69 Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion: parallelize workloads onto clusters of commodity servers to scale-out horizontally. Google and Inktomi (YHOO Search) were working along the same lines.
  • 70. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS Friday, 01 March 13 70 Ah, teh olde days - Perl and C++ for CGI :) Feedback loops shown in red represent data innovations at the time… Characterized by slow, manual processes: data modeling / business intelligence; “throw it over the wall”… this thinking led to impossible silos
  • 71. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS Friday, 01 March 13 71 Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the marketing funnel, as in our case study. LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
  • 72. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS Friday, 01 March 13 72 Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams. Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric. Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment. We see this feeding into cluster optimization in YARN, Mesos, etc.
  • 73. Asymptotically… • long-term trends toward more instrumentation DSL of Enterprise data workflows: - workflow abstraction enables business cases Planner/ - more machine data collected about apps Optimizer - flow diagram (DAG) as unit of work (abstract type for machine data) Workflow - evolving feedback loops convert machine data App into actionable insights and optimizations History Cluster • industry moves beyond common needs of ad-hoc queries on logs and basic reporting, as a new class of complex data workflows emerges to provide Cluster the insights required by Enterprise Scheduler • end game is less about “bigness” of data, more about managing complexity in the process of structuring data Friday, 01 March 13 73 In summary…
  • 74. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L Friday, 01 March 13 74 Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
  • 75. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.html Friday, 01 March 13 75 In their own words…
  • 76. references… by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 Friday, 01 March 13 76 Some of this material comes from an upcoming O’Reilly book: “Enterprise Data Workflows with Cascading” This should be in Rough Cuts soon - scheduled to be out in print this June. Many thanks to my wonderful editor, Courtney Nash.
  • 77. drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com join us for very interesting work! Copyright @2013, Concurrent, Inc. Friday, 01 March 13 77 Links to our open source projects, developer community, etc… contact me @pacoid http://concurrentinc.com/ (we're hiring too!)