SlideShare a Scribd company logo
Cascading Meetup #4

                       BlueKai
                       Cupertino, CA
                       2013-03-05




                                       Copyright @2013, Concurrent, Inc.




Tuesday, 05 March 13                                                       1
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development




Tuesday, 05 March 13                                                                                                        2
Enterprise Data Workflows
                                                                                    Customers
            Let’s consider an example app…
            at the front end                                                          Web
                                                                                      App
            LOB use cases drive demand for apps
                                                                        logs         Cache
                                                                          logs
                                                                            Logs

                                                   Support
                                                                           source
                                                                 trap                  sink
                                                                             tap
                                                                  tap                  tap


                                                                         Data
                                                   Modeling    PMML
                                                                        Workflow

                                                                                      source
                                                                 sink
                                                                                        tap
                                                                 tap

                                                   Analytics
                                                    Cubes                            customer
                                                                                      Customer
                                                                                    profile DBs
                                                                                        Prefs
                                                                          Hadoop
                                                                          Cluster
                                                   Reporting




Tuesday, 05 March 13                                                                              3
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
                                                                                                                 Customers
             An example… in the back office
             Organizations have substantial investments                                                            Web
                                                                                                                   App
             in people, infrastructure, process
                                                                                                     logs         Cache
                                                                                                       logs
                                                                                                         Logs

                                                                      Support
                                                                                                        source
                                                                                              trap                  sink
                                                                                                          tap
                                                                                               tap                  tap


                                                                                                      Data
                                                                     Modeling            PMML
                                                                                                     Workflow

                                                                                                                   source
                                                                                              sink
                                                                                                                     tap
                                                                                              tap

                                                                     Analytics
                                                                      Cubes                                       customer
                                                                                                                   Customer
                                                                                                                 profile DBs
                                                                                                                     Prefs
                                                                                                       Hadoop
                                                                                                       Cluster
                                                                    Reporting




Tuesday, 05 March 13                                                                                                           4
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
                                                                                                          Customers
              An example… for the heavy lifting!
              “Main Street” firms are migrating                                                              Web
                                                                                                            App
              workflows to Hadoop, for cost
              savings and scale-out
                                                                                              logs         Cache
                                                                                                logs
                                                                                                  Logs

                                                                          Support
                                                                                                 source
                                                                                       trap                  sink
                                                                                                   tap
                                                                                        tap                  tap


                                                                                               Data
                                                                         Modeling    PMML
                                                                                              Workflow

                                                                                                            source
                                                                                       sink
                                                                                                              tap
                                                                                       tap

                                                                         Analytics
                                                                          Cubes                            customer
                                                                                                            Customer
                                                                                                          profile DBs
                                                                                                              Prefs
                                                                                                Hadoop
                                                                                                Cluster
                                                                        Reporting




Tuesday, 05 March 13                                                                                                    5
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Two Avenues…

             Enterprise: must contend with
             complexity at scale everyday…
             incumbents extend current practices and
             infrastructure investments – using J2EE,




                                                                                                            complexity ➞
             ANSI SQL, SAS, etc. – to migrate
             workflows onto Apache Hadoop while
             leveraging existing staff


              Start-ups: crave complexity and
              scale to become viable…
              new ventures move into Enterprise space
              to compete using relatively lean staff,
              while leveraging sophisticated engineering
              practices, e.g., Cascalog and Scalding
                                                                                                                                    scale ➞

Tuesday, 05 March 13                                                                                                                          6
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Two Avenues…

              Enterprise: must contend with
              complexity at scale everyday…
              incumbents extend current practices and
              infrastructure investments – using J2EE,




                                                            complexity ➞
              ANSI SQL, SAS, etc. – to migrate
              workflows onto Apache Hadoop while
              leveraging existing staff
                                         Hadoop almost never gets used
                                         in isolation; data workflows define
               Start-ups: crave complexity and
               scale to become viable… the “glue” required for system
               new ventures move into Enterprise space of Enterprise apps
                                         integration
               to compete using relatively lean staff,
               while leveraging sophisticated engineering
               practices, e.g., Cascalog and Scalding
                                                                           scale ➞

Tuesday, 05 March 13                                                                 7
Hadoop is almost never used in isolation.
Enterprise data workflows are about system integration.
There are a couple different ways to arrive at the party.
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development




Tuesday, 05 March 13                                                                                                        8
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base
                                                                                                                                                    Customers

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                        Web
                                                                                                                                                      App

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                logs
                                                                                                                                          logs       Cache
                                                                                                                                            Logs

               • relational catalog over a collection                                                        Support
                                                                                                                                           source
                   of unstructured data                                                                                          trap
                                                                                                                                  tap
                                                                                                                                             tap       sink
                                                                                                                                                       tap



               • SQL shell prompt to run queries                                                            Modeling         PMML
                                                                                                                                         Data
                                                                                                                                        Workflow

                                                                                                                                                      source
                                                                                                                                 sink
                                                                                                                                                        tap
                                                                                                                                 tap

                                                                                                            Analytics
                                                                                                             Cubes                                   customer
                                                                                                                                                      Customer
                                                                                                                                                    profile DBs
                                                                                                                                                        Prefs
                                                                                                                                          Hadoop
                                                                                                                                          Cluster
                                                                                                           Reporting




Tuesday, 05 March 13                                                                                                                                              9
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base
                                                                                                                                                    Customers

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                        Web
                                                                                                                                                      App

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                logs
                                                                                                                                          logs       Cache

                                     Premise: most SQL in the world gets                                                                    Logs

               • relational catalog over a collection                                                        Support


                 of unstructured datawritten by machines…                                                                        trap
                                                                                                                                  tap
                                                                                                                                           source
                                                                                                                                             tap       sink
                                                                                                                                                       tap



               • SQL shell prompt to run isn’t a database; this is about making
                                     This queries                                                           Modeling         PMML
                                                                                                                                         Data
                                                                                                                                        Workflow


                                     machine-to-machine communications                                                           sink
                                                                                                                                 tap
                                                                                                                                                      source
                                                                                                                                                        tap



                                     simpler and more robust at scale.
                                                                                                            Analytics
                                                                                                             Cubes                                   customer
                                                                                                                                                      Customer
                                                                                                                                                    profile DBs
                                                                                                                                                        Prefs
                                                                                                                                          Hadoop
                                                                                                                                          Cluster
                                                                                                           Reporting




Tuesday, 05 March 13                                                                                                                                              10
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Cascading workflows – ANSI SQL

               • enable analysts without retraining
                   on Hadoop, etc.                                                                                                                  Customers




               • transparency for Support, Ops,                                                                                                       Web
                                                                                                                                                      App
                   Finance, et al.
                                                                                                                                        logs         Cache
                                                                                                                                          logs
                                                                                                                                            Logs

                                                                                                             Support
                                                                                                                                           source
                                                                                                                                 trap                  sink
                                                                                                                                             tap
                                                                                                                                  tap                  tap


                                                                                                                                         Data
             a language for queries – not a database,                                                       Modeling         PMML
                                                                                                                                        Workflow


             but ANSI SQL as a DSL for workflows                                                                                  sink
                                                                                                                                 tap
                                                                                                                                                      source
                                                                                                                                                        tap

                                                                                                            Analytics
                                                                                                             Cubes                                   customer
                                                                                                                                                      Customer
                                                                                                                                                    profile DBs
                                                                                                                                                        Prefs
                                                                                                                                          Hadoop
                                                                                                                                          Cluster
                                                                                                           Reporting




Tuesday, 05 March 13                                                                                                                                              11
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
ANSI SQL – reviews
            Open Source 'Lingual' Helps SQL Devs Unlock Hadoop
            Thor Olavsrud, 2013-02-22
            cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop


            Hadoop Apps Without MapReduce Mindsets
            Adrian Bridgwater, 2013-02-28
            drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708


            Concurrent gives old SQL users new Hadoop tricks
            Jack Clark, 2013-02-20
            theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/


            Concurrent Open Source Project Ties SQL to Hadoop
            Michael Vizard, 2013-02-21
            itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html


            Concurrent Releases Lingual, a SQL DSL for Hadoop
            Boris Lublinsky, 2013-02-28
            infoq.com/news/2013/02/Lingual

Tuesday, 05 March 13                                                                                      12
ANSI SQL – CSV data in local file system




               cascading.org/lingual


Tuesday, 05 March 13                                                                             13
The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
ANSI SQL – shell prompt, catalog




                cascading.org/lingual


Tuesday, 05 March 13                                                                      14
Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
ANSI SQL – queries




              cascading.org/lingual


Tuesday, 05 March 13                                                       15
Here’s an example SQL query on that “employee” test database from MySQL.
ANSI SQL – layers

                                        abstraction                                                       RDBMS                                                     JVM Cluster
                                                parser                                                 ANSI SQL                                                      ANSI SQL
                                                                                                     compliant parser                                              compliant parser
                                              optimizer                                             logical plan,                                                 logical plan,
                                                                                              optimized based on stats                                      optimized based on stats
                                               planner                                                   physical plan                                              API “plumbing”

                                               machine                                                 query history,                                                  app history,
                                                data                                                     table stats                                                    tuple stats
                                               topology                                                  b-trees, etc.                                      heterogenous, distributed:
                                                                                                                                                               Hadoop, IMDG, etc.
                                            visualization                                                      ERD                                                    flow diagram

                                               schema                                                   table schema                                                  tuple schema

                                                catalog                                              relational catalog                                               tap usage DB


                                             provenance                                                (manual audit)                                                data set
                                                                                                                                                               producers/consumers
Tuesday, 05 March 13                                                                                                                                                                               16
When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
ANSI SQL – JDBC driver
             public void run() throws ClassNotFoundException, SQLException {
                 Class.forName( "cascading.lingual.jdbc.Driver" );
                 Connection connection =
                   DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" );
                 Statement statement = connection.createStatement();
              
                 ResultSet resultSet = statement.executeQuery(
                     "select *n"
                       + "from "EXAMPLE"."SALES_FACT_1997" as sn"
                       + "join "EXAMPLE"."EMPLOYEE" as en"
                       + "on e."EMPID" = s."CUST_ID"" );
              
                 while( resultSet.next() ) {
                   int n = resultSet.getMetaData().getColumnCount();
                   StringBuilder builder = new StringBuilder();
              
                   for( int i = 1; i <= n; i++ ) {
                     builder.append( ( i > 1 ? "; " : "" )
                         + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) );
                     }

                        System.out.println( builder );
                        }
              
                     resultSet.close();
                     statement.close();
                     connection.close();
                     }



Tuesday, 05 March 13                                                                                                      17
Note that in this example the schema for the DDL has been derived directly from the CSV files.

In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
ANSI SQL – JDBC driver
            $ gradle clean jar
            $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar
             
            CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
            CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian




                                Caveat: if you absolutely positively must have sub-second
                                SQL query response for Pb-scale data on a 1000+ node
                                cluster… Good luck with that! (call the MPP vendors)
                                This ANSI SQL library is primarily intended for batch
                                workflows – high throughput, not low-latency –
                                for many under-represented use cases in Enterprise IT.
                                It’s essentially ANSI SQL as a DSL.




Tuesday, 05 March 13                                                                        18
success
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development




Tuesday, 05 March 13                                                                                                        19
Test-Driven Development (TDD)




                                source: Wikipedia

Tuesday, 05 March 13                                20
A general view of TDD process
Test-Driven Development (TDD)




                                                                    In terms of Big Data apps,TDD is not
                                                                    generally part of the conversation




Tuesday, 05 March 13                                                                                       21
TDD is not usually high on the list when people start discussing Big Data apps.
Traps – Cascading “exceptional data”

               •   assert patterns (regex) on the tuple streams
                                                                                                                     Customers
               •   adjust assert levels, like log4j levels
               •   define traps on branches                                                                             Web
                                                                                                                       App

               •   tuples which fail asserts get trapped
                                                                                                         logs         Cache
                                                                                                           logs
                                                                                                             Logs

                                                                                    Support
                                                                                                            source
                                                                                                  trap                  sink
                                                                                                              tap
                                                                                                   tap                  tap


                                                                                                          Data
                                                                                    Modeling    PMML
                                                                                                         Workflow

                                                                                                                       source
                                                                                                  sink
                                                                                                                         tap
                                                                                                  tap

                                                                                    Analytics
                                                                                     Cubes                            customer
                                                                                                                       Customer
                                                                                                                     profile DBs
                                                                                                                         Prefs
                                                                                                           Hadoop
                                                                                                           Cluster
                                                                                    Reporting




Tuesday, 05 March 13                                                                                                               22
An innovation in Cascading was to introduce the notion of a “data exception”,
based on setting stream assertion levels as part of the business logic of an app.
Traps – example code
            // set up... 

            Pipe etlPipe = new Pipe( "etlPipe" );

            // some processing... 

            AssertMatches assertMatches = new AssertMatches( ".*true" );
            etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );
             
            // some processing... 

            FlowDef flowDef = FlowDef.flowDef().setName( "etl" )
              .addSource( etlPipe, jsonTap )
              .addTrap( etlPipe, trapTap )
              .addTailSink( etlPipe, cacheTap );
             
            if( options.has( "assert" ) )
              flowDef.setAssertionLevel( AssertionLevel.STRICT );
            else
              flowDef.setAssertionLevel( AssertionLevel.NONE );


Tuesday, 05 March 13                                                               23
Example use in Cascading code
Traps – redirect exceptions in production
            shunt the trapped exceptional data to other
            parts of the organization:                                                     Customers



             •   Ops: notifications                                                           Web
                                                                                             App

             •   QA: investigate data anomalies	

             •   Support: review customer records                              logs
                                                                                 logs
                                                                                   Logs
                                                                                            Cache


             •   	

                  Finance: audit                          Support
                                                                                  source
                                                                        trap                  sink
                                                                                    tap
                                                                         tap                  tap


                                                                                Data
                                                          Modeling    PMML
                                                                               Workflow

                                                                                             source
                                                                        sink
                                                                                               tap
                                                                        tap

                                                          Analytics
                                                           Cubes                            customer
                                                                                             Customer
                                                                                           profile DBs
                                                                                               Prefs
                                                                                 Hadoop
                                                                                 Cluster
                                                          Reporting




Tuesday, 05 March 13                                                                                     24
TDD – practice at scale
             1. assert expected patterns in raw input
             2. run just that, to find edge cases
             3. handle the edge cases for input data
             4. assert expected patterns after first chunk of processing
             5. run just that, to verify failure
             6. code until test passes                  GIS                               Regex




                                                                                  tree
                                                                                                           Scrub
                                                       export                            parse-tree        species




             7. repeat #4 for each chunk
                                                   M                              M
                                                                                                                                Estimate
                                                                                                                     Join                  Geohash
                                                                                                                                 height




                                                                 Regex




                                                                            src
                                                                parse-gis
                                                                                                            Tree                                                 Filter
                                                                                                                                                         tree
                                                                                                          Metadata                                               height




                                                                                         Failure                                                     M
                                                                                          Traps
                                                                                                                                                                                       Calculate         Filter             Sum
                                                                                                                                                                            Join
                                                                                                                                                                                        distance        distance           moment           Filter
                                                                                                                                                                                                                                         sum_moment




                                                                                                                                                                Estimate           R   M                               R                 M
                                                                                                                                                         road




                                                                                  road
                                                                                           Regex
                                                                                                                                                                  traffic
                                                                                         parse-road
                                                                                                                                                                                                                                                      shade




                                                                                                                     Estimate     Road
                                                                                                         Join
                                                                                                                      Albedo    Segments
                                                                                                                                           Geohash                                                                                                            Join



                                                                                  M
                                                                                                                                R
                                                                                               Road
                                                                                              Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                                       gps               reco
                                                                                                                                                                                           logs




                                                                                                                                                                                                                     Count
                                                                                                                                                                                                   Geohash                             Max
                                                                                                                                                                                                                   gps_count
                                                                                                                                                                                                                                    recent_visit




                                                                                                                                                                                       M                           R




Tuesday, 05 March 13                                                                                                                                                                                                                                                            25
TDD – Cascalog features
             consider that TDD is about asserting and negating logical
             predicates…
               •   Cascalog is based on logical predicates
               •   function definitions as composable subqueries
               •   functions are not particularly far from being unit tests
               •   Midje: facts, mocks

               sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html
               sritchie.github.com/2012/01/22/cascalog-testing-20.html




Tuesday, 05 March 13                                                                                                26
Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology --
in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development
              …plus, a proposal




Tuesday, 05 March 13                                                                                                        27
ANSI SQL – multiple flows



                                               GIS                               Regex




                                                                         tree
                                                                                                  Scrub
                                              export                            parse-tree        species




                                          M                              M
                                                                                                                       Estimate
                                                                                                            Join                  Geohash
                                                                                                                        height




                                                        Regex




                                                                   src
                                                       parse-gis
                                                                                                   Tree                                                 Filter
                                                                                                                                                tree
                                                                                                 Metadata                                               height




                                                                                Failure                                                     M
                                                                                 Traps
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                   Join
                                                                                                                                                                               distance        distance           moment           Filter
                                                                                                                                                                                                                                sum_moment




                                                                                                                                                       Estimate           R   M                               R                 M
                                                                                                                                                road
                                                                         road




                                                                                  Regex
                                                                                                                                                         traffic
                                                                                parse-road
                                                                                                                                                                                                                                             shade




                                                                                                            Estimate     Road
                                                                                                Join
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join



                                                                         M
                                                                                                                       R
                                                                                      Road
                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco
                                                                                                                                                                                  logs




                                                                                                                                                                                                            Count
                                                                                                                                                                                          Geohash                             Max
                                                                                                                                                                                                          gps_count
                                                                                                                                                                                                                           recent_visit




                                                                                                                                                                              M                           R




              Suppose your organization is responsible
              for an large-scale app…
              Multiple teams develop reusable libraries…
Tuesday, 05 March 13                                                                                                                                                                                                                                                   28
Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
ANSI SQL – multiple flows



                                               GIS                               Regex




                                                                         tree
                                                                                                  Scrub
                                              export                            parse-tree        species




                                          M                              M
                                                                                                                       Estimate
                                                                                                            Join                  Geohash
                                                                                                                        height




                                                        Regex




                                                                   src
                                                       parse-gis
                                                                                                   Tree                                                 Filter
                                                                                                                                                tree
                                                                                                 Metadata                                               height




                                                                                Failure                                                     M
                                                                                 Traps
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                   Join
                                                                                                                                                                               distance        distance           moment           Filter
                                                                                                                                                                                                                                sum_moment




                                                                                                                                                       Estimate           R   M                               R                 M
                                                                                                                                                road
                                                                         road




                                                                                  Regex
                                                                                                                                                         traffic
                                                                                parse-road
                                                                                                                                                                                                                                             shade




                                                                                                            Estimate     Road
                                                                                                Join
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join



                                                                         M
                                                                                                                       R
                                                                                      Road
                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco
                                                                                                                                                                                  logs




                                                                                                                                                                                                            Count
                                                                                                                                                                                          Geohash                             Max
                                                                                                                                                                                                          gps_count
                                                                                                                                                                                                                           recent_visit




                                                                                                                                                                              M                           R




              Data Analysts: ANSI SQL queries
              for data prep
              (displaces Hive, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                   29
Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                                GIS                               Regex




                                                                          tree
                                                                                                   Scrub
                                               export                            parse-tree        species




                                           M                              M
                                                                                                                        Estimate
                                                                                                             Join                  Geohash
                                                                                                                         height




                                                         Regex




                                                                    src
                                                        parse-gis
                                                                                                    Tree                                                 Filter
                                                                                                                                                 tree
                                                                                                  Metadata                                               height




                                                                                 Failure                                                     M
                                                                                  Traps
                                                                                                                                                                               Calculate         Filter             Sum
                                                                                                                                                                    Join
                                                                                                                                                                                distance        distance           moment           Filter
                                                                                                                                                                                                                                 sum_moment




                                                                                                                                                        Estimate           R   M                               R                 M
                                                                                                                                                 road
                                                                          road




                                                                                   Regex
                                                                                                                                                          traffic
                                                                                 parse-road
                                                                                                                                                                                                                                              shade




                                                                                                             Estimate     Road
                                                                                                 Join
                                                                                                              Albedo    Segments
                                                                                                                                   Geohash                                                                                                            Join



                                                                          M
                                                                                                                        R
                                                                                       Road
                                                                                      Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                               gps               reco
                                                                                                                                                                                   logs




                                                                                                                                                                                                             Count
                                                                                                                                                                                           Geohash                             Max
                                                                                                                                                                                                           gps_count
                                                                                                                                                                                                                            recent_visit




                                                                                                                                                                               M                           R




              Server-side Engineering: HBase tap
              for customer profiles
              (integrating other components)
Tuesday, 05 March 13                                                                                                                                                                                                                                                    30
Engineering provides integration with customer profiles, e.g., transactional data objects in HBase.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                                GIS                               Regex




                                                                          tree
                                                                                                   Scrub
                                               export                            parse-tree        species




                                           M                              M
                                                                                                                        Estimate
                                                                                                             Join                  Geohash
                                                                                                                         height




                                                         Regex




                                                                    src
                                                        parse-gis
                                                                                                    Tree                                                 Filter
                                                                                                                                                 tree
                                                                                                  Metadata                                               height




                                                                                 Failure                                                     M
                                                                                  Traps
                                                                                                                                                                               Calculate         Filter             Sum
                                                                                                                                                                    Join
                                                                                                                                                                                distance        distance           moment           Filter
                                                                                                                                                                                                                                 sum_moment




                                                                                                                                                        Estimate           R   M                               R                 M
                                                                                                                                                 road
                                                                          road




                                                                                   Regex
                                                                                                                                                          traffic
                                                                                 parse-road
                                                                                                                                                                                                                                              shade




                                                                                                             Estimate     Road
                                                                                                 Join
                                                                                                              Albedo    Segments
                                                                                                                                   Geohash                                                                                                            Join



                                                                          M
                                                                                                                        R
                                                                                       Road
                                                                                      Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                               gps               reco
                                                                                                                                                                                   logs




                                                                                                                                                                                                             Count
                                                                                                                                                                                           Geohash                             Max
                                                                                                                                                                                                           gps_count
                                                                                                                                                                                                                            recent_visit




                                                                                                                                                                               M                           R




              Ops + Support: Traps get
              routed to customer review
              (ties into notifications, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                    31
Support needs to review exceptional data, via reports/notifications.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                               GIS                               Regex




                                                                         tree
                                                                                                  Scrub
                                              export                            parse-tree        species




                                          M                              M
                                                                                                                       Estimate
                                                                                                            Join                  Geohash
                                                                                                                        height




                                                        Regex




                                                                   src
                                                       parse-gis
                                                                                                   Tree                                                 Filter
                                                                                                                                                tree
                                                                                                 Metadata                                               height




                                                                                Failure                                                     M
                                                                                 Traps
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                   Join
                                                                                                                                                                               distance        distance           moment           Filter
                                                                                                                                                                                                                                sum_moment




                                                                                                                                                       Estimate           R   M                               R                 M
                                                                                                                                                road
                                                                         road




                                                                                  Regex
                                                                                                                                                         traffic
                                                                                parse-road
                                                                                                                                                                                                                                             shade




                                                                                                            Estimate     Road
                                                                                                Join
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join



                                                                         M
                                                                                                                       R
                                                                                      Road
                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco
                                                                                                                                                                                  logs




                                                                                                                                                                                                            Count
                                                                                                                                                                                          Geohash                             Max
                                                                                                                                                                                                          gps_count
                                                                                                                                                                                                                           recent_visit




                                                                                                                                                                              M                           R




              Data Scientists: R => PMML
              for predictive models
              (displaces SAS, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                   32
Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                              GIS                               Regex




                                                                        tree
                                                                                                 Scrub
                                             export                            parse-tree        species




                                         M                              M
                                                                                                                      Estimate
                                                                                                           Join                  Geohash
                                                                                                                       height




                                                       Regex




                                                                  src
                                                      parse-gis
                                                                                                  Tree                                                 Filter
                                                                                                                                               tree
                                                                                                Metadata                                               height




                                                                               Failure                                                     M
                                                                                Traps
                                                                                                                                                                             Calculate         Filter             Sum
                                                                                                                                                                  Join
                                                                                                                                                                              distance        distance           moment           Filter
                                                                                                                                                                                                                               sum_moment




                                                                                                                                                      Estimate           R   M                               R                 M
                                                                                                                                               road
                                                                        road




                                                                                 Regex
                                                                                                                                                        traffic
                                                                               parse-road
                                                                                                                                                                                                                                            shade




                                                                                                           Estimate     Road
                                                                                               Join
                                                                                                            Albedo    Segments
                                                                                                                                 Geohash                                                                                                            Join



                                                                        M
                                                                                                                      R
                                                                                     Road
                                                                                    Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                             gps               reco
                                                                                                                                                                                 logs




                                                                                                                                                                                                           Count
                                                                                                                                                                                         Geohash                             Max
                                                                                                                                                                                                         gps_count
                                                                                                                                                                                                                          recent_visit




                                                                                                                                                                             M                           R




             App Engineering: Java/Scala/Clojure
             for business logic in data pipelines
             (displaces Pig, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                  33
Generally the revenue apps require some custom business logic -- representing business process for LOB.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                                GIS                               Regex




                                                                          tree
                                                                                                   Scrub
                                               export                            parse-tree        species




                                           M                              M
                                                                                                                        Estimate
                                                                                                             Join                  Geohash
                                                                                                                         height




                                                         Regex




                                                                    src
                                                        parse-gis
                                                                                                    Tree                                                 Filter
                                                                                                                                                 tree
                                                                                                  Metadata                                               height




                                                                                 Failure                                                     M
                                                                                  Traps
                                                                                                                                                                               Calculate         Filter             Sum
                                                                                                                                                                    Join
                                                                                                                                                                                distance        distance           moment           Filter
                                                                                                                                                                                                                                 sum_moment




                                                                                                                                                        Estimate           R   M                               R                 M
                                                                                                                                                 road
                                                                          road




                                                                                   Regex
                                                                                                                                                          traffic
                                                                                 parse-road
                                                                                                                                                                                                                                              shade




                                                                                                             Estimate     Road
                                                                                                 Join
                                                                                                              Albedo    Segments
                                                                                                                                   Geohash                                                                                                            Join



                                                                          M
                                                                                                                        R
                                                                                       Road
                                                                                      Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                               gps               reco
                                                                                                                                                                                   logs




                                                                                                                                                                                                             Count
                                                                                                                                                                                           Geohash                             Max
                                                                                                                                                                                                           gps_count
                                                                                                                                                                                                                            recent_visit




                                                                                                                                                                               M                           R




              Front-end Engineering: Memcached
              tap for pushing updates to API
              (integrating other components)
Tuesday, 05 March 13                                                                                                                                                                                                                                                    34
Engineering provides integration with caching layer, for API updates.
These can migrate into a Cascading app to run on Hadoop.
Cascading workflows – API principles

               • specify what is required, not how it must
                   be achieved                                                                Customers




               • plan far ahead, before consuming cluster                                       Web
                                                                                                App

                   resources – fail fast prior to submit
                                                                                  logs         Cache
               • fail the same way twice – deterministic                            logs
                                                                                      Logs

                                                             Support
                   flow planners help reduce engineering                    trap
                                                                                     source
                                                                                                 sink
                                                                                       tap
                   costs for debugging at scale                             tap                  tap


                                                                                   Data
                                                             Modeling    PMML

               • same JAR, any scale – app does not                               Workflow

                                                                                                source
                   require a recompile to change data                      sink
                                                                           tap
                                                                                                  tap


                   taps or cluster topologies                Analytics
                                                              Cubes                            customer
                                                                                                Customer
                                                                                              profile DBs
                                                                                                  Prefs
               • no surprises                                                       Hadoop
                                                                                    Cluster
                                                             Reporting




Tuesday, 05 March 13                                                                                        35
Some of the design principles for the pattern language
book…


                      by Paco Nathan
                      Enterprise Data Workflows
                      with Cascading
                      O’Reilly, 2013
                      amazon.com/dp/1449358721




Tuesday, 05 March 13                                                     36
Our upcoming O’Reilly book: “Enterprise Data Workflows with Cascading”
Should be in Rough Cuts soon -- scheduled to be out in print this June
drill-down…


                     blog, dev community, code/wiki/gists, maven repo,
                     commercial products, career opportunities:
                           cascading.org
                           zest.to/group11
                           github.com/Cascading
                           conjars.org
                           goo.gl/KQtUL
                           concurrentinc.com

                      join us for very interesting work!                 Copyright @2013, Concurrent, Inc.




Tuesday, 05 March 13                                                                                         37
Links to our open source projects, developer community, etc…

More Related Content

PDF
Chicago Hadoop Users Group: Enterprise Data Workflows
PDF
Pattern: an open source project for migrating predictive models onto Apache H...
PDF
FinCap Solutions Brochure
PDF
Bill inmon-data-warehousing-2-0-whitepaper
PDF
Dynamic Web Pages 2009v2.1
PDF
TAUS USER CONFERENCE 2010, More data equals better machine translation – the ...
PDF
Book Mgt
PDF
Data Search Searching And Finding Information In Unstructured And Structured ...
Chicago Hadoop Users Group: Enterprise Data Workflows
Pattern: an open source project for migrating predictive models onto Apache H...
FinCap Solutions Brochure
Bill inmon-data-warehousing-2-0-whitepaper
Dynamic Web Pages 2009v2.1
TAUS USER CONFERENCE 2010, More data equals better machine translation – the ...
Book Mgt
Data Search Searching And Finding Information In Unstructured And Structured ...

Viewers also liked (20)

PDF
Birth of the Global Mind
PDF
Localized methods for diffusions in large graphs
DOC
Parent resources
KEY
What Android Can Learn from Steve Jobs
PPTX
Creative, Digital & Design Business Briefing July 2015
PDF
The roadtrip that led to my first rails commit and how you could make yours too
KEY
Seoul Digital Forum (keynote file)
PPT
Social networks and professionalism
PPTX
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
PDF
Service oriented architecture
PDF
Comment le picture marketing permet de développer ses ventes en ligne et en b...
KEY
The Clothesline Paradox and the Sharing Economy (Keynote file)
PPT
Some Lessons for Startups (ppt)
PPTX
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
PDF
Colaboracion y Social CRM
PDF
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
PPT
Technical Debt and Selling Rearchitecture
PDF
Hadoop and Beyond
PDF
Government 2.0
PDF
Elastic Apache Mesos on Amazon EC2
Birth of the Global Mind
Localized methods for diffusions in large graphs
Parent resources
What Android Can Learn from Steve Jobs
Creative, Digital & Design Business Briefing July 2015
The roadtrip that led to my first rails commit and how you could make yours too
Seoul Digital Forum (keynote file)
Social networks and professionalism
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
Service oriented architecture
Comment le picture marketing permet de développer ses ventes en ligne et en b...
The Clothesline Paradox and the Sharing Economy (Keynote file)
Some Lessons for Startups (ppt)
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
Colaboracion y Social CRM
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Technical Debt and Selling Rearchitecture
Hadoop and Beyond
Government 2.0
Elastic Apache Mesos on Amazon EC2
Ad

Similar to Cascading meetup #4 @ BlueKai (20)

KEY
Building Enterprise Apps for Big Data with Cascading
KEY
A Data Scientist And A Log File Walk Into A Bar...
PDF
Cascading: Enterprise Data Workflows based on Functional Programming
KEY
Intro to Cascading (SpringOne2GX)
KEY
Intro to Data Science for Enterprise Big Data
PDF
Enterprise Data Workflows with Cascading
PDF
Log everything!
PDF
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
PDF
Hadoop + Forcedotcom = Like
PPTX
Vodafone xone fev142013v3 ext
PPTX
(ATS3-APP14) Troubleshooting Symyx Notebook client performance
PPTX
Business Intelligence - Architecture &amp; Execution Done Right
PDF
Functional programming for optimization problems in Big Data
PDF
Wallchart - Data Warehouse Documentation Roadmap
PDF
Infosys sequence services proof of concept
PDF
SpagoBI 3.x official presentation
PDF
DashMash: a Mashup Environment for End User Development
PDF
PDF
Experiences Streaming Analytics at Petabyte Scale
KEY
NoSQL "Tools in Action" talk at Devoxx
Building Enterprise Apps for Big Data with Cascading
A Data Scientist And A Log File Walk Into A Bar...
Cascading: Enterprise Data Workflows based on Functional Programming
Intro to Cascading (SpringOne2GX)
Intro to Data Science for Enterprise Big Data
Enterprise Data Workflows with Cascading
Log everything!
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
Hadoop + Forcedotcom = Like
Vodafone xone fev142013v3 ext
(ATS3-APP14) Troubleshooting Symyx Notebook client performance
Business Intelligence - Architecture &amp; Execution Done Right
Functional programming for optimization problems in Big Data
Wallchart - Data Warehouse Documentation Roadmap
Infosys sequence services proof of concept
SpagoBI 3.x official presentation
DashMash: a Mashup Environment for End User Development
Experiences Streaming Analytics at Petabyte Scale
NoSQL "Tools in Action" talk at Devoxx
Ad

More from Paco Nathan (20)

PDF
Human in the loop: a design pattern for managing teams working with ML
PDF
Human-in-the-loop: a design pattern for managing teams that leverage ML
PDF
Human-in-a-loop: a design pattern for managing teams which leverage ML
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
PDF
Humans in the loop: AI in open source and industry
PDF
Computable Content
PDF
Computable Content: Lessons Learned
PDF
SF Python Meetup: TextRank in Python
PDF
Use of standards and related issues in predictive analytics
PDF
Data Science in 2016: Moving Up
PDF
Data Science Reinvents Learning?
PDF
Jupyter for Education: Beyond Gutenberg and Erasmus
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Microservices, containers, and machine learning
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Graph Analytics in Spark
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
A New Year in Data Science: ML Unpaused
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Computable Content
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced IT Governance
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Monthly Chronicles - July 2025
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Understanding_Digital_Forensics_Presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Advanced IT Governance
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf

Cascading meetup #4 @ BlueKai

  • 1. Cascading Meetup #4 BlueKai Cupertino, CA 2013-03-05 Copyright @2013, Concurrent, Inc. Tuesday, 05 March 13 1
  • 2. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 2
  • 3. Enterprise Data Workflows Customers Let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 3 LOB use cases drive the demand for Big Data apps
  • 4. Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 4 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 5. Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 5 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 6. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Tuesday, 05 March 13 6 Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  • 7. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Hadoop almost never gets used in isolation; data workflows define Start-ups: crave complexity and scale to become viable… the “glue” required for system new ventures move into Enterprise space of Enterprise apps integration to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Tuesday, 05 March 13 7 Hadoop is almost never used in isolation. Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.
  • 8. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 8
  • 9. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Logs • relational catalog over a collection Support source of unstructured data trap tap tap sink tap • SQL shell prompt to run queries Modeling PMML Data Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 9 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 10. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Premise: most SQL in the world gets Logs • relational catalog over a collection Support of unstructured datawritten by machines… trap tap source tap sink tap • SQL shell prompt to run isn’t a database; this is about making This queries Modeling PMML Data Workflow machine-to-machine communications sink tap source tap simpler and more robust at scale. Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 10 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 11. Cascading workflows – ANSI SQL • enable analysts without retraining on Hadoop, etc. Customers • transparency for Support, Ops, Web App Finance, et al. logs Cache logs Logs Support source trap sink tap tap tap Data a language for queries – not a database, Modeling PMML Workflow but ANSI SQL as a DSL for workflows sink tap source tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 11 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 12. ANSI SQL – reviews Open Source 'Lingual' Helps SQL Devs Unlock Hadoop Thor Olavsrud, 2013-02-22 cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop Hadoop Apps Without MapReduce Mindsets Adrian Bridgwater, 2013-02-28 drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708 Concurrent gives old SQL users new Hadoop tricks Jack Clark, 2013-02-20 theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/ Concurrent Open Source Project Ties SQL to Hadoop Michael Vizard, 2013-02-21 itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html Concurrent Releases Lingual, a SQL DSL for Hadoop Boris Lublinsky, 2013-02-28 infoq.com/news/2013/02/Lingual Tuesday, 05 March 13 12
  • 13. ANSI SQL – CSV data in local file system cascading.org/lingual Tuesday, 05 March 13 13 The test database for MySQL is available for download from https://launchpad.net/test-db/ Here we have a bunch o’ CSV flat files in a directory in the local file system. Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
  • 14. ANSI SQL – shell prompt, catalog cascading.org/lingual Tuesday, 05 March 13 14 Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
  • 15. ANSI SQL – queries cascading.org/lingual Tuesday, 05 March 13 15 Here’s an example SQL query on that “employee” test database from MySQL.
  • 16. ANSI SQL – layers abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, IMDG, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumers Tuesday, 05 March 13 16 When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
  • 17. ANSI SQL – JDBC driver public void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement();   ResultSet resultSet = statement.executeQuery( "select *n" + "from "EXAMPLE"."SALES_FACT_1997" as sn" + "join "EXAMPLE"."EMPLOYEE" as en" + "on e."EMPID" = s."CUST_ID"" );   while( resultSet.next() ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder();   for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); } System.out.println( builder ); }   resultSet.close(); statement.close(); connection.close(); } Tuesday, 05 March 13 17 Note that in this example the schema for the DDL has been derived directly from the CSV files. In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
  • 18. ANSI SQL – JDBC driver $ gradle clean jar $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar   CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors) This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency – for many under-represented use cases in Enterprise IT. It’s essentially ANSI SQL as a DSL. Tuesday, 05 March 13 18 success
  • 19. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 19
  • 20. Test-Driven Development (TDD) source: Wikipedia Tuesday, 05 March 13 20 A general view of TDD process
  • 21. Test-Driven Development (TDD) In terms of Big Data apps,TDD is not generally part of the conversation Tuesday, 05 March 13 21 TDD is not usually high on the list when people start discussing Big Data apps.
  • 22. Traps – Cascading “exceptional data” • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • define traps on branches Web App • tuples which fail asserts get trapped logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 22 An innovation in Cascading was to introduce the notion of a “data exception”, based on setting stream assertion levels as part of the business logic of an app.
  • 23. Traps – example code // set up...  Pipe etlPipe = new Pipe( "etlPipe" ); // some processing...  AssertMatches assertMatches = new AssertMatches( ".*true" ); etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );   // some processing...  FlowDef flowDef = FlowDef.flowDef().setName( "etl" ) .addSource( etlPipe, jsonTap ) .addTrap( etlPipe, trapTap ) .addTailSink( etlPipe, cacheTap );   if( options.has( "assert" ) ) flowDef.setAssertionLevel( AssertionLevel.STRICT ); else flowDef.setAssertionLevel( AssertionLevel.NONE ); Tuesday, 05 March 13 23 Example use in Cascading code
  • 24. Traps – redirect exceptions in production shunt the trapped exceptional data to other parts of the organization: Customers • Ops: notifications Web App • QA: investigate data anomalies • Support: review customer records logs logs Logs Cache • Finance: audit Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 24
  • 25. TDD – practice at scale 1. assert expected patterns in raw input 2. run just that, to find edge cases 3. handle the edge cases for input data 4. assert expected patterns after first chunk of processing 5. run just that, to verify failure 6. code until test passes GIS Regex tree Scrub export parse-tree species 7. repeat #4 for each chunk M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Tuesday, 05 March 13 25
  • 26. TDD – Cascalog features consider that TDD is about asserting and negating logical predicates… • Cascalog is based on logical predicates • function definitions as composable subqueries • functions are not particularly far from being unit tests • Midje: facts, mocks sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html sritchie.github.com/2012/01/22/cascalog-testing-20.html Tuesday, 05 March 13 26 Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology -- in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
  • 27. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development …plus, a proposal Tuesday, 05 March 13 27
  • 28. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Suppose your organization is responsible for an large-scale app… Multiple teams develop reusable libraries… Tuesday, 05 March 13 28 Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
  • 29. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Analysts: ANSI SQL queries for data prep (displaces Hive, etc.) Tuesday, 05 March 13 29 Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes. These can migrate into a Cascading app to run on Hadoop.
  • 30. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Server-side Engineering: HBase tap for customer profiles (integrating other components) Tuesday, 05 March 13 30 Engineering provides integration with customer profiles, e.g., transactional data objects in HBase. These can migrate into a Cascading app to run on Hadoop.
  • 31. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Ops + Support: Traps get routed to customer review (ties into notifications, etc.) Tuesday, 05 March 13 31 Support needs to review exceptional data, via reports/notifications. These can migrate into a Cascading app to run on Hadoop.
  • 32. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Scientists: R => PMML for predictive models (displaces SAS, etc.) Tuesday, 05 March 13 32 Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML. These can migrate into a Cascading app to run on Hadoop.
  • 33. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R App Engineering: Java/Scala/Clojure for business logic in data pipelines (displaces Pig, etc.) Tuesday, 05 March 13 33 Generally the revenue apps require some custom business logic -- representing business process for LOB. These can migrate into a Cascading app to run on Hadoop.
  • 34. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Front-end Engineering: Memcached tap for pushing updates to API (integrating other components) Tuesday, 05 March 13 34 Engineering provides integration with caching layer, for API updates. These can migrate into a Cascading app to run on Hadoop.
  • 35. Cascading workflows – API principles • specify what is required, not how it must be achieved Customers • plan far ahead, before consuming cluster Web App resources – fail fast prior to submit logs Cache • fail the same way twice – deterministic logs Logs Support flow planners help reduce engineering trap source sink tap costs for debugging at scale tap tap Data Modeling PMML • same JAR, any scale – app does not Workflow source require a recompile to change data sink tap tap taps or cluster topologies Analytics Cubes customer Customer profile DBs Prefs • no surprises Hadoop Cluster Reporting Tuesday, 05 March 13 35 Some of the design principles for the pattern language
  • 36. book… by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 Tuesday, 05 March 13 36 Our upcoming O’Reilly book: “Enterprise Data Workflows with Cascading” Should be in Rough Cuts soon -- scheduled to be out in print this June
  • 37. drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com join us for very interesting work! Copyright @2013, Concurrent, Inc. Tuesday, 05 March 13 37 Links to our open source projects, developer community, etc…