SlideShare a Scribd company logo
Cascading for the Impatient
Paco Nathan                   Document
                              Collection




Concurrent, Inc.
                                                           Scrub
                                           Tokenize
                                                           token

                                      M



                                                                   HashJoin   Regex
                                                                     Left     token
                                                                                      GroupBy    R
                                                      Stop Word                        token
                                                         List
                                                                     RHS




pnathan@concurrentinc.com
                                                                                         Count




                                                                                                     Word
                                                                                                     Count




@pacoid




                            Copyright @2012, Concurrent, Inc.
why?



 Unstructured Data
   meets
  Enterprise Scale
how?


 Cascading.org/
  Document
  Collection



                               Scrub
               Tokenize
                               token

          M



                                       HashJoin   Regex
                                         Left     token
                                                          GroupBy    R
                          Stop Word                        token
                             List
                                         RHS




                                                             Count




                                                                         Word
                                                                         Count
who?
 • Business Stakeholder POV:
   business process management for workflow orchestration (think BPM/BPEL)


 • Systems Integrator POV: data sources and compute platforms
   system integration of heterogenous


 • Data Scientist graph (DAG) on which we can apply Amdahl's Law
   a directed, acyclic
                       POV:



 • Data Architect large-scale data flow management
   a physical plan for
                       POV:



 • Software Architect POV:plumbing or circuit design
   a pattern language, similar to


 • API bindings for Scala, Clojure, Python, Ruby, Java
    App Developer POV:
                                                         Document
                                                         Collection



                                                                                      Scrub
                                                                      Tokenize
                                                                                      token

                                                                 M



                                                                                              HashJoin   Regex
                                                                                                Left     token
                                                                                                                 GroupBy    R
                                                                                 Stop Word                        token
                                                                                    List
                                                                                                RHS




 • Systemshas passed CI, available in a Maven repo
   a JAR file,
               Engineer POV:                                                                                        Count




                                                                                                                                Word
                                                                                                                                Count
where?
  business      Domain expertise, business trade-offs,
  process       operating parameters, etc.

     API        Scala, Clojure, Python, Ruby, Java, etc.
  language      …envision whatever else runs in a JVM

 logical plan   (raw human intellect, unless…)
  / optimize
                   Document
                   Collection



                                                Scrub
                                Tokenize
                                                token

                           M




  physical                                 Stop Word
                                                        HashJoin
                                                          Left
                                                                   Regex
                                                                   token
                                                                           GroupBy
                                                                            token
                                                                                      R




    plan
                                              List
                                                          RHS




                                                                              Count




                                                                                          Word
                                                                                          Count




  compute       Apache Hadoop, in-memory local mode
 framework
                …envision GPUs, other frameworks, etc.




                                                                                                  “assembler”
                                                                                                   code
  monitors,     Nagios, etc.
 notification
1: copy
                       public class
                         Main
                         {
                         public static void
                         main( String[] args )
                           {
                           String inPath = args[ 0 ];
                           String outPath = args[ 1 ];
 Source
                           Properties props = new Properties();
                           AppProps.setApplicationJarClass( props, Main.class );
                           HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                           // create the source tap
                           Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

                           // create the sink tap
          M                Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );

                           // specify a pipe to connect the taps
                Sink       Pipe copyPipe = new Pipe( "copy" );

                           // connect the taps, pipes, etc., into a flow
                           FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                            .addSource( copyPipe, inTap )
                            .addTailSink( copyPipe, outTap );

                           // run the flow
                           flowConnector.connect( flowDef ).complete();

1 mapper
                           }
                         }

0 reducers
10 lines code
wait!



  ten lines of code
  for a file copy …
  seems like a lot.
same JAR, any scale…
                                              MegaCorp Enterprise IT:
                                              Pb’s data
                                              1000+ node cluster
                                              EVP calls you when app fails
                                              runtime: days+

                              Production Cluster:
                              Tb’s data
                              EMR + 50 HPC Instances
                              Ops monitors results
                              runtime: hours – days

               Staging Cluster:
               Gb’s data
               EMR + 4 Spot Instances
               CI shows red or green lights
               runtime: minutes – hours

 Your Mom’s Laptop:
 Mb’s data
 Hadoop standalone mode
 passes unit tests, or not
 runtime: seconds – minutes
2: word count


Document
Collection




                Tokenize
                           GroupBy
        M                   token    Count




                              R              Word
                                             Count




1 mapper
1 reducer
18 lines code
3: wc + scrub


Document
Collection



                        Scrub   GroupBy
             Tokenize
                        token    token
                                          Count
        M

                                   R              Word
                                                  Count




1 mapper
1 reducer
22+10 lines code
4: wc + scrub + stop words


Document
Collection



                             Scrub
             Tokenize
                             token

        M



                                     HashJoin   Regex
                                       Left     token
                                                        GroupBy    R
                        Stop Word                        token
                           List
                                       RHS




                                                           Count



1 mapper                                                               Word

1 reducer
                                                                       Count


28+10 lines code
5: tf-idf


                                                                        Unique                 Insert   SumBy




                                                                  D
                                                                        doc_id                   1      doc_id
Document
Collection

                                                                  M       R           M                   R      M     RHS

                               Scrub
             Tokenize
                               token
                                                                                                                     HashJoin
        M

                                                                                                                                            RHS




                                                          token
                                       HashJoin   Regex                 Unique                GroupBy




                                                                  DF
                                         Left     token                  token                 token                                                         ExprFunc
                                                                                                         Count                             CoGroup
                        Stop Word                                                                                                                              tf-idf
                           List
                                         RHS
                                                                  M       R           M          R               M                                   R
                                                                                                                                                                          TF-IDF




                                                                                                                 M

                                                                       GroupBy
                                                                  TF

                                                                        doc_id,
                                                                         token                 Count
                                                                                                                             GroupBy                 Count
                                                                                                                              token

                                                                  M       R       M       R
                                                                                                                                                                  Word
                                                                                                                                R      M      R                   Count




  11 mappers
  9 reducers
  65+10 lines code
6: tf-idf + tdd


                                                                                                Unique                 Insert   SumBy




                                                                                          D
                                                                                                doc_id                   1      doc_id
Document
Collection

                                                                                                                                               RHS
                                                                                          M       R           M                   R      M
                       Assert                          Scrub
                                Tokenize
                                                       token
                                                                                                                                             HashJoin              Checkpoint
        M
                                                                                                                                                                                  M

                                                                                                                                                                                       RHS




                                                                                  token
                                                               HashJoin   Regex                 Unique                GroupBy




                                                                                          DF
                                                                 Left     token                  token                 token     Count                                                               ExprFunc
                                                                                                                                                                                      CoGroup
                                                                                                                                                                                                       tf-idf
                                           Stop Word
                                              List               RHS

                                                                                          M       R           M          R               M                                                      R
                                                                                                                                                                                                                TF-IDF




                                                                                                                                         M
                                                                                               GroupBy




                                                                                          TF
                                                                                                doc_id,
             Failure                                                                             token                 Count
              Traps                                                                                                                                  GroupBy              Count
                                                                                                                                                      token

                                                                                          M       R       M       R
                                                                                                                                                                                             Word
                                                                                                                                                                                             Count
                                                                                                                                                        R      M    R




  12 mappers
  9 reducers
  76+14 lines code
deployed…


 elastic-mapreduce --create --name "TF-IDF" 
   --jar s3n://temp.cascading.org/impatient/part6.jar 
   --arg s3n://temp.cascading.org/impatient/rain.txt 
   --arg s3n://temp.cascading.org/impatient/out/wc 
   --arg s3n://temp.cascading.org/impatient/en.stop 
   --arg s3n://temp.cascading.org/impatient/out/tfidf 
   --arg s3n://temp.cascading.org/impatient/out/trap 
   --arg s3n://temp.cascading.org/impatient/out/check
results?                                                                               doc_id tf-idf
                                                                                       doc02 0.9163
                                                                                                       token
                                                                                                       air
                                                                                       doc05 0.9163    australia
                                                                                       doc05 0.9163    broken
                                                                                       doc04 0.9163    california's
                                                                                       doc04 0.9163    cause
                                                                                       doc02 0.9163    cloudcover
                                                                                       doc04 0.9163    death
                                                                                       doc04 0.9163    deserts
                                                                                       doc03 0.9163    downwind
doc_id text                                                                             …
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.          doc02 0.9163    sinking
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain   doc04 0.9163    such
with less rain and cloudcover.                                                         doc04 0.9163    valley
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind)      doc05 0.9163    women
side of a mountain.                                                                    doc03 0.5108    land
doc04 This is known as the rain shadow effect and is the primary cause of leeward      doc05 0.5108    land
deserts of mountain ranges, such as California's Death Valley.                         doc01 0.5108    lee
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]                               doc02 0.5108    lee
zoink null                                                                             doc03 0.5108    leeward
                                                                                       doc04 0.5108    leeward
                                                                                       doc01 0.4463    area
                                                                                       doc02 0.2231    area
                                                                                       doc03 0.2231    area
                                                                                       doc01 0.2231    dry
                                                                                       doc02 0.2231    dry
                                                                                       doc03 0.2231    dry
                                                                                       doc02 0.2231    mountain
                                                                                       doc03 0.2231    mountain
                                                                                       doc04 0.2231    mountain
                                                                                       doc01 0.0000    rain
                                                                                       doc02 0.0000    rain
                                                                                       doc03 0.0000    rain
                                                                                       doc04 0.0000    rain
                                                                                       doc01 0.0000    shadow
                                                                                       doc02 0.0000    shadow
                                                                                       doc03 0.0000    shadow
                                                                                       doc04 0.0000    shadow
comparisons?


 compare similar code in Scalding and Cascalog:

 sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
 based on: github.com/twitter/scalding/wiki


 github.com/Quantisan/Impatient
 based on: github.com/nathanmarz/cascalog/wiki
drill-down?


  blog, code, wiki, gists, jars, list, DevOps products:

  cascading.org/category/impatient/
  github.org/Cascading/
  conjars.org/
  goo.gl/KQtUL
  concurrentinc.com/

More Related Content

PDF
Using Cascalog to build
 an app based on City of Palo Alto Open Data
PDF
Elasticwulf Pycon Talk
PPTX
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
PDF
Building Scale Free Applications with Hadoop and Cascading
PDF
Colaboracion y Social CRM
PDF
Books a Love Story (pdf with notes)
PDF
Verge (pdf with some notes)
PPT
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarcon
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Elasticwulf Pycon Talk
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Building Scale Free Applications with Hadoop and Cascading
Colaboracion y Social CRM
Books a Love Story (pdf with notes)
Verge (pdf with some notes)
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarcon

Viewers also liked (15)

KEY
Hardware innovation (keynote file)
PPTX
Global Considerations for sCRM Strategy
PPT
Digital analytics & privacy: it's not the end of the world
PDF
Birth of the Global Mind
PDF
The roadtrip that led to my first rails commit and how you could make yours too
PDF
ISIS Captures Ramadi - May 2015
PDF
Awakening India - Jago Party
PDF
When Ruby Meets Java - The Power of Torquebox
PPTX
Yusuf mapping the creative industries in jordan 15 11 2012
PDF
The DiSo Project and the Open Web
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
PPTX
Creative, Digital & Design Business Briefing July 2015
PDF
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
PDF
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
PDF
Pinterest for Business 101
Hardware innovation (keynote file)
Global Considerations for sCRM Strategy
Digital analytics & privacy: it's not the end of the world
Birth of the Global Mind
The roadtrip that led to my first rails commit and how you could make yours too
ISIS Captures Ramadi - May 2015
Awakening India - Jago Party
When Ruby Meets Java - The Power of Torquebox
Yusuf mapping the creative industries in jordan 15 11 2012
The DiSo Project and the Open Web
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Creative, Digital & Design Business Briefing July 2015
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Pinterest for Business 101
Ad

Similar to Cascading for the Impatient (20)

PDF
Enterprise Data Workflows with Cascading
KEY
Intro to Cascading (SpringOne2GX)
PDF
The Workflow Abstraction
PDF
The Workflow Abstraction
KEY
Intro to Data Science for Enterprise Big Data
PDF
Functional programming for optimization problems in Big Data
PDF
Chicago Hadoop Users Group: Enterprise Data Workflows
PDF
Pattern: an open source project for migrating predictive models onto Apache H...
KEY
A Data Scientist And A Log File Walk Into A Bar...
PPTX
Scoobi - Scala for Startups
PDF
Cascading meetup #4 @ BlueKai
KEY
Building Enterprise Apps for Big Data with Cascading
KEY
Buzz words
KEY
Suffix Array 検証その後
PDF
Distributed computing the Google way
PDF
Solving real world problems with Hadoop
PDF
Introducción a hadoop
PDF
はじめてのまっぷりでゅ〜す
PDF
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
PPTX
NuGet for The Enterprise
Enterprise Data Workflows with Cascading
Intro to Cascading (SpringOne2GX)
The Workflow Abstraction
The Workflow Abstraction
Intro to Data Science for Enterprise Big Data
Functional programming for optimization problems in Big Data
Chicago Hadoop Users Group: Enterprise Data Workflows
Pattern: an open source project for migrating predictive models onto Apache H...
A Data Scientist And A Log File Walk Into A Bar...
Scoobi - Scala for Startups
Cascading meetup #4 @ BlueKai
Building Enterprise Apps for Big Data with Cascading
Buzz words
Suffix Array 検証その後
Distributed computing the Google way
Solving real world problems with Hadoop
Introducción a hadoop
はじめてのまっぷりでゅ〜す
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
NuGet for The Enterprise
Ad

More from Paco Nathan (20)

PDF
Human in the loop: a design pattern for managing teams working with ML
PDF
Human-in-the-loop: a design pattern for managing teams that leverage ML
PDF
Human-in-a-loop: a design pattern for managing teams which leverage ML
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
PDF
Humans in the loop: AI in open source and industry
PDF
Computable Content
PDF
Computable Content: Lessons Learned
PDF
SF Python Meetup: TextRank in Python
PDF
Use of standards and related issues in predictive analytics
PDF
Data Science in 2016: Moving Up
PDF
Data Science Reinvents Learning?
PDF
Jupyter for Education: Beyond Gutenberg and Erasmus
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Microservices, containers, and machine learning
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Graph Analytics in Spark
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
A New Year in Data Science: ML Unpaused
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Computable Content
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm

Cascading for the Impatient

  • 1. Cascading for the Impatient Paco Nathan Document Collection Concurrent, Inc. Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS pnathan@concurrentinc.com Count Word Count @pacoid Copyright @2012, Concurrent, Inc.
  • 2. why? Unstructured Data meets Enterprise Scale
  • 3. how? Cascading.org/ Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 4. who? • Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) • Systems Integrator POV: data sources and compute platforms system integration of heterogenous • Data Scientist graph (DAG) on which we can apply Amdahl's Law a directed, acyclic POV: • Data Architect large-scale data flow management a physical plan for POV: • Software Architect POV:plumbing or circuit design a pattern language, similar to • API bindings for Scala, Clojure, Python, Ruby, Java App Developer POV: Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS • Systemshas passed CI, available in a Maven repo a JAR file, Engineer POV: Count Word Count
  • 5. where? business Domain expertise, business trade-offs, process operating parameters, etc. API Scala, Clojure, Python, Ruby, Java, etc. language …envision whatever else runs in a JVM logical plan (raw human intellect, unless…) / optimize Document Collection Scrub Tokenize token M physical Stop Word HashJoin Left Regex token GroupBy token R plan List RHS Count Word Count compute Apache Hadoop, in-memory local mode framework …envision GPUs, other frameworks, etc. “assembler” code monitors, Nagios, etc. notification
  • 6. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );     // create the sink tap M     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );     // specify a pipe to connect the taps Sink     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code
  • 7. wait! ten lines of code for a file copy … seems like a lot.
  • 8. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR + 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Mom’s Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 9. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code
  • 10. 3: wc + scrub Document Collection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer 22+10 lines code
  • 11. 4: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines code
  • 12. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
  • 13. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
  • 14. deployed… elastic-mapreduce --create --name "TF-IDF" --jar s3n://temp.cascading.org/impatient/part6.jar --arg s3n://temp.cascading.org/impatient/rain.txt --arg s3n://temp.cascading.org/impatient/out/wc --arg s3n://temp.cascading.org/impatient/en.stop --arg s3n://temp.cascading.org/impatient/out/tfidf --arg s3n://temp.cascading.org/impatient/out/trap --arg s3n://temp.cascading.org/impatient/out/check
  • 15. results? doc_id tf-idf doc02 0.9163 token air doc05 0.9163 australia doc05 0.9163 broken doc04 0.9163 california's doc04 0.9163 cause doc02 0.9163 cloudcover doc04 0.9163 death doc04 0.9163 deserts doc03 0.9163 downwind doc_id text … doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinking doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 such with less rain and cloudcover. doc04 0.9163 valley doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 women side of a mountain. doc03 0.5108 land doc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 land deserts of mountain ranges, such as California's Death Valley. doc01 0.5108 lee doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 lee zoink null doc03 0.5108 leeward doc04 0.5108 leeward doc01 0.4463 area doc02 0.2231 area doc03 0.2231 area doc01 0.2231 dry doc02 0.2231 dry doc03 0.2231 dry doc02 0.2231 mountain doc03 0.2231 mountain doc04 0.2231 mountain doc01 0.0000 rain doc02 0.0000 rain doc03 0.0000 rain doc04 0.0000 rain doc01 0.0000 shadow doc02 0.0000 shadow doc03 0.0000 shadow doc04 0.0000 shadow
  • 16. comparisons? compare similar code in Scalding and Cascalog: sujitpal.blogspot.com/2012/08/scalding-for-impatient.html based on: github.com/twitter/scalding/wiki github.com/Quantisan/Impatient based on: github.com/nathanmarz/cascalog/wiki
  • 17. drill-down? blog, code, wiki, gists, jars, list, DevOps products: cascading.org/category/impatient/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/

Editor's Notes

  • #2: responsible for net lift, or we work on something else\n
  • #3: responsible for net lift, or we work on something else\n
  • #4: responsible for net lift, or we work on something else\n
  • #5: responsible for net lift, or we work on something else\n
  • #6: responsible for net lift, or we work on something else\n
  • #7: responsible for net lift, or we work on something else\n
  • #8: responsible for net lift, or we work on something else\n
  • #9: responsible for net lift, or we work on something else\n
  • #10: responsible for net lift, or we work on something else\n
  • #11: responsible for net lift, or we work on something else\n
  • #12: responsible for net lift, or we work on something else\n
  • #13: responsible for net lift, or we work on something else\n
  • #14: responsible for net lift, or we work on something else\n
  • #15: responsible for net lift, or we work on something else\n
  • #16: responsible for net lift, or we work on something else\n
  • #17: responsible for net lift, or we work on something else\n
  • #18: responsible for net lift, or we work on something else\n