SlideShare a Scribd company logo
Cascading
 Nathan Marz
  BackType
What is Cascading?

Cascading is a Java library that makes development of
   complex Hadoop MapReduce workflows easy
Why Hadoop?


• Process large amounts of data in a scalable,
  fault-tolerant way
Why Cascading?
    Tool           How you feel


Hadoop MapReduce




  Cascading
Tuples
Cascading represents all data as “Tuples”

       (“the man sat” , 25)
       (“hello dolly”  , 42)
       (“say hello”    ,1 )
       (“the woman sat”, 10)
Tuples
Tuples are named, ordered fields

     [“sentence”, “value”]
     (“the man sat” , 25)
     (“hello dolly”  , 42)
     (“say hello”    ,1 )
     (“the woman sat”, 10)
Flow
  A flow is a sequence of manipulations on
           pipes of tuple streams
• Flow compiles to one or more MapReduce
  jobs
• Inputs and outputs called “Taps”.
• Each Tap produces or receives a pipe of
  tuples with the same format
• Multiple inputs, multiple outputs
Example

[“sentence”, “value”]         [“word”, “sum”]



      Get the sum of the values for each word
Example
  [“sentence”, “value”]
               Split(“sentence”) -> “word”
   [“word”, “value”]
               GroupBy(“word”)
[“word”, list<[“value”]>]
              Sum(“value”) -> “sum”

     [“word”, “sum”]
Example
             Split(“sentence”) -> “word”

[“sentence”, “value”]          [“word”, “value”]
                               (“the”   , 25)
(“the man sat” , 25)           (“man” , 25)
(“hello dolly”  , 42)          (“sat”    , 25)
(“say hello”    ,1 )           (“hello” , 42)
(“the woman sat”, 10)          (“dolly” , 42)
                               (“say”     ,1 )
                               (“hello” , 1 )
                               (“the”    , 10)
                               (“woman” , 10)
                               (“sat”     , 10)
Example
                   GroupBy(“word”)

[“word”, “value”]            [“word”, list<[“value”]>]
(“the”   , 25)
(“man” , 25)                  (“the”   , [25, 10])
(“sat”    , 25)               (“man” , [25]       )
(“hello” , 42)                (“sat”    , [25, 10])
(“dolly” , 42)                (“hello” , [42, 1] )
(“say”     ,1 )               (“dolly” , [42]      )
(“hello” , 1 )                (“say”     , [1]    )
(“the”    , 10)               (“woman” , [10]     )
(“woman” , 10)
(“sat”     , 10)
Example
                Sum(“value”) -> “sum”

[“word”, list<[“value”]>]        [“word”, “sum”]

(“the”   , [25, 10])          (“the”   , 35)
(“man” , [25]       )         (“man” , 25)
(“sat”    , [25, 10])         (“sat”    , 35)
(“hello” , [42, 1] )          (“hello” , 43)
(“dolly” , [42]      )        (“dolly” , 42)
(“say”     , [1]    )         (“say”     ,1 )
(“woman” , [10]     )         (“woman” , 10)
More functionality

• Inner and outer joins natively supported
• Seamlessly branch and merge pipes of
  tuples
• Integrate diverse data sources
Why not Pig?

• Pig is a custom language for writing
  MapReduce workflows
• Because it’s a custom language, intermixing
  “plain logic” in between flows is painful
• Not nearly as flexible as Cascading for
  custom needs
Learn more


• Tutorial: http://blog.rapleaf.com/dev/?p=33
• Website: http://www.cascading.org
Questions?

More Related Content

PDF
2015 10-7-9am regex-functions-loops.key
PPTX
Fabian Hueske – Cascading on Flink
PPTX
Overview of Cascading 3.0 on Apache Flink
PPTX
TUTORIAL DE NETVIBES
PPTX
Sex cake and your business
PPT
I want to visit Austrialia
PDF
Las 27-maneras-en-que-la-mente-distorciona-la-realidad
PPTX
La Informática
2015 10-7-9am regex-functions-loops.key
Fabian Hueske – Cascading on Flink
Overview of Cascading 3.0 on Apache Flink
TUTORIAL DE NETVIBES
Sex cake and your business
I want to visit Austrialia
Las 27-maneras-en-que-la-mente-distorciona-la-realidad
La Informática

Viewers also liked (17)

PPT
Lab safety 12_10_13
ODP
Animales en peligro de extincion
PPTX
I love free_nsta2010
PPTX
Periodismo chiquinquireño
PPTX
Ahead Week 1 Key Slides
PPT
Chistesvarios8
PDF
A replication study of the top performing systems in SemEval twitter sentimen...
PPTX
Social media ROI
PPT
02 epidemio enf reum
PDF
Wakefield customer insight project
PPT
PDF
certificate
PPTX
Setting up Your LinkedIn Account
PPTX
Power tecnologia
PDF
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
PPT
Dr. Bart Cammaerts - The Mediation of Dissensus
PPTX
Presentasi moment
Lab safety 12_10_13
Animales en peligro de extincion
I love free_nsta2010
Periodismo chiquinquireño
Ahead Week 1 Key Slides
Chistesvarios8
A replication study of the top performing systems in SemEval twitter sentimen...
Social media ROI
02 epidemio enf reum
Wakefield customer insight project
certificate
Setting up Your LinkedIn Account
Power tecnologia
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
Dr. Bart Cammaerts - The Mediation of Dissensus
Presentasi moment
Ad

Similar to Cascading (20)

PPTX
Class 9: Consistent Hashing
PPTX
Class 31: Deanonymizing
PDF
CSMR10c.ppt
PDF
Word chains
ZIP
Hashing
PDF
Map Reduce
PDF
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
PDF
大话程序员可用的算法
PPSX
PPTX
PDF
Introduction to Python
PDF
Notes3
PDF
Distributed computing the Google way
PPTX
EMC2, Владимир Суворов
PDF
PDF
Programming Lisp Clojure - 2장 : 클로저 둘러보기
PDF
Map Reduce An Introduction
PDF
Clojure: The Art of Abstraction
PDF
Introduction to phyton
PPTX
Python overview
Class 9: Consistent Hashing
Class 31: Deanonymizing
CSMR10c.ppt
Word chains
Hashing
Map Reduce
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
大话程序员可用的算法
Introduction to Python
Notes3
Distributed computing the Google way
EMC2, Владимир Суворов
Programming Lisp Clojure - 2장 : 클로저 둘러보기
Map Reduce An Introduction
Clojure: The Art of Abstraction
Introduction to phyton
Python overview
Ad

More from nathanmarz (17)

PDF
Demystifying Data Engineering
PDF
The inherent complexity of stream processing
PPT
Using Simplicity to Make Hard Big Data Problems Easy
PDF
The Epistemology of Software Engineering
PDF
Your Code is Wrong
PDF
Runaway complexity in Big Data... and a plan to stop it
PDF
Storm
PDF
Storm: distributed and fault-tolerant realtime computation
KEY
ElephantDB
KEY
Become Efficient or Die: The Story of BackType
KEY
The Secrets of Building Realtime Big Data Systems
KEY
Clojure at BackType
KEY
Cascalog workshop
KEY
Cascalog at Strange Loop
PDF
Cascalog at Hadoop Day
KEY
Cascalog at May Bay Area Hadoop User Group
KEY
Cascalog
Demystifying Data Engineering
The inherent complexity of stream processing
Using Simplicity to Make Hard Big Data Problems Easy
The Epistemology of Software Engineering
Your Code is Wrong
Runaway complexity in Big Data... and a plan to stop it
Storm
Storm: distributed and fault-tolerant realtime computation
ElephantDB
Become Efficient or Die: The Story of BackType
The Secrets of Building Realtime Big Data Systems
Clojure at BackType
Cascalog workshop
Cascalog at Strange Loop
Cascalog at Hadoop Day
Cascalog at May Bay Area Hadoop User Group
Cascalog

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Approach and Philosophy of On baking technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Approach and Philosophy of On baking technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Cascading

  • 2. What is Cascading? Cascading is a Java library that makes development of complex Hadoop MapReduce workflows easy
  • 3. Why Hadoop? • Process large amounts of data in a scalable, fault-tolerant way
  • 4. Why Cascading? Tool How you feel Hadoop MapReduce Cascading
  • 5. Tuples Cascading represents all data as “Tuples” (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 6. Tuples Tuples are named, ordered fields [“sentence”, “value”] (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 7. Flow A flow is a sequence of manipulations on pipes of tuple streams • Flow compiles to one or more MapReduce jobs • Inputs and outputs called “Taps”. • Each Tap produces or receives a pipe of tuples with the same format • Multiple inputs, multiple outputs
  • 8. Example [“sentence”, “value”] [“word”, “sum”] Get the sum of the values for each word
  • 9. Example [“sentence”, “value”] Split(“sentence”) -> “word” [“word”, “value”] GroupBy(“word”) [“word”, list<[“value”]>] Sum(“value”) -> “sum” [“word”, “sum”]
  • 10. Example Split(“sentence”) -> “word” [“sentence”, “value”] [“word”, “value”] (“the” , 25) (“the man sat” , 25) (“man” , 25) (“hello dolly” , 42) (“sat” , 25) (“say hello” ,1 ) (“hello” , 42) (“the woman sat”, 10) (“dolly” , 42) (“say” ,1 ) (“hello” , 1 ) (“the” , 10) (“woman” , 10) (“sat” , 10)
  • 11. Example GroupBy(“word”) [“word”, “value”] [“word”, list<[“value”]>] (“the” , 25) (“man” , 25) (“the” , [25, 10]) (“sat” , 25) (“man” , [25] ) (“hello” , 42) (“sat” , [25, 10]) (“dolly” , 42) (“hello” , [42, 1] ) (“say” ,1 ) (“dolly” , [42] ) (“hello” , 1 ) (“say” , [1] ) (“the” , 10) (“woman” , [10] ) (“woman” , 10) (“sat” , 10)
  • 12. Example Sum(“value”) -> “sum” [“word”, list<[“value”]>] [“word”, “sum”] (“the” , [25, 10]) (“the” , 35) (“man” , [25] ) (“man” , 25) (“sat” , [25, 10]) (“sat” , 35) (“hello” , [42, 1] ) (“hello” , 43) (“dolly” , [42] ) (“dolly” , 42) (“say” , [1] ) (“say” ,1 ) (“woman” , [10] ) (“woman” , 10)
  • 13. More functionality • Inner and outer joins natively supported • Seamlessly branch and merge pipes of tuples • Integrate diverse data sources
  • 14. Why not Pig? • Pig is a custom language for writing MapReduce workflows • Because it’s a custom language, intermixing “plain logic” in between flows is painful • Not nearly as flexible as Cascading for custom needs
  • 15. Learn more • Tutorial: http://blog.rapleaf.com/dev/?p=33 • Website: http://www.cascading.org