SlideShare a Scribd company logo
Cascalog
Data processing on Hadoop without the hassle


                                    Nathan Marz
                                     BackType
                                    @nathanmarz
What is Cascalog?

               Cascalog   Variables and logic
Abstraction




              Cascading   Tuples, data workflows

                            Key/value pairs,
              MapReduce      aggregation
Cascalog’s components

Cascading   (the job execution engine)
    +
 Datalog    (basis of the API design)
    +
 Clojure    (the host programming language)
Clojure

• General purpose programming language
• Dialect of Lisp that compiles to Java bytecode
Clojure
• “Programmable programming language”:
  Easy to build Domain Specific Languages
  (DSL) in Clojure
Clojure examples
   Clojure code           Result
    (+ 1 2 3)               6
   (> 20 18)               true

(defn incr [x] (+ 1 x))     4
(incr 3)
Cascalog basics




 The “age” dataset
Cascalog basics
Cascalog basics




Define and
execute a query
Cascalog basics


        Where to
        emit results



Define and
execute a query
Cascalog basics


        Where to
        emit results

                   Output variables
Define and
execute a query
Cascalog basics


        Where to                      “Predicates”: constrain
        emit results                  the output variables

                   Output variables
Define and
execute a query
Predicates
Predicates


Input fields
Predicates


Input fields   Output fields
Predicates



Fields can be constants or variables
Predicates



Fields can be constants or variables

 Variables are prefixed with ? or !
Predicates
Predicates
• Functions
• Filters
• Aggregators
• Generators: finite sources of tuples
Example #1



    Generator   Filter
Example #2



Generator        Function
Example #3



Generator   Aggregator   Filter
Join example
Join example




     Triggers a join
Join example
Join example




Joins are an implementation detail
Demo time!
Why another query
 language for Hadoop?

Existing tools cause too much

Accidental Complexity
Accidental complexity

  Complexity caused by the tool used
  to solve a problem rather than the
  problem itself
Accidental complexity


• Distinct query languages cause accidental
  complexity
• Example: SQL injection
Query language

• We want:
 • Ability to abstract
 • Ability to compose
Abstraction




Clojure function that returns a subquery
Abstraction




Defining and using custom operation
Composability




Dynamic query with parameterized operation
Composability




 “Predicate macro”
Composability

       expands to




Using a predicate macro
Contrast to Pig




“Average” is 300 lines of code in Pig
Optimized aggregators
     in Cascalog




Implementation of count and sum
Why another query
 language for Hadoop?

Existing tools cause too much

Accidental Complexity
Composability




Value normalization example #1
Composability




Value normalization example #2
Composability


For each id:
 select value with the biggest timestamp




   Value normalization algorithm
Composability




Implementing value normalization
Composability




Using value normalization
Try Cascalog yourself!
Project Page
http://www.github.com/nathanmarz/cascalog

Introductory Tutorial
http://nathanmarz.com/blog/introducing-cascalog/


       5 minutes to install Clojure, Hadoop, and
       Cascalog locally! See project README
BackType is hiring

          Think Cascalog’s cool?
 Come build amazing software at BackType.



http://www.backtype.com/jobs
Questions?


Follow me on Twitter at @nathanmarz
      nathan.marz@gmail.com

More Related Content

KEY
Clojure at BackType
KEY
ElephantDB
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PPTX
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
PDF
Apache Spark Performance is too hard. Let's make it easier
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
PPTX
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
PDF
Storm: distributed and fault-tolerant realtime computation
Clojure at BackType
ElephantDB
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Apache Spark Performance is too hard. Let's make it easier
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Storm: distributed and fault-tolerant realtime computation

What's hot (20)

PDF
Low Latency Execution For Apache Spark
PDF
PySpark Best Practices
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PPTX
Yahoo compares Storm and Spark
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PPTX
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
PDF
Spark Summit EU talk by Nimbus Goehausen
PDF
Ray and Its Growing Ecosystem
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
PDF
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
PDF
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Low Latency Execution For Apache Spark
PySpark Best Practices
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Yahoo compares Storm and Spark
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Spark Summit EU talk by Nimbus Goehausen
Ray and Its Growing Ecosystem
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Apache Spark MLlib 2.0 Preview: Data Science and Production
Snorkel: Dark Data and Machine Learning with Christopher Ré
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Real Time Data Processing Using Spark Streaming
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Ad

Viewers also liked (17)

PDF
Your Code is Wrong
KEY
Become Efficient or Die: The Story of BackType
PDF
The inherent complexity of stream processing
PDF
Demystifying Data Engineering
KEY
The Secrets of Building Realtime Big Data Systems
PDF
Big Data Architecture
PDF
The Epistemology of Software Engineering
PPT
Using Simplicity to Make Hard Big Data Problems Easy
PDF
Storm
PDF
Runaway complexity in Big Data... and a plan to stop it
PDF
Lambda architecture for real time big data
PDF
Big Data and Fast Data - Lambda Architecture in Action
PDF
Cascalog at Hadoop Day
KEY
Cascalog at May Bay Area Hadoop User Group
KEY
Cascalog
KEY
Cascalog workshop
KEY
Cascading
Your Code is Wrong
Become Efficient or Die: The Story of BackType
The inherent complexity of stream processing
Demystifying Data Engineering
The Secrets of Building Realtime Big Data Systems
Big Data Architecture
The Epistemology of Software Engineering
Using Simplicity to Make Hard Big Data Problems Easy
Storm
Runaway complexity in Big Data... and a plan to stop it
Lambda architecture for real time big data
Big Data and Fast Data - Lambda Architecture in Action
Cascalog at Hadoop Day
Cascalog at May Bay Area Hadoop User Group
Cascalog
Cascalog workshop
Cascading
Ad

Similar to Cascalog at Strange Loop (20)

PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
PDF
BDM25 - Spark runtime internal
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
PDF
PHP, the GraphQL ecosystem and GraphQLite
ODP
Building Complex Data Workflows with Cascading on Hadoop
PDF
Rafael Bagmanov «Scala in a wild enterprise»
PPTX
AestasIT - Internal DSLs in Scala
PDF
Buildingsocialanalyticstoolwithmongodb
PDF
Boost your APIs with GraphQL 1.0
PPTX
GraphQL-ify your APIs - Devoxx UK 2021
PPTX
Introduction to Designing and Building Big Data Applications
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PPTX
Interactive Java Support to your tool -- The JShell API and Architecture
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PDF
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
BDM25 - Spark runtime internal
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
PHP, the GraphQL ecosystem and GraphQLite
Building Complex Data Workflows with Cascading on Hadoop
Rafael Bagmanov «Scala in a wild enterprise»
AestasIT - Internal DSLs in Scala
Buildingsocialanalyticstoolwithmongodb
Boost your APIs with GraphQL 1.0
GraphQL-ify your APIs - Devoxx UK 2021
Introduction to Designing and Building Big Data Applications
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Interactive Java Support to your tool -- The JShell API and Architecture
Big Data Processing with .NET and Spark (SQLBits 2020)
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Recent Developments In SparkR For Advanced Analytics
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Cascalog at Strange Loop