SlideShare a Scribd company logo
Apache Crunch
Rahul Sharma
Apache
Agenda :


    Issues with MapReduce pipelines

    Solving with Apache Crunch

    Data Model & Operations

    System Workflow

    Examples

    Question & Answers




                                      2
Issues with MapReduce Pipelines



                  Unit Testing pipeline ??
                   You must be joking !!     Can someone tell me where
                                               is the business logic ??



  Chain performance??




   Learn Latin(pig)
        first!!


                                                                          3
Apache Crunch


    Is a Java library

    Contains Collections which can excute Parallel operations

    Lazy evaluation of Collections at runtime

    Operations merged at runtime to have efficient chains.

    Available @ http://incubator.apache.org/crunch/

    Based on Google FlumeJava paper




                                                                4
Apache Crunch


    Supports Hadoop version 1 and 2-alpha

    Supports HBase, jdbc etc

    Works with Writables, Avro, Thrift and proto-buffers

    Scala varient also exists

    Integration with R and Clojure in process

    Archetype exists for creating sample maven project




                                                           5
Apache Crunch : Data Model
   
       Pipeline
   
       MRPipeline
   
       MemPipeline
   
       PCollection<T>
   
       PTable<K,V>
   
       PGroupTable<K,V>
   
       Source<T>
   
       Target<T>
   
       Emitter<T>
                             6
   
       PType<K,V>
Apache Crunch : Operations

  
      DoFn<S,T>
  
      CombineFn<S,T>
  
      FilterFn<T>
  
      Joins
  
      Cartesian
  
      Sort
  
      SecondarySort
  
      PObject<T>
  
      BloomFilters
                             7
Apache Crunch : System Workflow
                 Construct a pipeline




                    Pipeline.done()




         Map           Map              Map


         GBK           GBK


        Reduce       Reduce




                                              8
                        Output
Apache Crunch : Examples

  
      WordCount example
  
      Avro example
  
      Sorting example
  
      SecondarySort
  
      Join Example
  
      BloomFilters




                           9
Write to me : rsharma@apache.org
Example src : http://github.com/rahul0208
                                            10
Blog         : devlearnings.wordpress.com

More Related Content

PPTX
Apache Crunch
PPT
r,rstats,r language,r packages
PPTX
Unit 2 part-2
PPTX
PDF
Making our Future better
PDF
Introduction to PIG components
PPTX
Gude for C++11 in Apache Traffic Server
ODP
LD_PRELOAD Exploitation - DC9723
Apache Crunch
r,rstats,r language,r packages
Unit 2 part-2
Making our Future better
Introduction to PIG components
Gude for C++11 in Apache Traffic Server
LD_PRELOAD Exploitation - DC9723

What's hot (20)

PDF
Rcpp
PPT
Improving Robustness In Distributed Systems
PDF
Apache Flink internals
PDF
Dynamic pricing of Lyft rides using streaming
PDF
Integrating libSyntax into the compiler pipeline
PPTX
Apache Flink Training: DataStream API Part 1 Basic
PPTX
Apache Flink @ NYC Flink Meetup
PPTX
Apache Flink@ Strata & Hadoop World London
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
PPTX
Structured Streaming Using Spark 2.1
PDF
Map Reduce Execution Architecture
PPTX
Linker and loader upload
PPTX
Practical SPARQL Benchmarking Revisited
PPTX
Python Streaming Pipelines with Beam on Flink
PPTX
Java 7 & 8
ODP
Libraries
PDF
Mapreduce by examples
PPTX
Access to non local names
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Rcpp
Improving Robustness In Distributed Systems
Apache Flink internals
Dynamic pricing of Lyft rides using streaming
Integrating libSyntax into the compiler pipeline
Apache Flink Training: DataStream API Part 1 Basic
Apache Flink @ NYC Flink Meetup
Apache Flink@ Strata & Hadoop World London
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Structured Streaming Using Spark 2.1
Map Reduce Execution Architecture
Linker and loader upload
Practical SPARQL Benchmarking Revisited
Python Streaming Pipelines with Beam on Flink
Java 7 & 8
Libraries
Mapreduce by examples
Access to non local names
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Ad

Similar to Indic threads pune12-apache-crunch (20)

PPTX
The Evolution of the Hadoop Ecosystem
PDF
Building Applications using Apache Hadoop
PPTX
How to develop Big Data Pipelines for Hadoop, by Costin Leau
PDF
hadoop
PPTX
Hadoop Solutions
PDF
Hadoop.mapreduce
PDF
Notes on data-intensive processing with Hadoop Mapreduce
PDF
Hadoop Spark - Reuniao SouJava 12/04/2014
PDF
Hadoop Overview kdd2011
PPT
Java overview the piramide of success
PPTX
Big data week presentation
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
PPTX
Hadoop Big Data A big picture
PPTX
The Hadoop Ecosystem
KEY
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PPTX
Big data ppt
PPTX
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
ODP
Large scale crawling with Apache Nutch
KEY
Processing Big Data
The Evolution of the Hadoop Ecosystem
Building Applications using Apache Hadoop
How to develop Big Data Pipelines for Hadoop, by Costin Leau
hadoop
Hadoop Solutions
Hadoop.mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Overview kdd2011
Java overview the piramide of success
Big data week presentation
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Hadoop Big Data A big picture
The Hadoop Ecosystem
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Big data ppt
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Large scale crawling with Apache Nutch
Processing Big Data
Ad

More from IndicThreads (20)

PPTX
Http2 is here! And why the web needs it
ODP
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
PPT
Go Programming Language - Learning The Go Lang way
PPT
Building Resilient Microservices
PPT
App using golang indicthreads
PDF
Building on quicksand microservices indicthreads
PDF
How to Think in RxJava Before Reacting
PPT
Iot secure connected devices indicthreads
PDF
Real world IoT for enterprises
PPT
IoT testing and quality assurance indicthreads
PPT
Functional Programming Past Present Future
PDF
Harnessing the Power of Java 8 Streams
PDF
Building & scaling a live streaming mobile platform - Gr8 road to fame
PPTX
Internet of things architecture perspective - IndicThreads Conference
PDF
Cars and Computers: Building a Java Carputer
PPTX
Scrap Your MapReduce - Apache Spark
PPT
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
PPTX
Speed up your build pipeline for faster feedback
PPT
Unraveling OpenStack Clouds
PPTX
Digital Transformation of the Enterprise. What IT leaders need to know!
Http2 is here! And why the web needs it
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Go Programming Language - Learning The Go Lang way
Building Resilient Microservices
App using golang indicthreads
Building on quicksand microservices indicthreads
How to Think in RxJava Before Reacting
Iot secure connected devices indicthreads
Real world IoT for enterprises
IoT testing and quality assurance indicthreads
Functional Programming Past Present Future
Harnessing the Power of Java 8 Streams
Building & scaling a live streaming mobile platform - Gr8 road to fame
Internet of things architecture perspective - IndicThreads Conference
Cars and Computers: Building a Java Carputer
Scrap Your MapReduce - Apache Spark
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Speed up your build pipeline for faster feedback
Unraveling OpenStack Clouds
Digital Transformation of the Enterprise. What IT leaders need to know!

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology

Indic threads pune12-apache-crunch

  • 2. Agenda :  Issues with MapReduce pipelines  Solving with Apache Crunch  Data Model & Operations  System Workflow  Examples  Question & Answers 2
  • 3. Issues with MapReduce Pipelines Unit Testing pipeline ?? You must be joking !! Can someone tell me where is the business logic ?? Chain performance?? Learn Latin(pig) first!! 3
  • 4. Apache Crunch  Is a Java library  Contains Collections which can excute Parallel operations  Lazy evaluation of Collections at runtime  Operations merged at runtime to have efficient chains.  Available @ http://incubator.apache.org/crunch/  Based on Google FlumeJava paper 4
  • 5. Apache Crunch  Supports Hadoop version 1 and 2-alpha  Supports HBase, jdbc etc  Works with Writables, Avro, Thrift and proto-buffers  Scala varient also exists  Integration with R and Clojure in process  Archetype exists for creating sample maven project 5
  • 6. Apache Crunch : Data Model  Pipeline  MRPipeline  MemPipeline  PCollection<T>  PTable<K,V>  PGroupTable<K,V>  Source<T>  Target<T>  Emitter<T> 6  PType<K,V>
  • 7. Apache Crunch : Operations  DoFn<S,T>  CombineFn<S,T>  FilterFn<T>  Joins  Cartesian  Sort  SecondarySort  PObject<T>  BloomFilters 7
  • 8. Apache Crunch : System Workflow Construct a pipeline Pipeline.done() Map Map Map GBK GBK Reduce Reduce 8 Output
  • 9. Apache Crunch : Examples  WordCount example  Avro example  Sorting example  SecondarySort  Join Example  BloomFilters 9
  • 10. Write to me : rsharma@apache.org Example src : http://github.com/rahul0208 10 Blog : devlearnings.wordpress.com