SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Building Effective
Near-Real-Time Analytics with
Spark Streaming and Kudu
Jeremy Beard | Senior Solutions Architect, Cloudera
May 2016
2© Cloudera, Inc. All rights reserved.
Myself
• Jeremy Beard
• Senior Solutions Architect at Cloudera
• 3.5 years at Cloudera
• 6 years data warehousing before that
• jeremy@cloudera.com
3© Cloudera, Inc. All rights reserved.
Agenda
• What do we mean by near-real-time analytics?
• Which components can we use from the Cloudera stack?
• How do these components fit together?
• How do we implement the Spark Streaming to Kudu path?
• What if I don’t want to write all that code?
4© Cloudera, Inc. All rights reserved.
Defining near-real-time analytics (for this talk)
• Ability to analyze events happening right now in the real world
• And in the context of all the history that has gone before it
• By “near” we mean this is human scale (seconds), not machine scale (ns/us)
• Closer to real time is possible in CDH, but is more custom development
• SQL is the lingua franca of analytics
• Millions of people know it or the tools that run on it
• Say what you want to get not how you want to get it
5© Cloudera, Inc. All rights reserved.
Components from the Cloudera stack
• Four components come together to make this possible
• Apache Kafka
• Apache Spark
• Apache Kudu (incubating)
• Apache Impala (incubating)
• First we’ll discuss what they are, then how they fit together
6© Cloudera, Inc. All rights reserved.
Apache Kafka
• Publish-subscribe system
• Publish messages into topics
• Subscribe to messages arriving in topics
• Very high throughput
• Very low latency
• Distributed for fault tolerance and scale
• Supported by Cloudera
7© Cloudera, Inc. All rights reserved.
Apache Spark
• Modern distributed data processing engine
• Heavy utilizer of memory for speed
• Rich and intuitive API
• Spark Streaming
• Module for running a continuous loop of Spark transformations
• Each iteration is a micro-batch, usually in the single-digit seconds
• Supported by Cloudera (with some exceptions for experimental features)
8© Cloudera, Inc. All rights reserved.
Apache Kudu (incubating)
• New open-source columnar storage layer
• Data model of tables with finite typed columns
• Very fast random I/O
• Very fast scans
• Developed from scratch in C++
• Client APIs for C++, Java, Python
• First developed in Cloudera, now at Apache Software Foundation
• Currently in beta, not yet supported by Cloudera, not production ready
9© Cloudera, Inc. All rights reserved.
Apache Impala (incubating)
• Open-source SQL query engine
• Built for one purpose: really fast analytics SQL
• High concurrency
• Queries data stored in HDFS, HBase, and now Kudu
• Standard JDBC/ODBC interface for SQL editors and BI tools
• Uses JIT query compilation and modern CPU instructions
• First developed in Cloudera, now at Apache Software Foundation
• Fully supported by Cloudera and in production at many of our customers
10© Cloudera, Inc. All rights reserved.
Near-real-time analytics on the Cloudera stack
11© Cloudera, Inc. All rights reserved.
Implementing Spark Streaming to Kudu
• We define what we want Spark to do each micro-batch
• Spark then takes care of running the micro-batches for us
• We have limited time to process a micro-batch
• Storage lookups must be key lookups or very short scans
• A lot of repetitive boilerplate code to get up and running
12© Cloudera, Inc. All rights reserved.
Typical stages of a Spark Streaming to Kudu pipeline
• Sourcing from a queue of data
• Translating into a structured format
• Deriving the storage records
• Planning how to update the storage layer
• Applying the planned mutations to the storage layer
13© Cloudera, Inc. All rights reserved.
Queue sourcing
• Each micro-batch we first have to bring in data to process
• This is near-real-time, so we expect a queue of messages waiting to be processed
• Kafka fits this requirement very well
• Native no-data-loss integration with Spark Streaming
• Partitioned topics automatically parallelize across Spark executors
• Fault recovery simple because Kafka does not drop consumed messages
• In Spark Streaming this is the creation of a DStream object
• For Kafka use KafkaUtils.createDirectStream()
14© Cloudera, Inc. All rights reserved.
Translation
• Arriving messages could be in any format (XML, CSV, binary, proprietary, etc.)
• We need them in a common structured record format to effectively transform it
• When messages arrive, translate them first
• Avro’s GenericRecord is a widely adopted in-memory record format
• In Spark Streaming job use DStream.map() to define translation
15© Cloudera, Inc. All rights reserved.
Derivation
• We need to create the records that we want to write to the storage layer
• Often not identical to the arriving records
• Derive the storage records from the arriving records
• Spark SQL can define transformation, but much more plumbing code required
• May also require deriving from existing records in the storage layer
• Enrichment using reference data is a common example
16© Cloudera, Inc. All rights reserved.
Planning
• With derived storage records in hand we need to plan the storage mutations
• When existing records are never updated it is straight-forward
• Just plan inserts
• When updates for a key can occur it is a bit harder
• Plan insert if key does not exist, plan update if key does exist
• When all versions of a key are kept it can be a lot more complicated
• Insert arriving record, update metadata on existing records (e.g. end date)
17© Cloudera, Inc. All rights reserved.
Storing
• With the planned mutations for the micro-batch, we apply them to the storage
• For Kudu this requires using the Kudu client Java API
• Applied mutations are immediately visible to Impala users
• Use RDD.forEachPartition() so that you can open a Kudu connection per JVM
• Alternatively write to Solr, can be a good option where SQL is not required
• Alternatively write to HBase, but storage is too slow for analytics queries
• Alternatively write to HDFS, but storage does not support updates or deletes
18© Cloudera, Inc. All rights reserved.
Performance considerations
• Repartition the arriving records across all the cores of the Spark job
• If using Spark SQL, lower the number of shuffle partitions from default 200
• Use Spark Streaming backpressure to optimize micro-batch size
• If using Kafka, also use spark.streaming.kafka.maxRatePerPartition
• Experiment with micro-batch lengths to balance latency vs. throughput
• Ensure storage lookup predicates are at least by key, or face full table scans
• Avoid connecting and disconnecting from storage every micro-batch
• Singleton pattern can help to keep a connection per JVM
• Avoid instantiating objects for each record where they could be reused
• Batch mutations for higher throughput
19© Cloudera, Inc. All rights reserved.
New on Cloudera Labs: Envelope
• A pre-developed Spark Streaming application that implements these stages
• Pipelines are defined as simple configuration using a properties file
• Custom implementations of stages can be referenced in the configuration
• Available on Cloudera Labs (cloudera.com/labs)
• Not supported by Cloudera, not production ready
20© Cloudera, Inc. All rights reserved.
Envelope built-in functionality
• Queue source for Kafka
• Translators for delimited text, key-value pairs, and binary Avro
• Lookup of existing storage records
• Deriver for Spark SQL transformations
• Planners for appends, upserts, and history tracking
• Storage system for Kudu
• Support for many of the described performance considerations
• All stage implementations are also pluggable with user-provided classes
21© Cloudera, Inc. All rights reserved.
Example pipeline: Traffic
22© Cloudera, Inc. All rights reserved.
Example pipeline: FIX
23© Cloudera, Inc. All rights reserved.
Thank you
jeremy@cloudera.com

More Related Content

PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PPTX
Building Efficient Pipelines in Apache Spark
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PPTX
High concurrency,
Low latency analytics
using Spark/Kudu
PDF
Introduction to Apache Kudu
PDF
Exponea - Kafka and Hadoop as components of architecture
PDF
A Closer Look at Apache Kudu
PDF
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Building Efficient Pipelines in Apache Spark
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
High concurrency,
Low latency analytics
using Spark/Kudu
Introduction to Apache Kudu
Exponea - Kafka and Hadoop as components of architecture
A Closer Look at Apache Kudu
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform

What's hot (20)

PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
PPTX
Introduction to Apache Kudu
PPTX
A brave new world in mutable big data relational storage (Strata NYC 2017)
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PDF
Apache Flink & Kudu: a connector to develop Kappa architectures
PDF
SQL Engines for Hadoop - The case for Impala
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PDF
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
PPTX
Intro to Apache Spark
PDF
Application Architectures with Hadoop
PPTX
Introducing Kudu
PPTX
Enabling the Active Data Warehouse with Apache Kudu
PPTX
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
PDF
Introducing Kudu, Big Data Warehousing Meetup
PDF
Kudu - Fast Analytics on Fast Data
PPTX
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
PPTX
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Intro to Apache Kudu (short) - Big Data Application Meetup
Introduction to Apache Kudu
A brave new world in mutable big data relational storage (Strata NYC 2017)
Low latency high throughput streaming using Apache Apex and Apache Kudu
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Apache Flink & Kudu: a connector to develop Kappa architectures
SQL Engines for Hadoop - The case for Impala
Data Science at Scale Using Apache Spark and Apache Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Intro to Apache Spark
Application Architectures with Hadoop
Introducing Kudu
Enabling the Active Data Warehouse with Apache Kudu
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introducing Kudu, Big Data Warehousing Meetup
Kudu - Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Ad

Viewers also liked (20)

PPTX
Moving Beyond Lambda Architectures with Apache Kudu
PDF
Apache Spark streaming and HBase
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Huohua: A Distributed Time Series Analysis Framework For Spark
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PPTX
Intro to Apache Spark
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
Rethinking Streaming Analytics For Scale
PDF
How to deploy Apache Spark 
to Mesos/DCOS
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PDF
Akka in Production - ScalaDays 2015
PDF
Airstream: Spark Streaming At Airbnb
PDF
Apache kudu
PPTX
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
PDF
Reactive app using actor model & apache spark
PDF
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
PDF
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
Reactive dashboard’s using apache spark
Moving Beyond Lambda Architectures with Apache Kudu
Apache Spark streaming and HBase
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Huohua: A Distributed Time Series Analysis Framework For Spark
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Intro to Apache Spark
Alpine academy apache spark series #1 introduction to cluster computing wit...
Sa introduction to big data pipelining with cassandra & spark west mins...
Rethinking Streaming Analytics For Scale
How to deploy Apache Spark 
to Mesos/DCOS
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Akka in Production - ScalaDays 2015
Airstream: Spark Streaming At Airbnb
Apache kudu
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Reactive app using actor model & apache spark
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Reactive dashboard’s using apache spark
Ad

Similar to Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu (20)

PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
PPTX
End to End Streaming Architectures
PDF
Spark Summit EU talk by Mike Percy
PDF
Meetup: Streaming Data Pipeline Development
PPTX
Introduction to Kudu - StampedeCon 2016
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
SFHUG Kudu Talk
PPTX
Apache Kudu: Technical Deep Dive


PDF
Kudu austin oct 2015.pptx
PDF
Kudu: Fast Analytics on Fast Data
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
GSJUG: Mastering Data Streaming Pipelines 09May2023
End to End Streaming Architectures
Spark Summit EU talk by Mike Percy
Meetup: Streaming Data Pipeline Development
Introduction to Kudu - StampedeCon 2016
Real Time Data Processing Using Spark Streaming
Ingest and Stream Processing - What will you choose?
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Ingest and Stream Processing - What will you choose?
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing Using Spark Streaming
Ingest and Stream Processing - What will you choose?
SFHUG Kudu Talk
Apache Kudu: Technical Deep Dive


Kudu austin oct 2015.pptx
Kudu: Fast Analytics on Fast Data
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Simplifying Real-Time Architectures for IoT with Apache Kudu

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
L1 - Introduction to python Backend.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Essential Infomation Tech presentation.pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
System and Network Administration Chapter 2
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Design an Analysis of Algorithms I-SECS-1021-03
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
L1 - Introduction to python Backend.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Essential Infomation Tech presentation.pptx
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
System and Network Administration Chapter 2
Upgrade and Innovation Strategies for SAP ERP Customers
Softaken Excel to vCard Converter Software.pdf
ManageIQ - Sprint 268 Review - Slide Deck
Design an Analysis of Algorithms I-SECS-1021-03
The Five Best AI Cover Tools in 2025.docx
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Online Work Permit System for Fast Permit Processing
Odoo POS Development Services by CandidRoot Solutions
2025 Textile ERP Trends: SAP, Odoo & Oracle

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

  • 1. 1© Cloudera, Inc. All rights reserved. Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu Jeremy Beard | Senior Solutions Architect, Cloudera May 2016
  • 2. 2© Cloudera, Inc. All rights reserved. Myself • Jeremy Beard • Senior Solutions Architect at Cloudera • 3.5 years at Cloudera • 6 years data warehousing before that • jeremy@cloudera.com
  • 3. 3© Cloudera, Inc. All rights reserved. Agenda • What do we mean by near-real-time analytics? • Which components can we use from the Cloudera stack? • How do these components fit together? • How do we implement the Spark Streaming to Kudu path? • What if I don’t want to write all that code?
  • 4. 4© Cloudera, Inc. All rights reserved. Defining near-real-time analytics (for this talk) • Ability to analyze events happening right now in the real world • And in the context of all the history that has gone before it • By “near” we mean this is human scale (seconds), not machine scale (ns/us) • Closer to real time is possible in CDH, but is more custom development • SQL is the lingua franca of analytics • Millions of people know it or the tools that run on it • Say what you want to get not how you want to get it
  • 5. 5© Cloudera, Inc. All rights reserved. Components from the Cloudera stack • Four components come together to make this possible • Apache Kafka • Apache Spark • Apache Kudu (incubating) • Apache Impala (incubating) • First we’ll discuss what they are, then how they fit together
  • 6. 6© Cloudera, Inc. All rights reserved. Apache Kafka • Publish-subscribe system • Publish messages into topics • Subscribe to messages arriving in topics • Very high throughput • Very low latency • Distributed for fault tolerance and scale • Supported by Cloudera
  • 7. 7© Cloudera, Inc. All rights reserved. Apache Spark • Modern distributed data processing engine • Heavy utilizer of memory for speed • Rich and intuitive API • Spark Streaming • Module for running a continuous loop of Spark transformations • Each iteration is a micro-batch, usually in the single-digit seconds • Supported by Cloudera (with some exceptions for experimental features)
  • 8. 8© Cloudera, Inc. All rights reserved. Apache Kudu (incubating) • New open-source columnar storage layer • Data model of tables with finite typed columns • Very fast random I/O • Very fast scans • Developed from scratch in C++ • Client APIs for C++, Java, Python • First developed in Cloudera, now at Apache Software Foundation • Currently in beta, not yet supported by Cloudera, not production ready
  • 9. 9© Cloudera, Inc. All rights reserved. Apache Impala (incubating) • Open-source SQL query engine • Built for one purpose: really fast analytics SQL • High concurrency • Queries data stored in HDFS, HBase, and now Kudu • Standard JDBC/ODBC interface for SQL editors and BI tools • Uses JIT query compilation and modern CPU instructions • First developed in Cloudera, now at Apache Software Foundation • Fully supported by Cloudera and in production at many of our customers
  • 10. 10© Cloudera, Inc. All rights reserved. Near-real-time analytics on the Cloudera stack
  • 11. 11© Cloudera, Inc. All rights reserved. Implementing Spark Streaming to Kudu • We define what we want Spark to do each micro-batch • Spark then takes care of running the micro-batches for us • We have limited time to process a micro-batch • Storage lookups must be key lookups or very short scans • A lot of repetitive boilerplate code to get up and running
  • 12. 12© Cloudera, Inc. All rights reserved. Typical stages of a Spark Streaming to Kudu pipeline • Sourcing from a queue of data • Translating into a structured format • Deriving the storage records • Planning how to update the storage layer • Applying the planned mutations to the storage layer
  • 13. 13© Cloudera, Inc. All rights reserved. Queue sourcing • Each micro-batch we first have to bring in data to process • This is near-real-time, so we expect a queue of messages waiting to be processed • Kafka fits this requirement very well • Native no-data-loss integration with Spark Streaming • Partitioned topics automatically parallelize across Spark executors • Fault recovery simple because Kafka does not drop consumed messages • In Spark Streaming this is the creation of a DStream object • For Kafka use KafkaUtils.createDirectStream()
  • 14. 14© Cloudera, Inc. All rights reserved. Translation • Arriving messages could be in any format (XML, CSV, binary, proprietary, etc.) • We need them in a common structured record format to effectively transform it • When messages arrive, translate them first • Avro’s GenericRecord is a widely adopted in-memory record format • In Spark Streaming job use DStream.map() to define translation
  • 15. 15© Cloudera, Inc. All rights reserved. Derivation • We need to create the records that we want to write to the storage layer • Often not identical to the arriving records • Derive the storage records from the arriving records • Spark SQL can define transformation, but much more plumbing code required • May also require deriving from existing records in the storage layer • Enrichment using reference data is a common example
  • 16. 16© Cloudera, Inc. All rights reserved. Planning • With derived storage records in hand we need to plan the storage mutations • When existing records are never updated it is straight-forward • Just plan inserts • When updates for a key can occur it is a bit harder • Plan insert if key does not exist, plan update if key does exist • When all versions of a key are kept it can be a lot more complicated • Insert arriving record, update metadata on existing records (e.g. end date)
  • 17. 17© Cloudera, Inc. All rights reserved. Storing • With the planned mutations for the micro-batch, we apply them to the storage • For Kudu this requires using the Kudu client Java API • Applied mutations are immediately visible to Impala users • Use RDD.forEachPartition() so that you can open a Kudu connection per JVM • Alternatively write to Solr, can be a good option where SQL is not required • Alternatively write to HBase, but storage is too slow for analytics queries • Alternatively write to HDFS, but storage does not support updates or deletes
  • 18. 18© Cloudera, Inc. All rights reserved. Performance considerations • Repartition the arriving records across all the cores of the Spark job • If using Spark SQL, lower the number of shuffle partitions from default 200 • Use Spark Streaming backpressure to optimize micro-batch size • If using Kafka, also use spark.streaming.kafka.maxRatePerPartition • Experiment with micro-batch lengths to balance latency vs. throughput • Ensure storage lookup predicates are at least by key, or face full table scans • Avoid connecting and disconnecting from storage every micro-batch • Singleton pattern can help to keep a connection per JVM • Avoid instantiating objects for each record where they could be reused • Batch mutations for higher throughput
  • 19. 19© Cloudera, Inc. All rights reserved. New on Cloudera Labs: Envelope • A pre-developed Spark Streaming application that implements these stages • Pipelines are defined as simple configuration using a properties file • Custom implementations of stages can be referenced in the configuration • Available on Cloudera Labs (cloudera.com/labs) • Not supported by Cloudera, not production ready
  • 20. 20© Cloudera, Inc. All rights reserved. Envelope built-in functionality • Queue source for Kafka • Translators for delimited text, key-value pairs, and binary Avro • Lookup of existing storage records • Deriver for Spark SQL transformations • Planners for appends, upserts, and history tracking • Storage system for Kudu • Support for many of the described performance considerations • All stage implementations are also pluggable with user-provided classes
  • 21. 21© Cloudera, Inc. All rights reserved. Example pipeline: Traffic
  • 22. 22© Cloudera, Inc. All rights reserved. Example pipeline: FIX
  • 23. 23© Cloudera, Inc. All rights reserved. Thank you jeremy@cloudera.com