SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
13 June2016
Ted Malaska| Principle Solutions Architect @ Cloudera,
Pat Patterson| Community Champion @ StreamSets
Ingest and Stream Processing -
What will you choose?
2© Cloudera, Inc. All rights reserved.
About Ted and Pat
Ted Malaska
• Principal Solutions Architect
@ Cloudera
• Apache HBase SparkOnHBase
Contributor
• Contact
• ted.malaska@cloudera.com
• @TedMalaska
Pat Patterson
• Community Champion @
StreamSets
• Formerly Developer Evangelist at
Salesforce
• Contact
• pat@streamsets.com
• @metadaddy
3© Cloudera, Inc. All rights reserved.
Streaming Patterns
•Ingestion
•Low Millisecond Actions
•Near Real Time Complex Actions
4© Cloudera, Inc. All rights reserved.
Parts Of Streaming
Producer Kafka Engine Destination
5© Cloudera, Inc. All rights reserved.
Parts Of Streaming
Producer Kafka Engine Destination
At Least once
Ordered
Partitioned
At Least Once Depends
Depends
6© Cloudera, Inc. All rights reserved.
Destinations
• File Systems: example HDFS
• Batch is good
• Only can do exactly once is a file is closed in a single ack.
• Good for Scans
• Solr
• Everything is Document based making exactly once
• Batch is still good
• Good for Search Queries
7© Cloudera, Inc. All rights reserved.
Destinations
• NoSQL: example HBase
• Everything has a row key making exactly once for writes
• Increments can be applied twice is so be careful
• Good for gets and puts
• Kudu
• Everything has a row key making exactly once for writes
• Good for gets, puts, and scans
8© Cloudera, Inc. All rights reserved.
Ingestion Destinations
• File Systems: example HDFS
• Flume
• Kafka Connect
• Solr
• Flume
• Any Streaming Engine
9© Cloudera, Inc. All rights reserved.
Ingestion Destinations
• NoSQL: example HBase
• Flume
• Any Streaming Engine: Storm and Spark Streaming Tested
• Kudu
• Flume
• Kafka Connect
• Any Streaming Engine: Spark Streaming Tested
10© Cloudera, Inc. All rights reserved.
Tricks With Producers
• Send Source ID (requires Partitioning In Kafka)
• Seq
• UUID
• UUID plus time
• Partition on SourceID
• Watch out for repartitions and partition fail overs
11© Cloudera, Inc. All rights reserved.
Streaming Engines
• Consumer
• Flume, KafkaConnect, Streaming Engine
• Storm
• Spark Streaming
• Flink
• Kafka Streams
12© Cloudera, Inc. All rights reserved.
Consumer: Flume, KafkaConnect
• Simple and Works
• Low latency
• High throughput
• Interceptors
• Transformations
• Alerting
• Ingestions
13© Cloudera, Inc. All rights reserved.
Consumer: Streaming Engines
• Not so great at HDFS Ingestion
• But great for record storage systems
• HBase
• Cassandra
• Kudu
• SolR
• Elastic Search
14© Cloudera, Inc. All rights reserved.
Storm
• Old Gen
• Low latency
• Low throughput
• At least once
• Around for ever
• Topology Based
15© Cloudera, Inc. All rights reserved.
Spark Streaming
• The Juggernaut
• Higher Latency
• High Through Put
• Exactly Once
• SQL
• MlLib
• Highly used
• Easy to Debug/Unit Test
• Easy to transition from
Batch
• Flow Language
• 600 commits in a month
and about 100 meetups
16© Cloudera, Inc. All rights reserved.
Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batch
Second
Batch
17© Cloudera, Inc. All rights reserved.
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Spark Streaming
18© Cloudera, Inc. All rights reserved.
Flink
• I’m Better Than Spark Why Doesn’t Anyone use me
• Very much like Spark but not as feature rich
• Lower Latency
• Micro Batch -> ABS
• Asynchronous Barrier Snapshotting
• Flow Language
• ~1/6th the comments and meetups
• But Slim loves it 
19© Cloudera, Inc. All rights reserved.
Flink - ABS
Operator
Buffer
20© Cloudera, Inc. All rights reserved.
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A
Hit
Barrier 1B
Still Behind
21© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABS
Both
Barriers Hit
Operator
Buffer
Barrier 1A
Hit
Barrier 1B
Still Behind
22© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABS
Both
Barriers Hit
Operator
Buffer
Barrier is
combined
and can
move on
Buffer can
be flushed
out
23© Cloudera, Inc. All rights reserved.
Kafka Streams
• The new Kid on the Block
• When you only have Kafka
• Low Latency
• High Throughput
• Not exactly once
• Very Young
• Flow Language
• Very different hardware profile then others
• Not widely supported
• Not widely used
• Worries about separation of concern
24© Cloudera, Inc. All rights reserved.
Summary about Engines
• Ingestion
• Flume and KafkaConnect
• Super Real Time and Special
• Consumer
• Counting, MlLib, SQL
• Spark
• Maybe future and cool
• Flink and KafkaStreams
• Odd man out
• Storm
25© Cloudera, Inc. All rights reserved.
Abstractions
Code Abstractions
Beam
SQL Abstraction
SQL
UI Abstraction
StreamSets
Streaming Engines
26© Cloudera, Inc. All rights reserved.
StreamSets Data Collector
Building a Higher Level, Open Source Tool
27© Cloudera, Inc. All rights reserved.
Traditional and Big Data
Founders
StreamSets Company Background
Top tier Investors
Momentum to Date
Strategic Partners
• Founded 2014; exited stealth 9/15
• ~30 employees
• Double-digit enterprise customers
• 10,000 downloads
28© Cloudera, Inc. All rights reserved.
Thank you!

More Related Content

PPTX
Ingest and Stream Processing - What will you choose?
PDF
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
PDF
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
PPTX
When the Cloud is a Rockin: High Availability in Apache CloudStack
PDF
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
PDF
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
PPTX
Sitecore on Azure
PPTX
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
Ingest and Stream Processing - What will you choose?
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
When the Cloud is a Rockin: High Availability in Apache CloudStack
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
Sitecore on Azure
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...

What's hot (16)

PDF
Building Scalable Real-Time Data Pipelines with the Couchbase Kafka Connector...
PPTX
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
PDF
Building Complete Private Clouds with Apache CloudStack and Riak CS
PDF
Riak CS Build Your Own Cloud Storage
PPTX
Managing a SolrCloud cluster using APIs
PDF
Project Sherpa: How RightScale Went All in on Docker
PPTX
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
PDF
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
PPTX
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Consul 1.6: Layer 7 Traffic Management and Mesh Gateways
PPTX
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
PDF
Chicago AWS user group meetup - May 2014 at Cohesive
PPTX
6Reinventing Oracle Systems in a Cloudy World (RMOUG Trainingdays, February 2...
PPTX
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
PPTX
How Apache Kafka is transforming Hadoop, Spark and Storm
Building Scalable Real-Time Data Pipelines with the Couchbase Kafka Connector...
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
Building Complete Private Clouds with Apache CloudStack and Riak CS
Riak CS Build Your Own Cloud Storage
Managing a SolrCloud cluster using APIs
Project Sherpa: How RightScale Went All in on Docker
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Consul 1.6: Layer 7 Traffic Management and Mesh Gateways
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
Chicago AWS user group meetup - May 2014 at Cohesive
6Reinventing Oracle Systems in a Cloudy World (RMOUG Trainingdays, February 2...
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
How Apache Kafka is transforming Hadoop, Spark and Storm
Ad

Viewers also liked (10)

PDF
Building Custom Big Data Integrations
PPTX
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
PPTX
Adaptive Data Cleansing with StreamSets and Cassandra
PPTX
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
PPTX
All Aboard the Boxcar! Going Beyond the Basics of REST
PDF
Data Aggregation At Scale Using Apache Flume
PPTX
Building Data Pipelines with Spark and StreamSets
PPTX
Building Continuously Curated Ingestion Pipelines
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PDF
Apache Flume - DataDayTexas
Building Custom Big Data Integrations
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Adaptive Data Cleansing with StreamSets and Cassandra
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
All Aboard the Boxcar! Going Beyond the Basics of REST
Data Aggregation At Scale Using Apache Flume
Building Data Pipelines with Spark and StreamSets
Building Continuously Curated Ingestion Pipelines
Open Source Big Data Ingestion - Without the Heartburn!
Apache Flume - DataDayTexas
Ad

Similar to Ingest and Stream Processing - What will you choose? (20)

PPTX
Ingest and Stream Processing - What will you choose?
PDF
Apache kafka
PPTX
Decoupling Decisions with Apache Kafka
PPTX
Kafka for DBAs
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PPTX
End to End Streaming Architectures
PPTX
Event Detection Pipelines with Apache Kafka
PDF
intro-kafka
PDF
Hadoop Operations for Production Systems (Strata NYC)
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PPTX
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
PDF
Streaming architecture patterns
PPTX
Spark+flume seattle
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PPTX
Effective Spark on Multi-Tenant Clusters
PDF
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
PPTX
Building Efficient Pipelines in Apache Spark
Ingest and Stream Processing - What will you choose?
Apache kafka
Decoupling Decisions with Apache Kafka
Kafka for DBAs
Lambda architecture on Spark, Kafka for real-time large scale ML
End to End Streaming Architectures
Event Detection Pipelines with Apache Kafka
intro-kafka
Hadoop Operations for Production Systems (Strata NYC)
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Streaming architecture patterns
Spark+flume seattle
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Real Time Data Processing Using Spark Streaming
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Effective Spark on Multi-Tenant Clusters
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Building Efficient Pipelines in Apache Spark

More from Pat Patterson (20)

PPTX
DevOps from the Provider Perspective
PPTX
How Imprivata Combines External Data Sources for Business Insights
PPTX
Data Integration with Apache Kafka: What, Why, How
PPTX
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
PPTX
Dealing with Drift: Building an Enterprise Data Lake
PPTX
Integrating with Einstein Analytics
PPTX
Efficient Schemas in Motion with Kafka and Schema Registry
PPTX
Dealing With Drift - Building an Enterprise Data Lake
PPTX
Enterprise IoT: Data in Context
PPTX
OData: A Standard API for Data Access
PPTX
API-Driven Relationships: Building The Trans-Internet Express of the Future
PPTX
Using Salesforce to Manage Your Developer Community
PPTX
Identity in the Cloud
PPTX
OpenID Connect: An Overview
PPTX
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
PPT
Salesforce Integration with Twilio
PPTX
SAML Smackdown
PPTX
How I Learned to Stop Worrying and Love Open Source Identity
PPTX
Mobile Developer Week
PPTX
Taking Identity from the Enterprise to the Cloud
DevOps from the Provider Perspective
How Imprivata Combines External Data Sources for Business Insights
Data Integration with Apache Kafka: What, Why, How
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Dealing with Drift: Building an Enterprise Data Lake
Integrating with Einstein Analytics
Efficient Schemas in Motion with Kafka and Schema Registry
Dealing With Drift - Building an Enterprise Data Lake
Enterprise IoT: Data in Context
OData: A Standard API for Data Access
API-Driven Relationships: Building The Trans-Internet Express of the Future
Using Salesforce to Manage Your Developer Community
Identity in the Cloud
OpenID Connect: An Overview
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
Salesforce Integration with Twilio
SAML Smackdown
How I Learned to Stop Worrying and Love Open Source Identity
Mobile Developer Week
Taking Identity from the Enterprise to the Cloud

Recently uploaded (20)

PPTX
ai tools demonstartion for schools and inter college
PDF
Nekopoi APK 2025 free lastest update
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Introduction to Artificial Intelligence
PPTX
L1 - Introduction to python Backend.pptx
PPTX
history of c programming in notes for students .pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
top salesforce developer skills in 2025.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Transform Your Business with a Software ERP System
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
ai tools demonstartion for schools and inter college
Nekopoi APK 2025 free lastest update
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Introduction to Artificial Intelligence
L1 - Introduction to python Backend.pptx
history of c programming in notes for students .pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Migrate SBCGlobal Email to Yahoo Easily
top salesforce developer skills in 2025.pdf
Online Work Permit System for Fast Permit Processing
Odoo POS Development Services by CandidRoot Solutions
System and Network Administraation Chapter 3
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
Transform Your Business with a Software ERP System
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

Ingest and Stream Processing - What will you choose?

  • 1. 1© Cloudera, Inc. All rights reserved. 13 June2016 Ted Malaska| Principle Solutions Architect @ Cloudera, Pat Patterson| Community Champion @ StreamSets Ingest and Stream Processing - What will you choose?
  • 2. 2© Cloudera, Inc. All rights reserved. About Ted and Pat Ted Malaska • Principal Solutions Architect @ Cloudera • Apache HBase SparkOnHBase Contributor • Contact • ted.malaska@cloudera.com • @TedMalaska Pat Patterson • Community Champion @ StreamSets • Formerly Developer Evangelist at Salesforce • Contact • pat@streamsets.com • @metadaddy
  • 3. 3© Cloudera, Inc. All rights reserved. Streaming Patterns •Ingestion •Low Millisecond Actions •Near Real Time Complex Actions
  • 4. 4© Cloudera, Inc. All rights reserved. Parts Of Streaming Producer Kafka Engine Destination
  • 5. 5© Cloudera, Inc. All rights reserved. Parts Of Streaming Producer Kafka Engine Destination At Least once Ordered Partitioned At Least Once Depends Depends
  • 6. 6© Cloudera, Inc. All rights reserved. Destinations • File Systems: example HDFS • Batch is good • Only can do exactly once is a file is closed in a single ack. • Good for Scans • Solr • Everything is Document based making exactly once • Batch is still good • Good for Search Queries
  • 7. 7© Cloudera, Inc. All rights reserved. Destinations • NoSQL: example HBase • Everything has a row key making exactly once for writes • Increments can be applied twice is so be careful • Good for gets and puts • Kudu • Everything has a row key making exactly once for writes • Good for gets, puts, and scans
  • 8. 8© Cloudera, Inc. All rights reserved. Ingestion Destinations • File Systems: example HDFS • Flume • Kafka Connect • Solr • Flume • Any Streaming Engine
  • 9. 9© Cloudera, Inc. All rights reserved. Ingestion Destinations • NoSQL: example HBase • Flume • Any Streaming Engine: Storm and Spark Streaming Tested • Kudu • Flume • Kafka Connect • Any Streaming Engine: Spark Streaming Tested
  • 10. 10© Cloudera, Inc. All rights reserved. Tricks With Producers • Send Source ID (requires Partitioning In Kafka) • Seq • UUID • UUID plus time • Partition on SourceID • Watch out for repartitions and partition fail overs
  • 11. 11© Cloudera, Inc. All rights reserved. Streaming Engines • Consumer • Flume, KafkaConnect, Streaming Engine • Storm • Spark Streaming • Flink • Kafka Streams
  • 12. 12© Cloudera, Inc. All rights reserved. Consumer: Flume, KafkaConnect • Simple and Works • Low latency • High throughput • Interceptors • Transformations • Alerting • Ingestions
  • 13. 13© Cloudera, Inc. All rights reserved. Consumer: Streaming Engines • Not so great at HDFS Ingestion • But great for record storage systems • HBase • Cassandra • Kudu • SolR • Elastic Search
  • 14. 14© Cloudera, Inc. All rights reserved. Storm • Old Gen • Low latency • Low throughput • At least once • Around for ever • Topology Based
  • 15. 15© Cloudera, Inc. All rights reserved. Spark Streaming • The Juggernaut • Higher Latency • High Through Put • Exactly Once • SQL • MlLib • Highly used • Easy to Debug/Unit Test • Easy to transition from Batch • Flow Language • 600 commits in a month and about 100 meetups
  • 16. 16© Cloudera, Inc. All rights reserved. Spark Streaming DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print First Batch Second Batch
  • 17. 17© Cloudera, Inc. All rights reserved. DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD partitions RDD Parition RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1 Spark Streaming
  • 18. 18© Cloudera, Inc. All rights reserved. Flink • I’m Better Than Spark Why Doesn’t Anyone use me • Very much like Spark but not as feature rich • Lower Latency • Micro Batch -> ABS • Asynchronous Barrier Snapshotting • Flow Language • ~1/6th the comments and meetups • But Slim loves it 
  • 19. 19© Cloudera, Inc. All rights reserved. Flink - ABS Operator Buffer
  • 20. 20© Cloudera, Inc. All rights reserved. Operator Buffer Operator Buffer Flink - ABS Barrier 1A Hit Barrier 1B Still Behind
  • 21. 21© Cloudera, Inc. All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier 1A Hit Barrier 1B Still Behind
  • 22. 22© Cloudera, Inc. All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier is combined and can move on Buffer can be flushed out
  • 23. 23© Cloudera, Inc. All rights reserved. Kafka Streams • The new Kid on the Block • When you only have Kafka • Low Latency • High Throughput • Not exactly once • Very Young • Flow Language • Very different hardware profile then others • Not widely supported • Not widely used • Worries about separation of concern
  • 24. 24© Cloudera, Inc. All rights reserved. Summary about Engines • Ingestion • Flume and KafkaConnect • Super Real Time and Special • Consumer • Counting, MlLib, SQL • Spark • Maybe future and cool • Flink and KafkaStreams • Odd man out • Storm
  • 25. 25© Cloudera, Inc. All rights reserved. Abstractions Code Abstractions Beam SQL Abstraction SQL UI Abstraction StreamSets Streaming Engines
  • 26. 26© Cloudera, Inc. All rights reserved. StreamSets Data Collector Building a Higher Level, Open Source Tool
  • 27. 27© Cloudera, Inc. All rights reserved. Traditional and Big Data Founders StreamSets Company Background Top tier Investors Momentum to Date Strategic Partners • Founded 2014; exited stealth 9/15 • ~30 employees • Double-digit enterprise customers • 10,000 downloads
  • 28. 28© Cloudera, Inc. All rights reserved. Thank you!

Editor's Notes

  • #28: StreamSets was founded in 2014 by Informatica and Cloudera veterans. Learnings from big data and data integration. Mission to accelerate time to analysis by bringing deep introspection to data in motion. Launched company and initial product in September 2015. Especially well received by Cloudera and Elastic ecosystems