Streamsets and Spark
Hari Shreedharan
Software Engineer
@harisr1234
hshreedharan@streamsets.com
StreamSets Data Collector
Open source software for
the rapid development and
reliably operation of complex
data flows.
➢ Efficiency
➢ Control
➢ Agility
Inside an SDC
● Origins read data into the pipeline
○ Kafka, Kinesis S3, JDBC, CDC, Local FS, Tail file, HDFS, MapR FS
○ Automagically parse common data formats into Records - no coding required!
● Processors operate on Records making changes, adding, removing records or fields
○ Field remover, renamer, flattener, masker, JSON/XML/Log parser, HTTP client..
○ Scripting: Jython, Groovy, JavaScript
○ Spark!
● Destinations write data out to external systems
○ HDFS, MapR FS, S3, JDBC, Kafka, Kinesis, Mongo, HBase, Redis...
○ Automagically convert Records into common data formats
● Executors run when events are sent to them by a linked stage
○ Executors can be used to trigger an external action, like a Hive query (Impala refresh etc.)
○ Any stage can send events - like when a file is closed, or table read is completed
Stages
● Long running SparkContext, passed to user-code during pipeline start
● Processor that runs each batch through user provided “application” - SparkTransformer
● Each record passed in as an RDD to the transformer
● Use MLLib, existing Spark-based algorithms
Spark Evaluator
Stage Spark Evaluator
Spark
Transformer
Parallelized Batch
Results
+
Errors
Error
Sink
Errors
Stage
BatchBatch
● Transformer returns:
○ Result records that need to go to the next stage
○ Error records that can’t be processed
● Results are passed through to the rest of the pipeline
● Already available for CDH Spark in SDC 2.2.0.0
● MapR Spark support coming in 2.5.0.0
Spark Evaluator
Demo!
Cluster Pipelines on Spark
● Container on Spark
● Leverage Direct Kafka DStream
● Spark used only for Kafka partitioning
Cluster Pipeline
Kafka
DStream
Partition PipelineQueue
Partition PipelineQueue
Partition PipelineQueue
t1
Partition PipelineQueue
Partition PipelineQueue
Partition PipelineQueue
t2
Spark Evaluator in a Cluster World
● Able to see all the data coming in from Kafka as a single unit per batch
● Complex processing per batch, that can trigger shuffles
● Compare with Spark Streaming
● Update models in real time based on streaming data
● Maintain simple (non-distributed) state
● Can be used as a base for custom functions like count, windowing etc.
● Run Spark-processed data through our own processors and use our existing stages!
● Pipeline design will be exactly the same as standalone
● RDD passed to SparkTransformer points to data across the cluster
● Each RDD partition represents data on each worker pipeline
Cluster Mode Spark Evaluator
Cluster PipelineKafka
DStream
Kafka
RDD
Spark
Processor
Spark
Processor
Spark
Processor
RDD<Record>
Partition
Partition
Partition
SparkTransformer
Partition
Partition
Partition
RDD<Record>
Stage
Stage
Stage
Every Batch
SDC on Spark - Connectivity
Sources
● Kafka
Destinations
● HDFS
● HBase
● S3
● Kudu
● MapR DB
● Cassandra
● ElasticSearch
● Kafka
● MapR Streams
● Kinesis
● etc, etc, etc!
● When an event is received, kick off a Spark application
● Ability to provide an application jar, and specific configuration
● Supports YARN and Databricks cloud support.
● YARN
○ Client and Cluster mode
○ Parameters can be based on the event data like file name
● Databricks Cloud
○ Define job beforehand
○ Kick off the job on event
○ Parameters can be based on the event data like file name
Spark Executor
Questions?

More Related Content

PDF
Streamsets and spark at SF Hadoop User Group
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Streamsets and spark in Retail
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PPTX
Building Data Pipelines with Spark and StreamSets
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PPTX
Telco analytics at scale
Streamsets and spark at SF Hadoop User Group
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Streamsets and spark in Retail
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building Data Pipelines with Spark and StreamSets
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Open Source Big Data Ingestion - Without the Heartburn!
Telco analytics at scale

What's hot (20)

PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
From Batch to Streaming ET(L) with Apache Apex
PPTX
Lambda architecture: from zero to One
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PDF
Big Telco - Yousun Jeong
PDF
Apache Spark Streaming
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Lambda architecture with Spark
PDF
Superset druid realtime
PPTX
Using Visualization to Succeed with Big Data
ODP
Kick-Start with SMACK Stack
PDF
Presto @ Uber Hadoop summit2017
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
PDF
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PPTX
Ai big dataconference_jeffrey ricker_kappa_architecture
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
From Batch to Streaming ET(L) with Apache Apex
Lambda architecture: from zero to One
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Big Telco - Yousun Jeong
Apache Spark Streaming
Modern ETL Pipelines with Change Data Capture
Presto: Optimizing Performance of SQL-on-Anything Engine
Lambda architecture with Spark
Superset druid realtime
Using Visualization to Succeed with Big Data
Kick-Start with SMACK Stack
Presto @ Uber Hadoop summit2017
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Kappa Architecture on Apache Kafka and Querona: datamass.io
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Ai big dataconference_jeffrey ricker_kappa_architecture
Ad

Viewers also liked (20)

PDF
[OracleCode SF] In memory analytics with apache spark and hazelcast
PDF
Introduction to data flow management using apache nifi
PDF
Tracxn Research - Finance & Accounting Landscape, February 2017
PPTX
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
PDF
Akka-chan's Survival Guide for the Streaming World
PDF
2017 iosco research report on financial technologies (fintech)
PDF
2015 Internet Trends Report
PPTX
Meetup which approach to choose?
PPTX
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
PDF
The Power of the Log
PDF
Europa AI startup scaleups report 2016
PDF
Complex Event Processing with Esper
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PDF
3P Learning (3PL) - Earning from Learning - equity research initiation report
PDF
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
PDF
Webinar - Bringing Game Changing Insights with Graph Databases
PPTX
Tugas 4 0317-imelda felicia-1412510545
PDF
Tracxn Research - Mobile Advertising Landscape, February 2017
DOCX
Simple Odoo ERP auto scaling on AWS
[OracleCode SF] In memory analytics with apache spark and hazelcast
Introduction to data flow management using apache nifi
Tracxn Research - Finance & Accounting Landscape, February 2017
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Akka-chan's Survival Guide for the Streaming World
2017 iosco research report on financial technologies (fintech)
2015 Internet Trends Report
Meetup which approach to choose?
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
The Power of the Log
Europa AI startup scaleups report 2016
Complex Event Processing with Esper
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
3P Learning (3PL) - Earning from Learning - equity research initiation report
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
Webinar - Bringing Game Changing Insights with Graph Databases
Tugas 4 0317-imelda felicia-1412510545
Tracxn Research - Mobile Advertising Landscape, February 2017
Simple Odoo ERP auto scaling on AWS
Ad

Similar to Streamsets and spark (20)

PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PDF
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
PDF
Spark streaming state of the union
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Building end to end streaming application on Spark
PDF
Spark streaming State of the Union - Strata San Jose 2015
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PDF
Spark (Structured) Streaming vs. Kafka Streams
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PDF
Introduction to Spark Streaming
PPTX
Apache Spark Components
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PDF
Logging infrastructure for Microservices using StreamSets Data Collector
PDF
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
PPTX
Paris Data Geek - Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Spark streaming state of the union
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Building end to end streaming application on Spark
Spark streaming State of the Union - Strata San Jose 2015
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Spark (Structured) Streaming vs. Kafka Streams
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Introduction to Spark Streaming
Apache Spark Components
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Stream, stream, stream: Different streaming methods with Spark and Kafka
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Logging infrastructure for Microservices using StreamSets Data Collector
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Paris Data Geek - Spark Streaming

Recently uploaded (20)

PPTX
IMPACT OF LANDSLIDE.....................
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Microsoft 365 products and services descrption
PDF
Transcultural that can help you someday.
PPTX
chrmotography.pptx food anaylysis techni
DOCX
Factor Analysis Word Document Presentation
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Business_Capability_Map_Collection__pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Microsoft Core Cloud Services powerpoint
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Global Data and Analytics Market Outlook Report
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
IMPACT OF LANDSLIDE.....................
[EN] Industrial Machine Downtime Prediction
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
CYBER SECURITY the Next Warefare Tactics
Microsoft 365 products and services descrption
Transcultural that can help you someday.
chrmotography.pptx food anaylysis techni
Factor Analysis Word Document Presentation
retention in jsjsksksksnbsndjddjdnFPD.pptx
Business_Capability_Map_Collection__pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Microsoft Core Cloud Services powerpoint
STERILIZATION AND DISINFECTION-1.ppthhhbx
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Optimise Shopper Experiences with a Strong Data Estate.pdf
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Global Data and Analytics Market Outlook Report
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
SAP 2 completion done . PRESENTATION.pptx

Streamsets and spark

  • 1. Streamsets and Spark Hari Shreedharan Software Engineer @harisr1234 hshreedharan@streamsets.com
  • 2. StreamSets Data Collector Open source software for the rapid development and reliably operation of complex data flows. ➢ Efficiency ➢ Control ➢ Agility
  • 4. ● Origins read data into the pipeline ○ Kafka, Kinesis S3, JDBC, CDC, Local FS, Tail file, HDFS, MapR FS ○ Automagically parse common data formats into Records - no coding required! ● Processors operate on Records making changes, adding, removing records or fields ○ Field remover, renamer, flattener, masker, JSON/XML/Log parser, HTTP client.. ○ Scripting: Jython, Groovy, JavaScript ○ Spark! ● Destinations write data out to external systems ○ HDFS, MapR FS, S3, JDBC, Kafka, Kinesis, Mongo, HBase, Redis... ○ Automagically convert Records into common data formats ● Executors run when events are sent to them by a linked stage ○ Executors can be used to trigger an external action, like a Hive query (Impala refresh etc.) ○ Any stage can send events - like when a file is closed, or table read is completed Stages
  • 5. ● Long running SparkContext, passed to user-code during pipeline start ● Processor that runs each batch through user provided “application” - SparkTransformer ● Each record passed in as an RDD to the transformer ● Use MLLib, existing Spark-based algorithms Spark Evaluator Stage Spark Evaluator Spark Transformer Parallelized Batch Results + Errors Error Sink Errors Stage BatchBatch
  • 6. ● Transformer returns: ○ Result records that need to go to the next stage ○ Error records that can’t be processed ● Results are passed through to the rest of the pipeline ● Already available for CDH Spark in SDC 2.2.0.0 ● MapR Spark support coming in 2.5.0.0 Spark Evaluator
  • 8. Cluster Pipelines on Spark ● Container on Spark ● Leverage Direct Kafka DStream ● Spark used only for Kafka partitioning Cluster Pipeline Kafka DStream Partition PipelineQueue Partition PipelineQueue Partition PipelineQueue t1 Partition PipelineQueue Partition PipelineQueue Partition PipelineQueue t2
  • 9. Spark Evaluator in a Cluster World ● Able to see all the data coming in from Kafka as a single unit per batch ● Complex processing per batch, that can trigger shuffles ● Compare with Spark Streaming ● Update models in real time based on streaming data ● Maintain simple (non-distributed) state ● Can be used as a base for custom functions like count, windowing etc. ● Run Spark-processed data through our own processors and use our existing stages!
  • 10. ● Pipeline design will be exactly the same as standalone ● RDD passed to SparkTransformer points to data across the cluster ● Each RDD partition represents data on each worker pipeline Cluster Mode Spark Evaluator Cluster PipelineKafka DStream Kafka RDD Spark Processor Spark Processor Spark Processor RDD<Record> Partition Partition Partition SparkTransformer Partition Partition Partition RDD<Record> Stage Stage Stage Every Batch
  • 11. SDC on Spark - Connectivity Sources ● Kafka Destinations ● HDFS ● HBase ● S3 ● Kudu ● MapR DB ● Cassandra ● ElasticSearch ● Kafka ● MapR Streams ● Kinesis ● etc, etc, etc!
  • 12. ● When an event is received, kick off a Spark application ● Ability to provide an application jar, and specific configuration ● Supports YARN and Databricks cloud support. ● YARN ○ Client and Cluster mode ○ Parameters can be based on the event data like file name ● Databricks Cloud ○ Define job beforehand ○ Kick off the job on event ○ Parameters can be based on the event data like file name Spark Executor