Streamsets and spark

Streamsets and Spark
Hari Shreedharan
Software Engineer
@harisr1234
hshreedharan@streamsets.com

StreamSets Data Collector
Open source software for
the rapid development and
reliably operation of complex
data flows.
➢ Efficiency
➢ Control
➢ Agility

● Origins read data into the pipeline
○ Kafka, Kinesis S3, JDBC, CDC, Local FS, Tail file, HDFS, MapR FS
○ Automagically parse common data formats into Records - no coding required!
● Processors operate on Records making changes, adding, removing records or fields
○ Field remover, renamer, flattener, masker, JSON/XML/Log parser, HTTP client..
○ Scripting: Jython, Groovy, JavaScript
○ Spark!
● Destinations write data out to external systems
○ HDFS, MapR FS, S3, JDBC, Kafka, Kinesis, Mongo, HBase, Redis...
○ Automagically convert Records into common data formats
● Executors run when events are sent to them by a linked stage
○ Executors can be used to trigger an external action, like a Hive query (Impala refresh etc.)
○ Any stage can send events - like when a file is closed, or table read is completed
Stages

● Long running SparkContext, passed to user-code during pipeline start
● Processor that runs each batch through user provided “application” - SparkTransformer
● Each record passed in as an RDD to the transformer
● Use MLLib, existing Spark-based algorithms
Spark Evaluator
Stage Spark Evaluator
Spark
Transformer
Parallelized Batch
Results
+
Errors
Error
Sink
Errors
Stage
BatchBatch

● Transformer returns:
○ Result records that need to go to the next stage
○ Error records that can’t be processed
● Results are passed through to the rest of the pipeline
● Already available for CDH Spark in SDC 2.2.0.0
● MapR Spark support coming in 2.5.0.0
Spark Evaluator

Cluster Pipelines on Spark
● Container on Spark
● Leverage Direct Kafka DStream
● Spark used only for Kafka partitioning
Cluster Pipeline
Kafka
DStream
Partition PipelineQueue
t1
t2

Spark Evaluator in a Cluster World
● Able to see all the data coming in from Kafka as a single unit per batch
● Complex processing per batch, that can trigger shuffles
● Compare with Spark Streaming
● Update models in real time based on streaming data
● Maintain simple (non-distributed) state
● Can be used as a base for custom functions like count, windowing etc.
● Run Spark-processed data through our own processors and use our existing stages!

● Pipeline design will be exactly the same as standalone
● RDD passed to SparkTransformer points to data across the cluster
● Each RDD partition represents data on each worker pipeline
Cluster Mode Spark Evaluator
Cluster PipelineKafka
DStream
Kafka
RDD
Spark
Processor
Spark
Processor
Spark
Processor
RDD<Record>
Partition
Partition
Partition
SparkTransformer
Partition
Partition
Partition
RDD<Record>
Stage
Stage
Stage
Every Batch

SDC on Spark - Connectivity
Sources
● Kafka
Destinations
● HDFS
● HBase
● S3
● Kudu
● MapR DB
● Cassandra
● ElasticSearch
● Kafka
● MapR Streams
● Kinesis
● etc, etc, etc!

● When an event is received, kick off a Spark application
● Ability to provide an application jar, and specific configuration
● Supports YARN and Databricks cloud support.
● YARN
○ Client and Cluster mode
○ Parameters can be based on the event data like file name
● Databricks Cloud
○ Define job beforehand
○ Kick off the job on event
○ Parameters can be based on the event data like file name
Spark Executor

Streamsets and spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Streamsets and spark (20)

Recently uploaded (20)

Streamsets and spark