SlideShare a Scribd company logo
REX Real-time Data Pipeline:
Kafka Streams / Kafka Connect versus Spark Streaming
EL ARIB Abdelhamide , Data Engineer at
OUTLINE
• The Challenge
• Brief presentation of Spark & Kafka
• Kafka vs Spark :
1. New File Detection
2. Processing
3. Handling Failure
4. Deployment
5. Scaling
6. Monitoring
• 10 points not to forget
THE CHALLENGE
HDFS is our source of truth
Need to stream very big data (parquet
or Avro) into Kafka
Different schemas for each source
Need to scale up/down the speed of
streaming
Exactly-Once Copy (Hdfs & Kafka)
And the monitoring is, as always, a must
HDFS
/client1/crm
/client1/tracking
/client1/transactions
/client2/mainCrm
/client2/payments
….
BRIEF PRESENTATION OF SPARK & KAFKA
APACHE SPARK
Distributed, data processing, in-memory engine for batch & streaming mode.
Apache Spark Core
Standalone
Scheduler
MesosYarn k8s …Cluster Manager
Spark SQL
Spark
Streaming
Spark MLlibGraphXLibraries
SPARK STREAMING
Receiver
Records
Micro Batches
(RDDs)
Apache Spark Core
Standalone
Scheduler
MesosYarn k8s …
End-to-end latency ~ 100 ms
Exactly once guarantee
SPARK STRUCTURED STREAMING
Whenever you can, use Dataframe instead of Spark’s RDD primitive
DataFrames get the benefit of the Catalyst query optimizer.
Multiple types of Triggers
Receiver
Records
Micro Batches
(DataFrames)
Apache Spark Core
Standalone
Scheduler
MesosYarn k8s …
(Experimental in Spark 2.3) End to end latency ~ 1 ms with at least once guarantee
Spark SQL
(Catalyst)
KAFKA/ KAFKA STREAMS/ KAFKA CONNECT
Data
Source
Kafka
Connect
Kafka
Topic
Kafka
Streams
KAFKA CONNECT
Kafka Connect is a framework to stream data into & out of Kafka using connectors.
Two types of connectors: Sinks (Export) & Sources (Import)
poll.interval.ms (default 5ms)
Kafka Connect Cluster
Worker 1 Worker 2 Worker 3
Connector Elastic Sink
Instance
Conn ES, Task 1
Partitions: 1,2
Conn Elastic Sink, Task 2
Partitions: 3,4
Conn Elastic Sink, Task 3
Partitions: 5,6
1. NEW FILE DETECTION
1. NEW FILE DETECTION - SPARK
Can detect new files within a dir
Should have the same schema
Can’t create one stream for all source of the client X
N clients with M Source => N*M streams.
1 executor at least for each stream
HDFS
/client1/crm
/client1/tracking
/client1/transactions
…
1. NEW FILE DETECTION – KAFKA
1. NEW FILE DETECTION - HDFS-WATCHER
(INOTIFY)
/client1/crm
/client1/tracking
/client1/transactions
/client2/mainCrm
/client2/payments
….
HDFS-WATCHER
HDFS
Kafka
Topic
(Avro)
(source, client) -> path
Scanning for new files
(exclude dirs with _temporary)
2. PROCESSING
2. PROCESSING - SPARK
Read file as stream
Read partition by partition
At least once when writing to Kafka
2. PROCESSING - KAFKA
HDFS
Task 1
Kafka Connect
Kafka
Producer
Client_source
Topic
hdfs-source-
offset
Topic
GlobalKtable
(path, lastLineCommitted)
HDFS Watcher
Schema
Registry
poll()commitRecords()
(path, newLastLineCommitted)
FileDetection
Kafka topic
((source, client) -> filePath)
Get n (=batch.size) lines
Consuming new files to process & get keyPartitionKey from the metadata in their schemas
Kafka Connect Framework
Tracking the offset so we can handle the failure
1
2
3 5
6
7
8
4
3. HANDLING FAILURE
3. HANDLING FAILURE - SPARK
The HDFS receiver is not reliable receiver
.option("checkpointLocation", ”hdfs://checkpointPath") (Expensive)
spark.streaming.receiver.writeAheadLog.enable
spark.task.maxFailures = 0
trait StreamingListener {
…
/** Called when a receiver has reported an error */
def onReceiverError(receiverError: StreamingListenerReceiverError) { }
…
}
3. HANDLING FAILURE - KAFKA
GlobalKtable that track the last line commited.
KIP-298: Error Handling in Connect
• errors.retry.timeout
• errors.retry.delay.max.ms
• errors.tolerance (in a task) = none
Kafka producer config:
• acks = 1 (ou all)
• Retries > 0
• max.in.flight.requests.per.connection > 0( no need to preserve order)
• Batch.size (producer config) < number of lines * size of a single line
• linger.ms = 0 (Send the record as they arrive)
4. DEPLOYMENT
4. DEPLOYMENT - SPARK
Standalone
Cluster mode: Using spark-submit, the driver is in the cluster
Client mode : Start the app. The driver is in the Client machine.
4. DEPLOYMENT - KAFKA
Packager the connector (Classpath issues warning)
Make it in /plugins
Start Kafka connect cluster
Run via the The rest API
5. SCALING UP/DOWN
5. SCALING UP/DOWN- SPARK
Dynamic Allocation
Min*, max & init resources
Can’t scale down manually the resources for executor without stopping the driver.
Can’t scale up/down the number of executors manually.
5. SCALING UP/DOWN- KAFKA
Scale up/down easily the number of task.
Only Static allocation for the cluster resources.
Can’t update the cluster/workers resources without stopping it.
6. MONITORING
6. MONITORING - SPARK
Spark ui
Spark ui rest API : http://localhost:4040/api/v1
JMX with DropWizard
6. MONITORING - KAFKA
Kafka connect REST API : to monitor Tasks
Jmx metrics
Connector Metrics
Task Metrics Worker
Metrics Worker
Rebalance Metrics
See KIP-196
10 POINTS TO NOT FORGET (FROM OUR EXPERIENCE)
1. It depends on what kind of datasource & datasink you’re using.
2. Always check the Receiver (Spark Streaming) & The connector (Kafka Connect)
3. The monitoring is pretty good for both
4. Integration Test is challenging with Kafka Connect
5. You want to add more tasks, go for Kafka Connect
6. You want dynamic allocation, go for Spark Streaming
7. Deploying Spark Streaming app is simple
8. Naming conventions in Kafka Connect Framework may be confusing with the naming in Kafka core
9. You prefer conf over code, go for Kafka Connect
10. You want a distributed, copy from or to Kafka, without using a cluster Manager : Go for Kafka Connect
AS ALWAYS IT REALLY DEPENDS ON YOUR NEEDS ☺
THANK YOU

More Related Content

PPTX
JahiaOne - Performance Tuning
PDF
Spark / Mesos Cluster Optimization
PPTX
How to Actually Tune Your Spark Jobs So They Work
PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Spark 2.x Troubleshooting Guide
 
PPTX
Spark 1.6 vs Spark 2.0
PDF
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
JahiaOne - Performance Tuning
Spark / Mesos Cluster Optimization
How to Actually Tune Your Spark Jobs So They Work
Understanding Memory Management In Spark For Fun And Profit
Spark 2.x Troubleshooting Guide
 
Spark 1.6 vs Spark 2.0
Scaling Spark Workloads on YARN - Boulder/Denver July 2015

What's hot (20)

PDF
Spark performance tuning - Maksud Ibrahimov
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
Memory Management in Apache Spark
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Why your Spark job is failing
PPTX
Spark Tips & Tricks
PPTX
Tuning tips for Apache Spark Jobs
PDF
Spark on YARN
PPT
How to scale your web app
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PPTX
SORT & JOIN IN SPARK 2.0
PDF
Top 5 mistakes when writing Spark applications
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PDF
TeraCache: Efficient Caching Over Fast Storage Devices
PDF
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
PPTX
Why your Spark Job is Failing
PDF
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
PDF
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
PDF
Scalding - Big Data Programming with Scala
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Spark performance tuning - Maksud Ibrahimov
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Memory Management in Apache Spark
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Why your Spark job is failing
Spark Tips & Tricks
Tuning tips for Apache Spark Jobs
Spark on YARN
How to scale your web app
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
SORT & JOIN IN SPARK 2.0
Top 5 mistakes when writing Spark applications
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
TeraCache: Efficient Caching Over Fast Storage Devices
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Why your Spark Job is Failing
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Scalding - Big Data Programming with Scala
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Ad

Similar to Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming (20)

PDF
Lessons Learned: Using Spark and Microservices
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PDF
Structured Streaming with Kafka
PDF
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PPTX
Kafka for data scientists
PDF
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
PDF
Connect K of SMACK:pykafka, kafka-python or?
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Streamsets and spark
PPTX
Streaming Data and Stream Processing with Apache Kafka
PPTX
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
PDF
Streaming Processing with a Distributed Commit Log
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Streaming options in the wild
Lessons Learned: Using Spark and Microservices
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Stream, stream, stream: Different streaming methods with Spark and Kafka
Structured Streaming with Kafka
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Kafka for data scientists
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Connect K of SMACK:pykafka, kafka-python or?
Spark (Structured) Streaming vs. Kafka Streams
Streamsets and spark
Streaming Data and Stream Processing with Apache Kafka
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Streaming Processing with a Distributed Commit Log
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Streaming options in the wild
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Lecture1 pattern recognition............
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Business Analytics and business intelligence.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Introduction to Data Science and Data Analysis
PPTX
Computer network topology notes for revision
PDF
annual-report-2024-2025 original latest.
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IB Computer Science - Internal Assessment.pptx
Lecture1 pattern recognition............
STERILIZATION AND DISINFECTION-1.ppthhhbx
IBA_Chapter_11_Slides_Final_Accessible.pptx
1_Introduction to advance data techniques.pptx
[EN] Industrial Machine Downtime Prediction
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Reliability_Chapter_ presentation 1221.5784
Business Analytics and business intelligence.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Data Science and Data Analysis
Computer network topology notes for revision
annual-report-2024-2025 original latest.
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming

  • 1. REX Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming EL ARIB Abdelhamide , Data Engineer at
  • 2. OUTLINE • The Challenge • Brief presentation of Spark & Kafka • Kafka vs Spark : 1. New File Detection 2. Processing 3. Handling Failure 4. Deployment 5. Scaling 6. Monitoring • 10 points not to forget
  • 3. THE CHALLENGE HDFS is our source of truth Need to stream very big data (parquet or Avro) into Kafka Different schemas for each source Need to scale up/down the speed of streaming Exactly-Once Copy (Hdfs & Kafka) And the monitoring is, as always, a must HDFS /client1/crm /client1/tracking /client1/transactions /client2/mainCrm /client2/payments ….
  • 4. BRIEF PRESENTATION OF SPARK & KAFKA
  • 5. APACHE SPARK Distributed, data processing, in-memory engine for batch & streaming mode. Apache Spark Core Standalone Scheduler MesosYarn k8s …Cluster Manager Spark SQL Spark Streaming Spark MLlibGraphXLibraries
  • 6. SPARK STREAMING Receiver Records Micro Batches (RDDs) Apache Spark Core Standalone Scheduler MesosYarn k8s … End-to-end latency ~ 100 ms Exactly once guarantee
  • 7. SPARK STRUCTURED STREAMING Whenever you can, use Dataframe instead of Spark’s RDD primitive DataFrames get the benefit of the Catalyst query optimizer. Multiple types of Triggers Receiver Records Micro Batches (DataFrames) Apache Spark Core Standalone Scheduler MesosYarn k8s … (Experimental in Spark 2.3) End to end latency ~ 1 ms with at least once guarantee Spark SQL (Catalyst)
  • 8. KAFKA/ KAFKA STREAMS/ KAFKA CONNECT Data Source Kafka Connect Kafka Topic Kafka Streams
  • 9. KAFKA CONNECT Kafka Connect is a framework to stream data into & out of Kafka using connectors. Two types of connectors: Sinks (Export) & Sources (Import) poll.interval.ms (default 5ms)
  • 10. Kafka Connect Cluster Worker 1 Worker 2 Worker 3 Connector Elastic Sink Instance Conn ES, Task 1 Partitions: 1,2 Conn Elastic Sink, Task 2 Partitions: 3,4 Conn Elastic Sink, Task 3 Partitions: 5,6
  • 11. 1. NEW FILE DETECTION
  • 12. 1. NEW FILE DETECTION - SPARK Can detect new files within a dir Should have the same schema Can’t create one stream for all source of the client X N clients with M Source => N*M streams. 1 executor at least for each stream HDFS /client1/crm /client1/tracking /client1/transactions …
  • 13. 1. NEW FILE DETECTION – KAFKA
  • 14. 1. NEW FILE DETECTION - HDFS-WATCHER (INOTIFY) /client1/crm /client1/tracking /client1/transactions /client2/mainCrm /client2/payments …. HDFS-WATCHER HDFS Kafka Topic (Avro) (source, client) -> path Scanning for new files (exclude dirs with _temporary)
  • 16. 2. PROCESSING - SPARK Read file as stream Read partition by partition At least once when writing to Kafka
  • 17. 2. PROCESSING - KAFKA HDFS Task 1 Kafka Connect Kafka Producer Client_source Topic hdfs-source- offset Topic GlobalKtable (path, lastLineCommitted) HDFS Watcher Schema Registry poll()commitRecords() (path, newLastLineCommitted) FileDetection Kafka topic ((source, client) -> filePath) Get n (=batch.size) lines Consuming new files to process & get keyPartitionKey from the metadata in their schemas Kafka Connect Framework Tracking the offset so we can handle the failure 1 2 3 5 6 7 8 4
  • 19. 3. HANDLING FAILURE - SPARK The HDFS receiver is not reliable receiver .option("checkpointLocation", ”hdfs://checkpointPath") (Expensive) spark.streaming.receiver.writeAheadLog.enable spark.task.maxFailures = 0 trait StreamingListener { … /** Called when a receiver has reported an error */ def onReceiverError(receiverError: StreamingListenerReceiverError) { } … }
  • 20. 3. HANDLING FAILURE - KAFKA GlobalKtable that track the last line commited. KIP-298: Error Handling in Connect • errors.retry.timeout • errors.retry.delay.max.ms • errors.tolerance (in a task) = none Kafka producer config: • acks = 1 (ou all) • Retries > 0 • max.in.flight.requests.per.connection > 0( no need to preserve order) • Batch.size (producer config) < number of lines * size of a single line • linger.ms = 0 (Send the record as they arrive)
  • 22. 4. DEPLOYMENT - SPARK Standalone Cluster mode: Using spark-submit, the driver is in the cluster Client mode : Start the app. The driver is in the Client machine.
  • 23. 4. DEPLOYMENT - KAFKA Packager the connector (Classpath issues warning) Make it in /plugins Start Kafka connect cluster Run via the The rest API
  • 25. 5. SCALING UP/DOWN- SPARK Dynamic Allocation Min*, max & init resources Can’t scale down manually the resources for executor without stopping the driver. Can’t scale up/down the number of executors manually.
  • 26. 5. SCALING UP/DOWN- KAFKA Scale up/down easily the number of task. Only Static allocation for the cluster resources. Can’t update the cluster/workers resources without stopping it.
  • 28. 6. MONITORING - SPARK Spark ui Spark ui rest API : http://localhost:4040/api/v1 JMX with DropWizard
  • 29. 6. MONITORING - KAFKA Kafka connect REST API : to monitor Tasks Jmx metrics Connector Metrics Task Metrics Worker Metrics Worker Rebalance Metrics See KIP-196
  • 30. 10 POINTS TO NOT FORGET (FROM OUR EXPERIENCE) 1. It depends on what kind of datasource & datasink you’re using. 2. Always check the Receiver (Spark Streaming) & The connector (Kafka Connect) 3. The monitoring is pretty good for both 4. Integration Test is challenging with Kafka Connect 5. You want to add more tasks, go for Kafka Connect 6. You want dynamic allocation, go for Spark Streaming 7. Deploying Spark Streaming app is simple 8. Naming conventions in Kafka Connect Framework may be confusing with the naming in Kafka core 9. You prefer conf over code, go for Kafka Connect 10. You want a distributed, copy from or to Kafka, without using a cluster Manager : Go for Kafka Connect AS ALWAYS IT REALLY DEPENDS ON YOUR NEEDS ☺