SlideShare a Scribd company logo
Building End to End Streaming
Application on Spark
Streaming application development journey
https://github.com/Shasidhar/sensoranalytics
● Shashidhar E S
● Big data consultant and trainer at
datamantra.io
● www.shashidhare.com
Agenda
● Problem Statement
● Spark streaming
● Stage 1 : File Streams
● Stage 2 : Kafka as input source (Introduction to Kafka)
● Stage 3 : Casandra as Output Store (Introduction to Cassandra)
● Stage 4 : Flume as data collection engine (Introduction to Flume)
● How to test streaming code?
● Next steps
Earlier System
Business model
● Providers of Wi-Fi hot spot devices in public spaces
● Ability to collect data from these devices and analyse
Existing System
● Collect data and process in daily batches to generate the
required results
Existing System
Server
Server
Server
Server
Central
directory
Splunk
Downstream
Systems
Need for real time engine
● Lot of failures in User login
● Need to analyse why there is a drop in user logins
● Ability to analyse the data in real time rather than daily
batches
● As the company is growing Splunk was not scaling as it is
not meant for horizontal scaling
New system requirement
● Able to collect and process large amount of data
● Ability to store results in persistent storage
● A reporting mechanism to view the insights obtained from
the analysis
● Need to see the results in real time
● In a simple term, we can call it as a real time monitoring
system
Why Spark Streaming ?
● Easy to port batch system to streaming engine in Spark
● Spark streaming can handle large amounts of data and it
is very fast
● Best choice for near real time systems
● Futuristic views
○ Ability to ingest data from many sources
○ Good support for downstream stores like NoSQL
○ And lot more
Spark Streaming Architecture
Server
Source
directory
Spark
Streaming
engine
Output
directory
View in
Zeppelin
Data format
Log Data with the following format
● Timestamp
● Country
● State
● City
● SensorStatus
Required Results
● Country Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured
countrywise.
● State Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured
statewise.
● City Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured city
wise with respect to sensor status
Data Analytics - Phase 1
● Receive data from servers
● Store the input data into files
● Use file as input and output
● Process the data , generate
required statistics
● Store results into output files
Spark Streaming engine
Input files (Directory)
Output files (Directory)
Spark streaming introduction
Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Micro batch
● Spark streaming is a fast batch processing system
● Spark streaming collects stream data into small batch
and runs batch processing on it
● Batch can be as small as 1s to as big as multiple hours
● Spark job creation and execution overhead is so low it
can do all that under a sec
● These batches are called as DStreams
Apache Zeppelin
● Web based notebook that allows interactive data analysis
● It allows
○ Data ingestion
○ Data Discovery
○ Data Analytics
○ Data Visualization and collaboration
● Built-in Spark integration
Data Model
● 4 models
○ SensorRecord - To read input records
○ CountryWiseStats - Store country wise aggregations
○ StateWiseStats - Store state wise aggregations
○ CityWiseStats - Store city wise aggregations
Phase 1 - Hands On
Git branch : Master
Problems with Phase 1
● Input and output is a file
● Cannot detect new records / new data as and when it is
received
● File causes Low latency in system
Solution : Replace Input file source with Apache kafka
Data Analytics - Phase 2
● Receive data from servers
● Store the input data in Kafka
● Use kafka as input
● Process the data , generate
required statistics
● Store results into output files
Spark Streaming engine
Kafka
Output files (Directory)
Apache Kafka
● High throughput publish subscribe based messaging
system
● Distributed, partitioned and replicated commit log
● Messages are persistent in system as Topics
● Uses Zookeeper for cluster management
● Written in scala, but supports many client API’s - Java,
Ruby, Python etc
● Developed by LinkedIn
High Level Architecture
Terminology
● Topics : Is where messages are maintained and
partitioned
● Producers : Processes which produces messages to
Topic
● Consumers: Processes which subscribes to topic and
read messages
● Brokers: Every server which is part of kafka cluster
Anatomy of Kafka Topic
Spark Streaming - Kafka
● Two ways to fetch data from kafka to spark
○ Receiver approach
■ Data is stored in receivers
■ Kafka topic partitions does not correlate with RDDs
■ Enable WAL for zero data loss
■ To increase input speed create multiple receivers
Spark Streaming - Kafka cont
○ Receiver less approach
■ No data is stored in receivers
■ Exact same partitioning in maintained in Spark RDDs as in
Kafka topics
■ No WAL is needed as data is already in kafka we can fetch
older data on receiver crash
■ More kafka partitions increases the data fetching speed
Phase 2 - Hands On
Git branch : Kafka
Problems with Phase 2
● Output is still a file
● Always full file scan is needed to retrieve, no lookups
● Querying results is cumbersome
● Nosql Database is the better option
Solution : Replace Output file with Cassandra
Data Analytics - Phase 3
Spark Streaming engine
Kafka
Cassandra
● Receive data from servers
● Store the input data in Kafka
● Use kafka as input
● Process the data , generate
required statistics
● Store results into cassandra
What is Cassandra
“Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available, fault-
tolerant, tunable consistency, column-oriented database”
“Daughter of Dynamo and Big Table”
Key Components and Features
● Distributed
● System keyspace
● Peer to peer - No SPOF
● Read and write to any node
● Operational simplicity
● Gossip and Failure Detection
Cassandra daemon
cassandra
(CLI)
Language
drivers
JDBC Drivers
Memtable SS tablesCommit Log
Overall Architecture
Spark Cassandra Connector
● Loads data from cassandra to spark and vice versa
● Handles type conversions
● Maps tables to spark RDDs
● Support all cassandra data types, collections and UDTs
● Spark-Sql support
● Supports for Spark SQLs predicate push
Phase 3 - Hands On
Git branch : Cassandra
Problems with Phase 3
● Servers cannot push directly to Kafka
● There is an intervention to push data
● Need for automated way to push data
Solution : Add Flume as a data collection agent
Data Analytics - Phase 4
● Receive data from Server
● Stream data into kafka through
flume
● Store the input data in Kafka
● Use kafka as input
● Process the data , generate
required statistics
● Store results into cassandra
Spark Streaming engine
Kafka
Cassandra
Flume
Apache Flume
● Distributed data collection service
● Solution for data collection of all formats
● Initially designed to transfer log data into HDFS frequently
and reliably
● It is horizontally scalable
● Configurable routing
Flume Architecture
Components
○ Event
○ Source
○ Sink
○ Channel
○ Agent
Flume Configuration
● Define Source, Sink and Channel names
● Configure Source
● Configure Sink
● Configure Channel
● Bind Source and Sink to Channel
Phase 4 - Hands On
Git branch : Flume
Data Analytics - Re Design
● Why we want to re design/ re structure ?
● What we want to test ?
● How to test Streaming applications
● Hack a bit on Spark Manual Clock
● Use scala-test for unit testing
● Bring up abstractions to decouple the code
● Write some tests
Manual Clock
● A clock whose time can be set and modified
● Its notified time will not change as time elapses
● Only callers have control over it
● Specially used for testing
Phase 5 - Hands On
Git branch : unittest
Next steps
● Use better serialization frameworks like Avro
● Enable Checkpointing
● Integrate kafka monitoring tools
● Adding support for multiple kafka topics
● Write more tests for all functionality

More Related Content

PDF
Interactive Data Analysis in Spark Streaming
PDF
Introduction to Structured Data Processing with Spark SQL
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
PDF
Evolution of apache spark
PDF
Productionalizing a spark application
PDF
Understanding transactional writes in datasource v2
PDF
Introduction to Flink Streaming
PDF
Introduction to dataset
Interactive Data Analysis in Spark Streaming
Introduction to Structured Data Processing with Spark SQL
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Evolution of apache spark
Productionalizing a spark application
Understanding transactional writes in datasource v2
Introduction to Flink Streaming
Introduction to dataset

What's hot (20)

PDF
Interactive workflow management using Azkaban
PDF
Exploratory Data Analysis in Spark
PDF
Anatomy of in memory processing in Spark
PDF
Introduction to Datasource V2 API
PDF
A Tool For Big Data Analysis using Apache Spark
PDF
Building distributed processing system from scratch - Part 2
PDF
Building Distributed Systems from Scratch - Part 1
PPTX
Multi Source Data Analysis using Spark and Tellius
PDF
Structured Streaming with Kafka
PDF
Introduction to Spark 2.0 Dataset API
PDF
Improving Mobile Payments With Real time Spark
PPTX
Building real time Data Pipeline using Spark Streaming
PDF
Introduction to Structured streaming
PDF
Introduction to spark 2.0
PDF
Migrating to Spark 2.0 - Part 2
PDF
Anatomy of Data Source API : A deep dive into Spark Data source API
PDF
Migrating to spark 2.0
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
PDF
Real time ETL processing using Spark streaming
PDF
Spark architecture
Interactive workflow management using Azkaban
Exploratory Data Analysis in Spark
Anatomy of in memory processing in Spark
Introduction to Datasource V2 API
A Tool For Big Data Analysis using Apache Spark
Building distributed processing system from scratch - Part 2
Building Distributed Systems from Scratch - Part 1
Multi Source Data Analysis using Spark and Tellius
Structured Streaming with Kafka
Introduction to Spark 2.0 Dataset API
Improving Mobile Payments With Real time Spark
Building real time Data Pipeline using Spark Streaming
Introduction to Structured streaming
Introduction to spark 2.0
Migrating to Spark 2.0 - Part 2
Anatomy of Data Source API : A deep dive into Spark Data source API
Migrating to spark 2.0
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Real time ETL processing using Spark streaming
Spark architecture
Ad

Viewers also liked (19)

PDF
Functional programming in Scala
PDF
Apache spark with Machine learning
PDF
Anatomy of spark catalyst
PDF
Introduction to Apache Spark
PPT
Anatomy of file write in hadoop
PDF
Introduction to concurrent programming with akka actors
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
PDF
End-to-end Data Pipeline with Apache Spark
PDF
Anatomy of Spark SQL Catalyst - Part 2
PDF
Introduction to Structured Streaming
PDF
Machine learning pipeline with spark ml
PPTX
Spark+flume seattle
PDF
Spark DataFrames and ML Pipelines
PDF
Introduction to Apache Flink
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Introduction to Spark Internals
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
PDF
Spark 2.x Troubleshooting Guide
 
Functional programming in Scala
Apache spark with Machine learning
Anatomy of spark catalyst
Introduction to Apache Spark
Anatomy of file write in hadoop
Introduction to concurrent programming with akka actors
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
End-to-end Data Pipeline with Apache Spark
Anatomy of Spark SQL Catalyst - Part 2
Introduction to Structured Streaming
Machine learning pipeline with spark ml
Spark+flume seattle
Spark DataFrames and ML Pipelines
Introduction to Apache Flink
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Simplifying Big Data Analytics with Apache Spark
Introduction to Spark Internals
Python in the Hadoop Ecosystem (Rock Health presentation)
Spark 2.x Troubleshooting Guide
 
Ad

Similar to Building end to end streaming application on Spark (20)

PDF
Data Streaming For Big Data
PPTX
Big Data Analytics_basic introduction of Kafka.pptx
PPTX
Apache Spark Components
PPTX
Apache frameworks for Big and Fast Data
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Data streaming fundamentals
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Apache Spark Streaming
PPT
CS8091_BDA_Unit_IV_Stream_Computing
PDF
Spark streaming state of the union
PDF
Streaming analytics state of the art
PPTX
Streaming options in the wild
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PDF
Data platform architecture
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPTX
Trivento summercamp fast data 9/9/2016
PDF
Building Big Data Streaming Architectures
PDF
Strata NYC 2015: What's new in Spark Streaming
Data Streaming For Big Data
Big Data Analytics_basic introduction of Kafka.pptx
Apache Spark Components
Apache frameworks for Big and Fast Data
Trivento summercamp masterclass 9/9/2016
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Data streaming fundamentals
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Apache Spark Streaming
CS8091_BDA_Unit_IV_Stream_Computing
Spark streaming state of the union
Streaming analytics state of the art
Streaming options in the wild
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Data platform architecture
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Trivento summercamp fast data 9/9/2016
Building Big Data Streaming Architectures
Strata NYC 2015: What's new in Spark Streaming

More from datamantra (13)

PPTX
State management in Structured Streaming
PDF
Spark on Kubernetes
PDF
Core Services behind Spark Job Execution
PDF
Optimizing S3 Write-heavy Spark workloads
PDF
Understanding time in structured streaming
PDF
Spark stack for Model life-cycle management
PDF
Productionalizing Spark ML
PDF
Testing Spark and Scala
PDF
Understanding Implicits in Scala
PDF
Scalable Spark deployment using Kubernetes
PPTX
Telco analytics at scale
PPTX
Platform for Data Scientists
PDF
Building scalable rest service using Akka HTTP
State management in Structured Streaming
Spark on Kubernetes
Core Services behind Spark Job Execution
Optimizing S3 Write-heavy Spark workloads
Understanding time in structured streaming
Spark stack for Model life-cycle management
Productionalizing Spark ML
Testing Spark and Scala
Understanding Implicits in Scala
Scalable Spark deployment using Kubernetes
Telco analytics at scale
Platform for Data Scientists
Building scalable rest service using Akka HTTP

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Lecture1 pattern recognition............
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
Introduction to Knowledge Engineering Part 1
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction-to-Cloud-ComputingFinal.pptx
Database Infoormation System (DBIS).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Lecture1 pattern recognition............
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Business Acumen Training GuidePresentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Building end to end streaming application on Spark

  • 1. Building End to End Streaming Application on Spark Streaming application development journey https://github.com/Shasidhar/sensoranalytics
  • 2. ● Shashidhar E S ● Big data consultant and trainer at datamantra.io ● www.shashidhare.com
  • 3. Agenda ● Problem Statement ● Spark streaming ● Stage 1 : File Streams ● Stage 2 : Kafka as input source (Introduction to Kafka) ● Stage 3 : Casandra as Output Store (Introduction to Cassandra) ● Stage 4 : Flume as data collection engine (Introduction to Flume) ● How to test streaming code? ● Next steps
  • 4. Earlier System Business model ● Providers of Wi-Fi hot spot devices in public spaces ● Ability to collect data from these devices and analyse Existing System ● Collect data and process in daily batches to generate the required results
  • 6. Need for real time engine ● Lot of failures in User login ● Need to analyse why there is a drop in user logins ● Ability to analyse the data in real time rather than daily batches ● As the company is growing Splunk was not scaling as it is not meant for horizontal scaling
  • 7. New system requirement ● Able to collect and process large amount of data ● Ability to store results in persistent storage ● A reporting mechanism to view the insights obtained from the analysis ● Need to see the results in real time ● In a simple term, we can call it as a real time monitoring system
  • 8. Why Spark Streaming ? ● Easy to port batch system to streaming engine in Spark ● Spark streaming can handle large amounts of data and it is very fast ● Best choice for near real time systems ● Futuristic views ○ Ability to ingest data from many sources ○ Good support for downstream stores like NoSQL ○ And lot more
  • 10. Data format Log Data with the following format ● Timestamp ● Country ● State ● City ● SensorStatus
  • 11. Required Results ● Country Wise Stats ○ Hourly,Weekly and Monthly view of total count of records captured countrywise. ● State Wise Stats ○ Hourly,Weekly and Monthly view of total count of records captured statewise. ● City Wise Stats ○ Hourly,Weekly and Monthly view of total count of records captured city wise with respect to sensor status
  • 12. Data Analytics - Phase 1 ● Receive data from servers ● Store the input data into files ● Use file as input and output ● Process the data , generate required statistics ● Store results into output files Spark Streaming engine Input files (Directory) Output files (Directory)
  • 13. Spark streaming introduction Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
  • 14. Micro batch ● Spark streaming is a fast batch processing system ● Spark streaming collects stream data into small batch and runs batch processing on it ● Batch can be as small as 1s to as big as multiple hours ● Spark job creation and execution overhead is so low it can do all that under a sec ● These batches are called as DStreams
  • 15. Apache Zeppelin ● Web based notebook that allows interactive data analysis ● It allows ○ Data ingestion ○ Data Discovery ○ Data Analytics ○ Data Visualization and collaboration ● Built-in Spark integration
  • 16. Data Model ● 4 models ○ SensorRecord - To read input records ○ CountryWiseStats - Store country wise aggregations ○ StateWiseStats - Store state wise aggregations ○ CityWiseStats - Store city wise aggregations
  • 17. Phase 1 - Hands On Git branch : Master
  • 18. Problems with Phase 1 ● Input and output is a file ● Cannot detect new records / new data as and when it is received ● File causes Low latency in system Solution : Replace Input file source with Apache kafka
  • 19. Data Analytics - Phase 2 ● Receive data from servers ● Store the input data in Kafka ● Use kafka as input ● Process the data , generate required statistics ● Store results into output files Spark Streaming engine Kafka Output files (Directory)
  • 20. Apache Kafka ● High throughput publish subscribe based messaging system ● Distributed, partitioned and replicated commit log ● Messages are persistent in system as Topics ● Uses Zookeeper for cluster management ● Written in scala, but supports many client API’s - Java, Ruby, Python etc ● Developed by LinkedIn
  • 22. Terminology ● Topics : Is where messages are maintained and partitioned ● Producers : Processes which produces messages to Topic ● Consumers: Processes which subscribes to topic and read messages ● Brokers: Every server which is part of kafka cluster
  • 24. Spark Streaming - Kafka ● Two ways to fetch data from kafka to spark ○ Receiver approach ■ Data is stored in receivers ■ Kafka topic partitions does not correlate with RDDs ■ Enable WAL for zero data loss ■ To increase input speed create multiple receivers
  • 25. Spark Streaming - Kafka cont ○ Receiver less approach ■ No data is stored in receivers ■ Exact same partitioning in maintained in Spark RDDs as in Kafka topics ■ No WAL is needed as data is already in kafka we can fetch older data on receiver crash ■ More kafka partitions increases the data fetching speed
  • 26. Phase 2 - Hands On Git branch : Kafka
  • 27. Problems with Phase 2 ● Output is still a file ● Always full file scan is needed to retrieve, no lookups ● Querying results is cumbersome ● Nosql Database is the better option Solution : Replace Output file with Cassandra
  • 28. Data Analytics - Phase 3 Spark Streaming engine Kafka Cassandra ● Receive data from servers ● Store the input data in Kafka ● Use kafka as input ● Process the data , generate required statistics ● Store results into cassandra
  • 29. What is Cassandra “Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault- tolerant, tunable consistency, column-oriented database” “Daughter of Dynamo and Big Table”
  • 30. Key Components and Features ● Distributed ● System keyspace ● Peer to peer - No SPOF ● Read and write to any node ● Operational simplicity ● Gossip and Failure Detection
  • 32. Spark Cassandra Connector ● Loads data from cassandra to spark and vice versa ● Handles type conversions ● Maps tables to spark RDDs ● Support all cassandra data types, collections and UDTs ● Spark-Sql support ● Supports for Spark SQLs predicate push
  • 33. Phase 3 - Hands On Git branch : Cassandra
  • 34. Problems with Phase 3 ● Servers cannot push directly to Kafka ● There is an intervention to push data ● Need for automated way to push data Solution : Add Flume as a data collection agent
  • 35. Data Analytics - Phase 4 ● Receive data from Server ● Stream data into kafka through flume ● Store the input data in Kafka ● Use kafka as input ● Process the data , generate required statistics ● Store results into cassandra Spark Streaming engine Kafka Cassandra Flume
  • 36. Apache Flume ● Distributed data collection service ● Solution for data collection of all formats ● Initially designed to transfer log data into HDFS frequently and reliably ● It is horizontally scalable ● Configurable routing
  • 37. Flume Architecture Components ○ Event ○ Source ○ Sink ○ Channel ○ Agent
  • 38. Flume Configuration ● Define Source, Sink and Channel names ● Configure Source ● Configure Sink ● Configure Channel ● Bind Source and Sink to Channel
  • 39. Phase 4 - Hands On Git branch : Flume
  • 40. Data Analytics - Re Design ● Why we want to re design/ re structure ? ● What we want to test ? ● How to test Streaming applications ● Hack a bit on Spark Manual Clock ● Use scala-test for unit testing ● Bring up abstractions to decouple the code ● Write some tests
  • 41. Manual Clock ● A clock whose time can be set and modified ● Its notified time will not change as time elapses ● Only callers have control over it ● Specially used for testing
  • 42. Phase 5 - Hands On Git branch : unittest
  • 43. Next steps ● Use better serialization frameworks like Avro ● Enable Checkpointing ● Integrate kafka monitoring tools ● Adding support for multiple kafka topics ● Write more tests for all functionality