Building end to end streaming application on Spark

Building End to End Streaming
Application on Spark
Streaming application development journey
https://github.com/Shasidhar/sensoranalytics

● Shashidhar E S
● Big data consultant and trainer at
datamantra.io
● www.shashidhare.com

Agenda
● Problem Statement
● Spark streaming
● Stage 1 : File Streams
● Stage 2 : Kafka as input source (Introduction to Kafka)
● Stage 3 : Casandra as Output Store (Introduction to Cassandra)
● Stage 4 : Flume as data collection engine (Introduction to Flume)
● How to test streaming code?
● Next steps

Earlier System
Business model
● Providers of Wi-Fi hot spot devices in public spaces
● Ability to collect data from these devices and analyse
Existing System
● Collect data and process in daily batches to generate the
required results

Existing System
Server
Server
Server
Server
Central
directory
Splunk
Downstream
Systems

Need for real time engine
● Lot of failures in User login
● Need to analyse why there is a drop in user logins
● Ability to analyse the data in real time rather than daily
batches
● As the company is growing Splunk was not scaling as it is
not meant for horizontal scaling

New system requirement
● Able to collect and process large amount of data
● Ability to store results in persistent storage
● A reporting mechanism to view the insights obtained from
the analysis
● Need to see the results in real time
● In a simple term, we can call it as a real time monitoring
system

Why Spark Streaming ?
● Easy to port batch system to streaming engine in Spark
● Spark streaming can handle large amounts of data and it
is very fast
● Best choice for near real time systems
● Futuristic views
○ Ability to ingest data from many sources
○ Good support for downstream stores like NoSQL
○ And lot more

Spark Streaming Architecture
Server
Source
directory
Spark
Streaming
engine
Output
directory
View in
Zeppelin

Data format
Log Data with the following format
● Timestamp
● Country
● State
● City
● SensorStatus

Required Results
● Country Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured
countrywise.
● State Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured
statewise.
● City Wise Stats
○ Hourly,Weekly and Monthly view of total count of records captured city
wise with respect to sensor status

Data Analytics - Phase 1
● Receive data from servers
● Store the input data into files
● Use file as input and output
● Process the data , generate
required statistics
● Store results into output files
Spark Streaming engine
Input files (Directory)
Output files (Directory)

Spark streaming introduction
Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams

Micro batch
● Spark streaming is a fast batch processing system
● Spark streaming collects stream data into small batch
and runs batch processing on it
● Batch can be as small as 1s to as big as multiple hours
● Spark job creation and execution overhead is so low it
can do all that under a sec
● These batches are called as DStreams

Apache Zeppelin
● Web based notebook that allows interactive data analysis
● It allows
○ Data ingestion
○ Data Discovery
○ Data Analytics
○ Data Visualization and collaboration
● Built-in Spark integration

Data Model
● 4 models
○ SensorRecord - To read input records
○ CountryWiseStats - Store country wise aggregations
○ StateWiseStats - Store state wise aggregations
○ CityWiseStats - Store city wise aggregations

Phase 1 - Hands On
Git branch : Master

Problems with Phase 1
● Input and output is a file
● Cannot detect new records / new data as and when it is
received
● File causes Low latency in system
Solution : Replace Input file source with Apache kafka

● Store the input data in Kafka
● Use kafka as input
required statistics
● Store results into output files
Kafka
Output files (Directory)

Apache Kafka
● High throughput publish subscribe based messaging
system
● Distributed, partitioned and replicated commit log
● Messages are persistent in system as Topics
● Uses Zookeeper for cluster management
● Written in scala, but supports many client API’s - Java,
Ruby, Python etc
● Developed by LinkedIn

Terminology
● Topics : Is where messages are maintained and
partitioned
● Producers : Processes which produces messages to
Topic
● Consumers: Processes which subscribes to topic and
read messages
● Brokers: Every server which is part of kafka cluster

Spark Streaming - Kafka
● Two ways to fetch data from kafka to spark
○ Receiver approach
■ Data is stored in receivers
■ Kafka topic partitions does not correlate with RDDs
■ Enable WAL for zero data loss
■ To increase input speed create multiple receivers

Spark Streaming - Kafka cont
○ Receiver less approach
■ No data is stored in receivers
■ Exact same partitioning in maintained in Spark RDDs as in
Kafka topics
■ No WAL is needed as data is already in kafka we can fetch
older data on receiver crash
■ More kafka partitions increases the data fetching speed

Phase 2 - Hands On
Git branch : Kafka

● Output is still a file
● Always full file scan is needed to retrieve, no lookups
● Querying results is cumbersome
● Nosql Database is the better option
Solution : Replace Output file with Cassandra

Kafka
Cassandra
required statistics
● Store results into cassandra

What is Cassandra
“Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available, fault-
tolerant, tunable consistency, column-oriented database”
“Daughter of Dynamo and Big Table”

Key Components and Features
● Distributed
● System keyspace
● Peer to peer - No SPOF
● Read and write to any node
● Operational simplicity
● Gossip and Failure Detection

Cassandra daemon
cassandra
(CLI)
Language
drivers
JDBC Drivers
Memtable SS tablesCommit Log
Overall Architecture

Spark Cassandra Connector
● Loads data from cassandra to spark and vice versa
● Handles type conversions
● Maps tables to spark RDDs
● Support all cassandra data types, collections and UDTs
● Spark-Sql support
● Supports for Spark SQLs predicate push

Phase 3 - Hands On
Git branch : Cassandra

● Servers cannot push directly to Kafka
● There is an intervention to push data
● Need for automated way to push data
Solution : Add Flume as a data collection agent

● Receive data from Server
● Stream data into kafka through
flume
required statistics
● Store results into cassandra
Kafka
Cassandra
Flume

Apache Flume
● Distributed data collection service
● Solution for data collection of all formats
● Initially designed to transfer log data into HDFS frequently
and reliably
● It is horizontally scalable
● Configurable routing

Flume Architecture
Components
○ Event
○ Source
○ Sink
○ Channel
○ Agent

Flume Configuration
● Define Source, Sink and Channel names
● Configure Source
● Configure Sink
● Configure Channel
● Bind Source and Sink to Channel

Phase 4 - Hands On
Git branch : Flume

Data Analytics - Re Design
● Why we want to re design/ re structure ?
● What we want to test ?
● How to test Streaming applications
● Hack a bit on Spark Manual Clock
● Use scala-test for unit testing
● Bring up abstractions to decouple the code
● Write some tests

Manual Clock
● A clock whose time can be set and modified
● Its notified time will not change as time elapses
● Only callers have control over it
● Specially used for testing

Phase 5 - Hands On
Git branch : unittest

Next steps
● Use better serialization frameworks like Avro
● Enable Checkpointing
● Integrate kafka monitoring tools
● Adding support for multiple kafka topics
● Write more tests for all functionality

Building end to end streaming application on Spark

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Building end to end streaming application on Spark (20)

More from datamantra (13)

Recently uploaded (20)

Building end to end streaming application on Spark