Stream Data Processing at Big Data Landscape by Oleksandr Fedirko

2
Conﬁdential
Stream data processing at
BigData landscape

3
Conﬁdential
Intro
Meet your speaker today:
Oleksandr Fedirko - CEE Head of Big Data Practice

4
Conﬁdential
High level agenda
Streaming basics
Types of stream systems
Typical architectures and use cases
Main consideration on a project with Stream processing
Stream processing tools overview
Case study
Q&A session

5
Conﬁdential
5
Streaming basics

6
Conﬁdential
Streaming basics
Types of streaming operations
- Stateful
- Aggregation
- Join
- Sorting
- Stateless
- Filter
- Map

7
Conﬁdential
Streaming basics
Types of streaming sources
- Bounded
- Database
- Flat ﬁle
- Key-value storage
- Unbounded
- Queue
- Port
- Socket

8
Conﬁdential
Streaming basics

9
Conﬁdential
Streaming basics

10
Conﬁdential
Streaming basics

11
Conﬁdential
Streaming basics

12
Conﬁdential
12
Types of stream systems

13
Conﬁdential
MicroBatches vs Realtime streaming
Micro Batches
- Most of the tools/frameworks
work under this paradigm
- Widely used, mature
ecosystem
Realtime streaming
- Better performance with
stateless operations
- Can fulﬁll particular use cases
where low latency is a must

14
Confidential
Compositional vs Declarative engines
In a compositional stream processing engines, developers define the Directed
Acyclic Graph (DAG) in advance and then process the data. This may simplify code,
but also means developers need to plan their architecture carefully to avoid
inefficient processing.
Challenges: Compositional stream processing are considered the “first generation”
of stream processing and can be complex and difficult to manage.
Examples: Compositional engines include Samza, Apex and Apache Storm.

15
Confidential
Compositional vs Declarative engines
Developers use declarative engines to chain stream processing functions. The
engine calculates the DAG as it ingests the data. Developers can specify the DAG
explicitly in their code, and the engine optimizes it on the fly.
Challenges: While declarative engines are easier to manage, and have
readily-available managed service options, they still require major investments in
data engineering to set up the data pipeline, from source to eventual storage and
analysis.
Examples: Declarative engines include Apache Spark and Flink, both of which are
provided as a managed offering.

16
Conﬁdential
16
Typical architectures and use
cases

Source 1
Source 2
Source 3
Ingestion
Stream
processing
Queue Data Lake

Source 1
Source 2
Source 3
Stream
processing
Queue Data Lake

Source 1
Source 2
Source 3
Stream
processing
Queue
Key-value/
Columnar
storage

Source 1
Source 2
Source 3
Stream
processing
Queue

Source 1
Source 2
Source 3
Stream
processing
Queue
DB/Cache/
API call

22
Conﬁdential
22
Main consideration on a project
with Stream processing

23
Conﬁdential
Main consideration on a project with
Stream processing
Think of the next NFRs:
● Records per second, avg
● Records per second, max (spike)
● Spike longevity
● 95% of the size of record
● 1% max of the size of record
● Latency
● Exactly one/at least one/at most one semantic
● Late arrivals
● Static/dynamic streams

24
Conﬁdential
24

25
Conﬁdential
Apache Spark
Spark is an open-source distributed general-purpose cluster computing
framework. Spark’s in-memory data processing engine conducts
analytics, ETL, machine learning and graph processing on data in motion
or at rest. It oﬀers high-level APIs for the programming languages: Python,
Java, Scala, R, and SQL.
The Apache Spark Architecture is founded on Resilient Distributed
Datasets (RDDs). These are distributed immutable tables of data, which
are split up and allocated to workers. The worker executors implement the
data. The RDD is immutable, so the worker nodes cannot make
alterations; they process information and output results.

26
Conﬁdential
Pros: Apache Spark is a mature product with a large community, proven
in production for many use cases, and readily supports SQL querying.
Cons:
● Spark can be complex to set up and implement
● It is not a true streaming engine (it performs very fast batch
processing)
● Limited language support
● Latency of a few seconds, which eliminates some real-time analytics
use cases

27
Conﬁdential
Apache Storm
Apache Storm has very low latency and is suitable for near real time
processing workloads. It processes large quantities of data and provides
results with lower latency than most other solutions.
The Apache Storm Architecture is founded on spouts and bolts. Spouts
are origins of information and transfer information to one or more bolts.
This information is linked to other bolts, and the entire topology forms a
DAG. Developers deﬁne how the spouts and bolts are connected.

28
Conﬁdential

29
Confidential
Pros:
● Probably the best technical solution for true real-time processing
● Use of micro-batches provides flexibility in adapting the tool for
different use cases
● Very wide language support
Cons:
● Does not guarantee ordering of messages, may compromise
reliability
● Highly complex to implement

30
Conﬁdential
Apache Flink
Flink is based on the concept of streams and transformations. Data
comes into the system via a source and leaves via a sink. To produce a
Flink job Apache Maven is used. Maven has a skeleton project where the
packing requirements and dependencies are ready, so the developer can
add custom code.
Apache Flink is a stream processing framework that also handles batch
tasks. Flink approaches batches as data streams with ﬁnite boundaries.

31
Confidential
Pros:
● Stream-first approach offers low latency, high throughput
● Real entry-by-entry processing
● Does not require manual optimization and adjustment to data it
processes
● Dynamically analyzes and optimizes tasks
Cons:
● Some scaling limitations
● A relatively new project with less production deployments than other
frameworks

32
Conﬁdential
Stream processing tools overview (cloud)
● AWS Kinesis
● GCP DataFlow
● Azure Stream Analytics
When do we use Lambda-like application instead of services above?
Very light weight simple logic.

33
Conﬁdential
33
Case study

Conﬁdential
Case study (CEP for custom DSL)
Raw events
Parsed events
Canonically
parsed events
Indicators
Incidents
Archive job
Parse job
Index job
Archive storage
Primary storage
Index job
Rules job
Secondary storage
Application
storage
Save incind job
Message Queues Processing Engines Sink Storages

36
Conﬁdential
FAQ
I do my custom Java based application that does consume messages
from Kafka. Is it stream or not ?
If I have 1 message per day in my Kafka topic could it be considered as a
stream ?
I love my Kafka Stream API. Why didn’t you cover it ?
I have a … tool on my project. Why didn’t you mention it today ?
Did you cover everything Stream related today ? Am I a Stream master
after this event ?

37
Conﬁdential
37
Q&A session

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko

More Related Content

What's hot (20)

Similar to Stream Data Processing at Big Data Landscape by Oleksandr Fedirko (20)

More from GlobalLogic Ukraine (20)

Recently uploaded (20)

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko