SlideShare a Scribd company logo
1
Confidential
2
Confidential
Stream data processing at
BigData landscape
3
Confidential
Intro
Meet your speaker today:
Oleksandr Fedirko - CEE Head of Big Data Practice
4
Confidential
High level agenda
Streaming basics
Types of stream systems
Typical architectures and use cases
Main consideration on a project with Stream processing
Stream processing tools overview
Case study
Q&A session
5
Confidential
5
Streaming basics
6
Confidential
Streaming basics
Types of streaming operations
- Stateful
- Aggregation
- Join
- Sorting
- Stateless
- Filter
- Map
7
Confidential
Streaming basics
Types of streaming sources
- Bounded
- Database
- Flat file
- Key-value storage
- Unbounded
- Queue
- Port
- Socket
8
Confidential
Streaming basics
9
Confidential
Streaming basics
10
Confidential
Streaming basics
11
Confidential
Streaming basics
12
Confidential
12
Types of stream systems
13
Confidential
MicroBatches vs Realtime streaming
Micro Batches
- Most of the tools/frameworks
work under this paradigm
- Widely used, mature
ecosystem
Realtime streaming
- Better performance with
stateless operations
- Can fulfill particular use cases
where low latency is a must
14
Confidential
Compositional vs Declarative engines
In a compositional stream processing engines, developers define the Directed
Acyclic Graph (DAG) in advance and then process the data. This may simplify code,
but also means developers need to plan their architecture carefully to avoid
inefficient processing.
Challenges: Compositional stream processing are considered the “first generation”
of stream processing and can be complex and difficult to manage.
Examples: Compositional engines include Samza, Apex and Apache Storm.
15
Confidential
Compositional vs Declarative engines
Developers use declarative engines to chain stream processing functions. The
engine calculates the DAG as it ingests the data. Developers can specify the DAG
explicitly in their code, and the engine optimizes it on the fly.
Challenges: While declarative engines are easier to manage, and have
readily-available managed service options, they still require major investments in
data engineering to set up the data pipeline, from source to eventual storage and
analysis.
Examples: Declarative engines include Apache Spark and Flink, both of which are
provided as a managed offering.
16
Confidential
16
Typical architectures and use
cases
Typical architectures and use cases
Source 1
Source 2
Source 3
Ingestion
Stream
processing
Queue Data Lake
Source 1
Source 2
Source 3
Stream
processing
Queue Data Lake
Typical architectures and use cases
Source 1
Source 2
Source 3
Stream
processing
Queue
Key-value/
Columnar
storage
Typical architectures and use cases
Source 1
Source 2
Source 3
Stream
processing
Queue
Typical architectures and use cases
Source 1
Source 2
Source 3
Stream
processing
Queue
DB/Cache/
API call
Typical architectures and use cases
22
Confidential
22
Main consideration on a project
with Stream processing
23
Confidential
Main consideration on a project with
Stream processing
Think of the next NFRs:
● Records per second, avg
● Records per second, max (spike)
● Spike longevity
● 95% of the size of record
● 1% max of the size of record
● Latency
● Exactly one/at least one/at most one semantic
● Late arrivals
● Static/dynamic streams
24
Confidential
24
Stream processing tools overview
25
Confidential
Stream processing tools overview
Apache Spark
Spark is an open-source distributed general-purpose cluster computing
framework. Spark’s in-memory data processing engine conducts
analytics, ETL, machine learning and graph processing on data in motion
or at rest. It offers high-level APIs for the programming languages: Python,
Java, Scala, R, and SQL.
The Apache Spark Architecture is founded on Resilient Distributed
Datasets (RDDs). These are distributed immutable tables of data, which
are split up and allocated to workers. The worker executors implement the
data. The RDD is immutable, so the worker nodes cannot make
alterations; they process information and output results.
26
Confidential
Stream processing tools overview
Pros: Apache Spark is a mature product with a large community, proven
in production for many use cases, and readily supports SQL querying.
Cons:
● Spark can be complex to set up and implement
● It is not a true streaming engine (it performs very fast batch
processing)
● Limited language support
● Latency of a few seconds, which eliminates some real-time analytics
use cases
27
Confidential
Stream processing tools overview
Apache Storm
Apache Storm has very low latency and is suitable for near real time
processing workloads. It processes large quantities of data and provides
results with lower latency than most other solutions.
The Apache Storm Architecture is founded on spouts and bolts. Spouts
are origins of information and transfer information to one or more bolts.
This information is linked to other bolts, and the entire topology forms a
DAG. Developers define how the spouts and bolts are connected.
28
Confidential
Stream processing tools overview
29
Confidential
Stream processing tools overview
Pros:
● Probably the best technical solution for true real-time processing
● Use of micro-batches provides flexibility in adapting the tool for
different use cases
● Very wide language support
Cons:
● Does not guarantee ordering of messages, may compromise
reliability
● Highly complex to implement
30
Confidential
Stream processing tools overview
Apache Flink
Flink is based on the concept of streams and transformations. Data
comes into the system via a source and leaves via a sink. To produce a
Flink job Apache Maven is used. Maven has a skeleton project where the
packing requirements and dependencies are ready, so the developer can
add custom code.
Apache Flink is a stream processing framework that also handles batch
tasks. Flink approaches batches as data streams with finite boundaries.
31
Confidential
Stream processing tools overview
Pros:
● Stream-first approach offers low latency, high throughput
● Real entry-by-entry processing
● Does not require manual optimization and adjustment to data it
processes
● Dynamically analyzes and optimizes tasks
Cons:
● Some scaling limitations
● A relatively new project with less production deployments than other
frameworks
32
Confidential
Stream processing tools overview (cloud)
● AWS Kinesis
● GCP DataFlow
● Azure Stream Analytics
When do we use Lambda-like application instead of services above?
Very light weight simple logic.
33
Confidential
33
Case study
Confidential
Case study (CEP for custom DSL)
Raw events
Parsed events
Canonically
parsed events
Indicators
Incidents
Archive job
Parse job
Index job
Archive storage
Primary storage
Index job
Rules job
Secondary storage
Application
storage
Save incind job
Message Queues Processing Engines Sink Storages
35
Confidential
35
FAQ
36
Confidential
FAQ
I do my custom Java based application that does consume messages
from Kafka. Is it stream or not ?
If I have 1 message per day in my Kafka topic could it be considered as a
stream ?
I love my Kafka Stream API. Why didn’t you cover it ?
I have a … tool on my project. Why didn’t you mention it today ?
Did you cover everything Stream related today ? Am I a Stream master
after this event ?
37
Confidential
37
Q&A session
38
Confidential
Thank you!

More Related Content

PPTX
Assaf Araki – Real Time Analytics at Scale
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
Apache Flink 101 - the rise of stream processing and beyond
PPTX
Apache Flink and what it is used for
PDF
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
PPTX
SICS: Apache Flink Streaming
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
PPTX
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Assaf Araki – Real Time Analytics at Scale
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink and what it is used for
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
SICS: Apache Flink Streaming
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink

What's hot (20)

PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
PDF
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
PDF
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
PDF
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
PPTX
Lego-like building blocks of Storm and Spark Streaming Pipelines
PPTX
Flink Case Study: Bouygues Telecom
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PDF
PNDA - Platform for Network Data Analytics
PPTX
Debunking Common Myths in Stream Processing
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
PPTX
Large-scaled telematics analytics
PDF
Bay Area Apache Flink Meetup Community Update August 2015
PDF
Introduction to Streaming with Apache Flink
PDF
Introduction to Apache Apex by Thomas Weise
PDF
Productionalizing a spark application
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PPTX
ExxonMobil’s journey to unleash time-series data with open source technology
PDF
Data Streaming Technology Overview
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Self-Service Analytics on Hadoop: Lessons Learned
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Lego-like building blocks of Storm and Spark Streaming Pipelines
Flink Case Study: Bouygues Telecom
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PNDA - Platform for Network Data Analytics
Debunking Common Myths in Stream Processing
Unified, Efficient, and Portable Data Processing with Apache Beam
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Large-scaled telematics analytics
Bay Area Apache Flink Meetup Community Update August 2015
Introduction to Streaming with Apache Flink
Introduction to Apache Apex by Thomas Weise
Productionalizing a spark application
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
ExxonMobil’s journey to unleash time-series data with open source technology
Data Streaming Technology Overview
Ad

Similar to Stream Data Processing at Big Data Landscape by Oleksandr Fedirko (20)

PDF
Data Streaming For Big Data
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Introduction to Apache Flink
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PPTX
Trivento summercamp fast data 9/9/2016
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PPTX
Data streaming fundamentals
PPTX
Apache Flink: Past, Present and Future
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PDF
Building end to end streaming application on Spark
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
PDF
Santander Stream Processing with Apache Flink
PPTX
Apache frameworks for Big and Fast Data
PPTX
Flink vs. Spark
PPTX
Slim Baltagi – Flink vs. Spark
Data Streaming For Big Data
Trivento summercamp masterclass 9/9/2016
Introduction to Apache Flink
Don't Cross The Streams - Data Streaming And Apache Flink
Trivento summercamp fast data 9/9/2016
Why apache Flink is the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Data streaming fundamentals
Apache Flink: Past, Present and Future
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Building end to end streaming application on Spark
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
[WSO2Con EU 2018] The Rise of Streaming SQL
Santander Stream Processing with Apache Flink
Apache frameworks for Big and Fast Data
Flink vs. Spark
Slim Baltagi – Flink vs. Spark
Ad

More from GlobalLogic Ukraine (20)

PDF
GlobalLogic JavaScript Community Webinar #21 “Інтерв’ю без заспокійливих”
PPTX
Deadlocks in SQL - Turning Fear Into Understanding (by Sergii Stets)
PDF
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
PDF
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
PDF
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
PDF
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
PPTX
Штучний інтелект як допомога в навчанні, а не замінник.pptx
PPTX
Задачі AI-розробника як застосовується штучний інтелект.pptx
PPTX
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
PDF
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
PDF
JavaScript Community Webinar #14 "Why Is Git Rebase?"
PDF
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
PPTX
Страх і сила помилок - IT Inside від GlobalLogic Education
PDF
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
PDF
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
PDF
“How to Secure Your Applications With a Keycloak?
PDF
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
PPTX
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
PDF
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
PDF
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
GlobalLogic JavaScript Community Webinar #21 “Інтерв’ю без заспокійливих”
Deadlocks in SQL - Turning Fear Into Understanding (by Sergii Stets)
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
Штучний інтелект як допомога в навчанні, а не замінник.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
JavaScript Community Webinar #14 "Why Is Git Rebase?"
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
Страх і сила помилок - IT Inside від GlobalLogic Education
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
“How to Secure Your Applications With a Keycloak?
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mushroom cultivation and it's methods.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Approach and Philosophy of On baking technology
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Machine learning based COVID-19 study performance prediction
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mushroom cultivation and it's methods.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Approach and Philosophy of On baking technology
cloud_computing_Infrastucture_as_cloud_p
Machine learning based COVID-19 study performance prediction
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Assigned Numbers - 2025 - Bluetooth® Document
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Getting Started with Data Integration: FME Form 101
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
A comparative study of natural language inference in Swahili using monolingua...
A comparative analysis of optical character recognition models for extracting...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Univ-Connecticut-ChatGPT-Presentaion.pdf
OMC Textile Division Presentation 2021.pptx

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko

  • 3. 3 Confidential Intro Meet your speaker today: Oleksandr Fedirko - CEE Head of Big Data Practice
  • 4. 4 Confidential High level agenda Streaming basics Types of stream systems Typical architectures and use cases Main consideration on a project with Stream processing Stream processing tools overview Case study Q&A session
  • 6. 6 Confidential Streaming basics Types of streaming operations - Stateful - Aggregation - Join - Sorting - Stateless - Filter - Map
  • 7. 7 Confidential Streaming basics Types of streaming sources - Bounded - Database - Flat file - Key-value storage - Unbounded - Queue - Port - Socket
  • 13. 13 Confidential MicroBatches vs Realtime streaming Micro Batches - Most of the tools/frameworks work under this paradigm - Widely used, mature ecosystem Realtime streaming - Better performance with stateless operations - Can fulfill particular use cases where low latency is a must
  • 14. 14 Confidential Compositional vs Declarative engines In a compositional stream processing engines, developers define the Directed Acyclic Graph (DAG) in advance and then process the data. This may simplify code, but also means developers need to plan their architecture carefully to avoid inefficient processing. Challenges: Compositional stream processing are considered the “first generation” of stream processing and can be complex and difficult to manage. Examples: Compositional engines include Samza, Apex and Apache Storm.
  • 15. 15 Confidential Compositional vs Declarative engines Developers use declarative engines to chain stream processing functions. The engine calculates the DAG as it ingests the data. Developers can specify the DAG explicitly in their code, and the engine optimizes it on the fly. Challenges: While declarative engines are easier to manage, and have readily-available managed service options, they still require major investments in data engineering to set up the data pipeline, from source to eventual storage and analysis. Examples: Declarative engines include Apache Spark and Flink, both of which are provided as a managed offering.
  • 17. Typical architectures and use cases Source 1 Source 2 Source 3 Ingestion Stream processing Queue Data Lake
  • 18. Source 1 Source 2 Source 3 Stream processing Queue Data Lake Typical architectures and use cases
  • 19. Source 1 Source 2 Source 3 Stream processing Queue Key-value/ Columnar storage Typical architectures and use cases
  • 20. Source 1 Source 2 Source 3 Stream processing Queue Typical architectures and use cases
  • 21. Source 1 Source 2 Source 3 Stream processing Queue DB/Cache/ API call Typical architectures and use cases
  • 22. 22 Confidential 22 Main consideration on a project with Stream processing
  • 23. 23 Confidential Main consideration on a project with Stream processing Think of the next NFRs: ● Records per second, avg ● Records per second, max (spike) ● Spike longevity ● 95% of the size of record ● 1% max of the size of record ● Latency ● Exactly one/at least one/at most one semantic ● Late arrivals ● Static/dynamic streams
  • 25. 25 Confidential Stream processing tools overview Apache Spark Spark is an open-source distributed general-purpose cluster computing framework. Spark’s in-memory data processing engine conducts analytics, ETL, machine learning and graph processing on data in motion or at rest. It offers high-level APIs for the programming languages: Python, Java, Scala, R, and SQL. The Apache Spark Architecture is founded on Resilient Distributed Datasets (RDDs). These are distributed immutable tables of data, which are split up and allocated to workers. The worker executors implement the data. The RDD is immutable, so the worker nodes cannot make alterations; they process information and output results.
  • 26. 26 Confidential Stream processing tools overview Pros: Apache Spark is a mature product with a large community, proven in production for many use cases, and readily supports SQL querying. Cons: ● Spark can be complex to set up and implement ● It is not a true streaming engine (it performs very fast batch processing) ● Limited language support ● Latency of a few seconds, which eliminates some real-time analytics use cases
  • 27. 27 Confidential Stream processing tools overview Apache Storm Apache Storm has very low latency and is suitable for near real time processing workloads. It processes large quantities of data and provides results with lower latency than most other solutions. The Apache Storm Architecture is founded on spouts and bolts. Spouts are origins of information and transfer information to one or more bolts. This information is linked to other bolts, and the entire topology forms a DAG. Developers define how the spouts and bolts are connected.
  • 29. 29 Confidential Stream processing tools overview Pros: ● Probably the best technical solution for true real-time processing ● Use of micro-batches provides flexibility in adapting the tool for different use cases ● Very wide language support Cons: ● Does not guarantee ordering of messages, may compromise reliability ● Highly complex to implement
  • 30. 30 Confidential Stream processing tools overview Apache Flink Flink is based on the concept of streams and transformations. Data comes into the system via a source and leaves via a sink. To produce a Flink job Apache Maven is used. Maven has a skeleton project where the packing requirements and dependencies are ready, so the developer can add custom code. Apache Flink is a stream processing framework that also handles batch tasks. Flink approaches batches as data streams with finite boundaries.
  • 31. 31 Confidential Stream processing tools overview Pros: ● Stream-first approach offers low latency, high throughput ● Real entry-by-entry processing ● Does not require manual optimization and adjustment to data it processes ● Dynamically analyzes and optimizes tasks Cons: ● Some scaling limitations ● A relatively new project with less production deployments than other frameworks
  • 32. 32 Confidential Stream processing tools overview (cloud) ● AWS Kinesis ● GCP DataFlow ● Azure Stream Analytics When do we use Lambda-like application instead of services above? Very light weight simple logic.
  • 34. Confidential Case study (CEP for custom DSL) Raw events Parsed events Canonically parsed events Indicators Incidents Archive job Parse job Index job Archive storage Primary storage Index job Rules job Secondary storage Application storage Save incind job Message Queues Processing Engines Sink Storages
  • 36. 36 Confidential FAQ I do my custom Java based application that does consume messages from Kafka. Is it stream or not ? If I have 1 message per day in my Kafka topic could it be considered as a stream ? I love my Kafka Stream API. Why didn’t you cover it ? I have a … tool on my project. Why didn’t you mention it today ? Did you cover everything Stream related today ? Am I a Stream master after this event ?