SlideShare a Scribd company logo
1
Tale of two stream processing frameworks
Apache Storm & Apache Flink
Karthik Deivasigamani
@WalmartLabs
2
Streaming
• Stream
– Continuous flow
• Streaming Data
– Streaming data is data that is continuously
generated by different sources.
– Unbounded data
• Stream Processing
– processing of data in motion, or in other
words, computing on data directly as it is
produced or received
– data processing engine that is designed with
infinite data sets in mind
3
Retail Data
• Catalog Data
• Pricing Data
• Clickstream logs
• Payments
• Order Data
• Inventory
• Delivery Logistics
4
Not so long ago..
• Data submitted as feeds
• Periodic Data Collection
• Data Processed In Batches
• Runs offline
• Delay between actual time &
processing time
• Failures
5
Need For Speed – Fast Data
• Catalog Updates
• Price Updates
• Fraud Detection
• Out of stock
• Delivery alerts
• Personalization
6
7
Catalog Use Case
8
Catalog Functions
• Normalization
• Classification
• Product Matching
• Shelving
• Attribute Extraction
• Grouping
• Image
9
Characteristics of ingestion pipeline
• Zero message loss
• Fault Tolerance
• Source based priority queue
• Scale to millions of product updates/hour
• Near Real Time Updates
• Checkpoint at various stages
10
Apache Storm
• Created by Nathan Marz
• Stream Abstraction
• Spouts, Bolts, Topology
• Trident
• Kafka Integration
• Message processing
guarantees
11
Storm Cluster
• Nimbus
– distributing code
– assigning tasks to machines
– monitoring for failures
• Supervisor
– communicates with Nimbus
through Zookeeper
– starts and stops workers
according to signals from Nimbus
• Zookeeper
– Coordinates the storm cluster
12
Key Concepts
• Tuples
– Named list of values where each
value can be any type.
• Stream
– unbounded sequence of tuples
• Spout
– sources of streams in a
computation
• Bolts
– process input streams and
produce output streams
• Topology
– DAG - network of spouts and
bolts
13
Stream Grouping
• Shuffle Grouping
• Fields Grouping
• All grouping
• Global Grouping
• Local or Shuffle grouping
• Direct Grouping
14
Parallelism of a Storm Topology
• Worker processes
– Executes a subset of a topology
• Executors (Threads)
– Is a thread that is spawned by a
worker process.
– It may run one or more tasks for
the same component (spout or
bolt).
• Tasks
– performs the actual data processing
— each spout or bolt that you
implement in your code executes as
many tasks across the cluster
15
Guaranteeing Message Processing
16
Micro Service vs Bolt
• Choice of language
• Teams operate independently
• Platform with pluggable services
Bolt
17
Catalog Pipeline
18
Challenges
• Validations at various stages
• Async IO using RxJava, Hystrix
• Hystrix Circuit Breaker
• Failing Tuples
• Fetch-size, increase workers,
increase bolt parallelism
• Data Errors
• Services taking longer
• Service outage
• Fatal Errors
• Spike in traffic
19
Lessons Learnt
• Things will fail
• Monitor everything
• Automation
• Scale is not a feature
• Logs don’t lie
20
21
Pricing Use Case
• Competitive pricing (EDLP)
• Seller price updates
• Handle spike during holidays
• Promotions
• Anomaly Detection
• Accuracy
22
Characteristics of ingestion pipeline
• Exactly Once
• Order Guarantee
• Stateful
• Handle tens of millions of
updates/hour
• NRT price update on website
• Traceability
23
Apache Flink
• Project Stratosphere in
Universities around Berlin
• data Artisans founded in 2014
• Process Unbounded and
Bounded Data
• Exactly Once
• Stateful & Flexible API
• Alibaba was using it at scale
24
Apache Flink - Overview
• Data source: Incoming data that Flink processes
• Transformations: The processing step, when Flink modifies incoming data
• Data sink: Where Flink sends data after processing
25
Apache Flink - Runtime
Footer
26
Stateful Stream Processing
• "state" is shared between events.
• Past events can influence the way current
events are processed.
• Embedded database (Rocks DB) for state.
• Local state needs to be protected against
failures to avoid data loss.
• Checkpointing to guarantee persistence of
state.
27
Flink Checkpointing (Chandy-Lamport Algorithm)
28
Exactly Once - Explained
• The label “exactly-once” is misleading in
describing what is done exactly once.
• No Stream Processing can guarantee
exactly-once event processing.
• Flink guarantees exactly-once state
updates.
• Flink uses Chandy and Lamport Algorithm,
to draw consistent snapshots of current
state to create a checkpoint.
• Flink restarts an application using the most
recently completed checkpoint as a starting
point.
29
Duplicate Events
30
Pricing Pipeline
31
Challenges
• HTTP/DB lookup calls
• Huge payload choking network
• Isolation
• Buffer bloat
• Async I/O Operator
• Operator Chaining
• Mesos / YARN
• taskmanager.memory.segment-size
32
What we learnt
• Flink is fast, APIs are super easy to use.
• Avoid network shuffle and use forward / operator
chaining.
• Use accumulators to monitor the progress of your
application.
• Checkpoint failures indicate that your application is
running slow.
• Monitor everything – lag, checkpoints, latency etc
• For application inherently slow configure your
buffers to accommodate for buffer bloat, so that
checkpoints don’t fail.
• Join the flink users mailing list and ask questions!
33
Apache Storm vs Apache Flink
Feature Winner
True streaming Yes Yes Tie
Speed Fast Amazingly fast
Overall maturity Very stable, haven’t really
encountered storm bugs that
hit us in production.
Little behind – ran into lots of
fink bugs, some of it is
addressed now.
API Used to be very primitive with
until 1.0
Rich API and you can achieve lot
by writing very few lines of
code.
Windowing, Join They added support in 1.2 Excellent out of the box support
for windowing and join.
Tie
Monitoring / Deployment Better isolation of jobs with the
process model
You need YARN/Mesos to get
better isolation.
Tie (assumes you are running
Flink on YARN)
Stateful Stream processing WIP (apache storm 2.0) Supported with rocksdb. You
can also query the state outside
your stream processing system.
Message Processing Guarantee Supports - At least once, At
most once, Exactly once (need
trident)
Supports - At least once, At
most once, Exactly Once (state
is touched exactly once)
Tie
Backpressure Max spout pending can be used
to adjust
Handle automatically
Async IO support No native support Out of the box
Streaming SQL WIP (apache storm 2.0) Very early stage -
34
What should I pick
35
Future of streaming - Cloud
Amazon Kinesis Streams
Functions as stream processors
Cloud Flow
Confluent Cloud
Event Hub – Kafka Compatible
36
Thank You!
Yes, we are hiring!
https://indiacareers.walmartlabs.com/

More Related Content

PPTX
Realtime classroom analytics powered by apache druid
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Apache Flink internals
PDF
How to Automate Performance Tuning for Apache Spark
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
PPTX
Stability Patterns for Microservices
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
PDF
Massive Data Processing in Adobe Using Delta Lake
Realtime classroom analytics powered by apache druid
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Apache Flink internals
How to Automate Performance Tuning for Apache Spark
How Uber scaled its Real Time Infrastructure to Trillion events per day
Stability Patterns for Microservices
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Massive Data Processing in Adobe Using Delta Lake

What's hot (20)

PPTX
Kafka 101
PDF
Delta from a Data Engineer's Perspective
PDF
Apache BookKeeper: A High Performance and Low Latency Storage Service
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Apache kafka
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Apache Kafka - Martin Podval
PDF
Introduction to Apache Kafka and Confluent... and why they matter
PPTX
ORC File - Optimizing Your Big Data
PPTX
Apache Spark MLlib
PDF
Introduction to Apache Kafka
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
InfluxDB Roadmap: What’s New and What’s Coming
PDF
Fundamentals of Apache Kafka
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Kafka 101
Delta from a Data Engineer's Perspective
Apache BookKeeper: A High Performance and Low Latency Storage Service
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Introduction to Apache Flink - Fast and reliable big data processing
Apache kafka
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Apache Kafka - Martin Podval
Introduction to Apache Kafka and Confluent... and why they matter
ORC File - Optimizing Your Big Data
Apache Spark MLlib
Introduction to Apache Kafka
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Processing Large Data with Apache Spark -- HasGeek
Building a fully managed stream processing platform on Flink at scale for Lin...
InfluxDB Roadmap: What’s New and What’s Coming
Fundamentals of Apache Kafka
How Adobe Does 2 Million Records Per Second Using Apache Spark!
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Ad

Similar to Tale of two streaming frameworks- Apace Storm & Apache Flink (20)

PPTX
Data Stream Processing with Apache Flink
PPTX
Flink Streaming @BudapestData
PPTX
Flink history, roadmap and vision
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PPTX
QCon London - Stream Processing with Apache Flink
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
Debunking Six Common Myths in Stream Processing
PPTX
Debunking Common Myths in Stream Processing
PPTX
Apache Flink: Past, Present and Future
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PDF
Introduction to Flink Streaming
PDF
Flink Streaming Berlin Meetup
PDF
Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
Flink Streaming @BudapestData
Flink history, roadmap and vision
GOTO Night Amsterdam - Stream processing with Apache Flink
Chicago Flink Meetup: Flink's streaming architecture
QCon London - Stream Processing with Apache Flink
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Flexible and Real-Time Stream Processing with Apache Flink
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Debunking Six Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Apache Flink: Past, Present and Future
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Apache Flink(tm) - A Next-Generation Stream Processor
Introduction to Flink Streaming
Flink Streaming Berlin Meetup
Stream Processing with Apache Flink
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Global journeys: estimating international migration
PDF
Foundation of Data Science unit number two notes
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Quality review (1)_presentation of this 21
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
1_Introduction to advance data techniques.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Clinical guidelines as a resource for EBP(1).pdf
Global journeys: estimating international migration
Foundation of Data Science unit number two notes
Moving the Public Sector (Government) to a Digital Adoption
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Acumen Training GuidePresentation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Quality review (1)_presentation of this 21
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
1_Introduction to advance data techniques.pptx

Tale of two streaming frameworks- Apace Storm & Apache Flink

  • 1. 1 Tale of two stream processing frameworks Apache Storm & Apache Flink Karthik Deivasigamani @WalmartLabs
  • 2. 2 Streaming • Stream – Continuous flow • Streaming Data – Streaming data is data that is continuously generated by different sources. – Unbounded data • Stream Processing – processing of data in motion, or in other words, computing on data directly as it is produced or received – data processing engine that is designed with infinite data sets in mind
  • 3. 3 Retail Data • Catalog Data • Pricing Data • Clickstream logs • Payments • Order Data • Inventory • Delivery Logistics
  • 4. 4 Not so long ago.. • Data submitted as feeds • Periodic Data Collection • Data Processed In Batches • Runs offline • Delay between actual time & processing time • Failures
  • 5. 5 Need For Speed – Fast Data • Catalog Updates • Price Updates • Fraud Detection • Out of stock • Delivery alerts • Personalization
  • 6. 6
  • 8. 8 Catalog Functions • Normalization • Classification • Product Matching • Shelving • Attribute Extraction • Grouping • Image
  • 9. 9 Characteristics of ingestion pipeline • Zero message loss • Fault Tolerance • Source based priority queue • Scale to millions of product updates/hour • Near Real Time Updates • Checkpoint at various stages
  • 10. 10 Apache Storm • Created by Nathan Marz • Stream Abstraction • Spouts, Bolts, Topology • Trident • Kafka Integration • Message processing guarantees
  • 11. 11 Storm Cluster • Nimbus – distributing code – assigning tasks to machines – monitoring for failures • Supervisor – communicates with Nimbus through Zookeeper – starts and stops workers according to signals from Nimbus • Zookeeper – Coordinates the storm cluster
  • 12. 12 Key Concepts • Tuples – Named list of values where each value can be any type. • Stream – unbounded sequence of tuples • Spout – sources of streams in a computation • Bolts – process input streams and produce output streams • Topology – DAG - network of spouts and bolts
  • 13. 13 Stream Grouping • Shuffle Grouping • Fields Grouping • All grouping • Global Grouping • Local or Shuffle grouping • Direct Grouping
  • 14. 14 Parallelism of a Storm Topology • Worker processes – Executes a subset of a topology • Executors (Threads) – Is a thread that is spawned by a worker process. – It may run one or more tasks for the same component (spout or bolt). • Tasks – performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster
  • 16. 16 Micro Service vs Bolt • Choice of language • Teams operate independently • Platform with pluggable services Bolt
  • 18. 18 Challenges • Validations at various stages • Async IO using RxJava, Hystrix • Hystrix Circuit Breaker • Failing Tuples • Fetch-size, increase workers, increase bolt parallelism • Data Errors • Services taking longer • Service outage • Fatal Errors • Spike in traffic
  • 19. 19 Lessons Learnt • Things will fail • Monitor everything • Automation • Scale is not a feature • Logs don’t lie
  • 20. 20
  • 21. 21 Pricing Use Case • Competitive pricing (EDLP) • Seller price updates • Handle spike during holidays • Promotions • Anomaly Detection • Accuracy
  • 22. 22 Characteristics of ingestion pipeline • Exactly Once • Order Guarantee • Stateful • Handle tens of millions of updates/hour • NRT price update on website • Traceability
  • 23. 23 Apache Flink • Project Stratosphere in Universities around Berlin • data Artisans founded in 2014 • Process Unbounded and Bounded Data • Exactly Once • Stateful & Flexible API • Alibaba was using it at scale
  • 24. 24 Apache Flink - Overview • Data source: Incoming data that Flink processes • Transformations: The processing step, when Flink modifies incoming data • Data sink: Where Flink sends data after processing
  • 25. 25 Apache Flink - Runtime Footer
  • 26. 26 Stateful Stream Processing • "state" is shared between events. • Past events can influence the way current events are processed. • Embedded database (Rocks DB) for state. • Local state needs to be protected against failures to avoid data loss. • Checkpointing to guarantee persistence of state.
  • 28. 28 Exactly Once - Explained • The label “exactly-once” is misleading in describing what is done exactly once. • No Stream Processing can guarantee exactly-once event processing. • Flink guarantees exactly-once state updates. • Flink uses Chandy and Lamport Algorithm, to draw consistent snapshots of current state to create a checkpoint. • Flink restarts an application using the most recently completed checkpoint as a starting point.
  • 31. 31 Challenges • HTTP/DB lookup calls • Huge payload choking network • Isolation • Buffer bloat • Async I/O Operator • Operator Chaining • Mesos / YARN • taskmanager.memory.segment-size
  • 32. 32 What we learnt • Flink is fast, APIs are super easy to use. • Avoid network shuffle and use forward / operator chaining. • Use accumulators to monitor the progress of your application. • Checkpoint failures indicate that your application is running slow. • Monitor everything – lag, checkpoints, latency etc • For application inherently slow configure your buffers to accommodate for buffer bloat, so that checkpoints don’t fail. • Join the flink users mailing list and ask questions!
  • 33. 33 Apache Storm vs Apache Flink Feature Winner True streaming Yes Yes Tie Speed Fast Amazingly fast Overall maturity Very stable, haven’t really encountered storm bugs that hit us in production. Little behind – ran into lots of fink bugs, some of it is addressed now. API Used to be very primitive with until 1.0 Rich API and you can achieve lot by writing very few lines of code. Windowing, Join They added support in 1.2 Excellent out of the box support for windowing and join. Tie Monitoring / Deployment Better isolation of jobs with the process model You need YARN/Mesos to get better isolation. Tie (assumes you are running Flink on YARN) Stateful Stream processing WIP (apache storm 2.0) Supported with rocksdb. You can also query the state outside your stream processing system. Message Processing Guarantee Supports - At least once, At most once, Exactly once (need trident) Supports - At least once, At most once, Exactly Once (state is touched exactly once) Tie Backpressure Max spout pending can be used to adjust Handle automatically Async IO support No native support Out of the box Streaming SQL WIP (apache storm 2.0) Very early stage -
  • 35. 35 Future of streaming - Cloud Amazon Kinesis Streams Functions as stream processors Cloud Flow Confluent Cloud Event Hub – Kafka Compatible
  • 36. 36 Thank You! Yes, we are hiring! https://indiacareers.walmartlabs.com/