SlideShare a Scribd company logo
1© Copyright 2016 EMC Corporation. All rights reserved.
Improved Reliable Streaming Processing:
Apache Storm as example
Frank Zhao, EMC CTO Office,
Fenghao Zhang*, Microsoft Bing,
Yusong Lv*, Peking University
Special thanks to EMC Ken Taylor, John Cardente and Lincourt Robert
*Zhang and Lv contributed to the research when they worked at EMC China COE
2© Copyright 2016 EMC Corporation. All rights reserved.
The technology concepts being discussed and demonstrated are
the result of research conducted by the Advanced Research &
Development (ARD) team from the EMC Office of the CTO. Any
demonstrated capability is only for research purpose and at a
prototype phase, therefore : THERE ARE NO IMMEDIATE PLANS
NOR INDICATION OF SUCH PLANS FOR PRODUCTIZATION OF
THESE CAPABILITIES AT THE TIME OF PRESENTATION. THINGS
MAY OR MAY NOT CHANGE IN THE FUTURE.
DISCLAIMER
3© Copyright 2016 EMC Corporation. All rights reserved.
• Distributed Streaming System
• Reliable Processing
• Apache Storm’s Solution, the Challenge
• New Proposed Approaches
– Fingerprint, and share-split
• Prototyping with Apache Storm and Benchmark
• Summary and Outlook
Agenda
4© Copyright 2016 EMC Corporation. All rights reserved.
• As service, continuously process data (a.k.a message or tuple)
in scalable, reliable and high-performance way (msec)
– Open-source: Storm, Flink, Spark-Streaming, Samza
Streaming processing
5© Copyright 2016 EMC Corporation. All rights reserved.
Streaming Processing
(Storm, Spark Streaming)
Batch processing
(Hadoop MR)
Type Continuous(never-stop),
real-time (ms level)
Batch/Period
Model DAG/graph MapReduce like Jobs
Workload CPU/Memory intensive CPU/mem and IO internsive
State Stateless, may period ckpt Stateful
Cluster Master-Slave w/ Zookeeper (Storm) Master-Slave or Job-task
Fault-
tolerance
Fault-tolerance/HA Fault-tolerance/HA
Streaming vs. batch processing
6© Copyright 2016 EMC Corporation. All rights reserved.
Storm Flink Spark
Streaming
Built since 2011 (Apache, Trident)
2016 (Twitter Heron)
2014 (Apache) ~2013
Streaming Native
(micro-batch, Trident)
Native Micro-batch
Guarantee At least once
(exactly-once w/ Trident)
Exactly-once Exactly-once
Fault-Tolerance Ack per message Checkpoint Checkpoint
Latency 5 4 3
Throughput 4 5 5
Ecosystem 5 3 3
Storm, Flink, Spark streaming*
*Personal observations for reference only
7© Copyright 2016 EMC Corporation. All rights reserved.
• Every message shall be guaranteed processed
– At-most once
– At-least once
– Exactly once
Reliable processing
May save
result
Topology (DAG)
0
1
2
3
4
5
6
7
8
9Data source
B
C
D
E
F
G
H
I
J
K
L
M
Spout
R
Bolt (worker, task, op)
8© Copyright 2016 EMC Corporation. All rights reserved.
• Scalable
• Fault-tolerant
• Guaranteed message processing
– At least once (default)
• Fast: ms level
– Pure memory computing, no checkkpoint
• Simple programming model
– Topology - Spouts – Bolts
– Clojure, Java, Ruby, Python …
Apache Storm
9© Copyright 2016 EMC Corporation. All rights reserved.
Storm: designs for fault-tolerance
Nimbus
 Deploy topology
 Dispatch tasks
 Monitor cluster
Zookeeper
cluster
 Coordination
 States of Nimbus
 State of supervisor
 …
Supervisor
Executor
Task Task
WorkersMaster
Those FT are about thread/task/
job or node, NOT message
10© Copyright 2016 EMC Corporation. All rights reserved.
• Critical message granularity (NOT thread/task/job/node)
• Need an efficient method, considering
– Every component may fault
– Large topology, continuous flooding messages
– Network temp unavailable, traffic out-of-order, …
– Minimized resource usage (network, cpu, mem)
Track processing status in DAG
0
1
2
3
4
5
6
7
8
9Data source
B
C
D
E
F
G
H
I
J
K
L
M
Spout
R
Bolt
11© Copyright 2016 EMC Corporation. All rights reserved.
History of Apache Storm and lessons learned
– Nathan Marz, creator of Storm
Tough problem and Storm’s answer!
12© Copyright 2016 EMC Corporation. All rights reserved.
Storm reliability track algorithm
0
1
2
3
4
Status Acker
srcNodeID: R, R
A
B
C
D
E
F
R ⊕ A ⊕ B ⊕ C
A ⊕ D
B ⊕ E
C ⊕ F
D⊕ E ⊕ F
R
Status = R ⊕
R ⊕ A ⊕ B ⊕ C ⊕
A ⊕ D ⊕
B ⊕ E ⊕
C ⊕ F ⊕
D ⊕ E ⊕ F =
1. Each msg has ID (8B random number)
2. Each bolt runs XOR (inMsgID, outMsgID[]) per inMsg
3. Each bolt sends XOR (per inMsg) result to Acker
4. Acker runs XOR: always 8B (regardless topology size)
5. Finally, given timeout, Acker.status shall be 0 means OK
otherwise something failed (may false-alarm, but never miss) 0
13© Copyright 2016 EMC Corporation. All rights reserved.
• RandomNum + XOR based, the key foundation of Storm that
runs for 5+Y
– Smart, simple and pretty good!
– Least memory footprint at Acker, regardless of topology
– Reliable*, regardless of Ack traffic order
– XOR op: commutative law, associative law
• Easy to handle any Out-of-order
Ingenious!
*: in theory, random ID may collision
14© Copyright 2016 EMC Corporation. All rights reserved.
• Network traffic, CPU overhead  latency & throughput impact
– Possibility of random number collision
Limitations
25000 msg/sec
9300 msg/sec
Non-reliable processing
reliable processing
*3rd party benchmark in 2012, things may change now
15© Copyright 2016 EMC Corporation. All rights reserved.
IS IT POSSIBLE ?
Ack only at leaf?
0
1
2
3
4
5
6
7
8
9
Data source
B
C
D
E
F
G
H
I
J
K
L
M
R
Current algorithm is fantastic, however
16© Copyright 2016 EMC Corporation. All rights reserved.
• Same-level guaranteed reliable processing
• More scalable, efficient and fast
– Much less Ack traffic; usually only at leaf nodes
– Same memory footprint, less CPU usage
– Eventually better latency/throughput
2 new proposed approaches
Currently in research & quick validation phase
17© Copyright 2016 EMC Corporation. All rights reserved.
• An evolution based on Random Num + XOR
Approach-1: fingerprint based
Currently, XOR in-pair (send, recv), then it’s 0
Further, XOR in multiple pairs (2, 4, 6, …), still 0
18© Copyright 2016 EMC Corporation. All rights reserved.
• Fingerprint(FP): A digest (i.e., 8B) of {in msgs, out msgs and
parent.fp}, to encode & represent the context then recursively pass-
down. That each downstream inherits genes from all ancestors
– Still use XOR of IDs, redundant in scalable way
– 3-rule: Embedded, Recursively inherited and Append-only update
Approach-1: fingerprint idea
iMsg <Mj, FPj >
Msg < Mj+1, FPj:i >
Msg < Mj+2, FPj:i >
Msg < Mj+2, FPj:i >
Msg <…>
Ni
i+1
i+2
i+3
Ni+1
Ni+2
Ni+3
Pass-down FP
InMsgID XOR [outMsgIDs]
• Embedded: as part of metadata
• Recursive-inherit: pass-down
• Append-update: via XOR
Append update
19© Copyright 2016 EMC Corporation. All rights reserved.
Fingerprint example
0
1
2
3
4
FP0= R ⊕ A ⊕ B ⊕ C
FP1= FP0 ⊕ A ⊕ D
FP2= FP0 ⊕ B ⊕ E
FP3= FP0 ⊕ C ⊕ F
Leaf has 3 Ack traffic:
FP4-D= FP1 ⊕ D
FP4-E= FP2 ⊕ E
FP4-F = FP3 ⊕ F
 Acker.status = R ⊕
(FP0 ⊕ A ⊕ D) ⊕ D ⊕
(FP0 ⊕ B ⊕ E) ⊕ E ⊕
(FP0 ⊕ C ⊕ F) ⊕ F =
Acker
srcNodeID: RootMsgID, R
A, FP0
C, FP0
B, FP0
D, FP1
E, FP2
F, FP3
FP4-D
FP4-E
FP4-F
Init: R
Calculate FP
0
R
May batch
20© Copyright 2016 EMC Corporation. All rights reserved.
Approach-1: failure example
0
1
2
3
4
Acker
srcNodeID : RootMsgID, R
A, FP0
C, FP0
B, FP0
D, FP1
E, FP2
F, FP3
FP4-D
FP4-E
FP4-F
Init = R
if msg D failed, then node4 only Ack FP4-E and FP4-F, finally Acker.status =
= R ⊕ FP4-E ⊕ FP4-F
= R ⊕ FP2 ⊕ E ⊕ FP3 ⊕ F
= R ⊕ (FP ⊕ B ⊕ E ⊕ E) ⊕ (FP ⊕ C ⊕ F ⊕ F)
= R ⊕ B ⊕ C != 0
Another example, if all message failed, Ack is R !=0
R
 Missing info about A/D path, due to failure!!
21© Copyright 2016 EMC Corporation. All rights reserved.
Approach-1: a complex example
1
2
3
4
5
6
7
8R
A
B
C
D
E
F
G
H
I
X
Initial : R
FP1= R ⊕ A ⊕ B ⊕ C
FP2= FP1 ⊕ A ⊕ D
FP3= FP1 ⊕ B ⊕ X
FP4= FP1 ⊕ C ⊕ E
//update FP5 to Acker since even
number of downstreams (2)
FP5= FP2 ⊕ D ⊕
FP3 ⊕ X ⊕
FP4 ⊕ E ⊕ (F ⊕ G)
FP6= FP5 ⊕ F ⊕ H
FP7= FP5 ⊕ G ⊕ I
// blot8 sends FP8 to Acker
FP8= FP6 ⊕ H ⊕ FP7 ⊕ I
Final Status = R ⊕ FP5 ⊕ FP8
= R ⊕ FP5 ⊕ (FP5 ⊕ F) ⊕ (FP5 ⊕ G)
= R ⊕ FP5 ⊕ (F ⊕ G)
= R ⊕ FP2 ⊕ D ⊕ FP3 ⊕ X ⊕ FP4 ⊕ E
= R ⊕ (FP1 ⊕ A ⊕ B ⊕ C )
= 0
Acker
FP5
FP8
Limit and note: 1) downstream msg shall be odd number (1,3, 5, …); otherwise, bolt must send the new FP
to Acker, where Acker would run XOR with the new FP; 2) To implement such approach, ideally bolt needs
to know the total downstream number to generate FP before emit.
22© Copyright 2016 EMC Corporation. All rights reserved.
• For input rootMsg, INIT a BIG SHARE (8B), EMBED as metadata, pass-down
• SPLIT attached share by Storm at each bolt, EMBED, repeat this until leaf ...
• Only leaf ACK to Acker about received share at hand
• Acker REDO: decrease the reported share, finally 0 means ok; or-else failure
– No random(no collision), no XOR; inline embedded; split is transparent to App
– +/- (mod): follow commutative & associative law, resolve out-of-order issue
Approach-2: share split
0
1
2
3
4
5
6
7
8
9
Acker
srcNodeID: rootMsgID,BIG-Share
A
B, 50
C, 50
D, 25
E, 25
F, 17
G, 17
H, 16
I, 25
J, 25
K, 17
L, 17
A,1, 100
A, 0, 16
A, 0, 84M, 16
Like: IPO/stock share, split, increase share
23© Copyright 2016 EMC Corporation. All rights reserved.
• Rare case: INCREASE share if insufficient to split (also syncup the Acker)
• Acker then ADD the newly increased share (NOT decrease)
Approach-2: share split (con’t)
0
1
2
3
4
5
6
7
8
9
Acker
srcNodeID, RootMsgID,Share
A
B, 99
C, 1
F, 33
G, 33
H, 34
A, 100
A, +99
increase share;
Sync-up Acker
If S - S1 - S2 - … = Sn, then S - S1 - S2 - … - Sn =
AckerDAG
0
(Ack may batch)
24© Copyright 2016 EMC Corporation. All rights reserved.
• Implemented Approach-2 (share-split)
• Integrate with Storm 1.0.1 (Released in May 2016)
– Storm core (~200 LOC in Clojure: LISP-like) and Java APIs (~200 LOC
including some traces/tests)
• Implementation notes:
– Support BasicBolt, remove randomNum, re-use some existing
structures/APIs i.e., Anchors-to-ids (RootID:shareAttached), Ack sending
– Global pre-defined split share at all bolts (equally split)
• Next, configurable split approach per bolt
– To exactly split share, build 1-step delay emit
• Pre-split the input share
• Once new tuple generated, emit internally queue it until next tuple come out
• Finally explicitly call emitDone(), thus last tuple takes over all left share and emit
Prototyping
25© Copyright 2016 EMC Corporation. All rights reserved.
• Function & performance
– network traffic, CPU, latency/throughput
• Reference IBM whitepaper (Storm vs. IBM InfoSphere): 7 layers
– We use Wikipedia as data source; words processing
Benchmark
1000 Mbps
Ubuntu 15.10 (4.2.0)
Storm 1.0.1
Ubuntu 15.10 (4.2.0)
Storm 1.0.1
E5-2643 @ 3.40GHz,
24 cores;
256GB DRAM
E5-2643 @ 3.40GHz,
24 cores;
256GB DRAM
Ubuntu 15.10 (4.2.0)
Storm 1.0.1
E5-2643 @ 3.40GHz,
24 cores;
256GB DRAM
26© Copyright 2016 EMC Corporation. All rights reserved.
• Function: Inject error and validate reliability detection: Pass
– Same-level reliability as existing approach
•
• Performance: same HW/SW config and processing logic
– 16KB tuple, 100 pending, 48 parallelism per bolt
– 4 workers & 12 Ackers per host
Result: function & performance
27© Copyright 2016 EMC Corporation. All rights reserved.
• 1/3 Ack traffic, 18% faster, 9% less CPU
Test1: 3 layers
3903
1301
Current New
Ack traffic(Mil)
241
197
Current New
End-end Latency(ms)
350%
320%
Current New
CPU (per Java worker)
28© Copyright 2016 EMC Corporation. All rights reserved.
• 1/5 Ack traffic, 23% faster, 14% less CPU
Test2: 7 layers
2685
537
Current New
Ack traffic(Mil)
197
151
Current New
End-end latency(ms)
250%
215%
Current New
CPU (per Java worker)
29© Copyright 2016 EMC Corporation. All rights reserved.
• Larger topology? Quick test of 11 layers:
– 1/9 traffic
• Suppose the larger of topology, the more gains to achieve
• Next
– Refine multi-Acker
– Implement “Increase Share” operation
– Configurable split method per bolt
• So Dev can specify desired split way rather than fixed/global
• May integrate with Twitter Heron? Or apply to other areas?
– i.e., function call graph? performance trace? (more…)
MORE
30© Copyright 2016 EMC Corporation. All rights reserved.
End-end IoT landscape
Continuous, scalable,
Real-time processing
31© Copyright 2016 EMC Corporation. All rights reserved.
• Lambda architecture, fusion “historical ”+“new” data
– Proposed by Nathan Marz (5y ago), batch + streaming
– widely adopted in many Internet company
Unified data processing
32© Copyright 2016 EMC Corporation. All rights reserved.
• 2 innovative & inspiring streaming reliability algorithms
– Guaranteed with minimized mem footprint
– More scalable, efficient & fast, and even beautiful
• Demonstrate in Storm
– 1/N Ack traffic, only needed at leaf nodes
• N is topology depth. Usually a few leaf for aggregation, DB saving etc
• meanwhile, 23% faster, 14% less CPU
– Transparent to App except the last explicit emitDone() call
• Applying to other interesting areas...
– Distributed replication, tx, exact-state tracking, …
SUMMARY
33© Copyright 2016 EMC Corporation. All rights reserved.
• Feedback or comments? talk with us!
– Any flaw, constraints, or room to improve?
– then discuss with Storm community; Codes can be shared if needed
Junping.Zhao@emc.com ZhaoJP@gmail.com
THANK YOU!
Improved Reliable Streaming Processing: Apache Storm as example

More Related Content

PDF
Storm Anatomy
PPTX
Cassandra and Storm at Health Market Sceince
PDF
Apache Storm Tutorial
PPTX
PDF
Introduction to Apache Storm
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PPTX
Apache Storm Internals
PPTX
Multi-Tenant Storm Service on Hadoop Grid
Storm Anatomy
Cassandra and Storm at Health Market Sceince
Apache Storm Tutorial
Introduction to Apache Storm
Scaling Apache Storm (Hadoop Summit 2015)
Apache Storm Internals
Multi-Tenant Storm Service on Hadoop Grid

What's hot (20)

PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PDF
Storm Real Time Computation
PDF
Developing Java Streaming Applications with Apache Storm
PPTX
Slide #1:Introduction to Apache Storm
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
PDF
Introduction to Twitter Storm
PPTX
Resource Aware Scheduling in Apache Storm
PDF
Storm and Cassandra
PDF
Streams processing with Storm
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Storm - As deep into real-time data processing as you can get in 30 minutes.
PPTX
Yahoo compares Storm and Spark
PPTX
Apache Storm 0.9 basic training - Verisign
PDF
Storm: The Real-Time Layer - GlueCon 2012
PDF
Real time and reliable processing with Apache Storm
PDF
Distributed real time stream processing- why and how
PDF
Learning Stream Processing with Apache Storm
PDF
Storm
PPTX
Apache Storm and twitter Streaming API integration
PPTX
Apache Storm
Hadoop Summit Europe 2014: Apache Storm Architecture
Storm Real Time Computation
Developing Java Streaming Applications with Apache Storm
Slide #1:Introduction to Apache Storm
Real-Time Analytics with Kafka, Cassandra and Storm
Introduction to Twitter Storm
Resource Aware Scheduling in Apache Storm
Storm and Cassandra
Streams processing with Storm
Scaling Apache Storm - Strata + Hadoop World 2014
Storm - As deep into real-time data processing as you can get in 30 minutes.
Yahoo compares Storm and Spark
Apache Storm 0.9 basic training - Verisign
Storm: The Real-Time Layer - GlueCon 2012
Real time and reliable processing with Apache Storm
Distributed real time stream processing- why and how
Learning Stream Processing with Apache Storm
Storm
Apache Storm and twitter Streaming API integration
Apache Storm
Ad

Similar to Improved Reliable Streaming Processing: Apache Storm as example (20)

PPTX
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
PDF
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...
PPTX
Hacker's and painters Hardware Hacking 101 - 10th Oct 2014
PPT
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
PPTX
Buffer overflow – Smashing The Stack
PDF
Embedded system Design introduction _ Karakola
PPT
Mirage: ML kernels in the cloud (ML Workshop 2010)
PDF
Buffer Overflow - Smashing the Stack
PDF
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...
PDF
Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...
PDF
Erlang
PDF
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
PDF
0.5mln packets per second with Erlang
PDF
presentation
PPT
Erlang OTP
PDF
Erlang Developments: The Good, The Bad and The Ugly
PDF
Network Programming: Data Plane Development Kit (DPDK)
PPTX
Java on arm theory, applications, and workloads [dev5048]
PPT
LECTURE2 td 2 sue les theories de graphes
PDF
How does ping_work_style_1_gv
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...
Hacker's and painters Hardware Hacking 101 - 10th Oct 2014
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Buffer overflow – Smashing The Stack
Embedded system Design introduction _ Karakola
Mirage: ML kernels in the cloud (ML Workshop 2010)
Buffer Overflow - Smashing the Stack
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...
Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...
Erlang
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
0.5mln packets per second with Erlang
presentation
Erlang OTP
Erlang Developments: The Good, The Bad and The Ugly
Network Programming: Data Plane Development Kit (DPDK)
Java on arm theory, applications, and workloads [dev5048]
LECTURE2 td 2 sue les theories de graphes
How does ping_work_style_1_gv
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
KodekX | Application Modernization Development
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Programs and apps: productivity, graphics, security and other tools
KodekX | Application Modernization Development
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release
20250228 LYD VKU AI Blended-Learning.pptx

Improved Reliable Streaming Processing: Apache Storm as example

  • 1. 1© Copyright 2016 EMC Corporation. All rights reserved. Improved Reliable Streaming Processing: Apache Storm as example Frank Zhao, EMC CTO Office, Fenghao Zhang*, Microsoft Bing, Yusong Lv*, Peking University Special thanks to EMC Ken Taylor, John Cardente and Lincourt Robert *Zhang and Lv contributed to the research when they worked at EMC China COE
  • 2. 2© Copyright 2016 EMC Corporation. All rights reserved. The technology concepts being discussed and demonstrated are the result of research conducted by the Advanced Research & Development (ARD) team from the EMC Office of the CTO. Any demonstrated capability is only for research purpose and at a prototype phase, therefore : THERE ARE NO IMMEDIATE PLANS NOR INDICATION OF SUCH PLANS FOR PRODUCTIZATION OF THESE CAPABILITIES AT THE TIME OF PRESENTATION. THINGS MAY OR MAY NOT CHANGE IN THE FUTURE. DISCLAIMER
  • 3. 3© Copyright 2016 EMC Corporation. All rights reserved. • Distributed Streaming System • Reliable Processing • Apache Storm’s Solution, the Challenge • New Proposed Approaches – Fingerprint, and share-split • Prototyping with Apache Storm and Benchmark • Summary and Outlook Agenda
  • 4. 4© Copyright 2016 EMC Corporation. All rights reserved. • As service, continuously process data (a.k.a message or tuple) in scalable, reliable and high-performance way (msec) – Open-source: Storm, Flink, Spark-Streaming, Samza Streaming processing
  • 5. 5© Copyright 2016 EMC Corporation. All rights reserved. Streaming Processing (Storm, Spark Streaming) Batch processing (Hadoop MR) Type Continuous(never-stop), real-time (ms level) Batch/Period Model DAG/graph MapReduce like Jobs Workload CPU/Memory intensive CPU/mem and IO internsive State Stateless, may period ckpt Stateful Cluster Master-Slave w/ Zookeeper (Storm) Master-Slave or Job-task Fault- tolerance Fault-tolerance/HA Fault-tolerance/HA Streaming vs. batch processing
  • 6. 6© Copyright 2016 EMC Corporation. All rights reserved. Storm Flink Spark Streaming Built since 2011 (Apache, Trident) 2016 (Twitter Heron) 2014 (Apache) ~2013 Streaming Native (micro-batch, Trident) Native Micro-batch Guarantee At least once (exactly-once w/ Trident) Exactly-once Exactly-once Fault-Tolerance Ack per message Checkpoint Checkpoint Latency 5 4 3 Throughput 4 5 5 Ecosystem 5 3 3 Storm, Flink, Spark streaming* *Personal observations for reference only
  • 7. 7© Copyright 2016 EMC Corporation. All rights reserved. • Every message shall be guaranteed processed – At-most once – At-least once – Exactly once Reliable processing May save result Topology (DAG) 0 1 2 3 4 5 6 7 8 9Data source B C D E F G H I J K L M Spout R Bolt (worker, task, op)
  • 8. 8© Copyright 2016 EMC Corporation. All rights reserved. • Scalable • Fault-tolerant • Guaranteed message processing – At least once (default) • Fast: ms level – Pure memory computing, no checkkpoint • Simple programming model – Topology - Spouts – Bolts – Clojure, Java, Ruby, Python … Apache Storm
  • 9. 9© Copyright 2016 EMC Corporation. All rights reserved. Storm: designs for fault-tolerance Nimbus  Deploy topology  Dispatch tasks  Monitor cluster Zookeeper cluster  Coordination  States of Nimbus  State of supervisor  … Supervisor Executor Task Task WorkersMaster Those FT are about thread/task/ job or node, NOT message
  • 10. 10© Copyright 2016 EMC Corporation. All rights reserved. • Critical message granularity (NOT thread/task/job/node) • Need an efficient method, considering – Every component may fault – Large topology, continuous flooding messages – Network temp unavailable, traffic out-of-order, … – Minimized resource usage (network, cpu, mem) Track processing status in DAG 0 1 2 3 4 5 6 7 8 9Data source B C D E F G H I J K L M Spout R Bolt
  • 11. 11© Copyright 2016 EMC Corporation. All rights reserved. History of Apache Storm and lessons learned – Nathan Marz, creator of Storm Tough problem and Storm’s answer!
  • 12. 12© Copyright 2016 EMC Corporation. All rights reserved. Storm reliability track algorithm 0 1 2 3 4 Status Acker srcNodeID: R, R A B C D E F R ⊕ A ⊕ B ⊕ C A ⊕ D B ⊕ E C ⊕ F D⊕ E ⊕ F R Status = R ⊕ R ⊕ A ⊕ B ⊕ C ⊕ A ⊕ D ⊕ B ⊕ E ⊕ C ⊕ F ⊕ D ⊕ E ⊕ F = 1. Each msg has ID (8B random number) 2. Each bolt runs XOR (inMsgID, outMsgID[]) per inMsg 3. Each bolt sends XOR (per inMsg) result to Acker 4. Acker runs XOR: always 8B (regardless topology size) 5. Finally, given timeout, Acker.status shall be 0 means OK otherwise something failed (may false-alarm, but never miss) 0
  • 13. 13© Copyright 2016 EMC Corporation. All rights reserved. • RandomNum + XOR based, the key foundation of Storm that runs for 5+Y – Smart, simple and pretty good! – Least memory footprint at Acker, regardless of topology – Reliable*, regardless of Ack traffic order – XOR op: commutative law, associative law • Easy to handle any Out-of-order Ingenious! *: in theory, random ID may collision
  • 14. 14© Copyright 2016 EMC Corporation. All rights reserved. • Network traffic, CPU overhead  latency & throughput impact – Possibility of random number collision Limitations 25000 msg/sec 9300 msg/sec Non-reliable processing reliable processing *3rd party benchmark in 2012, things may change now
  • 15. 15© Copyright 2016 EMC Corporation. All rights reserved. IS IT POSSIBLE ? Ack only at leaf? 0 1 2 3 4 5 6 7 8 9 Data source B C D E F G H I J K L M R Current algorithm is fantastic, however
  • 16. 16© Copyright 2016 EMC Corporation. All rights reserved. • Same-level guaranteed reliable processing • More scalable, efficient and fast – Much less Ack traffic; usually only at leaf nodes – Same memory footprint, less CPU usage – Eventually better latency/throughput 2 new proposed approaches Currently in research & quick validation phase
  • 17. 17© Copyright 2016 EMC Corporation. All rights reserved. • An evolution based on Random Num + XOR Approach-1: fingerprint based Currently, XOR in-pair (send, recv), then it’s 0 Further, XOR in multiple pairs (2, 4, 6, …), still 0
  • 18. 18© Copyright 2016 EMC Corporation. All rights reserved. • Fingerprint(FP): A digest (i.e., 8B) of {in msgs, out msgs and parent.fp}, to encode & represent the context then recursively pass- down. That each downstream inherits genes from all ancestors – Still use XOR of IDs, redundant in scalable way – 3-rule: Embedded, Recursively inherited and Append-only update Approach-1: fingerprint idea iMsg <Mj, FPj > Msg < Mj+1, FPj:i > Msg < Mj+2, FPj:i > Msg < Mj+2, FPj:i > Msg <…> Ni i+1 i+2 i+3 Ni+1 Ni+2 Ni+3 Pass-down FP InMsgID XOR [outMsgIDs] • Embedded: as part of metadata • Recursive-inherit: pass-down • Append-update: via XOR Append update
  • 19. 19© Copyright 2016 EMC Corporation. All rights reserved. Fingerprint example 0 1 2 3 4 FP0= R ⊕ A ⊕ B ⊕ C FP1= FP0 ⊕ A ⊕ D FP2= FP0 ⊕ B ⊕ E FP3= FP0 ⊕ C ⊕ F Leaf has 3 Ack traffic: FP4-D= FP1 ⊕ D FP4-E= FP2 ⊕ E FP4-F = FP3 ⊕ F  Acker.status = R ⊕ (FP0 ⊕ A ⊕ D) ⊕ D ⊕ (FP0 ⊕ B ⊕ E) ⊕ E ⊕ (FP0 ⊕ C ⊕ F) ⊕ F = Acker srcNodeID: RootMsgID, R A, FP0 C, FP0 B, FP0 D, FP1 E, FP2 F, FP3 FP4-D FP4-E FP4-F Init: R Calculate FP 0 R May batch
  • 20. 20© Copyright 2016 EMC Corporation. All rights reserved. Approach-1: failure example 0 1 2 3 4 Acker srcNodeID : RootMsgID, R A, FP0 C, FP0 B, FP0 D, FP1 E, FP2 F, FP3 FP4-D FP4-E FP4-F Init = R if msg D failed, then node4 only Ack FP4-E and FP4-F, finally Acker.status = = R ⊕ FP4-E ⊕ FP4-F = R ⊕ FP2 ⊕ E ⊕ FP3 ⊕ F = R ⊕ (FP ⊕ B ⊕ E ⊕ E) ⊕ (FP ⊕ C ⊕ F ⊕ F) = R ⊕ B ⊕ C != 0 Another example, if all message failed, Ack is R !=0 R  Missing info about A/D path, due to failure!!
  • 21. 21© Copyright 2016 EMC Corporation. All rights reserved. Approach-1: a complex example 1 2 3 4 5 6 7 8R A B C D E F G H I X Initial : R FP1= R ⊕ A ⊕ B ⊕ C FP2= FP1 ⊕ A ⊕ D FP3= FP1 ⊕ B ⊕ X FP4= FP1 ⊕ C ⊕ E //update FP5 to Acker since even number of downstreams (2) FP5= FP2 ⊕ D ⊕ FP3 ⊕ X ⊕ FP4 ⊕ E ⊕ (F ⊕ G) FP6= FP5 ⊕ F ⊕ H FP7= FP5 ⊕ G ⊕ I // blot8 sends FP8 to Acker FP8= FP6 ⊕ H ⊕ FP7 ⊕ I Final Status = R ⊕ FP5 ⊕ FP8 = R ⊕ FP5 ⊕ (FP5 ⊕ F) ⊕ (FP5 ⊕ G) = R ⊕ FP5 ⊕ (F ⊕ G) = R ⊕ FP2 ⊕ D ⊕ FP3 ⊕ X ⊕ FP4 ⊕ E = R ⊕ (FP1 ⊕ A ⊕ B ⊕ C ) = 0 Acker FP5 FP8 Limit and note: 1) downstream msg shall be odd number (1,3, 5, …); otherwise, bolt must send the new FP to Acker, where Acker would run XOR with the new FP; 2) To implement such approach, ideally bolt needs to know the total downstream number to generate FP before emit.
  • 22. 22© Copyright 2016 EMC Corporation. All rights reserved. • For input rootMsg, INIT a BIG SHARE (8B), EMBED as metadata, pass-down • SPLIT attached share by Storm at each bolt, EMBED, repeat this until leaf ... • Only leaf ACK to Acker about received share at hand • Acker REDO: decrease the reported share, finally 0 means ok; or-else failure – No random(no collision), no XOR; inline embedded; split is transparent to App – +/- (mod): follow commutative & associative law, resolve out-of-order issue Approach-2: share split 0 1 2 3 4 5 6 7 8 9 Acker srcNodeID: rootMsgID,BIG-Share A B, 50 C, 50 D, 25 E, 25 F, 17 G, 17 H, 16 I, 25 J, 25 K, 17 L, 17 A,1, 100 A, 0, 16 A, 0, 84M, 16 Like: IPO/stock share, split, increase share
  • 23. 23© Copyright 2016 EMC Corporation. All rights reserved. • Rare case: INCREASE share if insufficient to split (also syncup the Acker) • Acker then ADD the newly increased share (NOT decrease) Approach-2: share split (con’t) 0 1 2 3 4 5 6 7 8 9 Acker srcNodeID, RootMsgID,Share A B, 99 C, 1 F, 33 G, 33 H, 34 A, 100 A, +99 increase share; Sync-up Acker If S - S1 - S2 - … = Sn, then S - S1 - S2 - … - Sn = AckerDAG 0 (Ack may batch)
  • 24. 24© Copyright 2016 EMC Corporation. All rights reserved. • Implemented Approach-2 (share-split) • Integrate with Storm 1.0.1 (Released in May 2016) – Storm core (~200 LOC in Clojure: LISP-like) and Java APIs (~200 LOC including some traces/tests) • Implementation notes: – Support BasicBolt, remove randomNum, re-use some existing structures/APIs i.e., Anchors-to-ids (RootID:shareAttached), Ack sending – Global pre-defined split share at all bolts (equally split) • Next, configurable split approach per bolt – To exactly split share, build 1-step delay emit • Pre-split the input share • Once new tuple generated, emit internally queue it until next tuple come out • Finally explicitly call emitDone(), thus last tuple takes over all left share and emit Prototyping
  • 25. 25© Copyright 2016 EMC Corporation. All rights reserved. • Function & performance – network traffic, CPU, latency/throughput • Reference IBM whitepaper (Storm vs. IBM InfoSphere): 7 layers – We use Wikipedia as data source; words processing Benchmark 1000 Mbps Ubuntu 15.10 (4.2.0) Storm 1.0.1 Ubuntu 15.10 (4.2.0) Storm 1.0.1 E5-2643 @ 3.40GHz, 24 cores; 256GB DRAM E5-2643 @ 3.40GHz, 24 cores; 256GB DRAM Ubuntu 15.10 (4.2.0) Storm 1.0.1 E5-2643 @ 3.40GHz, 24 cores; 256GB DRAM
  • 26. 26© Copyright 2016 EMC Corporation. All rights reserved. • Function: Inject error and validate reliability detection: Pass – Same-level reliability as existing approach • • Performance: same HW/SW config and processing logic – 16KB tuple, 100 pending, 48 parallelism per bolt – 4 workers & 12 Ackers per host Result: function & performance
  • 27. 27© Copyright 2016 EMC Corporation. All rights reserved. • 1/3 Ack traffic, 18% faster, 9% less CPU Test1: 3 layers 3903 1301 Current New Ack traffic(Mil) 241 197 Current New End-end Latency(ms) 350% 320% Current New CPU (per Java worker)
  • 28. 28© Copyright 2016 EMC Corporation. All rights reserved. • 1/5 Ack traffic, 23% faster, 14% less CPU Test2: 7 layers 2685 537 Current New Ack traffic(Mil) 197 151 Current New End-end latency(ms) 250% 215% Current New CPU (per Java worker)
  • 29. 29© Copyright 2016 EMC Corporation. All rights reserved. • Larger topology? Quick test of 11 layers: – 1/9 traffic • Suppose the larger of topology, the more gains to achieve • Next – Refine multi-Acker – Implement “Increase Share” operation – Configurable split method per bolt • So Dev can specify desired split way rather than fixed/global • May integrate with Twitter Heron? Or apply to other areas? – i.e., function call graph? performance trace? (more…) MORE
  • 30. 30© Copyright 2016 EMC Corporation. All rights reserved. End-end IoT landscape Continuous, scalable, Real-time processing
  • 31. 31© Copyright 2016 EMC Corporation. All rights reserved. • Lambda architecture, fusion “historical ”+“new” data – Proposed by Nathan Marz (5y ago), batch + streaming – widely adopted in many Internet company Unified data processing
  • 32. 32© Copyright 2016 EMC Corporation. All rights reserved. • 2 innovative & inspiring streaming reliability algorithms – Guaranteed with minimized mem footprint – More scalable, efficient & fast, and even beautiful • Demonstrate in Storm – 1/N Ack traffic, only needed at leaf nodes • N is topology depth. Usually a few leaf for aggregation, DB saving etc • meanwhile, 23% faster, 14% less CPU – Transparent to App except the last explicit emitDone() call • Applying to other interesting areas... – Distributed replication, tx, exact-state tracking, … SUMMARY
  • 33. 33© Copyright 2016 EMC Corporation. All rights reserved. • Feedback or comments? talk with us! – Any flaw, constraints, or room to improve? – then discuss with Storm community; Codes can be shared if needed Junping.Zhao@emc.com ZhaoJP@gmail.com THANK YOU!

Editor's Notes

  • #3: Any official adeclaimer?
  • #5: May also known as Complex Event Processing (CEP)
  • #7: Trident: abstraction on top of Storm. Besides providing higher-level constructs “a-la-Cascading”, it batches groups of Tuples to 1) Make reasoning about processing easier and 2) Encourage efficient data persistence, even with the help of an API that can provide exactly-once semantics for some cases Heron: built since 2014, paper in 2015, open-source in May 2016. http://twitter.github.io/heron/. API compatible with Apache Storm and hence no code change “One of our primary requirements for Heron was ease of debugging and profiling”, also scheduling, optimal resource utilization (IPC layer, simplification) Flink: based on distributed ckpt, Lightweight Asynchronous Snapshots for Distributed Dataflows, (ABS: Asynchronous Barrier Snapshotting ) http://arxiv.org/abs/1506.08603 variation of the Chandy Lamport algorithm (1985). periodically draws state snapshots of a running stream topology, and stores these snapshots to durable storage Similar to the micro-batching approach, in which all computations between two checkpoints either succeed or fail atomically as a whole. However, the similarities stop there. One great feature of Chandy Lamport is that we never have to press the “pause” button in stream processing to schedule the next micro batch. Instead, regular data processing always keeps going, processing events as they come, while checkpoints happen in the background
  • #8: If failed detected, Storm can re-do from beginning (Storm doesn’t do ckpt) - usually fast in ms level. Spark can re-do from the most recent ckpt (perf impact).
  • #10: Task failed: by supervisor daemon restart Supervisor/workNode failed: by ZK Restart/re-scheduler Master failed: by ZK. Cant’ submit new task Existing task should be ok Redo(Re-compute): no log/replica, for high performance or real-time processing
  • #12: http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html
  • #13: It doesn’t care which component is failed. Once failed is detected given a time-out (30sec) App should not commit the message to data source like Kafka, then Kafka never remove that data App could re-send the message and re-run the topology
  • #14: Random ID Every bolt must send a Ack message
  • #15: Another benchmark is IBM IBM InfoSphere vs. Storm: https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf
  • #20: In practice, there’s a challenge to implement such approach: ideally, Need to know how many downstream msg are generated, then alloc enough random IDs and calculate the FP For each downstream msg, embed the FP and emit to downstream However, for many (may be not all) processing logic, probably don’t know the total downstream msg count beforehand (in step1) until execute the logic.
  • #21: In practice, there’s a challenge to implement the approach: ideally, Need to know how many downstream msg are generated, then alloc enough random IDs and calculate the FP For each downstream msg, embed the FP and emit to downstream However, for many (may be not all) processing logic, probably don’t know the total downstream msg count beforehand (in step1) until execute the logic.
  • #22: In practice, there’s a challenge to implement such approach: ideally, Need to know how many downstream msg are generated, then alloc enough random IDs and calculate the FP For each downstream msg, embed the FP and emit to downstream However, for many (may be not all) processing logic, probably don’t know the total downstream msg count beforehand (in step1) until execute the logic.
  • #23: For example, init share is 100 @ Acker. Embed the share into msg and pass-down to the downstream, A source msg (root msg) is ingested at root node (spout), then init the BIG SHARE as initial status. And embed the SHARE as part of metadata Run topology, and each node execute pre-defined logic, meanwhile, also abstract the share and split it to downstream outputMsg Finally at leaf nodes, would abstract and report the received share to Acker Acker would decrease the share, 100 - 16 -84 = 0. 0 means ok.
  • #24: May pre-define some rule about inc, i.e., always increase 7B, then Acker could use one bit to indicate one increase A similar but different Huang’s algorithm. looks both use number as weight or share then involve split op, but sounds to me, the problem area, prerequisite, algorithm steps are very different. Huang’s target is more related to process (task/bolt) state, but my target is the continuous flowing message running at tasks. A few bullets in my mind, feel free to comment: Problem area: In Huang’s context, the distributed task consists of different processes, each in either active (may idel at anytime) or idle (idle to active is only triggered by some msg). Huang’s goal is to detect *all processes* in the system become state idle. our goal is to track each message status running at those task or usually related to partial failure (but we don’t care which task is failed/unavailable) prerequisite: importantly, state of idle (Huang’s monitoring state) clearly *is explicit aware* by the process; with that, his step is “Upon becoming idle, a process sends a message… “; but in our case, message failure/exception is hard to know by itself, typically due to network partition/timeout etc, thus it must be detected by other components or special design state, which adds extra challenges. into the algorithm: steps are different, our method always split the number during flow the DAG, then the Acker essentially redo split op based on recv share and make sure redo result is 0.   In general,  Huang’s research target is process (tens or hundreds), rather than the continuous flowing message (billions or never stop). In practices, currently distributed process states are managed by Zookeeper(or raft etc) that based on Paxos algorithm publish in 1990-but widely understood and adopted after 2001 (until Lamport’s second paper to explain Paxos, and Google validated)
  • #25: A few important points in implementation: Re-use existing Anchors-to-ids map to embed the share when emit (so no extra traffic), previously it’s [RootId -> tupleID]; now it’s [RootID -> shareAssigned] 2. To split pass-down share, need to know how many downstream outMsg generated beforehand (but usually hard to predict), To resolve that, work out the 1-step defer processing: 1) Static split the input share = sub-share 2) Assign and embed, prepare to emit 3) Internally queue current outMsg, and send previous msg 4) ToEmit the last msg with new API, thus the last outMsg takes over all the left share w/ above implem, we introduce a little bit delay but it’s acceptable. 3. How to split share is also important. Right now it’s simply pre-defined split method, i.e., all bolt uses a pre-defined split count (could be 1 ~ 4096 or larger); in the future, it shall be config per bolt by Dev (who suppose knows more on the topology). i.e., bolt1 may split upto 128; bolt2 may split as 256 etc. Improper split may cause pressure to run out of ID thus increase share is required - still depends on topology size
  • #26: IBM InfoSphere vs. Storm: https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf
  • #27: Various topology, such as top-down, multiple income bolt, multiple spouts, …
  • #34: CLJ source files storm-core/src/clj/org/apache/storm/daemon/acker.clj executor.clj storm-core/src/clj/org/apache/storm/util.clj Java source files storm-core/src/jvm/org/apache/storm/topology/BasicOutputCollector.java BasicBoltExecutor.java storm-core/src/jvm/org/apache/storm/coordination/CoordinatedBolt.java storm-core/src/jvm/org/apache/storm/task/IOutputCollector.java OutputCollector.java storm-core/src/jvm/org/apache/storm/trident/topology/TridentBoltExecutor.java