Improved Reliable Streaming Processing: Apache Storm as example

1© Copyright 2016 EMC Corporation. All rights reserved.
Improved Reliable Streaming Processing:
Apache Storm as example
Frank Zhao, EMC CTO Office,
Fenghao Zhang*, Microsoft Bing,
Yusong Lv*, Peking University
Special thanks to EMC Ken Taylor, John Cardente and Lincourt Robert
*Zhang and Lv contributed to the research when they worked at EMC China COE

The technology concepts being discussed and demonstrated are
the result of research conducted by the Advanced Research &
Development (ARD) team from the EMC Office of the CTO. Any
demonstrated capability is only for research purpose and at a
prototype phase, therefore : THERE ARE NO IMMEDIATE PLANS
NOR INDICATION OF SUCH PLANS FOR PRODUCTIZATION OF
THESE CAPABILITIES AT THE TIME OF PRESENTATION. THINGS
MAY OR MAY NOT CHANGE IN THE FUTURE.
DISCLAIMER

• Distributed Streaming System
• Reliable Processing
• Apache Storm’s Solution, the Challenge
• New Proposed Approaches
– Fingerprint, and share-split
• Prototyping with Apache Storm and Benchmark
• Summary and Outlook
Agenda

• As service, continuously process data (a.k.a message or tuple)
in scalable, reliable and high-performance way (msec)
– Open-source: Storm, Flink, Spark-Streaming, Samza
Streaming processing

Streaming Processing
(Storm, Spark Streaming)
Batch processing
(Hadoop MR)
Type Continuous(never-stop),
real-time (ms level)
Batch/Period
Model DAG/graph MapReduce like Jobs
Workload CPU/Memory intensive CPU/mem and IO internsive
State Stateless, may period ckpt Stateful
Cluster Master-Slave w/ Zookeeper (Storm) Master-Slave or Job-task
Fault-
tolerance
Fault-tolerance/HA Fault-tolerance/HA
Streaming vs. batch processing

Storm Flink Spark
Streaming
Built since 2011 (Apache, Trident)
2016 (Twitter Heron)
2014 (Apache) ~2013
Streaming Native
(micro-batch, Trident)
Native Micro-batch
Guarantee At least once
(exactly-once w/ Trident)
Exactly-once Exactly-once
Fault-Tolerance Ack per message Checkpoint Checkpoint
Latency 5 4 3
Throughput 4 5 5
Ecosystem 5 3 3
Storm, Flink, Spark streaming*
*Personal observations for reference only

• Every message shall be guaranteed processed
– At-most once
– At-least once
– Exactly once
Reliable processing
May save
result
Topology (DAG)
0
1
2
3
4
5
6
7
8
9Data source
B
C
D
E
F
G
H
I
J
K
L
M
Spout
R
Bolt (worker, task, op)

• Scalable
• Fault-tolerant
• Guaranteed message processing
– At least once (default)
• Fast: ms level
– Pure memory computing, no checkkpoint
• Simple programming model
– Topology - Spouts – Bolts
– Clojure, Java, Ruby, Python …
Apache Storm

Storm: designs for fault-tolerance
Nimbus
 Deploy topology
 Dispatch tasks
 Monitor cluster
Zookeeper
cluster
 Coordination
 States of Nimbus
 State of supervisor
 …
Supervisor
Executor
Task Task
WorkersMaster
Those FT are about thread/task/
job or node, NOT message

• Critical message granularity (NOT thread/task/job/node)
• Need an efficient method, considering
– Every component may fault
– Large topology, continuous flooding messages
– Network temp unavailable, traffic out-of-order, …
– Minimized resource usage (network, cpu, mem)
Track processing status in DAG
0
1
2
3
4
5
6
7
8
9Data source
B
C
D
E
F
G
H
I
J
K
L
M
Spout
R
Bolt

History of Apache Storm and lessons learned
– Nathan Marz, creator of Storm
Tough problem and Storm’s answer!

Storm reliability track algorithm
0
1
2
3
4
Status Acker
srcNodeID: R, R
A
B
C
D
E
F
R ⊕ A ⊕ B ⊕ C
A ⊕ D
B ⊕ E
C ⊕ F
D⊕ E ⊕ F
R
Status = R ⊕
R ⊕ A ⊕ B ⊕ C ⊕
A ⊕ D ⊕
B ⊕ E ⊕
C ⊕ F ⊕
D ⊕ E ⊕ F =
1. Each msg has ID (8B random number)
2. Each bolt runs XOR (inMsgID, outMsgID[]) per inMsg
3. Each bolt sends XOR (per inMsg) result to Acker
4. Acker runs XOR: always 8B (regardless topology size)
5. Finally, given timeout, Acker.status shall be 0 means OK
otherwise something failed (may false-alarm, but never miss) 0

• RandomNum + XOR based, the key foundation of Storm that
runs for 5+Y
– Smart, simple and pretty good!
– Least memory footprint at Acker, regardless of topology
– Reliable*, regardless of Ack traffic order
– XOR op: commutative law, associative law
• Easy to handle any Out-of-order
Ingenious!
*: in theory, random ID may collision

• Network traffic, CPU overhead  latency & throughput impact
– Possibility of random number collision
Limitations
25000 msg/sec
9300 msg/sec
Non-reliable processing
reliable processing
*3rd party benchmark in 2012, things may change now

IS IT POSSIBLE ?
Ack only at leaf?
0
1
2
3
4
5
6
7
8
9
Data source
B
C
D
E
F
G
H
I
J
K
L
M
R
Current algorithm is fantastic, however

• Same-level guaranteed reliable processing
• More scalable, efficient and fast
– Much less Ack traffic; usually only at leaf nodes
– Same memory footprint, less CPU usage
– Eventually better latency/throughput
2 new proposed approaches
Currently in research & quick validation phase

• An evolution based on Random Num + XOR
Approach-1: fingerprint based
Currently, XOR in-pair (send, recv), then it’s 0
Further, XOR in multiple pairs (2, 4, 6, …), still 0

• Fingerprint(FP): A digest (i.e., 8B) of {in msgs, out msgs and
parent.fp}, to encode & represent the context then recursively pass-
down. That each downstream inherits genes from all ancestors
– Still use XOR of IDs, redundant in scalable way
– 3-rule: Embedded, Recursively inherited and Append-only update
Approach-1: fingerprint idea
iMsg <Mj, FPj >
Msg < Mj+1, FPj:i >
Msg < Mj+2, FPj:i >
Msg < Mj+2, FPj:i >
Msg <…>
Ni
i+1
i+2
i+3
Ni+1
Ni+2
Ni+3
Pass-down FP
InMsgID XOR [outMsgIDs]
• Embedded: as part of metadata
• Recursive-inherit: pass-down
• Append-update: via XOR
Append update

Fingerprint example
0
1
2
3
4
FP0= R ⊕ A ⊕ B ⊕ C
FP1= FP0 ⊕ A ⊕ D
FP2= FP0 ⊕ B ⊕ E
FP3= FP0 ⊕ C ⊕ F
Leaf has 3 Ack traffic:
FP4-D= FP1 ⊕ D
FP4-E= FP2 ⊕ E
FP4-F = FP3 ⊕ F
 Acker.status = R ⊕
(FP0 ⊕ A ⊕ D) ⊕ D ⊕
(FP0 ⊕ B ⊕ E) ⊕ E ⊕
(FP0 ⊕ C ⊕ F) ⊕ F =
Acker
srcNodeID: RootMsgID, R
A, FP0
C, FP0
B, FP0
D, FP1
E, FP2
F, FP3
FP4-D
FP4-E
FP4-F
Init: R
Calculate FP
0
R
May batch

Approach-1: failure example
0
1
2
3
4
Acker
srcNodeID : RootMsgID, R
A, FP0
C, FP0
B, FP0
D, FP1
E, FP2
F, FP3
FP4-D
FP4-E
FP4-F
Init = R
if msg D failed, then node4 only Ack FP4-E and FP4-F, finally Acker.status =
= R ⊕ FP4-E ⊕ FP4-F
= R ⊕ FP2 ⊕ E ⊕ FP3 ⊕ F
= R ⊕ (FP ⊕ B ⊕ E ⊕ E) ⊕ (FP ⊕ C ⊕ F ⊕ F)
= R ⊕ B ⊕ C != 0
Another example, if all message failed, Ack is R !=0
R
 Missing info about A/D path, due to failure!!

Approach-1: a complex example
1
2
3
4
5
6
7
8R
A
B
C
D
E
F
G
H
I
X
Initial : R
FP1= R ⊕ A ⊕ B ⊕ C
FP2= FP1 ⊕ A ⊕ D
FP3= FP1 ⊕ B ⊕ X
FP4= FP1 ⊕ C ⊕ E
//update FP5 to Acker since even
number of downstreams (2)
FP5= FP2 ⊕ D ⊕
FP3 ⊕ X ⊕
FP4 ⊕ E ⊕ (F ⊕ G)
FP6= FP5 ⊕ F ⊕ H
FP7= FP5 ⊕ G ⊕ I
// blot8 sends FP8 to Acker
FP8= FP6 ⊕ H ⊕ FP7 ⊕ I
Final Status = R ⊕ FP5 ⊕ FP8
= R ⊕ FP5 ⊕ (FP5 ⊕ F) ⊕ (FP5 ⊕ G)
= R ⊕ FP5 ⊕ (F ⊕ G)
= R ⊕ FP2 ⊕ D ⊕ FP3 ⊕ X ⊕ FP4 ⊕ E
= R ⊕ (FP1 ⊕ A ⊕ B ⊕ C )
= 0
Acker
FP5
FP8
Limit and note: 1) downstream msg shall be odd number (1,3, 5, …); otherwise, bolt must send the new FP
to Acker, where Acker would run XOR with the new FP; 2) To implement such approach, ideally bolt needs
to know the total downstream number to generate FP before emit.

• For input rootMsg, INIT a BIG SHARE (8B), EMBED as metadata, pass-down
• SPLIT attached share by Storm at each bolt, EMBED, repeat this until leaf ...
• Only leaf ACK to Acker about received share at hand
• Acker REDO: decrease the reported share, finally 0 means ok; or-else failure
– No random(no collision), no XOR; inline embedded; split is transparent to App
– +/- (mod): follow commutative & associative law, resolve out-of-order issue
Approach-2: share split
0
1
2
3
4
5
6
7
8
9
Acker
srcNodeID: rootMsgID,BIG-Share
A
B, 50
C, 50
D, 25
E, 25
F, 17
G, 17
H, 16
I, 25
J, 25
K, 17
L, 17
A,1, 100
A, 0, 16
A, 0, 84M, 16
Like: IPO/stock share, split, increase share

• Rare case: INCREASE share if insufficient to split (also syncup the Acker)
• Acker then ADD the newly increased share (NOT decrease)
Approach-2: share split (con’t)
0
1
2
3
4
5
6
7
8
9
Acker
srcNodeID, RootMsgID,Share
A
B, 99
C, 1
F, 33
G, 33
H, 34
A, 100
A, +99
increase share;
Sync-up Acker
If S - S1 - S2 - … = Sn, then S - S1 - S2 - … - Sn =
AckerDAG
0
(Ack may batch)

• Implemented Approach-2 (share-split)
• Integrate with Storm 1.0.1 (Released in May 2016)
– Storm core (~200 LOC in Clojure: LISP-like) and Java APIs (~200 LOC
including some traces/tests)
• Implementation notes:
– Support BasicBolt, remove randomNum, re-use some existing
structures/APIs i.e., Anchors-to-ids (RootID:shareAttached), Ack sending
– Global pre-defined split share at all bolts (equally split)
• Next, configurable split approach per bolt
– To exactly split share, build 1-step delay emit
• Pre-split the input share
• Once new tuple generated, emit internally queue it until next tuple come out
• Finally explicitly call emitDone(), thus last tuple takes over all left share and emit
Prototyping

• Function & performance
– network traffic, CPU, latency/throughput
• Reference IBM whitepaper (Storm vs. IBM InfoSphere): 7 layers
– We use Wikipedia as data source; words processing
Benchmark
1000 Mbps
Ubuntu 15.10 (4.2.0)
Storm 1.0.1
Ubuntu 15.10 (4.2.0)
Storm 1.0.1
E5-2643 @ 3.40GHz,
24 cores;
256GB DRAM
E5-2643 @ 3.40GHz,
24 cores;
256GB DRAM
Ubuntu 15.10 (4.2.0)
Storm 1.0.1
E5-2643 @ 3.40GHz,
24 cores;
256GB DRAM

• Function: Inject error and validate reliability detection: Pass
– Same-level reliability as existing approach
•
• Performance: same HW/SW config and processing logic
– 16KB tuple, 100 pending, 48 parallelism per bolt
– 4 workers & 12 Ackers per host
Result: function & performance

• 1/3 Ack traffic, 18% faster, 9% less CPU
Test1: 3 layers
3903
1301
Current New
Ack traffic(Mil)
241
197
Current New
End-end Latency(ms)
350%
320%
Current New
CPU (per Java worker)

• 1/5 Ack traffic, 23% faster, 14% less CPU
Test2: 7 layers
2685
537
Current New
Ack traffic(Mil)
197
151
Current New
End-end latency(ms)
250%
215%
Current New
CPU (per Java worker)

• Larger topology? Quick test of 11 layers:
– 1/9 traffic
• Suppose the larger of topology, the more gains to achieve
• Next
– Refine multi-Acker
– Implement “Increase Share” operation
– Configurable split method per bolt
• So Dev can specify desired split way rather than fixed/global
• May integrate with Twitter Heron? Or apply to other areas?
– i.e., function call graph? performance trace? (more…)
MORE

End-end IoT landscape
Continuous, scalable,
Real-time processing

• Lambda architecture, fusion “historical ”+“new” data
– Proposed by Nathan Marz (5y ago), batch + streaming
– widely adopted in many Internet company
Unified data processing

• 2 innovative & inspiring streaming reliability algorithms
– Guaranteed with minimized mem footprint
– More scalable, efficient & fast, and even beautiful
• Demonstrate in Storm
– 1/N Ack traffic, only needed at leaf nodes
• N is topology depth. Usually a few leaf for aggregation, DB saving etc
• meanwhile, 23% faster, 14% less CPU
– Transparent to App except the last explicit emitDone() call
• Applying to other interesting areas...
– Distributed replication, tx, exact-state tracking, …
SUMMARY

• Feedback or comments? talk with us!
– Any flaw, constraints, or room to improve?
– then discuss with Storm community; Codes can be shared if needed
Junping.Zhao@emc.com ZhaoJP@gmail.com
THANK YOU!

Improved Reliable Streaming Processing: Apache Storm as example

Improved Reliable Streaming Processing: Apache Storm as example

More Related Content

What's hot (20)

Similar to Improved Reliable Streaming Processing: Apache Storm as example (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Improved Reliable Streaming Processing: Apache Storm as example

Editor's Notes