Next Generation Execution Engine for Apache Storm

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Execution Engine
for Apache Storm
Roshan Naik, Hortonworks
Hadoop Summit, Dataworks Summit
Jun 13th 2017, San Jose

Present : Storm 1.x
 Has matured into a stable and reliable system
 Widely deployed and holding up well in production
 Scales well horizontally
 Lots of new competition
– Differentiating on Features, Performance, Ease of Use etc.
Storm 2.x
 High performance execution engine
 All Java code (transitioning away from Clojure)
 Improved Backpressure, Metrics subsystems
 Beam integration, Bounded spouts
 Scheduling Hints, Elasticity

Performance

Use Cases - Latency centric
 100ms+ : Factory automation
 10ms - 100ms : Real time gaming, scoring shopping carts to print coupons
 0-10 ms : Network threat detection
 Java based High Frequency Trading systems
– fast: under 100 micro-secs 90% of time, no GC during the trading hours
– medium: under 1ms 95% of time, and rare minor GC
– slow: under 10 ms 99 or 99.9% of time, minor GC every few mins
– Cost of being slow
• Better to turn it off than lose money by leaving it running

Performance in 2.0
 How do we know if a streaming system is “fast”?
– Faster than another system ?
– What about Hardware potential ?
• More on this later
 Dimensions
– Throughput
– Latency
– Resource utilization: CPU/Network/Memory/Disk/Power

Execution Engine - Planned Enhancements for
 Umbrella Jira : STORM-2284
– https://issues.apache.org/jira/browse/STORM-2284

Areas critical to Performance
 Messaging System
– Need Bounded Concurrent Queues that operate as fast as hardware allows
– Lock based queues not an option
– Lock free queues or preferably Wait-free queues
 Threading Model
– Fewer Threads. Less synchronization.
– Dedicated threads instead of pooled threads.
– CPU Pinning.
 Memory Model
– Lowering GC Pressure: Recycling Objects in critical path.
– Reducing CPU cache faults: Controlling Object Layout (contiguous allocation).

Messaging Subsystem
(STORM-2307)

Understanding “Fast”
Component Throughput Mill/sec
AKKA 90-100 threads 50
Flink per core 1.5
Apex 3.0 container local 4.3
v3.0
Gear Pump 4 nodes 18
InfoSphere Streams
v3.0
Huge Gap!
Component Throughput Mill/sec
Not thread safe ArrayDeQueue 1 thread rd+wr 1063
Lock based ArrayBlockingQueue 1 thd rd+wr 30
1 Prod, 1 Cons 4
SleepingWaitStrategy Disruptor 1 P, 1C 25
(ProducerMode= MULTI) 3.3.x
lazySet() FastQ 1 P, 1C 31
JC Tools MPSC 1P, 1c 74
2P, 59
3P 43
4P 40
6P 56
8P 65
10P 66
15P 68
20P 68

Messaging - Current Architecture
Worker Send Thd
Send Q
Network
Bolt/Spout Executor
Recv Q
Bolt
Executor
Thread
(user logic)
Send Q
Send
Thread
Worker Recv Thd
Recv Q
Network
Worker Process - High Level View

Bolt/Spout Executor - Detailed
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
CLQ : OVERFLOW
BATCHER (1 per publisher)
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
Local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote

New Architecture
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
CLQ : OVERFLOW
BATCHER (1 per publisher)
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
Local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote

Messaging - New Architecture
(STORM-2306)
RECEIVE Q
BATCHER
JCTools Q
Bolt
Executor
Thread
(user logic)
publish
DestID
msgs
msgs
msgs
msgs
Local Executor’s
RECEIVE Q
Worker’s
Outbound Q
local
remote

Preliminary Numbers
LATENCY
 1 spout --> 1 bolt with 1 ACKer (all in same worker)
– v1.0.1 : 3.4 milliseconds
– v2.0 master: 7 milliseconds
– v2.0 redesigned : 60-100 micro seconds (116x improvement)

Preliminary Numbers
THROUGHPUT
 1 spout --> 1 bolt [w/o ACKing]
– v1.0.1 : ?
– v2.0 master: 3.3 million /sec
– v2.0 redesigned : 5 million /sec (50% improvement)
 1 spout --> 1 bolt [with ACKing]
– v1.0 : 233 K /sec
– v2.0 master: 900 k/sec
– v2.0 redesigned : 1 million /sec (not much change – but why ?)

Observations
 Latency: Dramatically improved.
 Throughput: Discovered multiple bottlenecks preventing significantly higher
throughput.
– Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along with some others,
throughput can reach ~7 million/sec.
– TumpleImpl : If inefficiencies here are addressed, throughput can reach ~15 mill/sec.
– ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec. Limitation with
implementation not with concept. I see room for ACKer specific fixes that can also
substantially improve its throughput.

Threading & Execution Model
(STORM-2307)

WORKER PROCESS
• Start/Stop/Monitor
Executors
• Manage Metrics
• Topology Reconfig
• Heartbeat
Executor (Thd)
grouper
Task
(Bolt)Q
counters
Executor (Thd)
System Task
(Inter host
Input)
Executor (Thd)
Sys Task
(Outbound
Msgs)
Q
counters
New Threading & Execution Model
Executor (Thd)
System Task
(Intra host
Input)
Executor (Thd)
(grouper)
(Bolt)
Task
(Spout/Bolt)Q
counters

CPU Pinning
(STORM-2313)

CPU cache access
 Approximate access costs
– L1 cache : 1x
– L2 cache : 2.5x
– Local L3 cache : 10-20x
– Remote L3 cache: 25-75x

CPU Affinity
 For inter-thread communication
– cache fault distance matters
– Faster between cores on same socket
• 20% latency hit when threads pinned to diff sockets
 Pinning threads to CPUs
– If done right, minimizes cache fault distance
– Threads moving around needs to cache refreshed
– Unrelated threads running on same core trash each others cache
 Helps perf on NUMA machines
– Pinning long running tasks reduces NUMA effects
– NUMA aware allocator introduced in Java SE 6u2

CPU Pinning Strategy
 1 thd per physical core
 Try to fit subsequent executor threads on same socket
 Logical cores – i.e. Hyperthreading ?
– Avoid hyperthreading – avoid cache trashing each other on same core
– Could provide it as option in future ?

Memory Management

Memory Management
Can be decomposed into 2 key area
– Object Recycling - in critical path
• Avoids dynamic allocation cost
• Minimizes stop-the-world GC pauses
– Contiguous allocation: arrays, data members.
• CPU likes it.
• Pre-fetch friendly.
• Fewer cache faults per object.
• Natural in C++, very painful in Java.

Scheduling & Elasticity
(STORM-2309)

Elasticity
 Stretching / Shrinking
– Changes Worker/Executor counts
 Current parallelism hints not good enough
 Need a better way for users to specify concurrency that enables stretching/shrinking

Topology Planning / Scheduling
(STORM-2309)
 Problem: How to line up the tasks within and across workers for optimal execution
– Lower level issue than Resource Aware Scheduling
 What is optimal ?
– Best Performance – without regard for hardware/energy utilization
– Resourceful hardware utilization – trade in last 10-20% perf for lower energy consumption.
 Enable user to decide what is optimal for them.
– Scheduling hints
– Allow elasticity

Scheduling Hints
 Parallelism hints
– Per worker, host, (rack), global counts
– Min and max settings
– Supervisor could have rack hints Worker
 Distribution
– Compact packing (default)
• Pack the Worker to its max
– In order of appearance in topology definition
• Then pack host, (then rack), then cluster
– Loose packing
• Pack the Worker to the min
• Then host, then …
• Left over resources are spread out in the similar fashion

Scheduling Hints
 If TaskA --emit-->TaskB, ….. A & B could be running on
• Same Thread
• Same Worker different thread
• Same Host different Worker
• Different Host: Shuffle/other
 Locality Control:
– Clustering: Co-locating
– Partitioning: Avoid colocation
 Specify via arguments to groupings ?:
– shuffle(threadLocal)
– fieldsGrouping(nodeLocal)

Thank You
Questions ?
References
https://issues.apache.org/jira/browse/STORM-2284

Next Generation Execution Engine for Apache Storm

More Related Content

What's hot (20)

Similar to Next Generation Execution Engine for Apache Storm (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Next Generation Execution Engine for Apache Storm