SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Execution Engine
for Apache Storm
Roshan Naik, Hortonworks
Hadoop Summit, Dataworks Summit
Jun 13th 2017, San Jose
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Present : Storm 1.x
 Has matured into a stable and reliable system
 Widely deployed and holding up well in production
 Scales well horizontally
 Lots of new competition
– Differentiating on Features, Performance, Ease of Use etc.
Storm 2.x
 High performance execution engine
 All Java code (transitioning away from Clojure)
 Improved Backpressure, Metrics subsystems
 Beam integration, Bounded spouts
 Scheduling Hints, Elasticity
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases - Latency centric
 100ms+ : Factory automation
 10ms - 100ms : Real time gaming, scoring shopping carts to print coupons
 0-10 ms : Network threat detection
 Java based High Frequency Trading systems
– fast: under 100 micro-secs 90% of time, no GC during the trading hours
– medium: under 1ms 95% of time, and rare minor GC
– slow: under 10 ms 99 or 99.9% of time, minor GC every few mins
– Cost of being slow
• Better to turn it off than lose money by leaving it running
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance in 2.0
 How do we know if a streaming system is “fast”?
– Faster than another system ?
– What about Hardware potential ?
• More on this later
 Dimensions
– Throughput
– Latency
– Resource utilization: CPU/Network/Memory/Disk/Power
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Execution Engine - Planned Enhancements for
 Umbrella Jira : STORM-2284
– https://issues.apache.org/jira/browse/STORM-2284
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Areas critical to Performance
 Messaging System
– Need Bounded Concurrent Queues that operate as fast as hardware allows
– Lock based queues not an option
– Lock free queues or preferably Wait-free queues
 Threading Model
– Fewer Threads. Less synchronization.
– Dedicated threads instead of pooled threads.
– CPU Pinning.
 Memory Model
– Lowering GC Pressure: Recycling Objects in critical path.
– Reducing CPU cache faults: Controlling Object Layout (contiguous allocation).
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Messaging Subsystem
(STORM-2307)
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Understanding “Fast”
Component Throughput Mill/sec
AKKA 90-100 threads 50
Flink per core 1.5
Apex 3.0 container local 4.3
v3.0
Gear Pump 4 nodes 18
InfoSphere Streams
v3.0
Huge Gap!
Component Throughput Mill/sec
Not thread safe ArrayDeQueue 1 thread rd+wr 1063
Lock based ArrayBlockingQueue 1 thd rd+wr 30
1 Prod, 1 Cons 4
SleepingWaitStrategy Disruptor 1 P, 1C 25
(ProducerMode= MULTI) 3.3.x
lazySet() FastQ 1 P, 1C 31
JC Tools MPSC 1P, 1c 74
2P, 59
3P 43
4P 40
6P 56
8P 65
10P 66
15P 68
20P 68
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Messaging - Current Architecture
Worker Send Thd
Send Q
Network
Bolt/Spout Executor
Recv Q
Bolt
Executor
Thread
(user logic)
Send Q
Send
Thread
Worker Recv Thd
Recv Q
Network
Worker Process - High Level View
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Bolt/Spout Executor - Detailed
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER (1 per publisher)
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
Local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Architecture
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER (1 per publisher)
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
Local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Messaging - New Architecture
(STORM-2306)
RECEIVE Q
ArrayList: Current Batch
BATCHER
JCTools Q
Bolt
Executor
Thread
(user logic)
publish
DestID
msgs
msgs
msgs
msgs
Local Executor’s
RECEIVE Q
Worker’s
Outbound Q
local
remote
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Preliminary Numbers
LATENCY
 1 spout --> 1 bolt with 1 ACKer (all in same worker)
– v1.0.1 : 3.4 milliseconds
– v2.0 master: 7 milliseconds
– v2.0 redesigned : 60-100 micro seconds (116x improvement)
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Preliminary Numbers
THROUGHPUT
 1 spout --> 1 bolt [w/o ACKing]
– v1.0.1 : ?
– v2.0 master: 3.3 million /sec
– v2.0 redesigned : 5 million /sec (50% improvement)
 1 spout --> 1 bolt [with ACKing]
– v1.0 : 233 K /sec
– v2.0 master: 900 k/sec
– v2.0 redesigned : 1 million /sec (not much change – but why ?)
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Observations
 Latency: Dramatically improved.
 Throughput: Discovered multiple bottlenecks preventing significantly higher
throughput.
– Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along with some others,
throughput can reach ~7 million/sec.
– TumpleImpl : If inefficiencies here are addressed, throughput can reach ~15 mill/sec.
– ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec. Limitation with
implementation not with concept. I see room for ACKer specific fixes that can also
substantially improve its throughput.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Threading & Execution Model
(STORM-2307)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
WORKER PROCESS
• Start/Stop/Monitor
Executors
• Manage Metrics
• Topology Reconfig
• Heartbeat
Executor (Thd)
grouper
Task
(Bolt)Q
counters
Executor (Thd)
System Task
(Inter host
Input)
Executor (Thd)
Sys Task
(Outbound
Msgs)
Q
counters
New Threading & Execution Model
Executor (Thd)
System Task
(Intra host
Input)
Executor (Thd)
(grouper)
(Bolt)
Task
(Spout/Bolt)Q
counters
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CPU Pinning
(STORM-2313)
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CPU cache access
 Approximate access costs
– L1 cache : 1x
– L2 cache : 2.5x
– Local L3 cache : 10-20x
– Remote L3 cache: 25-75x
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CPU Affinity
 For inter-thread communication
– cache fault distance matters
– Faster between cores on same socket
• 20% latency hit when threads pinned to diff sockets
 Pinning threads to CPUs
– If done right, minimizes cache fault distance
– Threads moving around needs to cache refreshed
– Unrelated threads running on same core trash each others cache
 Helps perf on NUMA machines
– Pinning long running tasks reduces NUMA effects
– NUMA aware allocator introduced in Java SE 6u2
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CPU Pinning Strategy
 1 thd per physical core
 Try to fit subsequent executor threads on same socket
 Logical cores – i.e. Hyperthreading ?
– Avoid hyperthreading – avoid cache trashing each other on same core
– Could provide it as option in future ?
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Memory Management
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Memory Management
Can be decomposed into 2 key area
– Object Recycling - in critical path
• Avoids dynamic allocation cost
• Minimizes stop-the-world GC pauses
– Contiguous allocation: arrays, data members.
• CPU likes it.
• Pre-fetch friendly.
• Fewer cache faults per object.
• Natural in C++, very painful in Java.
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling & Elasticity
(STORM-2309)
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Elasticity
 Stretching / Shrinking
– Changes Worker/Executor counts
 Current parallelism hints not good enough
 Need a better way for users to specify concurrency that enables stretching/shrinking
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Topology Planning / Scheduling
(STORM-2309)
 Problem: How to line up the tasks within and across workers for optimal execution
– Lower level issue than Resource Aware Scheduling
 What is optimal ?
– Best Performance – without regard for hardware/energy utilization
– Resourceful hardware utilization – trade in last 10-20% perf for lower energy consumption.
 Enable user to decide what is optimal for them.
– Scheduling hints
– Allow elasticity
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Hints
 Parallelism hints
– Per worker, host, (rack), global counts
– Min and max settings
– Supervisor could have rack hints Worker
 Distribution
– Compact packing (default)
• Pack the Worker to its max
– In order of appearance in topology definition
• Then pack host, (then rack), then cluster
– Loose packing
• Pack the Worker to the min
• Then host, then …
• Left over resources are spread out in the similar fashion
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Hints
 If TaskA --emit-->TaskB, ….. A & B could be running on
• Same Thread
• Same Worker different thread
• Same Host different Worker
• Different Host: Shuffle/other
 Locality Control:
– Clustering: Co-locating
– Partitioning: Avoid colocation
 Specify via arguments to groupings ?:
– shuffle(threadLocal)
– fieldsGrouping(nodeLocal)
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Questions ?
References
https://issues.apache.org/jira/browse/STORM-2284

More Related Content

PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
A Multi Colored YARN
PPTX
Major advancements in Apache Hive towards full support of SQL compliance
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PPTX
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
PPTX
An Apache Hive Based Data Warehouse
Hive edw-dataworks summit-eu-april-2017
LLAP: Sub-Second Analytical Queries in Hive
A Multi Colored YARN
Major advancements in Apache Hive towards full support of SQL compliance
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
An Apache Hive Based Data Warehouse

What's hot (20)

PPTX
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
PPTX
Hive present-and-feature-shanghai
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
LLAP: Building Cloud First BI
PDF
Sub-second-sql-on-hadoop-at-scale
PDF
The Future of Apache Storm
PPTX
Running a container cloud on YARN
PPTX
Streamline Hadoop DevOps with Apache Ambari
PDF
Strata Stinger Talk October 2013
PPTX
The Future of Apache Ambari
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PPTX
Apache Hive ACID Project
PPTX
Schema Registry - Set Your Data Free
PDF
The state of SQL-on-Hadoop in the Cloud
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
PPSX
LLAP Nov Meetup
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Hive present-and-feature-shanghai
Apache Hive 2.0: SQL, Speed, Scale
LLAP: Building Cloud First BI
Sub-second-sql-on-hadoop-at-scale
The Future of Apache Storm
Running a container cloud on YARN
Streamline Hadoop DevOps with Apache Ambari
Strata Stinger Talk October 2013
The Future of Apache Ambari
Large-Scale Stream Processing in the Hadoop Ecosystem
Apache Hive ACID Project
Schema Registry - Set Your Data Free
The state of SQL-on-Hadoop in the Cloud
Apache Hive 2.0: SQL, Speed, Scale
LLAP: Sub-Second Analytical Queries in Hive
Apache Hadoop 3.0 What's new in YARN and MapReduce
LLAP: Sub-Second Analytical Queries in Hive
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
LLAP Nov Meetup
Ad

Similar to Next Generation Execution Engine for Apache Storm (20)

PPTX
Storm worker redesign
PDF
Next Generation Execution for Apache Storm
PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Storm at Forter
PDF
Mhug apache storm
PPTX
Future of Apache Storm
PPTX
PPTX
Cleveland HUG - Storm
PDF
Stream Processing with CompletableFuture and Flow in Java 9
PPTX
Performance Comparison of Streaming Big Data Platforms
PDF
The Future of Apache Storm
PDF
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PPTX
Resource Aware Scheduling in Apache Storm
PPTX
Resource Aware Scheduling in Apache Storm
PDF
Kafka storm-v2
PDF
Distributed Realtime Computation using Apache Storm
PDF
Low Latency Streaming Data Processing in Hadoop
PDF
Hadoop Ecosystem and Low Latency Streaming Architecture
Storm worker redesign
Next Generation Execution for Apache Storm
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Scaling Apache Storm - Strata + Hadoop World 2014
Storm at Forter
Mhug apache storm
Future of Apache Storm
Cleveland HUG - Storm
Stream Processing with CompletableFuture and Flow in Java 9
Performance Comparison of Streaming Big Data Platforms
The Future of Apache Storm
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hadoop Summit Europe 2014: Apache Storm Architecture
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
Kafka storm-v2
Distributed Realtime Computation using Apache Storm
Low Latency Streaming Data Processing in Hadoop
Hadoop Ecosystem and Low Latency Streaming Architecture
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf

Next Generation Execution Engine for Apache Storm

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Generation Execution Engine for Apache Storm Roshan Naik, Hortonworks Hadoop Summit, Dataworks Summit Jun 13th 2017, San Jose
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Present : Storm 1.x  Has matured into a stable and reliable system  Widely deployed and holding up well in production  Scales well horizontally  Lots of new competition – Differentiating on Features, Performance, Ease of Use etc. Storm 2.x  High performance execution engine  All Java code (transitioning away from Clojure)  Improved Backpressure, Metrics subsystems  Beam integration, Bounded spouts  Scheduling Hints, Elasticity
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases - Latency centric  100ms+ : Factory automation  10ms - 100ms : Real time gaming, scoring shopping carts to print coupons  0-10 ms : Network threat detection  Java based High Frequency Trading systems – fast: under 100 micro-secs 90% of time, no GC during the trading hours – medium: under 1ms 95% of time, and rare minor GC – slow: under 10 ms 99 or 99.9% of time, minor GC every few mins – Cost of being slow • Better to turn it off than lose money by leaving it running
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance in 2.0  How do we know if a streaming system is “fast”? – Faster than another system ? – What about Hardware potential ? • More on this later  Dimensions – Throughput – Latency – Resource utilization: CPU/Network/Memory/Disk/Power
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Execution Engine - Planned Enhancements for  Umbrella Jira : STORM-2284 – https://issues.apache.org/jira/browse/STORM-2284
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Areas critical to Performance  Messaging System – Need Bounded Concurrent Queues that operate as fast as hardware allows – Lock based queues not an option – Lock free queues or preferably Wait-free queues  Threading Model – Fewer Threads. Less synchronization. – Dedicated threads instead of pooled threads. – CPU Pinning.  Memory Model – Lowering GC Pressure: Recycling Objects in critical path. – Reducing CPU cache faults: Controlling Object Layout (contiguous allocation).
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Messaging Subsystem (STORM-2307)
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Understanding “Fast” Component Throughput Mill/sec AKKA 90-100 threads 50 Flink per core 1.5 Apex 3.0 container local 4.3 v3.0 Gear Pump 4 nodes 18 InfoSphere Streams v3.0 Huge Gap! Component Throughput Mill/sec Not thread safe ArrayDeQueue 1 thread rd+wr 1063 Lock based ArrayBlockingQueue 1 thd rd+wr 30 1 Prod, 1 Cons 4 SleepingWaitStrategy Disruptor 1 P, 1C 25 (ProducerMode= MULTI) 3.3.x lazySet() FastQ 1 P, 1C 31 JC Tools MPSC 1P, 1c 74 2P, 59 3P 43 4P 40 6P 56 8P 65 10P 66 15P 68 20P 68
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Messaging - Current Architecture Worker Send Thd Send Q Network Bolt/Spout Executor Recv Q Bolt Executor Thread (user logic) Send Q Send Thread Worker Recv Thd Recv Q Network Worker Process - High Level View
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Bolt/Spout Executor - Detailed ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Flusher Thread Send Thread SEND QRECEIVE Q ArrayList: Current Batch CLQ : OVERFLOW BATCHER (1 per publisher) Disruptor Q Bolt Executor Thread (user logic) publish Flusher Thread ArrayList ArrayList Worker’s Outbound Q Local Executor’s RECEIVE Q S E N D T H R E A D local remote
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New Architecture ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Flusher Thread Send Thread SEND QRECEIVE Q ArrayList: Current Batch CLQ : OVERFLOW BATCHER (1 per publisher) Disruptor Q Bolt Executor Thread (user logic) publish Flusher Thread ArrayList ArrayList Worker’s Outbound Q Local Executor’s RECEIVE Q S E N D T H R E A D local remote
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Messaging - New Architecture (STORM-2306) RECEIVE Q ArrayList: Current Batch BATCHER JCTools Q Bolt Executor Thread (user logic) publish DestID msgs msgs msgs msgs Local Executor’s RECEIVE Q Worker’s Outbound Q local remote
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Preliminary Numbers LATENCY  1 spout --> 1 bolt with 1 ACKer (all in same worker) – v1.0.1 : 3.4 milliseconds – v2.0 master: 7 milliseconds – v2.0 redesigned : 60-100 micro seconds (116x improvement)
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Preliminary Numbers THROUGHPUT  1 spout --> 1 bolt [w/o ACKing] – v1.0.1 : ? – v2.0 master: 3.3 million /sec – v2.0 redesigned : 5 million /sec (50% improvement)  1 spout --> 1 bolt [with ACKing] – v1.0 : 233 K /sec – v2.0 master: 900 k/sec – v2.0 redesigned : 1 million /sec (not much change – but why ?)
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Observations  Latency: Dramatically improved.  Throughput: Discovered multiple bottlenecks preventing significantly higher throughput. – Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along with some others, throughput can reach ~7 million/sec. – TumpleImpl : If inefficiencies here are addressed, throughput can reach ~15 mill/sec. – ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec. Limitation with implementation not with concept. I see room for ACKer specific fixes that can also substantially improve its throughput.
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Threading & Execution Model (STORM-2307)
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved WORKER PROCESS • Start/Stop/Monitor Executors • Manage Metrics • Topology Reconfig • Heartbeat Executor (Thd) grouper Task (Bolt)Q counters Executor (Thd) System Task (Inter host Input) Executor (Thd) Sys Task (Outbound Msgs) Q counters New Threading & Execution Model Executor (Thd) System Task (Intra host Input) Executor (Thd) (grouper) (Bolt) Task (Spout/Bolt)Q counters
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU Pinning (STORM-2313)
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU cache access  Approximate access costs – L1 cache : 1x – L2 cache : 2.5x – Local L3 cache : 10-20x – Remote L3 cache: 25-75x
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU Affinity  For inter-thread communication – cache fault distance matters – Faster between cores on same socket • 20% latency hit when threads pinned to diff sockets  Pinning threads to CPUs – If done right, minimizes cache fault distance – Threads moving around needs to cache refreshed – Unrelated threads running on same core trash each others cache  Helps perf on NUMA machines – Pinning long running tasks reduces NUMA effects – NUMA aware allocator introduced in Java SE 6u2
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU Pinning Strategy  1 thd per physical core  Try to fit subsequent executor threads on same socket  Logical cores – i.e. Hyperthreading ? – Avoid hyperthreading – avoid cache trashing each other on same core – Could provide it as option in future ?
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Memory Management
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Memory Management Can be decomposed into 2 key area – Object Recycling - in critical path • Avoids dynamic allocation cost • Minimizes stop-the-world GC pauses – Contiguous allocation: arrays, data members. • CPU likes it. • Pre-fetch friendly. • Fewer cache faults per object. • Natural in C++, very painful in Java.
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduling & Elasticity (STORM-2309)
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Elasticity  Stretching / Shrinking – Changes Worker/Executor counts  Current parallelism hints not good enough  Need a better way for users to specify concurrency that enables stretching/shrinking
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Topology Planning / Scheduling (STORM-2309)  Problem: How to line up the tasks within and across workers for optimal execution – Lower level issue than Resource Aware Scheduling  What is optimal ? – Best Performance – without regard for hardware/energy utilization – Resourceful hardware utilization – trade in last 10-20% perf for lower energy consumption.  Enable user to decide what is optimal for them. – Scheduling hints – Allow elasticity
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduling Hints  Parallelism hints – Per worker, host, (rack), global counts – Min and max settings – Supervisor could have rack hints Worker  Distribution – Compact packing (default) • Pack the Worker to its max – In order of appearance in topology definition • Then pack host, (then rack), then cluster – Loose packing • Pack the Worker to the min • Then host, then … • Left over resources are spread out in the similar fashion
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduling Hints  If TaskA --emit-->TaskB, ….. A & B could be running on • Same Thread • Same Worker different thread • Same Host different Worker • Different Host: Shuffle/other  Locality Control: – Clustering: Co-locating – Partitioning: Avoid colocation  Specify via arguments to groupings ?: – shuffle(threadLocal) – fieldsGrouping(nodeLocal)
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You Questions ? References https://issues.apache.org/jira/browse/STORM-2284