SlideShare a Scribd company logo
Accelerated Spark on
Azure: Seamless and
Scalable Hardware Offloads
in the Cloud
Yuval Degani, Mellanox Technologies
Evan Burness, Microsoft Azure
#HWCSAIS18
• End-to-end designer and supplier of
interconnect solutions: network adapters,
switches, system-on-a-chip, cables, silicon
and software
• 10-400 Gb/s Ethernet and InfiniBand
Storage
Front / Backend
Server /
Compute
Switch /
Gateway
56/100/200G
InfiniBand
10/25/40/50/
100/200/400GbE
Virtual Protocol
Interconnect
56/100/200G
InfiniBand
10/25/40/50/
100/200/400GbE
Virtual Protocol
Interconnect
#HWCSAIS18 2
• RDMA capable network, powered by Mellanox
• H-series (Intel CPUs with FDR InfiniBand)
• NC-series (Nvidia GPUs with FDR InfiniBand)
• Only major Cloud provider with RDMA
• Run simulation and AI workloads at large-scale
• Dozens of RDMA clusters around the world
#HWCSAIS18 3
Why are we here?
• Azure hardware accelerated networks will soon support
general-purpose RDMA (on top of SR-IOV)
• SparkRDMA Shuffle Plugin (appeared at Spark Summit
Europe 2017) can now be used in the cloud, providing
instant speedups for Spark jobs
#HWCSAIS18 4
What’s RDMA?
• Remote Direct Memory Access
– Read/write from/to remote memory
locations
• Zero-copy
• Direct hardware interface – bypasses the
kernel and TCP/IP in IO path
• Flow control and reliability is offloaded in
hardware
• Sub-microsecond latency
• Supported on almost all mid-range/high-
end network adapters
Java app
buffer
OS
Sockets
TCP/IP
Driver
Network Adapter
RDMA
Socket
Context switch
#HWCSAIS18 5
RDMA on Azure
• No need for buying expensive hardware
• Lowest latency on the Cloud (~2.5 uSec)
• Pre-built OS images for easy deployment
• K80, P100, and V100 GPUs with InfiniBand
• Other uses cases for RDMA on Azure:
#HWCSAIS18 6
RDMA on Azure
Azure accelerated networking is build on top of SR-IOV (Single Root Input/Output
Virtualization) hardware support provided by Mellanox ConnectX network cards
7#HWCSAIS18
Spark’s Shuffle Internals
Under the hood
#HWCSAIS18 8
Spark’s Shuffle Basics
MapReduce
#HWCSAIS18 9
Spark’s Shuffle Basics
MapReduce
Input
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
MapReduce
Map
Map
Map
Map
Input
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
MapReduce
Map
Map
Map
Map
Input Map output
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
MapReduce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
MapReduce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
Driver
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
Reduce task
MapReduce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
Driver
Reduce task
Reduce task
Reduce task
Reduce task
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
Reduce task
MapReduce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
Driver
Reduce task
Reduce task
Reduce task
Reduce task
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
Reduce task
MapReduce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
Driver
Reduce task
Reduce task
Reduce task
Reduce task
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
#HWCSAIS18 9
Spark’s Shuffle Basics
Map
Reduce task
MapReduce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
Driver
Reduce task
Reduce task
Reduce task
Reduce task
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
#HWCSAIS18 9
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
Request Map
Statuses
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
2
Request Map
Statuses
Send back Map
Statuses
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
2
3
Request Map
Statuses
Send back Map
Statuses
Group block
locations by writer
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
2
3 4
Request Map
Statuses
Send back Map
Statuses
Request blocks
from writers
Group block
locations by writer
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
2
3 4
5
Request Map
Statuses
Send back Map
Statuses
Request blocks
from writers
Locate blocks, and
setup as stream
Group block
locations by writer
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
2
3 4
5
6
Request Map
Statuses
Send back Map
Statuses
Request blocks
from writers
Locate blocks, and
setup as stream
Request blocks
from stream, one
by one
Group block
locations by writer
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
2
3
7
4
5
6
Request Map
Statuses
Send back Map
Statuses
Request blocks
from writers
Locate blocks, and
setup as stream
Request blocks
from stream, one
by one
Group block
locations by writer
Locate block, send
back
Spark’s Shuffle Read Protocol
10#HWCSAIS18
Shuffle Read
Driver
Reader
Writer
1
2
3
7
4
5
6
Request Map
Statuses
Send back Map
Statuses
Request blocks
from writers
Locate blocks, and
setup as stream
Request blocks
from stream, one
by one
Group block
locations by writer
Locate block, send
back
8
Block data is now
ready
The Cost of Shuffling
• Shuffling is very expensive in terms of CPU,
RAM, disk and network IOs
• Spark users try to avoid shuffles as much as
they can
• Speedy shuffles can relieve developers of such
concerns, and simplify applications
#HWCSAIS18 11
SparkRDMA Shuffle Plugin
Accelerating Shuffle with RDMA
#HWCSAIS18 12
SparkRDMA
• Dedicated session at Spark Summit Europe 2017:
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache
Spark
• Open-source and free to use:
https://github.com/Mellanox/SparkRDMA
• Supports any RDMA-capable device
– Ethernet (RoCE – RDMA over Converged Ethernet)
– InfiniBand
#HWCSAIS18 13
SparkRDMA - Design Notes
• Entire Shuffle-related communication is done with RDMA
– RPC messaging for meta-data transfers
– Block transfers
• SparkRDMA is an independent plugin
– Implements the ShuffleManager interface
– No changes to Spark’s code – use with any existing Spark installation
• Reuses Spark facilities
– Maximize reliability
– Minimize impact on the data
• No functionality loss of any kind, SparkRDMA supports:
– Compression
– Spilling to disk
– Recovery from failed map or reduce tasks
14#HWCSAIS18
SortShuffleManager
RdmaShuffleManager
Shuffle Read
Driver
Reader
Writer
#HWCSAIS18 15
Shuffle Read Protocol – Standard vs. RDMA
Shuffle Read
Driver
Reader
Writer
1
2
3
7
4
5
6
Request Map
Statuses
Send back Map
Statuses
Request blocks
from writers
Locate blocks, and
setup as stream
Request blocks
from stream, one
by one
Group block
locations by writer
Locate block, send
back
8
Block data is now
ready
#HWCSAIS18 15
Shuffle Read Protocol – Standard vs. RDMA
Shuffle Read
Driver
Reader
Writer
1
2
3
Request Map
Statuses
Send back Map
Statuses
Group block
locations by writer
#HWCSAIS18 15
Shuffle Read Protocol – Standard vs. RDMA
Shuffle Read
Driver
Reader
Writer
1
2
3 4
Request Map
Statuses
Send back Map
Statuses
Group block
locations by writer
RDMA-Read
blocks from writers
#HWCSAIS18 15
Shuffle Read Protocol – Standard vs. RDMA
Shuffle Read
Driver
Reader
Writer
1
2
3 4
Request Map
Statuses
Send back Map
Statuses
Group block
locations by writer
RDMA-Read
blocks from writers
No-op on writer
HW offloads
transfers
#HWCSAIS18 15
Shuffle Read Protocol – Standard vs. RDMA
Shuffle Read
Driver
Reader
Writer
1
2
3 4
Request Map
Statuses
Send back Map
Statuses
Group block
locations by writer
RDMA-Read
blocks from writers
No-op on writer
HW offloads
transfers
5
Block data is now
ready
#HWCSAIS18 15
Shuffle Read Protocol – Standard vs. RDMA
Shuffle Read
Driver
Reader
Writer
1
2
3
7
4
5
6
Request Map
Statuses
Send back
Map Statuses
Request
blocks from
writers
Locate blocks,
and setup as
stream
Request blocks
from stream,
one by one
Group block
locations by
writer
Locate block,
send back
8
Block data is
now ready
Shuffle Read
Driver
Reader
Writer
1
2
3 4 6
Request Map
Statuses
Send back
Map Statuses
Group block
locations by
writer
RDMA-Read
blocks from
writers
No-op on writer HW
offloads transfers
5
Block data is
now ready
StandardRDMA
16
StandardRDMA
Reader
Writer 7
4
5
6
Request blocks
from writers
Request blocks
from stream, one
by one
Locate block,
send back
8
Block data is now
ready
Reader
Writer
4 6
RDMA-Read
blocks from
writers
No-op on writer
HW offloads
transfers
5
Block data is now
ready
Locate blocks,
and setup as
stream
16
StandardRDMA
Server-side:
 0 CPU
 Shuffle transfers are not
blocked by GC in executor
 No buffering
Client-side:
 Instant transfers
 Reduced messaging
 Direct, unblocked access to
remote blocks
Reader
Writer 7
4
5
6
Request blocks
from writers
Request blocks
from stream, one
by one
Locate block,
send back
8
Block data is now
ready
Reader
Writer
4 6
RDMA-Read
blocks from
writers
No-op on writer
HW offloads
transfers
5
Block data is now
ready
Locate blocks,
and setup as
stream
16
Benefits
• Substantial improvements in:
– Block transfer times: latency and total transfer time
– Memory consumption and management
– CPU utilization
• Easy to deploy and configure:
– Packed into a single JAR file
– Plugin is enabled through a simple configuration handle
– Allows finer tuning with a set of configuration handles
• Configuration and deployment are on a per-job basis:
– Can be deployed incrementally
– May be limited to Shuffle-intensive jobs
#HWCSAIS18 17
Demo time!
#HWCSAIS18 18
Demo Testbed
• Hardware:
– 8 Azure “h16mr” VM instances
– Intel Haswell E5-2667 V3
– InfiniBand FDR (56Gb/s)
– 224GiB RAM
– 2000GiB SSD for temporary
storage
• Workload:
– HiBench TeraSort
– Size: “gigantic” (320GB)
• Ubuntu 16.04
• HDFS on Hadoop 2.7.4
– No replication
• Spark 2.2.0
– 1 Master
– 7 Workers
– 16 active Spark cores on each
node, 112 total
#HWCSAIS18 19
 Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud with Yuval Degani and Evan Burness
TeraSort - Performance Results
RDMA
Standard
0 100 200 300 400 500 600 700
seconds
#HWCSAIS18 21
x4.4 Faster Shuffles!
22#HWCSAIS18
StandardRDMA
x1000 Faster Transfers!
23#HWCSAIS18
StandardRDMA
0 Shuffle Read Time
24#HWCSAIS18
StandardRDMA
Recap
• SR-IOV+RDMA comes to Azure H and N-series in
Fall 2018
• Support for all major MPI
– MVAPICH, OpenMPI, Intel MPI, Platform MPI, etc.
• General-purpose RDMA support
– Support for SparkRDMA, Caffe2, TensorFlow or any other
RDMA application
• Be on the lookout for more at SC’18 !
#HWCSAIS18 25
Thank you.

More Related Content

PDF
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
PDF
Continuous Processing in Structured Streaming with Jose Torres
PDF
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
PDF
SSR: Structured Streaming for R and Machine Learning
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Continuous Processing in Structured Streaming with Jose Torres
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
SSR: Structured Streaming for R and Machine Learning

What's hot (20)

PDF
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
PDF
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
PDF
Spark Summit EU talk by Jorg Schad
PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Apache Spark Overview
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Elastify Cloud-Native Spark Application with Persistent Memory
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PDF
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Data Security at Scale through Spark and Parquet Encryption
PDF
Spark Summit EU talk by Berni Schiefer
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
Spark Summit EU talk by Jorg Schad
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Apache Spark Overview
Speed up UDFs with GPUs using the RAPIDS Accelerator
Performance Troubleshooting Using Apache Spark Metrics
Elastify Cloud-Native Spark Application with Persistent Memory
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building Robust, Adaptive Streaming Apps with Spark Streaming
Apache Spark on K8S Best Practice and Performance in the Cloud
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Productionizing Spark and the REST Job Server- Evan Chan
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Next CERN Accelerator Logging Service with Jakub Wozniak
Data Security at Scale through Spark and Parquet Encryption
Spark Summit EU talk by Berni Schiefer
Ad

Similar to Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud with Yuval Degani and Evan Burness (20)

PDF
Big data processing meets non-volatile memory: opportunities and challenges
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
What is Digital Rebar Provision (and how RackN extends)?
PDF
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
PDF
CCNP Data Center Centralized Management Automation
PPTX
Spy hard, challenges of 100G deep packet inspection on x86 platform
PDF
End-to-End, Source to Analytics, Data Lineage with Syncsort DMX-h
PDF
6 open capi_meetup_in_japan_final
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Accelerating apache spark with rdma
PPTX
Big data processing with Apache Spark and Oracle Database
PPTX
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Introduction to DPDK
PDF
What is Apache Kafka and What is an Event Streaming Platform?
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
PDF
Build Low-Latency Applications in Rust on ScyllaDB
Big data processing meets non-volatile memory: opportunities and challenges
Spark Summit EU 2015: Lessons from 300+ production users
What is Digital Rebar Provision (and how RackN extends)?
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
RAPIDS: GPU-Accelerated ETL and Feature Engineering
CCNP Data Center Centralized Management Automation
Spy hard, challenges of 100G deep packet inspection on x86 platform
End-to-End, Source to Analytics, Data Lineage with Syncsort DMX-h
6 open capi_meetup_in_japan_final
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Accelerating apache spark with rdma
Big data processing with Apache Spark and Oracle Database
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
RAPIDS – Open GPU-accelerated Data Science
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
The columnar roadmap: Apache Parquet and Apache Arrow
Introduction to DPDK
What is Apache Kafka and What is an Event Streaming Platform?
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Build Low-Latency Applications in Rust on ScyllaDB
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Mega Projects Data Mega Projects Data
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to machine learning and Linear Models
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
[EN] Industrial Machine Downtime Prediction
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
Mega Projects Data Mega Projects Data
Supervised vs unsupervised machine learning algorithms
Fluorescence-microscope_Botany_detailed content
Miokarditis (Inflamasi pada Otot Jantung)
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ISS -ESG Data flows What is ESG and HowHow
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to machine learning and Linear Models

Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud with Yuval Degani and Evan Burness

  • 1. Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud Yuval Degani, Mellanox Technologies Evan Burness, Microsoft Azure #HWCSAIS18
  • 2. • End-to-end designer and supplier of interconnect solutions: network adapters, switches, system-on-a-chip, cables, silicon and software • 10-400 Gb/s Ethernet and InfiniBand Storage Front / Backend Server / Compute Switch / Gateway 56/100/200G InfiniBand 10/25/40/50/ 100/200/400GbE Virtual Protocol Interconnect 56/100/200G InfiniBand 10/25/40/50/ 100/200/400GbE Virtual Protocol Interconnect #HWCSAIS18 2
  • 3. • RDMA capable network, powered by Mellanox • H-series (Intel CPUs with FDR InfiniBand) • NC-series (Nvidia GPUs with FDR InfiniBand) • Only major Cloud provider with RDMA • Run simulation and AI workloads at large-scale • Dozens of RDMA clusters around the world #HWCSAIS18 3
  • 4. Why are we here? • Azure hardware accelerated networks will soon support general-purpose RDMA (on top of SR-IOV) • SparkRDMA Shuffle Plugin (appeared at Spark Summit Europe 2017) can now be used in the cloud, providing instant speedups for Spark jobs #HWCSAIS18 4
  • 5. What’s RDMA? • Remote Direct Memory Access – Read/write from/to remote memory locations • Zero-copy • Direct hardware interface – bypasses the kernel and TCP/IP in IO path • Flow control and reliability is offloaded in hardware • Sub-microsecond latency • Supported on almost all mid-range/high- end network adapters Java app buffer OS Sockets TCP/IP Driver Network Adapter RDMA Socket Context switch #HWCSAIS18 5
  • 6. RDMA on Azure • No need for buying expensive hardware • Lowest latency on the Cloud (~2.5 uSec) • Pre-built OS images for easy deployment • K80, P100, and V100 GPUs with InfiniBand • Other uses cases for RDMA on Azure: #HWCSAIS18 6
  • 7. RDMA on Azure Azure accelerated networking is build on top of SR-IOV (Single Root Input/Output Virtualization) hardware support provided by Mellanox ConnectX network cards 7#HWCSAIS18
  • 8. Spark’s Shuffle Internals Under the hood #HWCSAIS18 8
  • 13. Spark’s Shuffle Basics Map MapReduce Map Map Map Map Input Map output File File File File File #HWCSAIS18 9
  • 14. Spark’s Shuffle Basics Map MapReduce Map Map Map Map Input Map output File File File File File Driver #HWCSAIS18 9
  • 15. Spark’s Shuffle Basics Map Reduce task MapReduce Map Map Map Map Input Map output File File File File File Driver Reduce task Reduce task Reduce task Reduce task #HWCSAIS18 9
  • 16. Spark’s Shuffle Basics Map Reduce task MapReduce Map Map Map Map Input Map output File File File File File Driver Reduce task Reduce task Reduce task Reduce task Fetch blocks Fetch blocks Fetch blocks Fetch blocks Fetch blocks #HWCSAIS18 9
  • 17. Spark’s Shuffle Basics Map Reduce task MapReduce Map Map Map Map Input Map output File File File File File Driver Reduce task Reduce task Reduce task Reduce task Fetch blocks Fetch blocks Fetch blocks Fetch blocks Fetch blocks #HWCSAIS18 9
  • 18. Spark’s Shuffle Basics Map Reduce task MapReduce Map Map Map Map Input Map output File File File File File Driver Reduce task Reduce task Reduce task Reduce task Fetch blocks Fetch blocks Fetch blocks Fetch blocks Fetch blocks #HWCSAIS18 9
  • 19. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer
  • 20. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer
  • 21. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 Request Map Statuses
  • 22. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 2 Request Map Statuses Send back Map Statuses
  • 23. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 2 3 Request Map Statuses Send back Map Statuses Group block locations by writer
  • 24. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 2 3 4 Request Map Statuses Send back Map Statuses Request blocks from writers Group block locations by writer
  • 25. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 2 3 4 5 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Group block locations by writer
  • 26. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 2 3 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer
  • 27. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 2 3 7 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer Locate block, send back
  • 28. Spark’s Shuffle Read Protocol 10#HWCSAIS18 Shuffle Read Driver Reader Writer 1 2 3 7 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer Locate block, send back 8 Block data is now ready
  • 29. The Cost of Shuffling • Shuffling is very expensive in terms of CPU, RAM, disk and network IOs • Spark users try to avoid shuffles as much as they can • Speedy shuffles can relieve developers of such concerns, and simplify applications #HWCSAIS18 11
  • 30. SparkRDMA Shuffle Plugin Accelerating Shuffle with RDMA #HWCSAIS18 12
  • 31. SparkRDMA • Dedicated session at Spark Summit Europe 2017: Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark • Open-source and free to use: https://github.com/Mellanox/SparkRDMA • Supports any RDMA-capable device – Ethernet (RoCE – RDMA over Converged Ethernet) – InfiniBand #HWCSAIS18 13
  • 32. SparkRDMA - Design Notes • Entire Shuffle-related communication is done with RDMA – RPC messaging for meta-data transfers – Block transfers • SparkRDMA is an independent plugin – Implements the ShuffleManager interface – No changes to Spark’s code – use with any existing Spark installation • Reuses Spark facilities – Maximize reliability – Minimize impact on the data • No functionality loss of any kind, SparkRDMA supports: – Compression – Spilling to disk – Recovery from failed map or reduce tasks 14#HWCSAIS18 SortShuffleManager RdmaShuffleManager
  • 33. Shuffle Read Driver Reader Writer #HWCSAIS18 15 Shuffle Read Protocol – Standard vs. RDMA
  • 34. Shuffle Read Driver Reader Writer 1 2 3 7 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer Locate block, send back 8 Block data is now ready #HWCSAIS18 15 Shuffle Read Protocol – Standard vs. RDMA
  • 35. Shuffle Read Driver Reader Writer 1 2 3 Request Map Statuses Send back Map Statuses Group block locations by writer #HWCSAIS18 15 Shuffle Read Protocol – Standard vs. RDMA
  • 36. Shuffle Read Driver Reader Writer 1 2 3 4 Request Map Statuses Send back Map Statuses Group block locations by writer RDMA-Read blocks from writers #HWCSAIS18 15 Shuffle Read Protocol – Standard vs. RDMA
  • 37. Shuffle Read Driver Reader Writer 1 2 3 4 Request Map Statuses Send back Map Statuses Group block locations by writer RDMA-Read blocks from writers No-op on writer HW offloads transfers #HWCSAIS18 15 Shuffle Read Protocol – Standard vs. RDMA
  • 38. Shuffle Read Driver Reader Writer 1 2 3 4 Request Map Statuses Send back Map Statuses Group block locations by writer RDMA-Read blocks from writers No-op on writer HW offloads transfers 5 Block data is now ready #HWCSAIS18 15 Shuffle Read Protocol – Standard vs. RDMA
  • 39. Shuffle Read Driver Reader Writer 1 2 3 7 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer Locate block, send back 8 Block data is now ready Shuffle Read Driver Reader Writer 1 2 3 4 6 Request Map Statuses Send back Map Statuses Group block locations by writer RDMA-Read blocks from writers No-op on writer HW offloads transfers 5 Block data is now ready StandardRDMA 16
  • 40. StandardRDMA Reader Writer 7 4 5 6 Request blocks from writers Request blocks from stream, one by one Locate block, send back 8 Block data is now ready Reader Writer 4 6 RDMA-Read blocks from writers No-op on writer HW offloads transfers 5 Block data is now ready Locate blocks, and setup as stream 16
  • 41. StandardRDMA Server-side:  0 CPU  Shuffle transfers are not blocked by GC in executor  No buffering Client-side:  Instant transfers  Reduced messaging  Direct, unblocked access to remote blocks Reader Writer 7 4 5 6 Request blocks from writers Request blocks from stream, one by one Locate block, send back 8 Block data is now ready Reader Writer 4 6 RDMA-Read blocks from writers No-op on writer HW offloads transfers 5 Block data is now ready Locate blocks, and setup as stream 16
  • 42. Benefits • Substantial improvements in: – Block transfer times: latency and total transfer time – Memory consumption and management – CPU utilization • Easy to deploy and configure: – Packed into a single JAR file – Plugin is enabled through a simple configuration handle – Allows finer tuning with a set of configuration handles • Configuration and deployment are on a per-job basis: – Can be deployed incrementally – May be limited to Shuffle-intensive jobs #HWCSAIS18 17
  • 44. Demo Testbed • Hardware: – 8 Azure “h16mr” VM instances – Intel Haswell E5-2667 V3 – InfiniBand FDR (56Gb/s) – 224GiB RAM – 2000GiB SSD for temporary storage • Workload: – HiBench TeraSort – Size: “gigantic” (320GB) • Ubuntu 16.04 • HDFS on Hadoop 2.7.4 – No replication • Spark 2.2.0 – 1 Master – 7 Workers – 16 active Spark cores on each node, 112 total #HWCSAIS18 19
  • 46. TeraSort - Performance Results RDMA Standard 0 100 200 300 400 500 600 700 seconds #HWCSAIS18 21
  • 49. 0 Shuffle Read Time 24#HWCSAIS18 StandardRDMA
  • 50. Recap • SR-IOV+RDMA comes to Azure H and N-series in Fall 2018 • Support for all major MPI – MVAPICH, OpenMPI, Intel MPI, Platform MPI, etc. • General-purpose RDMA support – Support for SparkRDMA, Caffe2, TensorFlow or any other RDMA application • Be on the lookout for more at SC’18 ! #HWCSAIS18 25