SlideShare a Scribd company logo
April 27 2021
Advancing GPU Analytics
with RAPIDS Accelerator
for Spark and Alluxio
2
AGENDA
1. Introduction for RAPIDS Accelerator for Spark
2. RAPIDS Accelerator for Spark Performance
3. GPU Acceleration combined with Alluxio
3
GROWTH IN REQUIREMENT FOR DATA PROCESSING
2030
2020
2010
2000
Hadoop Era Spark Era Spark GPU Era
Spark 2.0 on
CPUs
GPU Accelerated
Spark 3.0
“These contributions lead to faster data pipelines,
model training and scoring for more breakthroughs and
insights with Apache Spark 3.0 and Databricks.”
Matei Zaharia, creator of Apache Spark and chief
technologist at Databricks
4
Accelerate data preparation
Quickly move to next stages of
the pipeline
Focus on most-critical activities
Orchestrate end-to-end pipelines
From ETL to model training to
visualization
Same infrastructure for Spark
and ML/DL frameworks
Complete jobs faster with less
hardware
Save on-prem and in the cloud
Do more with less
SPARK 3.0 ON NVIDIA GPUs
Accelerate data science pipelines without code changes
Faster Execution Time Streamline Analytics to AI
Reduced
Infrastructure Costs
NVIDIA BRINGS GPU ACCELERATION TO APACHE SPARK
Features
• Use existing (unmodified) customer
code
• Spark features that are not GPU
enabled run transparently on the
CPU
Initial Release - GPU Acceleration of:
• Spark Data Frames
• Spark SQL
• ML/DL training frameworks
Seamless integration with Spark 3.0
RAPIDS ACCELERATOR FOR APACHE SPARK 
UCX Libraries
RAPIDS libcudf
(C++ Libraries)
CUDA
JNI bindings
Mapping From Java/Scala to C++
RAPIDS Accelerator
for Spark
DISTRIBUTED SCALE-OUT SPARK APPLICATIONS
Spark SQL API Spark Shuffle
DataFrame API
if gpu_enabled(operation, data_type)
call-out to RAPIDS
else
execute standard Spark operation
JNI bindings
Mapping From Java/Scala to C++
● Custom Implementation of Spark
Shuffle
● Optimized to use RDMA and
GPU-to-GPU direct communication
APACHE SPARK CORE
GAME-CHANGING PERFORMANCE GAINS
7x Performance Boost
90% Cost Savings
on Databricks
Opens new possibilities for AI-driven services in Adobe Experience Cloud
“We’re seeing significantly faster performance with
NVIDIA-accelerated Spark 3.0 compared to running
Spark on CPUs. With these game-changing GPU
performance gains, new possibilities open up for
enhancing AI-driven features in our full suite of
Adobe Experience Cloud apps.”
— William Yan, Senior Director of Machine Learning at Adobe
RAPIDS ACCELERATOR ECOSYSTEM MOMENTUM
 
Databricks
Machine Learning
Runtime
Google Cloud
Dataproc
Apache Spark 3.0
Community Release
Amazon EMR
Available
Now
Available
Now
Available
Now
Available
Now
Cloudera CDP
Available in
Jun’21
9
NVIDIA CONFIDENTIAL - DO NOT COPY OR DISTRIBUTE
Nodes 8
CPU
2 x AMD EPYC 7452 
(64 cores/128 threads)
GPU 2 x NVIDIA Ampere A100, PCIe, 250W, 40GB
RAM 0.5 TB
Storage 4 x 7.68 TB Gen4 U.2 NVMe
Networking 1 x Mellanox CX-6 Single Port HDR100 QSFP56
Cost w/o GPU ~$42,000 per w/ bulk discount
Cost w/ GPU ~$71,000 per w/ bulk discount
Software
HDFS (Hadoop 3.2.1) 
Spark 3.0.2 (stand alone)
EGX / NVIDIA Certified
OEM servers
Benchmark Environment – EGX
SPARK SQL QUERIES – EGX CLUSTER
Based on 97 NVIDIA Decision Support (NDS) benchmark (3TB Dataset without decimals)*
GPU is 3.21x faster with a cost ratio of 0.52
(GPU cost was 52% that of the CPU)
Queries 14b and 72 were removed because of failures
*NVIDIA Decision Support (NDS) benchmark is derived from the TPC-DS benchmark and is used for internal performance testing. Results from NDS are not comparable to TPC-DS
UCX ON VS OFF
GPU + UCX shuffle is 1.23x faster than the GPU alone.
Queries 67 and 72 removed because of failures.
GPU + UCX Shuffle is 4.15x faster than the CPU and a cost ratio of 0.41
Queries 14a, 67, and 72 were removed because of failures.
SPARK SQL QUERIES – EGX CLUSTER
12
12
Nodes 1 driver (CPU only), 4 workers
CPU
n1-standard-4 (driver)
4 x n1-standard-32 (workers)
GPU 4 x 16GB T4 per executor
RAM 120 GB
Storage Google Cloud Storage/Alluxio with SSD
Networking 32 Gbps
Cost w/o GPU $7.82/hour incl GCE + Dataproc
Cost w/ GPU $13.41/hour incl GCE + Dataproc
Software Dataproc Spark 3.0.1 + YARN
Benchmark Environment – GCP DATAPROC
13
*NVIDIA Decision Support (NDS) benchmark is derived from the TPC-DS benchmark and is used for internal performance testing. Results from NDS are not comparable to TPC-DS
WHY SOME GPU QUERIES FAILED
GPU Memory Limitations/Spilling
Operation Problem Solution
Sort In cases of data skew, the amount of data
being sorted can exceed limits of the
hardware/software.
A modified external batch sort is
implemented in the working
branch for the 0.5 release.
Window* In cases of data skew the amount of data
in the window operation can exceed the
limits of the hardware/software.
Implement a chunked rank
optimization. Github issue #1859
Join Worst case join output row count is
left.rows * right.rows
Materialize the output of a join in
chunks. Github issue #1629
Have conditional filters run as a
part of the join. Github issue #288
* Actually, this is for rank which we don’t support yet but plan to in the 0.5 release
15
WHY IS THE GPU SLOWER FOR SOME QUERIES?
• Failed Queries
• Small Data Sizes
(spark.sql.adaptive.advisoryPartitionSizeInBytes=1G)
• Q28, Q44, and Q67
• Less computation overlap
(spark.rapids.sql.concurrentGpuTasks=1)
• Host/Device Memory Transfers
• All of them
• Cache Consistency on Reductions/Very Small Aggregate
Results
• Q88
• Lack of GPU support and CPU parallelism is much less
• Q44, Q49, and Q67
16
ALLUXIO CONFIGURATION
- Co-locate the Alluxio worker nodes with Spark worker nodes to ensure short-circuit
reads and writes.
- Size cache according to the working set.
- Choose the right cache medium choice(SSD or System Memory)
17
spark.rapids.sql.enabled is the master enable
spark.rapids.sql.explain enables logging of operations not accelerated
- Set to NOT_ON_GPU to print only incompatible ops
spark.rapids.sql.concurrentGpuTasks controls concurrent task count per GPU
- Set to a value between 2 and 4, with 2 typically providing the most benefit
spark.rapids.memory.pinnedPool.size significantly improves performance of
data transfers between the GPU and host memory
RAPIDS ACCELERATOR CONFIGURATION
18
WILL MY SPARK WORKLOAD ACCELERATE WITHOUT CHANGES?
If I know my Spark workload characteristics...
Accelerates Well on GPUs Not for GPUs
Data Pipeline
Use Cases
● Data Mining, Analytics and BI
● Batch processing and writing large datasets to a Data
Warehouse
● Data extraction, aggregation and feature preparation for ML
Training & Inference
● Real-time Streaming Analytics/AI pipeline
● Online Transaction Processing (OLTP)
● Data Pipeline with custom code
Technical
Characteristics
● Batch processing of GB+ data sets
● Parquet, ORC, CSV data formats
● HDFS, S3-compatible, or V2 data sources
● DataFrame/SQL (join, agg, sort, window), Selected Hive &
Scala UDFs
● Stream processing
● Spark RDD, MLLib, Dataset, GraphX, Streaming libraries
If I am unsure...
Use the Log-Analysis Tool
● Review Spark history logs from existing CPU jobs
● Understand how much of the workloads could execute on GPUs
● Get tips on optional code optimizations for GPUs
Apache Spark
Apache Spark - Core
Catalyst Query Optimizer
Spark Streaming
Spark SQL
Spark Dataframes Spark Datasets RDD
Spark Shuffle
Spark MLLib GraphX
CPU Only
GPU Aware
GPU Accelerated
Partially GPU
Accelerated
19
Summary
RAPIDS Accelerator for Spark unlocks GPU
acceleration for Spark dataframes, Spark SQL, &
ML/DL frameworks such as XGBoost(with more
coming)
Alluxio is a high performance data orchestration
system for GPU compute.
Spark & GPUs on Alluxio optimizes for
performance and cost on cloud scale datasets.
Spark 3 & GPUs on Databricks, EMR and Dataproc
available today
Try it yourself
https://nvidia.github.io/spark-rapids/Getting-
Started/
Developer Blog:
Accelerating Analytics and AI with Alluxio
and NVIDIA GPUs
GTC Talk:
Enabling Data Orchestration with RAPIDS
Accelerator [S32746]
Accelerating Apache Spark Shuffle with UCX
[S31822]
Tuning GPU Network and Memory Usage in
Apache Spark [S31566]
Running Large-Scale ETL Benchmarks with
GPU-Accelerated Apache Spark [S31846]
……. and more

More Related Content

PDF
Deep Dive into GPU Support in Apache Spark 3.x
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
PDF
Inside Parquet Format
PDF
Automating linux network performance testing
PDF
Accelerating Envoy and Istio with Cilium and the Linux Kernel
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
The Parquet Format and Performance Optimization Opportunities
Deep Dive into GPU Support in Apache Spark 3.x
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Inside Parquet Format
Automating linux network performance testing
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
The Parquet Format and Performance Optimization Opportunities

What's hot (20)

PPTX
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
PDF
Cilium - API-aware Networking and Security for Containers based on BPF
PDF
Parquet Hadoop Summit 2013
PDF
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PDF
Atom The Redis Streams-Powered Microservices SDK: Dan Pipemazo
PDF
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
PPTX
Rds data lake @ Robinhood
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
PDF
PDF
MinIO January 2020 Briefing
PDF
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
PDF
Embedded Operating System - Linux
PPTX
Flusso Continuous Integration & Continuous Delivery
PPTX
Learn Apache Spark: A Comprehensive Guide
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PPTX
Dpdk applications
PDF
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
Cilium - API-aware Networking and Security for Containers based on BPF
Parquet Hadoop Summit 2013
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Atom The Redis Streams-Powered Microservices SDK: Dan Pipemazo
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
Rds data lake @ Robinhood
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
MinIO January 2020 Briefing
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Embedded Operating System - Linux
Flusso Continuous Integration & Continuous Delivery
Learn Apache Spark: A Comprehensive Guide
Scaling your Data Pipelines with Apache Spark on Kubernetes
Dpdk applications
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Ad

Similar to Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio (20)

PDF
SFBigAnalytics_SparkRapid_20220622.pdf
PDF
NVIDIA Rapids presentation
PDF
Rapids: Data Science on GPUs
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
PDF
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
PDF
RAPIDS Overview
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
PDF
20190909_PGconf.ASIA_KaiGai
PDF
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PDF
sudoers: Benchmarking Hadoop with ALOJA
PPTX
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
PDF
RAPIDS, GPUs & Python - AWS Community Day Melbourne
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
PDF
QCon2016--Drive Best Spark Performance on AI
PDF
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
SFBigAnalytics_SparkRapid_20220622.pdf
NVIDIA Rapids presentation
Rapids: Data Science on GPUs
RAPIDS – Open GPU-accelerated Data Science
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
RAPIDS Overview
20181116 Massive Log Processing using I/O optimized PostgreSQL
20190909_PGconf.ASIA_KaiGai
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
sudoers: Benchmarking Hadoop with ALOJA
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
RAPIDS, GPUs & Python - AWS Community Day Melbourne
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
QCon2016--Drive Best Spark Performance on AI
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
medical staffing services at VALiNTRY
PDF
top salesforce developer skills in 2025.pdf
PPT
Introduction Database Management System for Course Database
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Introduction to Artificial Intelligence
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
Which alternative to Crystal Reports is best for small or large businesses.pdf
Understanding Forklifts - TECH EHS Solution
How to Migrate SBCGlobal Email to Yahoo Easily
Operating system designcfffgfgggggggvggggggggg
VVF-Customer-Presentation2025-Ver1.9.pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Wondershare Filmora 15 Crack With Activation Key [2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Online Work Permit System for Fast Permit Processing
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
medical staffing services at VALiNTRY
top salesforce developer skills in 2025.pdf
Introduction Database Management System for Course Database
L1 - Introduction to python Backend.pptx
Odoo POS Development Services by CandidRoot Solutions
Introduction to Artificial Intelligence
ISO 45001 Occupational Health and Safety Management System
Materi-Enum-and-Record-Data-Type (1).pptx
How to Choose the Right IT Partner for Your Business in Malaysia

Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio

  • 1. April 27 2021 Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
  • 2. 2 AGENDA 1. Introduction for RAPIDS Accelerator for Spark 2. RAPIDS Accelerator for Spark Performance 3. GPU Acceleration combined with Alluxio
  • 3. 3 GROWTH IN REQUIREMENT FOR DATA PROCESSING 2030 2020 2010 2000 Hadoop Era Spark Era Spark GPU Era Spark 2.0 on CPUs GPU Accelerated Spark 3.0 “These contributions lead to faster data pipelines, model training and scoring for more breakthroughs and insights with Apache Spark 3.0 and Databricks.” Matei Zaharia, creator of Apache Spark and chief technologist at Databricks
  • 4. 4 Accelerate data preparation Quickly move to next stages of the pipeline Focus on most-critical activities Orchestrate end-to-end pipelines From ETL to model training to visualization Same infrastructure for Spark and ML/DL frameworks Complete jobs faster with less hardware Save on-prem and in the cloud Do more with less SPARK 3.0 ON NVIDIA GPUs Accelerate data science pipelines without code changes Faster Execution Time Streamline Analytics to AI Reduced Infrastructure Costs
  • 5. NVIDIA BRINGS GPU ACCELERATION TO APACHE SPARK Features • Use existing (unmodified) customer code • Spark features that are not GPU enabled run transparently on the CPU Initial Release - GPU Acceleration of: • Spark Data Frames • Spark SQL • ML/DL training frameworks Seamless integration with Spark 3.0
  • 6. RAPIDS ACCELERATOR FOR APACHE SPARK  UCX Libraries RAPIDS libcudf (C++ Libraries) CUDA JNI bindings Mapping From Java/Scala to C++ RAPIDS Accelerator for Spark DISTRIBUTED SCALE-OUT SPARK APPLICATIONS Spark SQL API Spark Shuffle DataFrame API if gpu_enabled(operation, data_type) call-out to RAPIDS else execute standard Spark operation JNI bindings Mapping From Java/Scala to C++ ● Custom Implementation of Spark Shuffle ● Optimized to use RDMA and GPU-to-GPU direct communication APACHE SPARK CORE
  • 7. GAME-CHANGING PERFORMANCE GAINS 7x Performance Boost 90% Cost Savings on Databricks Opens new possibilities for AI-driven services in Adobe Experience Cloud “We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.” — William Yan, Senior Director of Machine Learning at Adobe
  • 8. RAPIDS ACCELERATOR ECOSYSTEM MOMENTUM   Databricks Machine Learning Runtime Google Cloud Dataproc Apache Spark 3.0 Community Release Amazon EMR Available Now Available Now Available Now Available Now Cloudera CDP Available in Jun’21
  • 9. 9 NVIDIA CONFIDENTIAL - DO NOT COPY OR DISTRIBUTE Nodes 8 CPU 2 x AMD EPYC 7452  (64 cores/128 threads) GPU 2 x NVIDIA Ampere A100, PCIe, 250W, 40GB RAM 0.5 TB Storage 4 x 7.68 TB Gen4 U.2 NVMe Networking 1 x Mellanox CX-6 Single Port HDR100 QSFP56 Cost w/o GPU ~$42,000 per w/ bulk discount Cost w/ GPU ~$71,000 per w/ bulk discount Software HDFS (Hadoop 3.2.1)  Spark 3.0.2 (stand alone) EGX / NVIDIA Certified OEM servers Benchmark Environment – EGX
  • 10. SPARK SQL QUERIES – EGX CLUSTER Based on 97 NVIDIA Decision Support (NDS) benchmark (3TB Dataset without decimals)* GPU is 3.21x faster with a cost ratio of 0.52 (GPU cost was 52% that of the CPU) Queries 14b and 72 were removed because of failures *NVIDIA Decision Support (NDS) benchmark is derived from the TPC-DS benchmark and is used for internal performance testing. Results from NDS are not comparable to TPC-DS
  • 11. UCX ON VS OFF GPU + UCX shuffle is 1.23x faster than the GPU alone. Queries 67 and 72 removed because of failures. GPU + UCX Shuffle is 4.15x faster than the CPU and a cost ratio of 0.41 Queries 14a, 67, and 72 were removed because of failures. SPARK SQL QUERIES – EGX CLUSTER
  • 12. 12 12 Nodes 1 driver (CPU only), 4 workers CPU n1-standard-4 (driver) 4 x n1-standard-32 (workers) GPU 4 x 16GB T4 per executor RAM 120 GB Storage Google Cloud Storage/Alluxio with SSD Networking 32 Gbps Cost w/o GPU $7.82/hour incl GCE + Dataproc Cost w/ GPU $13.41/hour incl GCE + Dataproc Software Dataproc Spark 3.0.1 + YARN Benchmark Environment – GCP DATAPROC
  • 13. 13 *NVIDIA Decision Support (NDS) benchmark is derived from the TPC-DS benchmark and is used for internal performance testing. Results from NDS are not comparable to TPC-DS
  • 14. WHY SOME GPU QUERIES FAILED GPU Memory Limitations/Spilling Operation Problem Solution Sort In cases of data skew, the amount of data being sorted can exceed limits of the hardware/software. A modified external batch sort is implemented in the working branch for the 0.5 release. Window* In cases of data skew the amount of data in the window operation can exceed the limits of the hardware/software. Implement a chunked rank optimization. Github issue #1859 Join Worst case join output row count is left.rows * right.rows Materialize the output of a join in chunks. Github issue #1629 Have conditional filters run as a part of the join. Github issue #288 * Actually, this is for rank which we don’t support yet but plan to in the 0.5 release
  • 15. 15 WHY IS THE GPU SLOWER FOR SOME QUERIES? • Failed Queries • Small Data Sizes (spark.sql.adaptive.advisoryPartitionSizeInBytes=1G) • Q28, Q44, and Q67 • Less computation overlap (spark.rapids.sql.concurrentGpuTasks=1) • Host/Device Memory Transfers • All of them • Cache Consistency on Reductions/Very Small Aggregate Results • Q88 • Lack of GPU support and CPU parallelism is much less • Q44, Q49, and Q67
  • 16. 16 ALLUXIO CONFIGURATION - Co-locate the Alluxio worker nodes with Spark worker nodes to ensure short-circuit reads and writes. - Size cache according to the working set. - Choose the right cache medium choice(SSD or System Memory)
  • 17. 17 spark.rapids.sql.enabled is the master enable spark.rapids.sql.explain enables logging of operations not accelerated - Set to NOT_ON_GPU to print only incompatible ops spark.rapids.sql.concurrentGpuTasks controls concurrent task count per GPU - Set to a value between 2 and 4, with 2 typically providing the most benefit spark.rapids.memory.pinnedPool.size significantly improves performance of data transfers between the GPU and host memory RAPIDS ACCELERATOR CONFIGURATION
  • 18. 18 WILL MY SPARK WORKLOAD ACCELERATE WITHOUT CHANGES? If I know my Spark workload characteristics... Accelerates Well on GPUs Not for GPUs Data Pipeline Use Cases ● Data Mining, Analytics and BI ● Batch processing and writing large datasets to a Data Warehouse ● Data extraction, aggregation and feature preparation for ML Training & Inference ● Real-time Streaming Analytics/AI pipeline ● Online Transaction Processing (OLTP) ● Data Pipeline with custom code Technical Characteristics ● Batch processing of GB+ data sets ● Parquet, ORC, CSV data formats ● HDFS, S3-compatible, or V2 data sources ● DataFrame/SQL (join, agg, sort, window), Selected Hive & Scala UDFs ● Stream processing ● Spark RDD, MLLib, Dataset, GraphX, Streaming libraries If I am unsure... Use the Log-Analysis Tool ● Review Spark history logs from existing CPU jobs ● Understand how much of the workloads could execute on GPUs ● Get tips on optional code optimizations for GPUs Apache Spark Apache Spark - Core Catalyst Query Optimizer Spark Streaming Spark SQL Spark Dataframes Spark Datasets RDD Spark Shuffle Spark MLLib GraphX CPU Only GPU Aware GPU Accelerated Partially GPU Accelerated
  • 19. 19 Summary RAPIDS Accelerator for Spark unlocks GPU acceleration for Spark dataframes, Spark SQL, & ML/DL frameworks such as XGBoost(with more coming) Alluxio is a high performance data orchestration system for GPU compute. Spark & GPUs on Alluxio optimizes for performance and cost on cloud scale datasets. Spark 3 & GPUs on Databricks, EMR and Dataproc available today Try it yourself https://nvidia.github.io/spark-rapids/Getting- Started/ Developer Blog: Accelerating Analytics and AI with Alluxio and NVIDIA GPUs GTC Talk: Enabling Data Orchestration with RAPIDS Accelerator [S32746] Accelerating Apache Spark Shuffle with UCX [S31822] Tuning GPU Network and Memory Usage in Apache Spark [S31566] Running Large-Scale ETL Benchmarks with GPU-Accelerated Apache Spark [S31846] ……. and more