SlideShare a Scribd company logo
Choose Your Weapon:
Comparing GPU-based VS.
FPGA-based Acceleration of
Apache Spark
Bishwa “Roop” Ganguly
Chief Solution Architect
The Need for Higher Performance
• There is mounting pressure on companies to use big data
analysis to gain a competitive edge, and the amount of
data is ever-growing.
• CPU-based approaches are not meeting these needs satisfactorily
• Non-linear scaling
• Cost
• Moore's law slowing
• Customers are missing their SLAs due to demand for computation
outgrowing the budget.
• Data Scientist productivity suffering due to the time requires running their
analytics.
Todays Approach for Spark: CPU Clusters
• CPU clusters are the most prevalent way that Spark is run
• Most common ways to meet performance demands: scale-up, scale-out
• Costly
• Typically, highly sub-linear in terms of performance improvement
• I.e. Adding N times the servers yields far less than Nx performance
• Solutions?
• Code optimizations
• Caching approaches
• Spark configuration optimizations
• SW-based acceleration
• Bigstream has developed SW-based acceleration
• Compiles Spark tasks into native code, running on standard CPU instances
• Zero user code change
• Complements scaling of CPU clusters
Average Speedup: 2.2x
• All runs on identical md5.2xl EMR clusters - Baseline: Spark 3.0
• 4 workers/cluster, S3 storage
• 250SF CSV gzip compressed standard data (approx. 72GB)
Bigstream (SW-only) Speedup Results over Spark 3.0
0
0.5
1
1.5
2
2.5
3
1 3 5 7 9 11 13 17 19 21 23a 24a 25 27 30 32 34 36 38 39a 42 44 46 48 50 52 54 56 59 62 64 66 68 70 73 75 78 80 82 84 86 89 92 96 98 100
Speedup
TPC-DS Query Number
Spark 3.0/Bigstream SW
Available on
AWS
Marketplace
• Programmable hardware: FPGA, GPU, ASIC
• Designed for efficient execution of specialized code,
contrast with general-purpose CPUs
• ASICs typically support domain-specific workloads
• FPGA, GPU: provide flexibility, but don’t natively connect to
big data platforms
• Middleware needed
• Both can provide performance, and have power and cost
advantages w.r.t CPU scale-up, scale-out
• Designed to physically attach to existing servers simply (i.e.
PCIe slot)
Hardware Accelerators
HW Accelerator Market is Trending
Source: ARK Invest “Big Ideas 2021"
• Accelerators = GPUs, ASICs, and FPGAs
• $41 Billion industry in next 10 years,
surpassing CPUs.
• Driven by big data analytics and AI
Architectural Comparison (GPU, FPGA)
• Features
• Programmable via ISA (simpler programming)
• High degree of data-level parallelism
• Challenges
• Branch divergence can be very costly
• Power consumption has been shown to be high for some
analytics operations
FPGA
GPU
• Features
• Logic configured per operation can maximize efficiency
• Low power consumption per computation
• Bit, byte level and “irregular” parallelism can be leveraged
• High on-chip BW
• Challenges
• Exploiting fine-grain parallelism requires FPGA architecture
expertise
Example Irregular Computation - IF/ELSE
GPU FPGA
… if ( shirtsize = ‘large’)
{ green code; } …}
else { red code; }...
“if” condition runs
while “else”
condition waits due
to SIMD (partial
utilization)
Conditions “if” &
“else” run parallel,
with branch
divergence (full
utilization)
L E G E N D
Compute Lanes GPU:
SIMD lanes
& FPGA: MIMD lanes
Application code
time
SQL/Analytics/ML Performance Hypotheses/Observations
• Scan better on FPGA because of if-else (MIMD) efficiency of logic vs SIMD
• SQL operations: GPU vs FPGA depends on degree of regularity of the operation
• Training ML: GPU is better
• Typically regular matrix operations
• Training uses floating point, high precision parallel computations required
• Inference: Depends on precision being used
• If less precision can give as good answers, FPGA may have an advantage
Spark Performance Results
• Goal: Initial assessment of FPGA-based and GPU-based solutions on AWS for a typical SparkSQL workload
• Experimental setup (both technologies)
• All runs use 4-node worker clusters
• 8 vCPUS per worker
• 1 executor/worker, 7 cores/executor
• Baseline: Spark 3.0.1 on CPU
• TPC-DS benchmark suite (90 queries)
• Identical, standard code – TPC-DS SQL as downloaded from www.tpc.org
• Identical, standard data - TPC-DS CSV format data (gzipped)
• Input data coming from same AWS S3 bucket
• Desired results: assessment of acceleration performance over standard Spark
• Note: The two solutions run on different instance types, so not a head-to-head comparison
Experimental Setup per Technology
• RAPIDS GPU-based Spark acceleration
• Cluster allocated via AWS EMR
• g4dn.2xlarge instance type (1 NVIDIA T4 GPU/instance)
• 4 physical cores (3196 MHz)
• 22GB executor memory
• Optimized Spark configurations as recommended by NVIDIA literature
• Bigstream FPGA-based Spark acceleration
• Cluster allocated via single Bigstream-provided Terraform script v1.1
• Note: F1 instances not yet available in EMR
• f1.2xlarge instance type (1 Xilinx FPGA/instance)
• 4 physical cores (2670 MHz)
• 72GB executor memory
• Optimized Spark configurations as recommended by Bigstream
RAPIDS Speedup Results
RAPIDS Average Speedup: 1.9x
• 250SF CSV gzip compressed standard data (approx. 72GB)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
1 3 5 7 10 12 17 19 21 23a 24a 25 27 30 32 34 36 38 39b 42 44 46 48 50 52 54 56 59 62 65 67 69 71 74 76 79 81 83 85 87 91 93 97 99 105
Speedup
TPC-DS Query Number
Speedup Rapids
Bigstream Speedup Results
Bigstream Average Speedup: 3.6x
• 250SF CSV gzip compressed standard data (approx. 72GB)
0
1
2
3
4
5
6
7
1 3 5 7 10 12 17 19 21 23a 24a 25 27 30 32 34 36 38 39b 42 44 46 48 50 52 54 56 59 62 65 67 69 71 74 76 79 81 83 85 87 91 93 97 99 105
Speedup
TPC-DS Query Number
Speedup Bigstream-F1
Bigstream Hyperacceleration Layer
Zero code changes
Cross-platform versatility
Up to 10x acceleration
Adaptation
Intelligent, automatic computation slicing
Cross-acceleration hardware
Bigstream Dataflow
Bigstream Runtime
Summary: HW Accelerated Spark for Analytics
• Available today
• Cloud
• On-premise
• Zero code change
• Provides next level of performance, over and above traditional Spark optimizations
• Use case examples:
• Highest performing analytics on a given infrastructure size
• Ability to leverage more data
• More sources
• More lookback
• Larger sizes
• Overcoming cluster scaling limitations
• Total cost of operations (TCO) savings
Thank you!
roop@bigstream.co

More Related Content

PPT
DOCX
UNIT-II CPLD & FPGA Architectures and Applications
PPTX
1.FPGA for dummies: Basic FPGA architecture
PDF
The I2C Interface
PPT
5 STM32's TIMER.ppt
PPTX
PDF
Micro Processor Lab Manual!
PDF
EE270_Final_Project
UNIT-II CPLD & FPGA Architectures and Applications
1.FPGA for dummies: Basic FPGA architecture
The I2C Interface
5 STM32's TIMER.ppt
Micro Processor Lab Manual!
EE270_Final_Project

What's hot (20)

PPTX
Interrupts on 8086 microprocessor by vijay kumar.k
PPT
FPGA Configuration
PDF
Qualcomm SnapDragon 800 Mobile Device
PDF
Bascom avr-course
PDF
PPTX
SEMICONDUCTOR MEMORIES(RAM &ROM).pptx
PDF
BeagleBone Black Booting Process
PPTX
8086 Microprocessor
PDF
Embedded C - Lecture 4
PPTX
Interrupts of 8085
PDF
8255 & IO Interfacing.pdf
PDF
FPGAによる津波シミュレーション -- GPUを超える高性能計算の手法
PDF
Introduction to embedded systems
PDF
bhyve Device Emulation Introduction
PPT
PPTX
Pic microcontroller architecture
PPTX
PPTX
Evolution of Microprocessor
PDF
ARM Processor Tutorial
PDF
FPGAs : An Overview
Interrupts on 8086 microprocessor by vijay kumar.k
FPGA Configuration
Qualcomm SnapDragon 800 Mobile Device
Bascom avr-course
SEMICONDUCTOR MEMORIES(RAM &ROM).pptx
BeagleBone Black Booting Process
8086 Microprocessor
Embedded C - Lecture 4
Interrupts of 8085
8255 & IO Interfacing.pdf
FPGAによる津波シミュレーション -- GPUを超える高性能計算の手法
Introduction to embedded systems
bhyve Device Emulation Introduction
Pic microcontroller architecture
Evolution of Microprocessor
ARM Processor Tutorial
FPGAs : An Overview
Ad

Similar to Choose Your Weapon: Comparing Spark on FPGAs vs GPUs (20)

PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
PDF
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
PDF
Spark and Deep Learning frameworks with distributed workloads
 
PPTX
FPGAs in the cloud? (October 2017)
PDF
The state of SQL-on-Hadoop in the Cloud
PDF
The state of SQL-on-Hadoop in the Cloud
PDF
The state of Hive and Spark in the Cloud (July 2017)
PDF
A Dataflow Processing Chip for Training Deep Neural Networks
PPTX
Mirabilis Design | Chiplet Summit | 2024
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
PPTX
Mirabilis_Design AMD Versal System-Level IP Library
PPTX
Webinar on RISC-V
PPTX
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
PDF
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
PPTX
DIR sssssssssssssssssssssssssssssssssV PPT draft2.pptx
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PDF
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
PPTX
Exploration of Radars and Software Defined Radios using VisualSim
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Spark and Deep Learning frameworks with distributed workloads
 
FPGAs in the cloud? (October 2017)
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
The state of Hive and Spark in the Cloud (July 2017)
A Dataflow Processing Chip for Training Deep Neural Networks
Mirabilis Design | Chiplet Summit | 2024
Ceph Community Talk on High-Performance Solid Sate Ceph
Mirabilis_Design AMD Versal System-Level IP Library
Webinar on RISC-V
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
DIR sssssssssssssssssssssssssssssssssV PPT draft2.pptx
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Exploration of Radars and Software Defined Radios using VisualSim
Low Latency Polyglot Model Scoring using Apache Apex
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Introduction to Business Data Analytics.
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Lecture1 pattern recognition............
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Database Infoormation System (DBIS).pptx
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Business Data Analytics.
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Moving the Public Sector (Government) to a Digital Adoption
Introduction-to-Cloud-ComputingFinal.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Lecture1 pattern recognition............
Data_Analytics_and_PowerBI_Presentation.pptx
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
Business Ppt On Nestle.pptx huunnnhhgfvu
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Mega Projects Data Mega Projects Data
IB Computer Science - Internal Assessment.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs

  • 1. Choose Your Weapon: Comparing GPU-based VS. FPGA-based Acceleration of Apache Spark Bishwa “Roop” Ganguly Chief Solution Architect
  • 2. The Need for Higher Performance • There is mounting pressure on companies to use big data analysis to gain a competitive edge, and the amount of data is ever-growing. • CPU-based approaches are not meeting these needs satisfactorily • Non-linear scaling • Cost • Moore's law slowing • Customers are missing their SLAs due to demand for computation outgrowing the budget. • Data Scientist productivity suffering due to the time requires running their analytics.
  • 3. Todays Approach for Spark: CPU Clusters • CPU clusters are the most prevalent way that Spark is run • Most common ways to meet performance demands: scale-up, scale-out • Costly • Typically, highly sub-linear in terms of performance improvement • I.e. Adding N times the servers yields far less than Nx performance • Solutions? • Code optimizations • Caching approaches • Spark configuration optimizations • SW-based acceleration • Bigstream has developed SW-based acceleration • Compiles Spark tasks into native code, running on standard CPU instances • Zero user code change • Complements scaling of CPU clusters
  • 4. Average Speedup: 2.2x • All runs on identical md5.2xl EMR clusters - Baseline: Spark 3.0 • 4 workers/cluster, S3 storage • 250SF CSV gzip compressed standard data (approx. 72GB) Bigstream (SW-only) Speedup Results over Spark 3.0 0 0.5 1 1.5 2 2.5 3 1 3 5 7 9 11 13 17 19 21 23a 24a 25 27 30 32 34 36 38 39a 42 44 46 48 50 52 54 56 59 62 64 66 68 70 73 75 78 80 82 84 86 89 92 96 98 100 Speedup TPC-DS Query Number Spark 3.0/Bigstream SW Available on AWS Marketplace
  • 5. • Programmable hardware: FPGA, GPU, ASIC • Designed for efficient execution of specialized code, contrast with general-purpose CPUs • ASICs typically support domain-specific workloads • FPGA, GPU: provide flexibility, but don’t natively connect to big data platforms • Middleware needed • Both can provide performance, and have power and cost advantages w.r.t CPU scale-up, scale-out • Designed to physically attach to existing servers simply (i.e. PCIe slot) Hardware Accelerators
  • 6. HW Accelerator Market is Trending Source: ARK Invest “Big Ideas 2021" • Accelerators = GPUs, ASICs, and FPGAs • $41 Billion industry in next 10 years, surpassing CPUs. • Driven by big data analytics and AI
  • 7. Architectural Comparison (GPU, FPGA) • Features • Programmable via ISA (simpler programming) • High degree of data-level parallelism • Challenges • Branch divergence can be very costly • Power consumption has been shown to be high for some analytics operations FPGA GPU • Features • Logic configured per operation can maximize efficiency • Low power consumption per computation • Bit, byte level and “irregular” parallelism can be leveraged • High on-chip BW • Challenges • Exploiting fine-grain parallelism requires FPGA architecture expertise
  • 8. Example Irregular Computation - IF/ELSE GPU FPGA … if ( shirtsize = ‘large’) { green code; } …} else { red code; }... “if” condition runs while “else” condition waits due to SIMD (partial utilization) Conditions “if” & “else” run parallel, with branch divergence (full utilization) L E G E N D Compute Lanes GPU: SIMD lanes & FPGA: MIMD lanes Application code time
  • 9. SQL/Analytics/ML Performance Hypotheses/Observations • Scan better on FPGA because of if-else (MIMD) efficiency of logic vs SIMD • SQL operations: GPU vs FPGA depends on degree of regularity of the operation • Training ML: GPU is better • Typically regular matrix operations • Training uses floating point, high precision parallel computations required • Inference: Depends on precision being used • If less precision can give as good answers, FPGA may have an advantage
  • 10. Spark Performance Results • Goal: Initial assessment of FPGA-based and GPU-based solutions on AWS for a typical SparkSQL workload • Experimental setup (both technologies) • All runs use 4-node worker clusters • 8 vCPUS per worker • 1 executor/worker, 7 cores/executor • Baseline: Spark 3.0.1 on CPU • TPC-DS benchmark suite (90 queries) • Identical, standard code – TPC-DS SQL as downloaded from www.tpc.org • Identical, standard data - TPC-DS CSV format data (gzipped) • Input data coming from same AWS S3 bucket • Desired results: assessment of acceleration performance over standard Spark • Note: The two solutions run on different instance types, so not a head-to-head comparison
  • 11. Experimental Setup per Technology • RAPIDS GPU-based Spark acceleration • Cluster allocated via AWS EMR • g4dn.2xlarge instance type (1 NVIDIA T4 GPU/instance) • 4 physical cores (3196 MHz) • 22GB executor memory • Optimized Spark configurations as recommended by NVIDIA literature • Bigstream FPGA-based Spark acceleration • Cluster allocated via single Bigstream-provided Terraform script v1.1 • Note: F1 instances not yet available in EMR • f1.2xlarge instance type (1 Xilinx FPGA/instance) • 4 physical cores (2670 MHz) • 72GB executor memory • Optimized Spark configurations as recommended by Bigstream
  • 12. RAPIDS Speedup Results RAPIDS Average Speedup: 1.9x • 250SF CSV gzip compressed standard data (approx. 72GB) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1 3 5 7 10 12 17 19 21 23a 24a 25 27 30 32 34 36 38 39b 42 44 46 48 50 52 54 56 59 62 65 67 69 71 74 76 79 81 83 85 87 91 93 97 99 105 Speedup TPC-DS Query Number Speedup Rapids
  • 13. Bigstream Speedup Results Bigstream Average Speedup: 3.6x • 250SF CSV gzip compressed standard data (approx. 72GB) 0 1 2 3 4 5 6 7 1 3 5 7 10 12 17 19 21 23a 24a 25 27 30 32 34 36 38 39b 42 44 46 48 50 52 54 56 59 62 65 67 69 71 74 76 79 81 83 85 87 91 93 97 99 105 Speedup TPC-DS Query Number Speedup Bigstream-F1
  • 14. Bigstream Hyperacceleration Layer Zero code changes Cross-platform versatility Up to 10x acceleration Adaptation Intelligent, automatic computation slicing Cross-acceleration hardware Bigstream Dataflow Bigstream Runtime
  • 15. Summary: HW Accelerated Spark for Analytics • Available today • Cloud • On-premise • Zero code change • Provides next level of performance, over and above traditional Spark optimizations • Use case examples: • Highest performing analytics on a given infrastructure size • Ability to leverage more data • More sources • More lookback • Larger sizes • Overcoming cluster scaling limitations • Total cost of operations (TCO) savings