SlideShare a Scribd company logo
Parallelize R code Using
Apache Spark
Hossein Falaki
@mhfalaki
About me
• Former Data Scientist at Apple Siri
• Software Engineer at Databricks
• Started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data source
• Worked on SparkR & Databricks R Notebooks
• Currently focusing on R experience at Databricks
What is SparkR
An R package distributed with Apache Spark:
• Provides R front-endto Apache Spark
• Exposes Spark DataFrames (inspired by R and Pandas)
• Convenient interoperability between R and Spark DataFrames
robust	distributed		
processing,	data	source,	off-
memory	data	
dynamic	environment,	
interactivity,	+10K	packages,	
visualizations
+
SparkR architecture
Spark Driver
JVM
Worker
JVM
Worker
DataSources
JVMR
RBackend
JVM
SparkR architecture (2.x)
Spark Driver
JVM
Worker
JVM
Worker
DataSources
JVMR
RBackend
R R
R R
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL
sql / table / saveAsTable /
registerTempTable / tables
Overview of SparkR API
http://spark.apache.org/docs/latest/api/R/
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy /
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply/ dapply /
gapply / dapplyCollect
SparkR UDF API
spark.lapply
Runs a function over
a list of elements
spark.lapply()
dapply
Appliesa function to
each partition of a
SparkDataFrame
dapply()
dapplyCollect()
gapply
Appliesa function to
each group within a
SparkDataFrame
gapply()
gapplyCollect()
spark.lapply
Simplest SparkR UDF patter
For each element of a list
1. Sends the function to an R worker
2. Executesthe function
3. Returns the result of all workers as a list to R driver
spark.lapply(1:100, function(x) {
runBootstrap(x)
}
spark.lapply control flow
RWorker JVMR Driver JVM
1. serialize R closure
4. transfer over
local socket
7. serialize result
2. transfer over
local socket
8. transfer over
local socket
10. transfer over
local socket
11. de-serialize result
9. Transfer serialized closureover thenetwork
3. Transfer serialized closureover thenetwork
5. de-serialize closure
6.Execution
dapply
For each partition of a Spark DataFrame
1. Collects each partition as an R data.frame
2. Sends the R function to the R worker
3. Executesthe function
dapply(sparkDF, func, schema)
combines resultsas DataFrame
with provided schema
dapplyCollect(sparkDF, func)
combines resultsas R
data.frame
dapply control & data flow
RWorker JVMR Driver JVM
localsocket cluster network localsocket
input data
ser/de transfer
result data
ser/de transfer
dapply control & data flow
RWorker JVMR Driver JVM
localsocket cluster network localsocket
input data
ser/de transfer
result data
ser/de transferresult deserialize
gapply
Groups a Spark DataFrame on one or more columns
1. Collects each group as an R data.frame
2. Sends the R function to the R worker
3. Executesthe function
gapply(sparkDF, func, schema)
combines resultsas DataFrame
with provided schema
gapplyCollect(sparkDF, func)
combines resultsas R
data.frame
dapply control & data flow
RWorker JVMR Driver JVM
localsocket cluster network localsocket
input data
ser/de transfer
result data
ser/de transferresult deserialize
data
shuffle
gapply dapply
signature gapply(df, cols, func, schema)
gapply(gdf, func, schema)
dapply(df, func, schema)
user	function
signature
function(key, data) function(data)
data	partition controlled	by	grouping not	controlled
Parallelizing data
• Do not use spark.lapply() to distribute large data sets
• Do not pack data in the closure
• Watch for skew in data
– Are partitions evenlysized?
• Auxiliary data
– Can be joined with input DataFrame
– Can be distributed to all the workers
Packages on workers
• SparkR closure capture does not include packages
• You need to import package son each worker inside your
function
• You need to install packages on workers
– spark.lapply() can be used to install packages
Debugging user code
1. Verify code on the driver
2. Interactively execute code on the cluster
• When R worker fails, Spark Driver throws exception with R error text
3. Inspect details of failure of failed job in Spark UI
4. Inspect stdout/stderrr of worker
Demo
Thank You

More Related Content

PPTX
Migrating from InnoDB and HBase to MyRocks at Facebook
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
PDF
Introduction to PySpark
PDF
How to govern and secure a Data Mesh?
PDF
Data Source API in Spark
PDF
Dive into Catalyst
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Migrating from InnoDB and HBase to MyRocks at Facebook
Improving PySpark performance: Spark Performance Beyond the JVM
Consolidating MLOps at One of Europe’s Biggest Airports
Introduction to PySpark
How to govern and secure a Data Mesh?
Data Source API in Spark
Dive into Catalyst
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -

What's hot (20)

PPTX
Optimizing Apache Spark SQL Joins
PDF
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
PPTX
OLTP+OLAP=HTAP
 
PDF
Reinventing the Transaction Script (NDC London 2020)
PPTX
Découverte de Redis
PDF
Natural Language Processing with Graph Databases and Neo4j
PPTX
Fluentd1.2 & Fluent Bit
PPTX
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
PDF
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
PDF
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
PPTX
NAMED ENTITY RECOGNITION
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning
PDF
Introducing Databricks Delta
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
SQL Performance Improvements at a Glance in Apache Spark 3.0
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
BigQueryの課金、節約しませんか
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Optimizing Apache Spark SQL Joins
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
OLTP+OLAP=HTAP
 
Reinventing the Transaction Script (NDC London 2020)
Découverte de Redis
Natural Language Processing with Graph Databases and Neo4j
Fluentd1.2 & Fluent Bit
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
NAMED ENTITY RECOGNITION
Deep Dive into Spark SQL with Advanced Performance Tuning
Introducing Databricks Delta
Common Strategies for Improving Performance on Your Delta Lakehouse
SQL Performance Improvements at a Glance in Apache Spark 3.0
Data Mesh Part 4 Monolith to Mesh
Building a SIMD Supported Vectorized Native Engine for Spark SQL
The Parquet Format and Performance Optimization Opportunities
BigQueryの課金、節約しませんか
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Ad

Similar to Parallelize R Code Using Apache Spark (20)

PPTX
Parallelizing Existing R Packages with SparkR
PDF
Parallelizing Existing R Packages
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Enabling exploratory data science with Spark and R
PPTX
Spark from the Surface
PDF
Introduction to SparkR
PDF
Introduction to SparkR
PPTX
Machine Learning with SparkR
PPTX
Spark core
PDF
SparkR: Enabling Interactive Data Science at Scale
PDF
Strata NYC 2015: What's new in Spark Streaming
PPT
Apache spark-melbourne-april-2015-meetup
PPTX
Big data processing with Apache Spark and Oracle Database
PDF
Module01
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Spark Programming Basic Training Handout
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
실시간 Streaming using Spark and Kafka 강의교재
PPT
An Introduction to Apache spark with scala
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages
Strata NYC 2015 - Supercharging R with Apache Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Enabling exploratory data science with Spark and R
Spark from the Surface
Introduction to SparkR
Introduction to SparkR
Machine Learning with SparkR
Spark core
SparkR: Enabling Interactive Data Science at Scale
Strata NYC 2015: What's new in Spark Streaming
Apache spark-melbourne-april-2015-meetup
Big data processing with Apache Spark and Oracle Database
Module01
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Spark Programming Basic Training Handout
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
실시간 Streaming using Spark and Kafka 강의교재
An Introduction to Apache spark with scala
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administraation Chapter 3
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
medical staffing services at VALiNTRY
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
AI in Product Development-omnex systems
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Nekopoi APK 2025 free lastest update
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
Softaken Excel to vCard Converter Software.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administraation Chapter 3
ISO 45001 Occupational Health and Safety Management System
Navsoft: AI-Powered Business Solutions & Custom Software Development
medical staffing services at VALiNTRY
Wondershare Filmora 15 Crack With Activation Key [2025
CHAPTER 2 - PM Management and IT Context
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
AI in Product Development-omnex systems
How Creative Agencies Leverage Project Management Software.pdf
Understanding Forklifts - TECH EHS Solution
Design an Analysis of Algorithms I-SECS-1021-03
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Nekopoi APK 2025 free lastest update
Upgrade and Innovation Strategies for SAP ERP Customers
2025 Textile ERP Trends: SAP, Odoo & Oracle
Which alternative to Crystal Reports is best for small or large businesses.pdf

Parallelize R Code Using Apache Spark

  • 1. Parallelize R code Using Apache Spark Hossein Falaki @mhfalaki
  • 2. About me • Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR & Databricks R Notebooks • Currently focusing on R experience at Databricks
  • 3. What is SparkR An R package distributed with Apache Spark: • Provides R front-endto Apache Spark • Exposes Spark DataFrames (inspired by R and Pandas) • Convenient interoperability between R and Spark DataFrames robust distributed processing, data source, off- memory data dynamic environment, interactivity, +10K packages, visualizations +
  • 5. SparkR architecture (2.x) Spark Driver JVM Worker JVM Worker DataSources JVMR RBackend R R R R
  • 6. IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables Overview of SparkR API http://spark.apache.org/docs/latest/api/R/ ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply/ dapply / gapply / dapplyCollect
  • 7. SparkR UDF API spark.lapply Runs a function over a list of elements spark.lapply() dapply Appliesa function to each partition of a SparkDataFrame dapply() dapplyCollect() gapply Appliesa function to each group within a SparkDataFrame gapply() gapplyCollect()
  • 8. spark.lapply Simplest SparkR UDF patter For each element of a list 1. Sends the function to an R worker 2. Executesthe function 3. Returns the result of all workers as a list to R driver spark.lapply(1:100, function(x) { runBootstrap(x) }
  • 9. spark.lapply control flow RWorker JVMR Driver JVM 1. serialize R closure 4. transfer over local socket 7. serialize result 2. transfer over local socket 8. transfer over local socket 10. transfer over local socket 11. de-serialize result 9. Transfer serialized closureover thenetwork 3. Transfer serialized closureover thenetwork 5. de-serialize closure 6.Execution
  • 10. dapply For each partition of a Spark DataFrame 1. Collects each partition as an R data.frame 2. Sends the R function to the R worker 3. Executesthe function dapply(sparkDF, func, schema) combines resultsas DataFrame with provided schema dapplyCollect(sparkDF, func) combines resultsas R data.frame
  • 11. dapply control & data flow RWorker JVMR Driver JVM localsocket cluster network localsocket input data ser/de transfer result data ser/de transfer
  • 12. dapply control & data flow RWorker JVMR Driver JVM localsocket cluster network localsocket input data ser/de transfer result data ser/de transferresult deserialize
  • 13. gapply Groups a Spark DataFrame on one or more columns 1. Collects each group as an R data.frame 2. Sends the R function to the R worker 3. Executesthe function gapply(sparkDF, func, schema) combines resultsas DataFrame with provided schema gapplyCollect(sparkDF, func) combines resultsas R data.frame
  • 14. dapply control & data flow RWorker JVMR Driver JVM localsocket cluster network localsocket input data ser/de transfer result data ser/de transferresult deserialize data shuffle
  • 15. gapply dapply signature gapply(df, cols, func, schema) gapply(gdf, func, schema) dapply(df, func, schema) user function signature function(key, data) function(data) data partition controlled by grouping not controlled
  • 16. Parallelizing data • Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data – Are partitions evenlysized? • Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers
  • 17. Packages on workers • SparkR closure capture does not include packages • You need to import package son each worker inside your function • You need to install packages on workers – spark.lapply() can be used to install packages
  • 18. Debugging user code 1. Verify code on the driver 2. Interactively execute code on the cluster • When R worker fails, Spark Driver throws exception with R error text 3. Inspect details of failure of failed job in Spark UI 4. Inspect stdout/stderrr of worker
  • 19. Demo