SlideShare a Scribd company logo
SCALABLE DATA
SCIENCE WITH SPARKR
Felix Cheung
Principal Engineer & Apache Spark Committer
Scalable Data Science with SparkR
Disclaimer:
Apache Spark community contributions
Agenda
• Spark + R, Architecture
• Features
• SparkR for Data Science
• Ecosystem
• What’s coming
Spark in 5 seconds
• General-purpose cluster computing system
• Spark SQL + DataFrame/Dataset + data sources
• Streaming/Structured Streaming
• ML
• GraphX
R
• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R – 1993 = S + lexical scoping
• Interpreted
• Matrix arithmetic
• Comprehensive R Archive Network (CRAN) – 10k+
packages
Scalable Data Science with SparkR
SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly
DataFrame APIs
• Runs as its own REPL sparkR
• or as a R package loaded in IDEs like RStudio
library(SparkR)
sparkR.session()
Architecture
• Native R classes and methods
• RBackend
• Scala “helper” methods (ML pipeline etc.)
www.slideshare.net/SparkSummit/07-venkataraman-sun
Key Advantage
• JVM processing, full access to DAG capabilities
and Catalyst optimizer, predicate pushdown, code
generation, etc.
databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
Features - What’s new in SparkR
• SQL & Data Source (JSON, csv, JDBC, libsvm)
• SparkSession & default session
• Catalog (database & table management)
• Spark packages, spark.addFiles()
• install.spark()
• ML
• R-native UDF
• Structured Streaming
• Cluster support (YARN, mesos, standalone)
SparkR for Data Science
Decisions, decisions?
Distributed?
Native R
UDF
Spark.ml
YesNo
DataFrame
Spark ML Pipeline
Transformer EstimatorTransformer
Feature engineering Modeling
Spark ML Pipeline
• Pre-processing, feature extraction, model fitting,
validation stages
• Transformer
• Estimator
• Cross-validation/hyperparameter tuning
SparkR API for ML Pipeline
spark.lda(
data = text, k =
20, maxIter = 25,
optimizer = "em")
RegexTokenizer
StopWordsRemover
CountVectorizer
R
JVM
LDA
API
builds
ML Pipeline
Model Operations
• summary - print a summary of the fitted model
• predict - make predictions on new data
• write.ml/read.ml - save/load fitted models
(slight layout difference: pipeline model plus R
metadata)
Spark.ml in SparkR 2.0.0
• Generalized Linear Model (GLM)
• Naive Bayes Model
• k-means Clustering
• Accelerated Failure Time (AFT) Survival Model
Spark.ml in SparkR 2.1.0
• Generalized Linear Model (GLM)
• Naive Bayes Model
• k-means Clustering
• Accelerated Failure Time (AFT) Survival Model
• Isotonic Regression Model
• Gaussian Mixture Model (GMM)
• Latent Dirichlet Allocation (LDA)
• Alternating Least Squares (ALS)
• Multilayer Perceptron Model (MLP)
• Kolmogorov-Smirnov Test (K-S test)
• Multiclass Logistic Regression
• Random Forest
• Gradient Boosted Tree (GBT)
RFormula
• Specify modeling in symbolic form
y ~ f0 + f1
response y is modeled linearly by f0 and f1
• Support a subset of R formula operators
~ , . , : , + , -
• Implemented as feature transformer in core Spark,
available to Scala/Java, Python
• String label column is indexed
• String term columns are one-hot encoded
Generalized Linear Model
# R-like
glm(Sepal_Length ~ Sepal_Width + Species,
gaussianDF, family = "gaussian")
spark.glm(binomialDF, Species ~
Sepal_Length + Sepal_Width, family =
"binomial")
• “binomial” output string label, prediction
Naive Bayes
spark.naiveBayes(nbDF, Survived ~ Class + Sex
+ Age)
• index label, predicted label to string
k-means
spark.kmeans(kmeansDF, ~ Sepal_Length +
Sepal_Width + Petal_Length + Petal_Width,
k = 3)
Accelerated Failure Time (AFT) Survival Model
spark.survreg(aftDF, Surv(futime, fustat) ~
ecog_ps + rx)
• formula rewrite for censor
Isotonic Regression
spark.isoreg(df, label ~ feature, isotonic =
FALSE)
Gaussian Mixture Model
spark.gaussianMixture(df, ~ V1 + V2, k = 2)
Alternating Least Squares
spark.als(df, "rating", "user", "item", rank
= 20, reg = 0.1, maxIter = 10, nonnegative =
TRUE)
Multilayer Perceptron Model
spark.mlp(df, label ~ features,
blockSize = 128, layers = c(4, 5, 4,
3), solver = "l-bfgs", maxIter = 100,
tol = 0.5, stepSize = 1)
Multiclass Logistic Regression
spark.logit(df, label ~ ., regParam =
0.3, elasticNetParam = 0.8, family =
"multinomial", thresholds = c(0, 1, 1))
• binary or multiclass
Random Forest
spark.randomForest(df, Employed ~ ., type
= "regression", maxDepth = 5, maxBins =
16)
spark.randomForest(df, Species ~
Petal_Length + Petal_Width,
"classification", numTree = 30)
• “classification” index label, predicted label to string
Gradient Boosted Tree
spark.gbt(df, Employed ~ ., type =
"regression", maxDepth = 5, maxBins = 16)
spark.gbt(df, IndexedSpecies ~ ., type =
"classification", stepSize = 0.1)
• “classification” index label, predicted label to string
• Binary classification
Modeling Parameters
spark.randomForest
function(data, formula, type = c("regression", "classification"),
maxDepth = 5, maxBins = 32, numTrees = 20, impurity = NULL,
featureSubsetStrategy = "auto", seed = NULL,
subsamplingRate = 1.0,
minInstancesPerNode = 1, minInfoGain = 0.0,
checkpointInterval = 10,
maxMemoryInMB = 256, cacheNodeIds = FALSE)
Spark.ml Challenges
• Limited API sets
• Keeping up to changes - Almost all
• Non-trivial to map spark.ml API to R API
• Simple API, but fixed ML pipeline
• Debugging is hard
• Not a ML specific problem - Getting better?
Native-R UDF
• User-Defined Functions - custom transformation
• Apply by Partition
• Apply by Group
UDFdata.frame data.frame
Parallel Processing By Partition
R
R
R
Partition
Partition
Partition
UDF
UDF
UDF
data.frame
data.frame
data.frame
data.frame
data.frame
data.frame
UDF: Apply by Partition
• Similar to R apply
• Function to process each partition of a DataFrame
• Mapping of Spark/R data types
dapply(carsSubDF,
function(x) {
x <- cbind(x, x$mpg * 1.61)
},
schema)
UDF: Apply by Partition + Collect
• No schema
out <- dapplyCollect(
carsSubDF,
function(x) {
x <- cbind(x, "kmpg" = x$mpg*1.61)
})
Example - UDF
results <- dapplyCollect(train,
function(x) {
model <-
randomForest::randomForest(as.factor(dep_delayed_1
5min) ~ Distance + night + early, data = x,
importance = TRUE, ntree = 20)
predictions <- predict(model, t)
data.frame(UniqueCarrier = t$UniqueCarrier,
delayed = predictions)
})
closure capture -
serialize &
broadcast “t”
access package
“randomForest::” at
each invocation
UDF: Apply by Group
• By grouping columns
gapply(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(x$mpg))
},
schema)
UDF: Apply by Group + Collect
• No Schema
out <- gapplyCollect(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(x$mpg))
names(y) <- c("cyl", "max_mpg")
y
})
R Spark
byte byte
integer integer
float float
double, numeric double
character, string string
binary, raw binary
logical boolean
POSIXct, POSIXlt timestamp
Date date
array, list array
env map
UDF: data type mapping
* not a complete list
UDF Challenges
• “struct” - No support for nested structures as columns
• Scaling up / data skew
• What if partition or group too big for single R process?
• Not enough data variety to run model?
• Performance costs
• Serialization/deserialization, data transfer
• esp. beware of closure capture
• Package management
UDF: lapply
• Like R lapply or doParallel
• Good for “embarrassingly parallel” tasks
• Such as hyperparameter tuning
UDF: lapply
• Take a native R list, distribute it
• Run the UDF in parallel
UDFelement *anything*
vector/
list
list
UDF: parallel distributed processing
• Output is a list - needs to fit in memory at the driver
costs <- exp(seq(from = log(1), to = log(1000),
length.out = 5))
train <- function(cost) {
model <- e1071::svm(Species ~ ., iris, cost = cost)
summary(model)
}
summaries <- spark.lapply(costs, train)
Demo at felixcheung.github.io
SparkR as a Package (target:ASAP)
• Goal: simple one-line installation of SparkR from CRAN
install.packages("SparkR")
• Spark Jar downloaded from official release and cached
automatically, or manually install.spark() since Spark 2
• R vignettes
• Community can write packages that depends on SparkR
package, eg. SparkRext
• Advanced Spark JVM interop APIs
sparkR.newJObject, sparkR.callJMethod
sparkR.callJStatic
Ecosystem
• RStudio sparklyr
• RevoScaleR/RxSpark, R Server
• H2O R
• Apache SystemML (R-like API)
• STC R4ML
• Renjin (not Spark)
• IBM BigInsights Big R (not Spark!)
Recap: SparkR 2.0.0, 2.1.0, 2.1.1
• SparkSession
• ML
• GLM, LDA
• UDF
• numPartitions, coalesce
• sparkR.uiWebUrl
• install.spark
What’s coming in SparkR 2.2.0
• More, richer ML
• Bisecting K-means, Linear SVM, GLM - Tweedie,
FP-Growth, Decision Tree (2.3.0)
• Column functions
• to_date/to_timestamp format,
approxQuantile columns, from_json/to_json
• Structured Streaming
• DataFrame - checkpoint, hint
• Catalog API - createTable, listDatabases,
refreshTable, …
SS in 1 line
library(magrittr)
kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092"
topic <- "test1"
read.stream("kafka", kafka.bootstrap.servers = kbsrvs,
subscribe = topic) %>%
selectExpr("explode(split(value as string, ' ')) as word") %>%
group_by("word") %>%
count() %>%
write.stream("console", outputMode = "complete")
SS-ML-UDF
See my other talk….
Thank You.
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7

More Related Content

PPTX
CaffeOnSpark Update: Recent Enhancements and Use Cases
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PPTX
Spark r under the hood with Hossein Falaki
PDF
Data profiling in Apache Calcite
CaffeOnSpark Update: Recent Enhancements and Use Cases
Why you should care about data layout in the file system with Cheng Lian and ...
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Robust and Scalable ETL over Cloud Storage with Apache Spark
Keeping Spark on Track: Productionizing Spark for ETL
Spark r under the hood with Hossein Falaki
Data profiling in Apache Calcite

What's hot (20)

PPTX
ETL with SPARK - First Spark London meetup
PDF
Parquet performance tuning: the missing guide
PPTX
Introduction to Apache Spark Developer Training
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PDF
Parallelize R Code Using Apache Spark
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Memory Management in Apache Spark
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Scalable Data Science in Python and R on Apache Spark
PDF
Spark Meetup at Uber
PPTX
Speed it up and Spark it up at Intel
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PPTX
Transformation Processing Smackdown; Spark vs Hive vs Pig
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
Seattle Scalability Meetup - Ted Dunning - MapR
ETL with SPARK - First Spark London meetup
Parquet performance tuning: the missing guide
Introduction to Apache Spark Developer Training
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Parallelize R Code Using Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Memory Management in Apache Spark
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Large-Scale Data Science in Apache Spark 2.0
Scalable Data Science in Python and R on Apache Spark
Spark Meetup at Uber
Speed it up and Spark it up at Intel
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Transformation Processing Smackdown; Spark vs Hive vs Pig
Parallelizing Existing R Packages with SparkR
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Apache Spark: The Next Gen toolset for Big Data Processing
Seattle Scalability Meetup - Ted Dunning - MapR
Ad

Similar to Scalable Data Science with SparkR (20)

PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Parallelizing Existing R Packages
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
SparkR Best Practices for R Data Scientists
PDF
SparkR best practices for R data scientist
PPTX
Machine Learning with SparkR
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Big data analysis using spark r published
PDF
Introduction to SparkR
PDF
Enabling exploratory data science with Spark and R
PDF
Integrate SparkR with existing R packages to accelerate data science workflows
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PPTX
First impressions of SparkR: our own machine learning algorithm
PDF
Introduction to SparkR
PDF
Introduction to SparkR
PDF
Sparkr sigmod
PDF
Apache Spark & MLlib
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Data processing with spark in r &amp; python
Recent Developments In SparkR For Advanced Analytics
Parallelizing Existing R Packages
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR Best Practices for R Data Scientists
SparkR best practices for R data scientist
Machine Learning with SparkR
Apache spark-melbourne-april-2015-meetup
Big data analysis using spark r published
Introduction to SparkR
Enabling exploratory data science with Spark and R
Integrate SparkR with existing R packages to accelerate data science workflows
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
First impressions of SparkR: our own machine learning algorithm
Introduction to SparkR
Introduction to SparkR
Sparkr sigmod
Apache Spark & MLlib
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Data processing with spark in r &amp; python
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Modernizing your data center with Dell and AMD
NewMind AI Weekly Chronicles - August'25 Week I

Scalable Data Science with SparkR

  • 1. SCALABLE DATA SCIENCE WITH SPARKR Felix Cheung Principal Engineer & Apache Spark Committer
  • 4. Agenda • Spark + R, Architecture • Features • SparkR for Data Science • Ecosystem • What’s coming
  • 5. Spark in 5 seconds • General-purpose cluster computing system • Spark SQL + DataFrame/Dataset + data sources • Streaming/Structured Streaming • ML • GraphX
  • 6. R • A programming language for statistical computing and graphics • S – 1975 • S4 - advanced object-oriented features • R – 1993 = S + lexical scoping • Interpreted • Matrix arithmetic • Comprehensive R Archive Network (CRAN) – 10k+ packages
  • 8. SparkR • R language APIs for Spark and Spark SQL • Exposes Spark functionality in an R-friendly DataFrame APIs • Runs as its own REPL sparkR • or as a R package loaded in IDEs like RStudio library(SparkR) sparkR.session()
  • 9. Architecture • Native R classes and methods • RBackend • Scala “helper” methods (ML pipeline etc.) www.slideshare.net/SparkSummit/07-venkataraman-sun
  • 10. Key Advantage • JVM processing, full access to DAG capabilities and Catalyst optimizer, predicate pushdown, code generation, etc. databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
  • 11. Features - What’s new in SparkR • SQL & Data Source (JSON, csv, JDBC, libsvm) • SparkSession & default session • Catalog (database & table management) • Spark packages, spark.addFiles() • install.spark() • ML • R-native UDF • Structured Streaming • Cluster support (YARN, mesos, standalone)
  • 12. SparkR for Data Science
  • 14. DataFrame Spark ML Pipeline Transformer EstimatorTransformer Feature engineering Modeling
  • 15. Spark ML Pipeline • Pre-processing, feature extraction, model fitting, validation stages • Transformer • Estimator • Cross-validation/hyperparameter tuning
  • 16. SparkR API for ML Pipeline spark.lda( data = text, k = 20, maxIter = 25, optimizer = "em") RegexTokenizer StopWordsRemover CountVectorizer R JVM LDA API builds ML Pipeline
  • 17. Model Operations • summary - print a summary of the fitted model • predict - make predictions on new data • write.ml/read.ml - save/load fitted models (slight layout difference: pipeline model plus R metadata)
  • 18. Spark.ml in SparkR 2.0.0 • Generalized Linear Model (GLM) • Naive Bayes Model • k-means Clustering • Accelerated Failure Time (AFT) Survival Model
  • 19. Spark.ml in SparkR 2.1.0 • Generalized Linear Model (GLM) • Naive Bayes Model • k-means Clustering • Accelerated Failure Time (AFT) Survival Model • Isotonic Regression Model • Gaussian Mixture Model (GMM) • Latent Dirichlet Allocation (LDA) • Alternating Least Squares (ALS) • Multilayer Perceptron Model (MLP) • Kolmogorov-Smirnov Test (K-S test) • Multiclass Logistic Regression • Random Forest • Gradient Boosted Tree (GBT)
  • 20. RFormula • Specify modeling in symbolic form y ~ f0 + f1 response y is modeled linearly by f0 and f1 • Support a subset of R formula operators ~ , . , : , + , - • Implemented as feature transformer in core Spark, available to Scala/Java, Python • String label column is indexed • String term columns are one-hot encoded
  • 21. Generalized Linear Model # R-like glm(Sepal_Length ~ Sepal_Width + Species, gaussianDF, family = "gaussian") spark.glm(binomialDF, Species ~ Sepal_Length + Sepal_Width, family = "binomial") • “binomial” output string label, prediction
  • 22. Naive Bayes spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age) • index label, predicted label to string
  • 23. k-means spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width, k = 3)
  • 24. Accelerated Failure Time (AFT) Survival Model spark.survreg(aftDF, Surv(futime, fustat) ~ ecog_ps + rx) • formula rewrite for censor
  • 25. Isotonic Regression spark.isoreg(df, label ~ feature, isotonic = FALSE)
  • 27. Alternating Least Squares spark.als(df, "rating", "user", "item", rank = 20, reg = 0.1, maxIter = 10, nonnegative = TRUE)
  • 28. Multilayer Perceptron Model spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", maxIter = 100, tol = 0.5, stepSize = 1)
  • 29. Multiclass Logistic Regression spark.logit(df, label ~ ., regParam = 0.3, elasticNetParam = 0.8, family = "multinomial", thresholds = c(0, 1, 1)) • binary or multiclass
  • 30. Random Forest spark.randomForest(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification", numTree = 30) • “classification” index label, predicted label to string
  • 31. Gradient Boosted Tree spark.gbt(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) spark.gbt(df, IndexedSpecies ~ ., type = "classification", stepSize = 0.1) • “classification” index label, predicted label to string • Binary classification
  • 32. Modeling Parameters spark.randomForest function(data, formula, type = c("regression", "classification"), maxDepth = 5, maxBins = 32, numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto", seed = NULL, subsamplingRate = 1.0, minInstancesPerNode = 1, minInfoGain = 0.0, checkpointInterval = 10, maxMemoryInMB = 256, cacheNodeIds = FALSE)
  • 33. Spark.ml Challenges • Limited API sets • Keeping up to changes - Almost all • Non-trivial to map spark.ml API to R API • Simple API, but fixed ML pipeline • Debugging is hard • Not a ML specific problem - Getting better?
  • 34. Native-R UDF • User-Defined Functions - custom transformation • Apply by Partition • Apply by Group UDFdata.frame data.frame
  • 35. Parallel Processing By Partition R R R Partition Partition Partition UDF UDF UDF data.frame data.frame data.frame data.frame data.frame data.frame
  • 36. UDF: Apply by Partition • Similar to R apply • Function to process each partition of a DataFrame • Mapping of Spark/R data types dapply(carsSubDF, function(x) { x <- cbind(x, x$mpg * 1.61) }, schema)
  • 37. UDF: Apply by Partition + Collect • No schema out <- dapplyCollect( carsSubDF, function(x) { x <- cbind(x, "kmpg" = x$mpg*1.61) })
  • 38. Example - UDF results <- dapplyCollect(train, function(x) { model <- randomForest::randomForest(as.factor(dep_delayed_1 5min) ~ Distance + night + early, data = x, importance = TRUE, ntree = 20) predictions <- predict(model, t) data.frame(UniqueCarrier = t$UniqueCarrier, delayed = predictions) }) closure capture - serialize & broadcast “t” access package “randomForest::” at each invocation
  • 39. UDF: Apply by Group • By grouping columns gapply(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) }, schema)
  • 40. UDF: Apply by Group + Collect • No Schema out <- gapplyCollect(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) names(y) <- c("cyl", "max_mpg") y })
  • 41. R Spark byte byte integer integer float float double, numeric double character, string string binary, raw binary logical boolean POSIXct, POSIXlt timestamp Date date array, list array env map UDF: data type mapping * not a complete list
  • 42. UDF Challenges • “struct” - No support for nested structures as columns • Scaling up / data skew • What if partition or group too big for single R process? • Not enough data variety to run model? • Performance costs • Serialization/deserialization, data transfer • esp. beware of closure capture • Package management
  • 43. UDF: lapply • Like R lapply or doParallel • Good for “embarrassingly parallel” tasks • Such as hyperparameter tuning
  • 44. UDF: lapply • Take a native R list, distribute it • Run the UDF in parallel UDFelement *anything* vector/ list list
  • 45. UDF: parallel distributed processing • Output is a list - needs to fit in memory at the driver costs <- exp(seq(from = log(1), to = log(1000), length.out = 5)) train <- function(cost) { model <- e1071::svm(Species ~ ., iris, cost = cost) summary(model) } summaries <- spark.lapply(costs, train)
  • 47. SparkR as a Package (target:ASAP) • Goal: simple one-line installation of SparkR from CRAN install.packages("SparkR") • Spark Jar downloaded from official release and cached automatically, or manually install.spark() since Spark 2 • R vignettes • Community can write packages that depends on SparkR package, eg. SparkRext • Advanced Spark JVM interop APIs sparkR.newJObject, sparkR.callJMethod sparkR.callJStatic
  • 48. Ecosystem • RStudio sparklyr • RevoScaleR/RxSpark, R Server • H2O R • Apache SystemML (R-like API) • STC R4ML • Renjin (not Spark) • IBM BigInsights Big R (not Spark!)
  • 49. Recap: SparkR 2.0.0, 2.1.0, 2.1.1 • SparkSession • ML • GLM, LDA • UDF • numPartitions, coalesce • sparkR.uiWebUrl • install.spark
  • 50. What’s coming in SparkR 2.2.0 • More, richer ML • Bisecting K-means, Linear SVM, GLM - Tweedie, FP-Growth, Decision Tree (2.3.0) • Column functions • to_date/to_timestamp format, approxQuantile columns, from_json/to_json • Structured Streaming • DataFrame - checkpoint, hint • Catalog API - createTable, listDatabases, refreshTable, …
  • 51. SS in 1 line library(magrittr) kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092" topic <- "test1" read.stream("kafka", kafka.bootstrap.servers = kbsrvs, subscribe = topic) %>% selectExpr("explode(split(value as string, ' ')) as word") %>% group_by("word") %>% count() %>% write.stream("console", outputMode = "complete")