SparkR - Play Spark Using R (20160909 HadoopCon)

SparkR
- Play Spark Using R
Gil Chen
@HadoopCon 2016
Demo: http://goo.gl/VF77ad

about me
• R, Python & Matlab User
• Taiwan R User Group
• Taiwan Spark User Group
• Co-founder
• Data Scientist @

Outline
• Introduction to SparkR
• Demo
• Starting to use SparkR
• DataFrames: dplyr style, SQL style
• RDD v.s. DataFrames
• MLlib: GLM, K-means
• User Case
• Median: approxQuantile()
• ID Match: dplyr style, SQL style, SparkR function
• SparkR + Shiny
• The Future of SparkR

Spark Origin
• Apache Spark is an open source cluster computing
framework
• Originally developed at the University of California,
Berkeley's AMPLab
• The ﬁrst 2 contributors of SparkR: 
Shivaram Venkataraman & Zongheng Yang
https://amplab.cs.berkeley.edu/

Spark History
https://en.wikipedia.org/wiki/Apache_Spark
SparkR
DataFrames
PySpark

Key Advantages of Spark & R
+
Fast!
Flexible
Scalable
Statistical!
Interactive
Packages
https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf

ggplot2
Google Search: ggplot2
ggplot2 is a plotting system for R, based on the grammar of graphics.

Shiny
http://shiny.rstudio.com/gallery/
and more impressive dashboard…
A web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required

Performance
https://amplab.cs.berkeley.edu/announcing-sparkr-r-on-spark/
The runtime performance of running group-by
aggregation on 10 million integer pairs on a single
machine in R, Python and Scala.
(using the same dataset as https://goo.gl/iMLXnh)
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf

RDD (Resilient Distributed Dataset)
https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD
Internally, each RDD is characterized
by ﬁve main properties:
1. A list of partitions
2. A function for computing each split
3. A list of dependencies on other
RDDs
4. Optionally, a Partitioner for key-value
RDDs (e.g. to say that the RDD is
hash-partitioned)
5. Optionally, a list of preferred
locations to compute each split on
(e.g. block locations for an HDFS
ﬁle)
https://docs.cloud.databricks.com/docs/latest/courses

RDD dependencies
• Narrow dependency: Each partition of the parent RDD is used by at most
one partition of the child RDD. This means the task can be executed
locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.)
• Wide dependency: Multiple child partitions may depend on one partition
of the parent RDD. This means we have to shuffle data unless the parents
are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.)
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

Job Scheduling
Black: if they are already in memory

Transformations
RDD Operations
map()
flatmap()
filter()
mapPartitions()
sample()
union()
intersection()
distinct()
groupByKey()
reduceByKey()
sortByKey()
join()
cogroup()
…
Actions
reduce()
collect()
count()
first()
take(num)
takeSample()
takeOrdered()
saveAsTextFile()
saveAsSequenceFile()
saveAsObjectFile()
countByValue()
countByKey()
foreach()
…
narrow
dep.
wide
dep.
lazy evaluation

RDD Example
RDDRDDRDDRDD
Transformations
Action Value
rdd <- SparkR:::textFile(sc, "txt")
words <- SparkR:::ﬂatMap(rdd, function())
wordCount <- SparkR:::lapply(words, function(word))
counts <- SparkR:::reduceByKey(wordCount, "+", 1)
op <- SparkR:::collect(counts)

R shell
RDD
SparkR
RDD & DataFrames
before v1.6
since v2.0
array
data.frame
+ schema
SparkDataFrame
+ schema
General
Action
Transformation

DataFrames are Faster!
http://scala-phase.org/talks/rdds-dataframes-datasets-2016-06-16/#/
Beyond SQL: Speeding up Spark with DataFrames
http://www.slideshare.net/databricks/spark-sqlsse2015public

Spark Stack
https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/ch04.html
Storage
Cluster
Manager
Processing
Engine
Access &
Interfaces

How does sparkR works?

Upgrading From SparkR 1.6 to 2.0
Before 1.6.2 Since 2.0.0
data type naming DataFrame SparkDataFrame
read csv
Package from
Databricks
built-in
function
(like approxQuantile)
X O
ML function glm
more 
(or use sparklyr)
SQLContext
/ HiveContext
sparkRSQL.init(sc)
merge in
sparkR.session()
Execute Message very detailed simple
Launch on EC2 API X
https://spark.apache.org/docs/latest/sparkr.html

Demo
http://goo.gl/VF77ad

Easy Setting
1. Download
2. Decompress and Give a Path
3. Set Path and Launch SparkR in R

Documents
• If you have to use RDD, refer to AMP-Lab github: 
http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/ 
and use “:::” 
e.g. SparkR:::textFile, SparkR:::lapply
• Otherwise, refer to SparkR ofﬁcial documents: 
https://spark.apache.org/docs/2.0.0/api/R/index.html

Starting to Use SparkR (v1.6.2)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc:sparkContext
sc <- sparkR.init(appName = "Demo_SparkR")
# Initialize SQLContext
sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()

Starting to Use SparkR (v2.0.0)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc: sparkContext
sc <- sparkR.session(appName = "Demo_SparkR")
# Initialize SQLContext (don’t need anymore since 2.0.0)
# sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()

DataFrames
# Load the flights CSV file using read.df
sdf <- read.df(sqlContext,"data_flights.csv",
"com.databricks.spark.csv", header = "true")
# Filter flights from JFK
jfk_flights <- filter(sdf, sdf$origin == "JFK")
# Group and aggregate flights to each destination
dest_flights <- summarize(
groupBy(jfk_flights, jfk_flights$dest),
count = n(jfk_flights$dest))
# Running SQL Queries
registerTempTable(sdf, "tempTable")
training <- sql(sqlContext,
"SELECT dest, count(dest) as cnt FROM tempTable
WHERE dest = 'JFK' GROUP BY dest")

Word Count
# read data into RDD
rdd <- SparkR:::textFile(sc, "data_word_count.txt")
# split word
words <- SparkR:::flatMap(rdd, function(line) {
strsplit(line, " ")[[1]]
})
# map: give 1 for each word
wordCount <- SparkR:::lapply(words, function(word) {
list(word, 1)
})
# reduce: count the value by key(word)
counts <- SparkR:::reduceByKey(wordCount, "+", 2)
# convert RDD to list
op <- SparkR:::collect(counts)

RDD v.s. DataFrames
flights_SDF <- read.df(sqlContext, "data_flights.csv",
source = "com.databricks.spark.csv", header = "true")
SDF_op <- flights_SDF %>%
group_by(flights_SDF$hour) %>%
summarize(sum(flights_SDF$dep_delay)) %>%
collect()
flights_RDD <- SparkR:::textFile(sc, "data_flights.csv")
RDD_op <- flights_RDD %>%
SparkR:::filterRDD(function (x) { x >= 1 }) %>%
SparkR:::lapply(function(x) {
y1 <- as.numeric(unlist(strsplit(x, ","))[2])
y2 <- as.numeric(unlist(strsplit(x, ","))[6])
return(list(y1,y2))}) %>%
SparkR:::reduceByKey(function(x,y) x + y, 1) %>%
SparkR:::collect()
DataFrames
RDD

SparkR on MLlib
SparkR supports a subset of the available R formula
operators for model ﬁtting, including ~ . : + - ,
e.g. y ~ x1 + x2

Generalized Linear Model, GLM
# read data and cache
flights_SDF <- read.df("data_flights.csv", source = "csv", 
header = "true", schema) %>% cache()
# drop NA
flights_SDF_2 <- dropna(flights_SDF, how = "any")
# split train/test dataset
train <- sample(flights_SDF_2, withReplacement = FALSE,
fraction = 0.5, seed = 42)
test <- except(flights_SDF_2, train)
# building model
gaussianGLM <- spark.glm(train, arr_delay ~ dep_delay + dist,  
family = "gaussian")
summary(gaussianGLM)
# prediction
preds <- predict(gaussianGLM, test)

K-means
# read data and cache
flights_SDF <- read.df("data_flights.csv", source = "csv", 
header = "true", schema) %>% cache()
# drop NA
flights_SDF_2 <- dropna(flights_SDF, how = "any")
# clustering
kmeansModel <- spark.kmeans(flights_SDF_2, ~ arr_delay +  
dep_delay + dist + flight + dest + cancelled +  
time + dist, k = 15)
summary(kmeansModel)
cluster_op <- fitted(kmeansModel)
# clustering result
kmeansPredictions <- predict(kmeansModel, flights_SDF_2)

Median (approxQuantile)
gdf <- seq(1,10,1) %>% data.frame()
colnames(gdf) <- "seq"
sdf <- createDataFrame(gdf)
median_val <- approxQuantile(sdf, "seq", 0.5, 0) %>% print()
Calculate Median using SQL query…so complicated…
http://www.1keydata.com/tw/sql/sql-median.html

ID Match
##### method 1 : like dplyr + pipeline
join_id_m1 <- join(sdf_1, sdf_2,
sdf_1$id1 == sdf_2$id2, "inner") %>%
select("id2") %>%
collect()
##### method 2 : sql query
createOrReplaceTempView(sdf_1, "table1")
createOrReplaceTempView(sdf_2, "table2")
qry_str <- "SELECT table2.id2 FROM table1
JOIN table2 ON table1.id1 = table2.id2"
join_id_m2 <- sql(qry_str)
##### method 3 : SparkR function
join_id_m2 <- intersect(sdf_1, sdf_2) %>%
collect()

Play Pokemon Go Data
with SparkR !!

Application on SparkR
Interactive MapsWeb FrameworkCompute Engine
+
Where is the
Dragonite nest ?
+

Port: 8080 - Cluster Monitor
Capacity of each worker

Port: 4040
Jobs Monitor
cache(SparkDataFrame), long run time for ﬁrst time
Advanced performance
Status of each worker

Some Tricks
• Customize spark conﬁg for launch
• cache()
• Some codes can’t run in Rstudio, try to use terminal
• Packages from 3rd party, like package of read csv
ﬁle from databricks

The Future of SparkR
• More MLlib API
• Advanced User Deﬁne Function
• package(“sparklyr”) from Rstudio

Reference
• SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu,
Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica,
and Matei Zaharia. SIGMOD 2016. June 2016. 
• SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark
Summit, June 2014, San Francisco. 
https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
• Apache Spark Official Research 
http://spark.apache.org/research.html 
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing 
- http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Apache Spark Official Document 
http://spark.apache.org/docs/latest/api/scala/
• AMPlab UC Berkeley - SparkR Project 
https://github.com/amplab-extras/SparkR-pkg
• Databricks Official Blog 
https://databricks.com/blog/category/engineering/spark
• R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio 
https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/

Join Us
• Fansboard
• Web Designer (php & JavaScript)
• Editor w/ facebook & instagram
• Vpon - Data Scientist
• Taiwan Spark User Group
• Taiwan R User Group

Thanks for your attention
& Taiwan Spark User Group
& Vpon Data Team

SparkR - Play Spark Using R (20160909 HadoopCon)

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to SparkR - Play Spark Using R (20160909 HadoopCon) (20)

Recently uploaded (20)

SparkR - Play Spark Using R (20160909 HadoopCon)