SparkR + Zeppelin

SparkR + Zeppelin
Seattle Spark Meetup
Sept 9, 2015
Felix Cheung

Agenda
• R & SparkR
• SparkR DataFrame
• SparkR in Zeppelin
• What’s next

R• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R – 1993
• S + lexical scoping
• Interpreted
• Matrix arithmetic
• Comprehensive R Archive Network (CRAN) – 7000+ packages

Fast!
Scalable
Flexible
Statistical!
Interactive
Packages

SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly DataFrame API
• Runs as its own REPL sparkR
• or as a standard R package imported in tools like Rstudio
library(SparkR)
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
5

History
• Shivaram Venkataraman & Zongheng Yang,
amplab – UC Berkeley
• RDD APIs in a standalone package (Jan/2014)
• Spark SQL and SchemaRDD -> DataFrame
• Spark 1.4 – first Spark release with SparkR APIs
• Spark 1.5 (today!)
6

Architecture
7
Native S4
classes &
methods
RBackend
socket
• A set of native S4 classes and methods that live inside a
standard R package
• A backend that passes data structures and method calls to
Spark Scala/JVM
• A collection of “helper” methods written in Scala

Advantages
• R-like syntax extending DataFrame API
• JVM processing with full access to Spark’s DAG capabilities
and Catalyst engine,
e.g. execution plan optimization, constant-folding, predicate
pushdown, and code generation
8

https://databricks.com/blog/201
5/06/09/announcing-sparkr-r-
on-spark.html
SparkR DataFrame
• Spark packages
• Data Source API
• Optimizations

DIY
• https://github.com/felixcheung/vagrant-
projects/tree/master/SparkR-Zeppelin
• Vagrant + VirtualBox
• Install prerequisites: JDK, R, R packages
• Automatically download Spark 1.5.0 release
• Need to build Zeppelin from
https://github.com/felixcheung/incubator-zeppelin/tree/r
• Notebook from https://github.com/felixcheung/spark-
notebook-
examples/blob/master/Zeppelin_notebook/2AZ9584GE/not
e.json

(extracted from the demo)
Native R

Native R and dplyr...
Similarly SparkR DataFrame…

SparkR DataFrame…

What’s new
• Zeppelin - run with provided Spark (SPARK_HOME)
• Spark 1.5.0 release
• SparkR new APIs

SparkR in Spark 1.5.0
Get this today:
• R formula
• Machine learning like GLM
model <- glm(Sepal_Length ~ Sepal_Width +
Species, data = df, family = "gaussian")
• More R-like
df[df$age %in% c(19, 30), 1:2]
transform(df, newCol = df$col1 / 5, newCol2 =
df$col1 * 2)

Zeppelin
• Stay tuned! More to come with R/SparkR
• Lots of updates in the upcoming 0.5.x/0.6.0 release

Question?
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI

subset
# Columns can be selected using `[[` and `[`
df[[2]] == df[["age"]]
df[,2] == df[,"age"]
df[,c("name", "age")]
# Or to filter rows
df[df$age > 20,]
# DataFrame can be subset on both rows and Columns
df[df$name == "Smith", c(1,2)]
df[df$age %in% c(19, 30), 1:2]
subset(df, df$age %in% c(19, 30), 1:2)
subset(df, df$age %in% c(19), select = c(1,2))

Transform/mutate
newDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2)
newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)

SparkR + Zeppelin

More Related Content

What's hot (20)

Similar to SparkR + Zeppelin (20)

Recently uploaded (20)

SparkR + Zeppelin

Editor's Notes