SparkR + Zeppelin
Seattle Spark Meetup
Sept 9, 2015
Felix Cheung
Agenda
• R & SparkR
• SparkR DataFrame
• SparkR in Zeppelin
• What’s next
R• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R – 1993
• S + lexical scoping
• Interpreted
• Matrix arithmetic
• Comprehensive R Archive Network (CRAN) – 7000+ packages
Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly DataFrame API
• Runs as its own REPL sparkR
• or as a standard R package imported in tools like Rstudio
library(SparkR)
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
5
History
• Shivaram Venkataraman & Zongheng Yang,
amplab – UC Berkeley
• RDD APIs in a standalone package (Jan/2014)
• Spark SQL and SchemaRDD -> DataFrame
• Spark 1.4 – first Spark release with SparkR APIs
• Spark 1.5 (today!)
6
Architecture
7
Native S4
classes &
methods
RBackend
socket
• A set of native S4 classes and methods that live inside a
standard R package
• A backend that passes data structures and method calls to
Spark Scala/JVM
• A collection of “helper” methods written in Scala
Advantages
• R-like syntax extending DataFrame API
• JVM processing with full access to Spark’s DAG capabilities
and Catalyst engine,
e.g. execution plan optimization, constant-folding, predicate
pushdown, and code generation
8
https://databricks.com/blog/201
5/06/09/announcing-sparkr-r-
on-spark.html
SparkR DataFrame
• Spark packages
• Data Source API
• Optimizations
SparkR in Zeppelin
Architecture
R
R adaptor
Demo
DIY
• https://github.com/felixcheung/vagrant-
projects/tree/master/SparkR-Zeppelin
• Vagrant + VirtualBox
• Install prerequisites: JDK, R, R packages
• Automatically download Spark 1.5.0 release
• Need to build Zeppelin from
https://github.com/felixcheung/incubator-zeppelin/tree/r
• Notebook from https://github.com/felixcheung/spark-
notebook-
examples/blob/master/Zeppelin_notebook/2AZ9584GE/not
e.json
(extracted from the demo)
Native R
(extracted from the demo)
Native R and dplyr...
Similarly SparkR DataFrame…
(extracted from the demo)
SparkR DataFrame…
What’s new
• Zeppelin - run with provided Spark (SPARK_HOME)
• Spark 1.5.0 release
• SparkR new APIs
SparkR in Spark 1.5.0
Get this today:
• R formula
• Machine learning like GLM
model <- glm(Sepal_Length ~ Sepal_Width +
Species, data = df, family = "gaussian")
• More R-like
df[df$age %in% c(19, 30), 1:2]
transform(df, newCol = df$col1 / 5, newCol2 =
df$col1 * 2)
Zeppelin
• Stay tuned! More to come with R/SparkR
• Lots of updates in the upcoming 0.5.x/0.6.0 release
Question?
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI
SparkR + Zeppelin
subset
# Columns can be selected using `[[` and `[`
df[[2]] == df[["age"]]
df[,2] == df[,"age"]
df[,c("name", "age")]
# Or to filter rows
df[df$age > 20,]
# DataFrame can be subset on both rows and Columns
df[df$name == "Smith", c(1,2)]
df[df$age %in% c(19, 30), 1:2]
subset(df, df$age %in% c(19, 30), 1:2)
subset(df, df$age %in% c(19), select = c(1,2))
Transform/mutate
newDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2)
newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)

More Related Content

PPTX
Data Science with Spark & Zeppelin
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PDF
Spark Summit EU talk by Jakub Hava
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Data Science with Spark & Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Yiannis Gkoufas
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

What's hot (20)

PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PDF
Data science lifecycle with Apache Zeppelin
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by Jim Dowling
PDF
Apache Spark Usage in the Open Source Ecosystem
PDF
Spark Summit EU talk by Simon Whitear
PDF
Apache Spark Performance is too hard. Let's make it easier
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
PDF
Operational Tips For Deploying Apache Spark
PDF
Apache Zeppelin, Helium and Beyond
PDF
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
PDF
Spark Summit EU talk by Elena Lazovik
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Spark Summit EU talk by John Musser
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Data science lifecycle with Apache Zeppelin
Spark Summit EU talk by Bas Geerdink
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Jim Dowling
Apache Spark Usage in the Open Source Ecosystem
Spark Summit EU talk by Simon Whitear
Apache Spark Performance is too hard. Let's make it easier
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Operational Tips For Deploying Apache Spark
Apache Zeppelin, Helium and Beyond
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by John Musser
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Ad

Similar to SparkR + Zeppelin (20)

PDF
sparklyr - Jeff Allen
PDF
Spark Summit EU talk by Heiko Korndorf
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Introduction to Apache Spark and MLlib
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Enabling exploratory data science with Spark and R
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Parallelize R Code Using Apache Spark
PPTX
Apache spark
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Introduction to Apache Spark
PPTX
Spark r under the hood with Hossein Falaki
PDF
What's new in Apache Spark 2.4
PPTX
Scala & Spark Online Training
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PPTX
Parallelizing Existing R Packages with SparkR
sparklyr - Jeff Allen
Spark Summit EU talk by Heiko Korndorf
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Apache Spark for Everyone - Women Who Code Workshop
Introduction to Apache Spark and MLlib
Apache spark-melbourne-april-2015-meetup
Enabling exploratory data science with Spark and R
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Parallelize R Code Using Apache Spark
Apache spark
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Introduction to Apache Spark
Spark r under the hood with Hossein Falaki
What's new in Apache Spark 2.4
Scala & Spark Online Training
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Parallelizing Existing R Packages with SparkR
Ad

Recently uploaded (20)

PPTX
Business_Capability_Map_Collection__pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
recommendation Project PPT with details attached
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
New ISO 27001_2022 standard and the changes
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
An essential collection of rules designed to help businesses manage and reduc...
PDF
Global Data and Analytics Market Outlook Report
PPTX
chrmotography.pptx food anaylysis techni
PDF
Microsoft 365 products and services descrption
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Business_Capability_Map_Collection__pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
recommendation Project PPT with details attached
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
expt-design-lecture-12 hghhgfggjhjd (1).ppt
retention in jsjsksksksnbsndjddjdnFPD.pptx
New ISO 27001_2022 standard and the changes
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
An essential collection of rules designed to help businesses manage and reduc...
Global Data and Analytics Market Outlook Report
chrmotography.pptx food anaylysis techni
Microsoft 365 products and services descrption
CYBER SECURITY the Next Warefare Tactics
SET 1 Compulsory MNH machine learning intro
Navigating the Thai Supplements Landscape.pdf
Best Data Science Professional Certificates in the USA | IABAC
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx

SparkR + Zeppelin

Editor's Notes

  • #6: InRstudio: .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
  • #14: Use the viewer to check out the notebook without running Zeppelin: https://www.zeppelinhub.com/viewer/
  • #15: Retail employment, in millions (2008-2014) Source: Bureau of Labor Statistics Credit: NPR
  • #23: dplyr-like syntax