SlideShare a Scribd company logo
A DataFrame Abstraction
Layer for SparkR
Chris Freeman
Agenda
• What is SparkR?
• History of DataFrames
• Why DataFrames?
• How do DataFrames work?
• Demo
• On the Roadmap
• Questions
2
What is SparkR?
• New R language API for Spark and SparkSQL
• Exposes existing Spark functionality in an R-
friendly syntax via the DataFrame API
• Has its own shell, but can also be imported like a
standard R package and used with Rstudio.
3
What is SparkR?
• An opportunity to make Spark accessible to the
large community of R developers who already
have clear ideas about how to do analytics in R
• No need to learn a new programming paradigm
when working with Spark
4
History of DataFrames
• SparkR began as an R package that ported
Spark’s core functionality (RDDs) to the R
language.
• The next logical step was to add SparkSQL and
SchemaRDDs.
• Initial implementation of SQLContext and
SchemaRDDs working in SparkR
5
History of DataFrames
6
History of DataFrames
7
History of DataFrames
Me:
8
History of DataFrames
Me:
9
Reynold:
Maybe this isn’t such a bad thing…
10
How can I use Spark to do something
simple?
"Michael, 29"
"Andy, 30"
"Justin, 19"
"Bob, 22"
"Chris, 28"
"Garth, 36"
"Tasha, 24"
"Mac, 30"
"Neil, 32"
11
Let’s say we wanted to do this with regular RDDs. What would that look like?
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
12
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
13
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
14
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
15
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
16
There’s got to be a better way.
17
What I’d hoped to see
{"name":"Michael", "age":29}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":22}
{"name":"Chris", "age":28}
{"name":"Garth", "age":36}
{"name":"Tasha", "age":24}
{"name":"Mac", "age":30}
{"name":"Neil", "age":32}
18
What I’d hoped to see
df <- read.df(sqlCtx, “people.json”, “json”)
19
What I’d hoped to see
df <- read.df(sqlCtx, “people.json”, “json”)
avg <- select(df, avg(df$age))
20
Why DataFrames?
• Uses the distributed, parallel capabilities offered
by RDDs, but imposes a schema on the data
• More structure == Easier access and
manipulation
• Natural extension of existing R conventions since
DataFrames are already the standard
21
Why DataFrames?
• Super awesome distributed, in-memory
collections
22
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
23
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
• ????
24
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
• ????
• Profit
25
DataFrames in SparkR
• Multiple Components:
– A set of native S4 classes and methods that
live inside a standard R package
– A SparkR backend that passes data
structures and method calls to the JVM
– A set of “helper” methods written in Scala
26
Why does the structure matter?
• Native R classes allow us to extend the existing
DataFrame API by adding R-like syntax and
interactions
• Handoff to the JVM gives us full access to
Spark’s DAG capabilities and Catalyst
optimizations, e.g. constant-folding, predicate
pushdown, and code generation.
27
SparkR DataFrame Features
• Column access using ‘$’ or ‘[ ]’ just like in R
• dplyr-like DataFrame manipulation:
– filter
– groupBy
– summarize
– mutate
• Access to external R packages that extend R
syntax
28
Demo Time!
29
On the Roadmap
• Spark 1.4: SparkR becomes an official API
– Primarily focused on SparkSQL/DataFrame
implementation
• Spark 1.5: Extend SparkR to include machine learning
capabilities (e.g. sparkML)
• For more information, be sure to check out “SparkR: The
Past, Present, and Future” at 4:30 on the Data Science
track.
30
Integration with
• Drag-and-drop GUI for data analysis
• Spark functionality built directly into existing
tools using SparkR
• Interact with a remote Spark cluster from your
desktop via Alteryx Designer
• Combine local and in-database data sources in
one workflow.
31
Developer Community
• SparkR originated at UC Berkeley AMPLAB, with
additional contributions from Alteryx, Intel,
Databricks, and others.
• Working on integration with Spark Packages
– Easily extend Spark with new functionality and
distribute via the Spark Package repository
32
Questions?
Slides, Demo, and Data available on GitHub at:
https://github.com/cafreeman/
SparkR_DataFrame_Demo
@15lettermax
cafreeman
33

More Related Content

PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Introduction to Spark R with R studio - Mr. Pragith
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Spark r under the hood with Hossein Falaki
PPTX
Introduction to Apache Spark Developer Training
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Recent Developments In SparkR For Advanced Analytics
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Introduction to Spark R with R studio - Mr. Pragith
Spark Under the Hood - Meetup @ Data Science London
Spark r under the hood with Hossein Falaki
Introduction to Apache Spark Developer Training

What's hot (20)

PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Machine Learning by Example - Apache Spark
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Scalable Data Science in Python and R on Apache Spark
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Enabling exploratory data science with Spark and R
PDF
Spark DataFrames and ML Pipelines
PDF
End-to-end Data Pipeline with Apache Spark
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
New directions for Apache Spark in 2015
PDF
Parallelize R Code Using Apache Spark
PDF
Spark Summit EU talk by Luca Canali
PDF
Operational Tips for Deploying Spark
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit EU 2015: Lessons from 300+ production users
From Pipelines to Refineries: Scaling Big Data Applications
Machine Learning by Example - Apache Spark
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Scalable Data Science in Python and R on Apache Spark
Large-Scale Data Science in Apache Spark 2.0
Enabling exploratory data science with Spark and R
Spark DataFrames and ML Pipelines
End-to-end Data Pipeline with Apache Spark
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Parallelizing Existing R Packages with SparkR
Spark Application Carousel: Highlights of Several Applications Built with Spark
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
New directions for Apache Spark in 2015
Parallelize R Code Using Apache Spark
Spark Summit EU talk by Luca Canali
Operational Tips for Deploying Spark
Combining Machine Learning Frameworks with Apache Spark
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Ad

Similar to A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx) (20)

PDF
Introduction to Spark Datasets - Functional and relational together at last
PPTX
Scala 20140715
PPTX
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
PDF
SparkR: Enabling Interactive Data Science at Scale
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Spark Programming Basic Training Handout
PDF
An introduction To Apache Spark
PPTX
5 things one must know about spark!
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Apache Spark Introduction.pdf
PDF
Spark streaming , Spark SQL
PDF
Intro to Spark and Spark SQL
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PPTX
Spark from the Surface
PPTX
Machine Learning with SparkR
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PPTX
5 reasons why spark is in demand!
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Introduction to Spark with Python
Introduction to Spark Datasets - Functional and relational together at last
Scala 20140715
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
SparkR: Enabling Interactive Data Science at Scale
Jump Start with Apache Spark 2.0 on Databricks
Spark Programming Basic Training Handout
An introduction To Apache Spark
5 things one must know about spark!
Apache spark-melbourne-april-2015-meetup
Apache Spark Introduction.pdf
Spark streaming , Spark SQL
Intro to Spark and Spark SQL
Big Data Processing with .NET and Spark (SQLBits 2020)
Spark from the Surface
Machine Learning with SparkR
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
5 reasons why spark is in demand!
A look under the hood at Apache Spark's API and engine evolutions
Introduction to Spark with Python
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Mega Projects Data Mega Projects Data
PPT
Quality review (1)_presentation of this 21
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Computer network topology notes for revision
PPTX
1_Introduction to advance data techniques.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Miokarditis (Inflamasi pada Otot Jantung)
Mega Projects Data Mega Projects Data
Quality review (1)_presentation of this 21
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Knowledge Engineering Part 1
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
STUDY DESIGN details- Lt Col Maksud (21).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction-to-Cloud-ComputingFinal.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
1_Introduction to advance data techniques.pptx

A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)

  • 1. A DataFrame Abstraction Layer for SparkR Chris Freeman
  • 2. Agenda • What is SparkR? • History of DataFrames • Why DataFrames? • How do DataFrames work? • Demo • On the Roadmap • Questions 2
  • 3. What is SparkR? • New R language API for Spark and SparkSQL • Exposes existing Spark functionality in an R- friendly syntax via the DataFrame API • Has its own shell, but can also be imported like a standard R package and used with Rstudio. 3
  • 4. What is SparkR? • An opportunity to make Spark accessible to the large community of R developers who already have clear ideas about how to do analytics in R • No need to learn a new programming paradigm when working with Spark 4
  • 5. History of DataFrames • SparkR began as an R package that ported Spark’s core functionality (RDDs) to the R language. • The next logical step was to add SparkSQL and SchemaRDDs. • Initial implementation of SQLContext and SchemaRDDs working in SparkR 5
  • 10. Maybe this isn’t such a bad thing… 10
  • 11. How can I use Spark to do something simple? "Michael, 29" "Andy, 30" "Justin, 19" "Bob, 22" "Chris, 28" "Garth, 36" "Tasha, 24" "Mac, 30" "Neil, 32" 11 Let’s say we wanted to do this with regular RDDs. What would that look like?
  • 12. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) 12
  • 13. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) 13
  • 14. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) 14
  • 15. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) avg <- sum / count(peopleRDD) 15
  • 16. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) avg <- sum / count(peopleRDD) 16
  • 17. There’s got to be a better way. 17
  • 18. What I’d hoped to see {"name":"Michael", "age":29} {"name":"Andy", "age":30} {"name":"Justin", "age":19} {"name":"Bob", "age":22} {"name":"Chris", "age":28} {"name":"Garth", "age":36} {"name":"Tasha", "age":24} {"name":"Mac", "age":30} {"name":"Neil", "age":32} 18
  • 19. What I’d hoped to see df <- read.df(sqlCtx, “people.json”, “json”) 19
  • 20. What I’d hoped to see df <- read.df(sqlCtx, “people.json”, “json”) avg <- select(df, avg(df$age)) 20
  • 21. Why DataFrames? • Uses the distributed, parallel capabilities offered by RDDs, but imposes a schema on the data • More structure == Easier access and manipulation • Natural extension of existing R conventions since DataFrames are already the standard 21
  • 22. Why DataFrames? • Super awesome distributed, in-memory collections 22
  • 23. Why DataFrames? • Super awesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative 23
  • 24. Why DataFrames? • Super awesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative • ???? 24
  • 25. Why DataFrames? • Super awesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative • ???? • Profit 25
  • 26. DataFrames in SparkR • Multiple Components: – A set of native S4 classes and methods that live inside a standard R package – A SparkR backend that passes data structures and method calls to the JVM – A set of “helper” methods written in Scala 26
  • 27. Why does the structure matter? • Native R classes allow us to extend the existing DataFrame API by adding R-like syntax and interactions • Handoff to the JVM gives us full access to Spark’s DAG capabilities and Catalyst optimizations, e.g. constant-folding, predicate pushdown, and code generation. 27
  • 28. SparkR DataFrame Features • Column access using ‘$’ or ‘[ ]’ just like in R • dplyr-like DataFrame manipulation: – filter – groupBy – summarize – mutate • Access to external R packages that extend R syntax 28
  • 30. On the Roadmap • Spark 1.4: SparkR becomes an official API – Primarily focused on SparkSQL/DataFrame implementation • Spark 1.5: Extend SparkR to include machine learning capabilities (e.g. sparkML) • For more information, be sure to check out “SparkR: The Past, Present, and Future” at 4:30 on the Data Science track. 30
  • 31. Integration with • Drag-and-drop GUI for data analysis • Spark functionality built directly into existing tools using SparkR • Interact with a remote Spark cluster from your desktop via Alteryx Designer • Combine local and in-database data sources in one workflow. 31
  • 32. Developer Community • SparkR originated at UC Berkeley AMPLAB, with additional contributions from Alteryx, Intel, Databricks, and others. • Working on integration with Spark Packages – Easily extend Spark with new functionality and distribute via the Spark Package repository 32
  • 33. Questions? Slides, Demo, and Data available on GitHub at: https://github.com/cafreeman/ SparkR_DataFrame_Demo @15lettermax cafreeman 33