SparkR
Advance Analytics for Big Data
A workshop with the Spark-Meetup
Tuesday 17th Nov 2015
Agenda
- INTRODUCTION
- SPARK OVERVIEW
DATAFRAMES OVERVIEW
- SPARKR
- DEMO: MACHINE LEARNING
SAMUEL SHAMIRI
PhD STATISTICS + MSc ECONMETRICS
Senior Analyst
Samuel.Shamiri@veda.com.au
https://au.linkedin.com/pub/samuel-shamiri
http://sshamiri.blogspot.com/
providing information and analytic services to
businesses to assist them in making decisions and
managing risks.
Veda holds data on more than 16.4 million credit-
active individuals, 3.6 million on companies and
businesses and 3.4 million on Sole Traders
throughout Australia, providing customers with the
ability to make more informed decisions.
is a data analytics business
WHO AM I?
Telecom Media
Retail Pharma
Investment Research Distributors
SPARK USERS in production by over 500 organizations
Spaceother
*CSV
*TXT
*JSON
Launching mode
Local
YARN
Standalone
Mesos
API’s
DataFrames API
SQL
GraphX
High level libraries
RDD API
SPARK ECOSYSTEM
Spark Context (sc)
The window to the world of
Spark
sqlContext
The window to the world of
DataFrames
Transformation (lazy)
Takes an RDD/DataFrame and
returns a new RDD/DataFrame
Action
Causes an RDD to be
evaluated (often storing the
result)
INITIALIZE SPARK
Optimally compressed
uses partitioning
skips data using statistics
DataFrame
Read less data
mapPartitions() ShuffledRDD ZipPartitions()
first_name,last_name,gender,age
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
RDD
DataFrame
How can I read this? Compute the average with…
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
} Super awesome distributed, in-memory collections
Schemas == metadata, structure and declarative
WRITE LESS CODE, BETTER READABILITY
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
RDD
df <- read.df(sqlCtx, “people.json”, “json”)
avg <- select(df, avg(df$age))
DataFrame
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
https://gist.github.com/rxin/c1592c133e4bccf515dd
Easier to
program
Significantly fewer
Lines of Code
Improved
performance
via intelligent
optimizations and
code-generation
NOT R v PYTHON v SCALA, IT’S R/PYTHON/SCALA + SPARK
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
https://gist.github.com/rxin/c1592c133e4bccf515dd
Easier to
program
Significantly fewer
Lines of Code
Improved
performance
via intelligent
optimizations and
code-generation
NOT R v PYTHON v SCALA, IT’S R/PYTHON/SCALA + SPARK
LIMITATION - COMPLICATION:R with other frameworks
GoogleLocal storageFramework
clean
transform
aggregate
filter
sample other
… …
Read and AnalyseDistributed storage
Framework
clean
transform
aggregate
filter
sample other
… …
Read and AnalyseGoogleData warehouseData sources
ETL
R dynamic design imposes performance problem on runtime (single threaded, fit all in
memory). Data scientists uses R in conjunction with other frameworks as
Read and AnalyseDistributed storage/Data source
USE SPARK’S DISTRIBUTED, PARRLLEL IN MEMORY COLLECTION
distributed/robust processing, off‐memory data structures
for interactive analysis at speed
Dynamic environment, interactvity,
packages, visualizaJon
Real time analytics
pipeline
Data SourceDRIVER MACHINE WORKER MACHINE
JVM Executor
DataFrame, RDD
JVM Executor
DataFrame, RDD
Task
Task
Task
Task
SPARKR ARCHITECTURE
Spark Context
Controller
JVMSparkContext
R-JVM
Data SourceDRIVER MACHINE WORKER MACHINE
JVM Executor
DataFrame, RDD
JVM Executor
DataFrame, RDD
Task
Task
Task
Task
SPARKR ARCHITECTURE
Spark Context
Controller
JVMSparkContext
R-JVM
Data SourceDRIVER MACHINE WORKER MACHINE
JVM Executor
DataFrame, RDD
JVM Executor
DataFrame, RDD
Task
Task
Task
Task
SPARKR ARCHITECTURE
Spark Context
Controller
JVMSparkContext
R-JVM
Data wrangling and
Machine learning with SparkR
Questions
https://github.com/SShamiri/SparkR
@SamuelShamiri
Slides, Demo, and Data available on GitHub at

More Related Content

PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
Introduction to Spark R with R studio - Mr. Pragith
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
PDF
Introduction to SparkR
PDF
Spark what's new what's coming
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PDF
End-to-end Data Pipeline with Apache Spark
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Introduction to Spark R with R studio - Mr. Pragith
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Introduction to SparkR
Spark what's new what's coming
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
End-to-end Data Pipeline with Apache Spark

What's hot (20)

PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PPTX
Spark Kafka summit 2017
PDF
Adding Complex Data to Spark Stack by Tug Grall
PDF
Introduction to Spark (Intern Event Presentation)
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
SparkR: Enabling Interactive Data Science at Scale
PDF
Introduction to SparkR
PDF
Enabling exploratory data science with Spark and R
PDF
A look ahead at spark 2.0
PDF
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
PDF
New Developments in Spark
PPTX
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
PPTX
Big data analytics_beyond_hadoop_public_18_july_2013
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Application Carousel: Highlights of Several Applications Built with Spark
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Strata NYC 2015 - Supercharging R with Apache Spark
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Kafka summit 2017
Adding Complex Data to Spark Stack by Tug Grall
Introduction to Spark (Intern Event Presentation)
Use r tutorial part1, introduction to sparkr
Spark Under the Hood - Meetup @ Data Science London
SparkR: Enabling Interactive Data Science at Scale
Introduction to SparkR
Enabling exploratory data science with Spark and R
A look ahead at spark 2.0
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
New Developments in Spark
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Big data analytics_beyond_hadoop_public_18_july_2013
Ad

Similar to SparkR-Advance Analytic for Big Data (20)

PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
Multiplatform Spark solution for Graph datasources by Javier Dominguez
PDF
Intro to Spark and Spark SQL
PDF
Agile Data Science
PDF
Sparklyr: Big Data enabler for R users
PDF
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
PDF
Fossasia 2018-chetan-khatri
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Agile Data Science 2.0
PDF
Agile Data Science 2.0
PDF
Agile Data Science 2.0
PPTX
Mastering MapReduce: MapReduce for Big Data Management and Analysis
PPTX
Big Data processing with Spark, Scala or Java?
PDF
Agile Data Science 2.0 - Big Data Science Meetup
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
Spark streaming , Spark SQL
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
PDF
An Introduction to Spark with Scala
PDF
What's new in spark 2.0?
PPTX
Big Data on the Cloud
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Intro to Spark and Spark SQL
Agile Data Science
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Fossasia 2018-chetan-khatri
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Agile Data Science 2.0
Agile Data Science 2.0
Agile Data Science 2.0
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Big Data processing with Spark, Scala or Java?
Agile Data Science 2.0 - Big Data Science Meetup
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Spark streaming , Spark SQL
Scaling up with Cisco Big Data: Data + Science = Data Science
An Introduction to Spark with Scala
What's new in spark 2.0?
Big Data on the Cloud
Ad

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PDF
Introduction to the R Programming Language
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
chrmotography.pptx food anaylysis techni
PDF
Microsoft 365 products and services descrption
PDF
Introduction to Data Science and Data Analysis
PPT
Image processing and pattern recognition 2.ppt
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Transcultural that can help you someday.
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Managing Community Partner Relationships
PPTX
modul_python (1).pptx for professional and student
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Introduction to the R Programming Language
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
A Complete Guide to Streamlining Business Processes
chrmotography.pptx food anaylysis techni
Microsoft 365 products and services descrption
Introduction to Data Science and Data Analysis
Image processing and pattern recognition 2.ppt
SET 1 Compulsory MNH machine learning intro
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Pilar Kemerdekaan dan Identi Bangsa.pptx
DU, AIS, Big Data and Data Analytics.ppt
Transcultural that can help you someday.
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Managing Community Partner Relationships
modul_python (1).pptx for professional and student
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx

SparkR-Advance Analytic for Big Data

  • 1. SparkR Advance Analytics for Big Data A workshop with the Spark-Meetup Tuesday 17th Nov 2015
  • 2. Agenda - INTRODUCTION - SPARK OVERVIEW DATAFRAMES OVERVIEW - SPARKR - DEMO: MACHINE LEARNING
  • 3. SAMUEL SHAMIRI PhD STATISTICS + MSc ECONMETRICS Senior Analyst Samuel.Shamiri@veda.com.au https://au.linkedin.com/pub/samuel-shamiri http://sshamiri.blogspot.com/ providing information and analytic services to businesses to assist them in making decisions and managing risks. Veda holds data on more than 16.4 million credit- active individuals, 3.6 million on companies and businesses and 3.4 million on Sole Traders throughout Australia, providing customers with the ability to make more informed decisions. is a data analytics business WHO AM I?
  • 4. Telecom Media Retail Pharma Investment Research Distributors SPARK USERS in production by over 500 organizations Spaceother
  • 6. Spark Context (sc) The window to the world of Spark sqlContext The window to the world of DataFrames Transformation (lazy) Takes an RDD/DataFrame and returns a new RDD/DataFrame Action Causes an RDD to be evaluated (often storing the result) INITIALIZE SPARK Optimally compressed uses partitioning skips data using statistics DataFrame Read less data mapPartitions() ShuffledRDD ZipPartitions()
  • 8. private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } Super awesome distributed, in-memory collections Schemas == metadata, structure and declarative WRITE LESS CODE, BETTER READABILITY peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) avg <- sum / count(peopleRDD) RDD df <- read.df(sqlCtx, “people.json”, “json”) avg <- select(df, avg(df$age)) DataFrame
  • 9. 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs) https://gist.github.com/rxin/c1592c133e4bccf515dd Easier to program Significantly fewer Lines of Code Improved performance via intelligent optimizations and code-generation NOT R v PYTHON v SCALA, IT’S R/PYTHON/SCALA + SPARK
  • 10. 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs) https://gist.github.com/rxin/c1592c133e4bccf515dd Easier to program Significantly fewer Lines of Code Improved performance via intelligent optimizations and code-generation NOT R v PYTHON v SCALA, IT’S R/PYTHON/SCALA + SPARK
  • 11. LIMITATION - COMPLICATION:R with other frameworks GoogleLocal storageFramework clean transform aggregate filter sample other … … Read and AnalyseDistributed storage Framework clean transform aggregate filter sample other … … Read and AnalyseGoogleData warehouseData sources ETL R dynamic design imposes performance problem on runtime (single threaded, fit all in memory). Data scientists uses R in conjunction with other frameworks as
  • 12. Read and AnalyseDistributed storage/Data source USE SPARK’S DISTRIBUTED, PARRLLEL IN MEMORY COLLECTION distributed/robust processing, off‐memory data structures for interactive analysis at speed Dynamic environment, interactvity, packages, visualizaJon Real time analytics pipeline
  • 13. Data SourceDRIVER MACHINE WORKER MACHINE JVM Executor DataFrame, RDD JVM Executor DataFrame, RDD Task Task Task Task SPARKR ARCHITECTURE Spark Context Controller JVMSparkContext R-JVM
  • 14. Data SourceDRIVER MACHINE WORKER MACHINE JVM Executor DataFrame, RDD JVM Executor DataFrame, RDD Task Task Task Task SPARKR ARCHITECTURE Spark Context Controller JVMSparkContext R-JVM
  • 15. Data SourceDRIVER MACHINE WORKER MACHINE JVM Executor DataFrame, RDD JVM Executor DataFrame, RDD Task Task Task Task SPARKR ARCHITECTURE Spark Context Controller JVMSparkContext R-JVM
  • 16. Data wrangling and Machine learning with SparkR