SparkR-Advance Analytic for Big Data

SparkR
Advance Analytics for Big Data
A workshop with the Spark-Meetup
Tuesday 17th Nov 2015

Agenda
- INTRODUCTION
- SPARK OVERVIEW
DATAFRAMES OVERVIEW
- SPARKR
- DEMO: MACHINE LEARNING

SAMUEL SHAMIRI
PhD STATISTICS + MSc ECONMETRICS
Senior Analyst
Samuel.Shamiri@veda.com.au
https://au.linkedin.com/pub/samuel-shamiri
http://sshamiri.blogspot.com/
providing information and analytic services to
businesses to assist them in making decisions and
managing risks.
Veda holds data on more than 16.4 million credit-
active individuals, 3.6 million on companies and
businesses and 3.4 million on Sole Traders
throughout Australia, providing customers with the
ability to make more informed decisions.
is a data analytics business
WHO AM I?

Telecom Media
Retail Pharma
Investment Research Distributors
SPARK USERS in production by over 500 organizations
Spaceother

*CSV
*TXT
*JSON
Launching mode
Local
YARN
Standalone
Mesos
API’s
DataFrames API
SQL
GraphX
High level libraries
RDD API
SPARK ECOSYSTEM

Spark Context (sc)
The window to the world of
Spark
sqlContext
The window to the world of
DataFrames
Transformation (lazy)
Takes an RDD/DataFrame and
returns a new RDD/DataFrame
Action
Causes an RDD to be
evaluated (often storing the
result)
INITIALIZE SPARK
Optimally compressed
uses partitioning
skips data using statistics
DataFrame
Read less data
mapPartitions() ShuffledRDD ZipPartitions()

first_name,last_name,gender,age
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
RDD
DataFrame
How can I read this? Compute the average with…

private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
} Super awesome distributed, in-memory collections
Schemas == metadata, structure and declarative
WRITE LESS CODE, BETTER READABILITY
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
RDD
df <- read.df(sqlCtx, “people.json”, “json”)
avg <- select(df, avg(df$age))
DataFrame

0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
https://gist.github.com/rxin/c1592c133e4bccf515dd
Easier to
program
Significantly fewer
Lines of Code
Improved
performance
via intelligent
optimizations and
code-generation
NOT R v PYTHON v SCALA, IT’S R/PYTHON/SCALA + SPARK

LIMITATION - COMPLICATION:R with other frameworks
GoogleLocal storageFramework
clean
transform
aggregate
filter
sample other
… …
Read and AnalyseDistributed storage
Framework
clean
transform
aggregate
filter
sample other
… …
Read and AnalyseGoogleData warehouseData sources
ETL
R dynamic design imposes performance problem on runtime (single threaded, fit all in
memory). Data scientists uses R in conjunction with other frameworks as

Read and AnalyseDistributed storage/Data source
USE SPARK’S DISTRIBUTED, PARRLLEL IN MEMORY COLLECTION
distributed/robust processing, off‐memory data structures
for interactive analysis at speed
Dynamic environment, interactvity,
packages, visualizaJon
Real time analytics
pipeline

Data SourceDRIVER MACHINE WORKER MACHINE
JVM Executor
DataFrame, RDD
JVM Executor
DataFrame, RDD
Task
Task
Task
Task
SPARKR ARCHITECTURE
Spark Context
Controller
JVMSparkContext
R-JVM

Data wrangling and
Machine learning with SparkR

Questions
https://github.com/SShamiri/SparkR
@SamuelShamiri
Slides, Demo, and Data available on GitHub at

SparkR-Advance Analytic for Big Data

More Related Content

What's hot (20)

Similar to SparkR-Advance Analytic for Big Data (20)

Recently uploaded (20)

SparkR-Advance Analytic for Big Data