SlideShare a Scribd company logo
Reza Zadeh
Advanced Data Science on Spark
@Reza_Zadeh | http://reza-zadeh.com
Data Science Problem
Data growing faster than processing speeds
Only solution is to parallelize on large clusters
» Wide use in both enterprises and web industry
How do we program these things?
Use a Cluster
Convex Optimization
Matrix Factorization
Machine Learning

Numerical Linear Algebra
Large Graph analysis
Streaming and online
algorithms
Following	
  lectures	
  on	
  http://stanford.edu/~rezab/dao	
  
	
  	
  
Slides	
  at	
  http://stanford.edu/~rezab/slides/sparksummit2015
	
  
Outline
Data Flow Engines and Spark
The Three Dimensions of Machine Learning
Built-in Libraries
MLlib + {Streaming, GraphX, SQL}
Future of MLlib
Traditional Network Programming
Message-passing between nodes (e.g. MPI)
Very difficult to do at scale:
» How to split problem across nodes?
•  Must consider network & data locality
» How to deal with failures? (inevitable at scale)
» Even worse: stragglers (node not failed, but slow)
» Ethernet networking not fast
» Have to write programs for each machine
Rarely used in commodity datacenters
Disk vs Memory
L1 cache reference: 
 
 
0.5 ns
L2 cache reference: 
 
 
7 ns
Mutex lock/unlock: 
 
 
100 ns
Main memory reference: 
100 ns
Disk seek: 

 
 
 
 
 
10,000,000 ns
Network vs Local
Send 2K bytes over 1 Gbps network: 
 
20,000 ns
Read 1 MB sequentially from memory: 
250,000 ns
Round trip within same datacenter: 
 
500,000 ns
Read 1 MB sequentially from network: 
10,000,000 ns
Read 1 MB sequentially from disk:
 
 
30,000,000 ns
Send packet CA->Netherlands->CA: 
 
150,000,000 ns
Data Flow Models
Restrict the programming interface so that the
system can do more automatically
Express jobs as graphs of high-level operators
» System picks how to split each operator into tasks
and where to run each task
» Run parts twice fault recovery
Biggest example: MapReduce
Map
Map
Map
Reduce
Reduce
iter. 1
 iter. 2
 . . .
Input
file system"
read
file system"
write
file system"
read
file system"
write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
file system"
read
Commonly spend 90% of time doing I/O
Example: Iterative Apps
MapReduce evolved
MapReduce is great at one-pass computation,
but inefficient for multi-pass algorithms
No efficient primitives for data sharing
» State between steps goes to distributed file system
» Slow due to replication & disk storage
Verdict
MapReduce algorithms research doesn’t go
to waste, it just gets sped up and easier to
use

Still useful to study as an algorithmic
framework, silly to use directly
Spark Computing Engine
Extends a programming language with a
distributed collection data-structure
» “Resilient distributed datasets” (RDD)
Open source at Apache
» Most active community in big data, with 50+
companies contributing
Clean APIs in Java, Scala, Python
Community: SparkR, being released in 1.4!
Key Idea
Resilient Distributed Datasets (RDDs)
» Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...)
» Built via parallel transformations (map, filter, …)
» The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
Resilient Distributed Datasets (RDDs)
Main idea: Resilient Distributed Datasets
» Immutable collections of objects, spread across cluster
» Statically typed: RDD[T] has objects of type T
val sc = new SparkContext()!
val lines = sc.textFile("log.txt") // RDD[String]!
!
// Transform using standard collection operations!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split(‘t’)(2))!
!
messages.saveAsTextFile("errors.txt")!
lazily evaluated
kicks off a computation
Fault Tolerance
file.map(lambda	
  rec:	
  (rec.type,	
  1))	
  
	
  	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  x	
  +	
  y)	
  
	
  	
  	
  	
  .filter(lambda	
  (type,	
  count):	
  count	
  >	
  10)	
  
filter
reduce
map
Inputfile
RDDs track lineage info to rebuild lost data
filter
reduce
map
Inputfile
Fault Tolerance
file.map(lambda	
  rec:	
  (rec.type,	
  1))	
  
	
  	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  x	
  +	
  y)	
  
	
  	
  	
  	
  .filter(lambda	
  (type,	
  count):	
  count	
  >	
  10)	
  
RDDs track lineage info to rebuild lost data
Partitioning
file.map(lambda	
  rec:	
  (rec.type,	
  1))	
  
	
  	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  x	
  +	
  y)	
  
	
  	
  	
  	
  .filter(lambda	
  (type,	
  count):	
  count	
  >	
  10)	
  
filter
reduce
map
Inputfile
RDDs know their partitioning functions
Known to be"
hash-partitioned
Also known
MLlib: Available algorithms
classification: logistic regression, linear SVM,"
naïve Bayes, least squares, classification tree
regression: generalized linear models (GLMs),
regression tree
collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
clustering: k-means||
decomposition: SVD, PCA
optimization: stochastic gradient descent, L-BFGS
The Three Dimensions
ML Objectives
Almost all machine learning objectives are
optimized using this update


w is a vector of dimension d"
we’re trying to find the best w via optimization
Scaling

1) Data size

2) Number of models

3) Model size
Logistic Regression	
  
Goal:	
  find	
  best	
  line	
  separating	
  two	
  sets	
  of	
  points	
  
+
–
+
+
+
+
+
+
+
+
– –
–
–
–
–
–
–
+
target	
  
–
random	
  initial	
  line	
  
Data Scaling
data	
  =	
  spark.textFile(...).map(readPoint).cache()	
  
	
  
w	
  =	
  numpy.random.rand(D)	
  
	
  
for	
  i	
  in	
  range(iterations):	
  
	
  	
  	
  	
  gradient	
  =	
  data.map(lambda	
  p:	
  
	
  	
  	
  	
  	
  	
  	
  	
  (1	
  /	
  (1	
  +	
  exp(-­‐p.y	
  *	
  w.dot(p.x))))	
  *	
  p.y	
  *	
  p.x	
  
	
  	
  	
  	
  ).reduce(lambda	
  a,	
  b:	
  a	
  +	
  b)	
  
	
  	
  	
  	
  w	
  -­‐=	
  gradient	
  
	
  
print	
  “Final	
  w:	
  %s”	
  %	
  w	
  
Separable Updates
Can be generalized for
»  Unconstrained optimization
»  Smooth or non-smooth
»  LBFGS, Conjugate Gradient, Accelerated
Gradient methods, …
Logistic Regression Results
0
500
1000
1500
2000
2500
3000
3500
4000
1
 5
 10
 20
 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
100 GB of data on 50 m1.xlarge EC2 machines
Behavior with Less RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0%
 25%
 50%
 75%
 100%
Iterationtime(s)
% of working set in memory
Lots of little models
Is embarrassingly parallel

Most of the work should be handled by data
flow paradigm

ML pipelines does this
Hyper-parameter Tuning
Model Scaling
Linear models only need to compute the dot
product of each example with model

Use a BlockMatrix to store data, use joins to
compute dot products

Coming in 1.5
Model Scaling
Data joined with model (weight):
Built-in libraries
A General Platform
Spark Core
Spark
Streaming"
real-time
Spark SQL
structured
GraphX
graph
MLlib
machine
learning
…
Standard libraries included with Spark
Benefit for Users
Same engine performs data extraction, model
training and interactive queries

…
DFS
read
DFS
write
parse
DFS
read
DFS
write
train
DFS
read
DFS
write
query
DFS
DFS
read
parse
train
query
Separate engines
Spark
Machine Learning Library (MLlib)
70+ contributors
in past year
points = context.sql(“select latitude, longitude from tweets”)!
model = KMeans.train(points, 10)!
!
MLlib algorithms
classification: logistic regression, linear SVM,"
naïve Bayes, classification tree
regression: generalized linear models (GLMs),
regression tree
collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
clustering: k-means||
decomposition: SVD, PCA
optimization: stochastic gradient descent, L-BFGS
GraphX
General graph processing library

Build graph using RDDs of nodes and edges

Run standard algorithms such as PageRank
GraphX
Spark Streaming
Run a streaming computation as a series
of very small, deterministic batch jobs
Spark	
  
Spark	
  
Streaming	
  
batches	
  of	
  X	
  
seconds	
  
live	
  data	
  stream	
  
processed	
  
results	
  
•  Chop	
  up	
  the	
  live	
  stream	
  into	
  batches	
  of	
  
X	
  seconds	
  	
  
•  Spark	
  treats	
  each	
  batch	
  of	
  data	
  as	
  
RDDs	
  and	
  processes	
  them	
  using	
  RDD	
  
opera;ons	
  
•  Finally,	
  the	
  processed	
  results	
  of	
  the	
  
RDD	
  opera;ons	
  are	
  returned	
  in	
  
batches	
  
Spark Streaming
Run a streaming computation as a series
of very small, deterministic batch jobs
Spark	
  
Spark	
  
Streaming	
  
batches	
  of	
  X	
  
seconds	
  
live	
  data	
  stream	
  
processed	
  
results	
  
•  Batch	
  sizes	
  as	
  low	
  as	
  ½	
  second,	
  latency	
  
~	
  1	
  second	
  
•  Poten;al	
  for	
  combining	
  batch	
  
processing	
  and	
  streaming	
  processing	
  in	
  
the	
  same	
  system	
  
Spark SQL
// Run SQL statements!
val teenagers = context.sql(!
"SELECT name FROM people WHERE age >= 13 AND age <= 19")!
!
// The results of SQL queries are RDDs of Row objects!
val names = teenagers.map(t => "Name: " + t(0)).collect()!
MLlib + {Streaming, GraphX, SQL}
A General Platform
Spark Core
Spark
Streaming"
real-time
Spark SQL
structured
GraphX
graph
MLlib
machine
learning
…
Standard libraries included with Spark
MLlib + Streaming
As of Spark 1.1, you can train linear models in
a streaming fashion, k-means as of 1.2

Model weights are updated via SGD, thus
amenable to streaming

More work needed for decision trees
MLlib + SQL
df = context.sql(“select latitude, longitude from tweets”)!
model = pipeline.fit(df)!
DataFrames in Spark 1.3! (March 2015)
Powerful coupled with new pipeline API
MLlib + GraphX
Future of MLlib
Goals
Tighter integration with DataFrame and spark.ml API

Accelerated gradient methods & Optimization interface

Model export: PMML (current export exists in Spark 1.3, but
not PMML, which lacks distributed models)

Scaling: Model scaling

More Related Content

PPTX
Intro to Spark development
PDF
Introduction to Spark Training
PDF
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Dev Ops Training
PDF
H2O Design and Infrastructure with Matt Dowle
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Spark Community Update - Spark Summit San Francisco 2015
Intro to Spark development
Introduction to Spark Training
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Spark Summit East 2015 Advanced Devops Student Slides
Dev Ops Training
H2O Design and Infrastructure with Matt Dowle
Apache Spark: The Next Gen toolset for Big Data Processing
Spark Community Update - Spark Summit San Francisco 2015

What's hot (20)

PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PPTX
Introduction to Apache Spark Developer Training
PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
Sparkling Water 5 28-14
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Productive Use of the Apache Spark Prompt with Sam Penrose
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
What is Distributed Computing, Why we use Apache Spark
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PDF
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
End-to-end Data Pipeline with Apache Spark
PDF
Apache Spark Tutorial
PDF
Extending Spark Streaming to Support Complex Event Processing
PDF
How Apache Spark fits into the Big Data landscape
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Introduction to Apache Spark Developer Training
Strata NYC 2015: What's new in Spark Streaming
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Sparkling Water 5 28-14
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Unified Big Data Processing with Apache Spark (QCON 2014)
Productive Use of the Apache Spark Prompt with Sam Penrose
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
What is Distributed Computing, Why we use Apache Spark
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Distributed Stream Processing - Spark Summit East 2017
End-to-end Data Pipeline with Apache Spark
Apache Spark Tutorial
Extending Spark Streaming to Support Complex Event Processing
How Apache Spark fits into the Big Data landscape
Ad

Similar to Advanced Data Science on Spark-(Reza Zadeh, Stanford) (20)

PDF
Unified Big Data Processing with Apache Spark
PPTX
Next generation analytics with yarn, spark and graph lab
PPTX
Large Scale Machine Learning with Apache Spark
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PPT
11. From Hadoop to Spark 1:2
PDF
Big data distributed processing: Spark introduction
PPT
Hadoop trainting-in-hyderabad@kelly technologies
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Apache Spark: What? Why? When?
PPT
Hadoop institutes-in-bangalore
PPTX
Apache Flink Deep Dive
PPTX
Spark 计算模型
PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
Apache spark core
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PPTX
Flink internals web
PDF
Big Data Analytics and Ubiquitous computing
PDF
Shark SQL and Rich Analytics at Scale
PDF
Spark ml streaming
Unified Big Data Processing with Apache Spark
Next generation analytics with yarn, spark and graph lab
Large Scale Machine Learning with Apache Spark
Yarn spark next_gen_hadoop_8_jan_2014
11. From Hadoop to Spark 1:2
Big data distributed processing: Spark introduction
Hadoop trainting-in-hyderabad@kelly technologies
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Apache Spark: What? Why? When?
Hadoop institutes-in-bangalore
Apache Flink Deep Dive
Spark 计算模型
Hadoop trainting in hyderabad@kelly technologies
Apache spark core
A look under the hood at Apache Spark's API and engine evolutions
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Flink internals web
Big Data Analytics and Ubiquitous computing
Shark SQL and Rich Analytics at Scale
Spark ml streaming
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Global journeys: estimating international migration
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to machine learning and Linear Models
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Global journeys: estimating international migration
Business Ppt On Nestle.pptx huunnnhhgfvu
A Quantitative-WPS Office.pptx research study
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Moving the Public Sector (Government) to a Digital Adoption
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Reliability_Chapter_ presentation 1221.5784
Fluorescence-microscope_Botany_detailed content

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

  • 1. Reza Zadeh Advanced Data Science on Spark @Reza_Zadeh | http://reza-zadeh.com
  • 2. Data Science Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters » Wide use in both enterprises and web industry How do we program these things?
  • 3. Use a Cluster Convex Optimization Matrix Factorization Machine Learning Numerical Linear Algebra Large Graph analysis Streaming and online algorithms Following  lectures  on  http://stanford.edu/~rezab/dao       Slides  at  http://stanford.edu/~rezab/slides/sparksummit2015  
  • 4. Outline Data Flow Engines and Spark The Three Dimensions of Machine Learning Built-in Libraries MLlib + {Streaming, GraphX, SQL} Future of MLlib
  • 5. Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale: » How to split problem across nodes? •  Must consider network & data locality » How to deal with failures? (inevitable at scale) » Even worse: stragglers (node not failed, but slow) » Ethernet networking not fast » Have to write programs for each machine Rarely used in commodity datacenters
  • 6. Disk vs Memory L1 cache reference: 0.5 ns L2 cache reference: 7 ns Mutex lock/unlock: 100 ns Main memory reference: 100 ns Disk seek: 10,000,000 ns
  • 7. Network vs Local Send 2K bytes over 1 Gbps network: 20,000 ns Read 1 MB sequentially from memory: 250,000 ns Round trip within same datacenter: 500,000 ns Read 1 MB sequentially from network: 10,000,000 ns Read 1 MB sequentially from disk: 30,000,000 ns Send packet CA->Netherlands->CA: 150,000,000 ns
  • 8. Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks and where to run each task » Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce
  • 9. iter. 1 iter. 2 . . . Input file system" read file system" write file system" read file system" write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . file system" read Commonly spend 90% of time doing I/O Example: Iterative Apps
  • 10. MapReduce evolved MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing » State between steps goes to distributed file system » Slow due to replication & disk storage
  • 11. Verdict MapReduce algorithms research doesn’t go to waste, it just gets sped up and easier to use Still useful to study as an algorithmic framework, silly to use directly
  • 12. Spark Computing Engine Extends a programming language with a distributed collection data-structure » “Resilient distributed datasets” (RDD) Open source at Apache » Most active community in big data, with 50+ companies contributing Clean APIs in Java, Scala, Python Community: SparkR, being released in 1.4!
  • 13. Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » The world only lets you make make RDDs such that they can be: Automatically rebuilt on failure
  • 14. Resilient Distributed Datasets (RDDs) Main idea: Resilient Distributed Datasets » Immutable collections of objects, spread across cluster » Statically typed: RDD[T] has objects of type T val sc = new SparkContext()! val lines = sc.textFile("log.txt") // RDD[String]! ! // Transform using standard collection operations! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split(‘t’)(2))! ! messages.saveAsTextFile("errors.txt")! lazily evaluated kicks off a computation
  • 15. Fault Tolerance file.map(lambda  rec:  (rec.type,  1))          .reduceByKey(lambda  x,  y:  x  +  y)          .filter(lambda  (type,  count):  count  >  10)   filter reduce map Inputfile RDDs track lineage info to rebuild lost data
  • 16. filter reduce map Inputfile Fault Tolerance file.map(lambda  rec:  (rec.type,  1))          .reduceByKey(lambda  x,  y:  x  +  y)          .filter(lambda  (type,  count):  count  >  10)   RDDs track lineage info to rebuild lost data
  • 17. Partitioning file.map(lambda  rec:  (rec.type,  1))          .reduceByKey(lambda  x,  y:  x  +  y)          .filter(lambda  (type,  count):  count  >  10)   filter reduce map Inputfile RDDs know their partitioning functions Known to be" hash-partitioned Also known
  • 18. MLlib: Available algorithms classification: logistic regression, linear SVM," naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS
  • 20. ML Objectives Almost all machine learning objectives are optimized using this update w is a vector of dimension d" we’re trying to find the best w via optimization
  • 21. Scaling 1) Data size 2) Number of models 3) Model size
  • 22. Logistic Regression   Goal:  find  best  line  separating  two  sets  of  points   + – + + + + + + + + – – – – – – – – + target   – random  initial  line  
  • 23. Data Scaling data  =  spark.textFile(...).map(readPoint).cache()     w  =  numpy.random.rand(D)     for  i  in  range(iterations):          gradient  =  data.map(lambda  p:                  (1  /  (1  +  exp(-­‐p.y  *  w.dot(p.x))))  *  p.y  *  p.x          ).reduce(lambda  a,  b:  a  +  b)          w  -­‐=  gradient     print  “Final  w:  %s”  %  w  
  • 24. Separable Updates Can be generalized for »  Unconstrained optimization »  Smooth or non-smooth »  LBFGS, Conjugate Gradient, Accelerated Gradient methods, …
  • 25. Logistic Regression Results 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s 100 GB of data on 50 m1.xlarge EC2 machines
  • 26. Behavior with Less RAM 68.8 58.1 40.7 29.7 11.5 0 20 40 60 80 100 0% 25% 50% 75% 100% Iterationtime(s) % of working set in memory
  • 27. Lots of little models Is embarrassingly parallel Most of the work should be handled by data flow paradigm ML pipelines does this
  • 29. Model Scaling Linear models only need to compute the dot product of each example with model Use a BlockMatrix to store data, use joins to compute dot products Coming in 1.5
  • 30. Model Scaling Data joined with model (weight):
  • 32. A General Platform Spark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning … Standard libraries included with Spark
  • 33. Benefit for Users Same engine performs data extraction, model training and interactive queries … DFS read DFS write parse DFS read DFS write train DFS read DFS write query DFS DFS read parse train query Separate engines Spark
  • 34. Machine Learning Library (MLlib) 70+ contributors in past year points = context.sql(“select latitude, longitude from tweets”)! model = KMeans.train(points, 10)! !
  • 35. MLlib algorithms classification: logistic regression, linear SVM," naïve Bayes, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS
  • 37. General graph processing library Build graph using RDDs of nodes and edges Run standard algorithms such as PageRank GraphX
  • 38. Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Spark   Spark   Streaming   batches  of  X   seconds   live  data  stream   processed   results   •  Chop  up  the  live  stream  into  batches  of   X  seconds     •  Spark  treats  each  batch  of  data  as   RDDs  and  processes  them  using  RDD   opera;ons   •  Finally,  the  processed  results  of  the   RDD  opera;ons  are  returned  in   batches  
  • 39. Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Spark   Spark   Streaming   batches  of  X   seconds   live  data  stream   processed   results   •  Batch  sizes  as  low  as  ½  second,  latency   ~  1  second   •  Poten;al  for  combining  batch   processing  and  streaming  processing  in   the  same  system  
  • 40. Spark SQL // Run SQL statements! val teenagers = context.sql(! "SELECT name FROM people WHERE age >= 13 AND age <= 19")! ! // The results of SQL queries are RDDs of Row objects! val names = teenagers.map(t => "Name: " + t(0)).collect()!
  • 41. MLlib + {Streaming, GraphX, SQL}
  • 42. A General Platform Spark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning … Standard libraries included with Spark
  • 43. MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion, k-means as of 1.2 Model weights are updated via SGD, thus amenable to streaming More work needed for decision trees
  • 44. MLlib + SQL df = context.sql(“select latitude, longitude from tweets”)! model = pipeline.fit(df)! DataFrames in Spark 1.3! (March 2015) Powerful coupled with new pipeline API
  • 47. Goals Tighter integration with DataFrame and spark.ml API Accelerated gradient methods & Optimization interface Model export: PMML (current export exists in Spark 1.3, but not PMML, which lacks distributed models) Scaling: Model scaling