SlideShare a Scribd company logo
Spark: Next Generation Hadoop

Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
1
Contents
Big Data Computations

Hadoop 2.0
(Hadoop YARN)
Berkeley
• BDAS Spark
data
• BDAS Discretized
analytics
Streams
stack
PMML • PMML Primer
Scoring
for Naïve • Naïve Bayes Primer
Bayes

2
Big Data Computations
Computations/Operations

Giant 1 (simple stats) is perfect
for Hadoop 1.0.

Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark
from UC Berkeley is efficient.

Interactive/On-the-fly data
processing – Storm.

Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.

Example is social group-first
approach for consumer churn
analysis [2]

OLAP – data cube operations.
Dremel/Drill

Data sets – not embarrassingly
parallel?
Machine vision from Google [3]
Deep Learning

Artificial Neural Networks
Speech analysis from Microsoft

Giant 5 – Graph processing –
GraphLab, Pregel, Giraph

3

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social
Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio
Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:
Hadoop YARN Requirements or 1.0 shortcomings
R1: Scalability

R2: Multi-tenancy

• single cluster limitation

• Addressed by Hadoopon-Demand
• Security, Quotas

R3: Locality
awareness

R4: Shared cluster
utilization

• Shuffle of records

• Hogging by users
• Typed slots

R5:
Reliability/Availability
• Job Tracker bugs

R6: Iterative
Machine Learning

4

Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves,
Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and
Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing,
Oct 2013, ACM Press.
Iterative ML Algorithms


What are iterative algorithms?

 Those that need communication among the computing entities
 Examples – neural networks, PageRank algorithms, network traffic analysis



Conjugate gradient descent

 Commonly used to solve systems of linear equations
 [CB09] tried implementing CG on dense matrices
 DAXPY – Multiplies vector x by constant a and adds y.
 DDOT – Dot product of 2 vectors
 MatVec – Multiply matrix by vector, produce a vector.

 1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG
computation, leading to 10 of GBs of communication even for small
matrices.



Other iterative algorithms – fast fourier transform, block tridiagonal

[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing,
Technical Report, University of California, Computer Science Department, 2009.
6

Hadoop YARN Architecture
YARN Internals

Application Master

YARN RM

Node Manager

• Sends
ResourceRequests
to the YARN RM
• Captures
containers,
resources per
container, locality
preferences.

• Generates tokens
and containers
• Global view of
cluster – monolithic
scheduling.

• Node health
monitoring,
advertise available
resources through
heartbeats to RM.

7
Berkeley Big-data Analytics Stack (BDAS)

8
BDAS: Spark
Transformations/Actions
Map(function f1)
Filter(function f2)
flatMap(function f3)
Union(RDD r1)
Sample(flag, p, seed)
groupByKey(noTasks)

Description
Pass each element of the RDD through f1 in parallel and return the resulting RDD.
Select elements of RDD that return true when passed through f2.
Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple
outputs.
Returns result of union of the RDD r1 with the self.
Returns a randomly sampled (with seed) p percentage of the RDD.
Can only be invoked on key-value paired data – returns data grouped by value. No. of
parallel tasks is given as an argument (default is 8).
Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the
second argument.
Joins RDD r2 with self – computes all possible pairs for given key.
Joins RDD r3 with self and groups by key.

reduceByKey(function f4,
noTasks)
Join(RDD r2, noTasks)
groupWith(RDD r3,
noTasks)
sortByKey(flag)
Sorts the self RDD in ascending or descending based on flag.
Reduce(function f5)
Aggregates result of applying function f5 on all elements of self RDD
Collect()
Return all elements of the RDD as an array.
Count()
Count no. of elements in RDD
take(n)
Get first n elements of RDD.
First()
Equivalent to take(1)
saveAsTextFile(path)
Persists RDD in a file in HDFS or other Hadoop supported file system at given path.
saveAsSequenceFile(path Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs
)
that implement Hadoop writable interface or equivalent.
foreach(function f6)
Run f6 in parallel on elements of self Ankur
[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,RDD. Dave, Justin Ma, Murphy McCauley, Michael

J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
BDAS: Use Cases
Ooyala

Conviva

Uses Cassandra for
video data
personalization.

Uses Hive for
repeatedly running
ad-hoc queries on
video data.

Pre-compute
aggregates VS onthe-fly queries.

Optimized ad-hoc
queries using Spark
RDDs – found Spark
is 30 times faster
than Hive

Moved to Spark for
ML and computing
views.

ML for connection
analysis and video
streaming
optimization.

10

Moved to Shark for on-the-fly
queries – C* OLAP aggregate
queries on Cassandra 130 secs, 60
ms in Spark

Yahoo
Advertisement
targeting: 30K nodes
on Hadoop Yarn

Hadoop – batch processing
Spark – iterative processing
Storm – on-the-fly processing

Content
recommendation –
collaborative
filtering
11
PMML Primer

Predictive Model Markup
Language

Developed by DMG (Data
Mining Group)

PMML offers a standard
to define a model, so that
a model generated in
tool-A can be directly
used in tool-B.

XML representation of a
model.

May contain a myriad of
data transformations
(pre- and post-processing)
as well as one or more
predictive models.

12
Naïve Bayes Primer
A simple probabilistic
classifier based on
Bayes Theorem

Given features
X1,X2,…,Xn, predict a
label Y by calculating
the probability for all
possible Y value

Likelihood

Normalization Constant

Prior

13
PMML Scoring for Naïve Bayes
Wrote a PMML based
scoring engine for
Naïve Bayes
algorithm.

This can theoretically
be used in any
framework for data
processing by
invoking the API

Deployed a Naïve
Bayes PMML
generated from R into
Storm / Spark and
Samza frameworks

Real time predictions
with the above APIs

14
Header
• Version and timestamp
• Model development
environment information

Data Dictionary
• Variable types, missing
valid and invalid values,

Data
Munging/Transformation
• Normalization, mapping,
discretization

Model
• Model specifi attributes
• Mining Schema
• Treatment for missing
and outlier values
• Targets
• Prior probability and
default
• Outputs
• List of computer output
fields
• Post-processing
• Definition of model
architecture/parameters.

15
PMML Scoring for Naïve Bayes
<DataDictionary numberOfFields="4">
<DataField name="Class" optype="categorical" dataType="string">
<Value value="democrat"/>
<Value value="republican"/>
</DataField>
<DataField name="V1" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
<DataField name="V2" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
<DataField name="V3" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
</DataDictionary>

(ctd on the next slide)

16
PMML Scoring for Naïve Bayes
<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification"
threshold="0.003">
<MiningSchema>
<MiningField name="Class" usageType="predicted"/>
<MiningField name="V1" usageType="active"/>
<MiningField name="V2" usageType="active"/>
<MiningField name="V3" usageType="active"/>
</MiningSchema>
<Output>
<OutputField name="Predicted_Class" feature="predictedValue"/>
<OutputField name="Probability_democrat" optype="continuous" dataType="double"
feature="probability" value="democrat"/>
<OutputField name="Probability_republican" optype="continuous" dataType="double"
feature="probability" value="republican"/>
</Output>
<BayesInputs>
(ctd on the next page)

17
PMML Scoring for Naïve Bayes

18

<BayesInputs>
<BayesInput fieldName="V1">
<PairCounts value="n">
<TargetValueCounts>
<TargetValueCount value="democrat" count="51"/>
<TargetValueCount value="republican" count="85"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="y">
<TargetValueCounts>
<TargetValueCount value="democrat" count="73"/>
<TargetValueCount value="republican" count="23"/>
</TargetValueCounts>
</PairCounts>
</BayesInput>
<BayesInput fieldName="V2">
*
<BayesInput fieldName="V3">
*
</BayesInputs>
<BayesOutput fieldName="Class">
<TargetValueCounts>
<TargetValueCount value="democrat" count="124"/>
<TargetValueCount value="republican" count="108"/>
</TargetValueCounts>
</BayesOutput>
PMML Scoring for Naïve Bayes
Definition Of Elements:DataDictionary :
Definitions for fields as used in mining models
( Class, V1, V2, V3 )
NaiveBayesModel :
Indicates that this is a NaiveBayes PMML
MiningSchema : lists fields as used in that model.
Class is “predicted” field,
V1,V2,V3 are “active” predictor fields
Output:
Describes a set of result values that can be returned
from a model
19
PMML Scoring for Naïve Bayes
Definition Of Elements (ctd .. ) :BayesInputs:
For each type of inputs, contains the counts of
outputs
BayesOutput:
Contains the counts associated with the values of the
target field

20
PMML Scoring for Naïve Bayes
Sample Input
Eg1 - n y y n y y n n n n n n y y y y
Eg2 - n y n y y y n n n n n y y y n y

• 1st , 2nd and 3rd Columns:
Predictor variables ( Attribute “name” in element MiningField )

• Using these we predict whether the Output is Democrat or
Republican ( PMML element BayesOutput)

21
PMML Scoring for Naïve Bayes
• 3 Node Xeon Machines Storm cluster ( 8
quad code CPUs, 32 GB RAM, 32 GB
Swap space, 1 Nimbus, 2 Supervisors )
Number of records ( in
millions )

Time Taken (seconds)

0.1

4

0.4

7

1.0

12

2.0

21

10

129

25

310

22
PMML Scoring for Naïve Bayes
• 3 Node Xeon Machines Spark cluster( 8
quad code CPUs, 32 GB RAM and 32
GB Swap space )
Number of records ( in
millions )

Time Taken (

0.1

1 min 47 sec

0.2

3 min 35 src

0.4

6 min 40 secs

1.0

35 mins 17 sec

10

More than 3 hrs

23
Future of Spark
•

Domain specific language approach from
Stanford.

•

•
•

Forge [AKS13] – a meta DSL for high
performance DSLs.

40X faster than Spark!

Spark

•

Explore BLAS libraries for efficiency

24

[Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle
Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification.
In Proceedings of the 12th international conference on Generative programming: concepts &
experiences (GPCE '13). ACM, New York, NY, USA, 145-154.
Conclusion
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.

• Real-time computation
• Processing specialized data structures

• PMML scoring
• Spark for batch computations

• Spark streaming and Storm for real-time.
25

• Allows traditional analytical tools/algorithms to be
re-used.
Thank You!

Mail
LinkedIn

• vijay.sa@impetus.co.in
• http://in.linkedin.com/in/vijaysrinivasagneeswaran

Blogs

• blogs.impetus.com

Twitter

• @a_vijaysrinivas.
Back up slides

27
GraphLab: Ideal Engine for Processing Natural Graphs [YL12]
Goals – targeted at machine
learning.
• Model graph dependencies, be
asynchronous, iterative, dynamic.

Data associated with edges
(weights, for instance) and
vertices (user profile data, current
interests etc.).

Update functions – lives on each
vertex

Consistency is important in ML
algorithms (some do not even
converge when there are
inconsistent updates –
collaborative filtering).

• Transforms data in scope of vertex.
• Can choose to trigger neighbours (for
example only if Rank changes drastically)
• Run asynchronously till convergence –
no global barrier.

• GraphLab – provides varying level of
consistency. Parallelism VS consistency.

Implemented several algorithms,
including ALS, K-means, SVM,
Belief propagation, matrix
factorization, Gibbs sampling,
SVD, CoEM etc.
• Co-EM (Expectation Maximization)
algorithm 15x faster than Hadoop MR –
on distributed GraphLab, only 0.3% of
Hadoop execution time.
[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed
GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
GraphLab 2: PowerGraph – Modeling Natural Graphs [1]

GraphLab could not
scale to Altavista web
graph 2002, 1.4B
vertices, 6.7B edges.

Powergraph provides
new way of
partitioning power law
graphs

• Most graph parallel
abstractions assume small
neighbourhoods – low
degree vertices
• But natural graphs
(LinkedIn, Facebook,
Twitter) – power law
graphs.
• Hard to partition power law
graphs, high degree
vertices limit parallelism.

• Edges are tied to
machines, vertices (esp.
high degree ones) span
machines
• Execution split into 3
phases:
• Gather, apply and
scatter.

Triangle counting on
Twitter graph
• Hadoop MR took 423
minutes on 1536 machines
• GraphLab 2 took 1.5
minutes on 1024 cores (64
machines)

[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph:
Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium
on Operating Systems Design and Implementation (OSDI '12).
BDAS: Discretized Streams
Treats streams as series of small time interval batch
computations
Event based APIs – stream handling
How to make interval granularity very low (milliseconds)?
• Built over Spark RDDs – in-memory distributed cache

Fault-tolerance is based on RDD lineage (series of
transformations that can be stored and recomputed on failure).
• Parallel recovery – re-computations happen in parallel across the cluster.

pageViews = readStream("http://...", "1s")
1_s = pageViews.map(event => (event.url, 1))
counts = 1_s.runningReduce((a, b) => a + b)
30
BDAS: D-Streams Streaming Operators

Windowing
• pairs.window("5s").reduceByKey(_+_)

Incremental aggregation
• pairs.reduceByWindow("5s", (a, b) =>
a + b)

Time skewed joins
words = sentences.flatMap(s => s.split(" "))
pairs = words.map(w => (w, 1))
counts = pairs.reduceByKey((a, b) => a + b)
31
Representation of an RDD
Information

HadoopRDD

FilteredRDD

JoinedRDD

Set of partitions

1 per HDFS block

Same as parent

1 per reduce task

Set of dependencies

None

1-to-1 on parent

Shuffle on each parent

Function to compute
data set based on
parents

Read corresponding block

Compute parent and Read and join shuffled
filter it
data

Meta-data on location
(preferredLocaations)

HDFS block location from
namenode

None (parent)

None

Meta-data on
partitioning
(partitioningScheme)

None

None

HashPartitioner

32
Logistic Regression: Spark VS Hadoop

http://spark-project.org

33
Some Spark(ling) examples
Scala code (serial)

var count = 0
for (i <- 1 to 100000)
{ val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)
Sample random point on unit circle – count how many are inside them (roughly about PI/4).
Hence, u get approximate value for PI.
Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
Some Spark(ling) examples
Spark code (parallel)
val spark = new SparkContext(<Mesos master>)

var count = spark.accumulator(0)
for (i <- spark.parallelize(1 to 100000, 12))
{ val x = Math.random * 2 – 1

val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Notable points:
1.
2.

Count becomes shared variable – accumulator.

3.

For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.

4.

1

Spark context created – talks to Mesos1 master.

Parallelize method invokes foreach method of RDD.

Mesos is an Apache incubated clustering system – http://mesosproject.org
Logistic Regression in Spark: Serial Code
// Read data file and convert it into Point objects
val lines = scala.io.Source.fromFile("data.txt").getLines()
val points = lines.map(x => parsePoint(x))
// Run logistic regression

var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = Vector.zeros(D)

for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x

}
w -= gradient
}

println("Result: " + w)
Logistic Regression in Spark
// Read data file and transform it into Point objects
val spark = new SparkContext(<Mesos master>)
val lines = spark.hdfsTextFile("hdfs://.../data.txt")
val points = lines.map(x => parsePoint(x)).cache()
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = spark.accumulator(Vector.zeros(D))
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Result: " + w)

More Related Content

PPTX
Next generation analytics with yarn, spark and graph lab
PPTX
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
PPTX
Big dataanalyticsbeyondhadoop public_20_june_2013
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PDF
Big data distributed processing: Spark introduction
PDF
Implementation of p pic algorithm in map reduce to handle big data
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
Generalized Linear Models with H2O
Next generation analytics with yarn, spark and graph lab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Big dataanalyticsbeyondhadoop public_20_june_2013
Big Data Analytics with Storm, Spark and GraphLab
Big data distributed processing: Spark introduction
Implementation of p pic algorithm in map reduce to handle big data
Distributed Deep Learning + others for Spark Meetup
Generalized Linear Models with H2O

What's hot (20)

PPTX
High Performance Data Analytics with Java on Large Multicore HPC Clusters
PPTX
Introduction to Mahout
PPTX
dmapply: A functional primitive to express distributed machine learning algor...
PDF
Modeling with Hadoop kdd2011
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
PDF
Resilient Distributed Datasets
PDF
useR 2014 jskim
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PPTX
Qiu bosc2010
PDF
Mapreduce Algorithms
PDF
Python in an Evolving Enterprise System (PyData SV 2013)
PDF
Scalable Data Analysis in R Webinar Presentation
PDF
Boston Spark Meetup event Slides Update
PDF
Optimizing Terascale Machine Learning Pipelines with Keystone ML
PDF
A sql implementation on the map reduce framework
PDF
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Introduction to Mahout
dmapply: A functional primitive to express distributed machine learning algor...
Modeling with Hadoop kdd2011
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Resilient Distributed Datasets
useR 2014 jskim
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Qiu bosc2010
Mapreduce Algorithms
Python in an Evolving Enterprise System (PyData SV 2013)
Scalable Data Analysis in R Webinar Presentation
Boston Spark Meetup event Slides Update
Optimizing Terascale Machine Learning Pipelines with Keystone ML
A sql implementation on the map reduce framework
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Scalable Distributed Real-Time Clustering for Big Data Streams
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Ad

Similar to Yarn spark next_gen_hadoop_8_jan_2014 (20)

PPTX
Big data analytics_7_giants_public_24_sep_2013
PPT
Data science and OSS
PDF
Big Data Analytics and Ubiquitous computing
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PPTX
Topic modeling using big data analytics
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
MAD skills for analysis and big data Machine Learning
PPTX
Topic modeling using big data analytics
PDF
An efficient data mining framework on hadoop using java persistence api
PDF
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
PPTX
Large Scale Machine Learning with Apache Spark
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
PPTX
Sawmill - Integrating R and Large Data Clouds
PDF
Data Science
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
PPTX
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
User biglm
PPTX
System mldl meetup
Big data analytics_7_giants_public_24_sep_2013
Data science and OSS
Big Data Analytics and Ubiquitous computing
A Hands-on Intro to Data Science and R Presentation.ppt
Topic modeling using big data analytics
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
MAD skills for analysis and big data Machine Learning
Topic modeling using big data analytics
An efficient data mining framework on hadoop using java persistence api
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Large Scale Machine Learning with Apache Spark
Big data vahidamiri-tabriz-13960226-datastack.ir
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Sawmill - Integrating R and Large Data Clouds
Data Science
Analyzing Big data in R and Scala using Apache Spark 17-7-19
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
Apache spark sneha challa- google pittsburgh-aug 25th
User biglm
System mldl meetup
Ad

More from Vijay Srinivas Agneeswaran, Ph.D (6)

PDF
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
PPTX
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
PPTX
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
PPTX
Open problems big_data_19_feb_2015_ver_0.1
PPTX
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
PPTX
Big data analytics_beyond_hadoop_public_18_july_2013
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Open problems big_data_19_feb_2015_ver_0.1
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Big data analytics_beyond_hadoop_public_18_july_2013

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Modernizing your data center with Dell and AMD
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
MYSQL Presentation for SQL database connectivity
Modernizing your data center with Dell and AMD
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation

Yarn spark next_gen_hadoop_8_jan_2014

  • 1. Spark: Next Generation Hadoop Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus 1
  • 2. Contents Big Data Computations Hadoop 2.0 (Hadoop YARN) Berkeley • BDAS Spark data • BDAS Discretized analytics Streams stack PMML • PMML Primer Scoring for Naïve • Naïve Bayes Primer Bayes 2
  • 3. Big Data Computations Computations/Operations Giant 1 (simple stats) is perfect for Hadoop 1.0. Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark from UC Berkeley is efficient. Interactive/On-the-fly data processing – Storm. Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares. Example is social group-first approach for consumer churn analysis [2] OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Machine vision from Google [3] Deep Learning Artificial Neural Networks Speech analysis from Microsoft Giant 5 – Graph processing – GraphLab, Pregel, Giraph 3 [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741 [3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:
  • 4. Hadoop YARN Requirements or 1.0 shortcomings R1: Scalability R2: Multi-tenancy • single cluster limitation • Addressed by Hadoopon-Demand • Security, Quotas R3: Locality awareness R4: Shared cluster utilization • Shuffle of records • Hogging by users • Typed slots R5: Reliability/Availability • Job Tracker bugs R6: Iterative Machine Learning 4 Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing, Oct 2013, ACM Press.
  • 5. Iterative ML Algorithms  What are iterative algorithms?  Those that need communication among the computing entities  Examples – neural networks, PageRank algorithms, network traffic analysis  Conjugate gradient descent  Commonly used to solve systems of linear equations  [CB09] tried implementing CG on dense matrices  DAXPY – Multiplies vector x by constant a and adds y.  DDOT – Dot product of 2 vectors  MatVec – Multiply matrix by vector, produce a vector.  1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices.  Other iterative algorithms – fast fourier transform, block tridiagonal [CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing, Technical Report, University of California, Computer Science Department, 2009.
  • 7. YARN Internals Application Master YARN RM Node Manager • Sends ResourceRequests to the YARN RM • Captures containers, resources per container, locality preferences. • Generates tokens and containers • Global view of cluster – monolithic scheduling. • Node health monitoring, advertise available resources through heartbeats to RM. 7
  • 9. BDAS: Spark Transformations/Actions Map(function f1) Filter(function f2) flatMap(function f3) Union(RDD r1) Sample(flag, p, seed) groupByKey(noTasks) Description Pass each element of the RDD through f1 in parallel and return the resulting RDD. Select elements of RDD that return true when passed through f2. Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs. Returns result of union of the RDD r1 with the self. Returns a randomly sampled (with seed) p percentage of the RDD. Can only be invoked on key-value paired data – returns data grouped by value. No. of parallel tasks is given as an argument (default is 8). Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument. Joins RDD r2 with self – computes all possible pairs for given key. Joins RDD r3 with self and groups by key. reduceByKey(function f4, noTasks) Join(RDD r2, noTasks) groupWith(RDD r3, noTasks) sortByKey(flag) Sorts the self RDD in ascending or descending based on flag. Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD Collect() Return all elements of the RDD as an array. Count() Count no. of elements in RDD take(n) Get first n elements of RDD. First() Equivalent to take(1) saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path. saveAsSequenceFile(path Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs ) that implement Hadoop writable interface or equivalent. foreach(function f6) Run f6 in parallel on elements of self Ankur [MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,RDD. Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
  • 10. BDAS: Use Cases Ooyala Conviva Uses Cassandra for video data personalization. Uses Hive for repeatedly running ad-hoc queries on video data. Pre-compute aggregates VS onthe-fly queries. Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than Hive Moved to Spark for ML and computing views. ML for connection analysis and video streaming optimization. 10 Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark Yahoo Advertisement targeting: 30K nodes on Hadoop Yarn Hadoop – batch processing Spark – iterative processing Storm – on-the-fly processing Content recommendation – collaborative filtering
  • 11. 11
  • 12. PMML Primer Predictive Model Markup Language Developed by DMG (Data Mining Group) PMML offers a standard to define a model, so that a model generated in tool-A can be directly used in tool-B. XML representation of a model. May contain a myriad of data transformations (pre- and post-processing) as well as one or more predictive models. 12
  • 13. Naïve Bayes Primer A simple probabilistic classifier based on Bayes Theorem Given features X1,X2,…,Xn, predict a label Y by calculating the probability for all possible Y value Likelihood Normalization Constant Prior 13
  • 14. PMML Scoring for Naïve Bayes Wrote a PMML based scoring engine for Naïve Bayes algorithm. This can theoretically be used in any framework for data processing by invoking the API Deployed a Naïve Bayes PMML generated from R into Storm / Spark and Samza frameworks Real time predictions with the above APIs 14
  • 15. Header • Version and timestamp • Model development environment information Data Dictionary • Variable types, missing valid and invalid values, Data Munging/Transformation • Normalization, mapping, discretization Model • Model specifi attributes • Mining Schema • Treatment for missing and outlier values • Targets • Prior probability and default • Outputs • List of computer output fields • Post-processing • Definition of model architecture/parameters. 15
  • 16. PMML Scoring for Naïve Bayes <DataDictionary numberOfFields="4"> <DataField name="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary> (ctd on the next slide) 16
  • 17. PMML Scoring for Naïve Bayes <NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs> (ctd on the next page) 17
  • 18. PMML Scoring for Naïve Bayes 18 <BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> * </BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>
  • 19. PMML Scoring for Naïve Bayes Definition Of Elements:DataDictionary : Definitions for fields as used in mining models ( Class, V1, V2, V3 ) NaiveBayesModel : Indicates that this is a NaiveBayes PMML MiningSchema : lists fields as used in that model. Class is “predicted” field, V1,V2,V3 are “active” predictor fields Output: Describes a set of result values that can be returned from a model 19
  • 20. PMML Scoring for Naïve Bayes Definition Of Elements (ctd .. ) :BayesInputs: For each type of inputs, contains the counts of outputs BayesOutput: Contains the counts associated with the values of the target field 20
  • 21. PMML Scoring for Naïve Bayes Sample Input Eg1 - n y y n y y n n n n n n y y y y Eg2 - n y n y y y n n n n n y y y n y • 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField ) • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput) 21
  • 22. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors ) Number of records ( in millions ) Time Taken (seconds) 0.1 4 0.4 7 1.0 12 2.0 21 10 129 25 310 22
  • 23. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space ) Number of records ( in millions ) Time Taken ( 0.1 1 min 47 sec 0.2 3 min 35 src 0.4 6 min 40 secs 1.0 35 mins 17 sec 10 More than 3 hrs 23
  • 24. Future of Spark • Domain specific language approach from Stanford. • • • Forge [AKS13] – a meta DSL for high performance DSLs. 40X faster than Spark! Spark • Explore BLAS libraries for efficiency 24 [Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Proceedings of the 12th international conference on Generative programming: concepts & experiences (GPCE '13). ACM, New York, NY, USA, 145-154.
  • 25. Conclusion • Beyond Hadoop Map-Reduce philosophy • Optimization and other problems. • Real-time computation • Processing specialized data structures • PMML scoring • Spark for batch computations • Spark streaming and Storm for real-time. 25 • Allows traditional analytical tools/algorithms to be re-used.
  • 26. Thank You! Mail LinkedIn • vijay.sa@impetus.co.in • http://in.linkedin.com/in/vijaysrinivasagneeswaran Blogs • blogs.impetus.com Twitter • @a_vijaysrinivas.
  • 28. GraphLab: Ideal Engine for Processing Natural Graphs [YL12] Goals – targeted at machine learning. • Model graph dependencies, be asynchronous, iterative, dynamic. Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.). Update functions – lives on each vertex Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering). • Transforms data in scope of vertex. • Can choose to trigger neighbours (for example only if Rank changes drastically) • Run asynchronously till convergence – no global barrier. • GraphLab – provides varying level of consistency. Parallelism VS consistency. Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc. • Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time. [YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
  • 29. GraphLab 2: PowerGraph – Modeling Natural Graphs [1] GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges. Powergraph provides new way of partitioning power law graphs • Most graph parallel abstractions assume small neighbourhoods – low degree vertices • But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs. • Hard to partition power law graphs, high degree vertices limit parallelism. • Edges are tied to machines, vertices (esp. high degree ones) span machines • Execution split into 3 phases: • Gather, apply and scatter. Triangle counting on Twitter graph • Hadoop MR took 423 minutes on 1536 machines • GraphLab 2 took 1.5 minutes on 1024 cores (64 machines) [1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
  • 30. BDAS: Discretized Streams Treats streams as series of small time interval batch computations Event based APIs – stream handling How to make interval granularity very low (milliseconds)? • Built over Spark RDDs – in-memory distributed cache Fault-tolerance is based on RDD lineage (series of transformations that can be stored and recomputed on failure). • Parallel recovery – re-computations happen in parallel across the cluster. pageViews = readStream("http://...", "1s") 1_s = pageViews.map(event => (event.url, 1)) counts = 1_s.runningReduce((a, b) => a + b) 30
  • 31. BDAS: D-Streams Streaming Operators Windowing • pairs.window("5s").reduceByKey(_+_) Incremental aggregation • pairs.reduceByWindow("5s", (a, b) => a + b) Time skewed joins words = sentences.flatMap(s => s.split(" ")) pairs = words.map(w => (w, 1)) counts = pairs.reduceByKey((a, b) => a + b) 31
  • 32. Representation of an RDD Information HadoopRDD FilteredRDD JoinedRDD Set of partitions 1 per HDFS block Same as parent 1 per reduce task Set of dependencies None 1-to-1 on parent Shuffle on each parent Function to compute data set based on parents Read corresponding block Compute parent and Read and join shuffled filter it data Meta-data on location (preferredLocaations) HDFS block location from namenode None (parent) None Meta-data on partitioning (partitioningScheme) None None HashPartitioner 32
  • 33. Logistic Regression: Spark VS Hadoop http://spark-project.org 33
  • 34. Some Spark(ling) examples Scala code (serial) var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI. Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
  • 35. Some Spark(ling) examples Spark code (parallel) val spark = new SparkContext(<Mesos master>) var count = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 – 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Notable points: 1. 2. Count becomes shared variable – accumulator. 3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices. 4. 1 Spark context created – talks to Mesos1 master. Parallelize method invokes foreach method of RDD. Mesos is an Apache incubated clustering system – http://mesosproject.org
  • 36. Logistic Regression in Spark: Serial Code // Read data file and convert it into Point objects val lines = scala.io.Source.fromFile("data.txt").getLines() val points = lines.map(x => parsePoint(x)) // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient } println("Result: " + w)
  • 37. Logistic Regression in Spark // Read data file and transform it into Point objects val spark = new SparkContext(<Mesos master>) val lines = spark.hdfsTextFile("hdfs://.../data.txt") val points = lines.map(x => parsePoint(x)).cache() // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value } println("Result: " + w)