BDAS RDD study report v1.2

Resilient Distributed Datasets
A FaultTolerant Abstraction for
InMemory Cluster Computing

Motivation
•RDDs are motivated by two types of applications that current computing
frameworks handle inefficiently:
1. Iterative algorithms:
iterative machine learning
graph algorithms
2. Interative data mining
adhoc query
•In MapReduce, the only way to share data across jobs is stable storage
slow!

Examples
Slow due to replication and disk I/O, but
necessary for fault tolerance

Solution: Resilient
Distributed Datasets (RDDs)
•Restriced form of distributed shared memory
Immutable,partitioned collections of records
Can only be built through coarsegrained derterminstic
transformations(map,filter,join,…)
•Efficient fault recovery using lineage
log one operation to apply to many elenments
Recompute lost partitions on failure
No cost if nonthing fails

Solution: Resilient
Distributed Datasets (RDDs)
• Allow apps to keep working sets in memory
for efficient reuse
• Retain the attractive properties of MapReduce
– Fault tolerance, data locality, scalability
• Support a wide range of applications
• Control of each RDD’s partitioning (layout
across nodes) and persistence (storage in
RAM,on disk,etc)

RDD Operations
Transformations
(define a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey

Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Load error messages from a log into memory, then
interactively search for various patterns

9
Fault Recovery
• RDD track the grapth of transformations that
built them (their lineage) to rebuild lost data

Optimizing Placement
links & ranks repeatedly joined
Can co-partition them (e.g.hash
both on URL) to avoid shuffles
Can also use app knowledge,
e.g.,hash on DNS name
links = links.partitionBy(new
URLPartitioner())

Representing RDDs
• a set of partitions, which are atomic pieces of the
dataset
• a set of dependencies on parent RDDs
• a function for computing the dataset based on its
parents
• metadata about its partitioning scheme
• data placement
04/25/14

Representing RDDs
04/25/14
Operation Meanning
partitions() Return a list of Partition objects
preferredLocations(p) List nodes where partition p can
be accessed faster due to data
locality
dependencies() Return a list of dependencies
iterator(p, parentIters) Compute the elements of
partition p given iterators for its
parent partitions
partitioner() Return metadata specifying
whether the RDD is hash/range
partitioned
Interface used to represent RDDs in Spark

Dependencies
• narrow dependencies
---where each partition of the parent RDD is used by
at most one partition of the child RDD
• wide dependencies
---where multiple child partitions may depend on it.
• For example
---map leads to a narrow dependency,
---while join leads to wide dependencies (unless the parents are
hash-partitioned)
04/25/14

Dependencies
04/25/14
Examples of narrow and wide dependencies. Each box is an RDD, with
partitions shown as shaded rectangles

Narrow VS Wide dependencies
• Narrow dependencies
---allow for pipelined execution on one cluster node, which can compute all the
parent partitions.
---recovery after a node failure is more efficient, as only the lost parent partitions
need to be recomputed, can be recomputed in parallel on different nodes
• Wide dependencies
--- require data from all parent partitions to be available and to be shuffled across
the nodes using a MapReduce-like operation
--- in a lineage graph, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution
04/25/14

Job Scheduler
• Similar to Dryad’s, but takes into account which partitions of persistent
RDDS available in memory
• When runs an action (e.g., count or save) on an RDD, the scheduler
examines that RDD’s lineage graph to build a DAG of stages to execute
• Each stage contains as many pipelined transformations with narrow
dependencies as possible
Boundary of the stages
---shuffle operations required for wide dependencies
---any already computed partitions(shortcircuit the computation of a
parent RDD)
• The scheduler then launches tasks to compute missing partitions from
each stage until it has computed the target RDD
04/25/14

Job Scheduler
04/25/14
Dryad-like DAGs
Pipelines functions
within a stage
Locality & data
reuse aware
Partitioning-aware
to avoid shuffles

Task Assignment
• scheduler assigns tasks to machines based on data locality
using delay scheduling
---if a task needs to process a partition that is available in
memory on a node, then send it to that node
---otherwise, a task processes a partition for which the
containing RDD provides preferred locations (e.g., an HDFS
file), then send it to those
04/25/14

Memory Management
• in-memory storage as deserialized Java objects
---The first option provides the fastest performance, because the Java
VM can access each RDD element natively
• in-memory storage as serialized data
---The second option lets users choose a more memory-efficient
representation than Java object graphs when space is limited, at the
cost of lower performance
• on-disk storage
---The third option is useful for RDDs that are too large to keep in RAM
but costly to recompute on each use.
04/25/14

Not Suitable for RDDs
• RDDs are best suited for batch applications that apply the same
operation to all elements of a dataset
• RDDs would be less suitable for applications that make asynchronous
fine-grained updates to shared state, such as a storage system for a web
application or an incremental web crawler
04/25/14

04/25/14
Programming Models
Implemented on Spark
RDDs can express many existing parallel models

04/25/14
Open Source Community
15contributors,5+companies using Spark,
3+applications projects at Berkeley
User applications:
» Data mining 40x faster than Hadoop(Conviva)
» Exploratory log analysis (Foursquare)
» Traffic prediction via EM(Mobile Millennium)
» Twitter spam classification (Monarch)
» DNA sequence analysis(SNAP)

04/25/14
Conclusion
RDDs offer a simple and efficient programming model for a broad range of
Applications(immutable nature and coarse-grained transformations, suitable
for a wide class of applications)
Leverage the coarse-grained nature of many parallel algorithms for low-
overhead recovery
Let user controls each RDD’s partitioning (layout across nodes) and
persistence (storage in RAM,on disk,etc)

BDAS RDD study report v1.2

More Related Content

What's hot (18)

Similar to BDAS RDD study report v1.2 (20)

Recently uploaded (20)

BDAS RDD study report v1.2

Editor's Notes