SlideShare a Scribd company logo
Resilient Distributed Datasets
A Fault­­Tolerant Abstraction for
In­Memory Cluster Computing
Motivation
•RDDs are motivated by two types of applications that current computing
frameworks handle inefficiently:
1. Iterative algorithms:
­iterative machine learning
­graph algorithms
2. Interative data mining
­ad­hoc query
•In MapReduce, the only way to share data across jobs is stable storage
slow!
Examples
Slow due to replication and disk I/O, but
necessary for fault tolerance
Goal:In-Memory Data Sharing
Solution: Resilient
Distributed Datasets (RDDs)
•Restriced form of distributed shared memory
­­ Immutable,partitioned collections of records
­­ Can only be built through coarse­grained derterminstic
transformations(map,filter,join,…)
•Efficient fault recovery using lineage
­­log one operation to apply to many elenments
­­Recompute lost partitions on failure
­­No cost if nonthing fails
Solution: Resilient
Distributed Datasets (RDDs)
• Allow apps to keep working sets in memory
for efficient reuse
• Retain the attractive properties of MapReduce
– Fault tolerance, data locality, scalability
• Support a wide range of applications
• Control of each RDD’s partitioning (layout
across nodes) and persistence (storage in
RAM,on disk,etc)
RDD Operations
Transformations
(define a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Load error messages from a log into memory, then
interactively search for various patterns
9
Fault Recovery
• RDD track the grapth of transformations that
built them (their lineage) to rebuild lost data
10
Example:PageRank
Optimizing Placement
links & ranks repeatedly joined
Can co-partition them (e.g.hash
both on URL) to avoid shuffles
Can also use app knowledge,
e.g.,hash on DNS name
links = links.partitionBy(new
URLPartitioner())
PageRank Performance
Representing RDDs
• a set of partitions, which are atomic pieces of the
dataset
• a set of dependencies on parent RDDs
• a function for computing the dataset based on its
parents
• metadata about its partitioning scheme
• data placement
04/25/14
Representing RDDs
04/25/14
Operation Meanning
partitions() Return a list of Partition objects
preferredLocations(p) List nodes where partition p can
be accessed faster due to data
locality
dependencies() Return a list of dependencies
iterator(p, parentIters) Compute the elements of
partition p given iterators for its
parent partitions
partitioner() Return metadata specifying
whether the RDD is hash/range
partitioned
Interface used to represent RDDs in Spark
Dependencies
• narrow dependencies
---where each partition of the parent RDD is used by
at most one partition of the child RDD
• wide dependencies
---where multiple child partitions may depend on it.
• For example
---map leads to a narrow dependency,
---while join leads to wide dependencies (unless the parents are
hash-partitioned)
04/25/14
Dependencies
04/25/14
Examples of narrow and wide dependencies. Each box is an RDD, with
partitions shown as shaded rectangles
Narrow VS Wide dependencies
• Narrow dependencies
---allow for pipelined execution on one cluster node, which can compute all the
parent partitions.
---recovery after a node failure is more efficient, as only the lost parent partitions
need to be recomputed, can be recomputed in parallel on different nodes
• Wide dependencies
--- require data from all parent partitions to be available and to be shuffled across
the nodes using a MapReduce-like operation
--- in a lineage graph, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution
04/25/14
Job Scheduler
• Similar to Dryad’s, but takes into account which partitions of persistent
RDDS available in memory
• When runs an action (e.g., count or save) on an RDD, the scheduler
examines that RDD’s lineage graph to build a DAG of stages to execute
• Each stage contains as many pipelined transformations with narrow
dependencies as possible
Boundary of the stages
---shuffle operations required for wide dependencies
---any already computed partitions(shortcircuit the computation of a
parent RDD)
• The scheduler then launches tasks to compute missing partitions from
each stage until it has computed the target RDD
04/25/14
Job Scheduler
04/25/14
Dryad-like DAGs
Pipelines functions
within a stage
Locality & data
reuse aware
Partitioning-aware
to avoid shuffles
Task Assignment
• scheduler assigns tasks to machines based on data locality
using delay scheduling
---if a task needs to process a partition that is available in
memory on a node, then send it to that node
---otherwise, a task processes a partition for which the
containing RDD provides preferred locations (e.g., an HDFS
file), then send it to those
04/25/14
Memory Management
• in-memory storage as deserialized Java objects
---The first option provides the fastest performance, because the Java
VM can access each RDD element natively
• in-memory storage as serialized data
---The second option lets users choose a more memory-efficient
representation than Java object graphs when space is limited, at the
cost of lower performance
• on-disk storage
---The third option is useful for RDDs that are too large to keep in RAM
but costly to recompute on each use.
04/25/14
Not Suitable for RDDs
• RDDs are best suited for batch applications that apply the same
operation to all elements of a dataset
• RDDs would be less suitable for applications that make asynchronous
fine-grained updates to shared state, such as a storage system for a web
application or an incremental web crawler
04/25/14
04/25/14
Programming Models
Implemented on Spark
RDDs can express many existing parallel models
04/25/14
Open Source Community
15contributors,5+companies using Spark,
3+applications projects at Berkeley
User applications:
» Data mining 40x faster than Hadoop(Conviva)
» Exploratory log analysis (Foursquare)
» Traffic prediction via EM(Mobile Millennium)
» Twitter spam classification (Monarch)
» DNA sequence analysis(SNAP)
04/25/14
Conclusion
RDDs offer a simple and efficient programming model for a broad range of
Applications(immutable nature and coarse-grained transformations, suitable
for a wide class of applications)
Leverage the coarse-grained nature of many parallel algorithms for low-
overhead recovery
Let user controls each RDD’s partitioning (layout across nodes) and
persistence (storage in RAM,on disk,etc)

More Related Content

PPTX
Resilient Distributed DataSets - Apache SPARK
PDF
Resilient Distributed Datasets
PPTX
dmapply: A functional primitive to express distributed machine learning algor...
PDF
Spark cluster computing with working sets
PPTX
IBM Spark Meetup - RDD & Spark Basics
PDF
Resilient Distributed Datasets
PPTX
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed Datasets
dmapply: A functional primitive to express distributed machine learning algor...
Spark cluster computing with working sets
IBM Spark Meetup - RDD & Spark Basics
Resilient Distributed Datasets

What's hot (18)

PPTX
An Approach for the Incremental Export of Relational Databases into RDF Graphs
PPTX
Spark 计算模型
PDF
C0312023
PPTX
Hadoop - HDFS
PPTX
Map Reduce
PDF
Databases and how to choose them
PPTX
MapReduce
PDF
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
PPTX
Distributed Databases - Concepts & Architectures
PPTX
A 3 dimensional data model in hbase for large time-series dataset-20120915
PPTX
PPT
Cassandra advanced part-ll
PPTX
Fundamental of Big Data with Hadoop and Hive
DOCX
assignment3
PPTX
Transient and persistent RDF views over relational databases in the context o...
PDF
Lecture 2 part 1
PDF
Hadoop, HDFS and MapReduce
PDF
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
An Approach for the Incremental Export of Relational Databases into RDF Graphs
Spark 计算模型
C0312023
Hadoop - HDFS
Map Reduce
Databases and how to choose them
MapReduce
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
Distributed Databases - Concepts & Architectures
A 3 dimensional data model in hbase for large time-series dataset-20120915
Cassandra advanced part-ll
Fundamental of Big Data with Hadoop and Hive
assignment3
Transient and persistent RDF views over relational databases in the context o...
Lecture 2 part 1
Hadoop, HDFS and MapReduce
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Ad

Similar to BDAS RDD study report v1.2 (20)

PPTX
Study Notes: Apache Spark
PDF
Secrets of Spark's success - Deenar Toraskar, Think Reactive
PPT
Map reducecloudtech
PPTX
Spark architechure.pptx
PPTX
Apache Spark for Beginners
PPTX
Apache Spark Core
PPT
11. From Hadoop to Spark 1:2
PPTX
Geek Night - Functional Data Processing using Spark and Scala
PPTX
Apache Spark overview
PPTX
Spark Overview and Performance Issues
PPTX
MOD-2 presentation on engineering students
PPTX
Apache Spark
PPTX
APACHE SPARK.pptx
PPTX
Hadoop and It_s Components_PPT .pptx
PPTX
Big Data Analytics (Collection of Huge Data 3)
PPT
Hadoop mapreduce and yarn frame work- unit5
PPTX
Apache Spark
PPTX
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
PPT
Apache hadoop, hdfs and map reduce Overview
PDF
an detailed notes on Hadoop Map-Reduce.pdf
Study Notes: Apache Spark
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Map reducecloudtech
Spark architechure.pptx
Apache Spark for Beginners
Apache Spark Core
11. From Hadoop to Spark 1:2
Geek Night - Functional Data Processing using Spark and Scala
Apache Spark overview
Spark Overview and Performance Issues
MOD-2 presentation on engineering students
Apache Spark
APACHE SPARK.pptx
Hadoop and It_s Components_PPT .pptx
Big Data Analytics (Collection of Huge Data 3)
Hadoop mapreduce and yarn frame work- unit5
Apache Spark
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
Apache hadoop, hdfs and map reduce Overview
an detailed notes on Hadoop Map-Reduce.pdf
Ad

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.

BDAS RDD study report v1.2

  • 1. Resilient Distributed Datasets A Fault­­Tolerant Abstraction for In­Memory Cluster Computing
  • 2. Motivation •RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: 1. Iterative algorithms: ­iterative machine learning ­graph algorithms 2. Interative data mining ­ad­hoc query •In MapReduce, the only way to share data across jobs is stable storage slow!
  • 3. Examples Slow due to replication and disk I/O, but necessary for fault tolerance
  • 5. Solution: Resilient Distributed Datasets (RDDs) •Restriced form of distributed shared memory ­­ Immutable,partitioned collections of records ­­ Can only be built through coarse­grained derterminstic transformations(map,filter,join,…) •Efficient fault recovery using lineage ­­log one operation to apply to many elenments ­­Recompute lost partitions on failure ­­No cost if nonthing fails
  • 6. Solution: Resilient Distributed Datasets (RDDs) • Allow apps to keep working sets in memory for efficient reuse • Retain the attractive properties of MapReduce – Fault tolerance, data locality, scalability • Support a wide range of applications • Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)
  • 7. RDD Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey
  • 8. Example: Log Mining lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Load error messages from a log into memory, then interactively search for various patterns
  • 9. 9 Fault Recovery • RDD track the grapth of transformations that built them (their lineage) to rebuild lost data
  • 11. Optimizing Placement links & ranks repeatedly joined Can co-partition them (e.g.hash both on URL) to avoid shuffles Can also use app knowledge, e.g.,hash on DNS name links = links.partitionBy(new URLPartitioner())
  • 13. Representing RDDs • a set of partitions, which are atomic pieces of the dataset • a set of dependencies on parent RDDs • a function for computing the dataset based on its parents • metadata about its partitioning scheme • data placement 04/25/14
  • 14. Representing RDDs 04/25/14 Operation Meanning partitions() Return a list of Partition objects preferredLocations(p) List nodes where partition p can be accessed faster due to data locality dependencies() Return a list of dependencies iterator(p, parentIters) Compute the elements of partition p given iterators for its parent partitions partitioner() Return metadata specifying whether the RDD is hash/range partitioned Interface used to represent RDDs in Spark
  • 15. Dependencies • narrow dependencies ---where each partition of the parent RDD is used by at most one partition of the child RDD • wide dependencies ---where multiple child partitions may depend on it. • For example ---map leads to a narrow dependency, ---while join leads to wide dependencies (unless the parents are hash-partitioned) 04/25/14
  • 16. Dependencies 04/25/14 Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles
  • 17. Narrow VS Wide dependencies • Narrow dependencies ---allow for pipelined execution on one cluster node, which can compute all the parent partitions. ---recovery after a node failure is more efficient, as only the lost parent partitions need to be recomputed, can be recomputed in parallel on different nodes • Wide dependencies --- require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation --- in a lineage graph, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution 04/25/14
  • 18. Job Scheduler • Similar to Dryad’s, but takes into account which partitions of persistent RDDS available in memory • When runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute • Each stage contains as many pipelined transformations with narrow dependencies as possible Boundary of the stages ---shuffle operations required for wide dependencies ---any already computed partitions(shortcircuit the computation of a parent RDD) • The scheduler then launches tasks to compute missing partitions from each stage until it has computed the target RDD 04/25/14
  • 19. Job Scheduler 04/25/14 Dryad-like DAGs Pipelines functions within a stage Locality & data reuse aware Partitioning-aware to avoid shuffles
  • 20. Task Assignment • scheduler assigns tasks to machines based on data locality using delay scheduling ---if a task needs to process a partition that is available in memory on a node, then send it to that node ---otherwise, a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), then send it to those 04/25/14
  • 21. Memory Management • in-memory storage as deserialized Java objects ---The first option provides the fastest performance, because the Java VM can access each RDD element natively • in-memory storage as serialized data ---The second option lets users choose a more memory-efficient representation than Java object graphs when space is limited, at the cost of lower performance • on-disk storage ---The third option is useful for RDDs that are too large to keep in RAM but costly to recompute on each use. 04/25/14
  • 22. Not Suitable for RDDs • RDDs are best suited for batch applications that apply the same operation to all elements of a dataset • RDDs would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler 04/25/14
  • 23. 04/25/14 Programming Models Implemented on Spark RDDs can express many existing parallel models
  • 24. 04/25/14 Open Source Community 15contributors,5+companies using Spark, 3+applications projects at Berkeley User applications: » Data mining 40x faster than Hadoop(Conviva) » Exploratory log analysis (Foursquare) » Traffic prediction via EM(Mobile Millennium) » Twitter spam classification (Monarch) » DNA sequence analysis(SNAP)
  • 25. 04/25/14 Conclusion RDDs offer a simple and efficient programming model for a broad range of Applications(immutable nature and coarse-grained transformations, suitable for a wide class of applications) Leverage the coarse-grained nature of many parallel algorithms for low- overhead recovery Let user controls each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)

Editor's Notes

  • #9: Key idea: add “variables” to the “functions” in functional programming
  • #18: Pepieline execution: For example, one can apply a map followed by a filter on an element-by-element basis
  • #20: Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD G, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3.
  • #26: 自己总结: 1.简单 高效 应用范围较广 2.降低了粗粒度并行算法容恢复的代价 3.由用户决定哪些数据是需要重复利用而需要长久保存以及保存的策略,用户可以控制数据分布的策略来避免shuffle以提高效率(如co-partition,shuffle的过程是比较慢,比较耗时间的操作) 4.比一般的模型更通用,现有的模型大多解决的是MapReduce在某些领域性能表现的不好而专门位置设计的专用模型,如Google的Pregel,与之相比,Pregel提供的数据共享模型隐含的适用于图计算的模型,而RDD的模型则提供了一种更通用的数据共享模型(不仅仅能表达出Pregel的计算模型,还能用在其他的应用场景,更通用,更灵活。) 与Pregel的区别: A third class of systems provide high-level interfaces for specific classes of applications requiring data sharing. For example, Pregel [22] supports iterative graph applications, while Twister [11] and HaLoop [7] are iterative MapReduce runtimes. However, these frameworks perform data sharing implicitly for the pattern of computation they support, and do not provide a general abstraction that the user can employ to share data of her choice among operations of her choice. For example, a user cannot use Pregel or Twister to load a dataset into memory and then decide what query to run on it. RDDs provide a distributed storage abstraction explicitly and can thus support applications that these specialized systems do not capture, such as interactive data mining. 与 MR的区别(shark论文总结): 1. Like Dryad and Tenzing [17, 9], it supports general computation DAGs, not just the two-stage MapReduce topology. 2. It provides an in-memory storage abstraction called Resilient Distributed Datasets (RDDs) that lets applications keep data in memory across queries, and automatically reconstructs it after failures [33]. 3. The engine is optimized for low latency. It can efficiently manage tasks as short as 100 milliseconds on clusters of thousands of cores, while engines like Hadoop incur a latency of 5–10 seconds to launch each task. RDD的四个特点(shark论文总结): The RDD model offers several key benefits our large-scale in memory computing setting. First, RDDs can be written at the speed of DRAM instead of the speed of the network, because there is no need to replicate each byte written to another machine for fault tolerance. DRAM in a modern server is over 10 faster than even a 10-Gigabit network. Second, Spark can keep just one copy of each RDD partition in memory, saving precious memory over a replicated system, since it can always recover lost data using lineage. Third, when a node fails, its lost RDD partitions can be rebuilt in parallel across the other nodes, allowing speedy recovery. Fourth,even if a node is just slow (a “straggler”), we can recompute necessary partitions on other nodes because RDDs are immutable so there are no consistency concerns with having two copies of a partition.