SlideShare a Scribd company logo
Spark and Resilient Distributed Datasets
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 1 / 49
Motivation
MapReduce greatly simplified big data analysis on large, unreliable
clusters.
But as soon as it got popular, users wanted more:
• Iterative jobs, e.g., machine learning algorithms
• Interactive analytics
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 2 / 49
Motivation
Both iterative and interactive queries need one thing that MapRe-
duce lacks:
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
Motivation
Both iterative and interactive queries need one thing that MapRe-
duce lacks:
Efficient primitives for data sharing.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
Motivation
Both iterative and interactive queries need one thing that MapRe-
duce lacks:
Efficient primitives for data sharing.
In MapReduce, the only way to share data across jobs is stable
storage, which is slow.
Replication also makes the system slow, but it is necessary for fault
tolerance.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
Proposed Solution
In-Memory Data Processing and Sharing.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 4 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
Challenge
How to design a distributed memory abstraction
that is both fault tolerant and efficient?
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
Challenge
How to design a distributed memory abstraction
that is both fault tolerant and efficient?
Solution
Resilient Distributed Datasets (RDD)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
Resilient Distributed Datasets (RDD) (1/2)
A distributed memory abstraction.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
Resilient Distributed Datasets (RDD) (1/2)
A distributed memory abstraction.
Immutable collections of objects spread across a cluster.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
Resilient Distributed Datasets (RDD) (2/2)
An RDD is divided into a number of partitions, which are atomic
pieces of information.
Partitions of an RDD can be stored on different nodes of a cluster.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 9 / 49
Programming Model
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 10 / 49
Spark Programming Model (1/2)
Spark programming model is based on parallelizable operators.
Parallelizable operators are higher-order functions that execute user-
defined functions in parallel.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 11 / 49
Spark Programming Model (2/2)
A data flow is composed of any number of data sources, operators,
and data sinks by connecting their inputs and outputs.
Job description based on directed acyclic graphs (DAG).
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 12 / 49
Higher-Order Functions (1/3)
Higher-order functions: RDDs operators.
There are two types of RDD operators: transformations and actions.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 13 / 49
Higher-Order Functions (2/3)
Transformations: lazy operators that create new RDDs.
Actions: lunch a computation and return a value to the program or
write data to the external storage.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 14 / 49
Higher-Order Functions (3/3)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 15 / 49
RDD Transformations - Map
All pairs are independently processed.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
RDD Transformations - Map
All pairs are independently processed.
// passing each element through a function.
val nums = sc.parallelize(Array(1, 2, 3))
val squares = nums.map(x => x * x) // {1, 4, 9}
// selecting those elements that func returns true.
val even = squares.filter(x => x % 2 == 0) // {4}
// mapping each element to zero or more others.
nums.flatMap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2}
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
RDD Transformations - Reduce
Pairs with identical key are grouped.
Groups are independently processed.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
RDD Transformations - Reduce
Pairs with identical key are grouped.
Groups are independently processed.
val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2)))
pets.reduceByKey((x, y) => x + y)
// {(cat, 3), (dog, 1)}
pets.groupByKey()
// {(cat, (1, 2)), (dog, (1))}
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
RDD Transformations - Join
Performs an equi-join on the key.
Join candidates are independently pro-
cessed.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
RDD Transformations - Join
Performs an equi-join on the key.
Join candidates are independently pro-
cessed.
val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),
("about.html", "3.4.5.6"),
("index.html", "1.3.3.1")))
val pageNames = sc.parallelize(Seq(("index.html", "Home"),
("about.html", "About")))
visits.join(pageNames)
// ("index.html", ("1.2.3.4", "Home"))
// ("index.html", ("1.3.3.1", "Home"))
// ("about.html", ("3.4.5.6", "About"))
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
RDD Transformations - CoGroup
Groups each input on key.
Groups with identical keys are processed
together.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
RDD Transformations - CoGroup
Groups each input on key.
Groups with identical keys are processed
together.
val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),
("about.html", "3.4.5.6"),
("index.html", "1.3.3.1")))
val pageNames = sc.parallelize(Seq(("index.html", "Home"),
("about.html", "About")))
visits.cogroup(pageNames)
// ("index.html", (("1.2.3.4", "1.3.3.1"), ("Home")))
// ("about.html", (("3.4.5.6"), ("About")))
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
RDD Transformations - Union and Sample
Union: merges two RDDs and returns a single RDD using bag se-
mantics, i.e., duplicates are not removed.
Sample: similar to mapping, except that the RDD stores a random
number generator seed for each partition to deterministically sample
parent records.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 20 / 49
Basic RDD Actions (1/2)
Return all the elements of the RDD as an array.
val nums = sc.parallelize(Array(1, 2, 3))
nums.collect() // Array(1, 2, 3)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
Basic RDD Actions (1/2)
Return all the elements of the RDD as an array.
val nums = sc.parallelize(Array(1, 2, 3))
nums.collect() // Array(1, 2, 3)
Return an array with the first n elements of the RDD.
nums.take(2) // Array(1, 2)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
Basic RDD Actions (1/2)
Return all the elements of the RDD as an array.
val nums = sc.parallelize(Array(1, 2, 3))
nums.collect() // Array(1, 2, 3)
Return an array with the first n elements of the RDD.
nums.take(2) // Array(1, 2)
Return the number of elements in the RDD.
nums.count() // 3
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
Basic RDD Actions (2/2)
Aggregate the elements of the RDD using the given function.
nums.reduce((x, y) => x + y)
or
nums.reduce(_ + _) // 6
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
Basic RDD Actions (2/2)
Aggregate the elements of the RDD using the given function.
nums.reduce((x, y) => x + y)
or
nums.reduce(_ + _) // 6
Write the elements of the RDD as a text file.
nums.saveAsTextFile("hdfs://file.txt")
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
SparkContext
Main entry point to Spark functionality.
Available in shell as variable sc.
In standalone programs, you should make your own.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sc = new SparkContext(master, appName, [sparkHome], [jars])
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
SparkContext
Main entry point to Spark functionality.
Available in shell as variable sc.
In standalone programs, you should make your own.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sc = new SparkContext(master, appName, [sparkHome], [jars])
local
local[k]
spark://host:port
mesos://host:port
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
Creating RDDs
Turn a collection into an RDD.
val a = sc.parallelize(Array(1, 2, 3))
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
Creating RDDs
Turn a collection into an RDD.
val a = sc.parallelize(Array(1, 2, 3))
Load text file from local FS, HDFS, or S3.
val a = sc.textFile("file.txt")
val b = sc.textFile("directory/*.txt")
val c = sc.textFile("hdfs://namenode:9000/path/file")
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
Example (1/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val sics = file.filter(_.contains("SICS"))
val cachedSics = sics.cache()
val ones = cachedSics.map(_ => 1)
val count = ones.reduce(_+_)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
Example (1/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val sics = file.filter(_.contains("SICS"))
val cachedSics = sics.cache()
val ones = cachedSics.map(_ => 1)
val count = ones.reduce(_+_)
Transformation
Transformation
Action
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
Example (2/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val count = file.filter(_.contains("SICS")).count()
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
Example (2/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val count = file.filter(_.contains("SICS")).count()
Transformation
Action
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
Example - Standalone Application (1/2)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext("local", "SICS", "127.0.0.1",
List("target/scala-2.10/sics-count_2.10-1.0.jar"))
val file = sc.textFile("...").cache()
val count = file.filter(_.contains("SICS")).count()
}
}
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 27 / 49
Example - Standalone Application (2/2)
sics.sbt:
name := "SICS Count"
version := "1.0"
scalaVersion := "2.10.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-incubating"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 28 / 49
Shared Variables (1/2)
When Spark runs a function in parallel as a set of tasks on different
nodes, it ships a copy of each variable used in the function to each
task.
Sometimes, a variable needs to be shared across tasks, or between
tasks and the driver program.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 29 / 49
Shared Variables (2/2)
No updates to the variables are propagated back to the driver pro-
gram.
General read-write shared variables across tasks is inefficient.
• For example, to give every node a copy of a large input dataset.
Two types of shared variables: broadcast variables and accumula-
tors.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 30 / 49
Shared Variables: Broadcast Variables
A read-only variable cached on each machine rather than shipping
a copy of it with tasks.
The broadcast values are not shipped to the nodes more than once.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-...)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 31 / 49
Shared Variables: Accumulators
They are only added.
They can be used to implement counters or sums.
Tasks running on the cluster can then add to it using the += oper-
ator.
scala> val accum = sc.accumulator(0)
accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
scala> accum.value
res2: Int = 10
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 32 / 49
Execution Engine
(SPARK)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 33 / 49
Spark
Spark provides a programming interface in Scala.
Each RDD is represented as an object in Spark.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 34 / 49
Spark Programming Interface
A Spark application consists of a driver program that runs the user’s
main function and executes various parallel operations on a cluster.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 35 / 49
Lineage
Lineage: transformations used to build
an RDD.
RDDs are stored as a chain of objects
capturing the lineage of each RDD.
val file = sc.textFile("hdfs://...")
val sics = file.filter(_.contains("SICS"))
val cachedSics = sics.cache()
val ones = cachedSics.map(_ => 1)
val count = ones.reduce(_+_)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 36 / 49
RDD Dependencies (1/3)
Two types of dependencies between RDDs: Narrow and Wide.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 37 / 49
RDD Dependencies: Narrow (2/3)
Narrow: each partition of a parent RDD is used by at most one
partition of the child RDD.
Narrow dependencies allow pipelined execution on one cluster node:
a map followed by a filter.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 38 / 49
RDD Dependencies: Wide (3/3)
Wide: each partition of a parent RDD is used by multiple partitions
of the child RDDs.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 39 / 49
Job Scheduling (1/2)
When a user runs an action on an RDD:
the scheduler builds a DAG of stages
from the RDD lineage graph.
A stage contains as many pipelined
transformations with narrow dependen-
cies.
The boundary of a stage:
• Shuffles for wide dependencies.
• Already computed partitions.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 40 / 49
Job Scheduling (2/2)
The scheduler launches tasks to compute
missing partitions from each stage until
it computes the target RDD.
Tasks are assigned to machines based on
data locality.
• If a task needs a partition, which is
available in the memory of a node, the
task is sent to that node.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 41 / 49
RDD Fault Tolerance (1/3)
RDDs maintain lineage information that can be used to reconstruct
lost partitions.
Logging lineage rather than the actual data.
No replication.
Recompute only the lost partitions of an RDD.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 42 / 49
RDD Fault Tolerance (2/3)
The intermediate records of wide dependencies are materialized on
the nodes holding the parent partitions: to simplify fault recovery.
If a task fails, it will be re-ran on another node, as long as its stages
parents are available.
If some stages become unavailable, the tasks are submitted to com-
pute the missing partitions in parallel.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 43 / 49
RDD Fault Tolerance (3/3)
Recovery may be time-consuming for RDDs with long lineage chains
and wide dependencies.
It can be helpful to checkpoint some RDDs to stable storage.
Decision about which data to checkpoint is left to users.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 44 / 49
Memory Management (1/2)
If there is not enough space in memory for a new computed RDD
partition: a partition from the least recently used RDD is evicted.
Spark provides three options for storage of persistent RDDs:
1 In memory storage as deserialized Java objects.
2 In memory storage as serialized Java objects.
3 On disk storage.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 45 / 49
Memory Management (2/2)
When an RDD is persisted, each node stores any partitions of the
RDD that it computes in memory.
This allows future actions to be much faster.
Persisting an RDD using persist() or cache() methods.
Different storage levels:
MEMORY ONLY
MEMORY AND DISK
MEMORY ONLY SER
MEMORY AND DISK SER
MEMORY ONLY 2, MEMORY AND DISK 2, etc.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 46 / 49
RDD Applications
Applications suitable for RDDs
• Batch applications that apply the same operation to all elements of
a dataset.
Applications not suitable for RDDs
• Applications that make asynchronous fine-grained updates to shared
state, e.g., storage system for a web application.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 47 / 49
Summary
RDD: a distributed memory abstraction that is both fault tolerant
and efficient
Two types of operations: Transformations and Actions.
RDD fault tolerance: Lineage
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 48 / 49
Questions?
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 49 / 49

More Related Content

PPTX
Apache Spark Architecture
PDF
Apache Spark Overview
PPTX
RocksDB detail
PDF
Introduction to Apache Spark
PDF
Introduction to Apache Spark
PPTX
Apache Spark Core
PDF
Cassandra Introduction & Features
PDF
Introduction to Apache Spark
Apache Spark Architecture
Apache Spark Overview
RocksDB detail
Introduction to Apache Spark
Introduction to Apache Spark
Apache Spark Core
Cassandra Introduction & Features
Introduction to Apache Spark

What's hot (20)

PPTX
Apache Flink and what it is used for
PDF
Apache Spark Core – Practical Optimization
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PPTX
Linux Network Stack
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Introduction to apache spark
PDF
Parquet performance tuning: the missing guide
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Introduction to Cassandra
KEY
Introduction to memcached
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PPTX
Introduction to Apache Spark
PDF
Spark shuffle introduction
PPTX
Introduction to Storm
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PDF
Physical Plans in Spark SQL
PPTX
Apache Spark overview
Apache Flink and what it is used for
Apache Spark Core – Practical Optimization
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Designing Structured Streaming Pipelines—How to Architect Things Right
Linux Network Stack
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Introduction to apache spark
Parquet performance tuning: the missing guide
Deep Dive: Memory Management in Apache Spark
Introduction to Cassandra
Introduction to memcached
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Introduction to Apache Spark
Spark shuffle introduction
Introduction to Storm
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Physical Plans in Spark SQL
Apache Spark overview
Ad

Viewers also liked (20)

PDF
PDF
P2P Content Distribution Network
PDF
Hive and Shark
PDF
Spark Stream and SEEP
PDF
MapReduce
PDF
Linux Module Programming
PDF
MegaStore and Spanner
PDF
Cloud Computing
PDF
Main Memory - Part2
PDF
Security
PDF
Process Management - Part2
PDF
Introduction to Operating Systems - Part2
PDF
Protection
PDF
IO Systems
PDF
CPU Scheduling - Part2
PDF
Storage
PDF
The Stratosphere Big Data Analytics Platform
PDF
Data Intensive Computing Frameworks
PDF
The Spark Big Data Analytics Platform
PDF
Deadlocks
P2P Content Distribution Network
Hive and Shark
Spark Stream and SEEP
MapReduce
Linux Module Programming
MegaStore and Spanner
Cloud Computing
Main Memory - Part2
Security
Process Management - Part2
Introduction to Operating Systems - Part2
Protection
IO Systems
CPU Scheduling - Part2
Storage
The Stratosphere Big Data Analytics Platform
Data Intensive Computing Frameworks
The Spark Big Data Analytics Platform
Deadlocks
Ad

Similar to Spark (20)

PDF
Meetup ml spark_ppt
PPTX
SparkNotes
PDF
Distributed computing with spark
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Apache Spark: What? Why? When?
PPT
Scala and spark
PPTX
Introduction to Apache Spark
PDF
Apache Spark and DataStax Enablement
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Scala meetup - Intro to spark
PDF
Introduction to Apache Spark
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Apache Spark Presentation good for big data
PDF
Big Data Processing using Apache Spark and Clojure
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Study Notes: Apache Spark
PPTX
Transformations and actions a visual guide training
PDF
Visual Api Training
Meetup ml spark_ppt
SparkNotes
Distributed computing with spark
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark: What? Why? When?
Scala and spark
Introduction to Apache Spark
Apache Spark and DataStax Enablement
Ten tools for ten big data areas 03_Apache Spark
Scala meetup - Intro to spark
Introduction to Apache Spark
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark Presentation good for big data
Big Data Processing using Apache Spark and Clojure
AI與大數據數據處理 Spark實戰(20171216)
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Big data vahidamiri-tabriz-13960226-datastack.ir
Study Notes: Apache Spark
Transformations and actions a visual guide training
Visual Api Training

More from Amir Payberah (15)

PDF
File System Implementation - Part2
PDF
File System Implementation - Part1
PDF
File System Interface
PDF
Virtual Memory - Part2
PDF
Virtual Memory - Part1
PDF
Main Memory - Part1
PDF
CPU Scheduling - Part1
PDF
Process Synchronization - Part2
PDF
Process Synchronization - Part1
PDF
Threads
PDF
Process Management - Part3
PDF
Process Management - Part1
PDF
Introduction to Operating Systems - Part3
PDF
Introduction to Operating Systems - Part1
PDF
Mesos and YARN
File System Implementation - Part2
File System Implementation - Part1
File System Interface
Virtual Memory - Part2
Virtual Memory - Part1
Main Memory - Part1
CPU Scheduling - Part1
Process Synchronization - Part2
Process Synchronization - Part1
Threads
Process Management - Part3
Process Management - Part1
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part1
Mesos and YARN

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)

Spark

  • 1. Spark and Resilient Distributed Datasets Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 1 / 49
  • 2. Motivation MapReduce greatly simplified big data analysis on large, unreliable clusters. But as soon as it got popular, users wanted more: • Iterative jobs, e.g., machine learning algorithms • Interactive analytics Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 2 / 49
  • 3. Motivation Both iterative and interactive queries need one thing that MapRe- duce lacks: Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
  • 4. Motivation Both iterative and interactive queries need one thing that MapRe- duce lacks: Efficient primitives for data sharing. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
  • 5. Motivation Both iterative and interactive queries need one thing that MapRe- duce lacks: Efficient primitives for data sharing. In MapReduce, the only way to share data across jobs is stable storage, which is slow. Replication also makes the system slow, but it is necessary for fault tolerance. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
  • 6. Proposed Solution In-Memory Data Processing and Sharing. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 4 / 49
  • 7. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
  • 8. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
  • 9. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
  • 10. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
  • 11. Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
  • 12. Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Solution Resilient Distributed Datasets (RDD) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
  • 13. Resilient Distributed Datasets (RDD) (1/2) A distributed memory abstraction. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
  • 14. Resilient Distributed Datasets (RDD) (1/2) A distributed memory abstraction. Immutable collections of objects spread across a cluster. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
  • 15. Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 9 / 49
  • 16. Programming Model Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 10 / 49
  • 17. Spark Programming Model (1/2) Spark programming model is based on parallelizable operators. Parallelizable operators are higher-order functions that execute user- defined functions in parallel. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 11 / 49
  • 18. Spark Programming Model (2/2) A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. Job description based on directed acyclic graphs (DAG). Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 12 / 49
  • 19. Higher-Order Functions (1/3) Higher-order functions: RDDs operators. There are two types of RDD operators: transformations and actions. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 13 / 49
  • 20. Higher-Order Functions (2/3) Transformations: lazy operators that create new RDDs. Actions: lunch a computation and return a value to the program or write data to the external storage. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 14 / 49
  • 21. Higher-Order Functions (3/3) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 15 / 49
  • 22. RDD Transformations - Map All pairs are independently processed. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
  • 23. RDD Transformations - Map All pairs are independently processed. // passing each element through a function. val nums = sc.parallelize(Array(1, 2, 3)) val squares = nums.map(x => x * x) // {1, 4, 9} // selecting those elements that func returns true. val even = squares.filter(x => x % 2 == 0) // {4} // mapping each element to zero or more others. nums.flatMap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2} Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
  • 24. RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
  • 25. RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2))) pets.reduceByKey((x, y) => x + y) // {(cat, 3), (dog, 1)} pets.groupByKey() // {(cat, (1, 2)), (dog, (1))} Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
  • 26. RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently pro- cessed. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
  • 27. RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently pro- cessed. val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"), ("about.html", "3.4.5.6"), ("index.html", "1.3.3.1"))) val pageNames = sc.parallelize(Seq(("index.html", "Home"), ("about.html", "About"))) visits.join(pageNames) // ("index.html", ("1.2.3.4", "Home")) // ("index.html", ("1.3.3.1", "Home")) // ("about.html", ("3.4.5.6", "About")) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
  • 28. RDD Transformations - CoGroup Groups each input on key. Groups with identical keys are processed together. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
  • 29. RDD Transformations - CoGroup Groups each input on key. Groups with identical keys are processed together. val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"), ("about.html", "3.4.5.6"), ("index.html", "1.3.3.1"))) val pageNames = sc.parallelize(Seq(("index.html", "Home"), ("about.html", "About"))) visits.cogroup(pageNames) // ("index.html", (("1.2.3.4", "1.3.3.1"), ("Home"))) // ("about.html", (("3.4.5.6"), ("About"))) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
  • 30. RDD Transformations - Union and Sample Union: merges two RDDs and returns a single RDD using bag se- mantics, i.e., duplicates are not removed. Sample: similar to mapping, except that the RDD stores a random number generator seed for each partition to deterministically sample parent records. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 20 / 49
  • 31. Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
  • 32. Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
  • 33. Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Return the number of elements in the RDD. nums.count() // 3 Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
  • 34. Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
  • 35. Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Write the elements of the RDD as a text file. nums.saveAsTextFile("hdfs://file.txt") Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
  • 36. SparkContext Main entry point to Spark functionality. Available in shell as variable sc. In standalone programs, you should make your own. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val sc = new SparkContext(master, appName, [sparkHome], [jars]) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
  • 37. SparkContext Main entry point to Spark functionality. Available in shell as variable sc. In standalone programs, you should make your own. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val sc = new SparkContext(master, appName, [sparkHome], [jars]) local local[k] spark://host:port mesos://host:port Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
  • 38. Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3)) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
  • 39. Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3)) Load text file from local FS, HDFS, or S3. val a = sc.textFile("file.txt") val b = sc.textFile("directory/*.txt") val c = sc.textFile("hdfs://namenode:9000/path/file") Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
  • 40. Example (1/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val sics = file.filter(_.contains("SICS")) val cachedSics = sics.cache() val ones = cachedSics.map(_ => 1) val count = ones.reduce(_+_) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
  • 41. Example (1/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val sics = file.filter(_.contains("SICS")) val cachedSics = sics.cache() val ones = cachedSics.map(_ => 1) val count = ones.reduce(_+_) Transformation Transformation Action Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
  • 42. Example (2/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val count = file.filter(_.contains("SICS")).count() Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
  • 43. Example (2/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val count = file.filter(_.contains("SICS")).count() Transformation Action Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
  • 44. Example - Standalone Application (1/2) import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object WordCount { def main(args: Array[String]) { val sc = new SparkContext("local", "SICS", "127.0.0.1", List("target/scala-2.10/sics-count_2.10-1.0.jar")) val file = sc.textFile("...").cache() val count = file.filter(_.contains("SICS")).count() } } Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 27 / 49
  • 45. Example - Standalone Application (2/2) sics.sbt: name := "SICS Count" version := "1.0" scalaVersion := "2.10.3" libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-incubating" resolvers += "Akka Repository" at "http://repo.akka.io/releases/" Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 28 / 49
  • 46. Shared Variables (1/2) When Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 29 / 49
  • 47. Shared Variables (2/2) No updates to the variables are propagated back to the driver pro- gram. General read-write shared variables across tasks is inefficient. • For example, to give every node a copy of a large input dataset. Two types of shared variables: broadcast variables and accumula- tors. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 30 / 49
  • 48. Shared Variables: Broadcast Variables A read-only variable cached on each machine rather than shipping a copy of it with tasks. The broadcast values are not shipped to the nodes more than once. scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-...) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 31 / 49
  • 49. Shared Variables: Accumulators They are only added. They can be used to implement counters or sums. Tasks running on the cluster can then add to it using the += oper- ator. scala> val accum = sc.accumulator(0) accum: spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) ... scala> accum.value res2: Int = 10 Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 32 / 49
  • 50. Execution Engine (SPARK) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 33 / 49
  • 51. Spark Spark provides a programming interface in Scala. Each RDD is represented as an object in Spark. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 34 / 49
  • 52. Spark Programming Interface A Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 35 / 49
  • 53. Lineage Lineage: transformations used to build an RDD. RDDs are stored as a chain of objects capturing the lineage of each RDD. val file = sc.textFile("hdfs://...") val sics = file.filter(_.contains("SICS")) val cachedSics = sics.cache() val ones = cachedSics.map(_ => 1) val count = ones.reduce(_+_) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 36 / 49
  • 54. RDD Dependencies (1/3) Two types of dependencies between RDDs: Narrow and Wide. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 37 / 49
  • 55. RDD Dependencies: Narrow (2/3) Narrow: each partition of a parent RDD is used by at most one partition of the child RDD. Narrow dependencies allow pipelined execution on one cluster node: a map followed by a filter. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 38 / 49
  • 56. RDD Dependencies: Wide (3/3) Wide: each partition of a parent RDD is used by multiple partitions of the child RDDs. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 39 / 49
  • 57. Job Scheduling (1/2) When a user runs an action on an RDD: the scheduler builds a DAG of stages from the RDD lineage graph. A stage contains as many pipelined transformations with narrow dependen- cies. The boundary of a stage: • Shuffles for wide dependencies. • Already computed partitions. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 40 / 49
  • 58. Job Scheduling (2/2) The scheduler launches tasks to compute missing partitions from each stage until it computes the target RDD. Tasks are assigned to machines based on data locality. • If a task needs a partition, which is available in the memory of a node, the task is sent to that node. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 41 / 49
  • 59. RDD Fault Tolerance (1/3) RDDs maintain lineage information that can be used to reconstruct lost partitions. Logging lineage rather than the actual data. No replication. Recompute only the lost partitions of an RDD. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 42 / 49
  • 60. RDD Fault Tolerance (2/3) The intermediate records of wide dependencies are materialized on the nodes holding the parent partitions: to simplify fault recovery. If a task fails, it will be re-ran on another node, as long as its stages parents are available. If some stages become unavailable, the tasks are submitted to com- pute the missing partitions in parallel. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 43 / 49
  • 61. RDD Fault Tolerance (3/3) Recovery may be time-consuming for RDDs with long lineage chains and wide dependencies. It can be helpful to checkpoint some RDDs to stable storage. Decision about which data to checkpoint is left to users. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 44 / 49
  • 62. Memory Management (1/2) If there is not enough space in memory for a new computed RDD partition: a partition from the least recently used RDD is evicted. Spark provides three options for storage of persistent RDDs: 1 In memory storage as deserialized Java objects. 2 In memory storage as serialized Java objects. 3 On disk storage. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 45 / 49
  • 63. Memory Management (2/2) When an RDD is persisted, each node stores any partitions of the RDD that it computes in memory. This allows future actions to be much faster. Persisting an RDD using persist() or cache() methods. Different storage levels: MEMORY ONLY MEMORY AND DISK MEMORY ONLY SER MEMORY AND DISK SER MEMORY ONLY 2, MEMORY AND DISK 2, etc. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 46 / 49
  • 64. RDD Applications Applications suitable for RDDs • Batch applications that apply the same operation to all elements of a dataset. Applications not suitable for RDDs • Applications that make asynchronous fine-grained updates to shared state, e.g., storage system for a web application. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 47 / 49
  • 65. Summary RDD: a distributed memory abstraction that is both fault tolerant and efficient Two types of operations: Transformations and Actions. RDD fault tolerance: Lineage Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 48 / 49
  • 66. Questions? Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 49 / 49