Introduction to Spark

Introduction to Spark
Sriram and Amritendu
DOS Lab, IIT Madras
“Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons
Attribution 4.0 International License.

Motivation
• In Hadoop, programmer writes job using Map
Reduce abstraction
• Runtime distributes work and handles fault-tolerance
Makes analysis of large-data sets easy and reliable
Emerging Class of Applications
Machine learning
• K-means clustering
.
.
Graph Algorithms
• Page-rank
.
.
DOS Lab, IIT Madras

Intermediate results are reused across multiple
computations
Nature of the emerging class of applications
Iterative
Computation
DOS Lab, IIT Madras

Problem with Hadoop MapReduce
HDFS
R R R
Iteration 1W W W
HDFS
R R R
HDFS
W W W
Iteration 2
Results are written to HDFS
New job is launched for each
iteration
Incurs substantial storage and job launch overheads
DOS Lab, IIT Madras

Can we do away with these overheads?
Persist intermediate
results in memory
What if a node fails?
HDFS
L L L
Iteration 1
Memory is 10-100X faster
than disk/network Iteration 2
X
Challenge: how to handle faults efficiently?
W
R R R
W W
W W W
RR R
DOS Lab, IIT Madras

Approaches to handle faults
• Replication
Issues:
– Requires more storage
– More network traffic
– Log the operation
– Re-compute lost partitions
using lineage information
Master
W
M R
Replica 1
R
Replica 2
X
Can tolerate ‘r-1’
failures
• Using Lineage
D1 D2 D3
C1 C2
X
D2 D3
C2
Issues:
Recovery time can be high if re-
computation is very costly
– high iteration time
– wide dependencies
Wide
dependencies
DOS Lab, IIT Madras

Spark
• RDD – Resilient Distributed Datasets
– Read-only, partitioned collection of records
– Supports only coarse-grained operations
• e.g. map and group-by transformations, reduce action
– Uses lineage graph to recover from faults
D12
D11
D13
3 partitions
DOS Lab, IIT Madras
Val

Spark contd.
• Control placement of partitions of RDD
– can specify number of partitions
– can partition based on a key in each record
• useful in joins
• In-memory storage
– Up to 100X speedup over Hadoop for iterative
applications
• Spark can run on Hadoop YARN and read files
from HDFS
• Spark is coded using Scala
DOS Lab, IIT Madras

SCALA overview
• Functional programming meets object
orientation
• “No side effects” aids concurrent
programming
• Every variable is an object
• Every function is a value
DOS Lab, IIT Madras

Variables and Functions
var obj : java.lang.String = “Hello”
var x = new A()
def square(x: Int) : Int={
x * x
}
Return
type
DOS Lab, IIT Madras

Execution of a function
scala> square(2)
res0:Int = 4
scala-> square(square(6))
res1:Int = 1296
def square(x: Int) : Int={
x * x
}
DOS Lab, IIT Madras

Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
fact(i, 1)
}
DOS Lab, IIT Madras

Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
 fact(i, 1)
}
DOS Lab, IIT Madras

Higher order map functions
val add = (x: Int) => x+1
val lst = list(1,2,3)
lst.map(add) : list(2,3,4)
lst.map(x => x+1) : list(2,3,4)
lst.map( _ + 1) : list(2,3,4)
DOS Lab, IIT Madras

Defining Objects
object Example{
def main(args: Array[String]) {
val logData = sc.textFile(logFile, 2).cache()
-------
-------
}
}
Example.main(
(“master”,”noOfMap”,”noOfReducer”) )
DOS Lab, IIT Madras

Spark: Filter transformation in RDD
val numAs = logData.filter(line =>line.contains("a"))
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
test
Give me those lines which contains ‘a’
DOS Lab, IIT Madras

Count
val numAs = logData.filter(
line =>line.contains("a"))
numAs.count()
5
test
DOS Lab, IIT Madras

Flatmap
val numAs = logData.flatMap(line => line.split(" "))
Take each line, split based on space and give me the array
Here is a example of filter map ( Here, is, a, example, of, filter,map )
DOS Lab, IIT Madras

Wordcount Example in Spark
new SparkContext(master, appName, [sparkHome],
[jars])
val file = spark.textFile("hdfs://[input_path_to_textfile]")
val counts = file.flatMap (line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://[output_path]")
DOS Lab, IIT Madras

Limitations
• RDDs are not suitable for applications that
require fine-grained updates
– e.g. web storage system
DOS Lab, IIT Madras

References
• http://www.slideshare.net/tpunder/a-brief-intro-to-scala
• Scala in depth by Joshua D. Suereth
• Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin
Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica
“Resilient distributed datasets: a fault-tolerant abstraction for in-memory
cluster computing”, In Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation (NSDI'12). USENIX
Association, Berkeley, CA, USA, 2012.
• Pictures:
– http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg
– http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R
AM_Chip_Brand_New_Chip.jpg
DOS Lab, IIT Madras

Introduction to Spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Introduction to Spark (20)

Recently uploaded (20)

Introduction to Spark