SlideShare a Scribd company logo
Introduction to Spark
Sriram and Amritendu
DOS Lab, IIT Madras
“Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons
Attribution 4.0 International License.
Motivation
• In Hadoop, programmer writes job using Map
Reduce abstraction
• Runtime distributes work and handles fault-tolerance
Makes analysis of large-data sets easy and reliable
Emerging Class of Applications
Machine learning
• K-means clustering
.
.
Graph Algorithms
• Page-rank
.
.
DOS Lab, IIT Madras
Intermediate results are reused across multiple
computations
Nature of the emerging class of applications
Iterative
Computation
DOS Lab, IIT Madras
Problem with Hadoop MapReduce
HDFS
R R R
Iteration 1W W W
HDFS
R R R
HDFS
W W W
Iteration 2
Results are written to HDFS
New job is launched for each
iteration
Incurs substantial storage and job launch overheads
DOS Lab, IIT Madras
Can we do away with these overheads?
Persist intermediate
results in memory
What if a node fails?
HDFS
L L L
Iteration 1
Memory is 10-100X faster
than disk/network Iteration 2
X
Challenge: how to handle faults efficiently?
W
R R R
W W
W W W
RR R
DOS Lab, IIT Madras
Approaches to handle faults
• Replication
Issues:
– Requires more storage
– More network traffic
– Log the operation
– Re-compute lost partitions
using lineage information
Master
W
M R
Replica 1
R
Replica 2
X
Can tolerate ‘r-1’
failures
• Using Lineage
D1 D2 D3
C1 C2
X
D2 D3
C2
Issues:
Recovery time can be high if re-
computation is very costly
– high iteration time
– wide dependencies
Wide
dependencies
DOS Lab, IIT Madras
Spark
• RDD – Resilient Distributed Datasets
– Read-only, partitioned collection of records
– Supports only coarse-grained operations
• e.g. map and group-by transformations, reduce action
– Uses lineage graph to recover from faults
D12
D11
D13
3 partitions
DOS Lab, IIT Madras
Val
Spark contd.
• Control placement of partitions of RDD
– can specify number of partitions
– can partition based on a key in each record
• useful in joins
• In-memory storage
– Up to 100X speedup over Hadoop for iterative
applications
• Spark can run on Hadoop YARN and read files
from HDFS
• Spark is coded using Scala
DOS Lab, IIT Madras
SCALA overview
• Functional programming meets object
orientation
• “No side effects” aids concurrent
programming
• Every variable is an object
• Every function is a value
DOS Lab, IIT Madras
Variables and Functions
var obj : java.lang.String = “Hello”
var x = new A()
def square(x: Int) : Int={
x * x
}
Return
type
DOS Lab, IIT Madras
Execution of a function
scala> square(2)
res0:Int = 4
scala-> square(square(6))
res1:Int = 1296
def square(x: Int) : Int={
x * x
}
DOS Lab, IIT Madras
Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
fact(i, 1)
}
DOS Lab, IIT Madras
Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
 fact(i, 1)
}
DOS Lab, IIT Madras
Higher order map functions
val add = (x: Int) => x+1
val lst = list(1,2,3)
lst.map(add) : list(2,3,4)
lst.map(x => x+1) : list(2,3,4)
lst.map( _ + 1) : list(2,3,4)
DOS Lab, IIT Madras
Defining Objects
object Example{
def main(args: Array[String]) {
val logData = sc.textFile(logFile, 2).cache()
-------
-------
}
}
Example.main(
(“master”,”noOfMap”,”noOfReducer”) )
DOS Lab, IIT Madras
Spark: Filter transformation in RDD
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line =>line.contains("a"))
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
test
Give me those lines which contains ‘a’
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
DOS Lab, IIT Madras
Count
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(
line =>line.contains("a"))
numAs.count()
5
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
test
DOS Lab, IIT Madras
Flatmap
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.flatMap(line => line.split(" "))
Take each line, split based on space and give me the array
Here is a example of filter map ( Here, is, a, example, of, filter,map )
DOS Lab, IIT Madras
Wordcount Example in Spark
new SparkContext(master, appName, [sparkHome],
[jars])
val file = spark.textFile("hdfs://[input_path_to_textfile]")
val counts = file.flatMap (line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://[output_path]")
DOS Lab, IIT Madras
Limitations
• RDDs are not suitable for applications that
require fine-grained updates
– e.g. web storage system
DOS Lab, IIT Madras
References
• http://www.slideshare.net/tpunder/a-brief-intro-to-scala
• Scala in depth by Joshua D. Suereth
• Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin
Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica
“Resilient distributed datasets: a fault-tolerant abstraction for in-memory
cluster computing”, In Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation (NSDI'12). USENIX
Association, Berkeley, CA, USA, 2012.
• Pictures:
– http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg
– http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R
AM_Chip_Brand_New_Chip.jpg
DOS Lab, IIT Madras

More Related Content

PDF
Introduction to Apache Spark
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PDF
Tuning and Debugging in Apache Spark
PDF
2017 nov reflow sbtb
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Spark: Taming Big Data
PPT
On the need for a W3C community group on RDF Stream Processing
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Introduction to Apache Spark
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Tuning and Debugging in Apache Spark
2017 nov reflow sbtb
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark: Taming Big Data
On the need for a W3C community group on RDF Stream Processing
Apache spark sneha challa- google pittsburgh-aug 25th

What's hot (20)

PDF
Boston Spark Meetup event Slides Update
PPTX
Introduction To R Language
PPTX
Next generation analytics with yarn, spark and graph lab
PDF
Generalized Linear Models with H2O
PPTX
Alerting mechanism and algorithms introduction
 
PDF
Road to Analytics
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Productionizing your Streaming Jobs
PPTX
Distributed GLM with H2O - Atlanta Meetup
PDF
Sparkling Water 5 28-14
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PPTX
Large Scale Machine Learning with Apache Spark
PDF
PPTX
4. Recursion - Data Structures using C++ by Varsha Patil
PDF
Demystifying DataFrame and Dataset
PDF
Scalable Link Discovery for Modern Data-Driven Applications
PDF
R basics
 
PPT
Chapter 10 ds
Boston Spark Meetup event Slides Update
Introduction To R Language
Next generation analytics with yarn, spark and graph lab
Generalized Linear Models with H2O
Alerting mechanism and algorithms introduction
 
Road to Analytics
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Productionizing your Streaming Jobs
Distributed GLM with H2O - Atlanta Meetup
Sparkling Water 5 28-14
Distributed Deep Learning + others for Spark Meetup
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Large Scale Machine Learning with Apache Spark
4. Recursion - Data Structures using C++ by Varsha Patil
Demystifying DataFrame and Dataset
Scalable Link Discovery for Modern Data-Driven Applications
R basics
 
Chapter 10 ds
Ad

Viewers also liked (20)

PDF
Introduction to Apache Spark
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PPTX
Apache Spark Core
PPTX
1082016
PDF
Ortho Molecular Product ortho Biotic
PDF
2016 February Announcements
DOCX
Shawn 1 30-13a short
PPTX
Hurricane Katrina - America's Most Destructive Hurricane
PPTX
페이스북개발 트렌드 130313
PPTX
திறம்பட கற்றல்
PDF
LandReformsekta
PPTX
Rupee & dollar
PPTX
Motivational quotations
DOCX
Atividades
PPSX
Análisis de de textos revisados en la construcción de la historia del arte de...
PPTX
A Historical Glimpse at Jerusalem’s Western Wall
PPTX
Исследование производной
PPT
Animal classification based on Job 39
PPTX
Student induction 2013-14
PDF
2013 03-08 [開発中] node-sacloud
Introduction to Apache Spark
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Apache Spark Core
1082016
Ortho Molecular Product ortho Biotic
2016 February Announcements
Shawn 1 30-13a short
Hurricane Katrina - America's Most Destructive Hurricane
페이스북개발 트렌드 130313
திறம்பட கற்றல்
LandReformsekta
Rupee & dollar
Motivational quotations
Atividades
Análisis de de textos revisados en la construcción de la historia del arte de...
A Historical Glimpse at Jerusalem’s Western Wall
Исследование производной
Animal classification based on Job 39
Student induction 2013-14
2013 03-08 [開発中] node-sacloud
Ad

Similar to Introduction to Spark (20)

PPTX
SparkNotes
PPTX
Zaharia spark-scala-days-2012
PDF
Apache Spark: What? Why? When?
PDF
Unified Big Data Processing with Apache Spark
PPTX
dmapply: A functional primitive to express distributed machine learning algor...
PDF
Big Data Analytics with Apache Spark
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPTX
Apache spark core
PDF
Apache Spark with Scala
PDF
Stanford CS347 Guest Lecture: Apache Spark
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
Osd ctw spark
PPTX
Hadoop ecosystem
PPTX
Introduction to Apache Spark
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PDF
Hadoop ecosystem
PPT
Scala and spark
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
PDF
Big Data Analytics and Ubiquitous computing
SparkNotes
Zaharia spark-scala-days-2012
Apache Spark: What? Why? When?
Unified Big Data Processing with Apache Spark
dmapply: A functional primitive to express distributed machine learning algor...
Big Data Analytics with Apache Spark
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache spark core
Apache Spark with Scala
Stanford CS347 Guest Lecture: Apache Spark
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Osd ctw spark
Hadoop ecosystem
Introduction to Apache Spark
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Hadoop ecosystem
Scala and spark
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Big Data Analytics and Ubiquitous computing

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Spectroscopy.pptx food analysis technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Programs and apps: productivity, graphics, security and other tools
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
sap open course for s4hana steps from ECC to s4
Spectroscopy.pptx food analysis technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx

Introduction to Spark

  • 1. Introduction to Spark Sriram and Amritendu DOS Lab, IIT Madras “Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License.
  • 2. Motivation • In Hadoop, programmer writes job using Map Reduce abstraction • Runtime distributes work and handles fault-tolerance Makes analysis of large-data sets easy and reliable Emerging Class of Applications Machine learning • K-means clustering . . Graph Algorithms • Page-rank . . DOS Lab, IIT Madras
  • 3. Intermediate results are reused across multiple computations Nature of the emerging class of applications Iterative Computation DOS Lab, IIT Madras
  • 4. Problem with Hadoop MapReduce HDFS R R R Iteration 1W W W HDFS R R R HDFS W W W Iteration 2 Results are written to HDFS New job is launched for each iteration Incurs substantial storage and job launch overheads DOS Lab, IIT Madras
  • 5. Can we do away with these overheads? Persist intermediate results in memory What if a node fails? HDFS L L L Iteration 1 Memory is 10-100X faster than disk/network Iteration 2 X Challenge: how to handle faults efficiently? W R R R W W W W W RR R DOS Lab, IIT Madras
  • 6. Approaches to handle faults • Replication Issues: – Requires more storage – More network traffic – Log the operation – Re-compute lost partitions using lineage information Master W M R Replica 1 R Replica 2 X Can tolerate ‘r-1’ failures • Using Lineage D1 D2 D3 C1 C2 X D2 D3 C2 Issues: Recovery time can be high if re- computation is very costly – high iteration time – wide dependencies Wide dependencies DOS Lab, IIT Madras
  • 7. Spark • RDD – Resilient Distributed Datasets – Read-only, partitioned collection of records – Supports only coarse-grained operations • e.g. map and group-by transformations, reduce action – Uses lineage graph to recover from faults D12 D11 D13 3 partitions DOS Lab, IIT Madras Val
  • 8. Spark contd. • Control placement of partitions of RDD – can specify number of partitions – can partition based on a key in each record • useful in joins • In-memory storage – Up to 100X speedup over Hadoop for iterative applications • Spark can run on Hadoop YARN and read files from HDFS • Spark is coded using Scala DOS Lab, IIT Madras
  • 9. SCALA overview • Functional programming meets object orientation • “No side effects” aids concurrent programming • Every variable is an object • Every function is a value DOS Lab, IIT Madras
  • 10. Variables and Functions var obj : java.lang.String = “Hello” var x = new A() def square(x: Int) : Int={ x * x } Return type DOS Lab, IIT Madras
  • 11. Execution of a function scala> square(2) res0:Int = 4 scala-> square(square(6)) res1:Int = 1296 def square(x: Int) : Int={ x * x } DOS Lab, IIT Madras
  • 12. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) } fact(i, 1) } DOS Lab, IIT Madras
  • 13. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) }  fact(i, 1) } DOS Lab, IIT Madras
  • 14. Higher order map functions val add = (x: Int) => x+1 val lst = list(1,2,3) lst.map(add) : list(2,3,4) lst.map(x => x+1) : list(2,3,4) lst.map( _ + 1) : list(2,3,4) DOS Lab, IIT Madras
  • 15. Defining Objects object Example{ def main(args: Array[String]) { val logData = sc.textFile(logFile, 2).cache() ------- ------- } } Example.main( (“master”,”noOfMap”,”noOfReducer”) ) DOS Lab, IIT Madras
  • 16. Spark: Filter transformation in RDD val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line =>line.contains("a")) Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test Give me those lines which contains ‘a’ Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD DOS Lab, IIT Madras
  • 17. Count val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter( line =>line.contains("a")) numAs.count() 5 Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test DOS Lab, IIT Madras
  • 18. Flatmap val logData = sc.textFile(logFile, 2).cache() val numAs = logData.flatMap(line => line.split(" ")) Take each line, split based on space and give me the array Here is a example of filter map ( Here, is, a, example, of, filter,map ) DOS Lab, IIT Madras
  • 19. Wordcount Example in Spark new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://[input_path_to_textfile]") val counts = file.flatMap (line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://[output_path]") DOS Lab, IIT Madras
  • 20. Limitations • RDDs are not suitable for applications that require fine-grained updates – e.g. web storage system DOS Lab, IIT Madras
  • 21. References • http://www.slideshare.net/tpunder/a-brief-intro-to-scala • Scala in depth by Joshua D. Suereth • Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing”, In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2012. • Pictures: – http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg – http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R AM_Chip_Brand_New_Chip.jpg DOS Lab, IIT Madras