SlideShare a Scribd company logo
Lightning-fast cluster computing
Rahul Kavale(rahulkav@thoughtworks.com)
Unmesh Joshi(uvjoshi@thoughtworks.com)
2
Some properties of “Big Data”
•Big data is inherently immutable, meaning it is not supposed to
updated once generated.
•Mostly the operations are coarse grained when it comes to write
•Commodity hardware makes more sense for storage/computation
of such enormous data,hence the data is distributed across cluster
of many such machines
• The distributed nature makes the programming complicated.
3
Brush up for Hadoop concepts
Distributed Storage => HDFS
Cluster Manager => YARN
Fault tolerance => achieved via replication
Job scheduling => Scheduler in YARN
Mapper
Reducer
Combiner
4http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif
5
Map Reduce Programming Model
6https://twitter.com/francesc/status/507942534388011008
7http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop
8
http://www.slideshare.net/JimArgeropoulos/hadoop-101-32661121
9
MapReduce pain points
• considerable latency
• only Map and Reduce phases
• Non trivial to test
• results into complex workflow
• Not suitable for Iterative processing
10
Immutability and MapReduce model
• HDFS storage is immutable or append-only.
• The MapReduce model lacks to exploit the immutable nature of
the data.
• The intermediate results are persisted resulting in huge of IO,
causing a serious performance hit.
11
Wouldn’t it be very nice if we could have• Low latency
• Programmer friendly programming model
• Unified ecosystem
• Fault tolerance and other typical distributed system properties
• Easily testable code
• Of course open source :)
12
What is Apache Spark
• Cluster computing Engine
• Abstracts the storage and cluster management
• Unified interfaces to data
• API in Scala, Python, Java, R*
13
Where does it fit in existing Bigdata ecosystem
http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html
14
Why should you care about Apache Spark
• Abstracts underlying storage,
• Abstracts cluster management
• Easy programming model
• Very easy to test the code
• Highly performant
15
• Petabyte sort record
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
16
• Offers in memory caching of data
• Specialized Applications
• GraphX for graph processing
• Spark Streaming
• MLib for Machine learning
• Spark SQL
• Data exploration via Spark-Shell
17
Programming model
for
Apache Spark
18
Word Count example
val file = spark.textFile("input path")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
counts.saveAsTextFile("destination path")
19
Comparing example with MapReduce
20
Spark Shell Demo
• SparkContext
• RDD
• RDD operations
21
RDD
• RDD stands for Resilient Distributed Dataset.
• basic abstraction for Spark
22
• Equivalent of Distributed collections.
• The interface makes distributed nature of underlying data
transparent.
• RDD is immutable
• Can be created via,
• parallelising a collection,
• transforming an existing RDD by applying a transformation
function,
• reading from a persistent data store like HDFS.
23
RDD is lazily evaluated
RDD has two type of operations
• Transformations
Create a DAG of transformations to be applied on the RDD
Does not evaluating anything
• Actions
Evaluate the DAG of transformations
24
RDD operations
Transformations
map(f : T ⇒ U) : RDD[T] ⇒ RDD[U]
filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T]
flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]
sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling)
union() : (RDD[T],RDD[T]) ⇒ RDD[T]
join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]
groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])]
reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
25
Actions
count() : RDD[T] ⇒ Long
collect() : RDD[T] ⇒ Seq[T]
reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T
lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned
RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS
26
Job Execution
27
Spark Execution in Context of YARN
http://kb.cnblogs.com/page/198414/
28
Fault tolerance via lineage
MappedRDD
FilteredRDD
FlatMappedRDD
MappedRDD
HadoopRDD
29
Testing
30
Why is Spark more performant than
MapReduce
31
Reduced IO
• No disk IO between phases since phases themselves are pipelined
• No network IO involved unless a shuffle is required
32
No Mandatory Shuffle
• Programs not bounded by map and reduce phases
• No mandatory Shuffle and sort required
33
In memory caching of data
• Optional In memory caching
• DAG engine can apply certain optimisations since when an action is
called, it knows what all transformations as to be applied
34
Questions?
35
Thank You!

More Related Content

PPTX
Working with Scientific Data in MATLAB
PPTX
Advancing Scientific Data Support in ArcGIS
PPTX
Data Analytics using MATLAB and HDF5
PPTX
Matlab, Big Data, and HDF Server
PDF
Implementing HDF5 in MATLAB
PDF
EMR AWS Demo
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
PPTX
SPD and KEA: HDF5 based file formats for Earth Observation
Working with Scientific Data in MATLAB
Advancing Scientific Data Support in ArcGIS
Data Analytics using MATLAB and HDF5
Matlab, Big Data, and HDF Server
Implementing HDF5 in MATLAB
EMR AWS Demo
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
SPD and KEA: HDF5 based file formats for Earth Observation

What's hot (20)

PPTX
Join optimization in hive
PPT
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PPTX
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PPT
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
KEY
Realtime Computation with Storm
PPTX
Map Reduce
PPTX
Hadoop eco system-first class
PDF
Hadoop and Hive Development at Facebook
PPTX
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
PPTX
SparkNotes
PDF
Big Data Analytics with Scala at SCALA.IO 2013
PDF
Partitioning SKA Dataflows for Optimal Graph Execution
PPT
Hive integration: HBase and Rcfile__HadoopSummit2010
PPTX
Utilizing HDF4 File Content Maps for the Cloud Computing
PPT
Reading HDF family of formats via NetCDF-Java / CDM
PPTX
Graphite
PPTX
MATLAB, netCDF, and OPeNDAP
PDF
Collecting metrics with Graphite and StatsD
Join optimization in hive
Hw09 Hadoop Development At Facebook Hive And Hdfs
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
Realtime Computation with Storm
Map Reduce
Hadoop eco system-first class
Hadoop and Hive Development at Facebook
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
SparkNotes
Big Data Analytics with Scala at SCALA.IO 2013
Partitioning SKA Dataflows for Optimal Graph Execution
Hive integration: HBase and Rcfile__HadoopSummit2010
Utilizing HDF4 File Content Maps for the Cloud Computing
Reading HDF family of formats via NetCDF-Java / CDM
Graphite
MATLAB, netCDF, and OPeNDAP
Collecting metrics with Graphite and StatsD
Ad

Viewers also liked (11)

PPT
Driving Out Of Control
PPTX
Welcome bus drivers
PDF
Map reduce vs spark
PDF
Dealing with difficult people
PDF
Fast Data Analytics with Spark and Python
PDF
Machine Learning by Example - Apache Spark
PPT
Hadoop MapReduce Fundamentals
PPTX
Tuning and Debugging in Apache Spark
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PDF
Apache storm vs. Spark Streaming
PPTX
Big Data Analytics with Hadoop
Driving Out Of Control
Welcome bus drivers
Map reduce vs spark
Dealing with difficult people
Fast Data Analytics with Spark and Python
Machine Learning by Example - Apache Spark
Hadoop MapReduce Fundamentals
Tuning and Debugging in Apache Spark
Hadoop Summit Europe 2014: Apache Storm Architecture
Apache storm vs. Spark Streaming
Big Data Analytics with Hadoop
Ad

Similar to Scrap Your MapReduce - Apache Spark (20)

PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPT
Scala and spark
PPTX
APACHE SPARK.pptx
PPTX
Apache Spark
PDF
Spark cluster computing with working sets
PDF
Apache Spark: What? Why? When?
PDF
Bds session 13 14
PDF
Spark Driven Big Data Analytics
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
Distributed computing with spark
PPTX
Apache Spark Fundamentals
PPTX
Apache Spark Core
PPTX
PPT
11. From Hadoop to Spark 1:2
PDF
Apache spark - Spark's distributed programming model
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Ten tools for ten big data areas 03_Apache Spark
PDF
Hadoop and Spark
Apache Spark™ is a multi-language engine for executing data-S5.ppt
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Scala and spark
APACHE SPARK.pptx
Apache Spark
Spark cluster computing with working sets
Apache Spark: What? Why? When?
Bds session 13 14
Spark Driven Big Data Analytics
Unit II Real Time Data Processing tools.pptx
Distributed computing with spark
Apache Spark Fundamentals
Apache Spark Core
11. From Hadoop to Spark 1:2
Apache spark - Spark's distributed programming model
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark in Depth: Core Concepts, Architecture & Internals
Ten tools for ten big data areas 03_Apache Spark
Hadoop and Spark

More from IndicThreads (20)

PPTX
Http2 is here! And why the web needs it
ODP
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
PPT
Go Programming Language - Learning The Go Lang way
PPT
Building Resilient Microservices
PPT
App using golang indicthreads
PDF
Building on quicksand microservices indicthreads
PDF
How to Think in RxJava Before Reacting
PPT
Iot secure connected devices indicthreads
PDF
Real world IoT for enterprises
PPT
IoT testing and quality assurance indicthreads
PPT
Functional Programming Past Present Future
PDF
Harnessing the Power of Java 8 Streams
PDF
Building & scaling a live streaming mobile platform - Gr8 road to fame
PPTX
Internet of things architecture perspective - IndicThreads Conference
PDF
Cars and Computers: Building a Java Carputer
PPT
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
PPTX
Speed up your build pipeline for faster feedback
PPT
Unraveling OpenStack Clouds
PPTX
Digital Transformation of the Enterprise. What IT leaders need to know!
PDF
Architectural Considerations For Complex Mobile And Web Applications
Http2 is here! And why the web needs it
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Go Programming Language - Learning The Go Lang way
Building Resilient Microservices
App using golang indicthreads
Building on quicksand microservices indicthreads
How to Think in RxJava Before Reacting
Iot secure connected devices indicthreads
Real world IoT for enterprises
IoT testing and quality assurance indicthreads
Functional Programming Past Present Future
Harnessing the Power of Java 8 Streams
Building & scaling a live streaming mobile platform - Gr8 road to fame
Internet of things architecture perspective - IndicThreads Conference
Cars and Computers: Building a Java Carputer
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Speed up your build pipeline for faster feedback
Unraveling OpenStack Clouds
Digital Transformation of the Enterprise. What IT leaders need to know!
Architectural Considerations For Complex Mobile And Web Applications

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Digital Strategies for Manufacturing Companies
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPT
Introduction Database Management System for Course Database
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Introduction to Artificial Intelligence
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Nekopoi APK 2025 free lastest update
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms II-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 41
ISO 45001 Occupational Health and Safety Management System
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Digital Strategies for Manufacturing Companies
CHAPTER 2 - PM Management and IT Context
Wondershare Filmora 15 Crack With Activation Key [2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Introduction Database Management System for Course Database
Online Work Permit System for Fast Permit Processing
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Odoo Companies in India – Driving Business Transformation.pdf
Introduction to Artificial Intelligence
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Nekopoi APK 2025 free lastest update
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

Scrap Your MapReduce - Apache Spark

  • 1. Lightning-fast cluster computing Rahul Kavale(rahulkav@thoughtworks.com) Unmesh Joshi(uvjoshi@thoughtworks.com)
  • 2. 2 Some properties of “Big Data” •Big data is inherently immutable, meaning it is not supposed to updated once generated. •Mostly the operations are coarse grained when it comes to write •Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across cluster of many such machines • The distributed nature makes the programming complicated.
  • 3. 3 Brush up for Hadoop concepts Distributed Storage => HDFS Cluster Manager => YARN Fault tolerance => achieved via replication Job scheduling => Scheduler in YARN Mapper Reducer Combiner
  • 9. 9 MapReduce pain points • considerable latency • only Map and Reduce phases • Non trivial to test • results into complex workflow • Not suitable for Iterative processing
  • 10. 10 Immutability and MapReduce model • HDFS storage is immutable or append-only. • The MapReduce model lacks to exploit the immutable nature of the data. • The intermediate results are persisted resulting in huge of IO, causing a serious performance hit.
  • 11. 11 Wouldn’t it be very nice if we could have• Low latency • Programmer friendly programming model • Unified ecosystem • Fault tolerance and other typical distributed system properties • Easily testable code • Of course open source :)
  • 12. 12 What is Apache Spark • Cluster computing Engine • Abstracts the storage and cluster management • Unified interfaces to data • API in Scala, Python, Java, R*
  • 13. 13 Where does it fit in existing Bigdata ecosystem http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html
  • 14. 14 Why should you care about Apache Spark • Abstracts underlying storage, • Abstracts cluster management • Easy programming model • Very easy to test the code • Highly performant
  • 15. 15 • Petabyte sort record https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  • 16. 16 • Offers in memory caching of data • Specialized Applications • GraphX for graph processing • Spark Streaming • MLib for Machine learning • Spark SQL • Data exploration via Spark-Shell
  • 18. 18 Word Count example val file = spark.textFile("input path") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path")
  • 20. 20 Spark Shell Demo • SparkContext • RDD • RDD operations
  • 21. 21 RDD • RDD stands for Resilient Distributed Dataset. • basic abstraction for Spark
  • 22. 22 • Equivalent of Distributed collections. • The interface makes distributed nature of underlying data transparent. • RDD is immutable • Can be created via, • parallelising a collection, • transforming an existing RDD by applying a transformation function, • reading from a persistent data store like HDFS.
  • 23. 23 RDD is lazily evaluated RDD has two type of operations • Transformations Create a DAG of transformations to be applied on the RDD Does not evaluating anything • Actions Evaluate the DAG of transformations
  • 24. 24 RDD operations Transformations map(f : T ⇒ U) : RDD[T] ⇒ RDD[U] filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T] flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U] sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling) union() : (RDD[T],RDD[T]) ⇒ RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))] groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])] reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)] partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
  • 25. 25 Actions count() : RDD[T] ⇒ Long collect() : RDD[T] ⇒ Seq[T] reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS
  • 27. 27 Spark Execution in Context of YARN http://kb.cnblogs.com/page/198414/
  • 28. 28 Fault tolerance via lineage MappedRDD FilteredRDD FlatMappedRDD MappedRDD HadoopRDD
  • 30. 30 Why is Spark more performant than MapReduce
  • 31. 31 Reduced IO • No disk IO between phases since phases themselves are pipelined • No network IO involved unless a shuffle is required
  • 32. 32 No Mandatory Shuffle • Programs not bounded by map and reduce phases • No mandatory Shuffle and sort required
  • 33. 33 In memory caching of data • Optional In memory caching • DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied