SlideShare a Scribd company logo
Distributed Processing
Frameworks
Author: Antonios Katsarakis
Literature
• MapReduce: Simplified Data Processing on Large Clusters
Jeff Dean et al. - OSDI’04.
• Spark: Cluster Computing with Working Sets
M. Zaharia et al. - HotCloud’10.
Why Big Data?
• More data to process: IoT, smart devices, web applications
- About 2.3 trillion GB of new data are generated every day
• Growth of CPU performance cannot keep up with increasing
amount of data to process
• This leads us to the Big Data era
- Big data: Data sets are so large that the processing power of a
single machine is inadequate to deal with them
• We need to find ways to process these massive amounts of data
MapReduce
• Proposed by Jeff Dean et al. (Google) 2004
- Cited more than 18k
• A programming model that enables the parallel
and distributed processing of large data sets
• Typical MapReduce Program:
- Read Data
- Map: filtering of the data
- Shuffle and short
- Reduce: summary operation on data
- Write the Results
ReduceReduce
Input Data
1/3
Input
1/3
Input
1/3
Input
Map Map Map
Interm.
Data
Interm.
Data
Interm.
Data
Output
Data
Output
Data
Critical Reflection
• Outcome:
- Novel idea that lead to a whole new era of distributed systems
- Big impact in industry (Hadoop MapReduce)
- Lowered the cost of computations
• Limitations:
- Restricted to batch processing
- It only support map and reduce operations
- The shuffling phase introduces overheads
Spark
• Proposed by Matei Zaharia et al. 2010
- Cited 1.5k
• Another programming model based on
higher-ordered functions that execute
user-defined functions in parallel
• Aims to replace MapReduce in industry
• Main Ideas:
- Represent the computations as DAGs
- Cache datasets into memory
Spark Model
• Resilient Distributed
Datasets (RRDs):
immutable collections of
objects spread across a
cluster
• Operations over RDDs:
1.Transformations: lazy
operators that create new
RDDs
2.Actions: launch a
computation on an RDD
Pipelined
RDD1
var count = readFile(…)
.map(…)
.filter(..)
.reduceByKey()
.count()
File splited
into chunks
(RDD0)
RDD2
RDD3
RDD4
Result
Job (RDD) Graph
Stage1St.2
Critical Reflection
• Benefits:
- High level API
- Support more applications types
- Performance optimizations
• Limitations:
- Detailed performance analysis on the thread level is hard
- Multipurpose application support makes performance improvements and
tuning really challenging
- The shuffling phase introduces overheads
Conclusion
• Clusters provide the computational power to
process Big Data
• MapReduce allows developers to build programs for
clusters
• Spark tries to overcome limitations of MapReduce
• These systems introduce many challenges in terms
of measuring and improving their performance

More Related Content

PPTX
Tensor Processing Unit (TPU)
PPTX
Google TPU
PPTX
Spark Overview and Performance Issues
PDF
2017 04-13-google-tpu-04
PDF
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
PPTX
TPU paper slide
PDF
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
PPT
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Tensor Processing Unit (TPU)
Google TPU
Spark Overview and Performance Issues
2017 04-13-google-tpu-04
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
TPU paper slide
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...

What's hot (19)

PPTX
High performance computing with accelarators
PPT
Hadoop mapreduce performance study on arm cluster
PDF
High performance computing tutorial, with checklist and tips to optimize clus...
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PPTX
APSys Presentation Final copy2
PDF
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
PPTX
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
PPT
OpenCL caffe IWOCL 2016 presentation final
PPTX
Modern processor art
PPTX
Danish presentation
PDF
High performance computing - building blocks, production & perspective
PPTX
Взгляд на облака с точки зрения HPC
PPTX
Lec04 gpu architecture
PPTX
Greenplum-Spark November 2018
PPTX
Optimizing High Performance Computing Applications for Energy
PDF
MapReduce and Hadoop
PPT
Advanced Hadoop Tuning and Optimization
PPTX
Exascale Capabl
PPT
Map Reduce
High performance computing with accelarators
Hadoop mapreduce performance study on arm cluster
High performance computing tutorial, with checklist and tips to optimize clus...
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
APSys Presentation Final copy2
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
OpenCL caffe IWOCL 2016 presentation final
Modern processor art
Danish presentation
High performance computing - building blocks, production & perspective
Взгляд на облака с точки зрения HPC
Lec04 gpu architecture
Greenplum-Spark November 2018
Optimizing High Performance Computing Applications for Energy
MapReduce and Hadoop
Advanced Hadoop Tuning and Optimization
Exascale Capabl
Map Reduce
Ad

Similar to Distributed Processing Frameworks (20)

PPTX
Big Data training
PPTX
Apache spark - History and market overview
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
Big Data Processing & Analytics: Improving data insight.pdf
PDF
What is Distributed Computing, Why we use Apache Spark
PDF
Introduction to Spark Training
PDF
Apache Spark Presentation good for big data
PPTX
Intro to Spark development
PPTX
Big Data for QAs
PPTX
In Memory Analytics with Apache Spark
PPTX
Big Data and Cloud Computing
PDF
Architecting and productionising data science applications at scale
PDF
Bds session 13 14
PDF
An introduction To Apache Spark
PDF
Spark Driven Big Data Analytics
PDF
Big data processing with apache spark
PDF
Big Data Analytics and Ubiquitous computing
PPTX
Hadoop - A big data initiative
PDF
Machine learning at scale challenges and solutions
Big Data training
Apache spark - History and market overview
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Processing & Analytics: Improving data insight.pdf
What is Distributed Computing, Why we use Apache Spark
Introduction to Spark Training
Apache Spark Presentation good for big data
Intro to Spark development
Big Data for QAs
In Memory Analytics with Apache Spark
Big Data and Cloud Computing
Architecting and productionising data science applications at scale
Bds session 13 14
An introduction To Apache Spark
Spark Driven Big Data Analytics
Big data processing with apache spark
Big Data Analytics and Ubiquitous computing
Hadoop - A big data initiative
Machine learning at scale challenges and solutions
Ad

More from Antonios Katsarakis (9)

PDF
Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...
PDF
Dandelion Hashtable: beyond billion requests per second on a commodity server...
PPTX
Dandelion Hashtable: beyond billion requests per second on a commodity server
PDF
The L2AW theorem
PDF
Invalidation-Based Protocols for Replicated Datastores
PDF
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
PDF
Hermes Reliable Replication Protocol - Poster
PDF
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
PDF
Scale-out ccNUMA - Eurosys'18
Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...
Dandelion Hashtable: beyond billion requests per second on a commodity server...
Dandelion Hashtable: beyond billion requests per second on a commodity server
The L2AW theorem
Invalidation-Based Protocols for Replicated Datastores
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Scale-out ccNUMA - Eurosys'18

Recently uploaded (20)

PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Digital Strategies for Manufacturing Companies
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
history of c programming in notes for students .pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
System and Network Administraation Chapter 3
PDF
top salesforce developer skills in 2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
medical staffing services at VALiNTRY
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Transform Your Business with a Software ERP System
Designing Intelligence for the Shop Floor.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
L1 - Introduction to python Backend.pptx
Computer Software and OS of computer science of grade 11.pptx
Digital Systems & Binary Numbers (comprehensive )
Softaken Excel to vCard Converter Software.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Reimagine Home Health with the Power of Agentic AI​
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
history of c programming in notes for students .pptx
Operating system designcfffgfgggggggvggggggggg
System and Network Administraation Chapter 3
top salesforce developer skills in 2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
medical staffing services at VALiNTRY
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Transform Your Business with a Software ERP System

Distributed Processing Frameworks

  • 2. Literature • MapReduce: Simplified Data Processing on Large Clusters Jeff Dean et al. - OSDI’04. • Spark: Cluster Computing with Working Sets M. Zaharia et al. - HotCloud’10.
  • 3. Why Big Data? • More data to process: IoT, smart devices, web applications - About 2.3 trillion GB of new data are generated every day • Growth of CPU performance cannot keep up with increasing amount of data to process • This leads us to the Big Data era - Big data: Data sets are so large that the processing power of a single machine is inadequate to deal with them • We need to find ways to process these massive amounts of data
  • 4. MapReduce • Proposed by Jeff Dean et al. (Google) 2004 - Cited more than 18k • A programming model that enables the parallel and distributed processing of large data sets • Typical MapReduce Program: - Read Data - Map: filtering of the data - Shuffle and short - Reduce: summary operation on data - Write the Results ReduceReduce Input Data 1/3 Input 1/3 Input 1/3 Input Map Map Map Interm. Data Interm. Data Interm. Data Output Data Output Data
  • 5. Critical Reflection • Outcome: - Novel idea that lead to a whole new era of distributed systems - Big impact in industry (Hadoop MapReduce) - Lowered the cost of computations • Limitations: - Restricted to batch processing - It only support map and reduce operations - The shuffling phase introduces overheads
  • 6. Spark • Proposed by Matei Zaharia et al. 2010 - Cited 1.5k • Another programming model based on higher-ordered functions that execute user-defined functions in parallel • Aims to replace MapReduce in industry • Main Ideas: - Represent the computations as DAGs - Cache datasets into memory
  • 7. Spark Model • Resilient Distributed Datasets (RRDs): immutable collections of objects spread across a cluster • Operations over RDDs: 1.Transformations: lazy operators that create new RDDs 2.Actions: launch a computation on an RDD Pipelined RDD1 var count = readFile(…) .map(…) .filter(..) .reduceByKey() .count() File splited into chunks (RDD0) RDD2 RDD3 RDD4 Result Job (RDD) Graph Stage1St.2
  • 8. Critical Reflection • Benefits: - High level API - Support more applications types - Performance optimizations • Limitations: - Detailed performance analysis on the thread level is hard - Multipurpose application support makes performance improvements and tuning really challenging - The shuffling phase introduces overheads
  • 9. Conclusion • Clusters provide the computational power to process Big Data • MapReduce allows developers to build programs for clusters • Spark tries to overcome limitations of MapReduce • These systems introduce many challenges in terms of measuring and improving their performance

Editor's Notes

  • #9: HL API - (in Scala, Java, Python) - usable by non computer scientists SMAT - (streaming, iterative and interactive) PO - (memory caching, transformation pipelining etc.)
  • #10: 3* (in terms of performance, application support and user friendliness)