SlideShare a Scribd company logo
MAD SKILLS FOR ANALYSIS 
AND 
BIG DATA MACHINE LEARNING 
University of Helsinki 
Gianvito Siciliano 
(2014 - Distributed Computing Frameworks for Big Data Seminar)
COMPARISON OF 
• APPROACHES 
• PLATFORMS 
• ALGORITHMS
AGENDA 
1. Analysis intro: 
• needed skills (MAD) 
• important areas (IS, ML) 
2. Big Data intensive approaches: 
• HPC, ABDS, BDAS 
3. Machine Learning tool generations 
• SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…) 
4. Large scale (ML) algorithms comparison 
• K-means, LogReg
Why data analysis? 
“So, what’s getting ubiquitous and cheap? Data. And What is 
complementary to data? Analysis. “ 
The value of data analysis has entered common culture, to uncover the 
unexpected in your data.
How to make sense of data? 
The MAD acronym, is made up of three inherent aspects 
on big data analysis: 
Magnetic: it concerns attracting data from heterogeneus 
sources, regardless of the quality of data.
How to make sense of data? 
The MAD acronym, is made up of three inherent aspects 
on big data analysis: 
Agile: that is about how to make fastly analysis, to obtain 
action which maximizes the value for the business
How to make sense of data? 
The MAD acronym, is made up of three inherent aspects 
on big data analysis: 
Deep: is to enable analysts to know both sophisticated 
statistical methods and the most performing ML algorithms 
to study enormous datasets on distributed environments.
How to go deep? 
• Inferential statistics, that allows you to capture the underlying 
properties of the population (prediction, causality analysis and 
distributional comparison) 
• Machine Learning, “…is the unsung hero that powers many of the 
most sophisticated big data analytic applications”.
DB design 
capture, modelling, manage, querying… 
(SQL) 
MAD skills, 2 key points 
Programming Style 
extract, transform, process, investigate… 
(MapReduce)
MAD design for smart environment! 
Parallel DBMSs are substantially faster than the MR system once the 
data is loaded, but that loading the data takes considerably longer in the 
db system 
MapReduce has captured the interest of many developers because of its 
simple 2-functions paradigm and it has widely viewed as a more attractive 
programming environment than SQL 
MR paradigm simplifies the schema-writing process for data: it just require 
to load and copy data into the storage system.
MAD design for smart environment! 
As each approach has its own set of pros and cons, the proposal can be a 
database-Hadoop hybrid approach to scalable machine learning where 
batch-learning is performed on the Hadoop platform, and data are stored 
(and organised) with the help of some parallel DBMSs. 
The critical-skill for a MAD analysts becomes the interoperability on 
complex pipeline that includes some stage in SQL and some in 
MapReduce syntax.
How to deal with Big Data and Machine Learning? 
• parallelizing and distributing data analysis 
• large-scale data sets 
• cluster and data fault tolerance 
• iterative processing
BIG DATA INTENSIVE PARADIGMS 
High Performance Computing 
is the use of parallel processing for running advanced 
application programs efficiently, reliably and quickly 
parallel processing (MPI) 
advance and high performance 
applications (Molecular Dynamics) 
separating the cluster (VMs), compute 
(SLURM) and storage layer (LUSTRE) 
supercomputing 
HPC stack 
app 
proc 
comm 
strg
BIG DATA INTENSIVE PARADIGMS 
Apache Big Data Stack 
Based on integration of compute and data, it introduces an application-level scheduling 
to facilitate heterogeneous application workloads and high-cluster utilization. 
MapReduce paradigm 
integration compute/data mgmt 
cheap hw 
low-need communication among clusters 
many open-source implementations, support and docs 
app 
proc 
comm 
tight coupling between storage (YARN) and resource 
(HDFS) 
no shared memory 
strg 
no support for iteration ABDS stack
BIG DATA INTENSIVE PARADIGMS 
Berkeley Data Analytics Stack 
It emerge in response of application requirements (short-running 
tasks) and to overcome the problems of its 
predecessor (data-caching). 
Transform and Act paradigm 
multi-level scheduler (MESOS) 
runtime iterative processing (SPARK) 
distributed shared memory (RDD) 
app 
proc 
comm 
strg 
…young? BDAS stack
FROM 2 PARADIGMS TO AN HYBRID TOOL 
HPC - data (intensive) parallel tasks workflows 
+ 
ABDS - computes demanding on clusters and MapReduce style for batch-processing 
= 
BDAS - provides caching and shared memory 
… 
ML - remember that algorithms need iterative processing! 
=> SPARK - Distributed framework for (big) data preparation and machine learning, based 
on Resilient (cache) system to recompute iterations
BIG DATA FRAMEWORK SPACE 
Age/Maturity 
Fast Data Big Analytics Big Application
THREE ML GENERATION OF TOOLS 
First generation 
Traditional ML tools 
for machine learning 
(SAS, SPSS, Weka, R). 
wide set of ML 
algorithms 
can facilitate deep 
analysis 
vertically scalable 
non distributed 
smaller data sets 
Second generation 
ML tools built over Hadoop 
(Mahout, Pentaho, 
RapidMiner) 
scale to large data sets 
distributed 
no database connectivity 
(ODBC) 
smaller sub-sets of algorithms 
low performance with multi-stage 
applications (e.g machine 
learning and graph processing) 
inefficient primitives for data 
sharing 
poor support for ad-hoc and 
interactive queries 
slow iterative computations 
Third generation 
New purpose-tools 
(HaLoop, Twister, Pregel, 
GraphLab, Spark) 
modularity 
shared memory 
iterative ML algorithms 
asynchronous graph 
processing 
cached memory across 
iterations/interactions
ML ALGORITHMS 
K-means for clustering analysis. 
The iteration time of k-means is dominated by compute-intensive task of calculating the centroids 
from a set of datapoints. 
Logistic Regression, a type of probabilistic statistical classification model. 
For the comparison it is used for a binary classification task: it is less compute-intensive than k-means 
and more sensitive to time spent in deserialization and I/O.
K-MEANS
a) b) c) 
Times s 
b) c) 
d) 
iterations 
e) f) 
machines iterations input 
d) 
Times (s) 
iterations 
Times (s) 
iterations 
LOG REG
CONCLUSIONS 
• MAD design can help the analysis process, like the AGILE methodology helps the software 
development process. 
• The better performance of parallel DBMSs should be complementary to MapReduce systems. 
• MapReduce provides powerful abstractions for data processing, analytics and machine learning to 
the end-user that naturally involves in the new ”transform and act” paradigm used in Spark. 
• Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best 
framework in this scenario. 
• The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant 
abstraction for sharing data in cluster applications, and it is the added value of Spark. 
• Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they 
do not appear to be mature enough.
Acknoledgements 
Dr. Sasu Tarkoma 
Dr. Mohammad Hoque 
Reviewers
Thank you! 
(gianvito.siciliano@gmail.com)

More Related Content

DOC
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
PPTX
Big dataanalyticsbeyondhadoop public_20_june_2013
PPTX
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
PDF
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
PPTX
Topic modeling using big data analytics
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big dataanalyticsbeyondhadoop public_20_june_2013
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
Topic modeling using big data analytics
Matching Data Intensive Applications and Hardware/Software Architectures
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

What's hot (18)

PDF
mapreduce_presentation
PPTX
Hadoop - A big data initiative
PDF
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
PDF
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
PDF
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
PPTX
51 Use Cases and implications for HPC & Apache Big Data Stack
PPTX
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
PPTX
Big data
PDF
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
PPTX
Cloud Services for Big Data Analytics
PPT
Hadoop mapreduce and yarn frame work- unit5
PDF
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
PPTX
High Performance Processing of Streaming Data
PDF
Workshop on Real-time & Stream Analytics IEEE BigData 2016
PPTX
Comparing Big Data and Simulation Applications and Implications for Software ...
PDF
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
PDF
Final Report_798 Project_Nithin_Sharmila
mapreduce_presentation
Hadoop - A big data initiative
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
Matching Data Intensive Applications and Hardware/Software Architectures
51 Use Cases and implications for HPC & Apache Big Data Stack
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big data
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Cloud Services for Big Data Analytics
Hadoop mapreduce and yarn frame work- unit5
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
High Performance Processing of Streaming Data
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Comparing Big Data and Simulation Applications and Implications for Software ...
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Final Report_798 Project_Nithin_Sharmila
Ad

Viewers also liked (20)

PDF
Image Classification and Retrieval logic
PPT
Avanced Image Classification
PDF
your browser, my storage
PDF
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
PDF
Societal Impact of Applied Data Science on the Big Data Stack
PPTX
Big Data Case study - caixa bank
PDF
Introduction to Machine Learning for Oracle Database Professionals
PPTX
A brief history of "big data"
PPTX
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
PDF
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
PPTX
JEEConf 2015 - Introduction to real-time big data with Apache Spark
PPTX
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
PDF
Business case for Big Data Analytics
PPT
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
PDF
Impact of big data on analytics
PDF
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
PDF
The Efficient Big data Platform - IDC 360, Copenhagen
PDF
Image Classification Done Simply using Keras and TensorFlow
PDF
Data Modeling for Big Data
PDF
Deep Learning Use Cases - Data Science Pop-up Seattle
Image Classification and Retrieval logic
Avanced Image Classification
your browser, my storage
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
Societal Impact of Applied Data Science on the Big Data Stack
Big Data Case study - caixa bank
Introduction to Machine Learning for Oracle Database Professionals
A brief history of "big data"
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Business case for Big Data Analytics
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Impact of big data on analytics
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
The Efficient Big data Platform - IDC 360, Copenhagen
Image Classification Done Simply using Keras and TensorFlow
Data Modeling for Big Data
Deep Learning Use Cases - Data Science Pop-up Seattle
Ad

Similar to MAD skills for analysis and big data Machine Learning (20)

PDF
Iaetsd mapreduce streaming over cassandra datasets
PPTX
High Performance Computing and Big Data
PDF
Big Data Storage System Based on a Distributed Hash Tables System
PDF
Big Data Storage System Based on a Distributed Hash Tables System
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
PDF
Paper id 25201498
PDF
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
PPTX
عصر کلان داده، چرا و چگونه؟
PDF
benchmarks-sigmod09
PDF
Big Data: RDBMS vs. Hadoop vs. Spark
PPTX
Cloud Services for Big Data Analytics
PDF
Eg4301808811
PPTX
Topic modeling using big data analytics
PPTX
Hadoop - A big data initiative
PDF
Unstructured Datasets Analysis: Thesaurus Model
PPT
Big Data & Hadoop
PDF
A data aware caching 2415
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
PDF
Cost-aware optimal resource provisioning Map-Reduce scheduler for hadoop fram...
PDF
Generating Frequent Itemsets by RElim on Hadoop Clusters
Iaetsd mapreduce streaming over cassandra datasets
High Performance Computing and Big Data
Big Data Storage System Based on a Distributed Hash Tables System
Big Data Storage System Based on a Distributed Hash Tables System
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Paper id 25201498
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
عصر کلان داده، چرا و چگونه؟
benchmarks-sigmod09
Big Data: RDBMS vs. Hadoop vs. Spark
Cloud Services for Big Data Analytics
Eg4301808811
Topic modeling using big data analytics
Hadoop - A big data initiative
Unstructured Datasets Analysis: Thesaurus Model
Big Data & Hadoop
A data aware caching 2415
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Cost-aware optimal resource provisioning Map-Reduce scheduler for hadoop fram...
Generating Frequent Itemsets by RElim on Hadoop Clusters

More from Gianvito Siciliano (8)

PDF
Image Classification and Retrieval on Spark
PDF
Intro Angular Ionic
PDF
Firefly exact MCMC for Big Data
PDF
Social Study (project architecture review)
PDF
Consensus Concurrent problem
PDF
Yana - disabled assistance by google watch
PDF
Social study - Network
PDF
New interaction Technologies
Image Classification and Retrieval on Spark
Intro Angular Ionic
Firefly exact MCMC for Big Data
Social Study (project architecture review)
Consensus Concurrent problem
Yana - disabled assistance by google watch
Social study - Network
New interaction Technologies

Recently uploaded (20)

PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Computer network topology notes for revision
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
STERILIZATION AND DISINFECTION-1.ppthhhbx
oil_refinery_comprehensive_20250804084928 (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IB Computer Science - Internal Assessment.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Miokarditis (Inflamasi pada Otot Jantung)
Supervised vs unsupervised machine learning algorithms
IBA_Chapter_11_Slides_Final_Accessible.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
climate analysis of Dhaka ,Banglades.pptx
Fluorescence-microscope_Botany_detailed content
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Computer network topology notes for revision
Introduction-to-Cloud-ComputingFinal.pptx

MAD skills for analysis and big data Machine Learning

  • 1. MAD SKILLS FOR ANALYSIS AND BIG DATA MACHINE LEARNING University of Helsinki Gianvito Siciliano (2014 - Distributed Computing Frameworks for Big Data Seminar)
  • 2. COMPARISON OF • APPROACHES • PLATFORMS • ALGORITHMS
  • 3. AGENDA 1. Analysis intro: • needed skills (MAD) • important areas (IS, ML) 2. Big Data intensive approaches: • HPC, ABDS, BDAS 3. Machine Learning tool generations • SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…) 4. Large scale (ML) algorithms comparison • K-means, LogReg
  • 4. Why data analysis? “So, what’s getting ubiquitous and cheap? Data. And What is complementary to data? Analysis. “ The value of data analysis has entered common culture, to uncover the unexpected in your data.
  • 5. How to make sense of data? The MAD acronym, is made up of three inherent aspects on big data analysis: Magnetic: it concerns attracting data from heterogeneus sources, regardless of the quality of data.
  • 6. How to make sense of data? The MAD acronym, is made up of three inherent aspects on big data analysis: Agile: that is about how to make fastly analysis, to obtain action which maximizes the value for the business
  • 7. How to make sense of data? The MAD acronym, is made up of three inherent aspects on big data analysis: Deep: is to enable analysts to know both sophisticated statistical methods and the most performing ML algorithms to study enormous datasets on distributed environments.
  • 8. How to go deep? • Inferential statistics, that allows you to capture the underlying properties of the population (prediction, causality analysis and distributional comparison) • Machine Learning, “…is the unsung hero that powers many of the most sophisticated big data analytic applications”.
  • 9. DB design capture, modelling, manage, querying… (SQL) MAD skills, 2 key points Programming Style extract, transform, process, investigate… (MapReduce)
  • 10. MAD design for smart environment! Parallel DBMSs are substantially faster than the MR system once the data is loaded, but that loading the data takes considerably longer in the db system MapReduce has captured the interest of many developers because of its simple 2-functions paradigm and it has widely viewed as a more attractive programming environment than SQL MR paradigm simplifies the schema-writing process for data: it just require to load and copy data into the storage system.
  • 11. MAD design for smart environment! As each approach has its own set of pros and cons, the proposal can be a database-Hadoop hybrid approach to scalable machine learning where batch-learning is performed on the Hadoop platform, and data are stored (and organised) with the help of some parallel DBMSs. The critical-skill for a MAD analysts becomes the interoperability on complex pipeline that includes some stage in SQL and some in MapReduce syntax.
  • 12. How to deal with Big Data and Machine Learning? • parallelizing and distributing data analysis • large-scale data sets • cluster and data fault tolerance • iterative processing
  • 13. BIG DATA INTENSIVE PARADIGMS High Performance Computing is the use of parallel processing for running advanced application programs efficiently, reliably and quickly parallel processing (MPI) advance and high performance applications (Molecular Dynamics) separating the cluster (VMs), compute (SLURM) and storage layer (LUSTRE) supercomputing HPC stack app proc comm strg
  • 14. BIG DATA INTENSIVE PARADIGMS Apache Big Data Stack Based on integration of compute and data, it introduces an application-level scheduling to facilitate heterogeneous application workloads and high-cluster utilization. MapReduce paradigm integration compute/data mgmt cheap hw low-need communication among clusters many open-source implementations, support and docs app proc comm tight coupling between storage (YARN) and resource (HDFS) no shared memory strg no support for iteration ABDS stack
  • 15. BIG DATA INTENSIVE PARADIGMS Berkeley Data Analytics Stack It emerge in response of application requirements (short-running tasks) and to overcome the problems of its predecessor (data-caching). Transform and Act paradigm multi-level scheduler (MESOS) runtime iterative processing (SPARK) distributed shared memory (RDD) app proc comm strg …young? BDAS stack
  • 16. FROM 2 PARADIGMS TO AN HYBRID TOOL HPC - data (intensive) parallel tasks workflows + ABDS - computes demanding on clusters and MapReduce style for batch-processing = BDAS - provides caching and shared memory … ML - remember that algorithms need iterative processing! => SPARK - Distributed framework for (big) data preparation and machine learning, based on Resilient (cache) system to recompute iterations
  • 17. BIG DATA FRAMEWORK SPACE Age/Maturity Fast Data Big Analytics Big Application
  • 18. THREE ML GENERATION OF TOOLS First generation Traditional ML tools for machine learning (SAS, SPSS, Weka, R). wide set of ML algorithms can facilitate deep analysis vertically scalable non distributed smaller data sets Second generation ML tools built over Hadoop (Mahout, Pentaho, RapidMiner) scale to large data sets distributed no database connectivity (ODBC) smaller sub-sets of algorithms low performance with multi-stage applications (e.g machine learning and graph processing) inefficient primitives for data sharing poor support for ad-hoc and interactive queries slow iterative computations Third generation New purpose-tools (HaLoop, Twister, Pregel, GraphLab, Spark) modularity shared memory iterative ML algorithms asynchronous graph processing cached memory across iterations/interactions
  • 19. ML ALGORITHMS K-means for clustering analysis. The iteration time of k-means is dominated by compute-intensive task of calculating the centroids from a set of datapoints. Logistic Regression, a type of probabilistic statistical classification model. For the comparison it is used for a binary classification task: it is less compute-intensive than k-means and more sensitive to time spent in deserialization and I/O.
  • 21. a) b) c) Times s b) c) d) iterations e) f) machines iterations input d) Times (s) iterations Times (s) iterations LOG REG
  • 22. CONCLUSIONS • MAD design can help the analysis process, like the AGILE methodology helps the software development process. • The better performance of parallel DBMSs should be complementary to MapReduce systems. • MapReduce provides powerful abstractions for data processing, analytics and machine learning to the end-user that naturally involves in the new ”transform and act” paradigm used in Spark. • Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best framework in this scenario. • The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications, and it is the added value of Spark. • Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they do not appear to be mature enough.
  • 23. Acknoledgements Dr. Sasu Tarkoma Dr. Mohammad Hoque Reviewers