MAD skills for analysis and big data Machine Learning

MAD SKILLS FOR ANALYSIS
AND
BIG DATA MACHINE LEARNING
University of Helsinki
Gianvito Siciliano
(2014 - Distributed Computing Frameworks for Big Data Seminar)

COMPARISON OF
• APPROACHES
• PLATFORMS
• ALGORITHMS

AGENDA
1. Analysis intro:
• needed skills (MAD)
• important areas (IS, ML)
2. Big Data intensive approaches:
• HPC, ABDS, BDAS
3. Machine Learning tool generations
• SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…)
4. Large scale (ML) algorithms comparison
• K-means, LogReg

Why data analysis?
“So, what’s getting ubiquitous and cheap? Data. And What is
complementary to data? Analysis. “
The value of data analysis has entered common culture, to uncover the
unexpected in your data.

How to make sense of data?
The MAD acronym, is made up of three inherent aspects
on big data analysis:
Magnetic: it concerns attracting data from heterogeneus
sources, regardless of the quality of data.

Agile: that is about how to make fastly analysis, to obtain
action which maximizes the value for the business

Deep: is to enable analysts to know both sophisticated
statistical methods and the most performing ML algorithms
to study enormous datasets on distributed environments.

How to go deep?
• Inferential statistics, that allows you to capture the underlying
properties of the population (prediction, causality analysis and
distributional comparison)
• Machine Learning, “…is the unsung hero that powers many of the
most sophisticated big data analytic applications”.

DB design
capture, modelling, manage, querying…
(SQL)
MAD skills, 2 key points
Programming Style
extract, transform, process, investigate…
(MapReduce)

MAD design for smart environment!
Parallel DBMSs are substantially faster than the MR system once the
data is loaded, but that loading the data takes considerably longer in the
db system
MapReduce has captured the interest of many developers because of its
simple 2-functions paradigm and it has widely viewed as a more attractive
programming environment than SQL
MR paradigm simplifies the schema-writing process for data: it just require
to load and copy data into the storage system.

MAD design for smart environment!
As each approach has its own set of pros and cons, the proposal can be a
database-Hadoop hybrid approach to scalable machine learning where
batch-learning is performed on the Hadoop platform, and data are stored
(and organised) with the help of some parallel DBMSs.
The critical-skill for a MAD analysts becomes the interoperability on
complex pipeline that includes some stage in SQL and some in
MapReduce syntax.

How to deal with Big Data and Machine Learning?
• parallelizing and distributing data analysis
• large-scale data sets
• cluster and data fault tolerance
• iterative processing

BIG DATA INTENSIVE PARADIGMS
High Performance Computing
is the use of parallel processing for running advanced
application programs efficiently, reliably and quickly
parallel processing (MPI)
advance and high performance
applications (Molecular Dynamics)
separating the cluster (VMs), compute
(SLURM) and storage layer (LUSTRE)
supercomputing
HPC stack
app
proc
comm
strg

Apache Big Data Stack
Based on integration of compute and data, it introduces an application-level scheduling
to facilitate heterogeneous application workloads and high-cluster utilization.
MapReduce paradigm
integration compute/data mgmt
cheap hw
low-need communication among clusters
many open-source implementations, support and docs
app
proc
comm
tight coupling between storage (YARN) and resource
(HDFS)
no shared memory
strg
no support for iteration ABDS stack

Berkeley Data Analytics Stack
It emerge in response of application requirements (short-running
tasks) and to overcome the problems of its
predecessor (data-caching).
Transform and Act paradigm
multi-level scheduler (MESOS)
runtime iterative processing (SPARK)
distributed shared memory (RDD)
app
proc
comm
strg
…young? BDAS stack

FROM 2 PARADIGMS TO AN HYBRID TOOL
HPC - data (intensive) parallel tasks workflows
+
ABDS - computes demanding on clusters and MapReduce style for batch-processing
=
BDAS - provides caching and shared memory
…
ML - remember that algorithms need iterative processing!
=> SPARK - Distributed framework for (big) data preparation and machine learning, based
on Resilient (cache) system to recompute iterations

BIG DATA FRAMEWORK SPACE
Age/Maturity
Fast Data Big Analytics Big Application

THREE ML GENERATION OF TOOLS
First generation
Traditional ML tools
for machine learning
(SAS, SPSS, Weka, R).
wide set of ML
algorithms
can facilitate deep
analysis
vertically scalable
non distributed
smaller data sets
Second generation
ML tools built over Hadoop
(Mahout, Pentaho,
RapidMiner)
scale to large data sets
distributed
no database connectivity
(ODBC)
smaller sub-sets of algorithms
low performance with multi-stage
applications (e.g machine
learning and graph processing)
inefficient primitives for data
sharing
poor support for ad-hoc and
interactive queries
slow iterative computations
Third generation
New purpose-tools
(HaLoop, Twister, Pregel,
GraphLab, Spark)
modularity
shared memory
iterative ML algorithms
asynchronous graph
processing
cached memory across
iterations/interactions

ML ALGORITHMS
K-means for clustering analysis.
The iteration time of k-means is dominated by compute-intensive task of calculating the centroids
from a set of datapoints.
Logistic Regression, a type of probabilistic statistical classification model.
For the comparison it is used for a binary classification task: it is less compute-intensive than k-means
and more sensitive to time spent in deserialization and I/O.

a) b) c)
Times s
b) c)
d)
iterations
e) f)
machines iterations input
d)
Times (s)
iterations
Times (s)
iterations
LOG REG

CONCLUSIONS
• MAD design can help the analysis process, like the AGILE methodology helps the software
development process.
• The better performance of parallel DBMSs should be complementary to MapReduce systems.
• MapReduce provides powerful abstractions for data processing, analytics and machine learning to
the end-user that naturally involves in the new ”transform and act” paradigm used in Spark.
• Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best
framework in this scenario.
• The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant
abstraction for sharing data in cluster applications, and it is the added value of Spark.
• Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they
do not appear to be mature enough.

Acknoledgements
Dr. Sasu Tarkoma
Dr. Mohammad Hoque
Reviewers

Thank you!
(gianvito.siciliano@gmail.com)

MAD skills for analysis and big data Machine Learning

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to MAD skills for analysis and big data Machine Learning (20)

More from Gianvito Siciliano (8)

Recently uploaded (20)

MAD skills for analysis and big data Machine Learning