Big Machine Learning Libraries & Open Challenges

Big Machine
Learning Libraries
&
Open Challenges
Based on presentations during
Brno Data Week 2018
by prof Sherif Sakr
Created by: Tichý, T. & Luhan, J.
(Feb. 2019)
https://www.chedteb.eu/

We try to simplify
everything we do.
And this is the
reason why we try to
use machine
learning in Big Data.
Big machine learning libraries

Mahout
• Mahout is a Java library that implements Machine Learning
techniques(e.g. classification, clustering, recommendation)
on top of the Hadoop framework
• Example use cases:
• Recommendation: Takes users' behaviour and tries to
find items users might like
• Clustering: takes e.g. text documents and groups them
into groups of topically related documents

Google
Cloud
Machine
Learning
• Google Cloud Machine Learning is a managed platform that
enables its users to easily build machine learning models that
work on any type of data, of any size.

Open Challenges
Open Challenges

Pipelining Various Big Data Jobs
• In practice, one of the possible scenarios is that users need to
execute a computation that combines various analytics jobs.
• Existing systems do not address the challenges of data
construction, transformation and post-processing.
• New trend of integrated systems: Spark and Flink.
• Emerging pipelining systems: Apache Tez and Apache MRQL
Open challenges

Pipelining Various Big Data Jobs
Open challenges

Lack of
Declarative
Interfaces
• In the early days of the Hadoop framework, the lack of
declarative languages to express the big data processing
tasks limited its practicality and a wide acceptance and
the usage of the framework.
• Several systems (e.g., Pig, Hive, Impala) have been
designed to provide high-level languages for expressing
big data tasks on top of Hadoop.
• Currently, the systems/stacks of large scale graph and
stream processing platforms are suffering from the same
challenge.
• High level language abstractions for expressing big data
processing jobs and enabling the underlying
systems/stack to perform automatic optimization are
crucially required.
Open challenges

Benchmarking
Challenges
• Designing a good benchmark is a challenging task due to
the many aspects that should be considered; these can
influence the adoption and the usage scenarios of the
benchmark.
• Variety on algorithms, systems, big datasets
characteristics and application domains.
• There are not enough standard benchmarks that can be
effectively employed in this domain.
Open challenges

Platform-Independent Analytics
• In general, more alternatives usually mean harder
decisions for choice
• Porting data and data analytics jobs between
different systems is a tedious, time consuming
and costly task
• Musketeer and Rheem systems proposed a new
direction to map the frontend of the big data jobs
• More work is still required to tackle the challenge
of providing a platform independence and multi-
platform task execution platforms
Open challenges

And what is
next?
Would you like to know more?
See the original presentation for
Brno Data Week 2018
(by prof Sherif Sakr)

Big Machine Learning Libraries & Open Challenges

More Related Content

Similar to Big Machine Learning Libraries & Open Challenges (20)

Recently uploaded (20)

Big Machine Learning Libraries & Open Challenges