SlideShare a Scribd company logo
Big Machine
Learning Libraries
&
Open Challenges
Based on presentations during
Brno Data Week 2018
by prof Sherif Sakr
Created by: Tichý, T. & Luhan, J.
(Feb. 2019)
https://www.chedteb.eu/
We try to simplify
everything we do.
And this is the
reason why we try to
use machine
learning in Big Data.
Big machine learning libraries
Mahout
• Mahout is a Java library that implements Machine Learning
techniques(e.g. classification, clustering, recommendation)
on top of the Hadoop framework
• Example use cases:
• Recommendation: Takes users' behaviour and tries to
find items users might like
• Clustering: takes e.g. text documents and groups them
into groups of topically related documents
Big machine learning libraries
Google
Cloud
Machine
Learning
• Google Cloud Machine Learning is a managed platform that
enables its users to easily build machine learning models that
work on any type of data, of any size.
Big machine learning libraries
Open Challenges
Open Challenges
Pipelining Various Big Data Jobs
• In practice, one of the possible scenarios is that users need to
execute a computation that combines various analytics jobs.
• Existing systems do not address the challenges of data
construction, transformation and post-processing.
• New trend of integrated systems: Spark and Flink.
• Emerging pipelining systems: Apache Tez and Apache MRQL
Open challenges
Pipelining Various Big Data Jobs
Open challenges
Lack of
Declarative
Interfaces
• In the early days of the Hadoop framework, the lack of
declarative languages to express the big data processing
tasks limited its practicality and a wide acceptance and
the usage of the framework.
• Several systems (e.g., Pig, Hive, Impala) have been
designed to provide high-level languages for expressing
big data tasks on top of Hadoop.
• Currently, the systems/stacks of large scale graph and
stream processing platforms are suffering from the same
challenge.
• High level language abstractions for expressing big data
processing jobs and enabling the underlying
systems/stack to perform automatic optimization are
crucially required.
Open challenges
Benchmarking
Challenges
• Designing a good benchmark is a challenging task due to
the many aspects that should be considered; these can
influence the adoption and the usage scenarios of the
benchmark.
• Variety on algorithms, systems, big datasets
characteristics and application domains.
• There are not enough standard benchmarks that can be
effectively employed in this domain.
Open challenges
Platform-Independent Analytics
• In general, more alternatives usually mean harder
decisions for choice
• Porting data and data analytics jobs between
different systems is a tedious, time consuming
and costly task
• Musketeer and Rheem systems proposed a new
direction to map the frontend of the big data jobs
• More work is still required to tackle the challenge
of providing a platform independence and multi-
platform task execution platforms
Open challenges
And what is
next?
Would you like to know more?
See the original presentation for
Brno Data Week 2018
(by prof Sherif Sakr)

More Related Content

PPTX
PDF
Mendeley Introduction NUI Galway
PPTX
Santander brown-cybersecurity-roundtable-1-dec-2015
PPTX
Supporting Big Data, Open Data, Data Analytics and Data Science
PPTX
Enabling complex analysis of large scale digital collections
PPTX
Building Recommender Systems for Scholarly Information
PDF
Access methods for analysing sensitive data (amased)
PPTX
The University of Edinburgh Research Data Management Service Suite
Mendeley Introduction NUI Galway
Santander brown-cybersecurity-roundtable-1-dec-2015
Supporting Big Data, Open Data, Data Analytics and Data Science
Enabling complex analysis of large scale digital collections
Building Recommender Systems for Scholarly Information
Access methods for analysing sensitive data (amased)
The University of Edinburgh Research Data Management Service Suite

Similar to Big Machine Learning Libraries & Open Challenges (20)

PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
PPTX
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
PDF
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
PPTX
Open problems big_data_19_feb_2015_ver_0.1
PPT
Seminar presentation
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PPTX
Big Data Analytics-Open Source Toolkits
PPTX
A Glimpse of Bigdata - Introduction
PDF
Big Data Europe SC6 WS #3: Big Data Europe Platform: Apps, challenges, goals ...
PDF
ABench: Big Data Architecture Stack Benchmark
PPTX
Big Data for QAs
PPTX
Demystifying Systems for Interactive and Real-time Analytics
PDF
The Study of the Large Scale Twitter on Machine Learning
PDF
PPTX
Big Stream Processing Systems, Big Graphs
PDF
Simple, Modular and Extensible Big Data Platform Concept
PDF
Analysing Transportation Data with Open Source Big Data Analytic Tools
PDF
Session 1 - The Current Landscape of Big Data Benchmarks
PDF
Research paper on big data and hadoop
PPTX
Big Data Practice_Planning_steps_RK
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
Open problems big_data_19_feb_2015_ver_0.1
Seminar presentation
Apache Spark and the Emerging Technology Landscape for Big Data
Big Data Analytics-Open Source Toolkits
A Glimpse of Bigdata - Introduction
Big Data Europe SC6 WS #3: Big Data Europe Platform: Apps, challenges, goals ...
ABench: Big Data Architecture Stack Benchmark
Big Data for QAs
Demystifying Systems for Interactive and Real-time Analytics
The Study of the Large Scale Twitter on Machine Learning
Big Stream Processing Systems, Big Graphs
Simple, Modular and Extensible Big Data Platform Concept
Analysing Transportation Data with Open Source Big Data Analytic Tools
Session 1 - The Current Landscape of Big Data Benchmarks
Research paper on big data and hadoop
Big Data Practice_Planning_steps_RK
Ad

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Mega Projects Data Mega Projects Data
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Introduction to Data Science and Data Analysis
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
Qualitative Qantitative and Mixed Methods.pptx
[EN] Industrial Machine Downtime Prediction
ISS -ESG Data flows What is ESG and HowHow
climate analysis of Dhaka ,Banglades.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
oil_refinery_comprehensive_20250804084928 (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to machine learning and Linear Models
SAP 2 completion done . PRESENTATION.pptx
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Data Science and Data Analysis
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Ad

Big Machine Learning Libraries & Open Challenges

  • 1. Big Machine Learning Libraries & Open Challenges Based on presentations during Brno Data Week 2018 by prof Sherif Sakr Created by: Tichý, T. & Luhan, J. (Feb. 2019) https://www.chedteb.eu/
  • 2. We try to simplify everything we do. And this is the reason why we try to use machine learning in Big Data. Big machine learning libraries
  • 3. Mahout • Mahout is a Java library that implements Machine Learning techniques(e.g. classification, clustering, recommendation) on top of the Hadoop framework • Example use cases: • Recommendation: Takes users' behaviour and tries to find items users might like • Clustering: takes e.g. text documents and groups them into groups of topically related documents Big machine learning libraries
  • 4. Google Cloud Machine Learning • Google Cloud Machine Learning is a managed platform that enables its users to easily build machine learning models that work on any type of data, of any size. Big machine learning libraries
  • 6. Pipelining Various Big Data Jobs • In practice, one of the possible scenarios is that users need to execute a computation that combines various analytics jobs. • Existing systems do not address the challenges of data construction, transformation and post-processing. • New trend of integrated systems: Spark and Flink. • Emerging pipelining systems: Apache Tez and Apache MRQL Open challenges
  • 7. Pipelining Various Big Data Jobs Open challenges
  • 8. Lack of Declarative Interfaces • In the early days of the Hadoop framework, the lack of declarative languages to express the big data processing tasks limited its practicality and a wide acceptance and the usage of the framework. • Several systems (e.g., Pig, Hive, Impala) have been designed to provide high-level languages for expressing big data tasks on top of Hadoop. • Currently, the systems/stacks of large scale graph and stream processing platforms are suffering from the same challenge. • High level language abstractions for expressing big data processing jobs and enabling the underlying systems/stack to perform automatic optimization are crucially required. Open challenges
  • 9. Benchmarking Challenges • Designing a good benchmark is a challenging task due to the many aspects that should be considered; these can influence the adoption and the usage scenarios of the benchmark. • Variety on algorithms, systems, big datasets characteristics and application domains. • There are not enough standard benchmarks that can be effectively employed in this domain. Open challenges
  • 10. Platform-Independent Analytics • In general, more alternatives usually mean harder decisions for choice • Porting data and data analytics jobs between different systems is a tedious, time consuming and costly task • Musketeer and Rheem systems proposed a new direction to map the frontend of the big data jobs • More work is still required to tackle the challenge of providing a platform independence and multi- platform task execution platforms Open challenges
  • 11. And what is next? Would you like to know more? See the original presentation for Brno Data Week 2018 (by prof Sherif Sakr)