SlideShare a Scribd company logo
Available
platforms
for
Big Data 2.0
Based on presentations during
Brno Data Week 2018
by prof Sherif Sakr
Created by: Tichý, T. & Luhan, J.
(Feb. 2019)
https://www.chedteb.eu/
Dealing with
data diversity
in Big Data 2.0
Due to the fact that data
is very diverse, we have to
find the right platform
which suits our needs.
Big Data 2.0
Spark
• Apache Spark is a fast, general engine for large scale data
processing on a computing cluster (new engine for Hadoop)
• Developed initially at UC Berkeley, in 2009, in Scala, it is
currently supported by Databricks
• One of the most active and fastest growing Apache projects
• Committers from Cloudera, Yahoo, Databricks, UC Berkeley,
Intel, Groupon and others
Big Data 2.0: Platforms
Spark vs Hadoop
Big Data 2.0: Platforms
Spark vs Hadoop
Big Data 2.0: Platforms
Flink
• Apache Flink is a distributed in-memory data processing framework
which represents a flexible alternative for the MapReduce
framework that supports both batch and real-time processing.
• Flink has originated from the Stratosphere research project that was
started at the Technical University of Berlin in 2009 before joining
the Apache‘s Incubator in 2014.
• Instead of the map and reduce abstractions, Flink uses a directed
graph approach that leverages in-memory storage for improving the
performance of the runtime execution.
• Has a fast growing community.
Big Data 2.0: Platforms
Hadoop vs Spark vs Flink
Big Data 2.0: Platforms
Big Data 2.0 Processing
Systems:
SQL-On-Hadoop
Apache Hive
• The first system that was introduced to support SQL-on-Hadoop with
familiar relational database concepts such as tables and columns.
• Hive has been widely used in many organizations to manage and
process large volumes of data, such as Facebook, eBay, LinkedIn and
Yahoo!
• It supports an SQL-like declarative language, HiveQL, which represents
a subset of SQL92 and therefore can be easily understood by anyone
who is familiar with SQL.
• Hive queries automatically compile into MapReduce jobs that are run
by using Hadoop.
Big Data 2.0: SQL based platforms
Impala
• Open source project, built by Cloudera, to provide a massively
parallel processing SQL query engine that runs natively in
Apache Hadoop.
• By using Impala, the user can query data which is stored in
Hadoop Distributed File System (HDFS).
• It uses the same metadata, SQL syntax (HiveQL) of Apache Hive.
Big Data 2.0: SQL based platforms
IBM Big SQL
• The SQL interface for the IBM big data processing platform,
InfoSphere BigInsights
• Big SQL relies on a built-in query optimizer that rewrites the
input query as a local job to help minimize latencies by using
Hadoop dynamic scheduling mechanisms.
• It uses a massively parallel processing SQL engine that is
deployed directly on the physical Hadoop Distributed File
System (HDFS).
Big Data 2.0: SQL based platforms
Presto
• Open source distributed SQL query engine, built by Facebook, for
running interactive analytic queries against large scale structured
data sources of sizes of gigabytes up to petabytes.
• Presto allows querying data where it lives, including Hive, NoSQL
databases (e.g., Cassandra, HBase), relational databases or even
proprietary data stores
• A single Presto query can combine data from multiple sources
• Presto has been recently adopted by big companies and
applications such as Netflix and Airbnb
Big Data 2.0: SQL based platforms
To be
continued
In the next episode we will focus
on big streams (data that is
constantly generated), how to
analyze and vizualize them.

More Related Content

PPTX
Big data course
PPTX
Managed Cluster Services
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PPTX
Big Data Open Source Technologies
PPTX
Case study on big data
PDF
Iceberg + Alluxio for Fast Data Analytics
PPTX
PPTX
Enabling Modern Application Architecture using Data.gov open government data
Big data course
Managed Cluster Services
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Big Data Open Source Technologies
Case study on big data
Iceberg + Alluxio for Fast Data Analytics
Enabling Modern Application Architecture using Data.gov open government data

What's hot (20)

PDF
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
PPTX
BlueData Hunk Integration: Splunk Analytics for Hadoop
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
PPTX
Cloudian HyperStore Operating Environment
PDF
Hugfr SPARK & RIAK -20160114_hug_france
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
PPTX
The Fundamentals Guide to HDP and HDInsight
PPTX
Spark - The beginnings
PPTX
Big data architecture on cloud computing infrastructure
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
PDF
Apache Arrow and Python: The latest
PDF
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
PPTX
Big data vahidamiri-datastack.ir
PPTX
10 Things About Spark
PPTX
Big data in Azure
PDF
Ibis: Scaling Python Analytics on Hadoop and Impala
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PPTX
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
PPT
Big data & hadoop framework
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
BlueData Hunk Integration: Splunk Analytics for Hadoop
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Cloudian HyperStore Operating Environment
Hugfr SPARK & RIAK -20160114_hug_france
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
The Fundamentals Guide to HDP and HDInsight
Spark - The beginnings
Big data architecture on cloud computing infrastructure
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Apache Arrow and Python: The latest
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Big data vahidamiri-datastack.ir
10 Things About Spark
Big data in Azure
Ibis: Scaling Python Analytics on Hadoop and Impala
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
Big data & hadoop framework
Ad

Similar to Available platforms for Big Data 2.0 (20)

PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
PPTX
Data infrastructure at Facebook
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PPTX
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Hadoop training
PPTX
Getting started big data
PDF
Big Data , Big Problem?
PPTX
Data analytics
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
PPTX
Overview of big data & hadoop v1
PPTX
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
PPTX
Hadoop in a Nutshell
PPTX
Hadoop Platforms - Introduction, Importance, Providers
PPTX
Architecting Your First Big Data Implementation
PDF
Hadoop Primer
PPSX
PDF
What is Apache Hadoop and its ecosystem?
PPT
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Data infrastructure at Facebook
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of big data & hadoop version 1 - Tony Nguyen
Hadoop training
Getting started big data
Big Data , Big Problem?
Data analytics
Transitioning Compute Models: Hadoop MapReduce to Spark
Overview of big data & hadoop v1
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Hadoop in a Nutshell
Hadoop Platforms - Introduction, Importance, Providers
Architecting Your First Big Data Implementation
Hadoop Primer
What is Apache Hadoop and its ecosystem?
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Ad

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Mega Projects Data Mega Projects Data
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Global journeys: estimating international migration
PPTX
Computer network topology notes for revision
PDF
Lecture1 pattern recognition............
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Clinical guidelines as a resource for EBP(1).pdf
Mega Projects Data Mega Projects Data
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Moving the Public Sector (Government) to a Digital Adoption
Business Acumen Training GuidePresentation.pptx
Launch Your Data Science Career in Kochi – 2025
Major-Components-ofNKJNNKNKNKNKronment.pptx
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IB Computer Science - Internal Assessment.pptx
Global journeys: estimating international migration
Computer network topology notes for revision
Lecture1 pattern recognition............
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Available platforms for Big Data 2.0

  • 1. Available platforms for Big Data 2.0 Based on presentations during Brno Data Week 2018 by prof Sherif Sakr Created by: Tichý, T. & Luhan, J. (Feb. 2019) https://www.chedteb.eu/
  • 2. Dealing with data diversity in Big Data 2.0 Due to the fact that data is very diverse, we have to find the right platform which suits our needs. Big Data 2.0
  • 3. Spark • Apache Spark is a fast, general engine for large scale data processing on a computing cluster (new engine for Hadoop) • Developed initially at UC Berkeley, in 2009, in Scala, it is currently supported by Databricks • One of the most active and fastest growing Apache projects • Committers from Cloudera, Yahoo, Databricks, UC Berkeley, Intel, Groupon and others Big Data 2.0: Platforms
  • 4. Spark vs Hadoop Big Data 2.0: Platforms
  • 5. Spark vs Hadoop Big Data 2.0: Platforms
  • 6. Flink • Apache Flink is a distributed in-memory data processing framework which represents a flexible alternative for the MapReduce framework that supports both batch and real-time processing. • Flink has originated from the Stratosphere research project that was started at the Technical University of Berlin in 2009 before joining the Apache‘s Incubator in 2014. • Instead of the map and reduce abstractions, Flink uses a directed graph approach that leverages in-memory storage for improving the performance of the runtime execution. • Has a fast growing community. Big Data 2.0: Platforms
  • 7. Hadoop vs Spark vs Flink Big Data 2.0: Platforms
  • 8. Big Data 2.0 Processing Systems: SQL-On-Hadoop
  • 9. Apache Hive • The first system that was introduced to support SQL-on-Hadoop with familiar relational database concepts such as tables and columns. • Hive has been widely used in many organizations to manage and process large volumes of data, such as Facebook, eBay, LinkedIn and Yahoo! • It supports an SQL-like declarative language, HiveQL, which represents a subset of SQL92 and therefore can be easily understood by anyone who is familiar with SQL. • Hive queries automatically compile into MapReduce jobs that are run by using Hadoop. Big Data 2.0: SQL based platforms
  • 10. Impala • Open source project, built by Cloudera, to provide a massively parallel processing SQL query engine that runs natively in Apache Hadoop. • By using Impala, the user can query data which is stored in Hadoop Distributed File System (HDFS). • It uses the same metadata, SQL syntax (HiveQL) of Apache Hive. Big Data 2.0: SQL based platforms
  • 11. IBM Big SQL • The SQL interface for the IBM big data processing platform, InfoSphere BigInsights • Big SQL relies on a built-in query optimizer that rewrites the input query as a local job to help minimize latencies by using Hadoop dynamic scheduling mechanisms. • It uses a massively parallel processing SQL engine that is deployed directly on the physical Hadoop Distributed File System (HDFS). Big Data 2.0: SQL based platforms
  • 12. Presto • Open source distributed SQL query engine, built by Facebook, for running interactive analytic queries against large scale structured data sources of sizes of gigabytes up to petabytes. • Presto allows querying data where it lives, including Hive, NoSQL databases (e.g., Cassandra, HBase), relational databases or even proprietary data stores • A single Presto query can combine data from multiple sources • Presto has been recently adopted by big companies and applications such as Netflix and Airbnb Big Data 2.0: SQL based platforms
  • 13. To be continued In the next episode we will focus on big streams (data that is constantly generated), how to analyze and vizualize them.