SlideShare a Scribd company logo
SQL on Hadoop
Markus Olenius
BIGDATAPUMP Ltd
• Big Data vs. Traditional Data Warehousing
• Increased volumes and scale
• Lower costs per node
• Vendor lock-in avoidance, open source preference common
• Agility over correctness of data
• Schema on read vs. schema on write
• Hadoop and MapReduce processing
– support for scale, but with limitations
• No support for BI/ETL/IDE –tools
• Missing features (optimizer, indexes, views)
• Slow due to MapReduce latency
• No schemas
• Competence gap
Background
What is SQL on Hadoop?
”A class of analytical application tools that
combine established SQL-style querying with
newer Hadoop data framework elements”
Classic Hive Architecture
Hive Server Hive MetastoreUser Client
Datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
Table to Files
Table to Format
SQL to MapReduce
compiler
Datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
Datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
Rule based
optimizer
MR Plan execution
coordinator
Catalog
Metadata
HiveQL
queries
Hive CLI
BI, ETL, IDEs
ODBC/JDBC
Hive Classic
Hive Classic
• HiveQL is a SQL-based language, runs on top of MapReduce
• Support for JDBC/ODBC
• Familiar SQL like interface
• Economical processing for petabytes
• Logical schema – schema on read
• Missing features:
• ANSI SQL
• Cost based optimizer
• Data Types
• Security
• Hive Classic tied to MapReduce processing, leading to latency overhead
 Slow but OK for batch SQL
RDBMS on Hadoop
Relying on PostgreSQL
Others
Data Virtualization
Hive and
Enhancements
SQL outside
Hadoop
Relying on HBase
MPP Query Engines
SQL on Hadoop Landscape
CitusDB
Impala
Apache Kylin
Big SQL
PolyBase
Vortex
1. MPP query engines
• Don’t utilize MapReduce or utilize it only in some specific cases
• Usually able to utilize Hive MetaStore
• Typically support for ODBC and/or JDBC
2. RDBMS on Hadoop
• RDBMS sharing nodes with Hadoop and utilizing HDFS
• Typically proprietary file format for best performance
• May require dedicated edge nodes
3. SQL outside Hadoop
• RDBMS sends request to Hadoop
• Processing work shared between Hadoop and RDBMS
• Performance depends on network and RDBMS load
4. Data Virtualization solutions
• Typically utilize Hive and possibly other MPP query engines
• Similar use cases with SQL outside Hadoop
Remarks on solution
categories
1. Minimize I/O to get needed data
• Partitioning and indexing
• Using columnar file formats
• Caching and in-memory processing
2. Query execution optimization
• Better optimization plans, cost based optimization
• Better query execution
• Combining intermediate results more efficiently
• Batch (MapReduce) vs. Streaming (Impala, HAWQ) vs. Hybrid (Spark)
approaches
Need for Speed
- methods for getting better performance
Example – Impala Architecture
File format selection
• Best performance with compressed and columnar
data
• Trade-off - time and resources used for file format
transformations
• Optimal performance and compression may
require using proprietary file formats
• File format alternatives:
• CSV, Avro, JSON
• Columnar: ORC, Parquet, RCFile
• Proprietary formats
(e.g. Vertica, JethroData)
Overview of a few
interesting tools
Hive 0.13
• from MapReduce to Tez
• Optimized for ORC file
format
• Decimal, varchar, date
• Interactive queries
Stinger.next (Hive 0.14)
• ACID transactions
• Cost based query
optimization
Open Source
Stinger InitiativeMainly supported by: Hortonworks
Stinger Initiative, Hive 0.13
Stinger.next, Hive 0.14
ImpalaMainly supported by: Cloudera
• An open source distributed SQL engine that works directly with HDFS
• Architected specifically to leverage the flexibility and scalability strengths of Hadoop
• Used query language is compatible with HiveQL and uses the same metadata store as Apache Hive
• Built in functions:
• Mathematical operations
• String manipulation
• Type conversion
• Date operations
• Available interfaces:
• A command line interface
• Hue, the Hadoop management GUI
• ODBC
• JDBC
• Supported file types:
• Text files
• Hadoop sequence files
• Avro
• Parquet
Open Source
• Operations that are not supported:
• Hive user defined functions (UDFs)
• Hive indexes
• Deleting individual rows (overwriting data in tables and
partions is possible)
Impala 2.1 and Beyond (Ships in 2015)
• Nested data – enables queries on complex nested structures including
maps, structs, and arrays (early 2015)
• MERGE statement – enables merging in updates into existing tables
• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING
SET
• SQL SET operators – MINUS, INTERSECT
• Apache HBase CRUD – allows use of Impala for inserts and updates into
HBase
• UDTFs (user-defined table functions) Intra-node parallelized aggregations
and joins – to provide even faster joins and aggregations on on top of the
performance gains of Impala
• Parquet enhancements – continued performance gains including index
pages
• Amazon S3 integration
SparkSQL
• “Next generation Shark”
• Able to use existing Hive metastores, SerDes, and UDFs
• Integrated APIs in Python, Scala and Java
• JDBC and ODBC connectivity
• In-memory column store data caching
• Good support for Parquet, JSON
• Open Source
Apache Drill
Mainly supported by: MapR
• Open source, low latency SQL query for Hadoop and
NoSQL
• Agile:
• Self-service data exploration capabilities on data stored in
multiple formats in files or NoSQL databases
• Metadata definitions in a centralized store are not required
• Flexible:
• Hierarchical columnar representation of data allows high
performance queries
• Conforms to the ANSI SQL standards
• ODBC connector is used when integrated with BI tools
• Open Source
HP Vertica
• Enterprise-ready SQL queries on Hadoop data
• Features included:
• Database designer
• Management console
• Workload management
• Flex tables
• External tables
• Backup functionality
• Grid-based, columnar DBMS for data warehousing
• Handles changing workloads as elastically as the cloud
• Replication, failover and recovery in the cloud is provided
• YARN not yet supported
• Data Model: Relational structured
• Transaction Model: ACID
• Commercial: Pay per term, per node
• Features that are not included:
• Geospatial functions
• Live aggregate projections
• Time series analytics
• Text search
• Advanced analytics packages
JethroData
• Index based, all data is indexed
• Columnar proprietary file format
• Supports use with AWS S3 data
• Running on dedicated edge nodes besides Hadoop
cluster
• Supports use with Qlik, Tableau or MicroStrategy
through ODBC/JDBC
• Commercial
Demo with Qlik:
http://jethrodata.qlik.com/hub/stream/aaec8d41-5201-43ab-809f-3063750dfafd
Summary
Tool selection based on
use case requirements
https://www.mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details
Tool Focus
http://www.datasalt.com/2014/04/sql-on-hadoop-state-of-the-art/
https://www.mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details
• Long list of alternatives to choose from
• Many tools claim to be fastest in market (self-done evaluation tests)
• SQL and data type support varies
 It really depends on what your requirements for this capability are – at
least for batch and interactive SQL use cases there are many tools that
are probably “good enough”
 Doing testing and evaluation with actual use cases is typically needed
 Start with tools recommended or certified by selected Hadoop distribution
 Consider also amount of concurrent use, not only single query
performance
 SQL-on-Hadoop capabilities are usually not a differentiator that should
guide selection of Hadoop distribution
How to select correct tool for your
SQL-on-Hadoop use cases?
BIGDATAPUMP LTD
WWW.BIGDATAPUMP.COM

More Related Content

PDF
SQL on Hadoop
PDF
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
SQL-on-Hadoop Tutorial
PPTX
SQL On Hadoop
PDF
Interactive SQL-on-Hadoop and JethroData
PPT
Boston Hadoop Meetup, April 26 2012
PPTX
The Challenges of SQL on Hadoop
SQL on Hadoop
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
Building a Hadoop Data Warehouse with Impala
SQL-on-Hadoop Tutorial
SQL On Hadoop
Interactive SQL-on-Hadoop and JethroData
Boston Hadoop Meetup, April 26 2012
The Challenges of SQL on Hadoop

What's hot (20)

PDF
JethroData technical white paper
PDF
Cloudera Impala
PDF
SQL on Hadoop in Taiwan
PDF
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Impala: Real-time Queries in Hadoop
PPTX
Mutable Data in Hive's Immutable World
PPTX
HBase and Drill: How loosley typed SQL is ideal for NoSQL
PDF
Real-Time Queries in Hadoop w/ Cloudera Impala
PDF
50 Shades of SQL
PPTX
NoSQL Needs SomeSQL
PPTX
Azure_Business_Opportunity
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PPTX
Jethro for tableau webinar (11 15)
PPTX
Data warehousing with Hadoop
PDF
Introducing Kudu, Big Data Warehousing Meetup
PPTX
Introduction to Hadoop
PPTX
Hadoop and Hive in Enterprises
PPTX
Introduction to the Hadoop EcoSystem
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
JethroData technical white paper
Cloudera Impala
SQL on Hadoop in Taiwan
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Building a Hadoop Data Warehouse with Impala
Impala: Real-time Queries in Hadoop
Mutable Data in Hive's Immutable World
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Real-Time Queries in Hadoop w/ Cloudera Impala
50 Shades of SQL
NoSQL Needs SomeSQL
Azure_Business_Opportunity
Impala 2.0 - The Best Analytic Database for Hadoop
Jethro for tableau webinar (11 15)
Data warehousing with Hadoop
Introducing Kudu, Big Data Warehousing Meetup
Introduction to Hadoop
Hadoop and Hive in Enterprises
Introduction to the Hadoop EcoSystem
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Ad

Viewers also liked (20)

PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PDF
Apache ranger meetup
PPTX
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
PPTX
Tajo and SQL-on-Hadoop in Tech Planet 2013
PPTX
Introduction to Azure DocumentDB
PDF
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
PDF
SQL on everything, in memory
PDF
Pdf 이교수의 멘붕하둡_pig
PDF
인메모리 클러스터링 아키텍처
PDF
Big SQL Competitive Summary - Vendor Landscape
PPTX
머신 러닝(Machine Learning)
PDF
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
PDF
Ai(인공지능) & ML(머신러닝) 101 Part1
PDF
20160409 microsoft 세미나 머신러닝관련 발표자료
PDF
지금 핫한 Real-time In-memory Stream Processing 이야기
PPTX
인공지능, 기계학습 그리고 딥러닝
PDF
Xerox certification
PPTX
Presentation1
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Apache ranger meetup
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Tajo and SQL-on-Hadoop in Tech Planet 2013
Introduction to Azure DocumentDB
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
SQL on everything, in memory
Pdf 이교수의 멘붕하둡_pig
인메모리 클러스터링 아키텍처
Big SQL Competitive Summary - Vendor Landscape
머신 러닝(Machine Learning)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Ai(인공지능) & ML(머신러닝) 101 Part1
20160409 microsoft 세미나 머신러닝관련 발표자료
지금 핫한 Real-time In-memory Stream Processing 이야기
인공지능, 기계학습 그리고 딥러닝
Xerox certification
Presentation1
Ad

Similar to SQL on Hadoop (20)

PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
12 SQL On-Hadoop Tools
PPTX
SQL on Hadoop for the Oracle Professional
PPTX
Overview of big data & hadoop v1
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
PPTX
SQL Server 2012 and Big Data
PDF
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
PPTX
Big data processing engines, Atlanta Meetup 4/30
PPTX
Intro to Hadoop
PPTX
Impala for PhillyDB Meetup
PPTX
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
PPTX
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
PDF
SQL on Hadoop
PPTX
Apache Hadoop Hive
PDF
SQL Engines for Hadoop - The case for Impala
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PDF
Technologies for Data Analytics Platform
PDF
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
12 SQL On-Hadoop Tools
SQL on Hadoop for the Oracle Professional
Overview of big data & hadoop v1
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data Developers Moscow Meetup 1 - sql on hadoop
SQL Server 2012 and Big Data
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big data processing engines, Atlanta Meetup 4/30
Intro to Hadoop
Impala for PhillyDB Meetup
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
SQL on Hadoop
Apache Hadoop Hive
SQL Engines for Hadoop - The case for Impala
Big Data Analytics with Hadoop, MongoDB and SQL Server
Technologies for Data Analytics Platform
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Database Infoormation System (DBIS).pptx
PDF
Lecture1 pattern recognition............
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
.pdf is not working space design for the following data for the following dat...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Introduction to Business Data Analytics.
Introduction-to-Cloud-ComputingFinal.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
Lecture1 pattern recognition............
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Knowledge Engineering Part 1
.pdf is not working space design for the following data for the following dat...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Business Data Analytics.

SQL on Hadoop

  • 1. SQL on Hadoop Markus Olenius BIGDATAPUMP Ltd
  • 2. • Big Data vs. Traditional Data Warehousing • Increased volumes and scale • Lower costs per node • Vendor lock-in avoidance, open source preference common • Agility over correctness of data • Schema on read vs. schema on write • Hadoop and MapReduce processing – support for scale, but with limitations • No support for BI/ETL/IDE –tools • Missing features (optimizer, indexes, views) • Slow due to MapReduce latency • No schemas • Competence gap Background
  • 3. What is SQL on Hadoop? ”A class of analytical application tools that combine established SQL-style querying with newer Hadoop data framework elements”
  • 4. Classic Hive Architecture Hive Server Hive MetastoreUser Client Datanode HDFS Map/Reduce Hive Operators Hive SerDes Table to Files Table to Format SQL to MapReduce compiler Datanode HDFS Map/Reduce Hive Operators Hive SerDes Datanode HDFS Map/Reduce Hive Operators Hive SerDes Rule based optimizer MR Plan execution coordinator Catalog Metadata HiveQL queries Hive CLI BI, ETL, IDEs ODBC/JDBC
  • 5. Hive Classic Hive Classic • HiveQL is a SQL-based language, runs on top of MapReduce • Support for JDBC/ODBC • Familiar SQL like interface • Economical processing for petabytes • Logical schema – schema on read • Missing features: • ANSI SQL • Cost based optimizer • Data Types • Security • Hive Classic tied to MapReduce processing, leading to latency overhead  Slow but OK for batch SQL
  • 6. RDBMS on Hadoop Relying on PostgreSQL Others Data Virtualization Hive and Enhancements SQL outside Hadoop Relying on HBase MPP Query Engines SQL on Hadoop Landscape CitusDB Impala Apache Kylin Big SQL PolyBase Vortex
  • 7. 1. MPP query engines • Don’t utilize MapReduce or utilize it only in some specific cases • Usually able to utilize Hive MetaStore • Typically support for ODBC and/or JDBC 2. RDBMS on Hadoop • RDBMS sharing nodes with Hadoop and utilizing HDFS • Typically proprietary file format for best performance • May require dedicated edge nodes 3. SQL outside Hadoop • RDBMS sends request to Hadoop • Processing work shared between Hadoop and RDBMS • Performance depends on network and RDBMS load 4. Data Virtualization solutions • Typically utilize Hive and possibly other MPP query engines • Similar use cases with SQL outside Hadoop Remarks on solution categories
  • 8. 1. Minimize I/O to get needed data • Partitioning and indexing • Using columnar file formats • Caching and in-memory processing 2. Query execution optimization • Better optimization plans, cost based optimization • Better query execution • Combining intermediate results more efficiently • Batch (MapReduce) vs. Streaming (Impala, HAWQ) vs. Hybrid (Spark) approaches Need for Speed - methods for getting better performance
  • 9. Example – Impala Architecture
  • 10. File format selection • Best performance with compressed and columnar data • Trade-off - time and resources used for file format transformations • Optimal performance and compression may require using proprietary file formats • File format alternatives: • CSV, Avro, JSON • Columnar: ORC, Parquet, RCFile • Proprietary formats (e.g. Vertica, JethroData)
  • 11. Overview of a few interesting tools
  • 12. Hive 0.13 • from MapReduce to Tez • Optimized for ORC file format • Decimal, varchar, date • Interactive queries Stinger.next (Hive 0.14) • ACID transactions • Cost based query optimization Open Source Stinger InitiativeMainly supported by: Hortonworks
  • 15. ImpalaMainly supported by: Cloudera • An open source distributed SQL engine that works directly with HDFS • Architected specifically to leverage the flexibility and scalability strengths of Hadoop • Used query language is compatible with HiveQL and uses the same metadata store as Apache Hive • Built in functions: • Mathematical operations • String manipulation • Type conversion • Date operations • Available interfaces: • A command line interface • Hue, the Hadoop management GUI • ODBC • JDBC • Supported file types: • Text files • Hadoop sequence files • Avro • Parquet Open Source • Operations that are not supported: • Hive user defined functions (UDFs) • Hive indexes • Deleting individual rows (overwriting data in tables and partions is possible) Impala 2.1 and Beyond (Ships in 2015) • Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015) • MERGE statement – enables merging in updates into existing tables • Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET • SQL SET operators – MINUS, INTERSECT • Apache HBase CRUD – allows use of Impala for inserts and updates into HBase • UDTFs (user-defined table functions) Intra-node parallelized aggregations and joins – to provide even faster joins and aggregations on on top of the performance gains of Impala • Parquet enhancements – continued performance gains including index pages • Amazon S3 integration
  • 16. SparkSQL • “Next generation Shark” • Able to use existing Hive metastores, SerDes, and UDFs • Integrated APIs in Python, Scala and Java • JDBC and ODBC connectivity • In-memory column store data caching • Good support for Parquet, JSON • Open Source
  • 17. Apache Drill Mainly supported by: MapR • Open source, low latency SQL query for Hadoop and NoSQL • Agile: • Self-service data exploration capabilities on data stored in multiple formats in files or NoSQL databases • Metadata definitions in a centralized store are not required • Flexible: • Hierarchical columnar representation of data allows high performance queries • Conforms to the ANSI SQL standards • ODBC connector is used when integrated with BI tools • Open Source
  • 18. HP Vertica • Enterprise-ready SQL queries on Hadoop data • Features included: • Database designer • Management console • Workload management • Flex tables • External tables • Backup functionality • Grid-based, columnar DBMS for data warehousing • Handles changing workloads as elastically as the cloud • Replication, failover and recovery in the cloud is provided • YARN not yet supported • Data Model: Relational structured • Transaction Model: ACID • Commercial: Pay per term, per node • Features that are not included: • Geospatial functions • Live aggregate projections • Time series analytics • Text search • Advanced analytics packages
  • 19. JethroData • Index based, all data is indexed • Columnar proprietary file format • Supports use with AWS S3 data • Running on dedicated edge nodes besides Hadoop cluster • Supports use with Qlik, Tableau or MicroStrategy through ODBC/JDBC • Commercial Demo with Qlik: http://jethrodata.qlik.com/hub/stream/aaec8d41-5201-43ab-809f-3063750dfafd
  • 21. Tool selection based on use case requirements https://www.mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details
  • 23. • Long list of alternatives to choose from • Many tools claim to be fastest in market (self-done evaluation tests) • SQL and data type support varies  It really depends on what your requirements for this capability are – at least for batch and interactive SQL use cases there are many tools that are probably “good enough”  Doing testing and evaluation with actual use cases is typically needed  Start with tools recommended or certified by selected Hadoop distribution  Consider also amount of concurrent use, not only single query performance  SQL-on-Hadoop capabilities are usually not a differentiator that should guide selection of Hadoop distribution How to select correct tool for your SQL-on-Hadoop use cases?