SlideShare a Scribd company logo
Hadoop has proven to be a technology well suited to tackling
Big Data problems. The fundamental principle of the Hadoop
architecture is to move analysis to the data, rather than moving
the data to a system that can analyze it. Ideally, Hadoop takes
advantage of the advances in commodity hardware to scale in
the way companies want.
There are two key technologies that allow users of Hadoop to
successfully retain and analyze data: HDFS and MapReduce.
HDFS is a simple but extremely powerful distributed file
system. It is able to store data reliably at significant scale. HDFS
deployments exist with thousands of nodes storing hundreds of
petabytes of user data.
MapReduce is parallel programming framework that integrates
with HDFS. It allows users to express data analysis algorithms in
terms of a small number of functions and operators, chiefly, a
map function and a reduce function.
The success of MapReduce is a testament to the robustness of
HDFS — both as a system to restore and access data, and as an
application programming interface (API) for Big Data analysis
frameworks.
While MapReduce is convenient when performing scheduled
analysis or manipulation of data stored on HDFS, it is not suitable
for interactive use: it is too slow and lacks the expressive power
required.
SQL
SQL is the language of data analysis. No language is more
expressive, or more commonly used, for this task. Companies
that bet big on Hadoop — such as Facebook, Twitter and
especially Yahoo! — understood this immediately. Their solution
was Hive, a SQL-like query engine which compiles a limited
SQL dialect to MapReduce. While it addresses some of the
expressivity shortcomings of MapReduce, it compounds the
performance problem.
Many Hadoop users are disenfranchised from true SQL systems.
Either, it is hard to get data into them from a Hadoop system;
or, they cannot scale anywhere near as well as Hadoop; or,
they become prohibitively expensive as they reach the scale of
common Hadoop clusters.
HAWQ: the marriage of Hadoop and
parallel SQL database technology
HAWQ is a parallel SQL query engine that combines the key
technological advantages of the industry-leading Pivotal Analytic
Database with the scalability and convenience of Hadoop. HAWQ
reads data from and writes data to HDFS natively. HAWQ delivers
industry-leading performance and linear scalability. It provides
users the tools to confidently and successfully interact with
petabyte range data sets. HAWQ provides users with a complete,
standards compliant SQL interface.
White paper
Pivotal HD: HAWQ
A true SQL engine for Hadoop
gopivotal.com
Introduction
Today’s most successful companies use data to their advantage. The data are no longer easily quantifiable
facts, such as point of sale transaction data. Rather, these companies retain, explore, analyze, and
manipulate all the available information in their purview. Ultimately, they search for evidence of facts, insights
that lead to new business opportunities or which leverage their existing strengths. This is the business value
behind what is often referred to as Big Data.
White paper Pivotal HD: Hawq
2
By using the proven parallel database technology of the Pivotal
Analytic Database, HAWQ has been shown to be consistently tens
to hundreds of times faster than all Hadoop query engines in the
market today.
HAWQ Architecture
HAWQ has been designed from the ground up to be a massively
parallel SQL processing engine optimized specifically for analytics
with full transaction support. HAWQ breaks complex queries into
small tasks and distributes them to query processing units for
execution. Combining a world-class cost based query optimizer,
a leading edge network interconnect, a feature rich SQL and
analytics interface, and a high performance execution run time
with a transactional storage sub system, HAWQ is the only
Hadoop query engine offering such technology.
HAWQ architecture is shown in figure 1. HAWQ’s basic unit
of parallelism is the segment instance. Multiple segment
instances work together on commodity servers to form a single
parallel query processing system. When a query is submitted
to the HAWQ master, it is optimized and broken into smaller
components and dispatched to segments that work together to
deliver a single result set. All operations—such as table scans,
joins, aggregations, and sorts—execute in parallel across the
segments simultaneously. Data from upstream components are
transmitted to downstream components through the extremely
scalable UDP interconnect.
HAWQ has been designed to be resilient and high performance,
even when executing the most complex of queries. All operations,
irrespective of the amount of data they operate upon, can
succeed because of HAWQ’s unique ability to operate upon
streams of data through what we call dynamic pipelining.
HAWQ is designed to have no single point of failure. User data is
stored on HDFS, ensuring that it is replicated. HAWQ works with
HDFS to ensure that recovery from hardware failures is automatic
and online. Internally, system health is monitored continuously.
When a server failure is detected, it is removed from the cluster
dynamically and automatically while the system continues serving
queries. Recovered servers can be added back to the system on the
fly. At the master, HAWQ uses its own metadata replication system
to ensure that metadata is highly available.
Comparison with other parallel database
architectures
HAWQ’s architecture is different from other parallel databases.
HAWQ is metadata driven in that much of the behavior of the
system is governed by data registered
through HAWQ. This is transparent to
the user. Metadata is managed by what is
called a Universal Catalog Service. This is a
metadata store that resides at the master.
The Universal Catalog Service is replicated
to ensure high availability of the data.
When a query plan is generated, HAWQ
dispatches the plan to workers in the
Hadoop cluster, along with a payload of
metadata they need to execute the plan.
This approach is optimal for any system
seeking to scale to thousands of nodes.
Significantly, HAWQ’s query optimizer
generates the query plan at the master. Worker nodes in the
cluster merely evaluate it. Many other parallel databases instead
split queries into global operations and local operations, with the
workers generating optimal query plans for the local operations
only. While this approach is technically easier, it misses many
important query optimization opportunities. Much of HAWQ’s
amazing performance comes from its big picture view of query
optimization.
HAWQ is also very resilient in the face of challenges common
to Big Data. The query engine is able to intelligently buffer
data, spilling intermediate results to disk when necessary. For
performance, HAWQ delivers data to the local disk rather than
HDFS. The result is that HAWQ is able to perform joins, sorts and
OLAP operations on data well beyond the total size of volatile
memory in the cluster.
Figure 1: HAWQ architecture
White paper Pivotal HD: Hawq
3
Key Features and Benefits of HAWQ
Extreme Scale and one storage system
•	 With HAWQ, SQL can scale on Hadoop. HAWQ is designed for
petabyte range data sets.
•	 Data is stored directly on HDFS, providing all the convenience
of Hadoop.
Industry Leading Performance with Dynamic Pipelining
•	 HAWQ’s cutting edge query optimizer is able to produce
execution plans which it believes will optimally use the cluster’s
resources, irrespective of the complexity of the query or size
of the data
•	 HAWQ uses dynamic pipelining technology to orchestrate
execution of a query
•	 Dynamic pipelining is a parallel data flow framework which
uniquely combines:
-	 An adaptive, high speed UDP interconnect, developed for
and used on Pivotal’s 1000 server Analytics Work Bench
-	 A run time execution environment, tuned for Big Data
work loads, which implements the operations which
underlie all SQL queries
-	 A run time resource management layer, which ensures that
queries complete, even in the presence of very demanding
queries on heavily utilized clusters
-	 A seamless data partitioning mechanism which groups
together the parts of a data set which are often used in
any given query
•	 Extensive performance studies show that HAWQ is orders
of magnitude faster than existing Hadoop query engines for
common Hadoop, analytics and data warehousing workloads
Elastic Fault Tolerance and Transaction Support
•	 HAWQ’s fault tolerance, reliability and high availability features
tolerate disk level and node level failures.
•	 HAWQ supports transactions, a first for Hadoop. Transactions
allow users to isolate concurrent activity on Hadoop and to
rollback modifications when a fault occurs.
A toolkit for data management and analysis
•	 HAWQ can coexist with MapReduce, HBase and other
database technologies common in Hadoop environments.
•	 HAWQ supports traditional online analytic processing (OLAP)
as well as advanced machine learning capabilities, such as
supervised and unsupervised learning, inference and regression.
•	 HAWQ supports GPXF, a unique extension framework which
allows users to easily interface HAWQ custom formats and
other parts of the Hadoop ecosystem.
True SQL capabilities
•	 HAWQ’s cost based query optimizer can effortlessly find
optimal query plans for the most demanding of queries, such
as queries with more than thirty joins
•	 HAWQ is SQL standards compliant, including features such as
correlated sub queries, window functions, rollups and cubes, a
broad range of scalar and aggregate functions and much more
•	 HAWQ is ACID compliant
•	 Users can connect to HAWQ via the most popular programming
languages, and it also supports ODBC and JDBC. This means
that most business intelligence, data analysis and data
visualization tools work with HAWQ out of the box.
Performance results
As part of the development of HAWQ, we selected customer
workloads which reflected both real world and aspirational use of
Hadoop, whether via MapReduce, Hive or other technologies. We
used these workloads to validate the performance and scalability
of HAWQ.
Performance studies in isolation have little real world meaning,
so HAWQ was benchmarked against Hive and a new parallel
SQL engine for Hadoop, called Impala1
. So as to ensure a fair
comparison, all systems were deployed on the Pivotal Analytics
Workbench2
(AWB). The AWB is configured to realize the best
possible performance from Hadoop3
.
Of the workloads collected during research and development,
five queries and data sets were used for the performance
comparison. Two criteria were used: the queries must reflect
different uses for Hadoop today and the queries must complete
on Hive, Impala and HAWQ4
.
The performance experiment took place on a 60 node
deployment with the AWB. This size deployment was selected as
it was the point at which we saw most stability of performance
from Impala. HAWQ has been verified for stable performance
well beyond 200 nodes.
1
https://github.com/cloudera/impala/, in beta at the time of writing
2
http://www.analyticsworkbench.com/
3
http://www.greenplum.com/sites/default/files/Greenplum-Analytics-Workbench-Whitepaper_0.pdf
4
Many queries in the aspirational class used in excess of eight joins, but these queries did not complete on Hive or Impala
Query type HAWQ
(secs)
Hive Speedup Impala Speedup
User intelligence 4.2 198 47x 37 9x
Sales analysis 8.7 161 19x 596 69x
Click analysis 2.0 415 208x 50 25x
Data exploration 2.7 1285 476x 55 20x
BI drill down 2.8 1815 648x 59 21x
Table 1: Performance results for five real world queries on HAWQ, Hive
and Impala
GoPivotal, Pivotal, and the Pivotal logo are registered trademarks or trademarks of GoPivotal, Inc. in the United States and other countries. All other trademarks used herein are the property of their respective owners.
© Copyright 2013 Go Pivotal, Inc. All rights reserved. Published in the USA. PVTL-WP-201-04/13
At Pivotal our mission is to enable customers to build a new class of applications, leveraging big and fast data, and do all of this with the power of cloud independence.
Uniting selected technology, people and programs from EMC and VMware, the following products and services are now part of Pivotal: Greenplum, Cloud Foundry, Spring,
GemFire and other products from the VMware vFabric Suite, Cetas and Pivotal Labs.
White paper Pivotal HD: Hawq
Pivotal 1900 S Norfolk Street San Mateo CA 94403 goPivotal.com
Results of the performance comparison are contained in table 1.
It is clear that HAWQ’s performance is remarkable when
compared to other Hadoop options currently available.
One of the most important aspects of any technology running on
top of Hadoop is its ability to scale. Absolute performance alone
is considered secondary. As such, an experiment was undertaken
to verify the performance at 15, 30, and 60 nodes.
Table 2: Relative scalability of HAWQ and Impala
Table 2 shows the relative scalability for HAWQ and Impala. For
a data set of a consistent size, the run time of queries detailed
in table 1 practically halved each time the number of servers is
doubled. HAWQ has been shown to continue to scale like this
beyond 200 nodes. Surprisingly, Impala’s performance is better
with 15 nodes than 30 or 60 nodes: when the same amount of
data is given four times the amount of computing resources,
Impala spent 30% longer returning results.
Conclusion
SQL is an extremely powerful way to manipulate and understand
data. Until now, Hadoop’s SQL capability has been limited and
impractical for many users. HAWQ is the new benchmark for SQL
on Hadoop — the most functionally rich, mature, and robust SQL
offering available. Through Dynamic Pipelining and the significant
innovations within the Greenplum Analytic Database, it provides
performance previously unimaginable. HAWQ changes everything.
Learn More
To learn more about our products, services and solutions, visit us
at goPivotal.com.
HAWQ Impala

More Related Content

PDF
How can Hadoop & SAP be integrated
PPTX
Brief Introduction about Hadoop and Core Services.
PDF
SAP HORTONWORKS
DOCX
Integration of SAP HANA with Hadoop
PDF
HAWQ: a massively parallel processing SQL engine in hadoop
DOCX
Sap hana platform sps 11 introduces new sap hana hadoop integration features
PPTX
Harnessing Big Data in Real-Time
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
How can Hadoop & SAP be integrated
Brief Introduction about Hadoop and Core Services.
SAP HORTONWORKS
Integration of SAP HANA with Hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
Sap hana platform sps 11 introduces new sap hana hadoop integration features
Harnessing Big Data in Real-Time
The Future of Apache Hadoop an Enterprise Architecture View

What's hot (20)

PDF
Non-Stop Hadoop for Hortonworks
PPTX
Hadoop Innovation Summit 2014
PPTX
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
PPTX
Hadoop and Hive in Enterprises
PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PPTX
Introduction to Hadoop
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Building and managing complex dependencies pipeline using Apache Oozie
PPTX
Hadoop and Enterprise Data Warehouse
PPTX
Cloud Services for Big Data Analytics
PPTX
Azure_Business_Opportunity
PDF
50 Shades of SQL
PPTX
Big Data on the Microsoft Platform
PPTX
Impala Unlocks Interactive BI on Hadoop
PPTX
Hadoop in a Nutshell
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
PPTX
Hadoop first ETL on Apache Falcon
PDF
SQL on Hadoop
Non-Stop Hadoop for Hortonworks
Hadoop Innovation Summit 2014
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Hadoop and Hive in Enterprises
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Hadoop - Architectural road map for Hadoop Ecosystem
Introduction to Hadoop
Overview of Big data, Hadoop and Microsoft BI - version1
Apache Hive 2.0: SQL, Speed, Scale
Building and managing complex dependencies pipeline using Apache Oozie
Hadoop and Enterprise Data Warehouse
Cloud Services for Big Data Analytics
Azure_Business_Opportunity
50 Shades of SQL
Big Data on the Microsoft Platform
Impala Unlocks Interactive BI on Hadoop
Hadoop in a Nutshell
Big Data Simplified - Is all about Ab'strakSHeN
Hadoop first ETL on Apache Falcon
SQL on Hadoop
Ad

Viewers also liked (20)

PPTX
Pivotal HD as a Cloud Foundry Service
PPTX
Pivotal hawq internals
PPTX
How to Use Apache Zeppelin with HWX HDB
PPTX
Price control & s d market project
PPTX
City bogota
PPS
Whomovedmy cheese
PPTX
Cybernetics of knowledge
PDF
White Paper: EMC FAST Cache — A Detailed Review
 
PPS
Amarnath darshan
PDF
Sub formulario2
PPS
Thebracelet
PDF
Big Data, Big Innovations
 
PPTX
2015 day 3
PDF
Jeansvorming 2 uur
PPTX
Magazine pitch
PPTX
PPT
вивчення мотивації вибору професії
PPT
Chapter XI Board and Board Provisions (Cos Act 2013)
PDF
RSA Monthly Online Fraud Report -- October 2013
 
PPS
Two pots
Pivotal HD as a Cloud Foundry Service
Pivotal hawq internals
How to Use Apache Zeppelin with HWX HDB
Price control & s d market project
City bogota
Whomovedmy cheese
Cybernetics of knowledge
White Paper: EMC FAST Cache — A Detailed Review
 
Amarnath darshan
Sub formulario2
Thebracelet
Big Data, Big Innovations
 
2015 day 3
Jeansvorming 2 uur
Magazine pitch
вивчення мотивації вибору професії
Chapter XI Board and Board Provisions (Cos Act 2013)
RSA Monthly Online Fraud Report -- October 2013
 
Two pots
Ad

Similar to Hawq wp 042313_final (20)

PDF
SQL and Machine Learning on Hadoop
PPTX
Apache HAWQ Architecture
PDF
SQL and Machine Learning on Hadoop using HAWQ
PDF
HAWQ Meets Hive - Querying Unmanaged Data
PDF
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
PPTX
Hawq meets Hive - DataWorks San Jose 2017
PPTX
Apache HAWQ and Apache MADlib: Journey to Apache
PPTX
SQL On Hadoop
PDF
Federated Queries with HAWQ - SQL on Hadoop and Beyond
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
ODP
Hadoop demo ppt
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
PDF
Survey Paper on Big Data and Hadoop
PPTX
Hadoop and friends
PPTX
BDA: Introduction to HIVE, PIG and HBASE
PDF
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
PDF
VMUGIT UC 2013 - 08a VMware Hadoop
PPTX
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PPTX
Pivotal Strata NYC 2015 Apache HAWQ Launch
SQL and Machine Learning on Hadoop
Apache HAWQ Architecture
SQL and Machine Learning on Hadoop using HAWQ
HAWQ Meets Hive - Querying Unmanaged Data
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Hawq meets Hive - DataWorks San Jose 2017
Apache HAWQ and Apache MADlib: Journey to Apache
SQL On Hadoop
Federated Queries with HAWQ - SQL on Hadoop and Beyond
Unit II Hadoop Ecosystem_Updated.pptx
Hadoop demo ppt
Big Data Developers Moscow Meetup 1 - sql on hadoop
Survey Paper on Big Data and Hadoop
Hadoop and friends
BDA: Introduction to HIVE, PIG and HBASE
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
VMUGIT UC 2013 - 08a VMware Hadoop
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Sf NoSQL MeetUp: Apache Hadoop and HBase
Pivotal Strata NYC 2015 Apache HAWQ Launch

More from EMC (20)

PPTX
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
PDF
Cloud Foundry Summit Berlin Keynote
 
PPTX
EMC GLOBAL DATA PROTECTION INDEX
 
PDF
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
PDF
Citrix ready-webinar-xtremio
 
PDF
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
PPTX
EMC with Mirantis Openstack
 
PPTX
Modern infrastructure for business data lake
 
PDF
Force Cyber Criminals to Shop Elsewhere
 
PDF
Pivotal : Moments in Container History
 
PDF
Data Lake Protection - A Technical Review
 
PDF
Mobile E-commerce: Friend or Foe
 
PDF
Virtualization Myths Infographic
 
PDF
Intelligence-Driven GRC for Security
 
PDF
The Trust Paradox: Access Management and Trust in an Insecure Age
 
PDF
EMC Technology Day - SRM University 2015
 
PDF
EMC Academic Summit 2015
 
PDF
Data Science and Big Data Analytics Book from EMC Education Services
 
PDF
Using EMC Symmetrix Storage in VMware vSphere Environments
 
PDF
Using EMC VNX storage with VMware vSphereTechBook
 
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Tartificialntelligence_presentation.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
August Patch Tuesday
PDF
Hybrid model detection and classification of lung cancer
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
project resource management chapter-09.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Unlocking AI with Model Context Protocol (MCP)
Web App vs Mobile App What Should You Build First.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Tartificialntelligence_presentation.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
A comparative study of natural language inference in Swahili using monolingua...
August Patch Tuesday
Hybrid model detection and classification of lung cancer
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Getting Started with Data Integration: FME Form 101
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
TLE Review Electricity (Electricity).pptx
Heart disease approach using modified random forest and particle swarm optimi...
project resource management chapter-09.pdf
Chapter 5: Probability Theory and Statistics
Group 1 Presentation -Planning and Decision Making .pptx
OMC Textile Division Presentation 2021.pptx
Enhancing emotion recognition model for a student engagement use case through...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...

Hawq wp 042313_final

  • 1. Hadoop has proven to be a technology well suited to tackling Big Data problems. The fundamental principle of the Hadoop architecture is to move analysis to the data, rather than moving the data to a system that can analyze it. Ideally, Hadoop takes advantage of the advances in commodity hardware to scale in the way companies want. There are two key technologies that allow users of Hadoop to successfully retain and analyze data: HDFS and MapReduce. HDFS is a simple but extremely powerful distributed file system. It is able to store data reliably at significant scale. HDFS deployments exist with thousands of nodes storing hundreds of petabytes of user data. MapReduce is parallel programming framework that integrates with HDFS. It allows users to express data analysis algorithms in terms of a small number of functions and operators, chiefly, a map function and a reduce function. The success of MapReduce is a testament to the robustness of HDFS — both as a system to restore and access data, and as an application programming interface (API) for Big Data analysis frameworks. While MapReduce is convenient when performing scheduled analysis or manipulation of data stored on HDFS, it is not suitable for interactive use: it is too slow and lacks the expressive power required. SQL SQL is the language of data analysis. No language is more expressive, or more commonly used, for this task. Companies that bet big on Hadoop — such as Facebook, Twitter and especially Yahoo! — understood this immediately. Their solution was Hive, a SQL-like query engine which compiles a limited SQL dialect to MapReduce. While it addresses some of the expressivity shortcomings of MapReduce, it compounds the performance problem. Many Hadoop users are disenfranchised from true SQL systems. Either, it is hard to get data into them from a Hadoop system; or, they cannot scale anywhere near as well as Hadoop; or, they become prohibitively expensive as they reach the scale of common Hadoop clusters. HAWQ: the marriage of Hadoop and parallel SQL database technology HAWQ is a parallel SQL query engine that combines the key technological advantages of the industry-leading Pivotal Analytic Database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. White paper Pivotal HD: HAWQ A true SQL engine for Hadoop gopivotal.com Introduction Today’s most successful companies use data to their advantage. The data are no longer easily quantifiable facts, such as point of sale transaction data. Rather, these companies retain, explore, analyze, and manipulate all the available information in their purview. Ultimately, they search for evidence of facts, insights that lead to new business opportunities or which leverage their existing strengths. This is the business value behind what is often referred to as Big Data.
  • 2. White paper Pivotal HD: Hawq 2 By using the proven parallel database technology of the Pivotal Analytic Database, HAWQ has been shown to be consistently tens to hundreds of times faster than all Hadoop query engines in the market today. HAWQ Architecture HAWQ has been designed from the ground up to be a massively parallel SQL processing engine optimized specifically for analytics with full transaction support. HAWQ breaks complex queries into small tasks and distributes them to query processing units for execution. Combining a world-class cost based query optimizer, a leading edge network interconnect, a feature rich SQL and analytics interface, and a high performance execution run time with a transactional storage sub system, HAWQ is the only Hadoop query engine offering such technology. HAWQ architecture is shown in figure 1. HAWQ’s basic unit of parallelism is the segment instance. Multiple segment instances work together on commodity servers to form a single parallel query processing system. When a query is submitted to the HAWQ master, it is optimized and broken into smaller components and dispatched to segments that work together to deliver a single result set. All operations—such as table scans, joins, aggregations, and sorts—execute in parallel across the segments simultaneously. Data from upstream components are transmitted to downstream components through the extremely scalable UDP interconnect. HAWQ has been designed to be resilient and high performance, even when executing the most complex of queries. All operations, irrespective of the amount of data they operate upon, can succeed because of HAWQ’s unique ability to operate upon streams of data through what we call dynamic pipelining. HAWQ is designed to have no single point of failure. User data is stored on HDFS, ensuring that it is replicated. HAWQ works with HDFS to ensure that recovery from hardware failures is automatic and online. Internally, system health is monitored continuously. When a server failure is detected, it is removed from the cluster dynamically and automatically while the system continues serving queries. Recovered servers can be added back to the system on the fly. At the master, HAWQ uses its own metadata replication system to ensure that metadata is highly available. Comparison with other parallel database architectures HAWQ’s architecture is different from other parallel databases. HAWQ is metadata driven in that much of the behavior of the system is governed by data registered through HAWQ. This is transparent to the user. Metadata is managed by what is called a Universal Catalog Service. This is a metadata store that resides at the master. The Universal Catalog Service is replicated to ensure high availability of the data. When a query plan is generated, HAWQ dispatches the plan to workers in the Hadoop cluster, along with a payload of metadata they need to execute the plan. This approach is optimal for any system seeking to scale to thousands of nodes. Significantly, HAWQ’s query optimizer generates the query plan at the master. Worker nodes in the cluster merely evaluate it. Many other parallel databases instead split queries into global operations and local operations, with the workers generating optimal query plans for the local operations only. While this approach is technically easier, it misses many important query optimization opportunities. Much of HAWQ’s amazing performance comes from its big picture view of query optimization. HAWQ is also very resilient in the face of challenges common to Big Data. The query engine is able to intelligently buffer data, spilling intermediate results to disk when necessary. For performance, HAWQ delivers data to the local disk rather than HDFS. The result is that HAWQ is able to perform joins, sorts and OLAP operations on data well beyond the total size of volatile memory in the cluster. Figure 1: HAWQ architecture
  • 3. White paper Pivotal HD: Hawq 3 Key Features and Benefits of HAWQ Extreme Scale and one storage system • With HAWQ, SQL can scale on Hadoop. HAWQ is designed for petabyte range data sets. • Data is stored directly on HDFS, providing all the convenience of Hadoop. Industry Leading Performance with Dynamic Pipelining • HAWQ’s cutting edge query optimizer is able to produce execution plans which it believes will optimally use the cluster’s resources, irrespective of the complexity of the query or size of the data • HAWQ uses dynamic pipelining technology to orchestrate execution of a query • Dynamic pipelining is a parallel data flow framework which uniquely combines: - An adaptive, high speed UDP interconnect, developed for and used on Pivotal’s 1000 server Analytics Work Bench - A run time execution environment, tuned for Big Data work loads, which implements the operations which underlie all SQL queries - A run time resource management layer, which ensures that queries complete, even in the presence of very demanding queries on heavily utilized clusters - A seamless data partitioning mechanism which groups together the parts of a data set which are often used in any given query • Extensive performance studies show that HAWQ is orders of magnitude faster than existing Hadoop query engines for common Hadoop, analytics and data warehousing workloads Elastic Fault Tolerance and Transaction Support • HAWQ’s fault tolerance, reliability and high availability features tolerate disk level and node level failures. • HAWQ supports transactions, a first for Hadoop. Transactions allow users to isolate concurrent activity on Hadoop and to rollback modifications when a fault occurs. A toolkit for data management and analysis • HAWQ can coexist with MapReduce, HBase and other database technologies common in Hadoop environments. • HAWQ supports traditional online analytic processing (OLAP) as well as advanced machine learning capabilities, such as supervised and unsupervised learning, inference and regression. • HAWQ supports GPXF, a unique extension framework which allows users to easily interface HAWQ custom formats and other parts of the Hadoop ecosystem. True SQL capabilities • HAWQ’s cost based query optimizer can effortlessly find optimal query plans for the most demanding of queries, such as queries with more than thirty joins • HAWQ is SQL standards compliant, including features such as correlated sub queries, window functions, rollups and cubes, a broad range of scalar and aggregate functions and much more • HAWQ is ACID compliant • Users can connect to HAWQ via the most popular programming languages, and it also supports ODBC and JDBC. This means that most business intelligence, data analysis and data visualization tools work with HAWQ out of the box. Performance results As part of the development of HAWQ, we selected customer workloads which reflected both real world and aspirational use of Hadoop, whether via MapReduce, Hive or other technologies. We used these workloads to validate the performance and scalability of HAWQ. Performance studies in isolation have little real world meaning, so HAWQ was benchmarked against Hive and a new parallel SQL engine for Hadoop, called Impala1 . So as to ensure a fair comparison, all systems were deployed on the Pivotal Analytics Workbench2 (AWB). The AWB is configured to realize the best possible performance from Hadoop3 . Of the workloads collected during research and development, five queries and data sets were used for the performance comparison. Two criteria were used: the queries must reflect different uses for Hadoop today and the queries must complete on Hive, Impala and HAWQ4 . The performance experiment took place on a 60 node deployment with the AWB. This size deployment was selected as it was the point at which we saw most stability of performance from Impala. HAWQ has been verified for stable performance well beyond 200 nodes. 1 https://github.com/cloudera/impala/, in beta at the time of writing 2 http://www.analyticsworkbench.com/ 3 http://www.greenplum.com/sites/default/files/Greenplum-Analytics-Workbench-Whitepaper_0.pdf 4 Many queries in the aspirational class used in excess of eight joins, but these queries did not complete on Hive or Impala Query type HAWQ (secs) Hive Speedup Impala Speedup User intelligence 4.2 198 47x 37 9x Sales analysis 8.7 161 19x 596 69x Click analysis 2.0 415 208x 50 25x Data exploration 2.7 1285 476x 55 20x BI drill down 2.8 1815 648x 59 21x Table 1: Performance results for five real world queries on HAWQ, Hive and Impala
  • 4. GoPivotal, Pivotal, and the Pivotal logo are registered trademarks or trademarks of GoPivotal, Inc. in the United States and other countries. All other trademarks used herein are the property of their respective owners. © Copyright 2013 Go Pivotal, Inc. All rights reserved. Published in the USA. PVTL-WP-201-04/13 At Pivotal our mission is to enable customers to build a new class of applications, leveraging big and fast data, and do all of this with the power of cloud independence. Uniting selected technology, people and programs from EMC and VMware, the following products and services are now part of Pivotal: Greenplum, Cloud Foundry, Spring, GemFire and other products from the VMware vFabric Suite, Cetas and Pivotal Labs. White paper Pivotal HD: Hawq Pivotal 1900 S Norfolk Street San Mateo CA 94403 goPivotal.com Results of the performance comparison are contained in table 1. It is clear that HAWQ’s performance is remarkable when compared to other Hadoop options currently available. One of the most important aspects of any technology running on top of Hadoop is its ability to scale. Absolute performance alone is considered secondary. As such, an experiment was undertaken to verify the performance at 15, 30, and 60 nodes. Table 2: Relative scalability of HAWQ and Impala Table 2 shows the relative scalability for HAWQ and Impala. For a data set of a consistent size, the run time of queries detailed in table 1 practically halved each time the number of servers is doubled. HAWQ has been shown to continue to scale like this beyond 200 nodes. Surprisingly, Impala’s performance is better with 15 nodes than 30 or 60 nodes: when the same amount of data is given four times the amount of computing resources, Impala spent 30% longer returning results. Conclusion SQL is an extremely powerful way to manipulate and understand data. Until now, Hadoop’s SQL capability has been limited and impractical for many users. HAWQ is the new benchmark for SQL on Hadoop — the most functionally rich, mature, and robust SQL offering available. Through Dynamic Pipelining and the significant innovations within the Greenplum Analytic Database, it provides performance previously unimaginable. HAWQ changes everything. Learn More To learn more about our products, services and solutions, visit us at goPivotal.com. HAWQ Impala