SlideShare a Scribd company logo
Phoenix
James Taylor
@JamesPlusPlus
http://phoenix-hbase.blogspot.com/
We put the SQL back in NoSQL
https://github.com/forcedotcom/phoenix
In the dawn of time…
Completed
Relational Databases were invented
Completed
But we all know the problems folks ran into
Completed
And then there was HBase
Completed
And it was good
Completed
1. Horizontally scalable
And it was good
Completed
1. Horizontally scalable
2. Maintains data locality
And it was good
Completed
1. Horizontally scalable
2. Maintains data locality
3. Runs on commodity
hardware
But somewhere,
something terrible went wrong
Completed
But somewhere,
something terrible went wrong
Completed
1. It takes too much expertise
to write an application
But somewhere,
something terrible went wrong
Completed
1. It takes too much expertise
to write an application
2. It takes too much code to
do anything
But somewhere,
something terrible went wrong
Completed
1. It takes too much expertise
to write an application
2. It takes too much code to
do anything
3. Your application is tied too
closely with your data
model
What is Phoenix?
Completed
 SQL skin for HBase
What is Phoenix?
Completed
 SQL skin for HBase
 An alternate client API
What is Phoenix?
Completed
 SQL skin for HBase
 An alternate client API
 An embedded JDBC driver that allows you
to run at HBase native speed
What is Phoenix?
Completed
 SQL skin for HBase
 An alternate client API
 An embedded JDBC driver that allows you
to run at HBase native speed
 Compiles your SQL into native HBase calls
What is Phoenix?
Completed
 SQL skin for HBase
 An alternate client API
 An embedded JDBC driver that allows you
to run at HBase native speed
 Compiles your SQL into native HBase calls
so you don’t have to!
Phoenix Performance
Why SQL for HBase?
Completed
 Broaden HBase adoption
 Give folks an API they already know
Why SQL for HBase?
Completed
 Broaden HBase adoption
 Give folks an API they already know
 Reduce the amount of code users need to write
SELECT TRUNC(date,'DAY’), AVG(cpu_usage)
FROM web_stat
WHERE domain LIKE 'Salesforce%’
GROUP BY TRUNC(date,'DAY’)
Why SQL for HBase?
Completed
 Broaden HBase adoption
 Give folks an API they already know
 Reduce the amount of code users need to write
SELECT TRUNC(date,'DAY’), AVG(cpu_usage)
FROM web_stat
WHERE domain LIKE 'Salesforce%’
GROUP BY TRUNC(date,'DAY')
 Performance optimizations transparent to the user
 Aggregation
 Skip Scan
 Secondary indexing (soon!)
Why SQL for HBase?
Completed
 Broaden HBase adoption
 Give folks an API they already know
 Reduce the amount of code users need to write
SELECT TRUNC(date,'DAY’), AVG(cpu_usage)
FROM web_stat
WHERE domain LIKE 'Salesforce%’
GROUP BY TRUNC(date,'DAY')
 Performance optimizations transparent to the user
 Aggregation
 Skip Scan
 Secondary indexing (soon!)
 Leverage existing tooling
 SQL client/terminal
 OLAP engine
Example
Row Key
Server Metrics
HOST VARCHAR
DATE DATE
RESPONSE_TIME INTEGER
GC_TIME INTEGER
CPU_TIME INTEGER
IO_TIME INTEGER
…
Over metrics data for clusters of servers with a schema like this:
Example
Server Metrics
HOST VARCHAR
DATE DATE
RESPONSE_TIME INTEGER
GC_TIME INTEGER
CPU_TIME INTEGER
IO_TIME INTEGER
…
Over metrics data for clusters of servers with a schema like this:
Key Values
Example
With 90 days of data that looks like this:
SERVER METRICS
HOST DATE RESPONSE_TIME GC_TIME
sf1.s1 Jun5 10:10:10.234 1234
sf1.s1 Jun 5 11:18:28.456 8012
…
sf3.s1 Jun5 10:10:10.234 2345
sf3.s1 Jun 6 12:46:19.123 2340
sf7.s9 Jun 4 08:23:23.456 5002 1234
…
Example
Walk through query processing for three scenarios
1. Chart Response Time Per Cluster
Example
Walk through query processing for three scenarios
1. Chart Response Time Per Cluster
Example
Walk through query processing for three scenarios
1. Chart Response Time Per Cluster
2. Identify 5 Longest GC Times
Example
Walk through query processing for three scenarios
1. Chart Response Time Per Cluster
2. Identify 5 Longest GC Times
Example
Walk through query processing for three scenarios
1. Chart Response Time Per Cluster
2. Identify 5 Longest GC Times
3. Identify 5 Longest GC Times again and again
Scenario 1
Chart Response Time Per Cluster
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Scenario 1
Chart Response Time Per Cluster
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Scenario 1
Chart Response Time Per Cluster
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Scenario 1
Chart Response Time Per Cluster
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Scenario 1
Chart Response Time Per Cluster
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Step 1: Client
Identify Row Key Ranges from Query
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Row Key Ranges
HOST DATE
Step 1: Client
Identify Row Key Ranges from Query
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Row Key Ranges
HOST DATE
Step 1: Client
Identify Row Key Ranges from Query
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Row Key Ranges
HOST DATE
Step 1: Client
Identify Row Key Ranges from Query
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Row Key Ranges
HOST DATE
sf1
Step 1: Client
Identify Row Key Ranges from Query
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Row Key Ranges
HOST DATE
sf1
sf3
Step 1: Client
Identify Row Key Ranges from Query
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Row Key Ranges
HOST DATE
sf1
sf3
sf7
Step 1: Client
Identify Row Key Ranges from Query
Completed
SELECT host, trunc(date,’DAY’),
min(response_time), max(response_time)
FROM server_metrics
WHERE date >CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’)
GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
Row Key Ranges
HOST DATE
sf1 t1 – *
sf3
sf7
Step 2: Client
Overlay Row Key Ranges with Regions
Completed
R1
R2
R3
R4
sf1
sf4
sf6
sf1
sf3
sf7
Step 3: Client
Execute Parallel Scans
Completed
R1
R2
R3
R4
sf1
sf4
sf6
sf1
sf3
sf7
scan1
scan3
scan2
Step 4: Server
Filter using Skip Scan
Completed
sf1.s1 t0SKIP
Step 4: Server
Filter using Skip Scan
Completed
sf1.s1 t1INCLUDE
Step 4: Server
Filter using Skip Scan
Completed
sf1.s2 t0
SKIP
Step 4: Server
Filter using Skip Scan
Completed
sf1.s2 t1INCLUDE
Step 4: Server
Filter using Skip Scan
sf1.s3 t0SKIP
Step 4: Server
Filter using Skip Scan
sf1.s3 t1INCLUDE
SERVER METRICS
HOST DATE
sf1.s1 Jun 2 10:10:10.234
sf1.s2 Jun 3 23:05:44.975
sf1.s2 Jun 9 08:10:32.147
sf1.s3 Jun 1 11:18:28.456
sf1.s3 Jun 3 22:03:22.142
sf1.s4 Jun 1 10:29:58.950
sf1.s4 Jun 2 14:55:34.104
sf1.s4 Jun 3 12:46:19.123
sf1.s5 Jun 8 08:23:23.456
sf1.s6 Jun 1 10:31:10.234
Step 5: Server
Intercept Scan in Coprocessor
SERVER METRICS
HOST DATE
sf1 Jun 1
sf1 Jun 2
sf1 Jun 3
sf1 Jun 8
sf1 Jun 9
Step 6: Client
Perform Final Merge Sort
Completed
R1
R2
R3
R4
scan1
scan3
scan2
SERVER METRICS
HOST DATE
sf1 Jun5
sf1 Jun 9
sf3 Jun 1
sf3 Jun 2
sf7 Jun 1
sf7 Jun 8
Scenario 2
Find 5 Longest GC Times
Completed
SELECT host, date, gc_time
FROMserver_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’)
ORDER BY gc_time DESC
LIMIT 5
Scenario 2
Find 5 Longest GC Times
• Same client parallelization and server skip scan filtering
Scenario 2
Find 5 Longest GC Times
Completed
• Same client parallelization and server skip scan filtering
• Server holds 5 longest GC_TIME value for each scan
R2
SERVER METRICS
HOST DATE GC_TIME
sf1.s1 Jun 2 10:10:10.234 22123
sf1.s1 Jun 3 23:05:44.975 19876
sf1.s1 Jun 9 08:10:32.147 11345
sf1.s2 Jun 1 11:18:28.456 10234
sf1.s2 Jun 3 22:03:22.142 10111
Scenario 2
Find 5 Longest GC Times
Completed
• Same client parallelization and server skip scan filtering
• Server holds 5 longest GC_TIME value for each scan
• Client performs final merge sort among parallel scans
Scan1
SERVER METRICS
HOST DATE GC_TIME
sf1.s1 Jun 2 10:10:10.234 25865
sf1.s1 Jun 3 23:05:44.975 22123
sf1.s1 Jun 9 08:10:32.147 20176
sf1.s2 Jun 1 11:18:28.456 19876
sf1.s2 Jun 3 22:03:22.142 17111
Scan2
Scan3
Scenario 3
Find 5 Longest GC Times
Completed
CREATE INDEX gc_time_index
ON server_metrics(gc_time DESC, date DESC)
INCLUDE (host, response_time)
Scenario 3
Find 5 Longest GC Times
Completed
CREATE INDEX gc_time_index
ON server_metrics (gc_time DESC, date DESC)
INCLUDE (host, response_time)
Scenario 3
Find 5 Longest GC Times
Completed
CREATE INDEX gc_time_index
ON server_metrics (gc_time DESC, date DESC)
INCLUDE (host, response_time)
Scenario 3
Find 5 Longest GC Times
Completed
CREATE INDEX gc_time_index
ON server_metrics (gc_time DESC, date DESC)
INCLUDE (host, response_time)
Row Key
Server Metrics GC Time Index
GC_TIME INTEGER
DATE DATE
HOST VARCHAR
RESPONSE_TIME INTEGER
Scenario 3
Find 5 Longest GC Times
Completed
SELECT host, date, gc_time
FROMserver_metrics
WHERE date > CURRENT_DATE() – 7
AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’)
ORDER BY gc_time DESC
LIMIT 5
Phoenix Roadmap
Completed
 Secondary Indexing
 Hash Joins
 Apache Drill integration
 Count distinct and percentile
 Derived tables
 SELECT * FROM (SELECT * FROM t)
 Cost-based query optimizer
 OLAP extensions
 WINDOW, PARTITION OVER, RANK
 Monitoring and management
 Transactions
Thank you!
Questions/comments?

More Related Content

PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
PPTX
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
PPTX
HBaseCon 2013: ETL for Apache HBase
PDF
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
PPTX
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
PPTX
Time-Series Apache HBase
PPTX
HBaseCon 2015: OpenTSDB and AsyncHBase Update
PPTX
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
Time-Series Apache HBase
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: HBase Operations in a Flurry

What's hot (20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
PDF
HBaseCon 2015: HBase Operations at Xiaomi
PDF
Stream Processing made simple with Kafka
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
PPTX
Cross-Site BigTable using HBase
PDF
HBaseCon 2013: Scalable Network Designs for Apache HBase
PDF
HBaseCon2017 Improving HBase availability in a multi tenant environment
PPTX
HBaseCon 2013: A Developer’s Guide to Coprocessors
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
PPTX
Off-heaping the Apache HBase Read Path
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
PDF
Argus Production Monitoring at Salesforce
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
PDF
HBaseConAsia2018 Track1-7: HDFS optimizations for HBase at Xiaomi
PPT
2011 06-30-hadoop-summit v5
PDF
Meet HBase 1.0
PDF
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon 2015: HBase Operations at Xiaomi
Stream Processing made simple with Kafka
HBaseCon 2015: HBase Performance Tuning @ Salesforce
Cross-Site BigTable using HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Off-heaping the Apache HBase Read Path
Hive, Presto, and Spark on TPC-DS benchmark
Argus Production Monitoring at Salesforce
Gruter TECHDAY 2014 Realtime Processing in Telco
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseConAsia2018 Track1-7: HDFS optimizations for HBase at Xiaomi
2011 06-30-hadoop-summit v5
Meet HBase 1.0
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
Ad

Viewers also liked (20)

PPTX
Taming HBase with Apache Phoenix and SQL
PPTX
Apache HBase Performance Tuning
PDF
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
PPTX
April 2014 HUG : Apache Phoenix
PPTX
Apache Phoenix: Transforming HBase into a SQL Database
PPTX
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
PPTX
HBaseCon 2013: 1500 JIRAs in 20 Minutes
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
PPTX
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
PPTX
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
PPTX
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
PPTX
HBaseCon 2012 | Scaling GIS In Three Acts
PDF
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
PPTX
HBaseCon 2013: Rebuilding for Scale on Apache HBase
PPT
HBaseCon 2012 | Building Mobile Infrastructure with HBase
PPT
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
PPTX
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
PPTX
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
PPTX
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
PDF
Tales from the Cloudera Field
Taming HBase with Apache Phoenix and SQL
Apache HBase Performance Tuning
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
April 2014 HUG : Apache Phoenix
Apache Phoenix: Transforming HBase into a SQL Database
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
Tales from the Cloudera Field
Ad

Similar to HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL (20)

PDF
Building a fully-automated Fast Data Platform
PDF
Building a fully-automated Fast Data Platform
PDF
Build a Complex, Realtime Data Management App with Postgres 14!
PDF
"Enabling Googley microservices with gRPC." at Devoxx France 2017
PDF
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
PDF
Naked Performance With Clojure
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PDF
How Cloudflare analyzes -1m dns queries per second @ Percona E17
PPTX
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
KEY
fog or: How I Learned to Stop Worrying and Love the Cloud
KEY
fog or: How I Learned to Stop Worrying and Love the Cloud (OpenStack Edition)
PDF
So you think you can stream.pptx
PDF
Solr @ Etsy - Apache Lucene Eurocon
PPTX
Building and Scaling Node.js Applications
PDF
LA Ember.js Meetup, Jan 2017
PPTX
Make BDD great again
PDF
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
PDF
Norikra: SQL Stream Processing In Ruby
PPT
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
PDF
Node Boot Camp
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data Platform
Build a Complex, Realtime Data Management App with Postgres 14!
"Enabling Googley microservices with gRPC." at Devoxx France 2017
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
Naked Performance With Clojure
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
How Cloudflare analyzes -1m dns queries per second @ Percona E17
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
fog or: How I Learned to Stop Worrying and Love the Cloud
fog or: How I Learned to Stop Worrying and Love the Cloud (OpenStack Edition)
So you think you can stream.pptx
Solr @ Etsy - Apache Lucene Eurocon
Building and Scaling Node.js Applications
LA Ember.js Meetup, Jan 2017
Make BDD great again
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Norikra: SQL Stream Processing In Ruby
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
Node Boot Camp

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
project resource management chapter-09.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
August Patch Tuesday
PPTX
Modernising the Digital Integration Hub
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPT
What is a Computer? Input Devices /output devices
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
The various Industrial Revolutions .pptx
Programs and apps: productivity, graphics, security and other tools
cloud_computing_Infrastucture_as_cloud_p
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
1. Introduction to Computer Programming.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
project resource management chapter-09.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Hindi spoken digit analysis for native and non-native speakers
August Patch Tuesday
Modernising the Digital Integration Hub
NewMind AI Weekly Chronicles - August'25-Week II
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
What is a Computer? Input Devices /output devices
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A comparative study of natural language inference in Swahili using monolingua...
Developing a website for English-speaking practice to English as a foreign la...
The various Industrial Revolutions .pptx

HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL

  • 1. Phoenix James Taylor @JamesPlusPlus http://phoenix-hbase.blogspot.com/ We put the SQL back in NoSQL https://github.com/forcedotcom/phoenix
  • 2. In the dawn of time… Completed
  • 3. Relational Databases were invented Completed
  • 4. But we all know the problems folks ran into Completed
  • 5. And then there was HBase Completed
  • 6. And it was good Completed 1. Horizontally scalable
  • 7. And it was good Completed 1. Horizontally scalable 2. Maintains data locality
  • 8. And it was good Completed 1. Horizontally scalable 2. Maintains data locality 3. Runs on commodity hardware
  • 9. But somewhere, something terrible went wrong Completed
  • 10. But somewhere, something terrible went wrong Completed 1. It takes too much expertise to write an application
  • 11. But somewhere, something terrible went wrong Completed 1. It takes too much expertise to write an application 2. It takes too much code to do anything
  • 12. But somewhere, something terrible went wrong Completed 1. It takes too much expertise to write an application 2. It takes too much code to do anything 3. Your application is tied too closely with your data model
  • 13. What is Phoenix? Completed  SQL skin for HBase
  • 14. What is Phoenix? Completed  SQL skin for HBase  An alternate client API
  • 15. What is Phoenix? Completed  SQL skin for HBase  An alternate client API  An embedded JDBC driver that allows you to run at HBase native speed
  • 16. What is Phoenix? Completed  SQL skin for HBase  An alternate client API  An embedded JDBC driver that allows you to run at HBase native speed  Compiles your SQL into native HBase calls
  • 17. What is Phoenix? Completed  SQL skin for HBase  An alternate client API  An embedded JDBC driver that allows you to run at HBase native speed  Compiles your SQL into native HBase calls so you don’t have to!
  • 19. Why SQL for HBase? Completed  Broaden HBase adoption  Give folks an API they already know
  • 20. Why SQL for HBase? Completed  Broaden HBase adoption  Give folks an API they already know  Reduce the amount of code users need to write SELECT TRUNC(date,'DAY’), AVG(cpu_usage) FROM web_stat WHERE domain LIKE 'Salesforce%’ GROUP BY TRUNC(date,'DAY’)
  • 21. Why SQL for HBase? Completed  Broaden HBase adoption  Give folks an API they already know  Reduce the amount of code users need to write SELECT TRUNC(date,'DAY’), AVG(cpu_usage) FROM web_stat WHERE domain LIKE 'Salesforce%’ GROUP BY TRUNC(date,'DAY')  Performance optimizations transparent to the user  Aggregation  Skip Scan  Secondary indexing (soon!)
  • 22. Why SQL for HBase? Completed  Broaden HBase adoption  Give folks an API they already know  Reduce the amount of code users need to write SELECT TRUNC(date,'DAY’), AVG(cpu_usage) FROM web_stat WHERE domain LIKE 'Salesforce%’ GROUP BY TRUNC(date,'DAY')  Performance optimizations transparent to the user  Aggregation  Skip Scan  Secondary indexing (soon!)  Leverage existing tooling  SQL client/terminal  OLAP engine
  • 23. Example Row Key Server Metrics HOST VARCHAR DATE DATE RESPONSE_TIME INTEGER GC_TIME INTEGER CPU_TIME INTEGER IO_TIME INTEGER … Over metrics data for clusters of servers with a schema like this:
  • 24. Example Server Metrics HOST VARCHAR DATE DATE RESPONSE_TIME INTEGER GC_TIME INTEGER CPU_TIME INTEGER IO_TIME INTEGER … Over metrics data for clusters of servers with a schema like this: Key Values
  • 25. Example With 90 days of data that looks like this: SERVER METRICS HOST DATE RESPONSE_TIME GC_TIME sf1.s1 Jun5 10:10:10.234 1234 sf1.s1 Jun 5 11:18:28.456 8012 … sf3.s1 Jun5 10:10:10.234 2345 sf3.s1 Jun 6 12:46:19.123 2340 sf7.s9 Jun 4 08:23:23.456 5002 1234 …
  • 26. Example Walk through query processing for three scenarios 1. Chart Response Time Per Cluster
  • 27. Example Walk through query processing for three scenarios 1. Chart Response Time Per Cluster
  • 28. Example Walk through query processing for three scenarios 1. Chart Response Time Per Cluster 2. Identify 5 Longest GC Times
  • 29. Example Walk through query processing for three scenarios 1. Chart Response Time Per Cluster 2. Identify 5 Longest GC Times
  • 30. Example Walk through query processing for three scenarios 1. Chart Response Time Per Cluster 2. Identify 5 Longest GC Times 3. Identify 5 Longest GC Times again and again
  • 31. Scenario 1 Chart Response Time Per Cluster Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
  • 32. Scenario 1 Chart Response Time Per Cluster Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
  • 33. Scenario 1 Chart Response Time Per Cluster Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
  • 34. Scenario 1 Chart Response Time Per Cluster Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
  • 35. Scenario 1 Chart Response Time Per Cluster Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’)
  • 36. Step 1: Client Identify Row Key Ranges from Query Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) Row Key Ranges HOST DATE
  • 37. Step 1: Client Identify Row Key Ranges from Query Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) Row Key Ranges HOST DATE
  • 38. Step 1: Client Identify Row Key Ranges from Query Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) Row Key Ranges HOST DATE
  • 39. Step 1: Client Identify Row Key Ranges from Query Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) Row Key Ranges HOST DATE sf1
  • 40. Step 1: Client Identify Row Key Ranges from Query Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) Row Key Ranges HOST DATE sf1 sf3
  • 41. Step 1: Client Identify Row Key Ranges from Query Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) Row Key Ranges HOST DATE sf1 sf3 sf7
  • 42. Step 1: Client Identify Row Key Ranges from Query Completed SELECT host, trunc(date,’DAY’), min(response_time), max(response_time) FROM server_metrics WHERE date >CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3’, ‘sf7’) GROUP BY substr(host, 1, 3), trunc(date,’DAY’) Row Key Ranges HOST DATE sf1 t1 – * sf3 sf7
  • 43. Step 2: Client Overlay Row Key Ranges with Regions Completed R1 R2 R3 R4 sf1 sf4 sf6 sf1 sf3 sf7
  • 44. Step 3: Client Execute Parallel Scans Completed R1 R2 R3 R4 sf1 sf4 sf6 sf1 sf3 sf7 scan1 scan3 scan2
  • 45. Step 4: Server Filter using Skip Scan Completed sf1.s1 t0SKIP
  • 46. Step 4: Server Filter using Skip Scan Completed sf1.s1 t1INCLUDE
  • 47. Step 4: Server Filter using Skip Scan Completed sf1.s2 t0 SKIP
  • 48. Step 4: Server Filter using Skip Scan Completed sf1.s2 t1INCLUDE
  • 49. Step 4: Server Filter using Skip Scan sf1.s3 t0SKIP
  • 50. Step 4: Server Filter using Skip Scan sf1.s3 t1INCLUDE
  • 51. SERVER METRICS HOST DATE sf1.s1 Jun 2 10:10:10.234 sf1.s2 Jun 3 23:05:44.975 sf1.s2 Jun 9 08:10:32.147 sf1.s3 Jun 1 11:18:28.456 sf1.s3 Jun 3 22:03:22.142 sf1.s4 Jun 1 10:29:58.950 sf1.s4 Jun 2 14:55:34.104 sf1.s4 Jun 3 12:46:19.123 sf1.s5 Jun 8 08:23:23.456 sf1.s6 Jun 1 10:31:10.234 Step 5: Server Intercept Scan in Coprocessor SERVER METRICS HOST DATE sf1 Jun 1 sf1 Jun 2 sf1 Jun 3 sf1 Jun 8 sf1 Jun 9
  • 52. Step 6: Client Perform Final Merge Sort Completed R1 R2 R3 R4 scan1 scan3 scan2 SERVER METRICS HOST DATE sf1 Jun5 sf1 Jun 9 sf3 Jun 1 sf3 Jun 2 sf7 Jun 1 sf7 Jun 8
  • 53. Scenario 2 Find 5 Longest GC Times Completed SELECT host, date, gc_time FROMserver_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) ORDER BY gc_time DESC LIMIT 5
  • 54. Scenario 2 Find 5 Longest GC Times • Same client parallelization and server skip scan filtering
  • 55. Scenario 2 Find 5 Longest GC Times Completed • Same client parallelization and server skip scan filtering • Server holds 5 longest GC_TIME value for each scan R2 SERVER METRICS HOST DATE GC_TIME sf1.s1 Jun 2 10:10:10.234 22123 sf1.s1 Jun 3 23:05:44.975 19876 sf1.s1 Jun 9 08:10:32.147 11345 sf1.s2 Jun 1 11:18:28.456 10234 sf1.s2 Jun 3 22:03:22.142 10111
  • 56. Scenario 2 Find 5 Longest GC Times Completed • Same client parallelization and server skip scan filtering • Server holds 5 longest GC_TIME value for each scan • Client performs final merge sort among parallel scans Scan1 SERVER METRICS HOST DATE GC_TIME sf1.s1 Jun 2 10:10:10.234 25865 sf1.s1 Jun 3 23:05:44.975 22123 sf1.s1 Jun 9 08:10:32.147 20176 sf1.s2 Jun 1 11:18:28.456 19876 sf1.s2 Jun 3 22:03:22.142 17111 Scan2 Scan3
  • 57. Scenario 3 Find 5 Longest GC Times Completed CREATE INDEX gc_time_index ON server_metrics(gc_time DESC, date DESC) INCLUDE (host, response_time)
  • 58. Scenario 3 Find 5 Longest GC Times Completed CREATE INDEX gc_time_index ON server_metrics (gc_time DESC, date DESC) INCLUDE (host, response_time)
  • 59. Scenario 3 Find 5 Longest GC Times Completed CREATE INDEX gc_time_index ON server_metrics (gc_time DESC, date DESC) INCLUDE (host, response_time)
  • 60. Scenario 3 Find 5 Longest GC Times Completed CREATE INDEX gc_time_index ON server_metrics (gc_time DESC, date DESC) INCLUDE (host, response_time) Row Key Server Metrics GC Time Index GC_TIME INTEGER DATE DATE HOST VARCHAR RESPONSE_TIME INTEGER
  • 61. Scenario 3 Find 5 Longest GC Times Completed SELECT host, date, gc_time FROMserver_metrics WHERE date > CURRENT_DATE() – 7 AND substr(host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) ORDER BY gc_time DESC LIMIT 5
  • 62. Phoenix Roadmap Completed  Secondary Indexing  Hash Joins  Apache Drill integration  Count distinct and percentile  Derived tables  SELECT * FROM (SELECT * FROM t)  Cost-based query optimizer  OLAP extensions  WINDOW, PARTITION OVER, RANK  Monitoring and management  Transactions