SlideShare a Scribd company logo
Anil Gupta
Omkar Nalawade
06/18/2018
Assumptions:
• Our audience have basic knowledge of HBase/Phoenix
• Actual performance improvement varies per your workload
• Due to time constraints, we are covering most important tuning tips
2
Agenda:
• Data Architecture at TRUECar
• Use Cases for Apache HBase/Phoenix
• Performance Optimization Techniques
 Cluster Settings
 Table Settings
 Data Modelling
 Instance Type
3
Data Architecture at TRUECar
4
5
Storage
Cluster
Compute
Cluster
Isolate compute and storage cluster for:
• Reducing interference between Compute and Storage job
• Use different EC2 instance types for HBase and Yarn
• Better consistency and debugging capability
Use Cases for Apache HBase/Phoenix
• Data store for Historical Data
• Data store for highly unstructured data(primarily HBase)
• Data store for semi-structured data(dynamic columns of Phoenix)
• In-memory Cache for small datasets
• We try to denormalize data to avoid joins in HBase/Phoenix
6
Cluster Settings
• UPDATE_CACHE_FREQUENCY
• Default value is “Always”
• SYSTEM.CATALOG is queried for every instantiation of Statement/PreparedStatement
• Causes hotspot in SYSTEM.CATALOG
• “phoenix.default.update.cache.frequency”: 120000
• Can be set per Table
• Saw 5x performance improvement in some jobs
7
Table Settings
• Pre-splitting the table
• Pre-splitting the secondary index
• Bloom Filter
• Hints
• SMALL
• NO_CACHE
• IN_MEMORY
8
Pre-split! Pre-split! Pre-split!
• Without presplitting, Phoenix tables are seeded with 1 region
• Avoid hotspot writing data to new tables.
• Leads to better distribution of table data across cluster
• Significant performance improvement(few X) at initial data load of table
9
Pre-splitting Global Secondary Index
• Global Secondary Index data is stored in another Phoenix table.
• Without pre-splitting Index table can lead to:
 Hotspot in Index table
 Slow writes to primary table(even though its pre-splitted)
10
Bloom Filter
• It’s a light-weight in-memory structure to reduce the number of negative reads
• It can be enabled on Column Family:
 ROW(default): If table doesnt have a lot of Dynamic Columns
 ROWCOL: If table has lots of Dynamic Columns
11
We saw 2x performance improvement in Read in a table that had close to 40000 Dynamic Columns
Hints
12
NO_CACHE
• To avoid the results of query to populate HBase block cache
• Use it when adhoc/nigthly export of data
• Reduce unnecessary churn in LRU
13
SMALL HINT
 Data set:
 Main Table consists of 50 columns
 2 million rows
 Case 1: Secondary Index without HINT
 Secondary Index on Main Table to retrieve 2 columns
 CREATE TEST_IDX ON TEST_TABLE(COLUMN_1)
 Query: SELECT * FROM TEST_IDX WHERE COLUMN_1=100
 Performance: 10.44 ms/query
14
SMALL HINT
 Case 2: Covered Index with HINT
 Covered Index to retrieve 2 columns
 CREATE TEST_IDX ON TEST_TABLE(COLUMN_1) INCLUDE (COLUMN_2, COLUMN_3)
 SELECT COLUMN_2, COLUMN_3 FROM TEST_IDX WHERE COLUMN_1=100
 Query Performance: ~1.8 ms/query
15
SMALL HINT
 Case 3: Covered Index with SMALL HINT
 Covered Index with SMALL HINT to retrieve 2 columns
 SELECT /*+SMALL*/ COLUMN_2, COLUMN_3 FROM TEST_IDX WHERE COLUMN_1=100
 Query Performance: ~1.2 ms/query
16
SMALL Hint: Performance
17
IN_MEMORY Option
• Use in-memory option to cache small data sets.
• Fast reads(in single digit milliseconds)
• We try to restrict in memory option to data < 1 Gb
• Don’t forget to split the table
18
Data Modeling: Incremental Key
• Rows in Phoenix are sorted lexicographically by the row key
• Sequential Keys leads to hotspotting due to non-uniform read/write pattern
• Common example: SequenceId’s of RDBMS
19
Data Modeling: Incremental Key
• Reversing key
• Reversing the primary Key so that randomizes the row keys
• Reversing can be done iff point queries are done
• Range Scan are not feasible with Reversing
20
Why Reversing key rather than Salting?
• Need to specify number of buckets at time to table creation
• Number of salt bucket stays same even if datasize keeps on growing
• Range scans are not feasible with salting too.
21
Data Modelling: Read Most Recent Data
• Sample Problem:
 We want to store sales transaction of vehicle
 Applications wants to read latest sale data per vehicle(VIN number)
 We can still do range scan on primary key prefix i.e. VIN
22
Primary key: <(String)VIN><(long)epoch time at Jan-01-2100:00 - SaleDate>
Phoenix Query to read latest: Select * from vin_sales where vin=‘x’ limit 1;
Data Modelling: Read Most Recent Data
23
VIN SALE_DATE
19UDE2F30HA000958 20170924
19UDE2F30HA000958 20180402
VIN MILLIS_UNTIL_EPOCH SALE_DATE
19UDE2F30HA000958 2609193660000 20180402
19UDE2F30HA000958 2609280060000 20170924
Rowkey:VIN,
Millis_Until_Epoch
Query:Select where vin=
19UDE2F30HA000958 limit
1
Rowkey: VIN,Sale_date
Query: Will need to do
orderby sale_date
EC2 Instance Types
24
d2.xlarge i3.2xlarge
Memory 30.5 GB 61GB
vCPUs 4 8
Instance Storage 6 TB (spinning disk) 1.9 TB NVMe SSD(fastest disk)
Network Performance Moderate Up to 10GB
Cost - On-Demand Instances $0.69/hr $0.62/hr
Cost – Reserved Instances $0.40/hr $0.43/hr
EC2 Instance Types
25
I3.2xlarge instance provided 25-120% performance improvement in our jobs mainly due to
better disk without significant increase in cost
Thanks & Questions
(PS:We are hiring!)
26

More Related Content

PDF
Hbase tutorial
PPTX
Apache Phoenix + Apache HBase
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
PDF
All about Zookeeper and ClickHouse Keeper.pdf
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
PPTX
Apache Phoenix Query Server
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PPTX
Apache HBase Performance Tuning
Hbase tutorial
Apache Phoenix + Apache HBase
Data Privacy with Apache Spark: Defensive and Offensive Approaches
All about Zookeeper and ClickHouse Keeper.pdf
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Apache Phoenix Query Server
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Apache HBase Performance Tuning

What's hot (20)

PDF
Better than you think: Handling JSON data in ClickHouse
PDF
Oracle Extended Clusters for Oracle RAC
PDF
Deep review of LMS process
PDF
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
PDF
Hive Data Modeling and Query Optimization
PDF
Oracle db performance tuning
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
PDF
Altinity Quickstart for ClickHouse
PPT
Ash masters : advanced ash analytics on Oracle
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Apache HBase Improvements and Practices at Xiaomi
PDF
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PDF
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
DOCX
Oracle architecture
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
All Oracle-dba-interview-questions
PDF
Understanding oracle rac internals part 1 - slides
PDF
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
PPTX
Apache Tez – Present and Future
Better than you think: Handling JSON data in ClickHouse
Oracle Extended Clusters for Oracle RAC
Deep review of LMS process
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Hive Data Modeling and Query Optimization
Oracle db performance tuning
In Memory Database In Action by Tanel Poder and Kerry Osborne
Altinity Quickstart for ClickHouse
Ash masters : advanced ash analytics on Oracle
ClickHouse Deep Dive, by Aleksei Milovidov
Apache HBase Improvements and Practices at Xiaomi
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle architecture
Building Lakehouses on Delta Lake with SQL Analytics Primer
All Oracle-dba-interview-questions
Understanding oracle rac internals part 1 - slides
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Apache Tez – Present and Future
Ad

Similar to Tuning Apache Phoenix/HBase (20)

PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
PDF
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
PDF
Deep Dive: Amazon Redshift (March 2017)
PPTX
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
PDF
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
PPTX
MySQL: Know more about open Source Database
PPT
Les 18 space
PPT
Lecture3.ppt
PDF
Melhores práticas de data warehouse no Amazon Redshift
PPT
6.2 my sql queryoptimization_part1
PPTX
Novedades SQL Server 2014
PDF
How to Fine-Tune Performance Using Amazon Redshift
PPTX
SPL_ALL_EN.pptx
PPTX
MemSQL 201: Advanced Tips and Tricks Webcast
PPTX
Finance month closing with HANA
PDF
Scaling with sync_replication using Galera and EC2
PPTX
Challenges of Implementing an Advanced SQL Engine on Hadoop
PDF
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
PPTX
MySQL: Know more about open Source Database
PPTX
Cassandra Tutorial
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Deep Dive: Amazon Redshift (March 2017)
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
MySQL: Know more about open Source Database
Les 18 space
Lecture3.ppt
Melhores práticas de data warehouse no Amazon Redshift
6.2 my sql queryoptimization_part1
Novedades SQL Server 2014
How to Fine-Tune Performance Using Amazon Redshift
SPL_ALL_EN.pptx
MemSQL 201: Advanced Tips and Tricks Webcast
Finance month closing with HANA
Scaling with sync_replication using Galera and EC2
Challenges of Implementing an Advanced SQL Engine on Hadoop
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
MySQL: Know more about open Source Database
Cassandra Tutorial
Ad

Recently uploaded (20)

PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ai tools demonstartion for schools and inter college
PDF
Digital Strategies for Manufacturing Companies
PPTX
Introduction to Artificial Intelligence
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
medical staffing services at VALiNTRY
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
top salesforce developer skills in 2025.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Nekopoi APK 2025 free lastest update
PDF
Softaken Excel to vCard Converter Software.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
ai tools demonstartion for schools and inter college
Digital Strategies for Manufacturing Companies
Introduction to Artificial Intelligence
How Creative Agencies Leverage Project Management Software.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How to Migrate SBCGlobal Email to Yahoo Easily
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
medical staffing services at VALiNTRY
Odoo POS Development Services by CandidRoot Solutions
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PTS Company Brochure 2025 (1).pdf.......
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
top salesforce developer skills in 2025.pdf
Design an Analysis of Algorithms I-SECS-1021-03
Wondershare Filmora 15 Crack With Activation Key [2025
Nekopoi APK 2025 free lastest update
Softaken Excel to vCard Converter Software.pdf

Tuning Apache Phoenix/HBase

  • 2. Assumptions: • Our audience have basic knowledge of HBase/Phoenix • Actual performance improvement varies per your workload • Due to time constraints, we are covering most important tuning tips 2
  • 3. Agenda: • Data Architecture at TRUECar • Use Cases for Apache HBase/Phoenix • Performance Optimization Techniques  Cluster Settings  Table Settings  Data Modelling  Instance Type 3
  • 5. 5 Storage Cluster Compute Cluster Isolate compute and storage cluster for: • Reducing interference between Compute and Storage job • Use different EC2 instance types for HBase and Yarn • Better consistency and debugging capability
  • 6. Use Cases for Apache HBase/Phoenix • Data store for Historical Data • Data store for highly unstructured data(primarily HBase) • Data store for semi-structured data(dynamic columns of Phoenix) • In-memory Cache for small datasets • We try to denormalize data to avoid joins in HBase/Phoenix 6
  • 7. Cluster Settings • UPDATE_CACHE_FREQUENCY • Default value is “Always” • SYSTEM.CATALOG is queried for every instantiation of Statement/PreparedStatement • Causes hotspot in SYSTEM.CATALOG • “phoenix.default.update.cache.frequency”: 120000 • Can be set per Table • Saw 5x performance improvement in some jobs 7
  • 8. Table Settings • Pre-splitting the table • Pre-splitting the secondary index • Bloom Filter • Hints • SMALL • NO_CACHE • IN_MEMORY 8
  • 9. Pre-split! Pre-split! Pre-split! • Without presplitting, Phoenix tables are seeded with 1 region • Avoid hotspot writing data to new tables. • Leads to better distribution of table data across cluster • Significant performance improvement(few X) at initial data load of table 9
  • 10. Pre-splitting Global Secondary Index • Global Secondary Index data is stored in another Phoenix table. • Without pre-splitting Index table can lead to:  Hotspot in Index table  Slow writes to primary table(even though its pre-splitted) 10
  • 11. Bloom Filter • It’s a light-weight in-memory structure to reduce the number of negative reads • It can be enabled on Column Family:  ROW(default): If table doesnt have a lot of Dynamic Columns  ROWCOL: If table has lots of Dynamic Columns 11 We saw 2x performance improvement in Read in a table that had close to 40000 Dynamic Columns
  • 13. NO_CACHE • To avoid the results of query to populate HBase block cache • Use it when adhoc/nigthly export of data • Reduce unnecessary churn in LRU 13
  • 14. SMALL HINT  Data set:  Main Table consists of 50 columns  2 million rows  Case 1: Secondary Index without HINT  Secondary Index on Main Table to retrieve 2 columns  CREATE TEST_IDX ON TEST_TABLE(COLUMN_1)  Query: SELECT * FROM TEST_IDX WHERE COLUMN_1=100  Performance: 10.44 ms/query 14
  • 15. SMALL HINT  Case 2: Covered Index with HINT  Covered Index to retrieve 2 columns  CREATE TEST_IDX ON TEST_TABLE(COLUMN_1) INCLUDE (COLUMN_2, COLUMN_3)  SELECT COLUMN_2, COLUMN_3 FROM TEST_IDX WHERE COLUMN_1=100  Query Performance: ~1.8 ms/query 15
  • 16. SMALL HINT  Case 3: Covered Index with SMALL HINT  Covered Index with SMALL HINT to retrieve 2 columns  SELECT /*+SMALL*/ COLUMN_2, COLUMN_3 FROM TEST_IDX WHERE COLUMN_1=100  Query Performance: ~1.2 ms/query 16
  • 18. IN_MEMORY Option • Use in-memory option to cache small data sets. • Fast reads(in single digit milliseconds) • We try to restrict in memory option to data < 1 Gb • Don’t forget to split the table 18
  • 19. Data Modeling: Incremental Key • Rows in Phoenix are sorted lexicographically by the row key • Sequential Keys leads to hotspotting due to non-uniform read/write pattern • Common example: SequenceId’s of RDBMS 19
  • 20. Data Modeling: Incremental Key • Reversing key • Reversing the primary Key so that randomizes the row keys • Reversing can be done iff point queries are done • Range Scan are not feasible with Reversing 20
  • 21. Why Reversing key rather than Salting? • Need to specify number of buckets at time to table creation • Number of salt bucket stays same even if datasize keeps on growing • Range scans are not feasible with salting too. 21
  • 22. Data Modelling: Read Most Recent Data • Sample Problem:  We want to store sales transaction of vehicle  Applications wants to read latest sale data per vehicle(VIN number)  We can still do range scan on primary key prefix i.e. VIN 22 Primary key: <(String)VIN><(long)epoch time at Jan-01-2100:00 - SaleDate> Phoenix Query to read latest: Select * from vin_sales where vin=‘x’ limit 1;
  • 23. Data Modelling: Read Most Recent Data 23 VIN SALE_DATE 19UDE2F30HA000958 20170924 19UDE2F30HA000958 20180402 VIN MILLIS_UNTIL_EPOCH SALE_DATE 19UDE2F30HA000958 2609193660000 20180402 19UDE2F30HA000958 2609280060000 20170924 Rowkey:VIN, Millis_Until_Epoch Query:Select where vin= 19UDE2F30HA000958 limit 1 Rowkey: VIN,Sale_date Query: Will need to do orderby sale_date
  • 24. EC2 Instance Types 24 d2.xlarge i3.2xlarge Memory 30.5 GB 61GB vCPUs 4 8 Instance Storage 6 TB (spinning disk) 1.9 TB NVMe SSD(fastest disk) Network Performance Moderate Up to 10GB Cost - On-Demand Instances $0.69/hr $0.62/hr Cost – Reserved Instances $0.40/hr $0.43/hr
  • 25. EC2 Instance Types 25 I3.2xlarge instance provided 25-120% performance improvement in our jobs mainly due to better disk without significant increase in cost
  • 26. Thanks & Questions (PS:We are hiring!) 26