SlideShare a Scribd company logo
HBaseConAsia2018
August 17,2018
Gehua New Century Hotel Beijing,China
HBase On Persistent Memory
Anoop Sam John, Ramkrishna S Vasudevan
hosted by
hosted by
Content
01
02
04
03
HBase Present Model
Region Replica
Persistent Memory Technology
HBase On Persistent Memory
05 Performance Numbers
hosted by
Apache HBase Present Model
 Accumulate data in Memory
o Sorted Map
o Cell data bytes in Local Allocation Buffers (LABs) in DRAM
o LABs with size 2 MB in RAM.
 Also write to Write Ahead Log (WAL)
o Cell data in Volatile RAM.
o To recover from server crash
o HDFS interaction adding more latency
hsync vs hflush - HBASE-19024
 Flushes as files to HDFS on reaching memstores size
o 128 MB default flush size
 Replay WAL on server crash
o Data unavailable till replay completes
o Large Mean Time To Recover (MTTR)
o Takes several minutes on large cluster (Complaint from many
users like Alibaba – HBaseConAsia , Huawei)
hosted by
Apache HBase Present Model
Region Replica for better availability (?)
 Read only replica regions on other RSs
o Refer to same HFiles in HDFS
o Memstore data also can be replicated
o Using WAL read replication path
o Eventual consistency
o Can read from replica regions when primary is down
o No strong consistency guarantees.
o Only when Scan/Get says TIMELINE Consistency, replica read happens
(not by default)
o Better availability but only for selected use cases !!! – No strong
Consistency
hosted by
Persistent Memory Technology
Persistent Memory
o Get back data even after power cycles
o Accessed using memory APIs
o Processor load and store instructions
3DXPoint
o New NVM technology
o Stackable cross-gridded data access array
Intel Apache Pass (AEP)
o Persistent Memory (pmem)
o Big, affordable and persistent
o Accessed like volatile memory, using processor load
and store instructions
Library
o NVML (Now called PMDK)
o Java wrappers around it
o Apache Mnemonic – (used for PoC)
o https://github.com/pmem/pcj
hosted by
Apache HBase On Persistent Memory
 Accumulate data in Memory
o Sorted Map
o Cell data bytes in Local Allocation Buffers with size 2 MB
 Region Replica in other servers
o Replica regions feature already in place.
o Synchronous replication to replica regions
 No need to write to Write Ahead Log (WAL)
o Cell data in non volatile AEP.
 Server crash
o Fast switch to replica regions
o Consistent data in replica regions
o Full cluster down – Data in non volatile area. Fast way to recreate in memory Map.
 HBASE-20003
hosted by
Apache HBase On Persistent Memory
 More memory size available - DRAM 100s of GBs. AEP even more
 More and More Data in Memory
o Large memstore size and global memstore size
• More data in memory
• Less flushes and compactions = Less IO
• Subsequent reads can get data from memory mostly (?)
• More memstores size => More Java heap size => Larger GC pause issues.
• More cell entries to CSLM. More compares for ordering => Lower Throughput
• Server down - more data in live WAL files for replay => Higher MTTR
o More Java heap size –> Off heap writes using Off heap memstores
o More cell entries to CSLM –> Compacting Memstore work by Yahoo , New faster CSLM implementation by Alibaba
o More data in live WALs –> HBase on AEP with no WALs. Persistent memstores LABs. Instant rebuild of memstores CSLM.
hosted by
Performance Results
 PerformanceEvaluation Tool
o Write only workload
o 4 Node cluster
o 100 client thread
o 250 GB Total data
o Single column per row
o WAL – 3 Replicas (HDFS replicas)
o WALLess – Primary and 2 replica regions
Average throughput is > 2x compared to With WAL cases.
Latency is consistent through out for the WALLess case. In WAL
case the latency varies as the amount of data to ‘sync’ increases.
(Here again with ‘fsync’ latency is more than ‘hflush’).
hosted by
Performance Results
 PerformanceEvaluation Tool
o Random Reads
o 4 Node cluster
o 100 client thread
o 5 secs/ 30 sec ZK session time outs
o One RS node crash in between PE run
o
 Max latency is 44x larger with WAL (ZK session time out = 5 sec)
 Max latency is 104x larger with WAL (ZK session time out = 30 sec)
hosted by
Apache HBase On Persistent Memory
 HBASE-20003
 https://docs.google.com/document/d/1sYJS9lMZa_EMhTTOJ7y_KzVUjXdBgdKdPWmsN5p2lH0/
 Write path, read path changes done in PoC
 Pending – WAL based features – Inter cluster replication, backup
o Similar issue as that in HBASE-20951 (Ratis LogService backed WALs)
o Work with this project - HBase improvements for Cloud
 Testing – Full cluster restart/ Rolling restart scenarios
 Load balancer , AM stabilization.
o Specially with Region replicas. Many bugs. Solving…
 http://pmem.io/
Project Status
Thanks

More Related Content

PDF
HBaseConAsia2018 Track1-7: HDFS optimizations for HBase at Xiaomi
PDF
Tales from Taming the Long Tail
PPTX
Keynote: Apache HBase at Yahoo! Scale
PPTX
Date-tiered Compaction Policy for Time-series Data
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
PPTX
Apache HBase, Accelerated: In-Memory Flush and Compaction
PDF
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
PDF
hbaseconasia2017: hbase-2.0.0
HBaseConAsia2018 Track1-7: HDFS optimizations for HBase at Xiaomi
Tales from Taming the Long Tail
Keynote: Apache HBase at Yahoo! Scale
Date-tiered Compaction Policy for Time-series Data
HBaseCon2017 gohbase: Pure Go HBase Client
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
hbaseconasia2017: hbase-2.0.0

What's hot (20)

PDF
Accordion HBaseCon 2017
PDF
Kafka on ZFS: Better Living Through Filesystems
PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
HBaseCon2017 Improving HBase availability in a multi tenant environment
PDF
HBaseCon 2013: Scalable Network Designs for Apache HBase
PDF
TeraCache: Efficient Caching Over Fast Storage Devices
PPTX
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
PDF
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
PDF
Hadoop Hardware @Twitter: Size does matter!
PPTX
HBaseCon 2015: OpenTSDB and AsyncHBase Update
PPTX
Rolling Out Apache HBase for Mobile Offerings at Visa
PDF
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
PPTX
Off-heaping the Apache HBase Read Path
PDF
HBase: How to get MTTR below 1 minute
PDF
Argus Production Monitoring at Salesforce
PPTX
HBaseCon 2015: HBase Operations in a Flurry
PDF
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
PPTX
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
PPTX
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
Accordion HBaseCon 2017
Kafka on ZFS: Better Living Through Filesystems
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon 2013: Scalable Network Designs for Apache HBase
TeraCache: Efficient Caching Over Fast Storage Devices
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
Hadoop Hardware @Twitter: Size does matter!
HBaseCon 2015: OpenTSDB and AsyncHBase Update
Rolling Out Apache HBase for Mobile Offerings at Visa
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
Off-heaping the Apache HBase Read Path
HBase: How to get MTTR below 1 minute
Argus Production Monitoring at Salesforce
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
Ad

Similar to HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices (20)

PDF
HBase Application Performance Improvement
PDF
Breaking the Sound Barrier with Persistent Memory
PDF
Software Design for Persistent Memory Systems
PPTX
HBase Accelerated: In-Memory Flush and Compaction
PDF
Facebook keynote-nicolas-qcon
PDF
Facebook Messages & HBase
PDF
支撑Facebook消息处理的h base存储系统
PDF
HBase ArcheTypes
PDF
Hbase 20141003
PDF
Building Apps with Distributed In-Memory Computing Using Apache Geode
PDF
Hbase: an introduction
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
PPTX
HBaseCon 2015: HBase 2.0 and Beyond Panel
PPTX
2013 year of real-time hadoop
PPTX
HBase Low Latency, StrataNYC 2014
PPTX
HBase at Flurry
PPTX
Introduction to Apache HBase
PPTX
A Survey of HBase Application Archetypes
PPT
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
PPTX
HBase Introduction
HBase Application Performance Improvement
Breaking the Sound Barrier with Persistent Memory
Software Design for Persistent Memory Systems
HBase Accelerated: In-Memory Flush and Compaction
Facebook keynote-nicolas-qcon
Facebook Messages & HBase
支撑Facebook消息处理的h base存储系统
HBase ArcheTypes
Hbase 20141003
Building Apps with Distributed In-Memory Computing Using Apache Geode
Hbase: an introduction
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2015: HBase 2.0 and Beyond Panel
2013 year of real-time hadoop
HBase Low Latency, StrataNYC 2014
HBase at Flurry
Introduction to Apache HBase
A Survey of HBase Application Archetypes
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBase Introduction
Ad

More from Michael Stack (20)

PDF
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
PDF
hbaseconasia2019 Recent work on HBase at Pinterest
PDF
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
PDF
hbaseconasia2019 HBase at Didi
PDF
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
PDF
hbaseconasia2019 HBase at Tencent
PDF
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
PDF
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
PDF
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
PDF
hbaseconasia2019 OpenTSDB at Xiaomi
PDF
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
PDF
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
PDF
hbaseconasia2019 Distributed Bitmap Index Solution
PDF
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
PDF
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
PDF
hbaseconasia2019 BDS: A data synchronization platform for HBase
PDF
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
PDF
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
PDF
HBaseConAsia2019 Keynote
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 HBase at Didi
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 HBase at Tencent
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
HBaseConAsia2019 Keynote

Recently uploaded (20)

PPTX
artificial intelligence overview of it and more
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Internet___Basics___Styled_ presentation
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPT
tcp ip networks nd ip layering assotred slides
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
DOCX
Unit-3 cyber security network security of internet system
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
Digital Literacy And Online Safety on internet
PDF
Testing WebRTC applications at scale.pdf
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
artificial intelligence overview of it and more
international classification of diseases ICD-10 review PPT.pptx
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Decoding a Decade: 10 Years of Applied CTI Discipline
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Internet___Basics___Styled_ presentation
WebRTC in SignalWire - troubleshooting media negotiation
tcp ip networks nd ip layering assotred slides
SAP Ariba Sourcing PPT for learning material
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
Unit-3 cyber security network security of internet system
Design_with_Watersergyerge45hrbgre4top (1).ppt
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
Slides PPTX World Game (s) Eco Economic Epochs.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
Job_Card_System_Styled_lorem_ipsum_.pptx
Digital Literacy And Online Safety on internet
Testing WebRTC applications at scale.pdf
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Introuction about ICD -10 and ICD-11 PPT.pptx

HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices

  • 1. HBaseConAsia2018 August 17,2018 Gehua New Century Hotel Beijing,China HBase On Persistent Memory Anoop Sam John, Ramkrishna S Vasudevan hosted by
  • 2. hosted by Content 01 02 04 03 HBase Present Model Region Replica Persistent Memory Technology HBase On Persistent Memory 05 Performance Numbers
  • 3. hosted by Apache HBase Present Model  Accumulate data in Memory o Sorted Map o Cell data bytes in Local Allocation Buffers (LABs) in DRAM o LABs with size 2 MB in RAM.  Also write to Write Ahead Log (WAL) o Cell data in Volatile RAM. o To recover from server crash o HDFS interaction adding more latency hsync vs hflush - HBASE-19024  Flushes as files to HDFS on reaching memstores size o 128 MB default flush size  Replay WAL on server crash o Data unavailable till replay completes o Large Mean Time To Recover (MTTR) o Takes several minutes on large cluster (Complaint from many users like Alibaba – HBaseConAsia , Huawei)
  • 4. hosted by Apache HBase Present Model Region Replica for better availability (?)  Read only replica regions on other RSs o Refer to same HFiles in HDFS o Memstore data also can be replicated o Using WAL read replication path o Eventual consistency o Can read from replica regions when primary is down o No strong consistency guarantees. o Only when Scan/Get says TIMELINE Consistency, replica read happens (not by default) o Better availability but only for selected use cases !!! – No strong Consistency
  • 5. hosted by Persistent Memory Technology Persistent Memory o Get back data even after power cycles o Accessed using memory APIs o Processor load and store instructions 3DXPoint o New NVM technology o Stackable cross-gridded data access array Intel Apache Pass (AEP) o Persistent Memory (pmem) o Big, affordable and persistent o Accessed like volatile memory, using processor load and store instructions Library o NVML (Now called PMDK) o Java wrappers around it o Apache Mnemonic – (used for PoC) o https://github.com/pmem/pcj
  • 6. hosted by Apache HBase On Persistent Memory  Accumulate data in Memory o Sorted Map o Cell data bytes in Local Allocation Buffers with size 2 MB  Region Replica in other servers o Replica regions feature already in place. o Synchronous replication to replica regions  No need to write to Write Ahead Log (WAL) o Cell data in non volatile AEP.  Server crash o Fast switch to replica regions o Consistent data in replica regions o Full cluster down – Data in non volatile area. Fast way to recreate in memory Map.  HBASE-20003
  • 7. hosted by Apache HBase On Persistent Memory  More memory size available - DRAM 100s of GBs. AEP even more  More and More Data in Memory o Large memstore size and global memstore size • More data in memory • Less flushes and compactions = Less IO • Subsequent reads can get data from memory mostly (?) • More memstores size => More Java heap size => Larger GC pause issues. • More cell entries to CSLM. More compares for ordering => Lower Throughput • Server down - more data in live WAL files for replay => Higher MTTR o More Java heap size –> Off heap writes using Off heap memstores o More cell entries to CSLM –> Compacting Memstore work by Yahoo , New faster CSLM implementation by Alibaba o More data in live WALs –> HBase on AEP with no WALs. Persistent memstores LABs. Instant rebuild of memstores CSLM.
  • 8. hosted by Performance Results  PerformanceEvaluation Tool o Write only workload o 4 Node cluster o 100 client thread o 250 GB Total data o Single column per row o WAL – 3 Replicas (HDFS replicas) o WALLess – Primary and 2 replica regions Average throughput is > 2x compared to With WAL cases. Latency is consistent through out for the WALLess case. In WAL case the latency varies as the amount of data to ‘sync’ increases. (Here again with ‘fsync’ latency is more than ‘hflush’).
  • 9. hosted by Performance Results  PerformanceEvaluation Tool o Random Reads o 4 Node cluster o 100 client thread o 5 secs/ 30 sec ZK session time outs o One RS node crash in between PE run o  Max latency is 44x larger with WAL (ZK session time out = 5 sec)  Max latency is 104x larger with WAL (ZK session time out = 30 sec)
  • 10. hosted by Apache HBase On Persistent Memory  HBASE-20003  https://docs.google.com/document/d/1sYJS9lMZa_EMhTTOJ7y_KzVUjXdBgdKdPWmsN5p2lH0/  Write path, read path changes done in PoC  Pending – WAL based features – Inter cluster replication, backup o Similar issue as that in HBASE-20951 (Ratis LogService backed WALs) o Work with this project - HBase improvements for Cloud  Testing – Full cluster restart/ Rolling restart scenarios  Load balancer , AM stabilization. o Specially with Region replicas. Many bugs. Solving…  http://pmem.io/ Project Status