SlideShare a Scribd company logo
HBase at Xiaomi
{xieliang, fenghonghua}@xiaomi.com
Liang Xie / Honghua Feng
1www.mi.com
2
About Us
Honghua FengLiang Xie
www.mi.com
3
Outline
 Introduction
 Latency practice
 Some patches we contributed
 Some ongoing patches
 Q&A
www.mi.com
4
About Xiaomi
 Mobile internet company founded in 2010
 Sold 18.7 million phones in 2013
 Over $5 billion revenue in 2013
 Sold 11 million phones in Q1, 2014
www.mi.com
5
Hardware
www.mi.com
6
Software
www.mi.com
7
Internet Services
www.mi.com
8
About Our HBase Team
 Founded in October 2012
 5 members
 Liang Xie
 Shaohui Liu
 Jianwei Cui
 Liangliang He
 Honghua Feng
 Resolved 130+ JIRAs so far
www.mi.com
9
Our Clusters and Scenarios
 15 Clusters : 9 online / 2 processing / 4 test
 Scenarios
 MiCloud
 MiPush
 MiTalk
 Perf Counter
www.mi.com
10
Our Latency Pain Points
 Java GC
 Stable page write in OS layer
 Slow buffered IO (FS journal IO)
 Read/Write IO contention
www.mi.com
11
 Bucket cache with off-heap mode
 Xmn/ServivorRatio/MaxTenuringThreshold
 PretenureSizeThreshold & repl src size
 GC concurrent thread number
GC time per day :
[2500, 3000] -> [300, 600]s !!!
www.mi.com
HBase GC Practice
12
HBase client put
->HRegion.batchMutate
->HLog.sync
->SequenceFileLogWriter.sync
->DFSOutputStream.flushOrSync
->DFSOutputStream.waitForAckedSeqno <Stuck here often!>
===================================================
DataNode pipeline write, in BlockReceiver.receivePacket() :
->receiveNextPacket
->mirrorPacketTo(mirrorOut) //write packet to the mirror
->out.write/flush //write data to local disk. <- buffered IO
[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also
confirmed it
www.mi.com
Write Latency Spikes
13
 write() is expected to be fast
 But blocked by write-back sometimes!
www.mi.com
Root Cause of Write Latency Spikes
14
Workaround :
2.6.32.279(6.3) -> 2.6.32.220(6.2)
or
2.6.32.279(6.3) -> 2.6.32.358(6.4)
Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive
HBase cluster!
www.mi.com
Stable page write issue workaround
15
...
0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]
0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]
0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]
0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]
0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]
0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]
0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]
0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]
0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]
0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]
XFS in latest kernel can relieve journal IO blocking issue, more friendly to
metadata heavy scenarios like HBase + HDFS
www.mi.com
Root Cause of Write Latency Spikes
16
8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel :
3.12.17
Statistic the stalled write() which costs > 100ms
The largest write() latency in Ext4 : ~600ms !
www.mi.com
Write Latency Spikes Testing
17
Hedged Read (HDFS-5776)
www.mi.com
18
 Long first “put” issue (HBASE-10010)
 Token invalid (HDFS-5637)
 Retry/timeout setting in DFSClient
 Reduce write traffic? (HLog compression)
 HDFS IO Priority (HADOOP-10410)
Other Meaningful Latency Work
www.mi.com
19
 Real-time HDFS, esp. priority related
 Core data structure GC friendly
 More off-heap; shenandoah GC
 TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned…
www.mi.com
Wish List
 New write thread model(HBASE-8755)
 Reverse scan(HBASE-4811)
 Per table/cf replication(HBASE-8751)
 Block index key optimization(HBASE-7845)
20www.mi.com
Some Patches Xiaomi Contributed
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
1. New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
WriteHandler : write to HDFS
WriteHandler : sync to HDFS
Local Buffer
Problem : WriteHandler does everything, severe lock race!
Old model:
21www.mi.com
256
256
256
WriteHandler :sync to HDFSWriteHandler :sync to HDFS
New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
AsyncWriter : write to HDFS
AsyncSyncer : sync to HDFS
Local Buffer
New model :
AsyncNotifier : notify writers
22www.mi.com
256
1
1
4
New Write Thread Model
 Low load : No improvement
 Heavy load : Huge improvement (3.5x)
23www.mi.com
2. Reverse Scan
Row2 kv2
Row3 kv1
Row3 kv3
Row4 kv2
Row4 kv5
Row5 kv2
Row1 kv2
Row3 kv2
Row3 kv4
Row4 kv4
Row4 kv6
Row5 kv3
Row1 kv1
Row2 kv1
Row2 kv3
Row4 kv1
Row4 kv3
Row6 kv1
1. All scanners seek to ‘previous’ rows (SeekBefore)
2. Figure out next row : max ‘previous’ row
3. All scanners seek to first KV of next row (SeekTo)
Performance : 70% of forward scan
24www.mi.com
Need a way to specify which data to replicate!
3. Per Table/CF Replication
Source
PeerA
(backup)
PeerB
(T2:cfX)
T1 : cfA, cfB
T2 : cfX, cfY
 PeerB creates T2 only : replication can’t work!
T1:cfA,cfB; T2:cfX,cfY
?
 PeerB creates T1&T2 : all data replicated!
25www.mi.com
Per Table/CF Replication
Source
PeerA
PeerB
(T2:cfX)
T1:cfA,cfB; T2:cfX,cfY
T2:cfX
 add_peer ‘PeerA’, ‘PeerA_ZK’
 add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’
T1 : cfA, cfB
T2 : cfX, cfY
26www.mi.com
4. Block Index Key Optimization
Block 1 Block 2
… …
k1:“ab” k2 : “ah, hello world”
Before : ‘Block 2’ block index key = “ah, hello world/…”
Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)
 Reduce block index size
 Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]
27www.mi.com
 Cross-table cross-row transaction(HBASE-10999)
 HLog compactor(HBASE-9873)
 Adjusted delete semantic(HBASE-8721)
 Coordinated compaction (HBASE-9528)
 Quorum master (HBASE-10296)
28www.mi.com
Some ongoing patches
http://github.com/xiaomi/themis
1. Cross-Row Transaction : Themis
 Google Percolator : Large-scale Incremental Processing Using
Distributed Transactions and Notifications
 Two-phase commit : strong cross-table/row consistency
 Global timestamp server : global strictly incremental timestamp
 No touch to HBase internal: based on HBase Client and coprocessor
 Read : 90%, Write : 23% (same downgrade as Google percolator)
 More details : HBASE-10999
29www.mi.com
2. HLog Compactor HLog 1,2,3
Region 1Memstore
HFiles
Region 2 Region x
Region x : few writes but scatter in many HLogs
PeriodicMemstoreFlusher : flush old memstores forcefully
 ‘flushCheckInterval’/‘flushPerChanges’ : hard to config
 Result in ‘tiny’ HFiles
 HBASE-10499 : problematic region can’t be flushed!
30
www.mi.com
HLog Compactor HLog 1, 2, 3,4
Region 1Memstore
HFiles
Region 2 Region x
 Compact : HLog 1,2,3,4  HLog x
 Archive : HLog1,2,3,4
HLog x
31www.mi.com
3. Adjusted Delete Semantic
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result : kvA can’t be read out
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
Result : kvA can be read out
Scenario 2
5. Read kvA
Fix : “delete can’t mask kvs with larger mvcc ( put later )”
32www.mi.com
4. Coordinated Compaction
HDFS (global resource)
RS RS RS
Compact storm!
 Compact uses a global HDFS, while whether to compact is decided locally!
33www.mi.com
Coordinated Compaction
RS RS RS
MasterCan I ?OK Can I ? OK
Can I ?
NO
HDFS (global resource)
 Compact is scheduled by master, no compact storm any longer
34www.mi.com
5. Quorum Master
zk3 zk2
zk1
RS RSRS
Master
Master
ZooKeeper
X
Read info/states
A
A
 When active master serves, standby master stays ‘really’ idle
 When standby master becomes active, it needs to rebuild in-memory status
35www.mi.com
Quorum Master
Master 3 Master 1
Master 2
RS RSRS
X
A
A
 Better master failover perf : No phase to rebuild in-memory status
 No external(ZooKeeper) dependency
 No potential consistency issue
 Simpler deployment
 Better restart perf for BIG cluster(10+K regions)
36www.mi.com
Hangjun Ye, Zesheng Wu, Peng Zhang
Xing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang He
Dihao Chen
Acknowledgement
37www.mi.com
Thank You!
xieliang@xiaomi.com
fenghonghua@xiaomi.com
www.mi.com
38www.mi.com

More Related Content

PDF
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
PPTX
HBaseCon 2015: HBase 2.0 and Beyond Panel
PDF
HBaseCon 2015: HBase Operations at Xiaomi
PDF
HBase 2.0 cluster topology
PDF
Breaking the Sound Barrier with Persistent Memory
PPTX
Meet hbase 2.0
PDF
hbaseconasia2017: hbase-2.0.0
PDF
Meet HBase 1.0
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase Operations at Xiaomi
HBase 2.0 cluster topology
Breaking the Sound Barrier with Persistent Memory
Meet hbase 2.0
hbaseconasia2017: hbase-2.0.0
Meet HBase 1.0

What's hot (20)

PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
PDF
Accelerating HBase with NVMe and Bucket Cache
PDF
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
PPTX
HBaseCon 2013: A Developer’s Guide to Coprocessors
PPTX
Apache HBase, Accelerated: In-Memory Flush and Compaction
PPT
Velocity 2010 - ATS
PPTX
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
PDF
HBaseCon2017 Improving HBase availability in a multi tenant environment
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
PPTX
006 performance tuningandclusteradmin
PPTX
HBase Low Latency
PDF
Accordion HBaseCon 2017
PDF
Apache HBase Low Latency
PPTX
HBase: Where Online Meets Low Latency
PPTX
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
PPTX
Real-time HBase: Lessons from the Cloud
PPTX
HBase Low Latency, StrataNYC 2014
PDF
HBase Blockcache 101
PDF
The State of HBase Replication
hbaseconasia2017: Large scale data near-line loading method and architecture
Accelerating HBase with NVMe and Bucket Cache
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2013: A Developer’s Guide to Coprocessors
Apache HBase, Accelerated: In-Memory Flush and Compaction
Velocity 2010 - ATS
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 gohbase: Pure Go HBase Client
006 performance tuningandclusteradmin
HBase Low Latency
Accordion HBaseCon 2017
Apache HBase Low Latency
HBase: Where Online Meets Low Latency
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
Real-time HBase: Lessons from the Cloud
HBase Low Latency, StrataNYC 2014
HBase Blockcache 101
The State of HBase Replication
Ad

Similar to HBase at Xiaomi (20)

PPTX
HBase Accelerated: In-Memory Flush and Compaction
PDF
Базы данных. HBase
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PDF
Hbase status quo apache-con europe - nov 2012
PPTX
HBase at Flurry
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PDF
HBase Applications - Atlanta HUG - May 2014
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
PPTX
Apache HBase Performance Tuning
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PPTX
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
PDF
Apache HBase: Where We've Been and What's Upcoming
PDF
Facebook keynote-nicolas-qcon
PDF
Facebook Messages & HBase
PDF
支撑Facebook消息处理的h base存储系统
PDF
HBase Advanced - Lars George
PDF
009709863.pdf
PDF
Apache HBase Improvements and Practices at Xiaomi
PPTX
HBase Introduction
HBase Accelerated: In-Memory Flush and Compaction
Базы данных. HBase
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Hbase status quo apache-con europe - nov 2012
HBase at Flurry
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
HBase Applications - Atlanta HUG - May 2014
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
HBaseCon 2015: HBase Performance Tuning @ Salesforce
Apache HBase Performance Tuning
HBase and HDFS: Understanding FileSystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
Apache HBase: Where We've Been and What's Upcoming
Facebook keynote-nicolas-qcon
Facebook Messages & HBase
支撑Facebook消息处理的h base存储系统
HBase Advanced - Lars George
009709863.pdf
Apache HBase Improvements and Practices at Xiaomi
HBase Introduction
Ad

More from HBaseCon (20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
PDF
hbaseconasia2017: HBase on Beam
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
PDF
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
PDF
hbaseconasia2017: Apache HBase at Netease
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
PDF
hbaseconasia2017: 基于HBase的企业级大数据平台
PDF
hbaseconasia2017: HBase at JD.com
PDF
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
HBaseCon2017 Democratizing HBase
PDF
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
PDF
HBaseCon2017 Transactions in HBase
PDF
HBaseCon2017 Highly-Available HBase
PDF
HBaseCon2017 Apache HBase at Didi
PDF
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
PDF
HBaseCon2017 HBase at Xiaomi
PDF
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon2017 Democratizing HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Transactions in HBase
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Introduction to Artificial Intelligence
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
history of c programming in notes for students .pptx
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
System and Network Administration Chapter 2
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Understanding Forklifts - TECH EHS Solution
PDF
AI in Product Development-omnex systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
System and Network Administraation Chapter 3
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
How Creative Agencies Leverage Project Management Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Introduction to Artificial Intelligence
Essential Infomation Tech presentation.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
history of c programming in notes for students .pptx
wealthsignaloriginal-com-DS-text-... (1).pdf
System and Network Administration Chapter 2
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Upgrade and Innovation Strategies for SAP ERP Customers
How to Migrate SBCGlobal Email to Yahoo Easily
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Understanding Forklifts - TECH EHS Solution
AI in Product Development-omnex systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf

HBase at Xiaomi

  • 1. HBase at Xiaomi {xieliang, fenghonghua}@xiaomi.com Liang Xie / Honghua Feng 1www.mi.com
  • 3. 3 Outline  Introduction  Latency practice  Some patches we contributed  Some ongoing patches  Q&A www.mi.com
  • 4. 4 About Xiaomi  Mobile internet company founded in 2010  Sold 18.7 million phones in 2013  Over $5 billion revenue in 2013  Sold 11 million phones in Q1, 2014 www.mi.com
  • 8. 8 About Our HBase Team  Founded in October 2012  5 members  Liang Xie  Shaohui Liu  Jianwei Cui  Liangliang He  Honghua Feng  Resolved 130+ JIRAs so far www.mi.com
  • 9. 9 Our Clusters and Scenarios  15 Clusters : 9 online / 2 processing / 4 test  Scenarios  MiCloud  MiPush  MiTalk  Perf Counter www.mi.com
  • 10. 10 Our Latency Pain Points  Java GC  Stable page write in OS layer  Slow buffered IO (FS journal IO)  Read/Write IO contention www.mi.com
  • 11. 11  Bucket cache with off-heap mode  Xmn/ServivorRatio/MaxTenuringThreshold  PretenureSizeThreshold & repl src size  GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! www.mi.com HBase GC Practice
  • 12. 12 HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it www.mi.com Write Latency Spikes
  • 13. 13  write() is expected to be fast  But blocked by write-back sometimes! www.mi.com Root Cause of Write Latency Spikes
  • 14. 14 Workaround : 2.6.32.279(6.3) -> 2.6.32.220(6.2) or 2.6.32.279(6.3) -> 2.6.32.358(6.4) Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! www.mi.com Stable page write issue workaround
  • 15. 15 ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS www.mi.com Root Cause of Write Latency Spikes
  • 16. 16 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! www.mi.com Write Latency Spikes Testing
  • 18. 18  Long first “put” issue (HBASE-10010)  Token invalid (HDFS-5637)  Retry/timeout setting in DFSClient  Reduce write traffic? (HLog compression)  HDFS IO Priority (HADOOP-10410) Other Meaningful Latency Work www.mi.com
  • 19. 19  Real-time HDFS, esp. priority related  Core data structure GC friendly  More off-heap; shenandoah GC  TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… www.mi.com Wish List
  • 20.  New write thread model(HBASE-8755)  Reverse scan(HBASE-4811)  Per table/cf replication(HBASE-8751)  Block index key optimization(HBASE-7845) 20www.mi.com Some Patches Xiaomi Contributed
  • 21. WriteHandler :sync to HDFS WriteHandler :write to HDFS WriteHandler :sync to HDFS WriteHandler :write to HDFS 1. New Write Thread Model WriteHandler WriteHandlerWriteHandler …… WriteHandler : write to HDFS WriteHandler : sync to HDFS Local Buffer Problem : WriteHandler does everything, severe lock race! Old model: 21www.mi.com 256 256 256
  • 22. WriteHandler :sync to HDFSWriteHandler :sync to HDFS New Write Thread Model WriteHandler WriteHandlerWriteHandler …… AsyncWriter : write to HDFS AsyncSyncer : sync to HDFS Local Buffer New model : AsyncNotifier : notify writers 22www.mi.com 256 1 1 4
  • 23. New Write Thread Model  Low load : No improvement  Heavy load : Huge improvement (3.5x) 23www.mi.com
  • 24. 2. Reverse Scan Row2 kv2 Row3 kv1 Row3 kv3 Row4 kv2 Row4 kv5 Row5 kv2 Row1 kv2 Row3 kv2 Row3 kv4 Row4 kv4 Row4 kv6 Row5 kv3 Row1 kv1 Row2 kv1 Row2 kv3 Row4 kv1 Row4 kv3 Row6 kv1 1. All scanners seek to ‘previous’ rows (SeekBefore) 2. Figure out next row : max ‘previous’ row 3. All scanners seek to first KV of next row (SeekTo) Performance : 70% of forward scan 24www.mi.com
  • 25. Need a way to specify which data to replicate! 3. Per Table/CF Replication Source PeerA (backup) PeerB (T2:cfX) T1 : cfA, cfB T2 : cfX, cfY  PeerB creates T2 only : replication can’t work! T1:cfA,cfB; T2:cfX,cfY ?  PeerB creates T1&T2 : all data replicated! 25www.mi.com
  • 26. Per Table/CF Replication Source PeerA PeerB (T2:cfX) T1:cfA,cfB; T2:cfX,cfY T2:cfX  add_peer ‘PeerA’, ‘PeerA_ZK’  add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’ T1 : cfA, cfB T2 : cfX, cfY 26www.mi.com
  • 27. 4. Block Index Key Optimization Block 1 Block 2 … … k1:“ab” k2 : “ah, hello world” Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)  Reduce block index size  Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’] 27www.mi.com
  • 28.  Cross-table cross-row transaction(HBASE-10999)  HLog compactor(HBASE-9873)  Adjusted delete semantic(HBASE-8721)  Coordinated compaction (HBASE-9528)  Quorum master (HBASE-10296) 28www.mi.com Some ongoing patches
  • 29. http://github.com/xiaomi/themis 1. Cross-Row Transaction : Themis  Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications  Two-phase commit : strong cross-table/row consistency  Global timestamp server : global strictly incremental timestamp  No touch to HBase internal: based on HBase Client and coprocessor  Read : 90%, Write : 23% (same downgrade as Google percolator)  More details : HBASE-10999 29www.mi.com
  • 30. 2. HLog Compactor HLog 1,2,3 Region 1Memstore HFiles Region 2 Region x Region x : few writes but scatter in many HLogs PeriodicMemstoreFlusher : flush old memstores forcefully  ‘flushCheckInterval’/‘flushPerChanges’ : hard to config  Result in ‘tiny’ HFiles  HBASE-10499 : problematic region can’t be flushed! 30 www.mi.com
  • 31. HLog Compactor HLog 1, 2, 3,4 Region 1Memstore HFiles Region 2 Region x  Compact : HLog 1,2,3,4  HLog x  Archive : HLog1,2,3,4 HLog x 31www.mi.com
  • 32. 3. Adjusted Delete Semantic 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again Result : kvA can be read out Scenario 2 5. Read kvA Fix : “delete can’t mask kvs with larger mvcc ( put later )” 32www.mi.com
  • 33. 4. Coordinated Compaction HDFS (global resource) RS RS RS Compact storm!  Compact uses a global HDFS, while whether to compact is decided locally! 33www.mi.com
  • 34. Coordinated Compaction RS RS RS MasterCan I ?OK Can I ? OK Can I ? NO HDFS (global resource)  Compact is scheduled by master, no compact storm any longer 34www.mi.com
  • 35. 5. Quorum Master zk3 zk2 zk1 RS RSRS Master Master ZooKeeper X Read info/states A A  When active master serves, standby master stays ‘really’ idle  When standby master becomes active, it needs to rebuild in-memory status 35www.mi.com
  • 36. Quorum Master Master 3 Master 1 Master 2 RS RSRS X A A  Better master failover perf : No phase to rebuild in-memory status  No external(ZooKeeper) dependency  No potential consistency issue  Simpler deployment  Better restart perf for BIG cluster(10+K regions) 36www.mi.com
  • 37. Hangjun Ye, Zesheng Wu, Peng Zhang Xing Yong, Hao Huang, Hailei Li Shaohui Liu, Jianwei Cui, Liangliang He Dihao Chen Acknowledgement 37www.mi.com

Editor's Notes

  • #24: This is the throughput comparison against a single regionserver: when the write load is low, there is almost no improvement, but as write load gets heavier and heavier, the improvement is pretty amazing, 3.5x at most Actually when write load is very low, new model has some small downgrade(about 10%), Michael Stack has fixed this downgrade in another patch, Thanks Stack!
  • #25: The second one is reverse scan. Before explaining how reverse scan works, I want to point out an important fact which can help understanding this patch. This fact is the granularity of scan is row, not key-value. All key-values of a row are read out in order from HFile or Memstore, and assembled together as a result row in RegionServer’s memory and then be returned to the client. This work is the same for both forward scan and reverse scan. So the difficulty of reverse scan is when the current row is done, figure out which is the next row, then jump to that row, and start to scan. Let’s see how we do it Since there are two more extra seek operations compared to forward scan, there is 30% downgrade in performance compared to forward scan, almost the same as in LevelDB. Finally thanks Chunhui very much for porting our patch to trunk!
  • #26: This is the third patch : per table/cf replication. Suppose we have a source cluster, it has two tables and four column families, all can be replicated. For data safety we deployed a peer cluster for backup, and the source cluster replicates all the data to this backup cluster, that’s just what we want and the replication works pretty well Then for some reason such as data analysis or experimental purpose we deployed another peer cluster, and our experimental program just needs data from cfX of table T2, What kind of replication we expect? Ideally we expect only data from cfX of T2 is replicated… but replication can’t work! Then we have to create all tables and column-families in PeerB, and all the data will be replicated, it’s really bad, either in term of bandwidth between source and PeerB, or in term of PeerB’s resource usage.
  • #27: Then we implement this feature, it allows to specify which data will be replicated to a peer cluster. For PeerA, the add_peer command is the same as before since PeerA want to replicate all the data. But for PeerB, the add_peer has an additional argument to specify which tables or column-families to replicate The implementation change is quite straightforward : In the source cluster, when parsing the log entries, the replication source thread will ignore all other ones and only replicates entries from cfX of table T2
  • #28: This is the fourth patch : block index key optimization. It is to reduce the overall block index size Suppose two contiguous blocks, the last key-avlue’s row of Block1 is “ab”, the first key-value’s row of Block2 is “ah, hello world”, before our patch the block index key of Block2 is “ah, hello world”(the first keyvalue of Block2), after our patch the block index key is “ac”(a fake key, it’s the minimal keyvalue which is larger than the last keyvalue of Block1 and less than or equal to the first keyvalue of Block2, with shortest row length), the new block index key is much shorter than old one
  • #29: Now let’s continue to talk about some work items we are currently working on
  • #31: The second one is HLog compactor, its target is to keep as few HLogs as possible, so we can say its final target is to improve regionserver failover performance, since the less HLog files to split, the better failover performance is We know a regionserver typically serves many regions, and the write patterns for all these regions can be quite different, so the flush frequency and timing of these regions can also be very different. Considering there is a region x, its memstore contains quite few entries, no flush triggered for a long time, and all its entries scatter in many HLogs. For these HLogs, though all other entries have been flushed to HFiles, they still can’t be archived since they contain entries from region x… We do have a background flusher thread to flush old memstores forcefully, but it has some obvious drawbacks, the first one is it’s hard to configure good-enough flushCheckInterval and flushPerChanges, second is forceful flush will result in tiny Hfiles, last one, as in jira HBASE-10499, some problematic region just can’t be flushed at all by this background flusher thread!
  • #32: Our patch works as this : we introduce another background thread, HLog compactor. When the HLog size is too large compared to the memstore size(which means we flushed enough, but not enough archive), we trigger the HLog compactor, it reads entries from all active HLog files, if the entry is still in some region’s memstore, write it to new HLog file; if not in any memstore(which means it has been flushed to some HFile) ignore it. After the compaction, we can archive all the old HLog files without flushing any memstore We have finished this feature and are testing it in our test cluster, we’ll share the patch after the test
  • #33: Let’s consider two scenarios The first scenario: first we write kvA at timestamp t0, then delete it and flush, and then write it again, and finally we try to read it, the result is we can’t read it out since both writes are masked by the delete The second scenario is the same as the first one except that before writing kvA for the second time we trigger a major compact. But this time kvA can be read out, since the delete is collected by the major compact This is inconsistent since major compact is transparent to the client but the read results are different depending on whether major compact occurs or not, the root cause is that the delete can even mask a key-values put later than it. The fix is simple, since mvcc represents the order all writes(including put/delete) entering HBase, we use it as an additional delete criterion to prevent delete from masking later put We ever have some severe discussion on this patch, personally I still insist that it deserves further thinking and discussion
  • #34: The fourth item is coordinated compaction. We talk about compact storm from time to time, now let’s check how it happens, when a regionserver wants to do compact, it just triggers it, and compact reads from HDFS and write back to HDFS, a regionserver can trigger a new compact no matter how overloaded the whole system is So we can see the problem is, what compact eventually uses is a global HDFS, but whether to trigger a compact is a local decision by each regionserver
  • #35: What we propose is using the master as a coordinator for compact scheduling, it works this way: when a regionserver want a compact, it asks the master, if the master says yes, it can trigger a compact, if the master thinks the system is loaded, it will reject all later compact requests until the system becomes not loaded
  • #36: The last item is quorum master. This is a master re-design and there are some discussion on it already. And I noticed that JimmyXiang from Cloudera and Mikhail from wandisco have put some efforts on it. It’s great! Current master design has 2 problems: 1. the first problem is some system-wide metadata and status are only maintained in the active master, for master failover these metadata and status are stored in ZooKeeper as well, and during master failover the new active master needs to read from ZooKeeper to rebuild the in-memory state 2. the second problem is the way ZooKeeper is used as the communication channel between master and regionservers for the state machine of region assigning task, ZooKeeper’s asynchronous notification mechanism is just not suitable for state machine logic, it’s also the root cause of many tricky bugs ever found
  • #37: We propose this new design: Instead of storing in-memory status in ZooKeeper, we replicate it among all master instances using a consensus protocol such as Raft or Paxos. This way when active master fails, a new active master is elected via consensus protocol among all alive standby masters, and the new active master serves immediately without reading from elsewhere Quorum master has some advantages: Better master failover performance Better restart performance for big cluster, since the communication between master and ZooKeeper is the bottleneck when a big number region assignment tasks happen concurrently No external dependency on ZooKeeper No potential consistency issue any longer Simpler deployment