HBase at Xiaomi

HBase at Xiaomi
{xieliang, fenghonghua}@xiaomi.com
Liang Xie / Honghua Feng
1www.mi.com

2
About Us
Honghua FengLiang Xie
www.mi.com

3
Outline
 Introduction
 Latency practice
 Some patches we contributed
 Some ongoing patches
 Q&A
www.mi.com

4
About Xiaomi
 Mobile internet company founded in 2010
 Sold 18.7 million phones in 2013
 Over $5 billion revenue in 2013
 Sold 11 million phones in Q1, 2014
www.mi.com

7
Internet Services
www.mi.com

8
About Our HBase Team
 Founded in October 2012
 5 members
 Liang Xie
 Shaohui Liu
 Jianwei Cui
 Liangliang He
 Honghua Feng
 Resolved 130+ JIRAs so far
www.mi.com

9
Our Clusters and Scenarios
 15 Clusters : 9 online / 2 processing / 4 test
 Scenarios
 MiCloud
 MiPush
 MiTalk
 Perf Counter
www.mi.com

10
Our Latency Pain Points
 Java GC
 Stable page write in OS layer
 Slow buffered IO (FS journal IO)
 Read/Write IO contention
www.mi.com

11
 Bucket cache with off-heap mode
 Xmn/ServivorRatio/MaxTenuringThreshold
 PretenureSizeThreshold & repl src size
 GC concurrent thread number
GC time per day :
[2500, 3000] -> [300, 600]s !!!
www.mi.com
HBase GC Practice

12
HBase client put
->HRegion.batchMutate
->HLog.sync
->SequenceFileLogWriter.sync
->DFSOutputStream.flushOrSync
->DFSOutputStream.waitForAckedSeqno <Stuck here often!>
===================================================
DataNode pipeline write, in BlockReceiver.receivePacket() :
->receiveNextPacket
->mirrorPacketTo(mirrorOut) //write packet to the mirror
->out.write/flush //write data to local disk. <- buffered IO
[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also
confirmed it
www.mi.com
Write Latency Spikes

13
 write() is expected to be fast
 But blocked by write-back sometimes!
www.mi.com
Root Cause of Write Latency Spikes

14
Workaround :
2.6.32.279(6.3) -> 2.6.32.220(6.2)
or
2.6.32.279(6.3) -> 2.6.32.358(6.4)
Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive
HBase cluster!
www.mi.com
Stable page write issue workaround

15
...
0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]
0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]
0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]
0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]
0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]
0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]
0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]
0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]
0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]
0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]
XFS in latest kernel can relieve journal IO blocking issue, more friendly to
metadata heavy scenarios like HBase + HDFS
www.mi.com
Root Cause of Write Latency Spikes

16
8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel :
3.12.17
Statistic the stalled write() which costs > 100ms
The largest write() latency in Ext4 : ~600ms !
www.mi.com
Write Latency Spikes Testing

17
Hedged Read (HDFS-5776)
www.mi.com

18
 Long first “put” issue (HBASE-10010)
 Token invalid (HDFS-5637)
 Retry/timeout setting in DFSClient
 Reduce write traffic? (HLog compression)
 HDFS IO Priority (HADOOP-10410)
Other Meaningful Latency Work
www.mi.com

19
 Real-time HDFS, esp. priority related
 Core data structure GC friendly
 More off-heap; shenandoah GC
 TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned…
www.mi.com
Wish List

 New write thread model(HBASE-8755)
 Reverse scan(HBASE-4811)
 Per table/cf replication(HBASE-8751)
 Block index key optimization(HBASE-7845)
20www.mi.com
Some Patches Xiaomi Contributed

WriteHandler :sync to HDFS
WriteHandler :write to HDFS
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
1. New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
WriteHandler : write to HDFS
WriteHandler : sync to HDFS
Local Buffer
Problem : WriteHandler does everything, severe lock race!
Old model:
21www.mi.com
256
256
256

WriteHandler :sync to HDFSWriteHandler :sync to HDFS
New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
AsyncWriter : write to HDFS
AsyncSyncer : sync to HDFS
Local Buffer
New model :
AsyncNotifier : notify writers
22www.mi.com
256
1
1
4

New Write Thread Model
 Low load : No improvement
 Heavy load : Huge improvement (3.5x)
23www.mi.com

2. Reverse Scan
Row2 kv2
Row3 kv1
Row3 kv3
Row4 kv2
Row4 kv5
Row5 kv2
Row1 kv2
Row3 kv2
Row3 kv4
Row4 kv4
Row4 kv6
Row5 kv3
Row1 kv1
Row2 kv1
Row2 kv3
Row4 kv1
Row4 kv3
Row6 kv1
1. All scanners seek to ‘previous’ rows (SeekBefore)
2. Figure out next row : max ‘previous’ row
3. All scanners seek to first KV of next row (SeekTo)
Performance : 70% of forward scan
24www.mi.com

Need a way to specify which data to replicate!
3. Per Table/CF Replication
Source
PeerA
(backup)
PeerB
(T2:cfX)
T1 : cfA, cfB
T2 : cfX, cfY
 PeerB creates T2 only : replication can’t work!
T1:cfA,cfB; T2:cfX,cfY
?
 PeerB creates T1&T2 : all data replicated!
25www.mi.com

Per Table/CF Replication
Source
PeerA
PeerB
(T2:cfX)
T1:cfA,cfB; T2:cfX,cfY
T2:cfX
 add_peer ‘PeerA’, ‘PeerA_ZK’
 add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’
T1 : cfA, cfB
T2 : cfX, cfY
26www.mi.com

4. Block Index Key Optimization
Block 1 Block 2
… …
k1:“ab” k2 : “ah, hello world”
Before : ‘Block 2’ block index key = “ah, hello world/…”
Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)
 Reduce block index size
 Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]
27www.mi.com

 Cross-table cross-row transaction(HBASE-10999)
 HLog compactor(HBASE-9873)
 Adjusted delete semantic(HBASE-8721)
 Coordinated compaction (HBASE-9528)
 Quorum master (HBASE-10296)
28www.mi.com
Some ongoing patches

http://github.com/xiaomi/themis
1. Cross-Row Transaction : Themis
 Google Percolator : Large-scale Incremental Processing Using
Distributed Transactions and Notifications
 Two-phase commit : strong cross-table/row consistency
 Global timestamp server : global strictly incremental timestamp
 No touch to HBase internal: based on HBase Client and coprocessor
 Read : 90%, Write : 23% (same downgrade as Google percolator)
 More details : HBASE-10999
29www.mi.com

2. HLog Compactor HLog 1,2,3
Region 1Memstore
HFiles
Region 2 Region x
Region x : few writes but scatter in many HLogs
PeriodicMemstoreFlusher : flush old memstores forcefully
 ‘flushCheckInterval’/‘flushPerChanges’ : hard to config
 Result in ‘tiny’ HFiles
 HBASE-10499 : problematic region can’t be flushed!
30
www.mi.com

HLog Compactor HLog 1, 2, 3,4
Region 1Memstore
HFiles
Region 2 Region x
 Compact : HLog 1,2,3,4  HLog x
 Archive : HLog1,2,3,4
HLog x
31www.mi.com

3. Adjusted Delete Semantic
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result : kvA can’t be read out
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
Result : kvA can be read out
Scenario 2
5. Read kvA
Fix : “delete can’t mask kvs with larger mvcc ( put later )”
32www.mi.com

4. Coordinated Compaction
HDFS (global resource)
RS RS RS
Compact storm!
 Compact uses a global HDFS, while whether to compact is decided locally!
33www.mi.com

Coordinated Compaction
RS RS RS
MasterCan I ?OK Can I ? OK
Can I ?
NO
HDFS (global resource)
 Compact is scheduled by master, no compact storm any longer
34www.mi.com

5. Quorum Master
zk3 zk2
zk1
RS RSRS
Master
Master
ZooKeeper
X
Read info/states
A
A
 When active master serves, standby master stays ‘really’ idle
 When standby master becomes active, it needs to rebuild in-memory status
35www.mi.com

Quorum Master
Master 3 Master 1
Master 2
RS RSRS
X
A
A
 Better master failover perf : No phase to rebuild in-memory status
 No external(ZooKeeper) dependency
 No potential consistency issue
 Simpler deployment
 Better restart perf for BIG cluster(10+K regions)
36www.mi.com

Hangjun Ye, Zesheng Wu, Peng Zhang
Xing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang He
Dihao Chen
Acknowledgement
37www.mi.com

Thank You!
xieliang@xiaomi.com
fenghonghua@xiaomi.com
www.mi.com
38www.mi.com

HBase at Xiaomi

More Related Content

What's hot (20)

Similar to HBase at Xiaomi (20)

More from HBaseCon (20)

Recently uploaded (20)

HBase at Xiaomi

Editor's Notes