Date-tiered Compaction Policy for Time-series Data

Date Tiered Compaction
Clara Xiong

▪Not a time-series system(row key is independent of time) but similar
read/write pattern.
▪Exponential data growth pushes disk IO to the limit.
▪Date tiered compaction partitions the storage by time.
▪Resulted in drastically lower latency and massive reduction of disk IO
consumption.
Why Date Tiered Compaction for Flurry?

▪ Inspired by Cassandra’s Date Tiered Compaction.
▪When new window appears, older windows get merged into
exponentially larger windows.
What is Date Tiered Compaction?
Base window = 1 hour, windows per tier = 4
From https://labs.spotify.com/2014/12/18/date-tiered-compaction/

▪Exponential sizing balances the tradeoffs between optimal scan
performance of recent data and total file count.
▪Out of range time windows will no longer be merged for efficient data
retention management
What is Date Tiered Compaction?
(cont’d)

▪Use maxTimestamp
for file selection to
favor late-arriving data.
▪Normalize timestamp
by the order of
sequence id.
▪Plugged-in per-
window compaction
policy to reduce
wasteful compaction.
File
Selection

A Date Tiered Layout in Production
0
50
100
150
200
250
300
350
400
450
File Size (20MB)
Window (Days)

▪HDFS Read Bytes ▪Mapper Run Time Minutes
Days After DTC Was On
0.00E+00
2.00E+14
4.00E+14
6.00E+14
8.00E+14
1.00E+15
1.20E+15
1.40E+15
1.60E+15
1.80E+15
2.00E+15
0 2 4 6 8 10 12
7-day job 1-day job
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12
7-day job(avg) 7-day job(max)
1-day job(avg) 1-day job(max)

Reduction on HDFS IO (%) Notes
1-day look back job 79%
7-day look back job 97%
Minor Compaction 85% Still provisional
IO Reduction Summary

▪ Date Tiered Compactor (Duo Zhang @xiaomi)
● Outputs store files for given time boundaries.
● Allows us to split the data by window boundaries
▪ Date tiered major compaction
● Maintains tiered layout while processing deletes, updates and locality.
● Restores layout skewed by out-of-order writes, bulk load and etc.
▪ Window factory and Minor compaction for archiving (Duo Zhang)
Enhancements

▪What will benefit most?
●Most writes have cell timestamp near the write time and most reads are for limited recent data
out of the full data set
▪What will not?
●Random GET without a time range
●Frequent out of order data writes creating long tails, especially writes with future timestamps
●Frequent bulk load of heavily overlapping time-ranges
Is Date Tiered Compaction for You?

▪ Don’t turn it on cluster wide: Meta should not be on Date Tiered
Compaction. Limit all the related configuration changes by table
/column family
▪ Increase blocking file count and max compaction file count settings
generously
▪ Put in pressure aware compaction throughput throttler to limit the
impact of synchronized large minor compaction
Best Practice Recommendations

▪ Design specification
https://docs.google.com/document/d/1_AmlNb2N8Us1xICsTeGDLK
IqL6T-oHoRLZ323MG_uy8
▪ Email
claraxiong@yahoo-inc.com
Questions?

Date-tiered Compaction Policy for Time-series Data

More Related Content

What's hot (20)

Similar to Date-tiered Compaction Policy for Time-series Data (10)

More from HBaseCon (20)

Recently uploaded (20)

Date-tiered Compaction Policy for Time-series Data