Druid - DevconTLV X

Nov 2016
DRUIDEstimating Set Cardinalities

DRUID
“ Very fast highly scalable columnar data-store ”

Roll-Up
Event Time Id Attribute Daily Unique Monthly Unique
2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 11111 1 1
2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 22222 1 1
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 11111 0 0
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 22222 1 0
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 33333 1 0
Event Time Attribute Daily Unique Count Monthly Unique Count
2016-11-15 111111 1 1
2016-11-15 222222 2 1
2016-11-15 333333 1 0
SumAggregator

Count-Distinct problem
❏ Find the number of distinct elements in a data stream with repeated elements
❏ eXelate business question
❏ How many unique devices has eXelate encountered:
❏ for a given set-theoretic expression of attributes (segments, labels, regions, etc.)
❏ over a given date range

Count-Distinct Approaches
• Store everything
• Store only 1 bit per device
• 10B Devices - 1.25 GB/day
• 10B Devices * 80K attributes - 100 TB/day
• Approximate

ThetaSketch
• K Minimum Values (KMV)
• Estimate set cardinality
• Supports set-theoretic operations
X Y
• ThetaSketch mathematical framework - generalization of KMV
X Y

ThetaSketch Error
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
128 8.87% 17.75%
8,192 1.10% 2.21%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
131,072 0.28% 0.55%

Solution using Elasticsearch
• Document structure
{
“id”: “3a4c1f2d84a5c179435c1fea86e6ae02”,
“events”: [
{
“date”: “15-11-2016”,
“attributes” : [ “111111”, “222222” ]
},
{
“date”: “16-11-2016”,
“attributes” : [ “222222”, “333333” ]
}
]
}
• Exploit Elasticsearch reverse-index

• Indexing data
• 250 GB of daily data, 10 hours
• Affect query time
• Large index - 2.5 TB
• Querying
• low concurrency
• Spans on all the machines in the cluster
• Cost
• $100K monthly
Elasticsearch Issues

What We Tried
• Pre-processing
• Too many combinations
• HyperLogLog
• No good support for set-theoretic operations
• Calculated during query time

Druid Solution
(timestamp,device_id,attribute)
ThetaSketchAggregator

Benchmark
• Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge)
• Elasticsearch Cluster : 20 nodes (r3.8xlarge)

10TB
4 Hours
160GB
280ms-350ms
$40K/mo
DRUID ES
250GB
10 Hours
2.5TB
500ms-6000ms
$100K/mo
Druid vs. ES

We Are Hiring
❏Web application team leader
❏Frontend developer
❏Java developer & machine learning
❏Senior java developer
❏IT Production Engineer
❏Node.js Developer
http://exelate.com/about-us/careers

Druid - DevconTLV X

More Related Content

Similar to Druid - DevconTLV X (20)

Recently uploaded (20)

Druid - DevconTLV X