SlideShare a Scribd company logo
Nov 2016
DRUIDEstimating Set Cardinalities
ABOUT ME
DRUID
“ Very fast highly scalable columnar data-store ”
Roll-Up
Event Time Id Attribute Daily Unique Monthly Unique
2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 11111 1 1
2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 22222 1 1
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 11111 0 0
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 22222 1 0
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 33333 1 0
Event Time Attribute Daily Unique Count Monthly Unique Count
2016-11-15 111111 1 1
2016-11-15 222222 2 1
2016-11-15 333333 1 0
SumAggregator
Druid Architecture
Count-Distinct problem
❏ Find the number of distinct elements in a data stream with repeated elements
❏ eXelate business question
❏ How many unique devices has eXelate encountered:
❏ for a given set-theoretic expression of attributes (segments, labels, regions, etc.)
❏ over a given date range
Nielsen Marketing Cloud
Count-Distinct Approaches
• Store everything
• Store only 1 bit per device
• 10B Devices - 1.25 GB/day
• 10B Devices * 80K attributes - 100 TB/day
• Approximate
ThetaSketch
• K Minimum Values (KMV)
• Estimate set cardinality
• Supports set-theoretic operations
X Y
• ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch Error
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
128 8.87% 17.75%
8,192 1.10% 2.21%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
131,072 0.28% 0.55%
Solution using Elasticsearch
• Document structure
{
“id”: “3a4c1f2d84a5c179435c1fea86e6ae02”,
“events”: [
{
“date”: “15-11-2016”,
“attributes” : [ “111111”, “222222” ]
},
{
“date”: “16-11-2016”,
“attributes” : [ “222222”, “333333” ]
}
]
}
• Exploit Elasticsearch reverse-index
• Indexing data
• 250 GB of daily data, 10 hours
• Affect query time
• Large index - 2.5 TB
• Querying
• low concurrency
• Spans on all the machines in the cluster
• Cost
• $100K monthly
Elasticsearch Issues
What We Tried
• Pre-processing
• Too many combinations
• HyperLogLog
• No good support for set-theoretic operations
• Calculated during query time
Druid Solution
(timestamp,device_id,attribute)
ThetaSketchAggregator
Benchmark
• Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge)
• Elasticsearch Cluster : 20 nodes (r3.8xlarge)
10TB
4 Hours
160GB
280ms-350ms
$40K/mo
DRUID ES
250GB
10 Hours
2.5TB
500ms-6000ms
$100K/mo
Druid vs. ES
Druid - DevconTLV X
We Are Hiring
❏Web application team leader
❏Frontend developer
❏Java developer & machine learning
❏Senior java developer
❏IT Production Engineer
❏Node.js Developer
http://exelate.com/about-us/careers

More Related Content

PDF
Analytic Data Report with MongoDB
PDF
Temporal database
PPTX
Android getting started
PDF
C* Summit EU 2013: Using Cassandra in a Telco Storage System
PDF
NSA for Enterprises Log Analysis Use Cases
PPTX
Event-Driven Systems With MongoDB
PDF
Graal The Quest for Source Code Knowledge
PPTX
Federated Storage Resources GCC2018 https://vimeo.com/291738189
Analytic Data Report with MongoDB
Temporal database
Android getting started
C* Summit EU 2013: Using Cassandra in a Telco Storage System
NSA for Enterprises Log Analysis Use Cases
Event-Driven Systems With MongoDB
Graal The Quest for Source Code Knowledge
Federated Storage Resources GCC2018 https://vimeo.com/291738189

Similar to Druid - DevconTLV X (20)

PPTX
Using druid for interactive count distinct queries at scale
PPTX
Using druid for interactive count distinct queries at scale @ nmc
PPTX
Our journey with druid - from initial research to full production scale
PDF
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
PDF
[D15] 最強にスケーラブルなカラムナーDBよ、Hadoopとのタッグでビッグデータの地平を目指せ!by Daisuke Hirama
PDF
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
PDF
Experiences in ELK with D3.js for Large Log Analysis and Visualization
PDF
Analyze and visualize non-relational data with DocumentDB + Power BI
PDF
DOAG Security Day 2016 Enterprise Security Reloaded
PDF
Generic Framework for Knowledge Classification-1
PDF
Solr Power FTW: Powering NoSQL the World Over
PDF
Security Monitoring for big Infrastructures without a Million Dollar budget
PDF
Managing your Black Friday Logs NDC Oslo
PDF
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
PDF
Querying Data Pipeline with AWS Athena
PDF
Managing your black friday logs - Code Europe
PPTX
MongoDB Chunks - Distribution, Splitting, and Merging
PPTX
MongoDB for Time Series Data: Setting the Stage for Sensor Management
PDF
Building OpenDNS Stats
PDF
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
Using druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scale @ nmc
Our journey with druid - from initial research to full production scale
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
[D15] 最強にスケーラブルなカラムナーDBよ、Hadoopとのタッグでビッグデータの地平を目指せ!by Daisuke Hirama
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Analyze and visualize non-relational data with DocumentDB + Power BI
DOAG Security Day 2016 Enterprise Security Reloaded
Generic Framework for Knowledge Classification-1
Solr Power FTW: Powering NoSQL the World Over
Security Monitoring for big Infrastructures without a Million Dollar budget
Managing your Black Friday Logs NDC Oslo
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Querying Data Pipeline with AWS Athena
Managing your black friday logs - Code Europe
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB for Time Series Data: Setting the Stage for Sensor Management
Building OpenDNS Stats
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
Ad

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Transform Your Business with a Software ERP System
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
System and Network Administraation Chapter 3
PDF
Digital Strategies for Manufacturing Companies
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
ai tools demonstartion for schools and inter college
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Transform Your Business with a Software ERP System
How to Choose the Right IT Partner for Your Business in Malaysia
System and Network Administraation Chapter 3
Digital Strategies for Manufacturing Companies
Design an Analysis of Algorithms I-SECS-1021-03
Which alternative to Crystal Reports is best for small or large businesses.pdf
ai tools demonstartion for schools and inter college
ManageIQ - Sprint 268 Review - Slide Deck
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PTS Company Brochure 2025 (1).pdf.......
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Wondershare Filmora 15 Crack With Activation Key [2025
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY
Ad

Druid - DevconTLV X

  • 3. DRUID “ Very fast highly scalable columnar data-store ”
  • 4. Roll-Up Event Time Id Attribute Daily Unique Monthly Unique 2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 11111 1 1 2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 22222 1 1 2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 11111 0 0 2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 22222 1 0 2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 33333 1 0 Event Time Attribute Daily Unique Count Monthly Unique Count 2016-11-15 111111 1 1 2016-11-15 222222 2 1 2016-11-15 333333 1 0 SumAggregator
  • 6. Count-Distinct problem ❏ Find the number of distinct elements in a data stream with repeated elements ❏ eXelate business question ❏ How many unique devices has eXelate encountered: ❏ for a given set-theoretic expression of attributes (segments, labels, regions, etc.) ❏ over a given date range
  • 8. Count-Distinct Approaches • Store everything • Store only 1 bit per device • 10B Devices - 1.25 GB/day • 10B Devices * 80K attributes - 100 TB/day • Approximate
  • 9. ThetaSketch • K Minimum Values (KMV) • Estimate set cardinality • Supports set-theoretic operations X Y • ThetaSketch mathematical framework - generalization of KMV X Y
  • 10. ThetaSketch Error Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 128 8.87% 17.75% 8,192 1.10% 2.21% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% 131,072 0.28% 0.55%
  • 11. Solution using Elasticsearch • Document structure { “id”: “3a4c1f2d84a5c179435c1fea86e6ae02”, “events”: [ { “date”: “15-11-2016”, “attributes” : [ “111111”, “222222” ] }, { “date”: “16-11-2016”, “attributes” : [ “222222”, “333333” ] } ] } • Exploit Elasticsearch reverse-index
  • 12. • Indexing data • 250 GB of daily data, 10 hours • Affect query time • Large index - 2.5 TB • Querying • low concurrency • Spans on all the machines in the cluster • Cost • $100K monthly Elasticsearch Issues
  • 13. What We Tried • Pre-processing • Too many combinations • HyperLogLog • No good support for set-theoretic operations • Calculated during query time
  • 15. Benchmark • Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge) • Elasticsearch Cluster : 20 nodes (r3.8xlarge)
  • 16. 10TB 4 Hours 160GB 280ms-350ms $40K/mo DRUID ES 250GB 10 Hours 2.5TB 500ms-6000ms $100K/mo Druid vs. ES
  • 18. We Are Hiring ❏Web application team leader ❏Frontend developer ❏Java developer & machine learning ❏Senior java developer ❏IT Production Engineer ❏Node.js Developer http://exelate.com/about-us/careers