SlideShare a Scribd company logo
Introduction to Druid
Gagan Gupta
How do we do?
● What channels were viewed in prime-time in Karnataka during weekend?
● How many ad requests did we get from a particular Android OS Version?
● Given a region, on how many unique users we are showing ad per device?
● What are our sales this quarter?
● What was the trend this week as compared to last week?
SQL
Query
Latency
Data
Storage
Key Value Store
Query
Latency
Data
Storage
Map-Reduce/Spark
Query
Latency
Data
Storage
Introduction to Druid and Druidry
Druid
Query
Latency
Data
Storage
Druid
Time-series
Real-time
Interactive
Column oriented
Dictionary Encoding
Device City Page Time stayed
Pixel 2 Bengaluru Landing 60
Pixel 2 Gurugram Landing 30
iPhone X Bengaluru Landing 10
Bengaluru : 0
Gurugram : 1
Roll up
Timestamp Page Language Country Added Deleted
00:01:35 Justin Bieber en USA 10 65
00:01:63 Justin Bieber en USA 15 62
00:02:51 Justin Bieber en USA 32 45
00:01:11 Kesha en CA 17 87
00:02:24 Kesha en CA 43 99
00:02:03 Kesha en CA 12 53
Roll up
Timestamp Page Language Country Added Deleted
00:01:35 Justin Bieber en USA 10 65
00:01:63 Justin Bieber en USA 15 62
00:02:51 Justin Bieber en USA 32 45
00:01:11 Kesha en CA 17 87
00:02:24 Kesha en CA 43 99
00:02:03 Kesha en CA 12 53
Minute Level Rollup
Roll up
Timestamp Page Language Country Added Deleted
00:00:00 Justin Bieber en USA 25 127
00:01:00 Justin Bieber en USA 32 45
00:01:00 Kesha en CA 60 186
00:02:00 Kesha en CA 12 53
Minute Level Rollup
Inverted Index
Timestamp Page Language Country Added Deleted
00:00:00 Justin Bieber en USA 25 127
00:01:00 Justin Bieber en USA 32 45
00:01:00 Kesha en CA 60 186
00:02:00 Kesha en CA 12 53
00:02:00 Kesha en IND 12 53
Justin Bieber [1,1,0,0,0]
Kesha [0,0,1,1,1]
IND [0,0,0,0,1]
Inverted Index
Justin Bieber [1,1,0,0,0]
Kesha [0,0,1,1,1]
IND [0,0,0,0,1]
Char Added of Justin OR Kesha from India
(11000 OR 00111) AND 00001
= 00001
Column Oriented
Architecture
Introduction to Druid and Druidry
Real-time node
Responsible for ingesting data in real-time
Historical node
Where the data actually resides
Coordinator Node
Decides what data (segment) should be on
which node
Broker Node
Routes query to correct nodes and
merges them
External Dependencies
1. Deep Storage (HDFS/S3)
2. Zookeeper
3. Metadata store (MySQL)
Trade-offs
No joins
Poor for high-cardinality dimensions
Cannot go below specified granularity
Exact unique not available
Druid @ Production
● 3+ trillion events/month
● 3M+ events/sec
● 100+ PB of raw data
● 50+ trillion events
● 1000 queries/sec
Query
POST <queryable_host>:<port>/druid/v2/?pretty
Payload: Query JSON
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"threshold": 5,
"metric": "count",
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
}
],
"intervals": [
"2013-08-31T00:00:00.000/2013-09-
03T00:00:00.000"
]
Dashboards
Pivot (Not opensource anymore)
Turnilio (Opensourced version of Pivot)
Superset by AirBNB
Metabase
http://druid.io/libraries.html
Messy stuff
Spelling mistake
Type safety
Correct Query
Readability
Druidry
An Open-Source Java Client for Druid
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"threshold": 5,
"metric": "count",
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
}
],
"intervals": [
"2013-08-31T00:00:00.000/2013-09-
03T00:00:00.000"
]
DruidAggregator aggregator = new LongSumAggregator("count", "count");
Interval interval = new Interval(startTime, endTime);
DruidTopNQuery query = DruidTopNQuery.builder()
.dataSource("sample_data")
.dimension(new SimpleDimension("sample_dim"))
.threshold(5)
.granularity(new SimpleGranularity(PredefinedGranularity.ALL))
.aggregators(Collections.singletonList(aggregator))
.intervals(Collections.singletonList(interval))
.build();
Introduction to Druid and Druidry
https://github.com/zapr-oss

More Related Content

PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
PDF
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
PDF
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
PDF
2024 State of Marketing Report – by Hubspot
PDF
Everything You Need To Know About ChatGPT
PDF
Product Design Trends in 2024 | Teenage Engineerings
PDF
How Race, Age and Gender Shape Attitudes Towards Mental Health
2024 Trend Updates: What Really Works In SEO & Content Marketing
Storytelling For The Web: Integrate Storytelling in your Design Process
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
2024 State of Marketing Report – by Hubspot
Everything You Need To Know About ChatGPT
Product Design Trends in 2024 | Teenage Engineerings
How Race, Age and Gender Shape Attitudes Towards Mental Health

Recently uploaded (20)

PDF
Well-logging-methods_new................
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Geodesy 1.pptx...............................................
PPTX
OOP with Java - Java Introduction (Basics)
PDF
composite construction of structures.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Construction Project Organization Group 2.pptx
PPTX
Welding lecture in detail for understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT
Project quality management in manufacturing
DOCX
573137875-Attendance-Management-System-original
PPTX
web development for engineering and engineering
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
Well-logging-methods_new................
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Geodesy 1.pptx...............................................
OOP with Java - Java Introduction (Basics)
composite construction of structures.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Construction Project Organization Group 2.pptx
Welding lecture in detail for understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Project quality management in manufacturing
573137875-Attendance-Management-System-original
web development for engineering and engineering
Mechanical Engineering MATERIALS Selection
CYBER-CRIMES AND SECURITY A guide to understanding
Ad
Ad

Introduction to Druid and Druidry

Editor's Notes

  • #2: Introduction 2 parts to big data problem Data Gathering Sense of that data
  • #3: To make sense of what is happening when you don’t know what you are looking for. BI queries. Make sense in milliseconds. (Exploratory nature)
  • #4: Traditional DB solutions are not meant for such huge data (millions of event stream per day). Meant for metadata and aggregates
  • #5: We would need to precompute for all combinations of dimensions Querying is lightning fast Storage is slightly better as it is aggregated Storage Grows Exponentially Meant for caches
  • #6: Compression available at storage
  • #7: (Demo here)
  • #9: Diff between real time and interactive.
  • #11: Actual Data
  • #12: Actual Data
  • #13: Minute Level Rollup 6->4
  • #16: Less disk scans Bad for fetching/updating data for a particular user.
  • #18: For each function, there is a node. If external dependency fails, it should not fail.
  • #19: Stores some real-time data for query purposes also.
  • #21: Recent data is spreaded out
  • #22: Which historical node to query and realtime node if necessary
  • #23: Deep Storage - persistent Zookeeper - Coordination between node, availability + which node have which data Metadata - Tier rules/Segments
  • #24: Another tool in kit. No Replacement
  • #27: Not feasible.
  • #28: http://druid.io/libraries.html What if programmatic is there? Programmatic can be painful.
  • #33: Officially listed
  • #34: Thanks. Building on your contributions.