SlideShare a Scribd company logo
Deep Dive Into Druid
Sep 2017
Agenda
➔ Where is druid??
➔ Columnar Databases / Distributed Database
➔ Druid Concepts
➔ Indexing Process
➔ Querying Process
➔ Some Benchmarks
➔ Monitoring / Logging
2
3
Columnar Databases
+
Distributed Databases
Row Store
4
Row Store - On Disk
5
Column Store
6
Column Store
7
Column Store - On Disk
8
Timestamped Column Store
9
Distributed Database - Sharding
10
Distributed Database - Sharding
11
12
Druid Concepts
Druid Concepts
What is Druid?
• OLAP
• Columnar
• Real-Time
• Distributed
• Optimized for Aggregations
• Optimized for query performance
• Horizontally Scale
13
Druid Concepts
What it’s not :
• Relational - (no joins!)
– lookup feature as partial solution
• Good at ad-hoc row retrieval
• Fully ACID
– no transactions
– eventually consisted
• Easy to change/delete data
• Simple
14
Data Model
• Cluster - Instance of Druid system
• Data Source - collection of data, equivalent for Table
• Segment - Immutable Timestamped data part of data-source
(file)
• Shard / Partition - For big segments, enabling partitions split the
segment
15
Data Source
“Tracker_Stats_Zynga”
Segment
May-17
Segment
June-17
Partition
P0
Partition
P1
Partition
P0
Partition
P1
Druid Concepts
16
S3 /
HDFS
Segments Store
Deep Storage
Batch / RT
Overlord
MiddleManagers
Indexing Service
Brokers
Historicals
Query Layer
Coordinator
ZooKeeper
Metadata-DB
Management
17
Indexing Service
Indexing Process - Batch Ingestion Task
18
Overlord
Singular
Load-Launcher
API
Pending Waiting Running Completed
PeonPeon PeonPeon
MiddleManager MiddleManager
EMR Indexing
To S3
Singular
Load-Monitor
ZooKeeper
Metadata-DB
Indexing Console
19
Indexing Console
20
21
Querying Infra
Metadata-DB
Loading Segments For Query
22
Deep Storage
Coordinator
Historicals
Historicals
Historicals
Segments
Disk
RAM
Segments
Coordinator Console
23
Query Process
24
Singular
WebApp / API
Historicals
Historicals
Historicals
Broker
Broker
Aggregations
Lookups
Segments
Disk
RAM
Cache
25
Numbers
Numbers
➔ Druid Servers (Prod): 20
➔ EMR Indexing Servers: 32
➔ AVG Indexing Task Time: 4 Minutes
➔ Daily Indexing Tasks: 15,000
➔ Total Cluster Size: 3 TB
➔ Time to deploy a new cluster: 5 Minutes
26
The End
Questions?
27

More Related Content

PPTX
MongoDB Replication fundamentals - Desert Code Camp - October 2014
PPTX
Azure Redis Cache - Cache on Steroids!
PDF
Optiq: A dynamic data management framework
PPTX
Hadoop Training in Hyderabad
PPTX
Azure DocumentDB 101
PPTX
Intro to Spark
PPTX
Big data
PDF
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Azure Redis Cache - Cache on Steroids!
Optiq: A dynamic data management framework
Hadoop Training in Hyderabad
Azure DocumentDB 101
Intro to Spark
Big data
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

What's hot (19)

PPTX
Real World NoSQL (by Chris Yuen)
PPTX
Scylla Summit 2018: Scaling your time series data with Newts
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
Aadhaar at 5th_elephant_v3
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
PPTX
Basic Hadoop Architecture V1 vs V2
PPTX
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
PDF
ODI11g, Hadoop and "Big Data" Sources
PDF
Handling the growth of data
PDF
Next Generation Data Platforms - Deon Thomas
PPTX
Move your on prem data to a lake in a Lake in Cloud
PPTX
Aruman Cassandra database
PDF
Thug feb 23 2015 Chen Zhang
PDF
Getting Started with PostGIS
 
PPTX
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
PDF
Rails on HBase
PPTX
Database Choices
PDF
Scalding @ Coursera
PDF
Introduction to apache spark
Real World NoSQL (by Chris Yuen)
Scylla Summit 2018: Scaling your time series data with Newts
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Aadhaar at 5th_elephant_v3
Building tiered data stores using aesop to bridge sql and no sql systems
Basic Hadoop Architecture V1 vs V2
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
ODI11g, Hadoop and "Big Data" Sources
Handling the growth of data
Next Generation Data Platforms - Deon Thomas
Move your on prem data to a lake in a Lake in Cloud
Aruman Cassandra database
Thug feb 23 2015 Chen Zhang
Getting Started with PostGIS
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
Rails on HBase
Database Choices
Scalding @ Coursera
Introduction to apache spark
Ad

Similar to Deep dive into druid (20)

PPTX
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
PDF
Scaling the Web: Databases & NoSQL
PPTX
Teradata Loom Introductory Presentation
PPTX
Big Data (NJ SQL Server User Group)
PDF
Shard-Query, an MPP database for the cloud using the LAMP stack
PPTX
Hadoop Data Modeling
PDF
Optimizing Dell PowerEdge Configurations for Hadoop
PPTX
Revision
PPTX
Big data Hadoop
PDF
DataGraft Platform: RDF Database-as-a-Service
PPTX
Thinking in a document centric world with RavenDB by Nick Josevski
PDF
Building A Self Service Analytics Platform on Hadoop
PPT
Main MeMory Data Base
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
PPTX
Some thoughts on apache spark & shark
PDF
Connecting Hadoop and Oracle
PPTX
Column Stores and Google BigQuery
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
KEY
MongoDB Administration 20110922
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Scaling the Web: Databases & NoSQL
Teradata Loom Introductory Presentation
Big Data (NJ SQL Server User Group)
Shard-Query, an MPP database for the cloud using the LAMP stack
Hadoop Data Modeling
Optimizing Dell PowerEdge Configurations for Hadoop
Revision
Big data Hadoop
DataGraft Platform: RDF Database-as-a-Service
Thinking in a document centric world with RavenDB by Nick Josevski
Building A Self Service Analytics Platform on Hadoop
Main MeMory Data Base
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Some thoughts on apache spark & shark
Connecting Hadoop and Oracle
Column Stores and Google BigQuery
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
MongoDB Administration 20110922
Ad

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
annual-report-2024-2025 original latest.
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Database Infoormation System (DBIS).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Foundation of Data Science unit number two notes
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Miokarditis (Inflamasi pada Otot Jantung)
annual-report-2024-2025 original latest.
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
Qualitative Qantitative and Mixed Methods.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction-to-Cloud-ComputingFinal.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Database Infoormation System (DBIS).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ISS -ESG Data flows What is ESG and HowHow
Galatica Smart Energy Infrastructure Startup Pitch Deck
Foundation of Data Science unit number two notes
Clinical guidelines as a resource for EBP(1).pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx

Deep dive into druid