Druid and Hive Together : Use Cases and Best Practices

© Cloudera, Inc. All rights reserved.
DRUID AND HIVE TOGETHER
USE CASES AND BEST PRACTICES
Nishant Bangarwa

© Cloudera, Inc. All rights reserved. 2
AGENDA
Motivation
Introduction to Druid
Hive and Druid
Performance Numbers
Demo

Database popularity trend in last 24 months

Challenges with specialized DBs
• Each specialized DB has different dialects and API
• Diverse security and audit mechanisms
• Different governance models
• Data from different sources needs to be combined at client side
• Need a solution to provide performance without added complexity

Query Federation with Apache Hive
Extensible Storage Handler
• Input Format
• Output Format
• SerDe
• Rules for pushing computations
• Filters, Aggregates, Sort, Limit etc..
• Transform from SQL to special dialects

Introduction to Apache Druid
High performance analytics data store for timeseries data

Companies Using Druid
http://druid.io/druid-powered

When to use Druid ?
• Event Data/ Timeseries data
• Realtime – Need to analyze events as they happen.
• Delays can lead to business loss e.g. Fraud Detection
• High Data Ingestion rate
• Scalable horizontally
• Queries generally involve aggregations and filtering on time
• Results for last quarter
• Aggregate comparisons over time, this week compared to last week etc.
• Result set is much smaller than the actual dataset being queried

Common Use Cases
• User activity and behavior analysis
• clickstreams, viewstreams and activity streams
• measuring user engagement, tracking A/B test data for product releases, and
understanding usage patterns
• Application performance management
• operational data generated by applications
• identify bottlenecks and troubleshoot issues in Realtime
• IoT and device metrics
• Ingest machine generated data in real-time
• optimize hardware resources, identify issues, anomaly detection.
• Digital marketing
• understand advertising campaign performance, click through rates, conversion rates

When NOT to use Druid ?
• updating existing records using a primary key
• updates need to be done via Rebuilding Segments (Re-Ingestion)
• Queries involve dumping entire dataset
• joining one big fact table to another big fact table
• query latency is not very important for business use case
• offline reporting system

Key Druid Features
• Column-oriented Storage
• Sub-Second query times
• Arbitrary slicing and dicing of data
• Native Search Indexes
• Horizontally Scalable
• Streaming and Batch Ingestion
• Automatic Data Summarization
• Time based partition
• Flexible Schemas
• Rolling Upgrades

Druid Concepts
Time Based Partitioning
1. Time partitioned Segment Files
2. Segments are versioned to support batch overrides
3. By Segment Query Results are Cached
Segment 5_1:
version1
Friday
Time
Segment 1:
version1
Monday
Segment 2:
version1
Tuesday
Segment 3:
version2
Wednesday
Segment 4:
version1
Thursday
Segment 5_2:
version1
Friday

Druid Architecture
Realtime
Nodes
Historical
Nodes
Batch
Data Historical
Nodes
Broker
Nodes
Realtime
Index
Tasks
Streaming
Data
Historical
Nodes
Handoff

Apache Hive and Apache Druid
• Large Scale Queries
• Joins, Subqueries
• Windowing Functions
• Transformations
• Complex Aggregations
• Advanced Sorting
• UDFs
• Queries to power visualizations
• Needles-in-a-haystack
• Dimensional Aggregates
• TopN queries
• Timeseries Queries
• Min/Max Values
• Streaming Ingestion

Integration Benefits
1. Streaming Ingestion
2. Single SQL dialect and API
3. Central security controls and audit trail
4. Unified governance
5. Ability to combine data from multiple sources
6. Data independence

Druid data sources in Hive
Registering Existing Druid data sources
Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query

Creating Druid data sources
Use Create Table As Select (CTAS) statement
CREATE EXTERNAL TABLE druid_table
TBLPROPERTIES ("druid.segment.granularity" = "DAY”) AS
SELECT time, page, user, c_added, c_removed FROM src;
Hive table name
Druid segmentgranularity
⇢ Inference of Druid column types (timestamp,dimensions,metrics)dependson Hivecolumntype

File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Optional Data Summarization
Original CTAS
physical plan
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
File Sink
Select
Table Scan

File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Optional Data Summarization
Rewritten CTAS
physical plan
CTAS query results
File Sink
Select
Table Scan
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Reduce

Creating Streaming Druid data sources
Use Create Table As Select (CTAS) statement
CREATE EXTERNAL TABLE druid_streaming
(`__time` timestamp,`dimension1` string`metric1` int, `metric2 double, Etc.. )
TBLPROPERTIES ( "druid.segment.granularity" = "DAY”,
"kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "topic1");
Hive table name
Druid segmentgranularity
Kafka related properties

Managing Streaming Ingestion from Hive
Use Alter Table statement
ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'START’);
ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'STOP’);
ALTER TABLE druid_streaming t SET TBLPROPERTIES('druid.kafka.ingestion' = 'RESET');
Hive table name
Kafka related properties
⇢Reset will reset the offsets maintained by druid for ingestion

Querying Druid datasources
• Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to different
kinds of Druid queries (Timeseries, TopN, GroupBy, Select)
• Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
• Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
• It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed

Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Top 10 users that have added more characters from
beginning of 2010 until the end of 2011
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan

FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Initial Plan:
– Scan is executed in Druid (select query)
– Rest of the query is executed in Hive
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Apache Hive
Druid query

FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
• Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive
Druid query

FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Rewriting Rule
Apache Hive
Druid query
select

FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Rewriting Rule
Apache Hive
Druid query
groupBy

FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Rewriting Rule Apache Hive
Druid query
groupBy

Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Apache Hive
Druid query
groupBy
{
"queryType": "groupBy", DruidJSON
query
"dataSource":
"users_index",
"granularity": "all",
"dimension":
"user",
"aggregations":[ { "type": "longSum","name":"s","fieldName":"c_added"} ],
"limitSpec":{
"limit":10,
"columns":[ {"dimension":"s","direction": "descending"} ]
},
"intervals":[ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000"]
}
File Sink Druid Scan
Query Physical Plan

Druid input format
• Submits query to Druid and generates records out of the query results
• Current version
– Timeseries, TopN, and GroupBy queries are not partitioned directly sent to druid broker
– Scan queries: realtime and historical nodes are contacted directly
Timeseries, TopN, GroupBy Select
Node
Table Scan
Record reader
Table Scan
Record reader
Table Scan
Record reader
Node Node
Table Scan
Record reader
… … …
…
Table Scan
Record reader
…

Performance and Scalability: Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Snap Inc)
Largest Hourly Ingestion
2TB per Hour
(Netflix)

Performance Numbers
• Query Latency
• average - 500ms
• 90%ile < 1sec
• 95%ile < 5sec
• 99%ile < 10 sec
• Query Volume
• 1000s queries per minute
• Benchmarking code
• https://github.com/druid-
io/druid-benchmark

Performance Numbers
SSB Benchmark 1TB Scale

Useful Resources
• Druid website – http://druid.io
• Druid User Group - dev@druid.incubator.apache.org
• Druid Dev Group – users@druid.incubator.apache.org
• Hive Druid Integration -
https://cwiki.apache.org/confluence/display/Hive/Druid+Integration
• Blogs - https://hortonworks.com/blog/apache-hive-druid-part-1-3/
• Query Federation with Apache Hive - https://hortonworks.com/blog/query-
federation-with-hive/

Druid and Hive Together : Use Cases and Best Practices

More Related Content

What's hot (20)

Similar to Druid and Hive Together : Use Cases and Best Practices (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Druid and Hive Together : Use Cases and Best Practices