SlideShare a Scribd company logo
Interactive Realtime
Dashboards on Data Streams
Nishant Bangarwa
Hortonworks
Druid Committer, PMC
June 2017
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample Data Stream : Wikipedia Edits
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Step by Step Breakdown
Consume Events
Enrich / Transform
(Add Geolocation
from IP Address)
Store Events
Visualize Events
Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Required Components
 Event Flow
 Event Processing
 Data Store
 Visualization Layer
© Hortonworks Inc. 2011 – 2016. All Rights Reserved6
Event Flow
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Event Flow : Requirements
Event
Producers
Queue
Event
Consumers
 Low latency
 High Throughput
 Failure Handling
 Message delivery guarantees –
 Message Ordering
 Atleast Once, Exactly once, Atmost Once
 Scalability
 Fault tolerant
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Kafka
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Kafka
 Low Latency
 High Throughput
 Message Delivery guarantees
 At-least once
 Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)
 Reliable design to Handle Failures
 Message Acks between producers and brokers
 Data Replication on brokers
 Consumers can Read from any desired offset
 Handle multiple producers/consumers
 Scalable
© Hortonworks Inc. 2011 – 2016. All Rights Reserved10
Event Processing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Event Processing : Requirements
 Consume-Process-Produce Pattern
 Enrich and Transform event streams
 Windowing
 Apply business logic
 Consume and Join multiple streams into single
 Failure Handling
 Scalability
Source Process Sink
Consume Produce
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Streams
 Rich Lightweight Stream processing library
 Event-at-a-time
 Stateful processing : windowing, joining, aggregation operators
 Local state using RocksDb
 Backed by changelog in kafka
 Highly scalable, distributed, fault tolerant
 Compared to a standard Kafka consumer:
 Higher level: faster to build a sophisticated app
 Less control for very fine-grained consumption
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Streams : Wikipedia Data Enrichment
© Hortonworks Inc. 2011 – 2016. All Rights Reserved14
Data Store
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Store : Requirements
Processed
Events
Data Store Queries
 Ability to ingest Streaming data
 Power Interactive dashboards
 Sub-Second Query Response time
 Ad-hoc arbitrary slicing and dicing of data
 Data Freshness
 Summarized/aggregated data is queried
 Scalability
 High Availability
© Hortonworks Inc. 2011 – 2016. All Rights Reserved16
Druid
 Column-oriented distributed datastore
 Sub-Second query times
 Realtime streaming ingestion
 Arbitrary slicing and dicing of data
 Automatic Data Summarization
 Approximate algorithms (hyperLogLog, theta)
 Scalable to petabytes of data
 Highly available
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Suitable Use Cases
 Powering Interactive user facing applications
 Arbitrary slicing and dicing of large datasets
 User behavior analysis
 measuring distinct counts
 retention analysis
 funnel analysis
 A/B testing
 Exploratory analytics/root cause analysis
 Not interested in dumping entire dataset
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segments
 Data in Druid is stored in Segment Files.
 Partitioned by time
 Ideally, segment files are each smaller than 1GB.
 If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5_2:
Friday
Segment 5_1:
Friday
© Hortonworks Inc. 2011 – 2016. All Rights Reserved19
Example Wikipedia Edit Dataset
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
Timestamp Dimensions Metrics
© Hortonworks Inc. 2011 – 2016. All Rights Reserved20
Data Rollup
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
timestamp page language city country count sum_added sum_deleted min_added max_added ….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12
Rollup by hour
© Hortonworks Inc. 2011 – 2016. All Rights Reserved21
Dictionary Encoding
 Create and store Ids for each value
 e.g. page column
 Values - Justin Bieber, Ke$ha, Selena Gomes
 Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
 Column Data - [0 0 0 1 1 2]
 city column - [0 0 0 1 1 1]
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
© Hortonworks Inc. 2011 – 2016. All Rights Reserved22
Bitmap Indices
 Store Bitmap Indices for each value
 Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
 Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
 Selena Gomes -> [5] -> [0 0 0 0 0 1]
 Queries
 Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
 language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
 Indexes compressed with Concise or Roaring encoding
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99
2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
© Hortonworks Inc. 2011 – 2016. All Rights Reserved23
Approximate Sketch Columns
timestamp page userid language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53
timestamp page language city country count sum_added sum_delete
d
min_added Userid_sket
ch
….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch}
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch}
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch}
Rollup by hour
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Approximate Algorithms
 Store Sketch objects, instead of raw column values
 Better rollup for high cardinality columns e.g userid
 Reduced storage size
 Use Cases
 Fast approximate distinct counts
 Approximate histograms
 Funnel/retention analysis
 Limitation
 Not possible to do exact counts
 filter on individual row values
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Realtime
Nodes
Historical
Nodes
25
Druid Architecture
Batch Data
Event
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Historical
Nodes
Handoff
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance and Scalability : Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved27
Companies Using Druid
© Hortonworks Inc. 2011 – 2016. All Rights Reserved28
Visualization Layer
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visualization Layer : Requirements
 Rich dashboarding capabilities
 Work with multiple datasoucres
 Security/Access control
 Allow for extension
 Add custom visualizations
Data Store Visualization
Layer
User
Dashboards
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset
 Python backend
 Flask app builder
 Authentication
 Pandas for rich analytics
 SqlAlchemy for SQL toolkit
 Javascript frontend
 React, NVD3
 Deep integration with Druid
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Treemaps
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Sunburst
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset UI Provides Powerful Visualizations
Rich library of dashboard visualizations:
Basic:
• Bar Charts
• Pie Charts
• Line Charts
Advanced:
• Sankey Diagrams
• Treemaps
• Sunburst
• Heatmaps
And More!
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Wikipedia Real-Time Dashboard
Kafka
Connect
IP-to-
Geolocation
Processor
wikipedia-raw
topic
wikipedia-raw
topic
wikipedia-enriched
topic
wikipedia-enriched
topic
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project Websites
 Kafka - http://kafka.apache.org
 Druid - http://druid.io
 Superset - http://superset.incubator.apache.org
© Hortonworks Inc. 2011 – 2016. All Rights Reserved36
Thank you ! Questions ?
 Twitter - @NishantBangarwa
 Email - nbangarwa@hortonworks.com
 Linkedin - https://www.linkedin.com/in/nishant-bangarwa
Off The Record (OTR) session
Experiences and challenges in working with Druid
at 03:25 PM - 04:10 PM on 28 July, 2017
in Room 1 MLR Convention Centre, Whitefield

More Related Content

PPTX
Design cube in Apache Kylin
PPTX
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
PDF
Scaling and Modernizing Data Platform with Databricks
PPTX
Building a modern data warehouse
PPTX
Druid and Hive Together : Use Cases and Best Practices
PDF
Kafka streams windowing behind the curtain
PDF
Customer segmentation and marketing automation with Apache Unomi
PPTX
Microsoft cloud big data strategy
Design cube in Apache Kylin
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Scaling and Modernizing Data Platform with Databricks
Building a modern data warehouse
Druid and Hive Together : Use Cases and Best Practices
Kafka streams windowing behind the curtain
Customer segmentation and marketing automation with Apache Unomi
Microsoft cloud big data strategy

What's hot (20)

PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Combining logs, metrics, and traces for unified observability
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PPTX
TechEvent Databricks on Azure
PPTX
Microsoft Data Platform - What's included
PPTX
Apache Flink and what it is used for
PPTX
Sizing MongoDB Clusters
PDF
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
PDF
Apache Druid 101
PPTX
Azure Data Factory Data Flow
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PDF
Data Platform Architecture Principles and Evaluation Criteria
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Free Training: How to Build a Lakehouse
PDF
Apache Flink internals
PDF
Modern Data Flow
PPTX
Snowflake Architecture.pptx
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Introducing the Apache Unomi Project
Introduction SQL Analytics on Lakehouse Architecture
Combining logs, metrics, and traces for unified observability
Apache Iceberg - A Table Format for Hige Analytic Datasets
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
TechEvent Databricks on Azure
Microsoft Data Platform - What's included
Apache Flink and what it is used for
Sizing MongoDB Clusters
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Apache Druid 101
Azure Data Factory Data Flow
Apache Flink: Real-World Use Cases for Streaming Analytics
Data Platform Architecture Principles and Evaluation Criteria
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Free Training: How to Build a Lakehouse
Apache Flink internals
Modern Data Flow
Snowflake Architecture.pptx
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Introducing the Apache Unomi Project
Ad

Similar to Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset (20)

PPTX
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
PPTX
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
PPTX
An Introduction to Druid
PPTX
Design Patterns For Real Time Streaming Data Analytics
PPTX
Design Patterns For Real Time Streaming Data Analytics
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
PPTX
Scalable Real-time analytics using Druid
PPTX
Analyzing Hadoop Using Hadoop
PPTX
Druid Scaling Realtime Analytics
PPTX
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
PPTX
Webinar Series Part 5 New Features of HDF 5
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
PPTX
Hadoop crashcourse v3
PPTX
Enabling the Real Time Analytical Enterprise
PPTX
Using Apache® NiFi to Empower Self-Organising Teams
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
PPTX
HDF Powered by Apache NiFi Introduction
PPTX
Hive Performance Dataworks Summit Melbourne February 2019
PDF
Fast SQL on Hadoop, Really?
PPTX
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
An Introduction to Druid
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
Interactive Analytics at Scale in Apache Hive Using Druid
Scalable Real-time analytics using Druid
Analyzing Hadoop Using Hadoop
Druid Scaling Realtime Analytics
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Webinar Series Part 5 New Features of HDF 5
Interactive Analytics at Scale in Apache Hive Using Druid
Hadoop crashcourse v3
Enabling the Real Time Analytical Enterprise
Using Apache® NiFi to Empower Self-Organising Teams
Hortonworks Technical Workshop: What's New in HDP 2.3
HDF Powered by Apache NiFi Introduction
Hive Performance Dataworks Summit Melbourne February 2019
Fast SQL on Hadoop, Really?
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Ad

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Mechanical Engineering MATERIALS Selection
PDF
PPT on Performance Review to get promotions
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Well-logging-methods_new................
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
Project quality management in manufacturing
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Welding lecture in detail for understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mechanical Engineering MATERIALS Selection
PPT on Performance Review to get promotions
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Lesson 3_Tessellation.pptx finite Mathematics
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Well-logging-methods_new................
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Project quality management in manufacturing
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Structs to JSON How Go Powers REST APIs.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CYBER-CRIMES AND SECURITY A guide to understanding
Welding lecture in detail for understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset

  • 1. Interactive Realtime Dashboards on Data Streams Nishant Bangarwa Hortonworks Druid Committer, PMC June 2017
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sample Data Stream : Wikipedia Edits
  • 3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
  • 4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Step by Step Breakdown Consume Events Enrich / Transform (Add Geolocation from IP Address) Store Events Visualize Events Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact
  • 5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Required Components  Event Flow  Event Processing  Data Store  Visualization Layer
  • 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 Event Flow
  • 7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Event Flow : Requirements Event Producers Queue Event Consumers  Low latency  High Throughput  Failure Handling  Message delivery guarantees –  Message Ordering  Atleast Once, Exactly once, Atmost Once  Scalability  Fault tolerant
  • 8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Kafka
  • 9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Kafka  Low Latency  High Throughput  Message Delivery guarantees  At-least once  Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)  Reliable design to Handle Failures  Message Acks between producers and brokers  Data Replication on brokers  Consumers can Read from any desired offset  Handle multiple producers/consumers  Scalable
  • 10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved10 Event Processing
  • 11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Event Processing : Requirements  Consume-Process-Produce Pattern  Enrich and Transform event streams  Windowing  Apply business logic  Consume and Join multiple streams into single  Failure Handling  Scalability Source Process Sink Consume Produce
  • 12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Streams  Rich Lightweight Stream processing library  Event-at-a-time  Stateful processing : windowing, joining, aggregation operators  Local state using RocksDb  Backed by changelog in kafka  Highly scalable, distributed, fault tolerant  Compared to a standard Kafka consumer:  Higher level: faster to build a sophisticated app  Less control for very fine-grained consumption
  • 13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Streams : Wikipedia Data Enrichment
  • 14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved14 Data Store
  • 15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Store : Requirements Processed Events Data Store Queries  Ability to ingest Streaming data  Power Interactive dashboards  Sub-Second Query Response time  Ad-hoc arbitrary slicing and dicing of data  Data Freshness  Summarized/aggregated data is queried  Scalability  High Availability
  • 16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 Druid  Column-oriented distributed datastore  Sub-Second query times  Realtime streaming ingestion  Arbitrary slicing and dicing of data  Automatic Data Summarization  Approximate algorithms (hyperLogLog, theta)  Scalable to petabytes of data  Highly available
  • 17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Suitable Use Cases  Powering Interactive user facing applications  Arbitrary slicing and dicing of large datasets  User behavior analysis  measuring distinct counts  retention analysis  funnel analysis  A/B testing  Exploratory analytics/root cause analysis  Not interested in dumping entire dataset
  • 18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Segments  Data in Druid is stored in Segment Files.  Partitioned by time  Ideally, segment files are each smaller than 1GB.  If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5_2: Friday Segment 5_1: Friday
  • 19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 Example Wikipedia Edit Dataset timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 Timestamp Dimensions Metrics
  • 20. © Hortonworks Inc. 2011 – 2016. All Rights Reserved20 Data Rollup timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 timestamp page language city country count sum_added sum_deleted min_added max_added …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12 Rollup by hour
  • 21. © Hortonworks Inc. 2011 – 2016. All Rights Reserved21 Dictionary Encoding  Create and store Ids for each value  e.g. page column  Values - Justin Bieber, Ke$ha, Selena Gomes  Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2  Column Data - [0 0 0 1 1 2]  city column - [0 0 0 1 1 1] timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
  • 22. © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 Bitmap Indices  Store Bitmap Indices for each value  Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]  Ke$ha -> [3, 4] -> [0 0 0 1 1 0]  Selena Gomes -> [5] -> [0 0 0 0 0 1]  Queries  Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]  language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]  Indexes compressed with Concise or Roaring encoding timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99 2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
  • 23. © Hortonworks Inc. 2011 – 2016. All Rights Reserved23 Approximate Sketch Columns timestamp page userid language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53 timestamp page language city country count sum_added sum_delete d min_added Userid_sket ch …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch} 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch} 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch} Rollup by hour
  • 24. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Approximate Algorithms  Store Sketch objects, instead of raw column values  Better rollup for high cardinality columns e.g userid  Reduced storage size  Use Cases  Fast approximate distinct counts  Approximate histograms  Funnel/retention analysis  Limitation  Not possible to do exact counts  filter on individual row values
  • 25. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Realtime Nodes Historical Nodes 25 Druid Architecture Batch Data Event Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  • 26. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance and Scalability : Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 27. © Hortonworks Inc. 2011 – 2016. All Rights Reserved27 Companies Using Druid
  • 28. © Hortonworks Inc. 2011 – 2016. All Rights Reserved28 Visualization Layer
  • 29. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visualization Layer : Requirements  Rich dashboarding capabilities  Work with multiple datasoucres  Security/Access control  Allow for extension  Add custom visualizations Data Store Visualization Layer User Dashboards
  • 30. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset  Python backend  Flask app builder  Authentication  Pandas for rich analytics  SqlAlchemy for SQL toolkit  Javascript frontend  React, NVD3  Deep integration with Druid
  • 31. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Treemaps
  • 32. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Sunburst
  • 33. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset UI Provides Powerful Visualizations Rich library of dashboard visualizations: Basic: • Bar Charts • Pie Charts • Line Charts Advanced: • Sankey Diagrams • Treemaps • Sunburst • Heatmaps And More!
  • 34. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Wikipedia Real-Time Dashboard Kafka Connect IP-to- Geolocation Processor wikipedia-raw topic wikipedia-raw topic wikipedia-enriched topic wikipedia-enriched topic
  • 35. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Websites  Kafka - http://kafka.apache.org  Druid - http://druid.io  Superset - http://superset.incubator.apache.org
  • 36. © Hortonworks Inc. 2011 – 2016. All Rights Reserved36 Thank you ! Questions ?  Twitter - @NishantBangarwa  Email - nbangarwa@hortonworks.com  Linkedin - https://www.linkedin.com/in/nishant-bangarwa Off The Record (OTR) session Experiences and challenges in working with Druid at 03:25 PM - 04:10 PM on 28 July, 2017 in Room 1 MLR Convention Centre, Whitefield

Editor's Notes

  • #7: Druid Architecture
  • #18: Retention analysis
  • #29: Druid Architecture