SlideShare a Scribd company logo
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
@ashishth
Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
@ashishth
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
Time
@ashishth
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
@ashishth
LLAP
@ashishth
@ashishth
@ashishth
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
@ashishth
@ashishth
HDInsight cluster
Azure BLOB Store/
Azure Data Lake
Store
Network
@ashishth
Network
HDInsight cluster Azure BLOB Store/
Azure Data Lake
Store
@ashishth
Network
Hadoop
Cluster
Azure BLOB Store/
Azure Data Lake
Store
Spark Cluster
LLAP Cluster
Presto Cluster
@ashishth
@ashishth
@ashishth
On Demand Processing Clusters
Data Serving Clusters
Azure BLOB Store/
Azure Data Lake
Store
Common Hive
Metastore
@ashishth
Storage Storage
HDInsight Spark/Hive/MR/Pig
1. Create cluster
2. Submit jobs
6. Drop cluster jobs
@ashishth
Azure BLOB Store/
Azure Data Lake
Store
Common Hive
Metastore
@ashishth
Azure HDInsight
Analyst
Power
User
Data
Engineer
Data
Scientist @ashishth
@ashishth
Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
Time
@ashishth
@ashishth
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Ingest Serve
Time
@ashishth
• Hive Low Latency and Analytical Processing (LLAP)
• Serves queries directly from Azure BLOB/ADLS
• Works with TEXT, JSON, CSV, TSV, ORC, Parquet
• Super fast performance with TEXT data
• Modern scalable query concurrency architecture
• Security with Apache Ranger and Active Directory
@ashishth
HDInsight Interactive Query architecture
Memory + SSD cache
Metastore Cache
@ashishth
LLAP
@ashishth
Executor
IO Thread
@ashishth
Intelligent cache
Automatically reacts to changes in underlying data
o Shared cache between queries
o Cache eviction is based on source file last modified date
o Every query will check modified date, and reload if a new file has
arrived
DRAM
SSD
ADLS/BLOBStore
Updates
@ashishth
@ashishth
• LLAP, Spark, and Presto against 1 TB derived from the TPC-DS benchmark
• Out of the box HDInsight Configuration
• 45 queries derived from TPC-DS benchmark that ran on all engines
successfully
@ashishth
@ashishth
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
• We used number of different concurrency levels to test the concurrency
performance
• 99 queries on 1 TB data with 32 worker node cluster with max concurrency set
to 32.
Test 1: Run all 99 queries, 1 at a time - Concurrency = 1
Test 2: Run all 99 queries, 2 at a time - Concurrency = 2
Test 3: Run all 99 queries, 4 at a time - Concurrency = 4
Test 4: Run all 99 queries, 8 at a time - Concurrency = 8
Test 5: Run all 99 queries, 16 at a time - Concurrency = 16
Test 6: Run all 99 queries, 32 at a time - Concurrency = 32
Test 7: Run all 99 queries, 64 at a time - Concurrency = 64
@ashishth
@ashishth
Capability Interactive Query Spark SQL Presto
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Intelligent Cache Eviction Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level security Yes [Apache Ranger+ AAD] High Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources High number of
connectors
@ashishth
Perimeter level security
Authentication
Authorization
Data security
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Roadmap US Gov
@ashishth
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
DataLakeProbe
HBaseHealthProbe
HBaseMetricsProbe
HBaseProbe
HdfsProbe
HdinsightZookeeperProbe
……..
EdgenodeSSHWatchdog
GatewayTCPPingWatchdog
SSHTCPPingWatchdog
RStudioWatchdog
CertRolloverWatchdog
JobSubmissionPingWatchdog
OozieWatchdog
DataNodesUpWatchdog
NodeManagersUpWatchdog
ResourceHealthWatchdog
AzureNodeStatusWatchdog
ClusterMALoggingHashWatchdo
g
ClusterAvailabilityWatchdog
ClusterHealthWatchdog
……..
namenode_ha_health
ams_metrics_collector_process
ams_metrics_collector_autostart
ams_metrics_collector_hbase_master_p
rocess
namenode_last_checkpoint
namenode_webui
increase_nn_heap_usage_daily
hive_metastore_process
ambari_server_stale_alerts
ambari_server_agent_heartbeat
metrics_monitor_process_percent
……….
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigosmconfig
Spark
Hive/ LLAP
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
HDInsight Log Analytics Architecture
Microsoft Azure Estimate
Your Estimate
Service type Custom name Region Description Estimated Cost
HDInsight East US Interactive Query Component: 2 A3 (4 cores, 7 GB RAM) Head
nodes x 730 Hours, 6 D14V2 (16 cores, 112 GB RAM) Region
nodes x 730 Hours, 3 A1 (1 cores, 1.75 GB RAM) Zookeeper
nodes x 730 Hours, 0 D4V2 (8 cores, 28 GB RAM) Edge nodes
x 730 Hours
$7,163.27
Storage East US Block Blob Storage, General Purpose V2, LRS Redundancy, Hot
Access Tier, 100 TB Capacity, 10,000,000 Write operations,
100,000 List and Create Container Operations, 99,999,000
Read operations, 9,990,000 Other operations. 500 TB Data
Retrieval, 50 TB Data Write
$2,181.82
Support Support $0.00
Monthly Total $9,345.09
Annual Total $112,141.06
Disclaimer
All prices shown are in US Dollar ($). This is a summary estimate, not a quote. For up to date pricing information please visit https://azure.microsoft.com/pricing/calculator/
This estimate was created at 4/13/2018 7:48:34 PM UTC.
@ashishth
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Use the “Decrease List Level” and “Increase List Level” tools
on the Home menu to change text levels.
Try this:
1. Place your cursor in the line of text that says “Segoe UI, size
20pt for second level”
2. Next click the Home tab, and then on the “Decrease List
level” tool. Notice how the line moves up one level.
3. Now try placing your cursor in one of the top “Main topic…”
line of text. Click the “Increase List Level” tool and see how
the text is pushed in one level.
Use these 2 tools to adjust your text levels as you work
www.microsoft.com
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
PowerPoint Theme Accent colors
Creating accessible content
Take the following steps to create accessible content that everyone can consume effectively.
Contrast
Use high contrast colors for
maximum readability
The recommended contrast
ratio is at least 4.5:1
Text Text
Color Contrast Analyzer
Download this tool to determine
the legibility of text and the
contrast of visual elements
Download
Shape and color
Use different shapes with a
legend to indicate statuses
to accommodate for color
blindness
Example:
Alt text
Alt text helps people with
screen readers understand
the content of slides
You can create alternative
text for shapes, pictures,
charts, tables, SmartArt
graphics, or other objects
Here’s how:
Right click the image or shape
Select Edit Alt Text
Enter a Title and Description of
your image or object
Slide layouts
Using a built-in slide layout
that matches your content
ensures a hierarchical
reading order of text blocks
Example:
If a new slide will have a title,
rather than starting with a blank
layout and adding a text block for
the title, choose one of the built-in
layouts with a title placeholder
Reading order
Screen readers describe
content on the screen in the
order it was created
To ensure your content is
read back in the order you
prefer, arrange your objects
in the Selection Pane
appropriately. Objects on the
bottom of the selection pane
are read first
Here’s how:
Click the Home tab
In the Drawing group, select the
Arrange drop-down menu
Click Selection Pane…
Additional tips
Be sure to run the Accessibility Checker! Go to File click the Check for Issues drop down menu click Check Accessibility
Videos need to be accessible: If your presentation includes a video, ensure it is captioned and audio described (if appropriate)
Visit the Office Accessibility Center to learn more about accessibility in PowerPoint
Looking for photography resources?
Explore the library on Brand Central.
Looking for illustration resources?
Explore the library on Brand Central.
Microsoft monoline icons
Looking for icon resources?
The Monoline icon library for PowerPoint is
a slide deck that provides a library of icons
for use in PowerPoint presentations.
The Monoline icon style guide for
PowerPoint is a pdf with additional
guidelines.
Download both from Brand Central.
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query

More Related Content

PPTX
Tips, Tricks & Best Practices for large scale HDInsight Deployments
PPTX
Five essential new enhancements in azure HDnsight
PPTX
HDInsight for Architects
PPTX
Introduction and HDInsight best practices
PPTX
Building Big Data Applications using Spark, Hive, HBase and Kafka
PPTX
Zero ETL analytics with LLAP in Azure HDInsight
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PPTX
Introduction to Apache Kudu
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Five essential new enhancements in azure HDnsight
HDInsight for Architects
Introduction and HDInsight best practices
Building Big Data Applications using Spark, Hive, HBase and Kafka
Zero ETL analytics with LLAP in Azure HDInsight
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Introduction to Apache Kudu

What's hot (13)

PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
HBaseConAsia2018 Track1-3: HBase at Xiaomi
PPTX
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PDF
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
PDF
Using Databases and Containers From Development to Deployment
PPTX
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
PDF
Tales from the Cloudera Field
PDF
Amazon RedShift - Ianni Vamvadelis
PDF
Amazon Web Services - Relational Database Service Meetup
PDF
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
PPTX
Flash Economics and Lessons learned from operating low latency platforms at h...
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
HBaseConAsia2018 Track1-3: HBase at Xiaomi
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
Using Databases and Containers From Development to Deployment
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
Tales from the Cloudera Field
Amazon RedShift - Ianni Vamvadelis
Amazon Web Services - Relational Database Service Meetup
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Flash Economics and Lessons learned from operating low latency platforms at h...
Ad

Similar to Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query (20)

PPTX
Zero ETL analytics with LLAP in Azure HDInsight
PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
PDF
KoprowskiT_SQLAzureLandingInBelfast
PPTX
Azure Data Storage
PPTX
Azure HDInsight
PPTX
Webinar azuretalk
PDF
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
PDF
HDinsight Workshop - Prerequisite Activity
PDF
Azure Hd insigth news
PDF
How to Radically Simplify Your Business Data Management
PPTX
Microsoft Azure News - Dec 2016
PDF
Samedi SQL Québec - La plateforme data de Azure
PPTX
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
PDF
Azure - Data Platform
PPTX
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
PPTX
Big Data on the Microsoft Platform
PDF
KoprowskiT_SQLSatMoscow_WASDforBeginners
PPTX
Big Data on Azure Tutorial
PPTX
Azure data platform overview
PDF
Azure Virtual Machines Deployment Scenarios
Zero ETL analytics with LLAP in Azure HDInsight
Introduction to Microsoft’s Hadoop solution (HDInsight)
KoprowskiT_SQLAzureLandingInBelfast
Azure Data Storage
Azure HDInsight
Webinar azuretalk
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
HDinsight Workshop - Prerequisite Activity
Azure Hd insigth news
How to Radically Simplify Your Business Data Management
Microsoft Azure News - Dec 2016
Samedi SQL Québec - La plateforme data de Azure
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Azure - Data Platform
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform
KoprowskiT_SQLSatMoscow_WASDforBeginners
Big Data on Azure Tutorial
Azure data platform overview
Azure Virtual Machines Deployment Scenarios
Ad

More from Ashish Thapliyal (7)

PPTX
HDInsight Security & Compliance
PDF
HDInsight HBase replication
PPTX
Monitor Azure HDInsight with Azure Log Analytics
PPTX
HDInsight Interactive Query
PPTX
HDInsight HBase Performance best practices
PPTX
Architecting Big Data Applications with HDInsight
PPTX
DIY: TPCDS HDInsight Benchmark
HDInsight Security & Compliance
HDInsight HBase replication
Monitor Azure HDInsight with Azure Log Analytics
HDInsight Interactive Query
HDInsight HBase Performance best practices
Architecting Big Data Applications with HDInsight
DIY: TPCDS HDInsight Benchmark

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Foundation of Data Science unit number two notes
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Introduction to Business Data Analytics.
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Foundation of Data Science unit number two notes
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Business Data Analytics.

Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query