HDInsight for Architects

• The most trusted and
compliant platform
A secure and managed Apache Hadoop and Spark platform for building data lakes in the Cloud
@ashishth

OSS
Framework
Choices
Security
HA & DRStorage
Monitoring
Cost
Optimization
@ashishth

Devices
&
Sensors
Speed
Layer
Data Lake Store Gen 2
Blob
Storage
Corporate
Data
SaaS
Data
Web
Data
Streaming/Real-
Time/
Application
Advanced Analytics
& Data Science
Machine Learning
R, Python, APIs
Analytics
Data Exploration
Corporate
Reporting
Self-Service BI
ETL Serving LayerStorage
Hive LLAP
@ashishth

Devices
&
Sensors
Speed
Layer
Data Lake Store Gen 2
Blob
Storage
Corporate
Data
SaaS
Data
Web
Data
Streaming/Real-
Time/
Application
Advanced Analytics
& Data Science
Machine Learning
R, Python, APIs
Analytics
Data Exploration
Corporate
Reporting
Self-Service BI
ETL Serving LayerStorage
Hive LLAP
?
?
?
?
@ashishth

Spark Pig Hive
Designed for ETL ETL Data warehousing
Adoption High, increasing Low, decreasing Stable
Number of connectors Highest High High
Languages Python, R, Scala, Java, SQL Pig SQL
Performance High Medium Medium
@ashishth

Spark Structured Streaming Storm
Adoption High, increasing Decreasing
Event processing guarantee Exactly once At least once
Throughput High Low
Processing Model Micro Batch Real-Time
Latency High Low
Event time support Yes Yes
Languages Python, R, Scala, Java,
SQL
Java
@ashishth

Capability Hive LLAP Spark SQL Presto
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Result Caching Yes No No
Intelligent Cache
Eviction
Yes No No
Materialized Views Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level
security
Yes [Apache Ranger+ AAD] Medium Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources High number of
connectors @ashishth

Spark Metadata Hive Metadata Spark Metadata
Hive Metadata
Azure HDInsight 3.6 with Hadoop 2.6 Azure HDInsight 4.0 with Hadoop 3.x
Hive Metastore migration tool: https://azure.microsoft.com/en-us/blog/hdinsight-metastore-migration-tool-
open-source-release-now-available/

ADF Airflow Oozie
Service management Azure PaaS IaaS VM HDInsight
Code JSON Python Java
GUI ADF V2 has great UX Good UX Below Average UX
Community Microsoft Growing (10893 Stars) Declining (454 Stars)
On-demand clusters Yes No, but extensible No
Extensibility Custom action-only Full, graph + actions Custom action-only
Pipeline definition JSON/UX Python/ UX XML/UX
Devops-first design Yes Yes Yes
Pipeline monitoring Yes Yes Yes
Scheduling Event, Time Event Event, Time
@ashishth

Motivation and benefits
Architecture best practices
Infrastructure best practices
Storage best practices
Data migration best practices
Security and DevOps best practices
https://azure.microsoft.com/en-us/blog/migrating-on-premises-hadoop-infrastructure-to-azure-hdinsight/
@ashishth

Data
Sources
Apps
Sensors
and
devices
Data Ingestion Advanced Analytics BI/ Visualization
People
Automated
Systems
Apps
Web
Mobile
Bots
Data catalog/ Governance/ Lineage
Connectors: JDBC, ODBC
Productivity Tools
Enterprise grade add-ons (hybrid, backup, DR, security, performance)
Data Prep/
Management
@ashishth

Data
movement
Caching
Storage
options and
tradeoffs
@ashishth

Data Qty Network Bandwidth
45 Mbps (T3) 100 Mbps 1 Gbps
1 TB 2 days 1 day 2 hours
10 TB 22 days 10 days 1 day
35 TB 76 days 34 days 3 days
200 TB 1 year 194 days 19 days
500 TB 3 years 1 year 49 days
1 PB 6 years 3 years 97 days
2 PB 12 years 5 years 194 days
@ashishth

Network Transfer with TLS
• Over Internet
• Express Route
• Databox online Transfer
Shipping data offline
• Import / Export service
• Data Box offline data transfer
@ashishth

@ashishth
https://github.com/alkohli/azure-docs-
pr/blob/4023eb52cc6ed103e0fa7e794e039c143b6d2a6a/articles/storage/blobs/data-
lake-storage-migrate-on-prem-HDFS-cluster.md

Type Latency Consistency Workloads Bandwidth Key Benefits
ADLS Gen 1 Hierarchical 10-100ms Low HDInsight 3.6(
No HBase)
High Atomic Rename,
File Folder level
ACL’s
ADLS Gen 2 Hierarchical 10-50ms Medium HDInsight 3.6 &
4.0
Unconstrained Atomic Rename,
File Folder level
ACL’s
Standard
BLOB
Object Store 10-50ms Medium HDInsight 3.6 &
4.0
Unconstrained Mature
Premium
BLOB
Object Store ~5ms High HBase in Preview Unconstrained Fast
Premium
Managed
Disks
Hierarchical ~5ms High Kafka, HBase in
preview
Based on disk Consistent
@ashishth

Scenario Supported Workaround
HDInsight 3.6 & 4.0 with Standard Blob as Primary
and/ or secondary
Yes
HDInsight 3.6 & 4.0 with ADLS Gen2 as primary Yes
HDInsight 3.6 & 4.0 with ADLS Gen2 as primary &
Blob as additional
Yes
HDInsight 3.6 & 4.0 with Blob as primary & ADLS
Gen2 as additional
No
HDInsight 3.6 with multiple ADLS Gen2 accounts Yes
HDInsight 3.6 & 4.0 with ADLS Gen1 and ADLS Gen 2 No Distcp across two
clusters
HDInsight 4.0 with ADLS Gen 1 No Distcp across two
clusters
@ashishth

RegionServer
Region
Region
Region
WASB
Client
-Put
-Delete
-Get
Log
Flusher
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile

RegionServer
WASB
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Write path challenges with Write Ahead Log
Insert Update Get Delete
Sync Operation
• Inconsistent Latencies
• High latencies
@ashishth

RegionServer
Premium
Managed
Disk(s)
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Insert Update Get Delete
Sync Operation
Introducing Premium Managed disk for
WAL
• Consistent Latencies
• Low latencies
• Data Durability
@ashishth

RegionServer
Region
Region
Region
WASB
Client
-Put
-Delete
-Get
Log
Flusher
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Premium
Managed
Disk(s)
@ashishth

RegionServer
Region
Region
Region
PremiumBlob
Client
-Put
-Delete
-Get
Log
Flusher
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Premium
Managed
Disk(s)
Local
SSD
+DRA
M
@ashishth

Cluster Type Operation Row Size # ops
#Region
Servers
Region Server Node
Size
#Clients Throughput Avg Latency (ms) Run Time (min)
Standard Write 1KB 107,374,182 4 Standard_D4_V2 2
37,958
0.417 47
Premium
WAL
Write 1KB 107,374,182 4
Standard_D4_V2
2 57,812 0..271 31
Standard Small Write 100 Bytes 1,073,741,824 4
Standard_DS4_V2
2 84,910 0..186 210
Premium
WAL
Small Write 100 Bytes
1,073,741,824
4
Standard_DS4_V2
2 701,234 0.016 25
Standard Read 100 Bytes 925,075 4 Standard_D4_V2 2 256 62 60
Premium
WAL &
Premium
Blob
Read 100 Bytes 33,503,676 4
Standard_D4_V2
2 9,306 1.7 60
Standard Large Read 1K 945,682 4
Standard_D4_V2
262 61 60
Premium
WAL &
Premium
Blob
Large Read 1K 24,846,209 4 Standard_D4_V2 2 6901 2.3 60

Workload Caching Options Key benefits
Spark Spark IO Cache Up to ~8 to 10x perf improvements
HBase &
Phoenix
Bucket cache Up 5-10x perf gains on recently read or written
data
Hive + LLAP LLAP Intelligent cache/Result Cache Up to ~4-100X gain on cached data

Azure Data Lake Storage
INSTANCE CORE RAM TEMP SSD
D1 v2 1 3.50 GiB 50 GiB
D2 v2 2 7.00 GiB 100 GiB
D3 v2 4 14.00 GiB 200 GiB
D4 v2 8 28.00 GiB 400 GiB
D5 v2 16 56.00 GiB 800 GiB
• Significant Spark performance speed up
with IO cache (up to 9X perf gains)
• Automatic cache resource management
• DRAM + Temp SSD makes large cache
pool
@ashishth

HDInsight Cluster
Gateways
Head Node 1 Head Node 2
Worker Node Worker Node Worker Node Worker Node
Zookeeper1
Zookeeper1
Zookeeper1
Hive Metastore
YARN
https://cluster.azurehdinsight.net/APIs
@ashishth

Workload DR Option
Spark / Hive Manual, Partner solution
HBase HBase replication, Snapshot export, Import
Export, Copy Tables
Kafka Mirror Maker
@ashishth

Ingest
Process
Publish
Ingest
Process
Publish
Active Hot standby
RPO
RTO
Cost
Low
None
High
@ashishth

Ingest
Process
Publish
Active Cold standby
Replication
RPO
RTO
Cost
Medium
Medium
High
@ashishth

Ingest
Process
Publish
Active
Replication
DR- Cloud Storage
RPO
RTO
Cost
Effort
Highest
Highest
Lowest
Highest
@ashishth

Ingest
Process
Publish
Ingest
Process
Publish
Active Active
RPO
RTO
Cost
Lowest
None
Highest
@ashishth

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-mirroring

https://github.com/anagha-microsoft/hdi-spark-
dr
https://github.com/anagha-microsoft/hdi-kafka-dr
https://docs.microsoft.com/en-
us/azure/hdinsight/hbase/apache-hbase-backup-replication

Virtual Network (10.1.0.0/16)
HDInsight Cluster in Subnet (10.1.1.0/24)
Gateways
Head Node 1
Head Node 2
Worker Node Worker Node Worker Node Worker Node
Allow VNet
(10.1.0.0/16)
Allow VNet
(10.1.0.0/16)
Hive Metastore
@ashishth

Scenario Authorizing Component
Yarn: Submit-App Apache Ranger: Yarn Plugin
Hive Operations: Create, Select, Update, Drop,
index, Lock, Read, Write, Masking, Row level filter
on Hive Database, Table & Columns
Apache Ranger: Hive Plugin
Create/ Alter Table with storage location
reference
Apache Ranger + ADLS Gen 2 ACL’s
Spark SQL access with Hive Metastore Apache Ranger: Hive Plugin
HBase Access Policies Apache Ranger/ HBase plugin
Kafka Access Policies Apache ranger/ Kafka Plugin
Access Azure Data Lake Storage Gen2 using the
Spark DataFrame API
ADLS Gen 2 ACLs
Access Azure Data Lake Storage Gen2 using the
RDD API
ADLS Gen 2 ACLs
HDFS operations: Mkdir, ls, put, copyFromLocal,
get, cat, mv, cp etc
ADLS Gen 2 ACLs
Running Map Reduce jobs ADLS Gen 2 ACLs @ashishth

Apache Ambari Azure Log Analytics IntegrationHDInsight Cluster Metrics

INSTANCE VCPU RAM TEMPORARY
STORAGE
PAYG
D14 v2 16 112.00 GiB 800 GiB $1.196/hour
E16 v3 16 128.00 GiB 400 GiB $1.064/hour
12.5%
11%
@ashishth

HDInsight for Architects

More Related Content

What's hot (20)

Similar to HDInsight for Architects (20)

More from Ashish Thapliyal (12)

Recently uploaded (20)

HDInsight for Architects

Editor's Notes