Building Big Data Applications using Spark, Hive, HBase and Kafka

hot path
cold path
Serving-layer
data sources consumers
Governance
HDFS Compliant Storage
(Data Lake)
Meta data
Management
Security /
Access Control
Ingest real-time data Real Time NOSQL Store
ETL
Ingest batch data AdHoc Query in DataLake
Downstream Applications
Store real-time data
for long term analysis
Orchestration
Corporate
Data
Devices
&
Sensors
Advanced Analytics
& Data Science
Machine Learning
R, Python, APIs
Analytics
Data Exploration
Corporate
Reporting
Self-Service BI
Streaming/Real-
Time/
Application
Stream Processing
@ashishth

AZURE SDK
AZURE
DATA FACTORY
AZURE IMPORT
EXPORT SERVICE
AZURE CLI
COGNITIVE SERVICESBOT SERVICE
AZURE SEARCH
AZURE
DATA CATALOG
AZURE EXPRESSROUTE AZURE NETWORK
SECURITY GROUPS
AZURE FUNCTIONS
VISUAL STUDIOOPERATIONS
MANAGEMENT SUITE
AZURE
ACTIVE DIRECTORY
AZURE KEY
MANAGEMENT SERVICE
AZURE STORAGE
BLOBS
AZURE DATA LAKE
STORAGE
AZURE IOT HUB AZURE EVENT HUBS
KAFKA ON
AZURE HDINSIGHT
AZURE SQL DATA WAREHOUSEAZURE SQL DB AZURE COSMOS DB
AZURE
ANALYSIS SERVICES POWER BI
AZURE
HDINSIGHT
AZURE
DATABRICKS
AZURE
STREAM ANALYTICS
AZURE ML ML SERVER AZURE
DATABRICKS
@ashishth

• The most trusted and
compliant platform
Azure HDInsight
A secure and managed Apache Hadoop and Spark platform for building data lakes in the Cloud
@ashishth

Monitoring
& Security
Presto or Hive
LLAP?
Which storage
system?
How to Transfer
the Data
ADF/Airflow or Oozie?
Pig, Hive or
Spark
Spark Streaming or Storm
ETL Serving Layer
Storage
Orchestration
Event Processing
@ashishth

Pig
Designed for ETL ETL Data warehousing
Adoption High, increasing Low, decreasing Stable
Number of connectors Highest High High
Languages Python, R, Scala, Java, SQL Pig SQL
Performance High Medium Medium
@ashishth

Spark Structured Streaming Storm
Adoption High, increasing Decreasing
Event processing guarantee Exactly once At least once
Throughput High Low
Processing Model Micro Batch Real-Time
Latency High Low
Event time support Yes Yes
Languages Python, R, Scala, Java,
SQL
Java
@ashishth

Capability Hive LLAP
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Result Caching Yes No No
Intelligent Cache Eviction Yes No No
Materialized Views Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level
security
Yes [Apache Ranger+ AAD] Medium Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources connectors
@ashishth

Hive Metadata
Spark Metadata
Hive Metadata
Azure HDInsight 3.6 with Hadoop 2.6 Azure HDInsight 4.0 with Hadoop 3.x
Hive Metastore migration tool: https://azure.microsoft.com/en-us/blog/hdinsight-metastore-migration-tool-
open-source-release-now-available/ @ashishth

ADF Airflow Oozie
Service management Azure PaaS IaaS VM HDInsight
Code JSON Python Java
GUI ADF V2 has great UX Good UX Below Average UX
Community Microsoft Growing (12,133 Stars) Declining (483 Stars)
On-demand clusters Yes No, but extensible No
Extensibility Custom action-only Full, graph + actions Custom action-only
Pipeline definition JSON/UX Python/ UX XML/JAVA/UX
Devops-first design Yes Yes Yes
Pipeline monitoring Yes Yes Yes
Scheduling Event, Time Event Event, Time
@ashishth

Data
movement
Storage
options
and
tradeoffs
Caching
@ashishth

Data Qty Network Bandwidth
45 Mbps (T3) 100 Mbps 1 Gbps
1 TB 2 days 1 day 2 hours
10 TB 22 days 10 days 1 day
35 TB 76 days 34 days 3 days
200 TB 1 year 194 days 19 days
500 TB 3 years 1 year 49 days
1 PB 6 years 3 years 97 days
2 PB 12 years 5 years 194 days
@ashishth

Network Transfer with TLS
• Over Internet
• Express Route
• Data Box online Transfer
Shipping data offline
• Data Box offline data transfer
@ashishth

USB 3.1 SSD disks
Order up to 5 in each pack
Ruggedized, self-contained appliances
100 TB
8 TB, up to 40 TB
1 PB
@ashishth
Use Azure Data Box to migrate data from an on-premises HDFS store to Azure
Storage

Type Latency ( Consistency of
latency)
Workloads Bandwidth Key Benefits
ADLS Gen 2 Hierarchical 10-50ms (Medium) HDInsight 3.6 &
4.0
Unconstrained Atomic Rename,
File Folder level
ACL’s
Standard
BLOB
Object
Store
10-50ms (Medium) HDInsight 3.6 &
4.0
Unconstrained Mature
Premium
BLOB
Object
Store
~5ms (High) HBase in Preview Unconstrained Fast
Premium
Managed
Disks
Hierarchical ~5ms (High) Kafka, HBase in
preview
Based on disk Consistent latency
ADLS Gen 1 Hierarchical 10-100ms (Low) HDInsight 3.6(
No HBase)
High Atomic Rename,
File Folder level
ACL’s
@ashishth

RegionServer
Client
-Put
-Delete
-Get
Region
Region
Region
Log
Flusher
Memstore
HFile
Memstore
HFile
Memstore
HFile
Storage
@ashishth

RegionServer
Storage
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Remote store write path challenges with Write Ahead Log
Insert Update Get Delete
Sync Operation
• Inconsistent Latencies
• High latencies
@ashishth

RegionServer
Premium
Managed
Disk(s)
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Insert Update Get Delete
Sync Operation
Introducing Premium Managed disk for
WAL
• Consistent Latencies
• Low latencies
• Data Durability
@ashishth

RegionServer
Client
-Put
-Delete
-Get
Region
Region
Region
Log
Flusher
Memstore
HFile
Memstore
HFile
Memstore
HFile
Low latency workload HBase/ Small write
@ashishth
Storage
Premium
Managed
Disk(s)

RegionServer
Client
-Put
-Delete
-Get
Region
Region
Region
Log
Flusher
Memstore
HFile
Memstore
HFile
Memstore
HFile
@ashishth
PremiumBLOBStorage
Premium
Managed
Disk(s)

Workload Caching Options Key benefits
Spark Spark IO Cache Up to ~8 to 10x perf improvements
HBase &
Phoenix
Bucket cache Up 5-10x perf gains on recently read or written
data
Hive + LLAP LLAP Intelligent cache/Result Cache Up to ~4-100X gain on cached data
@ashishth

Azure Data Lake Storage
INSTANCE CORE RAM TEMP SSD
D1 v2 1 3.50 GiB 50 GiB
D2 v2 2 7.00 GiB 100 GiB
D3 v2 4 14.00 GiB 200 GiB
D4 v2 8 28.00 GiB 400 GiB
D5 v2 16 56.00 GiB 800 GiB
• Significant Spark performance speed up
with IO cache (up to 9X perf gains)
• Automatic cache resource management
• DRAM + Temp SSD makes large cache
pool
@ashishth@ashishth

PERIMETER
Isolate clusters within VNETs
Service Endpoint support for WASB, Azure DB, Cosmos DB
Restrict outbound traffic using NVAs*
AUTHENTICATION
Azure Active Directory
Kerberos with Active
Directory
AUTHORIZATION
Role-Based Access Control
Apache Ranger based Access
Control
DATA PROTECTION
Encryption on-the-wire with HTTPS enforced
Encryption at Rest using Azure Key Vault
Auditing of all data operations and configuration changes
@ashishth

Apache Ranger ADLS Gen 2 ACLs
@ashishth

Scenario Authorizing Component
Yarn: Submit-App Apache Ranger: Yarn Plugin
Hive Operations: Select , Drop, index, Lock, Read, Write, Masking,
Row level filter on Hive Database, Table & Columns
Apache Ranger: Hive Plugin
Create/ Alter Table with storage location reference Apache Ranger + ADLS Gen 2 ACL’s
Spark SQL access with Hive Metastore Apache Ranger: Hive Plugin
HBase Access Policies Apache Ranger/ HBase plugin
Kafka Access Policies Apache ranger/ Kafka Plugin
Access Azure Data Lake Storage Gen2 using the Spark DataFrame
API
ADLS Gen 2 ACLs
Access Azure Data Lake Storage Gen2 using the RDD API ADLS Gen 2 ACLs
HDFS operations: Mkdir, ls, put, copyFromLocal, get, cat, mv, cp
etc
ADLS Gen 2 ACLs
Running Map Reduce jobs ADLS Gen 2 ACLs
@ashishth

• hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode get # A report that shows the
• details of HDFS state: hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -report # Get
HDFS
• out of safe mode hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode leave #
Get
• HDFS into safe mode hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode enter
@ashishth

SetupAutoscale
Customize to your own scenario
Pay for ONLY what you need
Monitoring scaling history easily
Graceful Scale Down
@ashishth

HDInsight Cluster
Gateways
Head Node 1 Head Node 2
Worker Node Worker Node Worker Node Worker Node
Zookeeper1
Zookeeper1
Zookeeper1
Hive Metastore
YARN
https://cluster.azurehdinsight.net/APIs
@ashishth

Workload DR Option
Spark / Hive Manual, Partner solution
HBase HBase replication, Snapshot export, Import
Export, Copy Tables
Kafka Mirror Maker
@ashishth
https://github.com/anagha-microsoft/hdi-spark-dr
https://github.com/anagha-microsoft/hdi-kafka-dr
https://docs.microsoft.com/en-
us/azure/hdinsight/hbase/apache-hbase-backup-replication

Apache Ambari Azure Log Analytics IntegrationHDInsight Cluster Metrics
@ashishth

Motivation and benefits
Architecture best practices
Infrastructure best practices
Storage best practices
Data migration best practices
Security and DevOps best practices
https://azure.microsoft.com/en-us/blog/migrating-on-premises-hadoop-infrastructure-to-azure-hdinsight/
@ashishth

Building Big Data Applications using Spark, Hive, HBase and Kafka

More Related Content

What's hot (20)

Similar to Building Big Data Applications using Spark, Hive, HBase and Kafka (20)

More from Ashish Thapliyal (10)

Recently uploaded (20)

Building Big Data Applications using Spark, Hive, HBase and Kafka

Editor's Notes