SlideShare a Scribd company logo
Best practices:
Hadoop eco-system migration
from on-premises to Azure HDInsight
PASS SUMMIT 2018 | Seattle | Nov 7th 2018
• The most trusted and
compliant platform
A secure and managed Apache Hadoop and Spark platform for building data lakes in Azure
Best Practices: Hadoop migration to Azure HDInsight
Workload HDInsight Cluster type
Batch processing (ETL / ELT) Hadoop, Spark
Data warehousing Hadoop, Spark, Interactive Query
IoT / Streaming Kafka, Storm, Spark
NoSQL Transactional processing HBase
Interactive and Faster queries with in-memory caching Interactive Query
Data Science ML Services, Spark
• Clusters can be deleted once the workload has been successfully completed
• Deleting cluster does not delete the storage account and external metadata associated with
cluster
• Storage does not need to be co-located with compute
• Can be in Azure storage, Azure Data Lake store or both
• Hadoop credential provider path can be used to protect storage keys in
• Cluster configs
• DistCp jobs
• Identify the number of worker nodes
• Choose the VM size and type
• Choose the Region
• Choose storage location and size
Node type Cluster type
Hadoop HBase Interactive Query Storm Spark ML Server
Head
D3 v2, D4 v2, D12
v2
D3 v2, D4 v2, D12
v2 D13, D14
A4 v2, A8 v2,
A2m v2
D12 v2, D13 v2,
D14 v2
D12 v2, D13 v2,
D14 v2
Worker
D3 v2, D4 v2, D12
v2
D3 v2, D4 v2, D12
v2 D13, D14
D3 v2, D4 v2,
D12 v2
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2,
D13 v2, D14 v2
Zookeeper
A4 v2, A8 v2, A2m
v2
A2 v2, A4 v2, A8
v2
Edge
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2, D13 v2,
D14 v2
D4 v2, D12 v2,
D13 v2, D14 v2
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2,
D13 v2, D14 v2
• Secure communication between Azure resources
• Ability to filter and route network traffic
• Securely connect to
• Azure Blob Storage
• Azure Data Lake Storage Gen2
• Cosmos DB
• SQL databases
• Traffic flows through secured route from within the Azure data center
• HDInsight cluster is joined to the Active Directory domain
• Supports
• Active Directory-based authentication
• Multiuser support
• Role-based access control
• Auditing
• Provides elasticity to scale up and scale down the number of worker nodes
• Allows to shrink cluster after hours or on weekends and expand it during peak business demands
• Edge node is a Linux VM with the same client tools configured as in the headnode
• Edge node can be used
• to access the cluster
• to test client applications
• to host client applications
• Main metastores
• Hive
• Oozie
• Ranger
• Uses Azure SQL Database as metastores
• Clusters can be created and deleted without losing metadata
• Single metastore db can be shared across different types of clusters
• Consider using LLAP cluster for interactive Hive queries
• Consider using Spark jobs in place of Hive jobs
• Consider replacing impala-based queries with LLAP queries
• Consider replacing MapReduce jobs with Spark jobs
• Consider replacing low-latency Spark batch jobs using Spark Structured Streaming jobs
• Data orchestration – consider using Azure Data Factory(ADF) 2.0
• Consider Ambari for Cluster Management
• Change data storage from on-premises HDFS to wasb or adls
• Consider using Ranger RBAC on Hive tables and auditing
• Transfer data over network with TLS
• DistCp
• Azure Data Factory
• AzureCp
• Third party tools including WANDisco
• Kafka Mirrormaker
• Sqoop
• Shipping data
• Import / Export service
• Data Box
• Hive metastore migration using scripts
• Generate the Hive DDLs
• Edit the generated DDL to replace HDFS url with WASB/ADLS/ABFS urls
• Execute the updated DDL on the metastore from the HDI cluster
• Hive metastore migration using DB Replication
• Ranger metastore migration
• Export on-premises Ranger policies to xml files
• Transform on-prem specific HDFS based paths to WASB/ADLS
• import the policies on to Ranger running on HDI
• Remediate applications
• Perform Tests
• Optimize
https://aka.ms/PASS2018Survey
Take the survey at our survey station or on
your mobile device!
Once completed, come by the reception
desk for your Microsoft prize, and to collect
your raffle ticket!

More Related Content

PPTX
Using new sentinel features in terraform cloud
PDF
Building infrastructure as code using Terraform - DevOps Krakow
PPTX
Apache Knox setup and hive and hdfs Access using KNOX
PDF
Terraform: An Overview & Introduction
PDF
PCF-VxRail-ReferenceArchiteture
PPTX
Leveraging Neo4j With Apache Spark
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Using new sentinel features in terraform cloud
Building infrastructure as code using Terraform - DevOps Krakow
Apache Knox setup and hive and hdfs Access using KNOX
Terraform: An Overview & Introduction
PCF-VxRail-ReferenceArchiteture
Leveraging Neo4j With Apache Spark
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...

What's hot (20)

PPTX
Microsoft Offical Course 20410C_06
PDF
Improving notes addressing experience with recent contacts
PPTX
Apache Atlas: Governance for your Data
PDF
June OpenNTF Webinar - Domino V12 Certification Manager
PDF
What's New in Apache Hive
PDF
Terraform
PPTX
Reduce Amazon RDS Costs up to 50% with Proxies
PDF
Oracle Security Presentation
PDF
Oracle db architecture
PPTX
Understand oracle real application cluster
PPTX
NiFi Best Practices for the Enterprise
PPTX
Data Engineering and the Data Science Lifecycle
PPT
Oracle WebLogic Server Basic Concepts
DOCX
Rac questions
PPTX
Terraform on Azure
PPTX
Apache Atlas: Tracking dataset lineage across Hadoop components
PDF
Technical Introduction to RHEL8
PPTX
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
An Introduction to MapReduce
Microsoft Offical Course 20410C_06
Improving notes addressing experience with recent contacts
Apache Atlas: Governance for your Data
June OpenNTF Webinar - Domino V12 Certification Manager
What's New in Apache Hive
Terraform
Reduce Amazon RDS Costs up to 50% with Proxies
Oracle Security Presentation
Oracle db architecture
Understand oracle real application cluster
NiFi Best Practices for the Enterprise
Data Engineering and the Data Science Lifecycle
Oracle WebLogic Server Basic Concepts
Rac questions
Terraform on Azure
Apache Atlas: Tracking dataset lineage across Hadoop components
Technical Introduction to RHEL8
Hive Training -- Motivations and Real World Use Cases
An Introduction to MapReduce
Ad

Similar to Best Practices: Hadoop migration to Azure HDInsight (20)

PDF
Azure Hd insigth news
PPTX
HDInsight for Architects
PPTX
Building Big Data Applications using Spark, Hive, HBase and Kafka
PPTX
Drive Smarter Decisions with Hadoop and Windows Azure HDInsight
PPTX
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
PPTX
CC -Unit4.pptx
PPTX
HDInsight Interactive Query
PDF
5 Comparing Microsoft Big Data Technologies for Analytics
PPTX
Windows Azure HDInsight Service
PPTX
Big Data on Azure Tutorial
PDF
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
PPTX
Build Big Data Enterprise solutions faster on Azure HDInsight
PDF
Cortana Analytics Workshop: Big Data @ Microsoft
PPTX
New big data architecture in hadoop.pptx
PDF
Big data talking stories in Healthcare
PDF
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
PPTX
Big Data Analytics .pptx
PPTX
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
PPTX
Introduction and HDInsight best practices
Azure Hd insigth news
HDInsight for Architects
Building Big Data Applications using Spark, Hive, HBase and Kafka
Drive Smarter Decisions with Hadoop and Windows Azure HDInsight
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
CC -Unit4.pptx
HDInsight Interactive Query
5 Comparing Microsoft Big Data Technologies for Analytics
Windows Azure HDInsight Service
Big Data on Azure Tutorial
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Build Big Data Enterprise solutions faster on Azure HDInsight
Cortana Analytics Workshop: Big Data @ Microsoft
New big data architecture in hadoop.pptx
Big data talking stories in Healthcare
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics .pptx
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Introduction and HDInsight best practices
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Quality review (1)_presentation of this 21
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
.pdf is not working space design for the following data for the following dat...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Introduction to Business Data Analytics.
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction-to-Cloud-ComputingFinal.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Data_Analytics_and_PowerBI_Presentation.pptx
climate analysis of Dhaka ,Banglades.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Quality review (1)_presentation of this 21
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Reliability_Chapter_ presentation 1221.5784
.pdf is not working space design for the following data for the following dat...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Knowledge Engineering Part 1
Introduction to Business Data Analytics.
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Best Practices: Hadoop migration to Azure HDInsight

  • 1. Best practices: Hadoop eco-system migration from on-premises to Azure HDInsight PASS SUMMIT 2018 | Seattle | Nov 7th 2018
  • 2. • The most trusted and compliant platform A secure and managed Apache Hadoop and Spark platform for building data lakes in Azure
  • 4. Workload HDInsight Cluster type Batch processing (ETL / ELT) Hadoop, Spark Data warehousing Hadoop, Spark, Interactive Query IoT / Streaming Kafka, Storm, Spark NoSQL Transactional processing HBase Interactive and Faster queries with in-memory caching Interactive Query Data Science ML Services, Spark
  • 5. • Clusters can be deleted once the workload has been successfully completed • Deleting cluster does not delete the storage account and external metadata associated with cluster • Storage does not need to be co-located with compute • Can be in Azure storage, Azure Data Lake store or both • Hadoop credential provider path can be used to protect storage keys in • Cluster configs • DistCp jobs
  • 6. • Identify the number of worker nodes • Choose the VM size and type • Choose the Region • Choose storage location and size Node type Cluster type Hadoop HBase Interactive Query Storm Spark ML Server Head D3 v2, D4 v2, D12 v2 D3 v2, D4 v2, D12 v2 D13, D14 A4 v2, A8 v2, A2m v2 D12 v2, D13 v2, D14 v2 D12 v2, D13 v2, D14 v2 Worker D3 v2, D4 v2, D12 v2 D3 v2, D4 v2, D12 v2 D13, D14 D3 v2, D4 v2, D12 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 Zookeeper A4 v2, A8 v2, A2m v2 A2 v2, A4 v2, A8 v2 Edge D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2
  • 7. • Secure communication between Azure resources • Ability to filter and route network traffic • Securely connect to • Azure Blob Storage • Azure Data Lake Storage Gen2 • Cosmos DB • SQL databases • Traffic flows through secured route from within the Azure data center
  • 8. • HDInsight cluster is joined to the Active Directory domain • Supports • Active Directory-based authentication • Multiuser support • Role-based access control • Auditing
  • 9. • Provides elasticity to scale up and scale down the number of worker nodes • Allows to shrink cluster after hours or on weekends and expand it during peak business demands • Edge node is a Linux VM with the same client tools configured as in the headnode • Edge node can be used • to access the cluster • to test client applications • to host client applications
  • 10. • Main metastores • Hive • Oozie • Ranger • Uses Azure SQL Database as metastores • Clusters can be created and deleted without losing metadata • Single metastore db can be shared across different types of clusters
  • 11. • Consider using LLAP cluster for interactive Hive queries • Consider using Spark jobs in place of Hive jobs • Consider replacing impala-based queries with LLAP queries • Consider replacing MapReduce jobs with Spark jobs • Consider replacing low-latency Spark batch jobs using Spark Structured Streaming jobs • Data orchestration – consider using Azure Data Factory(ADF) 2.0 • Consider Ambari for Cluster Management • Change data storage from on-premises HDFS to wasb or adls • Consider using Ranger RBAC on Hive tables and auditing
  • 12. • Transfer data over network with TLS • DistCp • Azure Data Factory • AzureCp • Third party tools including WANDisco • Kafka Mirrormaker • Sqoop • Shipping data • Import / Export service • Data Box
  • 13. • Hive metastore migration using scripts • Generate the Hive DDLs • Edit the generated DDL to replace HDFS url with WASB/ADLS/ABFS urls • Execute the updated DDL on the metastore from the HDI cluster • Hive metastore migration using DB Replication • Ranger metastore migration • Export on-premises Ranger policies to xml files • Transform on-prem specific HDFS based paths to WASB/ADLS • import the policies on to Ranger running on HDI
  • 14. • Remediate applications • Perform Tests • Optimize
  • 15. https://aka.ms/PASS2018Survey Take the survey at our survey station or on your mobile device! Once completed, come by the reception desk for your Microsoft prize, and to collect your raffle ticket!

Editor's Notes

  • #3: Azure HDInsight is a secure and managed platform for building data lakes on Azure based on the Apache Hadoop and Spark frameworks. So, what all does HDInsight have to offer? Reliable Open Source analytics with an Industry leading SLA HDInsight allows you to easily spin up open source cluster types guaranteed with the industry’s best 99.9% SLA and 24/7 support. We guarantee this SLA for the entire big data solution, not just the VM instances. HDInsight is architected for full redundancy and high availability including head node replication, data geo-replication, and built-in standby NameNode making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24x7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core, more than all other managed cloud providers combined to support your deployment and the ability to fix and commit code back to Hadoop. Enterprise Grade Security & Monitoring HDInsight protects your data assets and easily extends your on-premise security and governance controls to the cloud. We feature single sign-on (SSO), multi-factor authentication and seamless management of millions of identities through Azure Active Directory. You can authorize users and groups with fine-grained access control policies over all your enterprise data with Apache Ranger. HDInsight meets HIPAA, PCI, SOC compliance, ensuring your enterprise data assets are always protected with the highest security and regulatory compliance. To ensure the highest level of business continuity, HDInsight extends capabilities for alerting, monitoring, defining pre-emptive actions, and enhanced workload protection through native integration with Azure Operations Management Suite (OMS). Most Productive platform for developers and scientists HDInsight offers developers tailored experiences through rich productivity suites for Hadoop & Spark with integrated development environments using Visual Studio, Eclipse, and IntelliJ supporting Scala, Python, R, Java, and .Net. HDInsight gives data scientists the ability to create narratives that combine code, statistical equations, and visualizations that tell a story about the data through integration to the two most popular notebooks: Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to 50x faster speeds than open source R—helping you train more accurate models for better predictions than previously possible. Cost effective cloud scale HDInsight has decoupled compute and storage, enabling you to cost-effectively scale workloads up or down, independent of storage. Local storage can still be used for caching and fast I/O. Spark and interactive Hive users can choose SSD memory for interactive performance; while Kafka users can retain all streaming data in premium managed disks. You only pay for the compute and storage you use and are given the ability to choose any Azure VM types that enables the best utilization of resources. A recent study showed HDInsight delivering 63% lower TCO than deploying Hadoop on premises over 5 years.* Integration with leading Productivity Applications In the broader ecosystem for Hadoop, there is a thriving market of independent software vendors (ISVs) who provide value added solutions. Through a unique design where every cluster is extended with edge nodes and script action, HDInsight lets customers spin up Hadoop and Spark clusters pre-integrated and pre-tuned with any ISV application out-of-the-box. Datameer, Cask, AtScale, StreamSets are few such applications, which are very popular on the HDInsight platform today. Easy for administrators to manage With HDInsight, administrators can deploy Hadoop in the cloud without buying new hardware or incurring other up-front costs. There’s also no time-consuming installation or set up. There is also no need to patch the operating system or upgrade the Hadoop versions. Azure does it for you. Launch your first cluster in minutes.
  • #16: So, to bring it all together, here's where Microsoft has invested, across these four areas: identity and access management, information protection, threat protection, and security management. We’ve put a tremendous amount of investment into these, and the way it shows up is across a pretty broad array of product areas and features. Our Identity and Access Management tools enable you to take an identity-based approach to security, and establish truly conditional access policies Our Information Protection solutions help you apply protection that travels with the information as it moves around—both inside and outside your organization Our Threat Protection capabilities are built in to the platform, so you can strengthen both pre-breach protection with deep capabilities across e-mail, collaboration services, and end points including hardware based protection; and post-breach detection that includes memory and kernel based protection and response with automation. And our Security Management tools give you the visibility and more importantly the guidance to manage policy centrally