SlideShare a Scribd company logo
Five essential new enhancements in azure HDnsight
AskHDInsight@microsoft.com
https://blogs.msdn.microsoft.com/ashish/
• The most trusted and
compliant platform
A secure and managed Hadoop and Spark cloud platform for building data lakes for enterprises
1.Fast BI 2.Transactions
3.Security
4.New Spark
IO Cache
5.Developer
Productivity
@ashishth
@ashishth
Azure Data Lake Storage
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Apache Hive LLAP for
interactive querying
Azure HDInsight
VS Code / JDBC Client
@ashishth
New in Hive LLAP 3.0
• Result Caching
• Dynamic Materialized Views
• Workload Management
• JDBC Integration
Five essential new enhancements in azure HDnsight
•
•
•
•
•
•
•
•
•
•
•
@ashishth
Five essential new enhancements in azure HDnsight
Dimensional Aggregates, Top N Queries, Min, Max, Avg, Aggregations, Time
Series
Azure Data Lake Storage @ashishth
@ashishth
@ashishth
Advantages
• ACID (Atomicity, Consistency, Isolation, and Durability) is
default in Hive 3.0
• No Performance overhead
• No Bucketing required
• Spark can read and write to Hive ACID tables via Hive
Warehouse connector
@ashishth
@ashishth
@ashishth
@ashishth
BRK2371 - Gaining deeper
insights from big data using
open source analytics on
Azure HDInsight
@ashishth
Azure Data Lake Storage
INSTANCE CORE RAM TEMP SSD
D1 v2 1 3.50 GiB 50 GiB
D2 v2 2 7.00 GiB 100 GiB
D3 v2 4 14.00 GiB 200 GiB
D4 v2 8 28.00 GiB 400 GiB
D5 v2 16 56.00 GiB 800 GiB
• Significant Spark performance speed up
with IO cache (up to 9X perf gains)
• Automatic cache resource management
• DRAM + Temp SSD makes large cache
pool
@ashishth
0
50
100
150
200
250
300
Q52 Q27 Q46 Q48 Q3 Q36 Q79 Q59 Q44 Q55 Q18 Q12 Q61 Q20 Q42 Q68 Q56 Q33 Q73
177
189
172
182
113
166
155
269
284
104
166
126
235
132
98
117
183
214
98
18 24 26 28
18
27 25
45 49
18
30
23
44
27 20 24
39
48
22
(S)
Queries
Top 20 queries with most gains
Without IO Cache With IO Cache
@ashishth
11,967
5,191
HDINSIGHT (NO IO CACHE) HDINSIGHT IO CACHE
RUNNINGTIME(SECS)
TOTAL RUNNING TIME
@ashishth
@ashishth
Built-in templates supporting Maven and SBT project
Local and Cluster Run & Debug
@ashishth
Job Executions Graph – Visualize Spark job execution with
heatmap, playback and outlier detection.
Job Debugging & Diagnosis
 Data & Time Skew – Skew detection and analysis with built-
in rules and customized rules.
 Executor usage analysis - Visualize executors' allocation and
actual execution utilization.
Job Data Management - Access and manage data input,
output and table operations for Spark job.
@ashishth
AskHDInsight@microsoft.com
https://blogs.msdn.microsoft.com/ashish/
@ashishth
http://myignite.microsoft.com
https://aka.ms/ignite.mobileapp

More Related Content

PPTX
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
PPTX
Tips, Tricks & Best Practices for large scale HDInsight Deployments
PPTX
HDInsight for Architects
PPTX
Introduction and HDInsight best practices
PPTX
Building Big Data Applications using Spark, Hive, HBase and Kafka
PPTX
Zero ETL analytics with LLAP in Azure HDInsight
PDF
HDInsight Informative articles
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Tips, Tricks & Best Practices for large scale HDInsight Deployments
HDInsight for Architects
Introduction and HDInsight best practices
Building Big Data Applications using Spark, Hive, HBase and Kafka
Zero ETL analytics with LLAP in Azure HDInsight
HDInsight Informative articles
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...

What's hot (17)

PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PPTX
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
PDF
Using Databases and Containers From Development to Deployment
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
PDF
HBaseConAsia2018 Track1-3: HBase at Xiaomi
PPTX
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
PPTX
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
PPTX
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
PPTX
Introduction to Apache Kudu
PDF
Amazon Web Services - Relational Database Service Meetup
PDF
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
PDF
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
PDF
Hive spark-s3acommitter-hbase-nfs
PDF
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
Using Databases and Containers From Development to Deployment
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
HBaseConAsia2018 Track1-3: HBase at Xiaomi
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Introduction to Apache Kudu
Amazon Web Services - Relational Database Service Meetup
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Hive spark-s3acommitter-hbase-nfs
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Ad

Similar to Five essential new enhancements in azure HDnsight (20)

PPTX
xPatterns - Spark Summit 2014
PDF
Webinar - DreamObjects/Ceph Case Study
PPTX
xPatterns on Spark, Shark, Mesos, Tachyon
PDF
VMworld 2013: Virtualizing Databases: Doing IT Right
PPTX
Hadoop enhancements using next gen IA technologies
PPTX
Taking SharePoint to the Cloud
PPT
AWS (Hadoop) Meetup 30.04.09
PPTX
Introduction to Kudu - StampedeCon 2016
PDF
Streaming Solutions for Real time problems
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PPTX
Spark introduction and architecture
PPTX
Spark introduction and architecture
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
PDF
Best Practices for Running SAP System Workloads on the AWS Cloud
PDF
Red Hat Storage Roadmap
PDF
Red Hat Storage Roadmap
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PPTX
UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence
PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
xPatterns - Spark Summit 2014
Webinar - DreamObjects/Ceph Case Study
xPatterns on Spark, Shark, Mesos, Tachyon
VMworld 2013: Virtualizing Databases: Doing IT Right
Hadoop enhancements using next gen IA technologies
Taking SharePoint to the Cloud
AWS (Hadoop) Meetup 30.04.09
Introduction to Kudu - StampedeCon 2016
Streaming Solutions for Real time problems
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Spark introduction and architecture
Spark introduction and architecture
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Best Practices for Running SAP System Workloads on the AWS Cloud
Red Hat Storage Roadmap
Red Hat Storage Roadmap
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence
IBM Cloud Day January 2021 Data Lake Deep Dive
Ad

More from Ashish Thapliyal (8)

PPTX
HDInsight Security & Compliance
PDF
HDInsight HBase replication
PPTX
Azure HDInsight
PPTX
Monitor Azure HDInsight with Azure Log Analytics
PPTX
HDInsight Interactive Query
PPTX
HDInsight HBase Performance best practices
PPTX
Architecting Big Data Applications with HDInsight
PPTX
DIY: TPCDS HDInsight Benchmark
HDInsight Security & Compliance
HDInsight HBase replication
Azure HDInsight
Monitor Azure HDInsight with Azure Log Analytics
HDInsight Interactive Query
HDInsight HBase Performance best practices
Architecting Big Data Applications with HDInsight
DIY: TPCDS HDInsight Benchmark

Recently uploaded (20)

PDF
Mega Projects Data Mega Projects Data
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
Mega Projects Data Mega Projects Data
STUDY DESIGN details- Lt Col Maksud (21).pptx
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
Launch Your Data Science Career in Kochi – 2025
oil_refinery_comprehensive_20250804084928 (1).pptx
Foundation of Data Science unit number two notes
Data_Analytics_and_PowerBI_Presentation.pptx
1_Introduction to advance data techniques.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Business Data Analytics.
Major-Components-ofNKJNNKNKNKNKronment.pptx

Five essential new enhancements in azure HDnsight

Editor's Notes

  • #4: Reliable Open Source analytics with an Industry leading SLA HDInsight allows you to easily spin up enterprise-grade open source cluster types guaranteed with the industry’s best 99.9% SLA and 24/7 support. We guarantee this SLA for the entire big data solution, not just the VM instances. HDInsight is architected for full redundancy and high availability including head node replication, data geo-replication, and built-in standby NameNode making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24x7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core, more than all other managed cloud providers combined to support your deployment and the ability to fix and commit code back to Hadoop. Enterprise Grade Security & Monitoring HDInsight protects your data assets and easily extends your on-premise security and governance controls to the cloud. We feature single sign-on (SSO), multi-factor authentication and seamless management of millions of identities through Azure Active Directory. You can authorize users and groups with fine-grained access control policies over all your enterprise data with Apache Ranger. HDInsight meets HIPAA, PCI, SOC compliance, ensuring your enterprise data assets are always protected with the highest security and regulatory compliance. To ensure the highest level of business continuity, HDInsight extends capabilities for alerting, monitoring, defining pre-emptive actions, and enhanced workload protection through native integration with Azure Operations Management Suite (OMS). Most Productive platform for developers and scientists HDInsight offers developers tailored experiences through rich productivity suites for Hadoop & Spark with integrated development environments using Visual Studio, Eclipse, and IntelliJ supporting Scala, Python, R, Java, and .Net. HDInsight gives data scientists the ability to create narratives that combine code, statistical equations, and visualizations that tell a story about the data through integration to the two most popular notebooks: Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to 50x faster speeds than open source R—helping you train more accurate models for better predictions than previously possible. Cost effective cloud scale HDInsight has decoupled compute and storage, enabling you to cost-effectively scale workloads up or down, independent of storage. Local storage can still be used for caching and fast I/O. Spark and interactive Hive users can choose SSD memory for interactive performance; while Kafka users can retain all streaming data in premium managed disks. You only pay for the compute and storage you use and are given the ability to choose any Azure VM types that enables the best utilization of resources. A recent study showed HDInsight delivering 63% lower TCO than deploying Hadoop on premises over 5 years.* Integration with leading Productivity Applications In the broader ecosystem for Hadoop, there is a thriving market of independent software vendors (ISVs) who provide value added solutions. Through a unique design where every cluster is extended with edge nodes and script action, HDInsight lets customers spin up Hadoop and Spark clusters pre-integrated and pre-tuned with any ISV application out-of-the-box. Datameer, Cask, AtScale, StreamSets are few such applications, which are very popular on the HDInsight platform today. Easy for administrators to manage With HDInsight, administrators can deploy Hadoop in the cloud without buying new hardware or incurring other up-front costs. There’s also no time-consuming installation or set up. There is also no need to patch the operating system or upgrade the Hadoop versions. Azure does it for you. Launch your first cluster in minutes.
  • #7: Fast Interactive BI, data security and end user adoption are three critical challenges for successful big data analytics implementations. Without right architecture and tools, many big data and analytics projects fail to catch on with common BI users and enterprise security architects. Last year at Ignite we announced HDInsight Interactive query, which produces significantly faster queries on raw data stored in commodity storage systems such as Azure Blob store or Azure Data Lake Store. In HDInsight 4.0 We have taken this even further, with key improvements such as LLAP result caching: Caching query results allow a previously computed query result to be re-used in the event that the same query is processed by Hive. This feature dramatically speeds up frequently used queries. When result caching is enabled, your cluster is saving compute resources and returning the previously cached results much more quickly, improving the performance of common queries submitted by users. Dynamic materialized views: Hive now supports dynamic materialized views. Pre-computation of summaries (materialized views) is a query speed-up technique in traditional data warehousing systems. Once created, materialized views can be stored natively in Hive or in an Apache Druid layer, and they can seamlessly use LLAP acceleration. Then the optimizer relies on Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, joins, and aggregation operations. Transactions While the previous version of ACID (Atomicity, Consistency, Isolation, and Durability) in Hive needed specialized configurations such as enabling transactions and implementing bucketing, ACID v2 in Hive 3.0 brings performance improvements in both the storage format and execution engine with either equal or better performance when compared to non-ACID tables. ACID on is enabled by default to allow full support for data updates. JDBC Integration: This helps you query data in other data sources such as SQL Server or oracle Learn More: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/
  • #9: While Hive LLAP is great for providing an interactive experience on complex queries, it is not built to be an OLAP system. To provide a singular solution for both complex SQL and OLAP type queries, we are bringing Apache Druid Integration with Hive LLAP to HDInsight. Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is optimized for sub-second queries to slice-and-dice, drill down, search, filter, and aggregate event streams. Druid is commonly used to power interactive applications where sub-second performance with thousands of concurrent users are expected. By combining Druid with LLAP in a single stack, we are enabling a powerful BI solution for our customers. Users and applications can use a JDBC endpoint to submit a query and depending upon the nature of query, the query can be answered by the Druid layer or LLAP layer.
  • #11: Simple queries can be answered directly from Druid and benefit from Druid’s extensive OLAP optimizations. More complex operations will push work down into Druid when it can, then run the remaining bits of the query in Hive LLAP.
  • #13: Hive supported transactions since v .14. However, with many limitations ACID is not a default option Bucketing is mandatory Subpar performance Spark doesn’t read write for ACID tables HDInsight 4.0 changes it
  • #14: The new integration between Apache Spark and Hive LLAP in HDInsight 4.0 delivers new capabilities for business analysts, data scientists, and data engineers. Business analysts get a performant SQL engine in the form of Hive LLAP (Interactive Query) while data scientists and data engineers get a great platform for ML experimentation and ETL with Apache Spark over transactional data in Hive tables.
  • #19: Announcing local disk cache for Spark. Better caching for tables in columnar storage layout to only cache columns being accessed.
  • #20: Top queries show up to 9X improvements
  • #23: For Java / Scala developers, our HDInsight Tools for IntelliJ provides native authoring experience, supports Maven and SBT projects. It enables local and cluster run & debugging. The HDInsight tools also integrate well with Azure for cluster and resource management. With the recent updates, the IntelliJ tools can support ESP as well as ADLS Gen 2, it also enables external jar submission and GitHub integration for issue tracking and feedback collection. Talking Points on tool overview - Built-in template with native authoring support for Scala and Java Spark apps, which lowers the learning curve and make the get started experiences very easy. Support for Apache Maven and Simple Build Tool (SBT) projects, enabling more flexibility to build the Spark projects. Easily switch between local and remote cluster run & debugging, which can greatly facilitate the short-cycle iterative code – run-debug cycle. Azure integration allows you to access your HDInsight clusters, WASB/ADLS storage and your Spark jobs. Recent Releases – Support ADLS Gen 2 and HDI 4.0 with Hadoop 3.0.1. Integrate with HDInsight Enterprise Security Package (ESP). Enable Ambari connection for job author and submission. Integration with external JARs for job submission. GitHub integration for issue tracking and feedback submission.
  • #24: In this video, I will walk you through a rich set of Spark debugging and diagnosis capabilities provided by Microsoft HDInsight. The default Spark history server user experience is now enhanced in HDInsight with powerful interactive visualization of Job Graphs, Data Flows and job diagnosis analysis charts. The new features greatly assist HDInsight Spark developers in job data management, data sampling, job monitoring and job diagnosis.