Five essential new enhancements in azure HDnsight

Download as PPTX, PDF

0 likes425 views

This document discusses features of Apache Spark on Azure HDInsight including a new Spark IO cache that provides significant performance improvements of up to 9x for Spark queries. It also discusses other HDInsight features like Hive LLAP for interactive querying, data analytics templates, and tools for Spark job debugging and diagnosis. Azure HDInsight is presented as a secure, managed Hadoop and Spark cloud platform for building data lakes on Azure.

Data & Analytics

Five essential new enhancements in azure HDnsight

AskHDInsight@microsoft.com
https://blogs.msdn.microsoft.com/ashish/

• The most trusted and
compliant platform
A secure and managed Hadoop and Spark cloud platform for building data lakes for enterprises

1.Fast BI 2.Transactions
3.Security
4.New Spark
IO Cache
5.Developer
Productivity
@ashishth

Azure Data Lake Storage
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Apache Hive LLAP for
interactive querying
Azure HDInsight
VS Code / JDBC Client
@ashishth
New in Hive LLAP 3.0
• Result Caching
• Dynamic Materialized Views
• Workload Management
• JDBC Integration

Dimensional Aggregates, Top N Queries, Min, Max, Avg, Aggregations, Time
Series
Azure Data Lake Storage @ashishth

Advantages
• ACID (Atomicity, Consistency, Isolation, and Durability) is
default in Hive 3.0
• No Performance overhead
• No Bucketing required
• Spark can read and write to Hive ACID tables via Hive
Warehouse connector
@ashishth

@ashishth
BRK2371 - Gaining deeper
insights from big data using
open source analytics on
Azure HDInsight

Azure Data Lake Storage
INSTANCE CORE RAM TEMP SSD
D1 v2 1 3.50 GiB 50 GiB
D2 v2 2 7.00 GiB 100 GiB
D3 v2 4 14.00 GiB 200 GiB
D4 v2 8 28.00 GiB 400 GiB
D5 v2 16 56.00 GiB 800 GiB
• Significant Spark performance speed up
with IO cache (up to 9X perf gains)
• Automatic cache resource management
• DRAM + Temp SSD makes large cache
pool
@ashishth

0
50
100
150
200
250
300
Q52 Q27 Q46 Q48 Q3 Q36 Q79 Q59 Q44 Q55 Q18 Q12 Q61 Q20 Q42 Q68 Q56 Q33 Q73
177
189
172
182
113
166
155
269
284
104
166
126
235
132
98
117
183
214
98
18 24 26 28
18
27 25
45 49
18
30
23
44
27 20 24
39
48
22
(S)
Queries
Top 20 queries with most gains
Without IO Cache With IO Cache
@ashishth

11,967
5,191
HDINSIGHT (NO IO CACHE) HDINSIGHT IO CACHE
RUNNINGTIME(SECS)
TOTAL RUNNING TIME
@ashishth

Built-in templates supporting Maven and SBT project
Local and Cluster Run & Debug
@ashishth

Job Executions Graph – Visualize Spark job execution with
heatmap, playback and outlier detection.
Job Debugging & Diagnosis
 Data & Time Skew – Skew detection and analysis with built-
in rules and customized rules.
 Executor usage analysis - Visualize executors' allocation and
actual execution utilization.
Job Data Management - Access and manage data input,
output and table operations for Spark job.
@ashishth

AskHDInsight@microsoft.com
https://blogs.msdn.microsoft.com/ashish/
@ashishth

http://myignite.microsoft.com
https://aka.ms/ignite.mobileapp

Five essential new enhancements in azure HDnsight

2. AskHDInsight@microsoft.com https://blogs.msdn.microsoft.com/ashish/

3. • The most trusted and compliant platform A secure and managed Hadoop and Spark cloud platform for building data lakes for enterprises

4. 1.Fast BI 2.Transactions 3.Security 4.New Spark IO Cache 5.Developer Productivity @ashishth

5. @ashishth

6. Azure Data Lake Storage Logs (unstructured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Apache Hive LLAP for interactive querying Azure HDInsight VS Code / JDBC Client @ashishth New in Hive LLAP 3.0 • Result Caching • Dynamic Materialized Views • Workload Management • JDBC Integration

8. • • • • • • • • • • • @ashishth

10. Dimensional Aggregates, Top N Queries, Min, Max, Avg, Aggregations, Time Series Azure Data Lake Storage @ashishth

11. @ashishth

12. @ashishth

13. Advantages • ACID (Atomicity, Consistency, Isolation, and Durability) is default in Hive 3.0 • No Performance overhead • No Bucketing required • Spark can read and write to Hive ACID tables via Hive Warehouse connector @ashishth

14. @ashishth

15. @ashishth

16. @ashishth BRK2371 - Gaining deeper insights from big data using open source analytics on Azure HDInsight

17. @ashishth

18. Azure Data Lake Storage INSTANCE CORE RAM TEMP SSD D1 v2 1 3.50 GiB 50 GiB D2 v2 2 7.00 GiB 100 GiB D3 v2 4 14.00 GiB 200 GiB D4 v2 8 28.00 GiB 400 GiB D5 v2 16 56.00 GiB 800 GiB • Significant Spark performance speed up with IO cache (up to 9X perf gains) • Automatic cache resource management • DRAM + Temp SSD makes large cache pool @ashishth

19. 0 50 100 150 200 250 300 Q52 Q27 Q46 Q48 Q3 Q36 Q79 Q59 Q44 Q55 Q18 Q12 Q61 Q20 Q42 Q68 Q56 Q33 Q73 177 189 172 182 113 166 155 269 284 104 166 126 235 132 98 117 183 214 98 18 24 26 28 18 27 25 45 49 18 30 23 44 27 20 24 39 48 22 (S) Queries Top 20 queries with most gains Without IO Cache With IO Cache @ashishth

20. 11,967 5,191 HDINSIGHT (NO IO CACHE) HDINSIGHT IO CACHE RUNNINGTIME(SECS) TOTAL RUNNING TIME @ashishth

21. @ashishth

22. Built-in templates supporting Maven and SBT project Local and Cluster Run & Debug @ashishth

23. Job Executions Graph – Visualize Spark job execution with heatmap, playback and outlier detection. Job Debugging & Diagnosis  Data & Time Skew – Skew detection and analysis with built- in rules and customized rules.  Executor usage analysis - Visualize executors' allocation and actual execution utilization. Job Data Management - Access and manage data input, output and table operations for Spark job. @ashishth

24. AskHDInsight@microsoft.com https://blogs.msdn.microsoft.com/ashish/ @ashishth

25. http://myignite.microsoft.com https://aka.ms/ignite.mobileapp

Editor's Notes

#4: Reliable Open Source analytics with an Industry leading SLAHDInsight allows you to easily spin up enterprise-grade open source cluster types guaranteed with the industry’s best 99.9% SLA and 24/7 support. We guarantee this SLA for the entire big data solution, not just the VM instances. HDInsight is architected for full redundancy and high availability including head node replication, data geo-replication, and built-in standby NameNode making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24x7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core, more than all other managed cloud providers combined to support your deployment and the ability to fix and commit code back to Hadoop. Enterprise Grade Security & MonitoringHDInsight protects your data assets and easily extends your on-premise security and governance controls to the cloud. We feature single sign-on (SSO), multi-factor authentication and seamless management of millions of identities through Azure Active Directory. You can authorize users and groups with fine-grained access control policies over all your enterprise data with Apache Ranger. HDInsight meets HIPAA, PCI, SOC compliance, ensuring your enterprise data assets are always protected with the highest security and regulatory compliance. To ensure the highest level of business continuity, HDInsight extends capabilities for alerting, monitoring, defining pre-emptive actions, and enhanced workload protection through native integration with Azure Operations Management Suite (OMS). Most Productive platform for developers and scientists HDInsight offers developers tailored experiences through rich productivity suites for Hadoop & Spark with integrated development environments using Visual Studio, Eclipse, and IntelliJ supporting Scala, Python, R, Java, and .Net. HDInsight gives data scientists the ability to create narratives that combine code, statistical equations, and visualizations that tell a story about the data through integration to the two most popular notebooks: Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to 50x faster speeds than open source R—helping you train more accurate models for better predictions than previously possible. Cost effective cloud scaleHDInsight has decoupled compute and storage, enabling you to cost-effectively scale workloads up or down, independent of storage. Local storage can still be used for caching and fast I/O. Spark and interactive Hive users can choose SSD memory for interactive performance; while Kafka users can retain all streaming data in premium managed disks. You only pay for the compute and storage you use and are given the ability to choose any Azure VM types that enables the best utilization of resources. A recent study showed HDInsight delivering 63% lower TCO than deploying Hadoop on premises over 5 years.* Integration with leading Productivity ApplicationsIn the broader ecosystem for Hadoop, there is a thriving market of independent software vendors (ISVs) who provide value added solutions. Through a unique design where every cluster is extended with edge nodes and script action, HDInsight lets customers spin up Hadoop and Spark clusters pre-integrated and pre-tuned with any ISV application out-of-the-box. Datameer, Cask, AtScale, StreamSets are few such applications, which are very popular on the HDInsight platform today. Easy for administrators to manageWith HDInsight, administrators can deploy Hadoop in the cloud without buying new hardware or incurring other up-front costs. There’s also no time-consuming installation or set up. There is also no need to patch the operating system or upgrade the Hadoop versions. Azure does it for you. Launch your first cluster in minutes.
#7: Fast Interactive BI, data security and end user adoption are three critical challenges for successful big data analytics implementations. Without right architecture and tools, many big data and analytics projects fail to catch on with common BI users and enterprise security architects. Last year at Ignite we announced HDInsight Interactive query, which produces significantly faster queries on raw data stored in commodity storage systems such as Azure Blob store or Azure Data Lake Store. In HDInsight 4.0 We have taken this even further, with key improvements such as LLAP result caching: Caching query results allow a previously computed query result to be re-used in the event that the same query is processed by Hive. This feature dramatically speeds up frequently used queries. When result caching is enabled, your cluster is saving compute resources and returning the previously cached results much more quickly, improving the performance of common queries submitted by users. Dynamic materialized views: Hive now supports dynamic materialized views. Pre-computation of summaries (materialized views) is a query speed-up technique in traditional data warehousing systems. Once created, materialized views can be stored natively in Hive or in an Apache Druid layer, and they can seamlessly use LLAP acceleration. Then the optimizer relies on Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, joins, and aggregation operations. Transactions While the previous version of ACID (Atomicity, Consistency, Isolation, and Durability) in Hive needed specialized configurations such as enabling transactions and implementing bucketing, ACID v2 in Hive 3.0 brings performance improvements in both the storage format and execution engine with either equal or better performance when compared to non-ACID tables. ACID on is enabled by default to allow full support for data updates. JDBC Integration: This helps you query data in other data sources such as SQL Server or oracle Learn More: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/
#9: While Hive LLAP is great for providing an interactive experience on complex queries, it is not built to be an OLAP system. To provide a singular solution for both complex SQL and OLAP type queries, we are bringing Apache Druid Integration with Hive LLAP to HDInsight. Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is optimized for sub-second queries to slice-and-dice, drill down, search, filter, and aggregate event streams. Druid is commonly used to power interactive applications where sub-second performance with thousands of concurrent users are expected. By combining Druid with LLAP in a single stack, we are enabling a powerful BI solution for our customers. Users and applications can use a JDBC endpoint to submit a query and depending upon the nature of query, the query can be answered by the Druid layer or LLAP layer.
#11: Simple queries can be answered directly from Druid and benefit from Druid’s extensive OLAP optimizations. More complex operations will push work down into Druid when it can, then run the remaining bits of the query in Hive LLAP.
#13: Hive supported transactions since v .14. However, with many limitations ACID is not a default option Bucketing is mandatory Subpar performance Spark doesn’t read write for ACID tables HDInsight 4.0 changes it
#14: The new integration between Apache Spark and Hive LLAP in HDInsight 4.0 delivers new capabilities for business analysts, data scientists, and data engineers. Business analysts get a performant SQL engine in the form of Hive LLAP (Interactive Query) while data scientists and data engineers get a great platform for ML experimentation and ETL with Apache Spark over transactional data in Hive tables.
#19: Announcing local disk cache for Spark. Better caching for tables in columnar storage layout to only cache columns being accessed.
#20: Top queries show up to 9X improvements
#23: For Java / Scala developers, our HDInsight Tools for IntelliJ provides native authoring experience, supports Maven and SBT projects. It enables local and cluster run & debugging. The HDInsight tools also integrate well with Azure for cluster and resource management. With the recent updates, the IntelliJ tools can support ESP as well as ADLS Gen 2, it also enables external jar submission and GitHub integration for issue tracking and feedback collection. Talking Points on tool overview - Built-in template with native authoring support for Scala and Java Spark apps, which lowers the learning curve and make the get started experiences very easy. Support for Apache Maven and Simple Build Tool (SBT) projects, enabling more flexibility to build the Spark projects. Easily switch between local and remote cluster run & debugging, which can greatly facilitate the short-cycle iterative code – run-debug cycle. Azure integration allows you to access your HDInsight clusters, WASB/ADLS storage and your Spark jobs. Recent Releases – Support ADLS Gen 2 and HDI 4.0 with Hadoop 3.0.1. Integrate with HDInsight Enterprise Security Package (ESP). Enable Ambari connection for job author and submission. Integration with external JARs for job submission. GitHub integration for issue tracking and feedback submission.
#24: In this video, I will walk you through a rich set of Spark debugging and diagnosis capabilities provided by Microsoft HDInsight. The default Spark history server user experience is now enhanced in HDInsight with powerful interactive visualization of Job Graphs, Data Flows and job diagnosis analysis charts. The new features greatly assist HDInsight Spark developers in job data management, data sampling, job monitoring and job diagnosis.

Five essential new enhancements in azure HDnsight

More Related Content

What's hot (17)

Similar to Five essential new enhancements in azure HDnsight (20)

More from Ashish Thapliyal (8)

Recently uploaded (20)

Five essential new enhancements in azure HDnsight

Editor's Notes