SlideShare a Scribd company logo
Analyzing StackExchange
with Azure Data Lake
Tom Kerkhove
NDC Sydney 2017
Tom Kerkhove
Azure Consultant @ Codit
Microsoft Azure MVP & Advisor
Member of Azug.be
“Integration of Things” whitepaper (https://bit.ly/azure-iot)
2
Nice to meet you
blog.tomkerkhove.be
@TomKerkhove
tomkerkhove
Agenda
• Introduction to Azure Data Lake
• What is Azure Data Lake Store?
• What is Azure Data Lake Analytics?
• Analyzing StackExchange with Azure Data Lake
3
4
5
Let’s go open-source, right?!
➔ Comes with a few challenges for C#/SQL professional
➔ New languages to learn & maintain
➔ Rapidly evolving ecosystem
➔ Cluster management
➔ Typically linux machines
6
7
Analyzing Big Data in Azure
➔ WebHDFS compatible
➔ Any size
➔ Any format as-is
➔ Write-once-read-many
➔ Enterprise-grade security
➔ Thé big data store in Azure
8
Azure Data Lake Store
9
Characteristics
➔ Data Warehousing
➔ Structured data
➔ Defined set of schemas
➔ Requires Extract-Transform-
Load (ETL) before storing
➔ Known for some of us
➔ Exploratory analysis is hard
because of transforming
the data
10
Data Warehousing vs Data Lakes
➔ Data Lakes
➔ Raw data
(unstructured/semi-structured/structured)
➔ “Dump” all your data in the lake
➔ Data scientists will interpret data
from the lake
➔ Without metadata, turns in a data
swamp pretty fast
11Martin Fowler on Data Lake & Data Warehouses: https://bit.ly/martin-fowler-data-lake
Security
➔ Roled-based Access Control (RBAC)
➔ Grant user/groups access to folder/file
(https://bit.ly/data-lake-store-acls)
➔ Firewall (off by default)
➔ Encryption at rest
➔ Keys managed by Microsoft
➔ Bring-your-own-key with Azure Key Vault
12
➔ ~$0,032/GB stored per month
➔ Transaction costs
➔ ~$0,043 per 1M write transactions
➔ ~$0,0034 per 1M read transactions
➔ 1 transaction is block of up to 128 kB
➔ Regular Egress fees
➔ Monthly commitment packages
➔ Save up to 33%
13
Pricing
Azure Data Lake Store vs Blob Storage
14
No Limitations
Store whatever you
want in any format
Security
Built-in Azure Active
Directory support
Pricing
More expensive than
Storage GRS
Redundancy
It’s there but no control
over it
Built for Scale
Optimized for high-
scale reads
Integration
With Data Factory, Data
Catalog & HDInsight
Full comparison on https://bit.ly/adls-vs-storage
Azure Data Lake Analytics
➔ Run analytics jobs on managed clusters
➔ No maintenance ~ Serverless
➔ Written in U-SQL
➔ SQL Syntax
➔ Extensibility in C#
➔ Easily scaled with Analytics Units
➔ Pay for processing time only
15
➔ Built-in partitioned tables
➔ Query data where it lives
➔ No need to prepare data
➔ One query that runs on multiple
data stores
➔ Use the correct data store
for the job
16
Data Sources
Writing U-SQL scripts
17
Extract from data source by
using built-in or custom
extractors.
Transform / Analyse the data
using SQL-syntax, in-line C# or
C# method calls
Output the result to a data
source by using built-in or
custom extractors
➔ C# Expressions
➔ User-Defined Functions (UDF)
➔ User-Defined Operations (UDO)
➔ User-Defined Aggregators (UDAGG)
18
Extensibility
➔ User-Defined Extractors
➔ User-Defined Processors
➔ Take one row and produce
one row
➔ Pass-through versus
transforming
➔ User-Defined Reducers
➔ Take n rows and produce 1
row
19
➔ User-Defined Outputters
➔ User-Defined Appliers
➔ Take one row and produce 0 to
n rows
➔ Used with OUTER/CROSS
APPLY
➔ User-Defined Combiners
➔ Combines rowsets (like a user-
defined join)
User-Defined Operations (UDO)
20
Metadata Model
21
U-SQL Batch Job Execution Lifetime
Michael Rys on “Tuning & Optimizing U-SQL” https://bit.ly/tuning-optimizing-u-sql
22
23
Job States
➔ Roled-based Access Control (RBAC)
➔ Firewall (Off by default)
➔ Access control on service catalog
➔ Access control on a per-database level
24
Security
➔ Account-level limitations
➔ Maximum of AUs
➔ Maximum of concurrent job
➔ Days to retain queries
➔ Job-level limitations
➔ Maximum of AUs
➔ Maximum priority
➔ Granted per user and/or group
25
Resource Management
➔ Billed for processing time, not per job
➔ Billed per second
➔ $1,687 per hour per Analytics Unit
➔ ~ $0,028 per minute
➔ Monthly commitment packages
➔ Save up to 74%
26
Pricing
Demo
Meet StackExchange
➔ Over 280 websites
➔ 150+ GB of open-source data
➔ Different kinds of data
➔ Posts
➔ Users
➔ Votes
➔ ...
➔ A big data sample data set
28
What Are We Going To Do?
• Download the
original data set
Acquiring The
Data
• Upload data set to
Azure
• Determine what
service to use
Moving The
Data • Merging data from
each site into one
file
• Conversion from
XML to CSV
Aggregating
The Data
• Run business logic
on it
• Attempt to gain
knowledge from it
Analyzing The
Data • Visualize what we’ve
learned
Visualizing The
Data
29
30
How is it setup?
➔ Azure Data Lake Store
➔ Graphs
• Storage Utilization
• Read/Write
• Ingress/Egress
➔ Audit & Request logs
➔ No Metrics
➔ No Alerts
31
➔ Azure Data Lake Analytics
➔ Graphs
• Job status
• Used # of AU time
➔ Metrics
• Job status
• Used # of AU time
➔ Audit & Request logs
➔ No alerts
Operations
➔ Store Explorer
➔ Browse store
➔ Download complete / subset of file
➔ Preview
➔ Only in Visual Studio
➔ Job Visualizer
➔ Determine bottlenecks by using heatmaps
➔ Playback jobs based on telemetry
➔ Query optimization
➔ Job Profiler
32
Azure Data Lake tools for Visual Studio
➔ Integration with Source control
➔ Unit Testing extensibility
➔ Local execution
➔ Simulate Data Lake Store
➔ Run & debug jobs
33
Azure Data Lake tools for Visual Studio (Code)
➔ Integrate with your data pipelines in Azure Data Factory
➔ Move data from Azure Data Lake Store to other store
➔ Move data to Azure Data Lake Store
➔ Run U-SQL jobs within pipeline
➔ Integration with Azure Data Catalog
➔ Register your Azure Data Lake Store assets
34
Integration with Azure Services
➔ Azure Data Lake Best Practices by Microsoft
(Contact me)
➔ “Mastering Azure Analytics” by Zoiner Tejada
(https://bit.ly/mastering-azure-analytics)
➔ MVA “Introducing Azure Data Lake”
(https://bit.ly/intro-to-azure-data-lake)
➔ Azure Data Lake GitHub Repo
(https://azure.github.io/AzureDataLake/)
➔ U-SQL Documentation
(https://usql.io)
35
Learn more!
➔ Big Data is not just a hype so get ready
➔ Azure Data Lake Store
➔ Analyse today & explore tomorrow
➔ Data Swamps
➔ Data Lake Analytics
➔ No cluster management ~ “Serverless”
➔ Re-use existing skills
➔ Pay for what we use
➔ Big Data in Azure? Use Azure Data Lake!
36
Summary
37

More Related Content

PPTX
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
PDF
Cassandra in e-commerce
PDF
Exploring Alluxio for Daily Tasks at Robinhood
PDF
Евгений Курпилянский "Индексирование поверх Cassandra". Выступление на Cassan...
PPTX
Data Modeling Basics for the Cloud with DataStax
PDF
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
PDF
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
PDF
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Cassandra in e-commerce
Exploring Alluxio for Daily Tasks at Robinhood
Евгений Курпилянский "Индексирование поверх Cassandra". Выступление на Cassan...
Data Modeling Basics for the Cloud with DataStax
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive

What's hot (20)

PDF
Replicate Elasticsearch Data with Cross-Cluster Replication (CCR)
PDF
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
PPTX
Azure database services for PostgreSQL and MySQL
PDF
Роман Новиков "Best Practices for MySQL Performance & Troubleshooting with th...
PDF
Azure SQL Data Warehouse
PPTX
Using Couchbase and Elasticsearch as data layers
PPTX
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
PPTX
From PoCs to Production
PPTX
Azure DocumentDB 101
PPTX
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
PPTX
Empowering the AWS DynamoDB™ application developer with Alternator
PPTX
Geek Sync I Learn to Troubleshoot Query Performance in Analysis Services
PDF
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
PPTX
Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
PPTX
MongoDB and the Internet of Things
PPTX
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
PPTX
Caching in asp.net mvc
PPTX
Stream processing at Hotstar
PDF
Azure Redis Cache
Replicate Elasticsearch Data with Cross-Cluster Replication (CCR)
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
Azure database services for PostgreSQL and MySQL
Роман Новиков "Best Practices for MySQL Performance & Troubleshooting with th...
Azure SQL Data Warehouse
Using Couchbase and Elasticsearch as data layers
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
From PoCs to Production
Azure DocumentDB 101
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Empowering the AWS DynamoDB™ application developer with Alternator
Geek Sync I Learn to Troubleshoot Query Performance in Analysis Services
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
MongoDB and the Internet of Things
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
Caching in asp.net mvc
Stream processing at Hotstar
Azure Redis Cache
Ad

Similar to NDC Sydney - Analyzing StackExchange with Azure Data Lake (20)

PPTX
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
PPTX
Analyzing StackExchange data with Azure Data Lake
PPTX
Tips, Tricks & Best Practices for large scale HDInsight Deployments
PPTX
Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
PPTX
Introducing Azure SQL Data Warehouse
PPTX
Microsoft Azure Big Data Analytics
PDF
Azure Data services
PDF
Estimating the Total Costs of Your Cloud Analytics Platform
PPTX
More Cache for Less Cash (DevLink 2014)
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
PPTX
Your-Complete-Guide-to-Azure-Data-Engineering (1).pptx
PPTX
Azure Lowlands: An intro to Azure Data Lake
PPTX
Azure Synapse vs. Snowflake: The Data Warehouse Dating Game
PPTX
Afternoons with Azure - Azure Data Services
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
PDF
Estimating the Total Costs of Your Cloud Analytics Platform 
PDF
Demystifying Data Warehouse as a Service (DWaaS)
PDF
ML-Based SQL Query Resource Usage Prediction
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
Integration Monday - Analysing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Introducing Azure SQL Data Warehouse
Microsoft Azure Big Data Analytics
Azure Data services
Estimating the Total Costs of Your Cloud Analytics Platform
More Cache for Less Cash (DevLink 2014)
Building a Turbo-fast Data Warehousing Platform with Databricks
Your-Complete-Guide-to-Azure-Data-Engineering (1).pptx
Azure Lowlands: An intro to Azure Data Lake
Azure Synapse vs. Snowflake: The Data Warehouse Dating Game
Afternoons with Azure - Azure Data Services
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
Estimating the Total Costs of Your Cloud Analytics Platform 
Demystifying Data Warehouse as a Service (DWaaS)
ML-Based SQL Query Resource Usage Prediction
Ad

More from Tom Kerkhove (20)

PPTX
Techorama 2022 - Adventures of building Promitor, an open-source product
PPTX
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
PPTX
Introduction to Promitor
PPTX
Azure Lowlands 2020 - API management for microservices in a hybrid and multi-...
PPTX
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
PPTX
Global Azure Virtual - Application Autoscaling with KEDA
PPTX
Building Bruges 2020 - Adventures of building a multi-tenant PaaS on Microsof...
PPTX
AZUG Lightning Talk - Application autoscaling on Kubernetes with Kubernetes E...
PPTX
IglooConf 2020 - API management for microservices in a hybrid and multi-cloud...
PPTX
IglooConf 2020 - Adventures of building a multi-tenant PaaS on Microsoft Azure
PPTX
Microsoft Ignite 2019 - API management for microservices in a hybrid and mult...
PPTX
Integrate UK 2019 - Adventures of building a (multi-tenant) PaaS on Microsoft...
PDF
Techdays Finland 2019 - Adventures of building a (multi-tenant) PaaS on Micro...
PPTX
Azure Low Lands 2019 - Building secure cloud applications with Azure Key Vault
PPTX
Next Generation Data Integration with Azure Data Factory
PPTX
Intelligent Cloud Conference 2018 - Automatically scaling Kubernetes pods bas...
PPTX
Intelligent Cloud Conference 2018 - Building secure cloud applications with A...
PPTX
Techdays Finland 2018 - Building secure cloud applications with Azure Key Vault
PPTX
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
PPTX
ITProceed 2015 - Securing Sensitive Data with Azure Key Vault
Techorama 2022 - Adventures of building Promitor, an open-source product
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
Introduction to Promitor
Azure Lowlands 2020 - API management for microservices in a hybrid and multi-...
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
Global Azure Virtual - Application Autoscaling with KEDA
Building Bruges 2020 - Adventures of building a multi-tenant PaaS on Microsof...
AZUG Lightning Talk - Application autoscaling on Kubernetes with Kubernetes E...
IglooConf 2020 - API management for microservices in a hybrid and multi-cloud...
IglooConf 2020 - Adventures of building a multi-tenant PaaS on Microsoft Azure
Microsoft Ignite 2019 - API management for microservices in a hybrid and mult...
Integrate UK 2019 - Adventures of building a (multi-tenant) PaaS on Microsoft...
Techdays Finland 2019 - Adventures of building a (multi-tenant) PaaS on Micro...
Azure Low Lands 2019 - Building secure cloud applications with Azure Key Vault
Next Generation Data Integration with Azure Data Factory
Intelligent Cloud Conference 2018 - Automatically scaling Kubernetes pods bas...
Intelligent Cloud Conference 2018 - Building secure cloud applications with A...
Techdays Finland 2018 - Building secure cloud applications with Azure Key Vault
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
ITProceed 2015 - Securing Sensitive Data with Azure Key Vault

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PPTX
A Presentation on Artificial Intelligence
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
A Presentation on Artificial Intelligence

NDC Sydney - Analyzing StackExchange with Azure Data Lake

  • 1. Analyzing StackExchange with Azure Data Lake Tom Kerkhove NDC Sydney 2017
  • 2. Tom Kerkhove Azure Consultant @ Codit Microsoft Azure MVP & Advisor Member of Azug.be “Integration of Things” whitepaper (https://bit.ly/azure-iot) 2 Nice to meet you blog.tomkerkhove.be @TomKerkhove tomkerkhove
  • 3. Agenda • Introduction to Azure Data Lake • What is Azure Data Lake Store? • What is Azure Data Lake Analytics? • Analyzing StackExchange with Azure Data Lake 3
  • 4. 4
  • 5. 5
  • 6. Let’s go open-source, right?! ➔ Comes with a few challenges for C#/SQL professional ➔ New languages to learn & maintain ➔ Rapidly evolving ecosystem ➔ Cluster management ➔ Typically linux machines 6
  • 8. ➔ WebHDFS compatible ➔ Any size ➔ Any format as-is ➔ Write-once-read-many ➔ Enterprise-grade security ➔ Thé big data store in Azure 8 Azure Data Lake Store
  • 9. 9
  • 10. Characteristics ➔ Data Warehousing ➔ Structured data ➔ Defined set of schemas ➔ Requires Extract-Transform- Load (ETL) before storing ➔ Known for some of us ➔ Exploratory analysis is hard because of transforming the data 10 Data Warehousing vs Data Lakes ➔ Data Lakes ➔ Raw data (unstructured/semi-structured/structured) ➔ “Dump” all your data in the lake ➔ Data scientists will interpret data from the lake ➔ Without metadata, turns in a data swamp pretty fast
  • 11. 11Martin Fowler on Data Lake & Data Warehouses: https://bit.ly/martin-fowler-data-lake
  • 12. Security ➔ Roled-based Access Control (RBAC) ➔ Grant user/groups access to folder/file (https://bit.ly/data-lake-store-acls) ➔ Firewall (off by default) ➔ Encryption at rest ➔ Keys managed by Microsoft ➔ Bring-your-own-key with Azure Key Vault 12
  • 13. ➔ ~$0,032/GB stored per month ➔ Transaction costs ➔ ~$0,043 per 1M write transactions ➔ ~$0,0034 per 1M read transactions ➔ 1 transaction is block of up to 128 kB ➔ Regular Egress fees ➔ Monthly commitment packages ➔ Save up to 33% 13 Pricing
  • 14. Azure Data Lake Store vs Blob Storage 14 No Limitations Store whatever you want in any format Security Built-in Azure Active Directory support Pricing More expensive than Storage GRS Redundancy It’s there but no control over it Built for Scale Optimized for high- scale reads Integration With Data Factory, Data Catalog & HDInsight Full comparison on https://bit.ly/adls-vs-storage
  • 15. Azure Data Lake Analytics ➔ Run analytics jobs on managed clusters ➔ No maintenance ~ Serverless ➔ Written in U-SQL ➔ SQL Syntax ➔ Extensibility in C# ➔ Easily scaled with Analytics Units ➔ Pay for processing time only 15
  • 16. ➔ Built-in partitioned tables ➔ Query data where it lives ➔ No need to prepare data ➔ One query that runs on multiple data stores ➔ Use the correct data store for the job 16 Data Sources
  • 17. Writing U-SQL scripts 17 Extract from data source by using built-in or custom extractors. Transform / Analyse the data using SQL-syntax, in-line C# or C# method calls Output the result to a data source by using built-in or custom extractors
  • 18. ➔ C# Expressions ➔ User-Defined Functions (UDF) ➔ User-Defined Operations (UDO) ➔ User-Defined Aggregators (UDAGG) 18 Extensibility
  • 19. ➔ User-Defined Extractors ➔ User-Defined Processors ➔ Take one row and produce one row ➔ Pass-through versus transforming ➔ User-Defined Reducers ➔ Take n rows and produce 1 row 19 ➔ User-Defined Outputters ➔ User-Defined Appliers ➔ Take one row and produce 0 to n rows ➔ Used with OUTER/CROSS APPLY ➔ User-Defined Combiners ➔ Combines rowsets (like a user- defined join) User-Defined Operations (UDO)
  • 21. 21 U-SQL Batch Job Execution Lifetime Michael Rys on “Tuning & Optimizing U-SQL” https://bit.ly/tuning-optimizing-u-sql
  • 22. 22
  • 24. ➔ Roled-based Access Control (RBAC) ➔ Firewall (Off by default) ➔ Access control on service catalog ➔ Access control on a per-database level 24 Security
  • 25. ➔ Account-level limitations ➔ Maximum of AUs ➔ Maximum of concurrent job ➔ Days to retain queries ➔ Job-level limitations ➔ Maximum of AUs ➔ Maximum priority ➔ Granted per user and/or group 25 Resource Management
  • 26. ➔ Billed for processing time, not per job ➔ Billed per second ➔ $1,687 per hour per Analytics Unit ➔ ~ $0,028 per minute ➔ Monthly commitment packages ➔ Save up to 74% 26 Pricing
  • 27. Demo
  • 28. Meet StackExchange ➔ Over 280 websites ➔ 150+ GB of open-source data ➔ Different kinds of data ➔ Posts ➔ Users ➔ Votes ➔ ... ➔ A big data sample data set 28
  • 29. What Are We Going To Do? • Download the original data set Acquiring The Data • Upload data set to Azure • Determine what service to use Moving The Data • Merging data from each site into one file • Conversion from XML to CSV Aggregating The Data • Run business logic on it • Attempt to gain knowledge from it Analyzing The Data • Visualize what we’ve learned Visualizing The Data 29
  • 30. 30 How is it setup?
  • 31. ➔ Azure Data Lake Store ➔ Graphs • Storage Utilization • Read/Write • Ingress/Egress ➔ Audit & Request logs ➔ No Metrics ➔ No Alerts 31 ➔ Azure Data Lake Analytics ➔ Graphs • Job status • Used # of AU time ➔ Metrics • Job status • Used # of AU time ➔ Audit & Request logs ➔ No alerts Operations
  • 32. ➔ Store Explorer ➔ Browse store ➔ Download complete / subset of file ➔ Preview ➔ Only in Visual Studio ➔ Job Visualizer ➔ Determine bottlenecks by using heatmaps ➔ Playback jobs based on telemetry ➔ Query optimization ➔ Job Profiler 32 Azure Data Lake tools for Visual Studio
  • 33. ➔ Integration with Source control ➔ Unit Testing extensibility ➔ Local execution ➔ Simulate Data Lake Store ➔ Run & debug jobs 33 Azure Data Lake tools for Visual Studio (Code)
  • 34. ➔ Integrate with your data pipelines in Azure Data Factory ➔ Move data from Azure Data Lake Store to other store ➔ Move data to Azure Data Lake Store ➔ Run U-SQL jobs within pipeline ➔ Integration with Azure Data Catalog ➔ Register your Azure Data Lake Store assets 34 Integration with Azure Services
  • 35. ➔ Azure Data Lake Best Practices by Microsoft (Contact me) ➔ “Mastering Azure Analytics” by Zoiner Tejada (https://bit.ly/mastering-azure-analytics) ➔ MVA “Introducing Azure Data Lake” (https://bit.ly/intro-to-azure-data-lake) ➔ Azure Data Lake GitHub Repo (https://azure.github.io/AzureDataLake/) ➔ U-SQL Documentation (https://usql.io) 35 Learn more!
  • 36. ➔ Big Data is not just a hype so get ready ➔ Azure Data Lake Store ➔ Analyse today & explore tomorrow ➔ Data Swamps ➔ Data Lake Analytics ➔ No cluster management ~ “Serverless” ➔ Re-use existing skills ➔ Pay for what we use ➔ Big Data in Azure? Use Azure Data Lake! 36 Summary
  • 37. 37