Azure data stack_2019_08

2
alexandre.bergere@gmail.com
https://fr.linkedin.com/in/alexandrebergere
@AlexPhile
Avanade
2016 - 2019
Sr Anls, Data Engineering
Worked for 3 years as a senior analyst at
Avanade France, I have developed my skills
in data analysis (MSBI, Power BI, R, Python,
Spark, Cosmos DB) by working on innovative
projects and proofs of concept in the energy
industry.
ESAIP
Teacher
2016 - x
Data Freelance
2019 - x
26/08/2019

26/08/2019 3
Sources
This support is a summary from the paths:
o Azure for the Data Engineer
o Store data in Azure
o Work with relational data in Azure
o Large Scale Data Processing with Azure Data Lake Storage Gen2
o Implement a Data Streaming Solution with Azure Streaming Analytics
o Implement a Data Warehouse with Azure SQL Data Warehouse
in Microsoft Learn.

26/08/2019 4
Storage introduction

26/08/2019 5
Classify your data

No SQL
26/08/2019 6
Distributed Systems and the CAP Theorem
NoSQL databases are distributed systems designed to operate across multiple nodes. A single node is a Virtual Machine or
even a physical machine. The NoSQL database will be running across a cluster of nodes so it can perform query and data
storage distribution.
As a distributed system, NoSQL databases offer some benefits in availability and scalability that a single SQL database does
not. If a node goes down, the database remains available and very little data is lost, if any. There are even features such as
quorum agreement when querying or writing data to ensure the stability and availability of data stored in the NoSQL
database.
What is the CAP Theorem?
When discussing any distributed system, it’s important to understand the CAP Theorem. The CAP Theorem, also known as
Brewer’s theorem, was first presented by Eric Brewer, a computer scientist at the University of California, Berkeley in 1998.
It helps describe the behaviour of distributed data systems.
The CAP Theorem states that it’s impossible for a distributed data store to simultaneously provide more than 2 of the
following 3 guarantees:
o Consistency: All queries receive the most recent data written to the data store.
o Availability: All requests receive a valid response without error.
o Partition Tolerance: All requests are processed and handled without error regardless of network issues or failed
nodes.
https://buildazure.com/nosql-vs-sql-demystifying-nosql-databases/

Structured data
26/08/2019 7
Structured data is data that adheres to a schema, so all of the data has the same fields or properties. Structured data can
be stored in a database table with rows and columns. Structured data relies on keys to indicate how one row in a table
relates to data in another row of another table. Structured data is also referred to as relational data, as the data's schema
defines the table of data, the fields in the table, and the clear relationship between the two.
Structured data is straightforward in that it's easy to enter, query, and analyze. All of the data follows the same format.
However, forcing a consistent structure also means evolution of the data is more difficult as each record has to be updated
to conform to the new structure.
Examples of structured data include:
o Sensor data
o Financial data
o CRM + OLAP …

Semi-structured data
26/08/2019 8
Semi-structured data is less organized than structured data, and is not stored in a relational format, as the fields do not
neatly fit into tables, rows, and columns. Semi-structured data contains tags that make the organization and hierarchy of
the data apparent. Semi-structured data is also referred to as non-relational or NoSQL data.
Examples of semi-structured data include:
o Key / Value pairs
o Graph data
o JSON files
o XML files

Unstructured data
26/08/2019 9
The organization of unstructured data is generally ambiguous. Unstructured data is often delivered in files, such as photos
or videos. The video file itself may have an overall structure and come with semi-structured metadata, but the data that
comprises the video itself is unstructured. Therefore, photos, videos, and other similar files are classified as unstructured
data.
Examples of unstructured data include:
o Media files, such as photos, videos, and audio files
o Office files, such as Word documents
o Text files
o Log files

Data classification
26/08/2019 10
Product catalog data
Product catalog data for an online retail business is fairly structured in nature, as each product has a product SKU, a description, a quantity, a
price, size options, color options, a photo, and possibly a video. So, this data appears relational to start with, as it all has the same structure.
However, as you introduce new products or different kinds of products, you may want to add different fields as time goes on. For example, new
tennis shoes you're carrying are Bluetooth-enabled, to relay sensor data from the shoe to a fitness app on the user’s phone. This appears to be
a growing trend, and you want to enable customers to filter on "Bluetooth-enabled" shoes in the future. You don't want to go back and update
all your existing shoe data with a Bluetooth-enabled property, you simply want to add it to new shoes.
With the addition of the Bluetooth-enabled property, your shoe data is no longer homogenous, as you've introduced differences in the schema.
If this is the only exception you expect to encounter, you can go back and normalize the existing data so that all products included a "Bluetooth-
enabled" field to maintain a structured, relational organization. However, if this is just one of many specialty fields that you envision supporting
in the future, then the classification of the data is semi-structured. The data is organized by tags, but each product in the catalog can contain
unique fields.
Data classification: Semi-structured
Photos and videos
The photos and videos displayed on product pages are unstructured data. Although the media file may contain metadata, the body of the
media file is unstructured.
Data classification: Unstructured
Business data
Business analysts want to implement business intelligence to perform inventory pipeline evaluations and sales data reviews. In order to
perform these operations, data from multiple months needs to be aggregated together, and then queried. Because of the need to aggregate
similar data, this data must be structured, so that one month can be compared against the next.
Data classification: Structured

26/08/2019 11
Determine operational needs
Operations and latency:
What are the main operations you'll be completing on each data type, and what are the performance requirements?
Ask yourself these questions:
o Will you be doing simple lookups using an ID?
o Do you need to query the database for one or more fields?
o How many create, update, and delete operations do you expect?
o Do you need to run complex analytical queries?
o How quickly do these operations need to complete?

Operations and latency
26/08/2019 12
Product catalog data:
For product catalog data in the online retail scenario, customer needs are the highest priority. Customers will want to query the product catalog
to find, for example, all men's shoes, then men's shoes on sale, and then men's shoes on sale in a particular size. Customer needs may require
lots of read operations, with the ability to query on certain fields.
In addition, when customers place orders, the application must update product quantities. The update operations need to happen just as
quickly as the read operations so that users don't put an item in their shopping carts when that item has just sold out. This will not only result in
a large number of read operations, but will also require increased write operations for product catalog data. Be sure to determine the priorities
for all the users of the database, not just the primary ones.
Photos and videos:
The photos and videos that are displayed on product pages have different requirements. They need fast retrieval times so that they are
displayed on the site at the same time as product catalog data, but they don't need to be queried independently. Instead, you can rely on the
results of the product query, and include the video ID or URL as a property on the product data. So, photos and videos need only be retrieved
by their ID.
In addition, users will not make updates to existing photos or videos. They may, however, add new photos for product reviews. For example, a
user might upload an image of themselves showing off their new shoes. As an employee, you also upload and delete product photos from your
vendor, but that update doesn't need to happen as fast as your other product data updates.
In summary, photos and videos can be queried by ID to return the whole file, but creates and updates will be less frequent and are less of a
priority.
Business data:
For business data, all the analysis is happening on historical data. No original data is updated based on the analysis, so business data is read-
only. Users don't expect their complex analytics to run instantly, so having some latency in the results is okay. In addition, business data will be
stored in multiple data sets, as users will have different access to write to each data set. However, business analysts will be able to read from all
databases.

26/08/2019 13
Group multiple operations in a transaction

What is a transaction?
26/08/2019 14
A transaction is a logical group of database operations that execute together.
Here's the question to ask yourself regarding whether you need to use transactions in your application: Will a change to
one piece of data in your dataset impact another? If the answer is yes, then you'll need support for transactions in your
database service.
Transactions are often defined by a set of four requirements, referred to as ACID guarantees. ACID stands for Atomicity,
Consistency, Isolation, and Durability:
o Atomicity means that either all the operations in the transaction are executed, or none of them are.
o Consistency ensures that if something happens partway through the transaction, a portion of the data isn't left out of
the updates. Across the board, the transaction is applied consistently or not at all.
o Isolation ensures that one transaction is not impacted by another transaction.
o Durability means that the changes made due to the transaction are permanently saved in the system. Committed data
is saved by the system so that even in the event of a failure and system restart, the data is available in its correct state.
When a database offers ACID guarantees, these principles are applied to any transactions in a consistent manner.

OLTP vs OLAP
26/08/2019 15
Transactional databases are often called OLTP (Online Transaction Processing) systems. OLTP systems commonly support
lots of users, have quick response times, and handle large volumes of data. They are also highly available (meaning they
have very minimal downtime), and typically handle small or relatively simple transactions.
On the contrary, OLAP (Online Analytical Processing) systems commonly support fewer users, have longer response times,
can be less available, and typically handle large and complex transactions.
The terms OLTP and OLAP aren't used as frequently as they used to be, but understanding them makes it easier to
categorize the needs of your application.
Now that you're familiar with transactions, OLTP, and OLAP, let's walk through each of the data sets in the online retail
scenario, and determine the need for transactions.

Operations and latency
26/08/2019 16
Product catalog data:
Product catalog data should be stored in a transactional database. When users place an order and the payment is verified, the inventory for the
item should be updated. Likewise, if the customer's credit card is declined, the order should be rolled back, and the inventory should not be
updated. These relationships all require transactions.
Photos and videos:
Photos and videos in a product catalog don't require transactional support. These files are changed only when an update is made or new files
are added. Even though there is a relationship between the image and the actual product data, it's not transactional in nature.
Business data:
For the business data, because all of the data is historical and unchanging, transactional support is not required. The business analysts working
with the data also have unique needs in that they often require working with aggregates in their queries, so that they can work with the totals
of other smaller data points.

26/08/2019 17
Choose a storage solution on Azure

Choose a storage solution on Azure
26/08/2019 18
Data classification Operations Latency &
throughput
Transactional
support
Recommended
service
Product catalog data Semi-structured because
of the need to extend or
modify the schema for
new products
o Customers require a
high number of read
operations, with the
ability to query on
many fields within the
database.
o The business requires
a high number of
write operations to
track the constantly
changing inventory.
High throughput and low
latency
Required Azure Cosmos DB
Photos and videos Unstructured o Only need to be
retrieved by ID.
o Customers require a
high number of read
operations with low
latency.
o Creates and updates
will be somewhat
infrequent and can
have higher latency
than read operations.
Retrievals by ID need to
support low latency and
high throughput. Creates
and updates can have
higher latency than read
operations.
Not required Azure Blob storage
Business data Structured Read-only, complex
analytical queries across
multiple databases
Some latency in the
results is expected based
on the complex nature of
the queries
Required Azure SQL Database

Azure Data Studio
26/08/2019 20
Azure Data Studio is a cross-platform database tool that you can run on Windows, macOS, and Linux. You'll use it to
connect to SQL Data Warehouse and Azure SQL Database.
Previously released under the preview name SQL Operations Studio, Azure Data Studio offers a modern editor experience
with IntelliSense, code snippets, source control integration, and an integrated terminal. It is engineered with the data
platform user in mind, with built in charting of query result sets and customizable dashboards.

Storage Explorer
26/08/2019 21
Begin by downloading and installing Storage Explorer. You can use Storage Explorer to do several operations against data in
your Azure Storage account and data lake:
o Upload files or folders from your local computer into Azure Storage.
o Download cloud-based data to your local computer.
o Copy or move files and folders around in the storage account.
o Delete data from the storage account.

Data Migration Assistant (DMA)
26/08/2019 22
The Data Migration Assistant (DMA) helps you upgrade to a modern data platform by detecting compatibility issues that
can impact database functionality in your new version of SQL Server or Azure SQL Database. DMA recommends
performance and reliability improvements for your target environment and allows you to move your schema, data, and
uncontained objects from your source server to your target server.
To install DMA, download the latest version of the tool from the Microsoft Download Center, and then run
the DataMigrationAssistant.msi file.
Sources Target
o SQL Server 2005
o SQL Server 2008
o SQL Server 2008 R2
o SQL Server 2012
o SQL Server 2014
o SQL Server 2016
o SQL Server 2017 on Windows
o SQL Server 2012
o SQL Server 2014
o SQL Server 2016
o SQL Server 2017 on Windows and Linux
o Azure SQL Database
o Azure SQL Database Managed Instance

Azure SQL Data Sync
26/08/2019 23
SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across
multiple SQL databases and SQL Server instances.
When to use Data Sync:
Data Sync is useful in cases where data needs to be kept up-to-date across several Azure SQL databases or SQL Server
databases. Here are the main use cases for Data Sync:
o Hybrid Data Synchronization: With Data Sync, you can keep data synchronized between your on-premises
databases and Azure SQL databases to enable hybrid applications. This capability may appeal to customers who
are considering moving to the cloud and would like to put some of their application in Azure.
o Distributed Applications: In many cases, it's beneficial to separate different workloads across different
databases. For example, if you have a large production database, but you also need to run a reporting or
analytics workload on this data, it's helpful to have a second database for this additional workload. This
approach minimizes the performance impact on your production workload. You can use Data Sync to keep these
two databases synchronized.
o Globally Distributed Applications: Many businesses span several regions and even several countries/regions. To
minimize network latency, it's best to have your data in a region close to you. With Data Sync, you can easily
keep databases in regions around the world synchronized.

Azure Database Migration Service
26/08/2019 24
Accelerate your transition to the cloud:
o Use a simple, self-guided migration process
o Get comprehensive assessments detailing pre-migration steps
o Migrate at scale from multiple sources to your target database
Accelerate your database migration:
Reduce the complexity of your cloud migration by using a single comprehensive service instead of multiple tools. Azure
Database Migration Service is designed as a seamless, end-to-end solution for moving on-premises SQL Server databases
to the cloud. Use the Database Migration Guide for recommendations, step-by-step guidance, and expert tips on your
specific database migration.
Finish faster with our guided process:
No specialty skills are required to get reliable outcomes with our highly-resilient and self-healing migration. The options
presented through the guided process are easy to understand and implement—so you can get the job done right the first
time.

Azure Cosmos DB : Data migration tools
26/08/2019 Big Data class by Alexandre Bergere 25
Data Migration Tools
SQL API Mongo DB API Graph APITable API
Cassandra API

Summary
26/08/2019 26
Scenario Some recommended solutions
Disaster Recovery Azure geo-redundant backups
Read Scale Use read-only replicas to load balance read-only query
workloads (preview)
ETL (OLTP to OLAP) Azure Data Factory or SQL Server Integration Services
Migration from on-premises SQL Server to Azure SQL
Database
Kept up-to-date across several Azure SQL databases or SQL
Server database
Azure SQL Data Sync

Understand data storage in Azure Storage
26/08/2019 28
Azure Storage accounts are the base storage type within Azure. Azure Storage offers a very scalable object store for data
objects and file system services in the cloud. It can also provide a messaging store for reliable messaging, or it can act as a
NoSQL store.
Azure Storage offers four configuration options:
o Azure Blob: A scalable object store for text and binary data
o Azure Files: Managed file shares for cloud or on-premises deployments
o Azure Queue: A messaging store for reliable messaging between application components
o Azure Table: A NoSQL store for no-schema storage of structured data
You can use Azure Storage as the storage basis when you're provisioning a data platform technology such as Azure Data
Lake Storage and HDInsight. But you can also provision Azure Storage for standalone use. For example, you provision an
Azure Blob store either as standard storage in the form of magnetic disk storage or as premium storage in the form of solid-
state drives (SSDs).

Databases
26/08/2019 29
Managed relational SQL Database
as a service
Azure SQL Database
Globally distributed, multi-model
database for any scale
Azure Cosmos DB
NoSQL key-value store using semi-
structured datasets
Table Storage
Managed MySQL database service
for app developers
Azure Database for MySQL
Simplify on-premises database
migration to the cloud

Analytics
26/08/2019 30
Fast, easy, and collaborative
Apache Spark-based analytics
platform
Azure Databricks
Real-time data stream processing
from millions of IoT devices
Azure Stream Analytics
Hybrid data integration at
enterprise scale, made easy
Data Factory
Enterprise-grade analytics engine
as a service
Azure Analysis Services
Elastic data warehouse as a service
with enterprise-class features
SQL Data Warehouse

Storage
26/08/2019 31
REST-based object storage for
unstructured data
Blob Storage
Simplify data protection and
protect against ransomware
Azure Backup
Massively scalable, secure data
lake functionality built on Azure
Blob Storage
Azure Data Lake Storage

Blob storage
26/08/2019 32
If you need to provision a data store that will store but not query data, your cheapest option is to set up a storage account
as a Blob store. Blob storage works well with images and unstructured data, and it's the cheapest way to store data in
Azure.
Azure Storage accounts are scalable and secure, durable, and highly available. Azure handles your hardware maintenance,
updates, and critical issues. It also provides REST APIs and SDKs for Azure Storage in various languages. Supported
languages include .NET, Java, Node.js, Python, PHP, Ruby, and Go. Azure Storage also supports scripting in Azure
PowerShell and the Azure CLI.
Data ingestion:
To ingest data into your system, use Azure Data Factory, Storage Explorer, the AzCopy tool, PowerShell, or Visual Studio. If
you use the File Upload feature to import file sizes above 2 GB, use PowerShell or Visual Studio. AzCopy supports a
maximum file size of 1 TB and automatically splits data files that exceed 200 GB.
Queries:
If you create a storage account as a Blob store, you can't query the data directly. To query it, either move the data to a
store that supports queries or set up the Azure Storage account for a data lake storage account.
Data security:
Azure Storage encrypts all data that's written to it. Azure Storage also provides you with fine-grained control over who has
access to your data. You'll secure the data by using keys or shared access signatures.
Azure Resource Manager provides a permissions model that uses role-based access control (RBAC). Use this functionality
to set permissions and assign roles to users, groups, or applications.

26/08/2019 33
Azure Data Lake Storage is a Hadoop-compatible data repository that can store any size or type of data. This storage
service is available as Generation 1 (Gen1) or Generation 2 (Gen2). Data Lake Storage Gen1 users don't have to upgrade to
Gen2, but they forgo some benefits.
Data Lake Storage Gen2 users take advantage of Azure Blob storage, a hierarchical file system, and performance tuning
that helps them process big-data analytics solutions. In Gen2, developers can access data through either the Blob API or
the Data Lake file API. Gen2 can also act as a storage layer for a wide range of compute platforms, including Azure
Databricks, Hadoop, and Azure HDInsight, but data doesn't need to be loaded into the platforms.
Data Lake Storage is designed to store massive amounts of data for big-data analytics. For example, Contoso Life Sciences is
a cancer research center that analyzes petabytes of genetic data, patient data, and records of related sample data. Data
Lake Storage Gen2 reduces computation times, making the research faster and less expensive.
The compute aspect that sits above this storage can vary. The aspect can include platforms like HDInsight, Hadoop,
Cloudera, Azure Databricks, and Hortonworks.
Here are the key features of Data Lake Storage:
o Unlimited scalability
o Hadoop compatibility
o Security support for both access control lists (ACLs)
o POSIX compliance
o An optimized Azure Blob File System (ABFS) driver that's designed for big-data analytics
o Zone-redundant storage
o Geo-redundant storage

26/08/2019 34
Data ingestion:
To ingest data into your system, use Azure Data Factory, Apache Sqoop, Azure Storage Explorer, the AzCopy tool,
PowerShell, or Visual Studio. To use the File Upload feature to import file sizes above 2 GB, use PowerShell or Visual Studio.
AzCopy supports a maximum file size of 1 TB and automatically splits data files that exceed 200 GB.
Queries:
In Data Lake Storage Gen1, data engineers query data by using the U-SQL language. In Gen 2, use the Azure Blob Storage
API or the Azure Data Lake System (ADLS) API.
Data security:
Because Data Lake Storage supports Azure Active Directory ACLs, security administrators can control data access by using
the familiar Active Directory Security Groups. Role-based access control (RBAC) is available in Gen1. Built-in security groups
include ReadOnlyUsers, WriteAccessUsers, and FullAccessUsers.
Enable the firewall to limit traffic to only Azure services. Data Lake Storage automatically encrypts data at rest, protecting
data privacy.

Azure Cosmos DB
26/08/2019 35
Azure Cosmos DB is a globally distributed, multimodel database. You can deploy it by using several API models:
o SQL API
o MongoDB API
o Cassandra API
o Gremlin API
o Table API
Because of the multimodel architecture of Azure Cosmos DB, you benefit from each model's inherent capabilities. For
example, you can use MongoDB for semistructured data, Cassandra for wide columns, or Gremlin for graph databases.
Using Gremlin, you could create graph entities and graph queries to traverse vertices and edges. This method can get you
subsecond response times for complex scenarios like natural language processing (NLP) and social networking associations.
When you move the database server from SQL, MongoDB, or Cassandra to Azure Cosmos DB, applications that are built on
SQL, MongoDB, or Cassandra will continue to operate.

Azure Cosmos DB
26/08/2019 36
When to use Azure Cosmos DB:
Deploy Azure Cosmos DB when you need a NoSQL database of the supported API model, at planet scale, and with low
latency performance. Currently, Azure Cosmos DB supports five-nines uptime (99.999 percent). It can support response
times below 10 ms when it's provisioned correctly.
Consider this example where Azure Cosmos DB helps resolve a business problem. Contoso is an e-commerce retailer based
in Manchester, UK. The company sells children's toys. After reviewing Power BI reports, Contoso's managers notice a
significant increase in sales in Australia. Managers review customer service cases in Dynamics 365 and see many Australian
customer complaints that their site's shopping cart is timing out.
Contoso's network operations manager confirms the problem. It's that the company's only data center is located in
London. The physical distance to Australia is causing delays. Contoso applies a solution that uses the Microsoft Australia
East datacenter to provide a local version of the data to users in Australia. Contoso migrates their on-premises SQL
Database to Azure Cosmos DB by using the SQL API. This solution improves performance for Australian users. The data can
be stored in the UK and replicated to Australia to improve throughput times.

Azure Cosmos DB
26/08/2019 37
Key features:
Azure Cosmos DB supports 99.999 percent uptime. You can invoke a regional failover by using programing or the Azure
portal. An Azure Cosmos DB database will automatically fail over if there's a regional disaster.
By using multimaster replication in Azure Cosmos DB, you can often achieve a response time of less than one second from
anywhere in the world. Azure Cosmos DB is guaranteed to achieve a response time of less than 10 ms for reads and writes.
To maintain the consistency of the data in Azure Cosmos DB, your engineering team should introduce a new set of
consistency levels that address the unique challenges of planet-scale solutions. Consistency levels include strong, bounded
staleness, session, consistent prefix, and eventual.
Data ingestion:
To ingest data into Azure Cosmos DB, use Azure Data Factory, create an application that writes data into Azure Cosmos DB
through its API, upload JSON documents, or directly edit the document.
Queries:
As a data engineer, you can create stored procedures, triggers, and user-defined functions (UDFs). Or use the JavaScript
query API. You'll also find other methods to query the other APIs within Azure Cosmos DB. For example, in the Data
Explorer component, you can use the graph visualization pane.
Data security:
Azure Cosmos DB supports data encryption, IP firewall configurations, and access from virtual networks. Data is encrypted
automatically. User authentication is based on tokens, and Azure Active Directory provides role-based security.
Azure Cosmos DB meets many security compliance certifications, including HIPAA, FedRAMP, SOCS, and HITRUST.

Azure SQL Database
26/08/2019 38
Azure SQL Database is a managed relational database service. It supports structures such as relational data and
unstructured formats such as spatial and XML data. SQL Database provides online transaction processing (OLTP) that can
scale on demand. You'll also find the comprehensive security and availability that you appreciate in Azure database
services.
When to use SQL Database:
Use SQL Database when you need to scale up and scale down OLTP systems on demand. SQL Database is a good solution
when your organization wants to take advantage of Azure security and availability features. Organizations that choose SQL
Database also avoid the risks of capital expenditures and of increasing operational spending on complex on-premises
systems.
SQL Database can be more flexible than an on-premises SQL Server solution because you can provision and configure it in
minutes. Even more, SQL Database is backed up by the Azure service-level agreement (SLA).
Key features:
SQL Database delivers predictable performance for multiple resource types, service tiers, and compute sizes. Requiring
almost no administration, it provides dynamic scalability with no downtime, built-in intelligent optimization, global
scalability and availability, and advanced security options. These capabilities let you focus on rapid app development and
on speeding up your time to market. You no longer have to devote precious time and resources to managing virtual
machines and infrastructure.

Azure SQL Database
26/08/2019 39
Ingesting and processing data:
SQL Database can ingest data through application integration from a wide range of developer SDKs. Allowed programming
languages include .NET, Python, Java, and Node.js. Beyond applications, you can also ingest data through Transact-SQL (T-
SQL) techniques and from the movement of data using Azure Data Factory.
Queries:
Use T-SQL to query the contents of a SQL Database. This method benefits from a wide range of standard SQL features to
filter, order, and project the data into the form you need.
Data security:
SQL Database provides a range of built-in security and compliance features. These features help your application meet
security and compliance requirements like these:
o Advanced Threat Protection
o SQL Database auditing
o Data encryption
o Azure Active Directory authentication
o Multifactor authentication
o Compliance certification

SQL Data Warehouse
26/08/2019 40
Azure SQL Data Warehouse is a cloud-based enterprise data warehouse. It can process massive amounts of data and
answer complex business questions.
When to use SQL Data Warehouse:
Data loads can increase the processing time for on-premises data warehousing solutions. Organizations that face this issue
might look to a cloud-based alternative to reduce processing time and release business intelligence reports faster. But
many organizations first consider scaling up on-premises servers. As this approach reaches its physical limits, they look for
a solution on a petabyte scale that doesn't involve complex installations and configurations.
Key features:
SQL Data Warehouse uses massively parallel processing (MPP) to quickly run queries across petabytes of data. Because the
storage is separated from the compute nodes, you can scale the compute nodes independently to meet any demand at any
time.
In SQL Data Warehouse, Data Movement Service (DMS) coordinates and transports data between compute nodes as
necessary. But you can use a replicated table to reduce data movement and improve performance. SQL Data Warehouse
supports two types of distributed tables: hash and round-robin. Use these tables to tune performance.
Importantly, SQL Data Warehouse can also pause and resume the compute layer. This means you pay only for the
computation you use. This capability is useful in data warehousing.

SQL Data Warehouse
26/08/2019 41
Ingesting and processing data:
SQL Data Warehouse uses the extract, load, and transform (ELT) approach for bulk data. SQL professionals are already
familiar with bulk-copy tools such as bcp and the SQLBulkCopy API. Data engineers who work with SQL Data Warehouse
will soon learn how quickly PolyBase can load data.
PolyBase is a technology that removes complexity for data engineers. They take advantage of techniques for big-data
ingestion and processing by offloading complex calculations to the cloud. Developers use PolyBase to apply stored
procedures, labels, views, and SQL to their applications. You can also use Azure Data Factory to ingest and process data.
Queries:
As a data engineer, you can use the familiar Transact-SQL to query the contents of SQL Data Warehouse. This method takes
advantage of a wide range of features, including the WHERE, ORDER BY, and GROUP BY clauses. Load data fast by using
PolyBase with additional Transact-SQL constructs such as CREATE TABLE and AS SELECT.
Data security:
SQL Data Warehouse supports both SQL Server authentication and Azure Active Directory. For high-security environments,
set up multifactor authentication. From a data perspective, SQL Data Warehouse supports security at the level of both
columns and rows.

Azure HDInsight
26/08/2019 42
Azure HDInsight provides technologies to help you ingest, process, and analyze big data. It supports batch processing, data
warehousing, IoT, and data science.
Key features:
HDInsight is a low-cost cloud solution. It includes Apache Hadoop, Spark, Kafka, HBase, Storm, and Interactive Query.
o Hadoop includes Apache Hive, HBase, Spark, and Kafka. Hadoop stores data in a file system (HDFS). Spark stores data
in memory. This difference in storage makes Spark about 100 times faster.
o HBase is a NoSQL database built on Hadoop. It's commonly used for search engines. HBase offers automatic failover.
o Storm is a distributed real-time streamlining analytics solution.
o Kafka is an open-source platform that's used to compose data pipelines. It offers message queue functionality, which
allows users to publish or subscribe to real-time data streams.

Azure HDInsight
26/08/2019 43
Ingesting data:
As a data engineer, use Hive to run ETL operations on the data you're ingesting. Or orchestrate Hive queries in Azure Data
Factory.
Data processing:
In Hadoop, use Java and Python to process big data. Mapper consumes and analyzes input data. It then emits tuples that
Reducer can analyze. Reducer runs summary operations to create a smaller combined result set.
Spark processes streams by using Spark Streaming. For machine learning, use the 200 preloaded Anaconda libraries with
Python. Use GraphX for graph computations.
Developers can remotely submit and monitor jobs from Spark. Storm supports common programming languages like Java,
C#, and Python.
Queries:
In Hadoop supports Pig and HiveQL languages. In Spark, data engineers use Spark SQL.
Data security:
Hadoop supports encryption, Secure Shell (SSH), shared access signatures, and Azure Active Directory security.

Azure Storage
26/08/2019 45
Microsoft Azure Storage is a managed service that provides durable, secure, and scalable storage in the cloud. Let's break
these terms down.
o Durable: Redundancy ensures that your data is safe in the event of transient hardware failures. You can also replicate
data across datacenters or geographical regions for additional protection from local catastrophe or natural disaster.
Data replicated in this way remains highly available in the event of an unexpected outage.
o Secure: All data written to Azure Storage is encrypted by the service. Azure Storage provides you with fine-grained
control over who has access to your data.
o Scalable: Azure Storage is designed to be massively scalable to meet the data storage and performance needs of
today's applications.
o Managed: Microsoft Azure handles maintenance and any critical problems for you.
A single Azure subscription can host up to 200 storage accounts, each of which can hold 500 TB of data. If you have a
business case, you can talk to the Azure Storage team and get approval for up to 250 storage accounts in a subscription,
which pushes your max storage up to 125 Petabytes!

What is Azure Storage?
26/08/2019 46
Azure provides many ways to store your data. There are multiple database options like Azure SQL Server, Azure Cosmos DB,
and Azure Table Storage. Azure offers multiple ways to store and send messages, such as Azure Queues and Event Hubs.
You can even store loose files using services like Azure Files and Azure Blobs.
Azure selected four of these data services and placed them together under the name Azure Storage. The four services are
Azure Blobs, Azure Files, Azure Queues, and Azure Tables. The following illustration shows the elements of Azure Storage.
https://docs.microsoft.com/en-us/learn/modules/create-azure-storage-account/2-decide-how-many-storage-accounts-you-need
These four were given special treatment because they are all primitive, cloud-based storage services and are often used
together in the same application.

What is Azure Storage?
26/08/2019 47
Azure storage includes four types of data:
o Blobs: A massively scalable object store for text and binary data.
o Files: Managed file shares for cloud or on-premises deployments.
o Queues: A messaging store for reliable messaging between application components.
o Tables: A NoSQL store for schemaless storage of structured data. This service has been replaced by Azure Cosmos DB
and will not be discussed here.
All of these data types in Azure Storage are accessible from anywhere in the world over HTTP or HTTPS. Microsoft provides
SDKs for Azure Storage in a variety of languages, as well as a REST API. You can also visually explore your data right in the
Azure portal.

What is a storage account?
26/08/2019 48
A storage account is a container that groups a set of Azure Storage services together. Only data services from Azure Storage
can be included in a storage account (Azure Blobs, Azure Files, Azure Queues, and Azure Tables). The following illustration
shows a storage account containing several data services.
Combining data services into a storage account lets you manage them as a group. The settings you specify when you create
the account, or any that you change after creation, are applied to everything in the account. Deleting the storage account
deletes all of the data stored inside it.
A storage account is an Azure resource and is included in a resource group. The following illustration shows an Azure
subscription containing multiple resource groups, where each group contains one or more storage accounts.

Choose your account settings
26/08/2019 49
The storage account settings we've already covered apply to the data services in the account. Here, we will discuss the
three settings that apply to the account itself, rather than to the data stored in the account:
o Name: Each storage account has a name. The name must be globally unique within Azure, use only lowercase letters and digits and be
between 3 and 24 characters.
o Deployment model: A deployment model is the system Azure uses to organize your resources. The model defines the API that you use
to create, configure, and manage those resources. Azure provides two deployment models:
o Resource Manager: the current model that uses the Azure Resource Manager API (recommended)
o Classic: a legacy offering that uses the Azure Service Management API
o Account kind: Storage account kind is a set of policies that determine which data services you can include in the account and the pricing
of those services. There are three kinds of storage accounts:
o StorageV2 (general purpose v2): the current offering that supports all storage types and all of the latest features (recommended)
o Storage (general purpose v1): a legacy kind that supports all storage types but may not support all features
o Blob storage: a legacy kind that allows only block blobs and append blobs
These settings impact how you manage your account and the cost of the services within it.
https://docs.microsoft.com/en-us/learn/modules/create-azure-storage-account/3-choose-your-account-settings

Azure storage accounts
26/08/2019 50
To access any of these services from an application, you have to create a storage account. The storage account provides a
unique namespace in Azure to store and access your data objects. A storage account contains any blobs, files, queues,
tables, and VM disks that you create under that account.
You can create an Azure storage account using the Azure portal, Azure PowerShell, or Azure CLI. Azure Storage provides
three distinct account options with different pricing and features supported.
o General-purpose v2 (GPv2): General-purpose v2 (GPv2) accounts are storage accounts that support all of the latest
features for blobs, files, queues, and tables. Pricing for GPv2 accounts has been designed to deliver the lowest per
gigabyte prices.
o General-purpose v1 (GPv1): General-purpose v1 (GPv1) accounts provide access to all Azure Storage services but may
not have the latest features or the lowest per gigabyte pricing. For example, cool storage and archive storage are not
supported in GPv1. Pricing is lower for GPv1 transactions, so workloads with high churn or high read rates may benefit
from this account type.
o Blob storage accounts: A legacy account type, blob storage accounts support all the same block blob features as GPv2,
but they are limited to supporting only block and append blobs. Pricing is broadly similar to pricing for general-purpose
v2 accounts.

Blob storage
26/08/2019 51
Azure Blob storage is an object storage solution optimized for storing massive amounts of unstructured data, such as text
or binary data. Blob storage is ideal for:
o Serving images or documents directly to a browser, including full static websites.
o Storing files for distributed access.
o Streaming video and audio.
o Storing data for backup and restoration, disaster recovery, and archiving.
o Storing data for analysis by an on-premises or Azure-hosted service.
Azure Storage supports three kinds of blobs:
o Blob type: Description
o Block blobs: Block blobs are used to hold text or binary files up to ~5 TB (50,000 blocks of 100 MB) in size. The primary
use case for block blobs is the storage of files that are read from beginning to end, such as media files or image files for
websites. They are named block blobs because files larger than 100 MB must be uploaded as small blocks, which are
then consolidated (or committed) into the final blob.
o Page blobs: Page blobs are used to hold random-access files up to 8 TB in size. Page blobs are used primarily as the
backing storage for the VHDs used to provide durable disks for Azure Virtual Machines (Azure VMs). They are named
page blobs because they provide random read/write access to 512-byte pages.
o Append blobs: Append blobs are made up of blocks like block blobs, but they are optimized for append operations.
These are frequently used for logging information from one or more sources into the same blob. For example, you
might write all of your trace logging to the same append blob for an application running on multiple VMs. A single
append blob can be up to 195 GB.

Files
26/08/2019 52
Azure Files enables you to set up highly available network file shares that can be accessed by using the standard Server
Message Block (SMB) protocol. This means that multiple VMs can share the same files with both read and write access.
You can also read the files using the REST interface or the storage client libraries. You can also associate a unique URL to
any file to allow fine-grained access to a private file for a set period of time. File shares can be used for many common
scenarios:
o Storing shared configuration files for VMs, tools, or utilities so that everyone is using the same version.
o Log files such as diagnostics, metrics, and crash dumps.
o Shared data between on-premises applications and Azure VMs to allow migration of apps to the cloud over a period
of time.

Queues
26/08/2019 53
The Azure Queue service is used to store and retrieve messages. Queue messages can be up to 64 KB in size, and a queue
can contain millions of messages. Queues are generally used to store lists of messages to be processed asynchronously.
You can use queues to loosely connect different parts of your application together. For example, we could perform image
processing on the photos uploaded by our users. Perhaps we want to provide some sort of face detection or tagging
capability, so people can search through all the images they have stored in our service. We could use queues to pass
messages to our image processing service to let it know that new images have been uploaded and are ready for
processing. This sort of architecture would allow you to develop and update each part of the service independently.

REST API endpoint
26/08/2019 54
In addition to access keys for authentication to storage accounts, your app will need to know the storage service endpoints
to issue the REST requests.
The REST endpoint is a combination of your storage account name, the data type, and a known domain. For example:
Data type Example endpoint
Blobs https://[name].blob.core.windows.net/
Queues https://[name].queue.core.windows.net/
Table https://[name].table.core.windows.net/
Files https://[name].file.core.windows.net/
The simplest way to handle access keys and endpoint URLs within applications is to use storage account connection strings.
A connection string provides all needed connectivity information in a single text string.
Azure Storage connection strings look similar to the example below but with the access key and account name of your
specific storage account:
DefaultEndpointsProtocol=https;AccountName={your-storage};
AccountKey={your-access-key};
EndpointSuffix=core.windows.net

26/08/2019 55
Secure your Azure Storage account

Azure Storage security features (1/3)
26/08/2019 56
Encryption at rest:
All data written to Azure Storage is automatically encrypted by Storage Service Encryption (SSE) with a 256-bit Advanced
Encryption Standard (AES) cipher. SSE automatically encrypts data when writing it to Azure Storage. When you read data
from Azure Storage, Azure Storage decrypts the data before returning it. This process incurs no additional charges and
doesn't degrade performance. It can't be disabled.
For virtual machines (VMs), Azure lets you encrypt virtual hard disks (VHDs) by using Azure Disk Encryption. This
encryption uses BitLocker for Windows images, and it uses dm-crypt for Linux.
Azure Key Vault stores the keys automatically to help you control and manage the disk-encryption keys and secrets. So
even if someone gets access to the VHD image and downloads it, they can't access the data on the VHD.
Encryption in transit:
Keep your data secure by enabling transport-level security between Azure and the client. Always use HTTPS to secure
communication over the public internet. When you call the REST APIs to access objects in storage accounts, you can
enforce the use of HTTPS by requiring secure transfer for the storage account. After you enable secure transfer,
connections that use HTTP will be refused. This flag will also enforce secure transfer over SMB by requiring SMB 3.0 for all
file share mounts.

26/08/2019 57
CORS support:
You can store several website asset types in Azure Storage. These types include images and videos. To secure browser
apps, Contoso locks GET requests down to specific domains.
Azure Storage supports cross-domain access through cross-origin resource sharing (CORS). CORS uses HTTP headers so that
a web application at one domain can access resources from a server at a different domain. By using CORS, web apps
ensure that they load only authorized content from authorized sources.
CORS support is an optional flag you can enable on Storage accounts. The flag adds the appropriate headers when you use
HTTP GET requests to retrieve resources from the Storage account.
Role-based access control:
To access data in a storage account, the client makes a request over HTTP or HTTPS. Every request to a secure resource
must be authorized. The service ensures that the client has the permissions required to access the data. You can choose
from several access options. Arguably, the most flexible option is role-based access.
Azure Storage supports Azure Active Directory and role-based access control (RBAC) for both resource management and
data operations. To security principals, you can assign RBAC roles that are scoped to the storage account. Use Active
Directory to authorize resource management operations, such as configuration. Active Directory is supported for data
operations on Blob and Queue storage.
To a security principal or a managed identity for Azure resources, you can assign RBAC roles that are scoped to a
subscription, a resource group, a storage account, or an individual container or queue.

26/08/2019 58
Auditing access:
Auditing is another part of controlling access. You can audit Azure Storage access by using the built-in Storage Analytics
service.
Storage Analytics logs every operation in real time, and you can search the Storage Analytics logs for specific requests.
Filter based on the authentication mechanism, the success of the operation, or the resource that was accessed.

Storage account keys
26/08/2019 59
Azure Storage accounts can create authorized apps in Active Directory to control access to the data in blobs and queues.
This authentication approach is the best solution for apps that use Blob storage or Queue storage.
For other storage models, clients can use a shared key, or shared secret. This authentication option is one of the easiest to
use, and it supports blobs, files, queues, and tables. The client embeds the shared key in the HTTP Authorization header of
every request, and the Storage account validates the key.
For example, an application can issue a GET request against a blob resource:
HTTP headers control the version of the REST API, the date, and the encoded shared key:
GET http://myaccount.blob.core.windows.net/?restype=service&comp=stats
x-ms-version: 2018-03-28
Date: Wed, 23 Oct 2018 21:00:44 GMT
Authorization: SharedKey
myaccount:CY1OP3O3jGFpYFbTCBimLn0Xov0vt0khH/E5Gy0fXvg=

Shared access signatures (1/2)
26/08/2019 60
As a best practice, you shouldn't share storage account keys with external third-party applications. If these apps need
access to your data, you'll need to secure their connections without using storage account keys.
For untrusted clients, use a shared access signature (SAS). A shared access signature is a string that contains a security
token that can be attached to a URI. Use a shared access signature to delegate access to storage objects and specify
constraints, such as the permissions and the time range of access.
You can give a customer a shared access signature token, for example, so they can upload pictures to a file system in Blob
storage. Separately, you can give a web application permission to read those pictures. In both cases, you allow only the
access that the application needs to do the task.
Types of shared access signatures:
You can use a service-level shared access signature to allow access to specific resources in a storage account. You'd use this
type of shared access signature, for example, to allow an app to retrieve a list of files in a file system or to download a file.
Use an account-level shared access signature to allow access to anything that a service-level shared access signature can
allow, plus additional resources and abilities. For example, you can use an account-level shared access signature to allow
the ability to create file systems.

Shared access signatures (1/2)
26/08/2019 61
You'd typically use a shared access signature for a service where users read and write their data to your storage account.
Accounts that store user data have two typical designs:
Clients upload and download data through a front-end proxy service, which performs authentication. This front-end proxy
service has the advantage of allowing validation of business rules. But if the service must handle large amounts of data or
high-volume transactions, you might find it complicated or expensive to scale this service to match demand.
A lightweight service authenticates the client as needed. Then it generates a shared access signature. After receiving the
shared access signature, the client can access storage account resources directly. The shared access signature defines the
client's permissions and access interval. The shared access signature reduces the need to route all data through the front-
end proxy service.
https://docs.microsoft.com/en-us/learn/modules/secure-azure-storage-account/4-shared-access-signatures

Control network access to your storage account
26/08/2019 62
By default, storage accounts accept connections from clients on any network. To limit access to selected networks, you
must first change the default action. You can restrict access to specific IP addresses, ranges, or virtual networks.
https://docs.microsoft.com/en-us/learn/modules/secure-azure-storage-account/5-control-network-access

Advanced Threat Protection for Azure Storage
26/08/2019 63
Advanced Threat Protection, now in public preview, detects anomalies in account activity. It then notifies you of potentially
harmful attempts to access your account. You don't have to be a security expert or manage security monitoring systems to
take advantage of this layer of threat protection.
Currently, Advanced Threat Protection for Azure Storage is available for the Blob service. Security alerts are integrated with
Azure Security Center. The alerts are sent by email to subscription admins.
https://docs.microsoft.com/en-us/learn/modules/secure-azure-storage-account/6-advanced-threat-protection

Azure Data Lake Storage security features
26/08/2019 64
Azure Data Lake Storage Gen2 provides a first-class data lake solution that allows enterprises to pull together their data.
It's built on Azure Blob storage, so it inherits all of the security features we've reviewed in this module.
Along with role-based access control (RBAC), Azure Data Lake Storage Gen2 provides access control lists (ACLs) that are
POSIX-compliant and that restrict access to only authorized users, groups, or service principals. It applies restrictions in a
way that's flexible, fine-grained, and manageable. Azure Data Lake Storage Gen2 authenticates through Azure Active
Directory OAuth 2.0 bearer tokens. This allows for flexible authentication schemes, including federation with Azure AD
Connect and multifactor authentication that provides stronger protection than just passwords.
More significantly, these authentication schemes are integrated into the main analytics services that use the data. These
services include Azure Databricks, HDInsight, and SQL Data Warehouse. Management tools such as Azure Storage Explorer
are also included. After authentication finishes, permissions are applied at the finest granularity to ensure the right level of
authorization for an enterprise's big-data assets.
The Azure Storage end-to-end encryption of data and transport layer protections complete the security shield for an
enterprise data lake. The same set of analytics engines and tools can take advantage of these additional layers of
protection, resulting in complete protection of your analytics pipelines.

26/08/2019 65
Store application data with Azure Blob
storage

What are blobs? (1/2)
26/08/2019 66
Blobs are "files for the cloud". Apps work with blobs in much the same way as they would work with files on a disk, like
reading and writing data. However, unlike a local file, blobs can be reached from anywhere with an internet connection.
Azure Blob storage is unstructured, meaning that there are no restrictions on the kinds of data it can hold. For example, a
blob can hold a PDF document, a JPG image, a JSON file, video content, etc. Blobs aren't limited to common file formats —
a blob could contain gigabytes of binary data streamed from a scientific instrument, an encrypted message for another
application, or data in a custom format for an app you're developing.
Blobs are usually not appropriate for structured data that needs to be queried frequently. They have higher latency than
memory and local disk and don't have the indexing features that make databases efficient at running queries. However,
blobs are frequently used in combination with databases to store non-queryable data. For example, an app with a
database of user profiles could store profile pictures in blobs. Each user record in the database would include the name or
URL of the blob containing the user's picture.

What are blobs? (2/2)
26/08/2019 67
Blobs are used for data storage in many ways across all kinds of applications and architectures:
o Apps that need to transmit large amounts of data using messaging system that supports only small messages. These
apps can store data in blobs and send the blob URLs in messages.
o Blob storage can be used like a file system for storing and sharing documents and other personal data.
o Static web assets like images can be stored in blobs and made available for public download as if they were files on a
web server.
o Many Azure components use blobs behind the scenes. For example, Azure Cloud Shell stores your files and
configuration in blobs, and Azure Virtual Machines uses blobs for hard-disk storage.
Some apps will constantly create, update, and delete blobs as part of their work. Others will use a small set of blobs and
rarely change them.
Storage accounts, containers, and metadata:
In Blob storage, every blob lives inside a blob container. You can store an unlimited number of blobs in a container and an
unlimited number of containers in a storage account. Containers are "flat" — they can only store blobs, not other
containers.
Blobs and containers support metadata in the form of name-value string pairs. Your apps can use metadata for anything
you like: a human-readable description of a blob's contents to be displayed by the application, a string that your app uses
to determine how to process the blob's data, etc.

Design a storage organization strategy (1/2)
26/08/2019 68
Blob name prefixes (virtual directories):
Technically, containers are "flat" and do not support any kind of nesting or hierarchy. But if you give your blobs hierarchical
names that look like file paths (such as finance/budgets/2017/q1.xls), the API's listing operation can filter results to specific
prefixes. This allows you to navigate the list as if it was a hierarchical system of files and folders.
This feature is often called virtual directories because some tools and client libraries use it to visualize and navigate Blob
storage as if it was a file system. Each folder navigation triggers a separate call to list the blobs in that folder.
Using names that are like file names for blobs is a common technique for organizing and navigating complex blob data.

Design a storage organization strategy (1/2)
26/08/2019 69
Public access and containers as security boundaries:
By default, all blobs require authentication to access. However, individual containers can be configured to allow public
downloading of their blobs without authentication. This feature supports many use cases, such as hosting static website
assets and sharing files. This is because downloading blob contents works the same way as reading any other kind of data
over the web: you just point a browser or anything that can make a GET request at the blob URL.
Enabling public access is important for scalability because data downloaded directly from Blob storage doesn't generate
any traffic in your server-side app. Even if you don't immediately take advantage of public access or if you will use a
database to control data access via your application, plan on using separate containers for data you want to be publicly
available.
Caution: Blobs in a container configured for public access can be downloaded without any kind of authentication or
auditing by anyone who knows their storage URLs. Never put blob data in a public container that you don't intend to share
publicly.
In addition to public access, Azure has a shared access signature feature that allows fine-grained permissions control on
containers. Precision access control enables scenarios that further improve scalability, so thinking about containers as
security boundaries in general is a helpful guideline.

Blob types
26/08/2019 70
There are three different kinds of blobs you can store data in:
o Block blobs are composed of blocks of different sizes that can be uploaded independently and in parallel. Writing to a
block blob involves uploading data to blocks and committing them to the blob.
o Append blobs are specialized block blobs that support only appending new data (not updating or deleting existing
data), but they're very efficient at it. Append blobs are great for scenarios like storing logs or writing streamed data.
o Page blobs are designed for scenarios that involve random-access reads and writes. Page blobs are used to store the
virtual hard disk (VHD) files used by Azure Virtual Machines, but they're great for any scenario that involves random
access.

26/08/2019 72
Large Scale Data Processing with Azure Data
Lake Storage Gen2
Many organizations have spent the last two decades building data warehouses and business intelligence (BI) solutions based on
relational database systems. Many BI solutions have lost out on opportunities to store unstructured data due to cost and complexity
in these types of data and databases.
Suppose that you're a data engineering consultant doing work for Contoso. They're interested in using Azure to analyze all of their
business data. In this role, you'll provide guidance on how Azure can enhance their existing business intelligence systems. You'll also
offer advice about using Azure's storage capabilities to store large amounts of unstructured data to add value to their BI solution.
Because of their needs, you plan to recommend Azure Data Lake Storage. Data Lake Storage provides a repository where you can
upload and store huge amounts of unstructured data with an eye toward high-performance big data analytics.

Azure Data Lake Storage Gen2 (1/2)
26/08/2019 73
A data lake is a repository of data that is stored in its natural format, usually as blobs or files. Azure Data Lake Storage is a
comprehensive, scalable, and cost-effective data lake solution for big data analytics built into Azure.
Azure Data Lake Storage combines a file system with a storage platform to help you quickly identify insights into your data.
Data Lake Storage Gen2 builds on Azure Blob storage capabilities to optimize it specifically for analytics workloads. This
integration enables analytics performance, the tiering and data lifecycle management capabilities of Blob storage, and the
high-availability, security, and durability capabilities of Azure Storage.
The variety and volume of data that is generated and analyzed today is increasing. Companies have multiple sources of
data, from websites to Point of Sale (POS) systems, and more recently from social media sites to Internet of Things (IoT)
devices. Each source provides an essential aspect of data that needs to be collected, analyzed, and potentially acted upon.
Data Lake Storage Gen2 is designed to deal with this variety and volume of data at exabyte scale while securely handling
hundreds of gigabytes of throughput. With this, you can use Data Lake Storage Gen2 as the basis for both real-time and
batch solutions. Here is a list of additional benefits that Data Lake Storage Gen2 brings:
Hadoop compatible access:
A benefit of Data Lake Storage Gen2 is that you can treat the data as if it's stored in a Hadoop Distributed File System. With
this feature, you can store the data in one place and access it through compute technologies including Azure Databricks,
Azure HDInsight, and Azure SQL Data Warehouse without moving the data between environments.

Azure Data Lake Storage Gen2 (2/2)
26/08/2019 74
Security:
Data Lake Storage Gen2 supports access control lists (ACLs) and Portable Operating System Interface (POSIX) permissions.
You can set permissions at a directory level or file level for the data stored within the data lake. This security is configurable
through technologies such as Hive and Spark, or utilities such as Azure Storage Explorer. All data that is stored is encrypted
at rest by using either Microsoft or customer-managed keys.
Performance:
Azure Data Lake Storage organizes the stored data into a hierarchy of directories and subdirectories, much like a file
system, for easier navigation. As a result, data processing requires less computational resources, reducing both the time
and cost.
Data redundancy:
Data Lake Storage Gen2 takes advantage of the Azure Blob replication models that provide data redundancy in a single
data center with locally redundant storage (LRS), or to a secondary region by using the Geo-redundant storage (GRS)
option. This feature ensures that your data is always available and protected if catastrophe strikes.

Compare Azure Data Lake Store to Azure Blob storage
26/08/2019 75
In Azure Blob storage, you can store large amounts of unstructured ("object") data, in a single hierarchy, also known as a
flat namespace. You can access this data by using HTTP or HTTPs. Azure Data Lake Storage Gen2 builds on blob storage and
optimizes I/O of high-volume data by using hierarchical namespaces that you turned on in the previous exercise.
Hierarchical namespaces organize blob data into directories and stores metadata about each directory and the files within
it. This structure allows operations, such as directory renames and deletes, to be performed in a single atomic operation.
Flat namespaces, by contrast, require several operations proportionate to the number of objects in the structure.
Hierarchical namespaces keep the data organized, which yields better storage and retrieval performance for an analytical
use case and lowers the cost of analysis.
Azure Blob storage vs. Azure Data Lake Storage:
If you want to store data without performing analysis on the data, set the Hierarchical Namespace option to Disabled to
set up the storage account as an Azure Blob storage account. You can also use blob storage to archive rarely used data or to
store website assets such as images and media.
If you are performing analytics on the data, set up the storage account as an Azure Data Lake Storage Gen2 account by
setting the Hierarchical Namespace option to Enabled. Because Azure Data Lake Storage Gen2 is integrated into the Azure
Storage platform, applications can use either the Blob APIs or the Azure Data Lake Storage Gen2 file system APIs to access
data.

26/08/2019 76
Examine uses for Azure Data Lake Storage
Gen2

Creating a modern data warehouse
26/08/2019 77
Imagine you're a Data Engineering consultant for Contoso. In the past, they've created an on-premises business
intelligence solution that used a Microsoft SQL Server Database Engine, Azure Integration Services, Azure Analysis Services,
and SQL Server Reporting Services to provide historical reports. They tried using the Analysis Services Data Mining
component to create a predictive analytics solution to predict the buying behavior of customers. While this approach
worked well with low volumes of data, it couldn't scale after more than a gigabyte of data was collected. Furthermore,
they were never able to deal with the JSON data that a third-party application generated when a customer used the
feedback module of the point of sale (POS) application.
Contoso has turned to you for help with creating an architecture that can scale with the data needs that are required to
create a predictive model and to handle the JSON data so that it's integrated into the BI solution. You suggest the following
architecture.
The architecture uses Azure Data Lake Storage at the center of the solution for a modern data warehouse. Integration
Services is replaced by Azure Data Factory to ingest data into the Data Lake from a business application. This is the source
for the predictive model that is built into Azure Databricks. PolyBase is used to transfer the historical data into a big data
relational format that is held in Azure SQL Data Warehouse, which also stores the results of the trained model from
Databricks. Azure Analysis Services provides the caching capability for SQL Data Warehouse to service many users and to
present the data through Power BI reports.

Creating a modern data warehouse
26/08/2019 78

Advanced analytics for big data
26/08/2019 79
In this second use case, Azure Data Lake Storage plays an important role in providing a large-scale data store. Your skills are
needed by AdventureWorks, which is a global seller of bicycles and cycling components through a chain of resellers and on
the internet. As their customers browse the product catalog on their websites and add items to their baskets, a
recommendation engine that is built into Azure Databricks recommends other products. They need to make sure that the
results of their recommendation engine can scale globally. The recommendations are based on the web log files that are
stored on the web servers and transferred to the Azure Databricks model hourly. The response time for the
recommendation should be less than 1 ms. You propose the following architecture.
In this solution, Azure Data Factory transfers terabytes of web logs from a web server to the Azure Data Lake on an hourly
basis. This data is provided as features to the predictive model in Azure Databricks, which is then trained and scored. The
results are distributed globally by using Azure Cosmos DB, which the real-time app (the AdventureWorks website) will use
to provide recommendations to customers as they add products to their online baskets.
To complete this architecture, PolyBase is used against the Data Lake to transfer descriptive data to the SQL Data
Warehouse for reporting purposes. Azure Analysis Services provides the caching capability for SQL Data Warehouse to
service many users and to display the data through Power BI reports.

Advanced analytics for big data
26/08/2019 80

Real-time analytical solutions
26/08/2019 81
To perform real-time analytical solutions, the ingestion phase of the architecture is changed for processing big data
solutions. In this architecture, note the introduction of Apache Kafka for Azure HDInsight to ingest streaming data from an
Internet of Things (IoT) device, although this could be replaced with Azure IoT Hub and Azure Stream Analytics. The key
point is that the data is persisted in Data Lake Storage Gen2 to service other parts of the solution.
In this use case, you are a Data Engineer for Trey Research, an organization that is working with a transport company to
monitor the fleet of Heavy Goods Vehicles (HGV) that drive around Europe. Each HGV is equipped with sensor hardware
that will continuously report metric data on the temperature, the speed, and the oil and brake solution levels of an HGV.
When the engine is turned off, the sensor also outputs a file with summary information about a trip, including the mileage
and elevation of a trip. A trip is a period in which the HGV engine is turned on and off.
Both the real-time data and batch data is processed in a machine learning model to predict a maintenance schedule for
each of the HGVs. This data is made available to the downstream application that third-party garage companies can use if
an HGV breaks down anywhere in Europe. In addition, historical reports about the HGV should be visually presented to
users. As a result, the following architecture is proposed.
In this architecture, there are two ingestion streams. Azure Data Factory ingests the summary files that are generated
when the HGV engine is turned off. Apache Kafka provides the real-time ingestion engine for the telemetry data. Both data
streams are stored in Azure Data Lake Store for use in the future, but they are also passed on to other technologies to
meet business needs. Both streaming and batch data are provided to the predictive model in Azure Databricks, and the
results are published to Azure Cosmos DB to be used by the third-party garages. PolyBase transfers data from the Data
Lake Store into SQL Data Warehouse where Azure Analysis Services creates the HGV reports by using Power BI.

Real-time analytical solutions
26/08/2019 82

26/08/2019 83
Work with relational data in
Azure

26/08/2019 84
Azure SQL database

Why choose Azure SQL Database?
26/08/2019 85
Convenience
Setting up SQL Server on a VM or on physical hardware requires you to know about hardware and software requirements. You'll need to
understand the latest security best practices and manage operating system and SQL Server patches on a routine basis. You also need to manage
backup and data retention issues yourself.
With Azure SQL Database, we manage the hardware, software updates, and OS patches for you. All you specify is the name of your database
and a few options. You'll have a running SQL database in minutes.
You can bring up and tear down Azure SQL Database instances at your convenience. Azure SQL Database comes up fast and is easy to configure.
You can focus less on configuring software and more on making your app great.
Cost
Because we manage things for you, there are no systems for you to buy, provide power for, or otherwise maintain.
Azure SQL Database has several pricing options. These pricing options enable you to balance performance versus cost. You can start for just a
few dollars a month.
Scale
You find that the amount of transportation logistics data you must store doubles every year. When running on-premises, how much excess
capacity should you plan for?
With Azure SQL Database, you can adjust the performance and size of your database on the fly when your needs change.
Security
Azure SQL Database comes with a firewall that's automatically configured to restrict connections from the Internet.
You can "whitelist" IP addresses you trust. Whitelisting lets you use Visual Studio, SQL Server Management Studio, or other tools to manage
your Azure SQL database.

Azure SQL database
26/08/2019 86
When you create your first Azure SQL database, you also create an Azure SQL logical server. Think of a logical server as an
administrative container for your databases. You can control logins, firewall rules, and security policies through the logical
server. You can also override these policies on each database within the logical server.
A logical server enables you to add more later and tune performance among all your databases.
DTUs versus vCores: Azure SQL Database has two purchasing models: DTU and vCore.
o DTUs: DTU stands for Database Transaction Unit and is a combined measure of compute, storage, and IO resources.
Think of the DTU model as a simple, preconfigured purchase option. Because your logical server can hold more than
one database, there's also the idea of eDTUs, or elastic Database Transaction Units. This option enables you to choose
one price, but allow each database in the pool to consume fewer or greater resources depending on current load.
o vCores: vCore gives you greater control over what compute and storage resources you create and pay for. While the
DTU model provides fixed combinations of compute, storage, and IO resources, the vCore model enables you to
configure resources independently. For example, with the vCore model you can increase storage capacity but keep the
existing amount of compute and IO throughput. Your transportation and logistics prototype only needs one Azure SQL
Database instance. You decide on the DTU option because it provides a good balance of compute, storage, and IO
performance and is less expensive to get started.

Azure SQL database
26/08/2019 87

Azure SQL database
26/08/2019 88
SQL elastic pools
When you create your Azure SQL database, you can create a SQL elastic pool.
SQL elastic pools relate to eDTUs. They enable you to buy a set of compute and storage resources that are shared among
all the databases in the pool. Each database can use the resources they need, within the limits you set, depending on
current load.
What is collation?
Collation refers to the rules that sort and compare data. Collation helps you define sorting rules when case sensitivity,
accent marks, and other language characteristics are important.
o Latin1_General refers to the family of Western European languages.
o CP1 refers to code page 1252, a popular character encoding of the Latin alphabet.
o CI means that comparisons are case insensitive. For example, "HELLO" compares equally to "hello".
o AS means that comparisons are accent sensitive. For example, "résumé" doesn't compare equally to "resume".

Authentication
26/08/2019 89
Authentication is the process of verifying an identity. This identity could be a user, a service running on a system, or a
system itself (such as a virtual machine). Through the process of authentication, we ensure that the person or system is
who they claim to be. SQL Database supports two types of authentication: SQL authentication and Azure Active Directory
authentication.
SQL authentication
SQL authentication method uses a username and password. User accounts can be created in the master database and can
be granted permissions in all databases on the server, or they can be created in the database itself (called contained users)
and given access to only that database. When you created the logical server for your database, you specified a "server
admin" login with a username and password. Using these credentials, you can authenticate to any database on that server
as the database owner, or "dbo".
Azure Active Directory authentication
This authentication method uses identities managed by Azure Active Directory (AD) and is supported for managed and
integrated domains. Use Azure AD authentication (integrated security) whenever possible. With Azure AD authentication,
you can centrally manage the identities of database users and other Microsoft services in one central location. Central ID
management provides a single place to manage database users and simplifies permission management. If you want to use
Azure AD authentication, you must create another server admin called the "Azure AD admin," which is allowed to
administer Azure AD users and groups. This admin can also perform all operations that a regular server admin can.

Authorization
26/08/2019 90
Authorization refers to what an identity can do within an Azure SQL Database. This is controlled by permissions granted
directly to the user account and/or database role memberships. A database role is used to group permissions together to
ease administration, and a user is added to a role to be granted the permissions the role has. These permissions can grant
things such as the ability to log in to the database, the ability to read a table, and the ability to add and remove columns
from a database. As a best practice, you should grant users the least privileges necessary. The process of granting
authorization to both SQL and Azure AD users is the same.
In our example here, the server admin account you are connecting with is a member of the db_owner role, which has
authority to do anything within the database.

Advanced Data Security for Azure SQL Database
26/08/2019 91
Advanced Data Security (ADS) provides a set of advanced SQL security capabilities, including data discovery &
classification, vulnerability assessment, and Advanced Threat Protection.
o Data discovery & classification (currently in preview) provides capabilities built into Azure SQL Database for
discovering, classifying, labeling & protecting the sensitive data in your databases. It can be used to provide visibility
into your database classification state, and to track the access to sensitive data within the database and beyond its
borders.
o Vulnerability assessment is an easy to configure service that can discover, track, and help you remediate potential
database vulnerabilities. It provides visibility into your security state, and includes actionable steps to resolve security
issues, and enhance your database fortifications.
o Advanced Threat Protection detects anomalous activities indicating unusual and potentially harmful attempts to
access or exploit your database. It continuously monitors your database for suspicious activities, and provides
immediate security alerts on potential vulnerabilities, SQL injection attacks, and anomalous database access patterns.
Advanced Threat Protection alerts provide details of the suspicious activity and recommend action on how to
investigate and mitigate the threat.
https://docs.microsoft.com/en-us/learn/modules/secure-your-azure-sql-database/5-monitor-your-database

Connect with Azure Cloud Shell
26/08/2019 92
You can use Visual Studio, SQL Server Management Studio, Azure Data Studio, or other tools to manage your Azure SQL
database.
The az commands you'll run require the name of your resource group and the name of your Azure SQL logical server. To
save typing, run this azure configure command to specify them as default values. Replace <server-name> with the name of
your Azure SQL logical server:
Run az sql db list to list all databases on your Azure SQL logical server:
Run this az sql db show-connection-string command to get the connection string to the Logistics database in a format that
sqlcmd can use:
Run the sqlcmd statement from the output of the previous step to create an interactive session. Remove the surrounding
quotes and replace <username> and <password> with the username and password you specified when you created your
database.
> az configure --defaults group=Learn-9b1284ba-506d-42d1-94cb-82e3d1d149af sql-server=<server-name>
> az sql db list
"sqlcmd -S tcp:contoso-1.database.windows.net,1433 -d Logistics -U <username> -P <password> -N -l 30"
sqlcmd -S tcp:contoso-1.database.windows.net,1433 -d Logistics -U martina -P "password1234$" -N -l 30

Dynamic data masking
26/08/2019 93
You might have noticed when we ran our query in the previous unit that some of the information in the database is
sensitive; there are phone numbers, email addresses, and other information that we may not want to fully display to
everyone with access to the data.
Maybe we don't want our users to be able to see the full phone number or email address, but we'd still like to make
a portion of the data available for customer service representatives to identify a customer. By using the dynamic data
masking feature of Azure SQL Database, we can limit the data that is displayed to the user. Dynamic data masking is a
policy-based security feature that hides the sensitive data in the result set of a query over designated database
fields, while the data in the database is not changed.
Data masking rules consist of the column to apply the mask to, and how the data should be masked. You can create
your own masking format, or use one of the standard masks such as:
o Default value, which displays the default value for that data type instead.
o Credit card value, which only shows the last four digits of the number, converting all other numbers to lower case
x’s.
o Email, which hides the domain name and all but the first character of the email account name.
o Number, which specifies a random number between a range of values. For example, on the credit card expiry
month and year, you could select random months from 1 to 12 and set the year range from 2018 to 3000.
o Custom string. This allows you to set the number of characters exposed from the start of the data, the number of
characters exposed from the end of the data, and the characters to repeat for the remainder of the data.
When querying the columns, database administrators will still see the original values, but non-administrators will see
the masked values. You can allow other users to see the non-masked versions by adding them to the SQL users
excluded from masking list.

Azure SQL Database Documentation
26/08/2019 94

26/08/2019 95
Azure Database for PostgreSQL

26/08/2019 96
Azure Database for PostgreSQL is a relational database service in the Microsoft cloud. The server software is based on the
community version of the open-source PostgreSQL database engine. Your familiarity with tools and expertise with
PostgreSQL is applicable when using Azure Database for PostgreSQL.
Moreover, Azure Database for PostgreSQL delivers the following benefits:
o Built-in high availability compared to on-premises resources. There is no additional configuration, replication, or cost
required to make sure your applications are always available.
o Simple and flexible pricing. You have predictable performance based on a selected pricing tier choice that includes
software patching, automatic backups, monitoring, and security.
o Scale up or down as needed within seconds. You can scale compute or storage independently as needed, to make sure
you adapt your service to match usage.
o Adjustable automatic backups and point-in-time-restore for up to 35 days.
o Enterprise-grade security and compliance to protect sensitive data at-rest and in-motion that covers data encryption
on disk and SSL encryption between client and server communication.

26/08/2019 97
What is an Azure Database for PostgreSQL server?
The PostgreSQL server is a central administration point for one or more databases. The PostgreSQL service in Azure is a
managed resource that provides performance guarantees, and provides access and features at the server level.
An Azure Database for PostgreSQL server is the parent resource for a database. A resource is a manageable item that's
available through Azure. Creating this resource allows you to configure your server instance.
What is an Azure Database for PostgreSQL server resource?
An Azure Database for PostgreSQL server resource is a container with strong lifetime implications for your server and
databases. If the server resource is deleted, all databases are also deleted. Keep in mind that all resources belonging to the
parent are hosted in the same region.
The server resource name is used to define the server endpoint name. For example, if the resource name is mypgsqlserver,
then the server name becomes mypgsqlserver.postgres.database.azure.com.
The server resource also provides the connection scope for management policies that apply to its database. For example:
login, firewall, users, roles, and configuration.
Just like the open-source version of PostgreSQL, the server is available in several versions and allows for the installation of
extensions. You'll choose which server version to install.

26/08/2019 98
Scale multiple Azure SQL Databases with SQL
elastic pools

26/08/2019 99
SQL elastic pools are a resource allocation service used to scale and manage the performance and cost of a group of Azure
SQL databases. Elastic pools allow you to purchase resources for the group. You set the amount of resources available to
the pool, add databases to the pool, and set minimum and maximum resource limits for the databases within the pool.
The pool resource requirements are set based on the overall needs of the group. The pool allows the databases within the
pool to share the allocated resources. SQL elastic pools are used to manage the budget and performance of multiple SQL
databases.
When to use an elastic pool?
SQL elastic pools are ideal when you have several SQL databases that have a low average utilization but have infrequent
but high utilization spikes. In this scenario, you can allocate enough capacity in the pool to manage the spikes for the group
but the total resources can be less than the sum of all of the peak demand of all of the databases. Since the spikes are
infrequent, a spike from one database will be unlikely to impact the capacity of the other databases in the pool.
How many databases to add to a pool?
The general guidance is, if the combined resources you would need for individual databases to meet capacity spikes is
more than 1.5 times the capacity required for the elastic pool, than the pool will be cost effective.
At a minimum, it is recommended to add at least two S3 databases or fifteen S0 databases to a single pool for it to have
potential cost savings.
Depending on the performance tier, you can add up to 100 or 500 databases to a single pool.

26/08/2019 100
Azure SQL Datawarehouse

26/08/2019 101
Data warehousing solutions

The types of data warehousing solutions
26/08/2019 102
There are three common types of data warehouse solutions:
o Enterprise data warehouse: An enterprise data warehouse (EDW) is a centralized data store that provides analytics
and decision support services across the entire enterprise. Data flows into the warehouse from multiple transactional
systems, relational databases, and other data sources on a periodic basis. The stored data is used for historical and
trend analysis reporting. The data warehouse acts as a central repository for many subject areas. It contains the "single
source of truth.“
o Data mart: A data mart serves the same purpose as a data warehouse, but it's designed for the needs of a single team
or business unit, like sales or human resources. A data mart is smaller and more focused, and it tends to serve a single
subject area.
o Operational data store: An operational data store (ODS) is an interim store that's used to integrate real-time or near
real-time data from multiple sources for additional operations on the data. Because the data is refreshed in real time,
it's widely preferred for performing routine activities like storing the records of employees or aggregating data from
multiple sites for real-time reporting.

Designing a data warehouse solution
26/08/2019 103
Data warehouse solutions can be classified by their technical relational constructs and the methodologies that are used to
define them. There are two typical architectural approaches used to design data warehouses:
Bottom-Up Architectural Design, by Ralph Kimball:
Bottom-up architectural design is based on connected data marts. You start creating your data warehouse by building
individual data marts for departmental subject areas. Then, you connect the data marts via a data warehouse bus by using
conforming dimensions. The core building blocks of this approach are dependent on the star schema model that provides
the multidimensional OLAP cube capabilities for data analysis within a relational database and SQL.
The benefit of this approach is that you can start small. You gradually build up the larger data warehouse through relatively
smaller investments over a period of time. This option provides a quicker path to return on investment (ROI) in the
solution.
Top-Down Architectural Design, by Bill Inmon:
The top-down architectural approach is about creating a single, integrated normalized warehouse. In a normalized
warehouse, the internal relational constructs follow the rules of normalization. This architectural style supports
transitional integrity because the data comes directly from the OLTP source systems.
This approach can provide more consistent and reliable data, but it requires rigorous design, development, and testing.
This approach also requires substantial financial investments early on, and the ROI doesn't start until the entire data
warehouse becomes fully functional.

26/08/2019 104
SQL Data Warehouse

SQL Data Warehouse
26/08/2019 105
Azure SQL Data Warehouse is a cloud-based enterprise data warehouse (EDW) that uses massively parallel processing
(MPP) to run complex queries across petabytes of data quickly. SQL Data Warehouse is the most appropriate solution
when you need to keep historical data separate from source transaction systems for performance reasons. The source data
might come from other storage mediums, such as network shares, Azure Storage blobs, or a data lake.
SQL Data Warehouse stores incoming data in relational tables that use columnar storage. The format significantly reduces
data storage costs and improves query performance. After information is stored in SQL Data Warehouse, you can run
analytics on a massive scale. Analysis queries in SQL Data Warehouse finish in a fraction of the time it takes to run them in
a traditional database system.
SQL Data Warehouse is a key component in creating end-to-end relational big data solutions in the cloud. It allows data to
be ingested from a variety of data sources, and it leverages a scale-out architecture to distribute computational processing
of data across a large cluster of nodes. Organizations can use SQL Data Warehouse to:
o Independently size compute power, regardless of storage needs.
o Grow or shrink compute power without moving data.
o Pause compute capacity while leaving data intact, so they pay only for storage.
o Resume compute capacity during operational hours.
These advantages are possible because computation and storage are decoupled in the MPP architecture.

Massively parallel processing (MPP) concepts (1/2)
26/08/2019 106
Azure SQL Data Warehouse separates computation from the underlying storage, so you can scale your computing power
independently of data storage. To do this, CPU, memory, and I/O are abstracted and bundled into units of compute scale
called data warehouse units (DWUs).
A DWU represents an abstract, normalized measure of compute resources and performance. By changing the service level,
you alter the number of DWUs that are allocated to the system. In turn, the performance and cost of the system are
adjusted. To achieve higher performance, increase the number of DWUs. This also increases associated costs. To achieve a
lower cost, reduce the number of DWUs. This lowers the performance. Storage and compute costs are billed separately, so
changing the number of DWUs doesn't affect storage costs.
SQL Data Warehouse uses a node-based architecture.
Applications connect and issue T-SQL commands to a control
node, which is the single point of entry for the data warehouse.
The control node runs the massively parallel processing (MPP)
engine, which optimizes queries for parallel processing. Then, it
passes operations to compute nodes to do their work in
parallel. Compute nodes store all user data in Azure Storage
and run parallel queries. The Data Movement Service (DMS) is
a system-level internal service that moves data across nodes as
necessary to run queries in parallel and return accurate results.
https://docs.microsoft.com/en-us/learn/modules/design-azure-sql-data-warehouse/4-massively-parallel-processing

26/08/2019 107
Control node:
The control node is the brain of the data warehouse. It's the front end that interacts with all applications and connections.
The MPP engine runs on the control node to optimize and coordinate parallel queries. When you submit a T-SQL query to
the SQL data warehouse, the control node transforms it into queries that run against each distribution in parallel.
Compute nodes:
The compute nodes provide the computational power. Distributions map to compute nodes for processing. As you pay for
more compute resources, SQL Data Warehouse remaps the distributions to the available compute nodes. The number of
compute nodes ranges from 1 to 60 and is determined by the service level for the data warehouse.
Data Movement Service:
DMS is the data transport technology that coordinates data movement between compute nodes. When SQL Data
Warehouse runs a query, the work is divided into 60 smaller queries that run in parallel. Each of the 60 smaller queries
runs on one of the underlying data distributions. A distribution is the basic unit of storage and processing for parallel
queries that run on distributed data. Some queries require data movement across nodes to ensure that parallel queries
return accurate results. When data movement is required, DMS ensures that the correct data gets to the correct location.

26/08/2019 108
A database connection string identifies the protocol, URL, port, and security options that are used to communicate with
the database. The client uses this information to establish a network connection to the database. Select Show database
connection on the Overview page to get the connection string in a variety of framework formats:
o ADO.NET
o JDBC
o ODBC
o PHP
For example, here's the connection string for ADO.NET.
Notice that the connection string identifies the protocol as TCP/IP, includes the URL, and uses the port 1433, which is the
default in this case.
TCP:1433 is a well-known and public port. After your SQL Data Warehouse server's name is made public, it might invite
denial-of-service (DoS) attacks. To protect the server from such attacks, configure the Azure firewall to restrict network
access to specific IP addresses.
https://docs.microsoft.com/en-us/learn/modules/query-azure-sql-data-warehouse/3-connect-dw-using-ssms

Table geometries
26/08/2019 109
Azure SQL Data Warehouse uses Azure Storage to keep user data safe. Because data is stored and managed by Azure
Storage, SQL Data Warehouse charges separately for storage consumption. The data is sharded into distributions to
optimize system performance. When you define the table, you can choose which sharding pattern to use to distribute the
data. SQL Data Warehouse supports these sharding patterns:
o Hash
o Round-robin
o Replicated
You can use the following strategies to determine which pattern is most suitable for your scenario:
Type Great fit for... Watch out if...
Replicated o Small-dimension tables in a star schema with less than 2GB of
storage after compression (~5x compression).
o Many write transactions are on the table
(insert/update/delete).
o You change DWU provisioning frequently.
o You use only 2-3 columns, but your table
has many columns.
o You index a replicated table.
Round-robin (default) o Temporary/Staging table
o No obvious joining key or good candidate column.
o Performance is slow due to data
movement.
Hash o Fact tables.
o Large-dimension tables.
o The distribution key can't be updated.

Table geometries
26/08/2019 110
Here are some tips that can help you choose a strategy:
o Start with round-robin, but aspire to a hash-distribution strategy to take advantage of a massively parallel processing
(MPP) architecture.
o Make sure that common hash keys have the same data format.
o Don't distribute on varchar format.
o Use a common hash key between dimension tables and the associated fact table to allow join operations to be hash-
distributed.
o Use sys.dm_pdw_request_steps to analyze data movements behind queries and monitor how long broadcast and
shuffle operations take. This is helpful when you review your distribution strategy.

26/08/2019 111
Table geometries
Hash-distributed tables
A hash-distributed table can deliver the highest query performance for joins and aggregations on large
tables.
To shard data into a hash-distributed table, SQL Data Warehouse uses a hash function to assign each row
to one distribution deterministically. In the table definition, one of the columns is designated the
distribution column. The hash function uses the values in the distribution column to assign each row to a
distribution.
Here's an example of a CREATE TABLE statement that defines a hash distribution:
CREATE TABLE [dbo].[EquityTimeSeriesData](
[Date] [varchar](30) ,
[BookId] [decimal](38, 0) ,
[P&L] [decimal](31, 7) ,
[VaRLower] [decimal](31, 7)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([P&L])
)
;

26/08/2019 112
Table geometries
Round-robin distributed tables
A round-robin table is the most straightforward table you can use to create and deliver fast performance
in a staging table for loads.
A round-robin distributed table distributes data evenly across the table but without additional
optimization. A distribution is first chosen at random. Then, buffers of rows are assigned to distributions
sequentially. Loading data into a round-robin table is quick, but query performance often is better in
hash-distributed tables. Joins on round-robin tables require reshuffling data and this takes more time.
Here's an example of a CREATE TABLE statement that defines a round-robin distribution:
CREATE TABLE [dbo].[Dates](
[Date] [datetime2](3) ,
[DateKey] [decimal](38, 0) ,
..
..
[WeekDay] [nvarchar](100) ,
[Day Of Month] [decimal](38, 0)
)
WITH
(
, DISTRIBUTION = ROUND_ROBIN
)
;

26/08/2019 113
Table geometries
Replicated tables
A replicated table provides the fastest query performance for small tables.
A table that is replicated caches a full copy on each compute node. Consequently, replicating a table
removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables
work best in small tables. Extra storage is required, and additional overhead is incurred when writing
data, which make large tables impractical.
Here's an example of a CREATE TABLE statement that defines a replicated distribution:
CREATE TABLE [dbo].[BusinessHierarchies](
[BookId] [nvarchar](250) ,
[Division] [nvarchar](100) ,
[Cluster] [nvarchar](100) ,
[Desk] [nvarchar](100) ,
[Book] [nvarchar](100) ,
[Volcker] [nvarchar](100) ,
[Region] [nvarchar](100)
)
WITH
(
, DISTRIBUTION = REPLICATE
)
;

26/08/2019 114
Import data into Azure SQL Data Warehouse
by using PolyBase

Introduction to PolyBase
26/08/2019 115
A key feature of Azure SQL Data Warehouse is that you pay for only the processing you need. You can decide how much
parallelism is needed for your work. You also can pause the compute nodes when it's not in use. In this way, you pay for
only the CPU time you use.
Azure SQL Data Warehouse supports many loading methods. These methods include non-PolyBase options such as BCP
and the SQL Bulk Copy API. The fastest and most scalable way to load data is through PolyBase. PolyBase is a technology
that accesses external data stored in Azure Blob storage, Hadoop, or Azure Data Lake Store via the Transact-SQL language.
The following architecture diagram shows how loading is achieved with each HDFS bridge of the data movement service
(DMS) on every compute node that connects to an external resource such as Azure Blob storage. PolyBase then
bidirectionally transfers data between SQL Data Warehouse and the external resource to provide the fast load
performance.
https://docs.microsoft.com/en-us/learn/modules/import-data-into-asdw-with-polybase/2-introduction-to-polybase

Introduction to PolyBase
26/08/2019 116
PolyBase can read data from several file formats and data sources. Before you upload your data into Azure SQL Data
Warehouse, you must prepare the source data into an acceptable format for PolyBase. These formats include:
o Comma-delimited text files (UTF-8 and UTF-16).
o Hadoop file formats, such as RC files, Optimized Row Columnar (ORC) files, and Parquet files.
o Gzip and Snappy compressed files.
If the data is coming from a relational database, such as Microsoft SQL Server, pull the data and then load it into an
acceptable data store, such as an Azure Blob storage account. Tools such as SQL Server Integration Services can ease this
transfer process.

Retrieve the URL and access key for the storage account
26/08/2019 117
Clients and applications must be authenticated to connect to an Azure Blob storage account. There are several ways to do
this. The easiest approach for trusted applications is to use the storage key. Since we're managing the import process, let's
use this approach.
We need two pieces of information to connect PolyBase to the Azure Storage account:
o The URL of the storage account
o The private storage key

26/08/2019 118
1. Create an import
database
import data from blob storage to Azure
SQL Data Warehouse by using PolyBase
The first step in using PolyBase is to create a database-scoped credential that secures the credentials
to the blob storage. Create a master key first, and then use this key to encrypt the database-scoped
credential named AzureStorageCredential.
Paste the following code into the query window. Replace the SECRET value with the access key you
retrieved in the previous exercise.
CREATE MASTER KEY;
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = 'DemoDwStorage',
SECRET = 'THE-VALUE-OF-THE-ACCESS-KEY' -- put key1's value here
;
Select Run to run the query. It should report Query succeeded: Affected rows: 0.

26/08/2019 119
2. Create an external data
source connection
Use the database-scoped credential to create an external data source named AzureStorage. Note
the location URL point to the container named data-files that you created in the blob storage. The
type Hadoop is used for both Hadoop-based and Azure Blob storage-based access.
Paste the following code into the query window. Replace the LOCATION value with your correct
value from the previous exercise.
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://data-files@demodwstorage.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);

26/08/2019 120
3. Define the import file
format
Define the external file format named TextFile. This name indicates to PolyBase that the format of
the text file is DelimitedText and the field terminator is a comma.
Paste the following code into the query window:
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

26/08/2019 121
4. Create a temporary
table
Create an external table named dbo.temp with the column definition for your table. At the bottom
of the query, use a WITH clause to call the data source definition named AzureStorage, as previously
defined, and the file format named TextFile, as previously defined. The location denotes that the
files for the load are in the root folder of the data source.
The table definition must match the fields defined in the input file. There are 12 defined columns,
with data types that match the input file data.
Note: External tables are in-memory tables that don't persist onto the physical disk. External tables
can be queried like any other table.
-- Create a temp table to hold the imported data
CREATE EXTERNAL TABLE dbo.Temp (
[Date] datetime2(3) NULL,
[DateKey] decimal(38, 0) NULL,
[MonthKey] decimal(38, 0) NULL,
[Month] nvarchar(100) NULL,
[Quarter] nvarchar(100) NULL,
[Year] decimal(38, 0) NULL,
[Year-Quarter] nvarchar(100) NULL,
[Year-Month] nvarchar(100) NULL,
[Year-MonthKey] nvarchar(100) NULL,
[WeekDayKey] decimal(38, 0) NULL,
[WeekDay] nvarchar(100) NULL,
[Day Of Month] decimal(38, 0) NULL
)
WITH (
LOCATION='/',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile
);

26/08/2019 122
5. Create a destination
table
Create a physical table in the SQL Data Warehouse database. In the following example, you create a
table named dbo.StageDate. The table has a clustered column store index defined on all the
columns. It uses a table geometry of round_robin by design because round_robin is the best table
geometry to use for loading data.
-- Load the data from Azure Blob storage to SQL Data Warehouse
CREATE TABLE [dbo].[StageDate]
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT * FROM [dbo].[Temp];

26/08/2019 123
6. Add statistics onto
columns to improve query
performance
As an optional step, create statistics on columns that feature in queries to improve the query
performance against the table.
-- Create statistics on the new data
CREATE STATISTICS [DateKey] on [StageDate] ([DateKey]);
CREATE STATISTICS [Quarter] on [StageDate] ([Quarter]);
CREATE STATISTICS [Month] on [StageDate] ([Month]);
You've loaded your first staging table in Azure SQL Data Warehouse. From here, you can write
further Transact-SQL queries to perform transformations into dimension and fact tables. Try it out by
querying the StageDate table in the query explorer or in another query tool. Refresh the view on the
left to see the new table or tables that you created. Reuse the previous steps in a persistent SQL
script to load additional data, as necessary.

Enable data encryption
26/08/2019 124
Encrypt data at rest:
Data encryption at rest helps protect data in your data warehouse and satisfies a critical compliance requirement. Azure
SQL Data Warehouse provides transparent data-at-rest encryption capabilities without affecting the client applications
that connect to it.
https://docs.microsoft.com/en-us/learn/modules/data-warehouse-security/2-azure-dw-security-control

Encrypt data in transit in your application
26/08/2019 125
Data in transit is a method to prevent man-in-the-middle attacks. To encrypt data in transit, specify encrypt=true in the
connection string in your client applications as follows. This ensures that all data sent between your client application and
SQL data warehouse is encrypted with SSL.
String connectionURL =
"jdbc:sqlserver://<your_azure_sql_datawarehouse_fqdn>:1433;" +
"databaseName=DemoDW;username=your_username;password=your_password " +
"encrypt=true ";

The difference between login and user is
26/08/2019 126
Logins provide access to a server, and users provide access to databases within it.

26/08/2019 128
Data streams
In the context of analytics, data streams are event data generated by sensors or other sources that can be analyzed by another
technology. Analyzing a data stream is typically done to measure the state change of a component or to capture information on an
area of interest. The intent being to:
o Continuously analyze data to detect issues and understand or respond to them.
o Understand component or system behavior under various conditions to fuel further enhancements of said component or
system.
o Trigger specific actions when certain thresholds are identified.
In today's world, data streams are ubiquitous. Companies can harness the latent knowledge in data streams to improve
efficiencies and further innovation. Examples of use cases that analyze data streams include:
o Stock market trends.
o Monitoring data of water pipelines and electrical transmission and distribution systems by utility companies.
o Mechanical component health monitoring data in automotive and automobile industries.
o Monitoring data from industrial and manufacturing equipment.
o Sensor data in transportation, such as traffic management and highway toll lanes.
o Patient health monitoring data in the healthcare industry.
o Satellite data in the space industry.
o Fraud detection in the banking and finance industries.
o Sentiment analysis of social media posts.

Approaches to data stream processing
26/08/2019 129
There are two approaches to processing data streams: on-demand and live.
o Streaming data can be collected over time and persisted in storage as static data. The data can then be processed
when convenient or during times when compute costs are lower. The downside to this approach is the cost of storing
the data.
o In contrast, live data streams have relatively low storage requirements. They also require more processing power to
run computations in sliding windows over continuously incoming data to generate the insights.

Event processing
26/08/2019 130
The process of consuming data streams, analyzing them, and deriving actionable insights out of them is called event
processing. An event processing pipeline has three distinct components:
o Event producer: Examples include sensors or processes that generate data continuously, such as a heart rate monitor
or a highway toll lane sensor.
o Event processor: An engine to consume event data streams and derive insights from them. Depending on the problem
space, event processors either process one incoming event at a time, such as a heart rate monitor, or process multiple
events at a time, such as Azure Stream Analytics processing the highway toll lane sensor data.
o Event consumer: An application that consumes the data and takes specific action based on the insights. Examples of
event consumers include alert generation, dashboards, or even sending data to another event processing engine.
https://docs.microsoft.com/en-us/learn/modules/introduction-to-data-streaming/3-event-processing

Process events with Azure Stream Analytics
26/08/2019 131
Microsoft Azure Stream Analytics is an event processing engine. It enables the consumption and analysis of high volumes
of streaming data generated by sensors, devices, or applications. Stream Analytics processes the data in real time. A typical
event processing pipeline built on top of Stream Analytics consists of the following four components:
o Event producer: Any application, system, or sensor that continuously produces event data of interest. Examples can
include a sensor that tracks the flow of water in a utility pipe to an application such as Twitter that generates tweets
against a single hashtag.
o Event ingestion system: Takes the data from the source system or application to pass onto an analytics engine. Azure
Event Hubs, Azure IoT Hub, or Azure Blob storage can all serve as the ingestion system.
o Stream analytics engine: Where compute is run over the incoming streams of data and insights are extracted. Azure
Stream Analytics exposes the Stream Analytics query language (SAQL), a subset of Transact-SQL that's tailored to
perform computations over streaming data. The engine supports windowing functions that are fundamental to stream
processing and are implemented by using the SAQL.
o Event consumer: A destination of the output from the stream analytics engine. The target can be storage, such as
Azure Data Lake, Azure Cosmos DB, Azure SQL Database, or Azure Blob storage, or dashboards powered by Power BI.
https://docs.microsoft.com/en-us/learn/modules/introduction-to-data-streaming/4-event-with-asa

26/08/2019 132
Azure Streaming Analytics

26/08/2019 133
In Azure Stream Analytics, a job is a unit of execution. A Stream Analytics job pipeline consists of three parts:
o An input that provides the source of the data stream.
o A transformation query that acts on the input. For example, a transformation query could aggregate the data.
o An output that identifies the destination of the transformed data.
The Stream Analytics pipeline provides a transformed data flow from input to output, as the following diagram shows.
https://docs.microsoft.com/en-us/learn/modules/transform-data-with-azure-stream-analytics/2-stream-analytics-workflow

26/08/2019 134
An Azure Stream Analytics job supports three input types:
Input type Use case
Azure Event Hubs Azure Event Hubs consumes live streaming data from applications with low latency and high
throughput.
Azure IoT Hub Azure IoT Hub consumes live streaming events from IoT devices. This service enables bidirectional
communication so that commands can be sent back to IoT devices. Commands trigger specific
actions based on the analyzing streams the devices send to the service.
Azure Blob storage Azure Blob storage serves as the input source to consume files that are persisted in blob storage.

26/08/2019 136
Identify job roles
Role differences
The roles of the data engineer, AI engineer, and data scientist differ. Each role solves a different problem.
Data engineers primarily provision data stores. They make sure that massive amounts of data are securely and cost-effectively extracted,
loaded, and transformed.
AI engineers add the intelligent capabilities of vision, voice, language, and knowledge to applications. To do this, they use the Cognitive
Services offerings that are available out of the box.
When a Cognitive Services application reaches its capacity, AI engineers call on data scientists. Data scientists develop machine learning
models and customize components for an AI engineer's application.
Each data-technology role is distinct, and each contributes an important part to digital transformation projects.

Data engineer
26/08/2019 137
Data engineers provision and set up data platform technologies that are on-premises and in the cloud. They manage and
secure the flow of structured and unstructured data from multiple sources. The data platforms they use can include
relational databases, nonrelational databases, data streams, and file stores. Data engineers also ensure that data services
securely and seamlessly integrate with other data platform technologies or application services such as Azure Cognitive
Services, Azure Search, or even bots.
The Azure data engineer focuses on data-related tasks in Azure. Primary responsibilities include using services and tools to
ingest, egress, and transform data from multiple sources. Azure data engineers collaborate with business stakeholders to
identify and meet data requirements. They design and implement solutions. They also manage, monitor, and ensure the
security and privacy of data to satisfy business needs.
The role of data engineer is different from the role of a database administrator. A data engineer's scope of work goes well
beyond looking after a database and the server where it's hosted. Data engineers must also get, ingest, transform,
validate, and clean up data to meet business requirements. This process is called data wrangling.
A data engineer adds tremendous value to both business intelligence and data science projects. Data wrangling can
consume a lot of time. When the data engineer wrangles data, projects move more quickly because data scientists can
focus on their own areas of work.
Both database administrators and business intelligence professionals can easily transition to a data engineer role. They just
need to learn the tools and technology that are used to process large amounts of data.

AI engineer
26/08/2019 138
AI engineers work with AI services such as Cognitive Services, Cognitive Search, and Bot Framework. Cognitive Services
includes Computer Vision, Text Analytics, Bing Search, and Language Understanding (LUIS).
Rather than creating models, AI engineers apply the prebuilt capabilities of Cognitive Services APIs. AI engineers embed
these capabilities within a new or existing application or bot. AI engineers rely on the expertise of data engineers to store
information that's generated from AI.
For example, an AI engineer might be working on a Computer Vision application that processes images. This AI engineer
would ask a data engineer to provision an Azure Cosmos DB instance to store the metadata and tags that the Computer
Vision application generates.

Data engineer
26/08/2019 139
Data engineers provision and set up data platform technologies that are on-premises and in the cloud. They manage and
secure the flow of structured and unstructured data from multiple sources. The data platforms they use can include
relational databases, nonrelational databases, data streams, and file stores. Data engineers also ensure that data services
securely and seamlessly integrate with other data platform technologies or application services such as Azure Cognitive
Services, Azure Search, or even bots.
The Azure data engineer focuses on data-related tasks in Azure. Primary responsibilities include using services and tools to
ingest, egress, and transform data from multiple sources. Azure data engineers collaborate with business stakeholders to
identify and meet data requirements. They design and implement solutions. They also manage, monitor, and ensure the
security and privacy of data to satisfy business needs.
The role of data engineer is different from the role of a database administrator. A data engineer's scope of work goes well
beyond looking after a database and the server where it's hosted. Data engineers must also get, ingest, transform,
validate, and clean up data to meet business requirements. This process is called data wrangling.
A data engineer adds tremendous value to both business intelligence and data science projects. Data wrangling can
consume a lot of time. When the data engineer wrangles data, projects move more quickly because data scientists can
focus on their own areas of work.
Both database administrators and business intelligence professionals can easily transition to a data engineer role. They just
need to learn the tools and technology that are used to process large amounts of data.

26/08/2019 140
Data Transformation

Change data processes
26/08/2019 141
As a data engineer you'll extract raw data from a structured or unstructured data pool and migrate it to a staging data
repository. Because the data source might have a different structure than the target destination, you'll transform the data
from the source schema to the destination schema. This process is called transformation. You'll then load the transformed
data into the data warehouse. Together, these steps form a process called extract, transform, and load (ETL).
A disadvantage of the ETL approach is that the transformation stage can take a long time. This stage can potentially tie up
source system resources.
An alternative approach is extract, load, and transform (ELT). In ELT, the data is immediately extracted and loaded into a
large data repository such as Azure Cosmos DB or Azure Data Lake Storage. This change in process reduces the resource
contention on source systems. Data engineers can begin transforming the data as soon as the load is complete.
ELT also has more architectural flexibility to support multiple transformations. For example, how the marketing
department needs to transform the data can be different than how the operations department needs to transform that
same data.

Evolution from ETL
26/08/2019 142
Azure has opened the way for technologies that can handle unstructured data at an unlimited scale. This change has
shifted the paradigm for loading and transforming data from ETL to extract, load, and transform (ELT).
The benefit of ELT is that you can store data in its original format, be it JSON, XML, PDF, or images. In ELT, you define the
data's structure during the transformation phase, so you can use the source data in multiple downstream systems.
In an ELT process, data is extracted and loaded in its native format. This change reduces the time required to load the data
into a destination system. The change also limits resource contention on the data sources.
The steps for the ELT process are the same as for the ETL process. They just follow a different order.
Another process like ELT is called extract, load, transform, and load (ELTL). The difference with ELTL is that it has a final load
into a destination system.

Holistic data engineering
26/08/2019 143
Organizations are changing their analysis types to incorporate predictive and preemptive analytics. Because of these
changes, as a data engineer you should look at data projects holistically. Data professionals used to focus on ETL, but
developments in data platform technologies lend themselves to an ELT approach.
Design data projects in phases that reflect the ELT approach:
Source: Identify the source systems to extract from.
Ingest: Identify the technology and method to load the data.
Prepare: Identify the technology and method to transform or prepare the data.
Also consider the technologies you'll use to analyze and consume the data within the project. These are the next two
phases of the process:
Analyze: Identify the technology and method to analyze the data.
Consume: Identify the technology and method to consume and present the data.
In traditional descriptive analytics projects, you might have transformed data in Azure Analysis Services and then used
Power BI to consume the analyzed data. New AI technologies such as Azure Machine Learning services and Azure
Notebooks provide a wider range of technologies to automate some of the required analysis.
These project phases don't necessarily have to flow linearly. For example, because machine learning experimentation is
iterative, the Analyze phase sometimes reveals issues such as missing source data or transformation steps. To get the
results you need, you might need to repeat earlier phases.

Azure data stack_2019_08

More Related Content

What's hot (20)

Similar to Azure data stack_2019_08 (20)

More from Alexandre BERGERE (8)

Recently uploaded (20)

Azure data stack_2019_08