SlideShare a Scribd company logo
| © Copyright 2016 Hitachi Consulting1
Microsoft Azure Batch
High Performance Computing with an Application of
Scalable Files Processing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2016 Hitachi Consulting2
Outline
 What is Azure Batch and High Performance Computing?
 When to Use Azure Batch?
 Azure Batch Constructs
 Scalable Data Loading Solution with Azure Batch
 .NET Code Walk-through & Demo
 Useful Resources
| © Copyright 2016 Hitachi Consulting3
High Performance Computing
| © Copyright 2016 Hitachi Consulting4
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
| © Copyright 2016 Hitachi Consulting5
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
| © Copyright 2016 Hitachi Consulting6
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
The computation on the
cluster is managed using
Azure Batch APIs.
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
| © Copyright 2016 Hitachi Consulting7
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
The computation on the
cluster is managed using
Azure Batch APIs.
On-demand – Pay as you use
Elastic – Scale up/down or shut down
PaaS – No infrastructure configurations are
needed
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
| © Copyright 2016 Hitachi Consulting8
Computing Example
Job
Job
Sequential Processing
| © Copyright 2016 Hitachi Consulting9
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Sequential Processing
| © Copyright 2016 Hitachi Consulting10
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Sequential Processing
Single Compute Unit
| © Copyright 2016 Hitachi Consulting11
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1
Sequential Processing
Single Compute Unit
Start T = 0
| © Copyright 2016 Hitachi Consulting12
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 2
Sequential Processing
Task 1 T = 1X
Start T = 0
Single Compute Unit
| © Copyright 2016 Hitachi Consulting13
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 3
Sequential Processing
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Single Compute Unit
| © Copyright 2016 Hitachi Consulting14
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Task 3 T = 3X
Task 4 T = 4X
Task 5 T = 5X
Task 6 T = 6X
Sequential Processing
End T = 6X+
Single Compute Unit
| © Copyright 2016 Hitachi Consulting15
High Performance Computing
Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power
| © Copyright 2016 Hitachi Consulting16
High Performance Computing
Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power
Scale out
Using multiple compute units
Divide
A Job is decomposed into
multiple Independent tasks
Distribute
Tasks are processed in a
separate compute nodes,
simultaneously
| © Copyright 2016 Hitachi Consulting17
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
| © Copyright 2016 Hitachi Consulting18
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
| © Copyright 2016 Hitachi Consulting19
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1
Task 2
Task 3
Task 4
Task 4
Task 6
| © Copyright 2016 Hitachi Consulting20
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1 T = 1X
Start T = 0
Task 2 T = 1X
Task 3 T = 1X
Task 4 T = 1X
Task 5 T = 1X
Task 6 T = 1X
End T = 1X+
| © Copyright 2016 Hitachi Consulting21
Big Data vs. Big Compute
The big brothers
Big Data
 Data Centric
 Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
 Azure HDInsight
| © Copyright 2016 Hitachi Consulting22
Big Data vs. Big Compute
The big brothers
Big Data
Big Compute
 Data Centric
 Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
 Azure HDInsight
 CPU & Memory Intensive
 Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
 Azure Batch
| © Copyright 2016 Hitachi Consulting23
Big Data vs. Big Compute
Big Data Processing is a subset of Big Compute, the latter covers a wider
spectrum of computing problems
The big brothers
Big Data
Big Compute
 Data Centric
 Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
 Azure HDInsight
 CPU & Memory Intensive
 Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
 Azure Batch
| © Copyright 2016 Hitachi Consulting24
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting25
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting26
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
 Executing thousands of DB Stored Procedures simultaneously
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting27
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
 Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting28
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
 Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!
For applications that needs task-to-task interaction, Message Passing Interfaces (MPI) are
supported in Azure Batch – Distributed Processing
In some cases, communication between tasks can be managed via a shared data store –
Parallel Processing
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting29
Azure Batch
| © Copyright 2016 Hitachi Consulting30
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting31
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting32
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting33
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting34
Job 2
(priority, max
execution time)
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting35
Job 2
(priority, max
execution time)
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, exe
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting36
Compute Size
Resource Default Maximum Limit
Azure Batch Account 1 50
Pools per Batch Account 20 5000
Cores per Batch Account 20 N/A
Tasks per Compute Node 1 4 X node core
Number of Nodes vs Node Size:
 Many small nodes → many tasks, not compute/memory intensive
 Few big nodes → few tasks, compute/memory intensive
(potential multi-threading per task)
 Task queueing is automatically managed by Azure Batch
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting37
Compute Size
What If:
 Pool Size = 10 Nodes
 Node Size = Small (1 Core)
 Total Cores = 10
And you have:
 2 Jobs
 Each Job has 7 task
 Total tasks = 14
By default:
 1 Core can process only 1 task
| © Copyright 2016 Hitachi Consulting38
Compute Size
What If:
 Pool Size = 10 Nodes
 Node Size = Small (1 Core)
 Total Cores = 10
And you have:
 2 Jobs
 Each Job has 7 task
 Total tasks = 14
By default:
 1 Core can process only 1 task
Then:
 The 7 tasks with the higher priority job will be executed
(status = “Running”)
 The first 3 added tasks to the lower priority job will be executed
(status = “Running”)
 The rest 4 task of the lower priority job will be queued
(status = “Active”)
 As soon as a “Running” task finishes (status = “Completed”)
an “Active” task will be assigned to the freed compute node
| © Copyright 2016 Hitachi Consulting39
Compute Size
What If:
 Pool Size = 10 Nodes
 Node Size = Small (1 Core)
 Total Cores = 10
And you have:
 2 Jobs
 Each Job has 7 task
 Total tasks = 14
By default:
 1 Core can process only 1 task
Then:
 The 7 tasks with the higher priority job will be executed
(status = “Running”)
 The first 3 added tasks to the lower priority job will be executed
(status = “Running”)
 The rest 4 task of the lower priority job will be queued
(status = “Active”)
 As soon as a “Running” task finishes (status = “Completed”)
an “Active” task will be assigned to the freed compute node
 If job was executed (status = “Running”), then a higher priority job is
submitted to the same pool:
− Azure Batch will “pause” tasks of the low priority job (status = “Suspended”)
to free resources (cores) for the higher priority job,
− then resume them when resources become available
| © Copyright 2016 Hitachi Consulting40
Use Case: Parallel Data Files Loading
| © Copyright 2016 Hitachi Consulting41
Parallel Data Loading with Azure Batch
 Source data is a set of files, with different formants
(Fixed width, Delimited, XML, JSON, Mainframe, Other), in Azure Blob Storage
 Blob Storage Structure: “<DataDomain><DataFeed><DataFeed>_<Timestamp>.<ext>”
 200+ data feeds, each produces 1-3 files daily
 Data feed formats (column, data types, file format) are described in MetadataDB (Azure SQL DB)
 The objective is to build a Data Loading Solution to:
 Parse the files and load them into a database (Azure SQL DW)
 Be scalable – used for ongoing data loading and history data migration
 Be metadata driven – new data feeds can be handled by the solution by adding metadata
 Log execution history and errors
Problem Context
| © Copyright 2016 Hitachi Consulting42
Parallel Data Loading with Azure Batch
The task (unit of parallelization, or granule) can be:
 Processing a Feed
 balanced number of files/file sizes in each feed
 loading files in sequence
 files can be processed simultaneously on the same node using multithreading (CPU/Memory
implications)
 Processing a File
 no files sequence is needed
 fine grain, more control, better utilization of resources
 less manageable (many tasks per job).
 Processing File Line
 multithreading on the same node.
Parallelism Level
| © Copyright 2016 Hitachi Consulting43
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Destination
<Azure SQL DW>
Metadata
<Azure SQL DB>
| © Copyright 2016 Hitachi Consulting44
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host> Metadata
<Azure SQL DB>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting45
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
2 – Create a Job
3 – Create a task for each feed
4 – add the tasks to the job
5 – Submit the job
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting46
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host> Metadata
<SQL Azure DB>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Task 1
Task 2
Task N
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting47
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
File
1
File
2
. . . DS
1
DS
2
. . .
Task 1
Task 2
Task N
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting48
Parallel Data Loading with Azure Batch
Task Processing Steps
Get feed format Info from Metadata
| © Copyright 2016 Hitachi Consulting49
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Task Processing Steps
| © Copyright 2016 Hitachi Consulting50
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Task Processing Steps
| © Copyright 2016 Hitachi Consulting51
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
Task Processing Steps
| © Copyright 2016 Hitachi Consulting52
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Task Processing Steps
| © Copyright 2016 Hitachi Consulting53
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Task Processing Steps
| © Copyright 2016 Hitachi Consulting54
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Parse file content to DataTable
Task Processing Steps
| © Copyright 2016 Hitachi Consulting55
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Parse file content to DataTable
Dump DataTable content to destination (DW)
Task Processing Steps
| © Copyright 2016 Hitachi Consulting56
.NET Solution Structure
• Model
• Database Services
• Blob Storage Services
• Parsers
Processing Logic
(Class Library)
• Receives Command Line parameters
• Performs the operation according to the supplied
parameters
Task
(Console App)
• Azure Batch Services
• Creates Pools/Jobs/Task
Runner
(Console App)
| © Copyright 2016 Hitachi Consulting57
.NET Solution Structure
}Azure Blob
Storage
} A Host
• Model
• Database Services
• Blob Storage Services
• Parsers
Processing Logic
(Class Library)
• Receives Command Line parameters
• Performs the operation according to the supplied
parameters
Task
(Console App)
• Azure Batch Services
• Creates Poos/Jobs/Task
Runner
(Console App)
| © Copyright 2016 Hitachi Consulting58
Hosting Azure Batch Runner
None! – One-off execution
SQL Agent Job (VM + SqlServer)
SQL Server Integration Services (VM + SqlServer)
Azure WebJob + Azure Scheduler (or on-demand)
Azure Data Factory
Azure Orchestration???
| © Copyright 2016 Hitachi Consulting59
Code Walk-through
| © Copyright 2016 Hitachi Consulting60
Code Walk-through
 Solution Structure
 Azure Batch Bits
 Azure Blob Storage Bits
 Text File Processing
 XML & JSON – (Quick and Dirty)
 SQL Bulk Copy with Retry Pattern
This is how we do it
| © Copyright 2016 Hitachi Consulting61
Code Walk-through
Solution Structure
| © Copyright 2016 Hitachi Consulting62
Code Walk-through
Azure Batch Bits
Very useful if you want to
sync with subsequent
processing steps.
I.e., start a subsequent step
only when the job finishes.
| © Copyright 2016 Hitachi Consulting63
Code Walk-through
Azure Batch Bits
| © Copyright 2016 Hitachi Consulting64
Code Walk-through
Azure Batch Bits
| © Copyright 2016 Hitachi Consulting65
Code Walk-through
Azure Blob Storage
Streaming is very efficient in
terms of processing large files,
instead of downloading the whole
file to be processed
| © Copyright 2016 Hitachi Consulting66
Code Walk-through
Text File Parsing – FileHelpers Library
Parallel processing at the file level
(a separate thread per line to parse)
| © Copyright 2016 Hitachi Consulting67
Code Walk-through
XML & JSON Files Parsing – Quick & Dirty
• The content of the whole file is loaded in a dataset
• Cannot flush data in batches
• Unlike streaming, it is more memory intensive approach
| © Copyright 2016 Hitachi Consulting68
Code Walk-through
SQL Bulk Copy – Loading in Batches
Batch size <
(available memory / record size)
| © Copyright 2016 Hitachi Consulting69
Code Walk-through
SQL Bulk Copy – Asynchronous
| © Copyright 2016 Hitachi Consulting70
Code Walk-through
SQL Bulk Copy – Retry Pattern
| © Copyright 2016 Hitachi Consulting71
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
| © Copyright 2016 Hitachi Consulting72
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
| © Copyright 2016 Hitachi Consulting73
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
 A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
| © Copyright 2016 Hitachi Consulting74
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
 A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
 Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.
| © Copyright 2016 Hitachi Consulting75
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
 A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
 Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.
 If the source is not Blob Storage (i.e., file system), or you destination is not Azure SQL DW (e.g.,
Azure SQL DB, DocumentDB, or another Azure Blob Storage/Data lake), or your file processing
does not only involve loading data to a database (e.g., processing requests to initiate workflow),
Azure Batch is the right tool.
| © Copyright 2016 Hitachi Consulting76
Useful Resources
Check these out…
• Azure Batch Documentation
https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview
• Azure Batch Explorer
https://github.com/Azure/azure-batch-samples/tree/master/CSharp/BatchExplorer
• HPC and data orchestration using Azure Batch and Data Factory
https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-processing-using-batch
• FileHelpers Librarys
http://www.filehelpers.net
• Retry Pattern
https://msdn.microsoft.com/en-us/library/dn589788.aspx
• Spinning up 16,000 A1 Virtual Machines on Azure Batch
https://blogs.endjin.com/2015/07/spinning-up-16000-a1-virtual-machines-on-azure-batch
• Parallel Computing
https://en.wikipedia.org/wiki/Parallel_computing
| © Copyright 2016 Hitachi Consulting77
Acknowledgement
These guys are awesome…
Thanks to James Fox and Alessandro Aeberli for their efforts
in building the awesome Data Landing Solution for Argos.
Nirav is currently the master of the landing solution 
| © Copyright 2016 Hitachi Consulting78
My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org

More Related Content

PDF
Azure App Modernization
PDF
Azure Monitoring Overview
PPTX
Azure Application Modernization
PDF
Cloud Migration - CCS Technologies (P) Ltd.
PDF
Sap PdMS Predictive Maintenance Service
PDF
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
PDF
Azure Application insights - An Introduction
PDF
Road to (Enterprise) Observability
Azure App Modernization
Azure Monitoring Overview
Azure Application Modernization
Cloud Migration - CCS Technologies (P) Ltd.
Sap PdMS Predictive Maintenance Service
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Azure Application insights - An Introduction
Road to (Enterprise) Observability

What's hot (20)

PPTX
Infrastructure as Code for Network
PPTX
Microsoft Azure Cost Optimization and improve efficiency
PPTX
The Ideal Approach to Application Modernization; Which Way to the Cloud?
PDF
Cloud-Native Observability
PPTX
Azure subscription management with EA and CSP
PDF
Presentation Introduction to Alteryx
PPTX
Azure Active Directory - An Introduction
PDF
Introduction of microsoft azure
PDF
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Data Center Migration to the AWS Cloud
PDF
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
PDF
Microservices Design Patterns
PPTX
Microsoft Azure Logic apps
PPT
Cloud computing
PDF
Amazon Athena 初心者向けハンズオン
PPTX
DevOps- exec level briefing
PPTX
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
PDF
Azure AD DSドメインに仮想マシンを参加させる
PPTX
Async API and Solace: Enabling the Event-Driven Future
Infrastructure as Code for Network
Microsoft Azure Cost Optimization and improve efficiency
The Ideal Approach to Application Modernization; Which Way to the Cloud?
Cloud-Native Observability
Azure subscription management with EA and CSP
Presentation Introduction to Alteryx
Azure Active Directory - An Introduction
Introduction of microsoft azure
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Azure Synapse Analytics Overview (r2)
Data Center Migration to the AWS Cloud
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Microservices Design Patterns
Microsoft Azure Logic apps
Cloud computing
Amazon Athena 初心者向けハンズオン
DevOps- exec level briefing
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
Azure AD DSドメインに仮想マシンを参加させる
Async API and Solace: Enabling the Event-Driven Future
Ad

Viewers also liked (20)

PPTX
Spark with HDInsight
PPTX
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
PDF
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
PPTX
Azure SQL DWH
PPTX
SQL Saturday #313 Rheinland - MapReduce in der Praxis
PDF
AnalyticsConf : Azure SQL Data Warehouse
PDF
How to deploy SQL Server on an Microsoft Azure virtual machines
PDF
Datawarehouse como servicio en azure (sqldw)
PPTX
Enterprise Cloud Data Platforms - with Microsoft Azure
PPTX
Microsoft Azure Data Warehouse Overview
PDF
SQL Azure Data Warehouse - Silviu Niculita
PPTX
Machine learning with Spark
PPTX
Introducing Azure SQL Data Warehouse
PPTX
Introducing Azure SQL Database
PPTX
Intorducing Big Data and Microsoft Azure
PDF
Cortana Analytics Workshop: Azure Data Lake
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Choosing technologies for a big data solution in the cloud
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PDF
SQL to Hive Cheat Sheet
Spark with HDInsight
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
Azure SQL DWH
SQL Saturday #313 Rheinland - MapReduce in der Praxis
AnalyticsConf : Azure SQL Data Warehouse
How to deploy SQL Server on an Microsoft Azure virtual machines
Datawarehouse como servicio en azure (sqldw)
Enterprise Cloud Data Platforms - with Microsoft Azure
Microsoft Azure Data Warehouse Overview
SQL Azure Data Warehouse - Silviu Niculita
Machine learning with Spark
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Database
Intorducing Big Data and Microsoft Azure
Cortana Analytics Workshop: Azure Data Lake
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Choosing technologies for a big data solution in the cloud
Big Data Analytics in the Cloud with Microsoft Azure
SQL to Hive Cheat Sheet
Ad

Similar to Microsoft Azure Batch (20)

PDF
Lean Enterprise, Microservices and Big Data
PPTX
Make your data fly - Building data platform in AWS
PPT
Code for the earth OCP APAC Tokyo 2013-05
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
PDF
Peteris Arajs - Where is my data
PPTX
Software Defined Infrastructure
PDF
Amazon WorkSpaces-Virtual Desktops in Cloud
PPTX
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
PDF
Design Choices for Cloud Data Platforms
PPTX
Cost Optimization as Major Architectural Consideration for Cloud Application
PPTX
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
PPTX
Linux on Azure Pitch Deck
PPTX
Task programming
PDF
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
Workload Automation for Cloud Migration and Machine Learning Platform
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PDF
Making AI FaaSt - QCon SF 2018
PPTX
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
PDF
Cloud comparison - AWS vs Azure vs Google
Lean Enterprise, Microservices and Big Data
Make your data fly - Building data platform in AWS
Code for the earth OCP APAC Tokyo 2013-05
Azure + DataStax Enterprise Powers Office 365 Per User Store
Peteris Arajs - Where is my data
Software Defined Infrastructure
Amazon WorkSpaces-Virtual Desktops in Cloud
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
Design Choices for Cloud Data Platforms
Cost Optimization as Major Architectural Consideration for Cloud Application
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
Linux on Azure Pitch Deck
Task programming
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
Apache Beam: A unified model for batch and stream processing data
Workload Automation for Cloud Migration and Machine Learning Platform
Infrastructure Agnostic Machine Learning Workload Deployment
Making AI FaaSt - QCon SF 2018
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
Cloud comparison - AWS vs Azure vs Google

More from Khalid Salama (8)

PPTX
Microsoft R - ScaleR Overview
PPTX
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
PPTX
Microservices, DevOps, and Continuous Delivery
PPTX
Graph Analytics
PPTX
NoSQL with Microsoft Azure
PPTX
Hive with HDInsight
PPTX
Real-Time Event & Stream Processing on MS Azure
PPTX
Data Mining - The Big Picture!
Microsoft R - ScaleR Overview
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Microservices, DevOps, and Continuous Delivery
Graph Analytics
NoSQL with Microsoft Azure
Hive with HDInsight
Real-Time Event & Stream Processing on MS Azure
Data Mining - The Big Picture!

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Database Infoormation System (DBIS).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Introduction to Business Data Analytics.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Global journeys: estimating international migration
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Database Infoormation System (DBIS).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Business Data Analytics.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Quality review (1)_presentation of this 21
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Global journeys: estimating international migration
climate analysis of Dhaka ,Banglades.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
1_Introduction to advance data techniques.pptx
Business Acumen Training GuidePresentation.pptx

Microsoft Azure Batch

  • 1. | © Copyright 2016 Hitachi Consulting1 Microsoft Azure Batch High Performance Computing with an Application of Scalable Files Processing Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2016 Hitachi Consulting2 Outline  What is Azure Batch and High Performance Computing?  When to Use Azure Batch?  Azure Batch Constructs  Scalable Data Loading Solution with Azure Batch  .NET Code Walk-through & Demo  Useful Resources
  • 3. | © Copyright 2016 Hitachi Consulting3 High Performance Computing
  • 4. | © Copyright 2016 Hitachi Consulting4 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure.
  • 5. | © Copyright 2016 Hitachi Consulting5 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure. Used to scale/parallelize compute- intensive workloads on managed cluster of VMs.
  • 6. | © Copyright 2016 Hitachi Consulting6 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure. The computation on the cluster is managed using Azure Batch APIs. Used to scale/parallelize compute- intensive workloads on managed cluster of VMs.
  • 7. | © Copyright 2016 Hitachi Consulting7 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure. The computation on the cluster is managed using Azure Batch APIs. On-demand – Pay as you use Elastic – Scale up/down or shut down PaaS – No infrastructure configurations are needed Used to scale/parallelize compute- intensive workloads on managed cluster of VMs.
  • 8. | © Copyright 2016 Hitachi Consulting8 Computing Example Job Job Sequential Processing
  • 9. | © Copyright 2016 Hitachi Consulting9 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Sequential Processing
  • 10. | © Copyright 2016 Hitachi Consulting10 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Sequential Processing Single Compute Unit
  • 11. | © Copyright 2016 Hitachi Consulting11 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 Sequential Processing Single Compute Unit Start T = 0
  • 12. | © Copyright 2016 Hitachi Consulting12 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 2 Sequential Processing Task 1 T = 1X Start T = 0 Single Compute Unit
  • 13. | © Copyright 2016 Hitachi Consulting13 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 3 Sequential Processing Task 1 T = 1X Start T = 0 Task 2 T = 2X Single Compute Unit
  • 14. | © Copyright 2016 Hitachi Consulting14 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 T = 1X Start T = 0 Task 2 T = 2X Task 3 T = 3X Task 4 T = 4X Task 5 T = 5X Task 6 T = 6X Sequential Processing End T = 6X+ Single Compute Unit
  • 15. | © Copyright 2016 Hitachi Consulting15 High Performance Computing Refers to the use of parallel processing for running compute intensive job programs efficiently via aggregating compute power
  • 16. | © Copyright 2016 Hitachi Consulting16 High Performance Computing Refers to the use of parallel processing for running compute intensive job programs efficiently via aggregating compute power Scale out Using multiple compute units Divide A Job is decomposed into multiple Independent tasks Distribute Tasks are processed in a separate compute nodes, simultaneously
  • 17. | © Copyright 2016 Hitachi Consulting17 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing
  • 18. | © Copyright 2016 Hitachi Consulting18 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing Compute Cluster
  • 19. | © Copyright 2016 Hitachi Consulting19 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing Compute Cluster Task 1 Task 2 Task 3 Task 4 Task 4 Task 6
  • 20. | © Copyright 2016 Hitachi Consulting20 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing Compute Cluster Task 1 T = 1X Start T = 0 Task 2 T = 1X Task 3 T = 1X Task 4 T = 1X Task 5 T = 1X Task 6 T = 1X End T = 1X+
  • 21. | © Copyright 2016 Hitachi Consulting21 Big Data vs. Big Compute The big brothers Big Data  Data Centric  Increase of data Volume + Velocity + Varity = Technologies to store and process the data efficiently  Azure HDInsight
  • 22. | © Copyright 2016 Hitachi Consulting22 Big Data vs. Big Compute The big brothers Big Data Big Compute  Data Centric  Increase of data Volume + Velocity + Varity = Technologies to store and process the data efficiently  Azure HDInsight  CPU & Memory Intensive  Increase of computation and algorithms complexity = Technologies to parallelize/distribute workload  Azure Batch
  • 23. | © Copyright 2016 Hitachi Consulting23 Big Data vs. Big Compute Big Data Processing is a subset of Big Compute, the latter covers a wider spectrum of computing problems The big brothers Big Data Big Compute  Data Centric  Increase of data Volume + Velocity + Varity = Technologies to store and process the data efficiently  Azure HDInsight  CPU & Memory Intensive  Increase of computation and algorithms complexity = Technologies to parallelize/distribute workload  Azure Batch
  • 24. | © Copyright 2016 Hitachi Consulting24 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications Use cases for Big Compute
  • 25. | © Copyright 2016 Hitachi Consulting25 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading Use cases for Big Compute
  • 26. | © Copyright 2016 Hitachi Consulting26 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading  Executing thousands of DB Stored Procedures simultaneously Use cases for Big Compute
  • 27. | © Copyright 2016 Hitachi Consulting27 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading  Executing thousands of DB Stored Procedures simultaneously NO! Remember where the computation occurs! Use cases for Big Compute
  • 28. | © Copyright 2016 Hitachi Consulting28 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading  Executing thousands of DB Stored Procedures simultaneously NO! Remember where the computation occurs! For applications that needs task-to-task interaction, Message Passing Interfaces (MPI) are supported in Azure Batch – Distributed Processing In some cases, communication between tasks can be managed via a shared data store – Parallel Processing Use cases for Big Compute
  • 29. | © Copyright 2016 Hitachi Consulting29 Azure Batch
  • 30. | © Copyright 2016 Hitachi Consulting30 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 31. | © Copyright 2016 Hitachi Consulting31 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 32. | © Copyright 2016 Hitachi Consulting32 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 33. | © Copyright 2016 Hitachi Consulting33 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Job (priority, max execution time) Task 1 (job, exe resources) Task 2 (job, ex resources) Task 3 (job, exe resources) Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 34. | © Copyright 2016 Hitachi Consulting34 Job 2 (priority, max execution time) Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Job 1 (priority, max execution time) Task 1 (job, exe resources) Task 2 (job, ex resources) Task 3 (job, exe resources) Task A (job, exe resources) Task B (job, exe resources) Job 3 (priority, max execution time) Task X (job, exe resources) Task Y (job, exe resources) Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 35. | © Copyright 2016 Hitachi Consulting35 Job 2 (priority, max execution time) Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Job 1 (priority, max execution time) Task 1 (job, exe resources) Task 2 (job, exe resources) Task 3 (job, exe resources) Task A (job, exe resources) Task B (job, exe resources) Job 3 (priority, max execution time) Task X (job, exe resources) Task Y (job, exe resources) Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 36. | © Copyright 2016 Hitachi Consulting36 Compute Size Resource Default Maximum Limit Azure Batch Account 1 50 Pools per Batch Account 20 5000 Cores per Batch Account 20 N/A Tasks per Compute Node 1 4 X node core Number of Nodes vs Node Size:  Many small nodes → many tasks, not compute/memory intensive  Few big nodes → few tasks, compute/memory intensive (potential multi-threading per task)  Task queueing is automatically managed by Azure Batch Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 37. | © Copyright 2016 Hitachi Consulting37 Compute Size What If:  Pool Size = 10 Nodes  Node Size = Small (1 Core)  Total Cores = 10 And you have:  2 Jobs  Each Job has 7 task  Total tasks = 14 By default:  1 Core can process only 1 task
  • 38. | © Copyright 2016 Hitachi Consulting38 Compute Size What If:  Pool Size = 10 Nodes  Node Size = Small (1 Core)  Total Cores = 10 And you have:  2 Jobs  Each Job has 7 task  Total tasks = 14 By default:  1 Core can process only 1 task Then:  The 7 tasks with the higher priority job will be executed (status = “Running”)  The first 3 added tasks to the lower priority job will be executed (status = “Running”)  The rest 4 task of the lower priority job will be queued (status = “Active”)  As soon as a “Running” task finishes (status = “Completed”) an “Active” task will be assigned to the freed compute node
  • 39. | © Copyright 2016 Hitachi Consulting39 Compute Size What If:  Pool Size = 10 Nodes  Node Size = Small (1 Core)  Total Cores = 10 And you have:  2 Jobs  Each Job has 7 task  Total tasks = 14 By default:  1 Core can process only 1 task Then:  The 7 tasks with the higher priority job will be executed (status = “Running”)  The first 3 added tasks to the lower priority job will be executed (status = “Running”)  The rest 4 task of the lower priority job will be queued (status = “Active”)  As soon as a “Running” task finishes (status = “Completed”) an “Active” task will be assigned to the freed compute node  If job was executed (status = “Running”), then a higher priority job is submitted to the same pool: − Azure Batch will “pause” tasks of the low priority job (status = “Suspended”) to free resources (cores) for the higher priority job, − then resume them when resources become available
  • 40. | © Copyright 2016 Hitachi Consulting40 Use Case: Parallel Data Files Loading
  • 41. | © Copyright 2016 Hitachi Consulting41 Parallel Data Loading with Azure Batch  Source data is a set of files, with different formants (Fixed width, Delimited, XML, JSON, Mainframe, Other), in Azure Blob Storage  Blob Storage Structure: “<DataDomain><DataFeed><DataFeed>_<Timestamp>.<ext>”  200+ data feeds, each produces 1-3 files daily  Data feed formats (column, data types, file format) are described in MetadataDB (Azure SQL DB)  The objective is to build a Data Loading Solution to:  Parse the files and load them into a database (Azure SQL DW)  Be scalable – used for ongoing data loading and history data migration  Be metadata driven – new data feeds can be handled by the solution by adding metadata  Log execution history and errors Problem Context
  • 42. | © Copyright 2016 Hitachi Consulting42 Parallel Data Loading with Azure Batch The task (unit of parallelization, or granule) can be:  Processing a Feed  balanced number of files/file sizes in each feed  loading files in sequence  files can be processed simultaneously on the same node using multithreading (CPU/Memory implications)  Processing a File  no files sequence is needed  fine grain, more control, better utilization of resources  less manageable (many tasks per job).  Processing File Line  multithreading on the same node. Parallelism Level
  • 43. | © Copyright 2016 Hitachi Consulting43 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . Destination <Azure SQL DW> Metadata <Azure SQL DB>
  • 44. | © Copyright 2016 Hitachi Consulting44 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Metadata <Azure SQL DB> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . 1 - Get list of feeds to process Destination <Azure SQL DW>
  • 45. | © Copyright 2016 Hitachi Consulting45 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . 1 - Get list of feeds to process 2 – Create a Job 3 – Create a task for each feed 4 – add the tasks to the job 5 – Submit the job Metadata <Azure SQL DB> Destination <Azure SQL DW>
  • 46. | © Copyright 2016 Hitachi Consulting46 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Metadata <SQL Azure DB> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . Task 1 Task 2 Task N Destination <Azure SQL DW>
  • 47. | © Copyright 2016 Hitachi Consulting47 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . File 1 File 2 . . . DS 1 DS 2 . . . Task 1 Task 2 Task N Metadata <Azure SQL DB> Destination <Azure SQL DW>
  • 48. | © Copyright 2016 Hitachi Consulting48 Parallel Data Loading with Azure Batch Task Processing Steps Get feed format Info from Metadata
  • 49. | © Copyright 2016 Hitachi Consulting49 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Task Processing Steps
  • 50. | © Copyright 2016 Hitachi Consulting50 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Task Processing Steps
  • 51. | © Copyright 2016 Hitachi Consulting51 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use Task Processing Steps
  • 52. | © Copyright 2016 Hitachi Consulting52 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Task Processing Steps
  • 53. | © Copyright 2016 Hitachi Consulting53 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Load file content from Blob Storage Task Processing Steps
  • 54. | © Copyright 2016 Hitachi Consulting54 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Load file content from Blob Storage Parse file content to DataTable Task Processing Steps
  • 55. | © Copyright 2016 Hitachi Consulting55 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Load file content from Blob Storage Parse file content to DataTable Dump DataTable content to destination (DW) Task Processing Steps
  • 56. | © Copyright 2016 Hitachi Consulting56 .NET Solution Structure • Model • Database Services • Blob Storage Services • Parsers Processing Logic (Class Library) • Receives Command Line parameters • Performs the operation according to the supplied parameters Task (Console App) • Azure Batch Services • Creates Pools/Jobs/Task Runner (Console App)
  • 57. | © Copyright 2016 Hitachi Consulting57 .NET Solution Structure }Azure Blob Storage } A Host • Model • Database Services • Blob Storage Services • Parsers Processing Logic (Class Library) • Receives Command Line parameters • Performs the operation according to the supplied parameters Task (Console App) • Azure Batch Services • Creates Poos/Jobs/Task Runner (Console App)
  • 58. | © Copyright 2016 Hitachi Consulting58 Hosting Azure Batch Runner None! – One-off execution SQL Agent Job (VM + SqlServer) SQL Server Integration Services (VM + SqlServer) Azure WebJob + Azure Scheduler (or on-demand) Azure Data Factory Azure Orchestration???
  • 59. | © Copyright 2016 Hitachi Consulting59 Code Walk-through
  • 60. | © Copyright 2016 Hitachi Consulting60 Code Walk-through  Solution Structure  Azure Batch Bits  Azure Blob Storage Bits  Text File Processing  XML & JSON – (Quick and Dirty)  SQL Bulk Copy with Retry Pattern This is how we do it
  • 61. | © Copyright 2016 Hitachi Consulting61 Code Walk-through Solution Structure
  • 62. | © Copyright 2016 Hitachi Consulting62 Code Walk-through Azure Batch Bits Very useful if you want to sync with subsequent processing steps. I.e., start a subsequent step only when the job finishes.
  • 63. | © Copyright 2016 Hitachi Consulting63 Code Walk-through Azure Batch Bits
  • 64. | © Copyright 2016 Hitachi Consulting64 Code Walk-through Azure Batch Bits
  • 65. | © Copyright 2016 Hitachi Consulting65 Code Walk-through Azure Blob Storage Streaming is very efficient in terms of processing large files, instead of downloading the whole file to be processed
  • 66. | © Copyright 2016 Hitachi Consulting66 Code Walk-through Text File Parsing – FileHelpers Library Parallel processing at the file level (a separate thread per line to parse)
  • 67. | © Copyright 2016 Hitachi Consulting67 Code Walk-through XML & JSON Files Parsing – Quick & Dirty • The content of the whole file is loaded in a dataset • Cannot flush data in batches • Unlike streaming, it is more memory intensive approach
  • 68. | © Copyright 2016 Hitachi Consulting68 Code Walk-through SQL Bulk Copy – Loading in Batches Batch size < (available memory / record size)
  • 69. | © Copyright 2016 Hitachi Consulting69 Code Walk-through SQL Bulk Copy – Asynchronous
  • 70. | © Copyright 2016 Hitachi Consulting70 Code Walk-through SQL Bulk Copy – Retry Pattern
  • 71. | © Copyright 2016 Hitachi Consulting71 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.
  • 72. | © Copyright 2016 Hitachi Consulting72 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.
  • 73. | © Copyright 2016 Hitachi Consulting73 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.  A pre-processing step is to move the data files from the original Blob storage (that might be Geo- redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
  • 74. | © Copyright 2016 Hitachi Consulting74 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.  A pre-processing step is to move the data files from the original Blob storage (that might be Geo- redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.  Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL is used to parse it.
  • 75. | © Copyright 2016 Hitachi Consulting75 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.  A pre-processing step is to move the data files from the original Blob storage (that might be Geo- redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.  Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL is used to parse it.  If the source is not Blob Storage (i.e., file system), or you destination is not Azure SQL DW (e.g., Azure SQL DB, DocumentDB, or another Azure Blob Storage/Data lake), or your file processing does not only involve loading data to a database (e.g., processing requests to initiate workflow), Azure Batch is the right tool.
  • 76. | © Copyright 2016 Hitachi Consulting76 Useful Resources Check these out… • Azure Batch Documentation https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview • Azure Batch Explorer https://github.com/Azure/azure-batch-samples/tree/master/CSharp/BatchExplorer • HPC and data orchestration using Azure Batch and Data Factory https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-processing-using-batch • FileHelpers Librarys http://www.filehelpers.net • Retry Pattern https://msdn.microsoft.com/en-us/library/dn589788.aspx • Spinning up 16,000 A1 Virtual Machines on Azure Batch https://blogs.endjin.com/2015/07/spinning-up-16000-a1-virtual-machines-on-azure-batch • Parallel Computing https://en.wikipedia.org/wiki/Parallel_computing
  • 77. | © Copyright 2016 Hitachi Consulting77 Acknowledgement These guys are awesome… Thanks to James Fox and Alessandro Aeberli for their efforts in building the awesome Data Landing Solution for Argos. Nirav is currently the master of the landing solution 
  • 78. | © Copyright 2016 Hitachi Consulting78 My Background Applying Computational Intelligence in Data Mining • Honorary Research Fellow, School of Computing , University of Kent. • Ph.D. Computer Science, University of Kent, Canterbury, UK. • M.Sc. Computer Science , The American University in Cairo, Egypt. • 25+ published journal and conference papers, focusing on: – classification rules induction, – decision trees construction, – Bayesian classification modelling, – data reduction, – instance-based learning, – evolving neural networks, and – data clustering • Journals: Swarm Intelligence, Swarm & Evolutionary Computation, , Applied Soft Computing, and Memetic Computing. • Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, IEEE WCCI and INNS-BigData. ResearchGate.org