Microsoft Azure Batch

| © Copyright 2016 Hitachi Consulting1
Microsoft Azure Batch
High Performance Computing with an Application of
Scalable Files Processing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.

Outline
 What is Azure Batch and High Performance Computing?
 When to Use Azure Batch?
 Azure Batch Constructs
 Scalable Data Loading Solution with Azure Batch
 .NET Code Walk-through & Demo
 Useful Resources

High Performance Computing

What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.

Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.

The computation on the
cluster is managed using
Azure Batch APIs.
cluster of VMs.

The computation on the
cluster is managed using
Azure Batch APIs.
On-demand – Pay as you use
Elastic – Scale up/down or shut down
PaaS – No infrastructure configurations are
needed
cluster of VMs.

Computing Example
Job
Job
Sequential Processing

Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6

Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Single Compute Unit

Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1
Single Compute Unit
Start T = 0

Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 2
Task 1 T = 1X
Start T = 0
Single Compute Unit

Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 3
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Single Compute Unit

Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Task 3 T = 3X
Task 4 T = 4X
Task 5 T = 5X
Task 6 T = 6X
End T = 6X+
Single Compute Unit

Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power

Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power
Scale out
Using multiple compute units
Divide
A Job is decomposed into
multiple Independent tasks
Distribute
Tasks are processed in a
separate compute nodes,
simultaneously

Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing

Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster

Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1
Task 2
Task 3
Task 4
Task 4
Task 6

Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1 T = 1X
Start T = 0
Task 2 T = 1X
Task 3 T = 1X
Task 4 T = 1X
Task 5 T = 1X
Task 6 T = 1X
End T = 1X+

Big Data vs. Big Compute
The big brothers
Big Data
 Data Centric
 Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
 Azure HDInsight

The big brothers
Big Data
Big Compute
 Data Centric
 Azure HDInsight
 CPU & Memory Intensive
 Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
 Azure Batch

Big Data Processing is a subset of Big Compute, the latter covers a wider
spectrum of computing problems
The big brothers
Big Data
Big Compute
 Data Centric
 Azure HDInsight
 CPU & Memory Intensive
 Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
 Azure Batch

When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
Use cases for Big Compute

 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading

 Executing thousands of DB Stored Procedures simultaneously

 Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!

 Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!
For applications that needs task-to-task interaction, Message Passing Interfaces (MPI) are
supported in Azure Batch – Distributed Processing
In some cases, communication between tasks can be managed via a shared data store –
Parallel Processing

Azure Batch

Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)

Azure Batch Account
Pool
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
 Task
− Parent Job
− Cmd Parameters
(.dlls & .exe)

Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
 Task
− Parent Job
− Cmd Parameters
(.dlls & .exe)

Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
 Task
− Parent Job
− Cmd Parameters
(.dlls & .exe)

Job 2
(priority, max
execution time)
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
 Task
− Parent Job
− Cmd Parameters
(.dlls & .exe)

Job 2
(priority, max
execution time)
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, exe
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
 Task
− Parent Job
− Cmd Parameters
(.dlls & .exe)

Compute Size
Resource Default Maximum Limit
Azure Batch Account 1 50
Pools per Batch Account 20 5000
Cores per Batch Account 20 N/A
Tasks per Compute Node 1 4 X node core
Number of Nodes vs Node Size:
 Many small nodes → many tasks, not compute/memory intensive
 Few big nodes → few tasks, compute/memory intensive
(potential multi-threading per task)
 Task queueing is automatically managed by Azure Batch
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
 Task
− Parent Job
− Cmd Parameters
(.dlls & .exe)

Compute Size
What If:
 Pool Size = 10 Nodes
 Node Size = Small (1 Core)
 Total Cores = 10
And you have:
 2 Jobs
 Each Job has 7 task
 Total tasks = 14
By default:
 1 Core can process only 1 task

Compute Size
What If:
And you have:
 2 Jobs
By default:
Then:
 The 7 tasks with the higher priority job will be executed
(status = “Running”)
 The first 3 added tasks to the lower priority job will be executed
 The rest 4 task of the lower priority job will be queued
(status = “Active”)
 As soon as a “Running” task finishes (status = “Completed”)
an “Active” task will be assigned to the freed compute node

Compute Size
What If:
And you have:
 2 Jobs
By default:
Then:
 The 7 tasks with the higher priority job will be executed
 The first 3 added tasks to the lower priority job will be executed
 The rest 4 task of the lower priority job will be queued
(status = “Active”)
 As soon as a “Running” task finishes (status = “Completed”)
an “Active” task will be assigned to the freed compute node
 If job was executed (status = “Running”), then a higher priority job is
submitted to the same pool:
− Azure Batch will “pause” tasks of the low priority job (status = “Suspended”)
to free resources (cores) for the higher priority job,
− then resume them when resources become available

Use Case: Parallel Data Files Loading

Parallel Data Loading with Azure Batch
 Source data is a set of files, with different formants
(Fixed width, Delimited, XML, JSON, Mainframe, Other), in Azure Blob Storage
 Blob Storage Structure: “<DataDomain><DataFeed><DataFeed>_<Timestamp>.<ext>”
 200+ data feeds, each produces 1-3 files daily
 Data feed formats (column, data types, file format) are described in MetadataDB (Azure SQL DB)
 The objective is to build a Data Loading Solution to:
 Parse the files and load them into a database (Azure SQL DW)
 Be scalable – used for ongoing data loading and history data migration
 Be metadata driven – new data feeds can be handled by the solution by adding metadata
 Log execution history and errors
Problem Context

The task (unit of parallelization, or granule) can be:
 Processing a Feed
 balanced number of files/file sizes in each feed
 loading files in sequence
 files can be processed simultaneously on the same node using multithreading (CPU/Memory
implications)
 Processing a File
 no files sequence is needed
 fine grain, more control, better utilization of resources
 less manageable (many tasks per job).
 Processing File Line
 multithreading on the same node.
Parallelism Level

Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Destination
<Azure SQL DW>
Metadata
<Azure SQL DB>

Azure Batch
Runner
<Host> Metadata
<Azure SQL DB>
Source
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
Destination
<Azure SQL DW>

Azure Batch
Runner
<Host>
Source
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
2 – Create a Job
3 – Create a task for each feed
4 – add the tasks to the job
5 – Submit the job
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>

Azure Batch
Runner
<Host> Metadata
<SQL Azure DB>
Source
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Task 1
Task 2
Task N
Destination
<Azure SQL DW>

Azure Batch
Runner
<Host>
Source
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
File
1
File
2
. . . DS
1
DS
2
. . .
Task 1
Task 2
Task N
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>

Task Processing Steps
Get feed format Info from Metadata

Create destination tables

Get list of file to process

Load parser class to use

For each file to process

Load file content from Blob Storage

Parse file content to DataTable

Parse file content to DataTable
Dump DataTable content to destination (DW)

.NET Solution Structure
• Model
• Database Services
• Blob Storage Services
• Parsers
Processing Logic
(Class Library)
• Receives Command Line parameters
• Performs the operation according to the supplied
parameters
Task
(Console App)
• Azure Batch Services
• Creates Pools/Jobs/Task
Runner
(Console App)

.NET Solution Structure
}Azure Blob
Storage
} A Host
• Model
• Database Services
• Blob Storage Services
• Parsers
Processing Logic
(Class Library)
• Receives Command Line parameters
• Performs the operation according to the supplied
parameters
Task
(Console App)
• Azure Batch Services
• Creates Poos/Jobs/Task
Runner
(Console App)

Hosting Azure Batch Runner
None! – One-off execution
SQL Agent Job (VM + SqlServer)
SQL Server Integration Services (VM + SqlServer)
Azure WebJob + Azure Scheduler (or on-demand)
Azure Data Factory
Azure Orchestration???

Code Walk-through

Code Walk-through
 Solution Structure
 Azure Batch Bits
 Azure Blob Storage Bits
 Text File Processing
 XML & JSON – (Quick and Dirty)
 SQL Bulk Copy with Retry Pattern
This is how we do it

Code Walk-through
Solution Structure

Code Walk-through
Azure Batch Bits
Very useful if you want to
sync with subsequent
processing steps.
I.e., start a subsequent step
only when the job finishes.

Code Walk-through
Azure Batch Bits

Code Walk-through
Azure Blob Storage
Streaming is very efficient in
terms of processing large files,
instead of downloading the whole
file to be processed

Code Walk-through
Text File Parsing – FileHelpers Library
Parallel processing at the file level
(a separate thread per line to parse)

Code Walk-through
XML & JSON Files Parsing – Quick & Dirty
• The content of the whole file is loaded in a dataset
• Cannot flush data in batches
• Unlike streaming, it is more memory intensive approach

Code Walk-through
SQL Bulk Copy – Loading in Batches
Batch size <
(available memory / record size)

Code Walk-through
SQL Bulk Copy – Asynchronous

Code Walk-through
SQL Bulk Copy – Retry Pattern

Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.

the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.

the data file.
 A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.

the data file.
 Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.

the data file.
 Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.
 If the source is not Blob Storage (i.e., file system), or you destination is not Azure SQL DW (e.g.,
Azure SQL DB, DocumentDB, or another Azure Blob Storage/Data lake), or your file processing
does not only involve loading data to a database (e.g., processing requests to initiate workflow),
Azure Batch is the right tool.

Useful Resources
Check these out…
• Azure Batch Documentation
https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview
• Azure Batch Explorer
https://github.com/Azure/azure-batch-samples/tree/master/CSharp/BatchExplorer
• HPC and data orchestration using Azure Batch and Data Factory
https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-processing-using-batch
• FileHelpers Librarys
http://www.filehelpers.net
• Retry Pattern
https://msdn.microsoft.com/en-us/library/dn589788.aspx
• Spinning up 16,000 A1 Virtual Machines on Azure Batch
https://blogs.endjin.com/2015/07/spinning-up-16000-a1-virtual-machines-on-azure-batch
• Parallel Computing
https://en.wikipedia.org/wiki/Parallel_computing

Acknowledgement
These guys are awesome…
Thanks to James Fox and Alessandro Aeberli for their efforts
in building the awesome Data Landing Solution for Argos.
Nirav is currently the master of the landing solution 

My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org

Microsoft Azure Batch

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Microsoft Azure Batch (20)

More from Khalid Salama (8)

Recently uploaded (20)

Microsoft Azure Batch