SlideShare a Scribd company logo
60 min
Big data on Azure for Architects
Data Complexity: Variety and Velocity
Terabytes (1012)
Gigabytes (109)
Megabytes (106)
Petabytes (1015)
Exabyte (1018)
Big data on Azure for Architects
Volume Velocity
Variety Variability
Reduces
NoSQL:
• No cleansing!
• No ETL!
• No load!
• Analyze the data where it lands! Store now, question later
RDBMS
Data
Arrives
Derive a
schema
Cleanse
the data
Transform
the data
Load
the data
SQL
Queries
1
2
3 4 5
6
Data
Arrives
Application
Program
1 2
HOW?? IF I
DON’T
KNOW THE
STRUCTURE?
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Distributed Storage (HDFS)
Query
(Hive)
Distributed Processing
(MapReduce)
DataIntegration
(ODBC/SQOOP/REST)
EventPipeline
(EventHub/
Flume)
Legend
Red =
Core Hadoop
Blue =
Data processing
Gray= Microsoft
integration points
and value adds
Orange =
Data Movement
Green = Packages
YARN
Name Node
de
Data Node
HDFS API
DFS (1 Data Node per
Worker Role) and Compute
Cluster / VM
Azure Storage (WASB)
Benefits:
Data reuse and sharing
Data storage cost
Elastic scale-out
Geo-replication
…
Data Node
Most important Benefit:
Data are INDEPENDENT from cluster
And WASB is FAST…
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
SOSP Paper - Windows Azure Storage: A Highly
Available Cloud Storage Service with Strong
Consistency
http://nasuni.com
Report link is here
M
Extent Nodes (EN)
Paxos
Front End
Layer
FE
Incoming Write Request
M
M
Partition
Server
Partition
Server
Partition
Server
Partition
Server
Partition
Master
FE FE FE FE
Lock
Service
Ack
Partition Layer
Stream
Layer
Account
Name
Container
Name
Blob
Name
aaaa aaaa aaaaa
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
zzzz zzzz zzzzz
Storage Stamp
Partition
Server
Partition
Server
Account
Name
Container
Name
Blob
Name
richard videos tennis
……… ……… ………
……… ……… ………
zzzz zzzz zzzzz
Account
Name
Container
Name
Blob
Name
harry pictures sunset
……… ……… ………
……… ……… ………
richard videos soccer
Partition
Server
Partition
Master
Front-End
Server
PS 2 PS 3
PS 1
A-H: PS1
H’-R: PS2
R’-Z: PS3
A-H: PS1
H’-R: PS2
R’-Z: PS3
Partition
Map
Blob Index
Partition
Map
Account
Name
Container
Name
Blob
Name
aaaa aaaa aaaaa
……… ……… ………
……… ……… ………
harry pictures sunrise
A-H
R’-ZH’-R
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
• Programming framework
(library and runtime) for
analyzing datasets stored in
HDFS
• Composed of user-supplied
Map and Reduce functions:
• Map() - subdivide and
conquer
• Reduce() - combine and
reduce cardinality
………
Do work() Do work() Do work()
Big data on Azure for Architects
Big data on Azure for Architects
context.write(word, one);
context.write(key, new IntWritable(sum));
wasb:///example/data/gutenberg/davinci.txt wasb:///example/data/WordCountOutput
Start-AzureHDInsightJob
Get-AzureStorageBlob
Run in PS
https://pltkhdc01.azurehdinsight.net:443/ambari/ap
i/v1/clusters/pltkhdc01.azurehdinsight.net/service
s/yarn
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
• It’s important to check that the results generated
by queries are realistic, valid, and useful for better
RoI
• Automate tasks in a repeatable solution, and run
the solution from a remote computer rather than
directly from the cluster server desktop.
• There’s a huge range of tools that you can use
with Hadoop, and choosing the most appropriate
can be difficult.
• If you decide to use a resource-intensive
application such as HBase or Storm, you should
consider running it on a separate cluster.
Data-flow platform to transform and
analyze HDFS data
Scripting – No Java Needed!
Focus on semantics, not on implementation
Extensible through user defined functions and
methods
Pigs Eat Anything
Pig can operate on data whether it has metadata or not.
Pigs Live Anywhere
Pig is not tied to one particular parallel framework.
Pigs Are Domestic Animals
Pig is designed to be easily controlled. Complex tasks involving
interrelated data transformations can be simplified and
encoded as data flow sequences. Pig programs accomplish
huge tasks, but they are easy to write and maintain.
Pigs Fly
Pig processes data quickly. The system automatically optimizes
execution of Pig jobs, so the user can focus on semantics.
Big data on Azure for Architects
LOGS = LOAD 'wasb:///example/data/sample.log';
LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1)
as LOGLEVEL;
FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;
GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;
FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL)
as COUNT;
RESULT = order FREQUENCIES by COUNT desc;
DUMP RESULT; STORE RESULT INTO 'tkR1'
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Check result in PS
Hadoop 2.0
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
What is Machine Learning (ML)
Solve extremely hard problems
Extract more value from Big Data
Drive a shift in business analytics
Business
Knowledge
Data
Preparation
Modelling
Evaluation
Data
Understanding
Idea
Data
Publish
Machine Learning Process Model
Based on the CRISP-DM Model
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Volume,batchprocessing
Events, Real Time processing
Big data on Azure for Architects
Relay
Queue
Topic
Notification Hub
Event Hub
NAT and Firewall Traversal Service
Request/Response Services
Unbuffered with TCP Throttling.
Hybrid Connection
Transactional Cloud AMQP/HTTP Broker
High-Scale, High-Reliability Messaging
Sessions, Scheduled Delivery, etc.
Transactional Message Distribution
Up to 2000 subscriptions per Topic
Up to 2K/100K filter rules per subscription
High-scale notification distribution
Most mobile push notification services
Millions of notification targets
EVENTS, MASSIVE
SCALE
Event
Producers
> 1M Producers
> 1GB/sec
Aggregate
Throughput
Partitions
Direct
PartitionKey
Hash
Throughput Units:
• 1 ≤ TUs ≤ Partition Count
• TU: 1 MB/s writes, 2 MB/s reads
• We pay for TU
AMQP 1.0
Credit-based flow control
Client-side cursors
Offset by Id or Timestamp
Ingestor
(broker)
Collection Presentation
and action
Event
producers
Transformation Long-term
storage
Event hubs
Storage
adapters
Stream
processingCloud gateways
(web APIs)
Field
gateways
Applications
Legacy IOT
(custom protocols)
Devices
IP-capable devices
(Windows/Linux)
Low-power
devices (RTOS)
Search and query
Data analytics (Excel)
Web/thick client
dashboards
Service bus
Azure DBs
Azure storage
HDInsight
Stream
Analytics
Devices to take action
Storm
IEventProcessor
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Daughter
jumping
in garage
Me with
compressed
(cold) air
Me with
small dryer
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
* Tick tuples scheme is Storm’s built-in mechanism for generating tuples and sending them to each bolt in the topology at specified intervals.
Worth to check: https://storm.apache.org/apidocs/backtype/storm/topology/TopologyBuilder.BoltGetter.html
EventHubSpout
spoutConfig.getPartitionCount
PartialCountBolt
EventHubSpout
DBGlobalCountBolt
collector.emit
collector.ack
db.insertValue(System.currentTimeMillis(), partialCount);
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Big data on Azure for Architects
Compute
Visualisation
Orchestration Storage
Service bus
Event Hub
Data Factory
Power BI
Stream Analytics
HD Insight
Machine Learning
Virtual Machines
Table Storage
Blob Storage
SQL Azure
Document DB
Feeds
IoT
Data Sources
Near real time analysisData Journeys
Azure
Compute
Visualisation
Orchestration Storage
Service bus
Event Hub
Data Factory
Power BI
Stream Analytics
HD Insight
Machine Learning
Virtual Machines
Table Storage
Blob Storage
SQL Azure
Document DB
Feeds
IoT
Data Sources
Near real time analysisPredictive Analytics
Azure
Compute
Visualisation
Orchestration Storage
Service bus
Event Hub
Data Factory
Power BI
Stream Analytics
HD Insight
Machine Learning
Virtual Machines
Table Storage
Blob Storage
SQL Azure
Document DB
Feeds
IoT
Data Sources
Near real time analysisNear real time analysis
Azure
Compute
Visualisation
Orchestration Storage
Service bus
Event Hub
Data Factory
Power BI
Stream Analytics
HD Insight
Machine Learning
Virtual Machines
Table Storage
Blob Storage
SQL Azure
Document DB
Feeds
IoT
Data Sources
Near real time analysisBig Data
Azure
Compute
Visualisation
Orchestration Storage
Service bus
Event Hub
Data Factory
Power BI
Stream Analytics
HD Insight
Machine Learning
Virtual Machines
Table Storage
Blob Storage
SQL Azure
Document DB
Feeds
IoT
Data Sources
Near real time analysis“Traditional” BI
Azure
Big data on Azure for Architects
tkopacz@microsoft.com
Big data on Azure for Architects
Azure
Windows
Server
Linux
Hosted Clouds
Windows
Server
Linux
Service Fabric
Private Clouds
Windows
Server
Linux
High Availability
Hyper-Scale
Hybrid Operations
High Density
Microservices
Rolling Upgrades
Stateful services
Low Latency
Fast startup &
shutdown
Container Orchestration
& lifecycle management
Replication &
Failover
Simple
programming
models
Load balancing
Self-healingData Partitioning
Automated Rollback
Health
Monitoring
Placement
Constraints
Big data on Azure for Architects

More Related Content

PPTX
Big Data with Azure
PPTX
Big Data on Azure Tutorial
PPTX
Microsoft Azure Big Data Analytics
PPTX
Big Data on azure
PDF
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
PDF
Azure Big data
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PDF
Azure HDInsight
Big Data with Azure
Big Data on Azure Tutorial
Microsoft Azure Big Data Analytics
Big Data on azure
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
Azure Big data
Big Data Analytics in the Cloud with Microsoft Azure
Azure HDInsight

What's hot (20)

PDF
Democratizing Data Science on Kubernetes
PDF
Data Lakes with Azure Databricks
PDF
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
PPTX
PDF
Hd insight essentials quick view
PPTX
Cortana Analytics Suite
PPTX
Introduction to PolyBase
PPTX
Ai & Data Analytics 2018 - Azure Databricks for data scientist
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
PPTX
Azure Databricks for Data Scientists
PDF
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
PPTX
Introduction to Azure HDInsight
PPTX
Global AI Bootcamp Madrid - Azure Databricks
PPTX
Building Modern Data Platform with Microsoft Azure
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PPTX
Designing big data analytics solutions on azure
PDF
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
PDF
Big Data Architecture and Design Patterns
PDF
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
PPTX
Building a modern data warehouse
Democratizing Data Science on Kubernetes
Data Lakes with Azure Databricks
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Hd insight essentials quick view
Cortana Analytics Suite
Introduction to PolyBase
Ai & Data Analytics 2018 - Azure Databricks for data scientist
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Azure Databricks for Data Scientists
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
Introduction to Azure HDInsight
Global AI Bootcamp Madrid - Azure Databricks
Building Modern Data Platform with Microsoft Azure
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Designing big data analytics solutions on azure
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Big Data Architecture and Design Patterns
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
Building a modern data warehouse
Ad

Viewers also liked (20)

PPTX
Desayuno de arquitectos: Big data en azure
PPSX
Haddop in Business Intelligence
PPTX
Big data in Azure
PPTX
Azure Big Data Story
PPTX
Azure architecture
PPTX
Windows Azure and the Hybrid Cloud
PPTX
Building Big data solutions in Azure
PPTX
Improving Application Security With Azure
PPT
Architecting azure IaaS Solutions
PPSX
Hadoop Ecosystem
PPTX
Microsoft Azure Hybrid Cloud - Getting Started For Techies
PDF
Hortonworks Technical Workshop: Apache Ambari
PDF
Hadoop Ecosystem Architecture Overview
PDF
Azure Stack - Azure in your own Data Center
PPTX
Optimize your azure architecture
PDF
Introduction To Hadoop Ecosystem
PPTX
MS Cloud Summit Paris 2017 - Azure Stack
PPTX
Big Data en Azure: Azure Data Lake
PPTX
Intorducing Big Data and Microsoft Azure
PPTX
Real world hybrid cloud session - OpenStack DACH 2015
Desayuno de arquitectos: Big data en azure
Haddop in Business Intelligence
Big data in Azure
Azure Big Data Story
Azure architecture
Windows Azure and the Hybrid Cloud
Building Big data solutions in Azure
Improving Application Security With Azure
Architecting azure IaaS Solutions
Hadoop Ecosystem
Microsoft Azure Hybrid Cloud - Getting Started For Techies
Hortonworks Technical Workshop: Apache Ambari
Hadoop Ecosystem Architecture Overview
Azure Stack - Azure in your own Data Center
Optimize your azure architecture
Introduction To Hadoop Ecosystem
MS Cloud Summit Paris 2017 - Azure Stack
Big Data en Azure: Azure Data Lake
Intorducing Big Data and Microsoft Azure
Real world hybrid cloud session - OpenStack DACH 2015
Ad

Similar to Big data on Azure for Architects (20)

PDF
Hadoop introduction
PPTX
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPT
Hadoop presentation
PDF
Hadoop on Azure, Blue elephants
PPT
Final deck
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
Hadoop paper
PPTX
Inroduction to Big Data
PDF
Intro to hadoop tutorial
PDF
BIG DATA: From mammoth to elephant
PDF
Big data on_aws in korea by abhishek sinha (lunch and learn)
PDF
Big data and hadoop overvew
PPTX
Big Data Concepts
PDF
What is hadoop
PDF
Big Data and Hadoop Ecosystem
ODP
Hadoop demo ppt
PDF
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Hadoop introduction
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Hadoop presentation
Hadoop on Azure, Blue elephants
Final deck
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Harnessing Hadoop and Big Data to Reduce Execution Times
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Hadoop paper
Inroduction to Big Data
Intro to hadoop tutorial
BIG DATA: From mammoth to elephant
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data and hadoop overvew
Big Data Concepts
What is hadoop
Big Data and Hadoop Ecosystem
Hadoop demo ppt
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop

More from Tomasz Kopacz (17)

PDF
Azure Digital Twins.pdf
PDF
24032022 Zero Trust for Developers Pub.pdf
PPTX
Deep dive into service fabric after 2 years
PDF
O danych w 2016
PDF
Net core (dawniej 5.0) – co to dla mnie. też dużo o open source
PDF
Visual Studio – jak zorganizować pracę używając Scrum i GIT?
PDF
Visual Studio - zastosowania
PDF
Coś o service fabric, architekturze, i bardzo skalowalnych aplikacjach
PDF
Kiedy napadnie na nas pralka – jak budować bezpieczne systemy internet of thi...
PDF
Windows 10, internet of things, komunikacja duplex od kabli do odrobiny azu...
PDF
It w roku 201x – dom, szkoła, potem praca. no i – jak tu (i czego!) uczyć
PDF
(Azure) Machine Learning 2015
PDF
Azure paa s v2 – microservices, microsoft (azure) service fabric, .apps and o...
PPTX
Mts 2013 tomasz kopacz - windows 8, office 365, workflow manager, windows a...
PPTX
Mts 2013 tomasz kopacz - wydajność aplikacji dla windows 8 - jak ją mierzyć...
PPTX
Tomasz Kopacz MTS 2012 Wind RT w Windows 8 i tzw aplikacje lob (line of busin...
PPTX
Tomasz Kopacz MTS 2012 Azure - Co i kiedy użyć (IaaS vs paas vshybrid cloud v...
Azure Digital Twins.pdf
24032022 Zero Trust for Developers Pub.pdf
Deep dive into service fabric after 2 years
O danych w 2016
Net core (dawniej 5.0) – co to dla mnie. też dużo o open source
Visual Studio – jak zorganizować pracę używając Scrum i GIT?
Visual Studio - zastosowania
Coś o service fabric, architekturze, i bardzo skalowalnych aplikacjach
Kiedy napadnie na nas pralka – jak budować bezpieczne systemy internet of thi...
Windows 10, internet of things, komunikacja duplex od kabli do odrobiny azu...
It w roku 201x – dom, szkoła, potem praca. no i – jak tu (i czego!) uczyć
(Azure) Machine Learning 2015
Azure paa s v2 – microservices, microsoft (azure) service fabric, .apps and o...
Mts 2013 tomasz kopacz - windows 8, office 365, workflow manager, windows a...
Mts 2013 tomasz kopacz - wydajność aplikacji dla windows 8 - jak ją mierzyć...
Tomasz Kopacz MTS 2012 Wind RT w Windows 8 i tzw aplikacje lob (line of busin...
Tomasz Kopacz MTS 2012 Azure - Co i kiedy użyć (IaaS vs paas vshybrid cloud v...

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Modernizing your data center with Dell and AMD
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Unlocking AI with Model Context Protocol (MCP)
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Weekly Chronicles - August'25 Week I

Big data on Azure for Architects

  • 3. Data Complexity: Variety and Velocity Terabytes (1012) Gigabytes (109) Megabytes (106) Petabytes (1015) Exabyte (1018)
  • 6. Reduces NoSQL: • No cleansing! • No ETL! • No load! • Analyze the data where it lands! Store now, question later RDBMS Data Arrives Derive a schema Cleanse the data Transform the data Load the data SQL Queries 1 2 3 4 5 6 Data Arrives Application Program 1 2 HOW?? IF I DON’T KNOW THE STRUCTURE?
  • 16. Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) DataIntegration (ODBC/SQOOP/REST) EventPipeline (EventHub/ Flume) Legend Red = Core Hadoop Blue = Data processing Gray= Microsoft integration points and value adds Orange = Data Movement Green = Packages YARN
  • 17. Name Node de Data Node HDFS API DFS (1 Data Node per Worker Role) and Compute Cluster / VM Azure Storage (WASB) Benefits: Data reuse and sharing Data storage cost Elastic scale-out Geo-replication … Data Node Most important Benefit: Data are INDEPENDENT from cluster And WASB is FAST…
  • 21. SOSP Paper - Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency http://nasuni.com Report link is here
  • 22. M Extent Nodes (EN) Paxos Front End Layer FE Incoming Write Request M M Partition Server Partition Server Partition Server Partition Server Partition Master FE FE FE FE Lock Service Ack Partition Layer Stream Layer
  • 23. Account Name Container Name Blob Name aaaa aaaa aaaaa …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. …….. zzzz zzzz zzzzz Storage Stamp Partition Server Partition Server Account Name Container Name Blob Name richard videos tennis ……… ……… ……… ……… ……… ……… zzzz zzzz zzzzz Account Name Container Name Blob Name harry pictures sunset ……… ……… ……… ……… ……… ……… richard videos soccer Partition Server Partition Master Front-End Server PS 2 PS 3 PS 1 A-H: PS1 H’-R: PS2 R’-Z: PS3 A-H: PS1 H’-R: PS2 R’-Z: PS3 Partition Map Blob Index Partition Map Account Name Container Name Blob Name aaaa aaaa aaaaa ……… ……… ……… ……… ……… ……… harry pictures sunrise A-H R’-ZH’-R
  • 27. • Programming framework (library and runtime) for analyzing datasets stored in HDFS • Composed of user-supplied Map and Reduce functions: • Map() - subdivide and conquer • Reduce() - combine and reduce cardinality ……… Do work() Do work() Do work()
  • 30. context.write(word, one); context.write(key, new IntWritable(sum)); wasb:///example/data/gutenberg/davinci.txt wasb:///example/data/WordCountOutput Start-AzureHDInsightJob Get-AzureStorageBlob Run in PS
  • 36. • It’s important to check that the results generated by queries are realistic, valid, and useful for better RoI • Automate tasks in a repeatable solution, and run the solution from a remote computer rather than directly from the cluster server desktop. • There’s a huge range of tools that you can use with Hadoop, and choosing the most appropriate can be difficult. • If you decide to use a resource-intensive application such as HBase or Storm, you should consider running it on a separate cluster.
  • 37. Data-flow platform to transform and analyze HDFS data Scripting – No Java Needed! Focus on semantics, not on implementation Extensible through user defined functions and methods Pigs Eat Anything Pig can operate on data whether it has metadata or not. Pigs Live Anywhere Pig is not tied to one particular parallel framework. Pigs Are Domestic Animals Pig is designed to be easily controlled. Complex tasks involving interrelated data transformations can be simplified and encoded as data flow sequences. Pig programs accomplish huge tasks, but they are easy to write and maintain. Pigs Fly Pig processes data quickly. The system automatically optimizes execution of Pig jobs, so the user can focus on semantics.
  • 39. LOGS = LOAD 'wasb:///example/data/sample.log'; LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL; FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null; GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL; FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT; RESULT = order FREQUENCIES by COUNT desc; DUMP RESULT; STORE RESULT INTO 'tkR1'
  • 58. What is Machine Learning (ML) Solve extremely hard problems Extract more value from Big Data Drive a shift in business analytics
  • 70. Relay Queue Topic Notification Hub Event Hub NAT and Firewall Traversal Service Request/Response Services Unbuffered with TCP Throttling. Hybrid Connection Transactional Cloud AMQP/HTTP Broker High-Scale, High-Reliability Messaging Sessions, Scheduled Delivery, etc. Transactional Message Distribution Up to 2000 subscriptions per Topic Up to 2K/100K filter rules per subscription High-scale notification distribution Most mobile push notification services Millions of notification targets EVENTS, MASSIVE SCALE
  • 71. Event Producers > 1M Producers > 1GB/sec Aggregate Throughput Partitions Direct PartitionKey Hash Throughput Units: • 1 ≤ TUs ≤ Partition Count • TU: 1 MB/s writes, 2 MB/s reads • We pay for TU AMQP 1.0 Credit-based flow control Client-side cursors Offset by Id or Timestamp
  • 72. Ingestor (broker) Collection Presentation and action Event producers Transformation Long-term storage Event hubs Storage adapters Stream processingCloud gateways (web APIs) Field gateways Applications Legacy IOT (custom protocols) Devices IP-capable devices (Windows/Linux) Low-power devices (RTOS) Search and query Data analytics (Excel) Web/thick client dashboards Service bus Azure DBs Azure storage HDInsight Stream Analytics Devices to take action Storm IEventProcessor
  • 86. * Tick tuples scheme is Storm’s built-in mechanism for generating tuples and sending them to each bolt in the topology at specified intervals. Worth to check: https://storm.apache.org/apidocs/backtype/storm/topology/TopologyBuilder.BoltGetter.html
  • 102. Compute Visualisation Orchestration Storage Service bus Event Hub Data Factory Power BI Stream Analytics HD Insight Machine Learning Virtual Machines Table Storage Blob Storage SQL Azure Document DB Feeds IoT Data Sources Near real time analysisData Journeys Azure
  • 103. Compute Visualisation Orchestration Storage Service bus Event Hub Data Factory Power BI Stream Analytics HD Insight Machine Learning Virtual Machines Table Storage Blob Storage SQL Azure Document DB Feeds IoT Data Sources Near real time analysisPredictive Analytics Azure
  • 104. Compute Visualisation Orchestration Storage Service bus Event Hub Data Factory Power BI Stream Analytics HD Insight Machine Learning Virtual Machines Table Storage Blob Storage SQL Azure Document DB Feeds IoT Data Sources Near real time analysisNear real time analysis Azure
  • 105. Compute Visualisation Orchestration Storage Service bus Event Hub Data Factory Power BI Stream Analytics HD Insight Machine Learning Virtual Machines Table Storage Blob Storage SQL Azure Document DB Feeds IoT Data Sources Near real time analysisBig Data Azure
  • 106. Compute Visualisation Orchestration Storage Service bus Event Hub Data Factory Power BI Stream Analytics HD Insight Machine Learning Virtual Machines Table Storage Blob Storage SQL Azure Document DB Feeds IoT Data Sources Near real time analysis“Traditional” BI Azure
  • 110. Azure Windows Server Linux Hosted Clouds Windows Server Linux Service Fabric Private Clouds Windows Server Linux High Availability Hyper-Scale Hybrid Operations High Density Microservices Rolling Upgrades Stateful services Low Latency Fast startup & shutdown Container Orchestration & lifecycle management Replication & Failover Simple programming models Load balancing Self-healingData Partitioning Automated Rollback Health Monitoring Placement Constraints