SlideShare a Scribd company logo
A D N A N M A S O O D , P H D
S Y S T E M S A R C H I T E C T / S O F T W A R E E N G I N E E R
A D N A N . M A S O O D @ O W A S P . O R G
( H T T P : / / B L O G . A D N A N M A S O O D . C O M )
G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) ,
T W I T T E R ( @ A D N A N M A S O O D ) .
P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P –
T A M P A A N A L Y T I C S P R O F E S S I O N A L S
H T T P : / / W W W . M E E T U P . C O M / A N A L Y T I C S - P R O F E S S I O N A L S - O F -
T A M P A / E V E N T S / 2 2 8 7 9 6 3 4 3 /
Data Science with Windows Azure
About the Speaker
Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in machine
learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid
financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a
leading UK based nonprofit organization as a solutions architect.
A strong believer in the development community, Adnan is an active member of the Open Web
Application Security Project (OWASP), an organization dedicated to software security. In the .NET
community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been
successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for
several Fortune 500 company projects.
Adnan devotes himself to his own continual, practical education. He holds certifications in big data,
machine learning, and systems architecture from Massachusetts Institute of Technology; an Application
Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon
University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions
Developer, and Sun Certified Java Developer.
Key Take Aways from this Talk
Understand what Microsoft Offers for Data Science in Windows Azure. (or
how to write mapReduce jobs in C#)
Diagrams are Courtesy of Microsoft Corporation
Data science with Windows Azure - A Brief Introduction
Diagrams are Courtesy of Microsoft Corporation
Diagrams are Courtesy of Microsoft Corporation
Diagrams are Courtesy of Microsoft Corporation
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
10 6/16/2015
11 6/16/2015
What is Hadoop?
 At Google MapReduce operation are run on a
special file system called Google File System (GFS)
that is highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and others at Yahoo! reverse
engineered the GFS and called it Hadoop Distributed
File System (HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.
12
MapReduce13
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.
6/16/2015
Classes of problems “mapreducable”
 Benchmark for comparing: Jim Gray’s challenge on data-
intensive computing. Ex: “Sort”
 Google uses it for wordcount, adwords, pagerank, indexing data.
 Simple algorithms such as grep, text-indexing, reverse indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and in web 3.0
14
Apache Spark
 Apache Spark is an open source cluster
computing framework originally developed in the
AMPlab at UC Berkley.
 Spark in-memory provides performance up to
100 times faster for certain applications.
 Spark is well suited for machine learning
algorithms.
 Spark requires a cluster manager and a
distributed storage system.
 Spark supports Hadoop YARN.
6/16/2015
15
How Hadoop Operates
16
6/16/2015
Example: counting the number of occurrences for each word
in a collection of documents
 The input file is a repository of documents, and each
document is an element. The Map function for this example
uses keys that are of type String (the words) and values
that are integers. The Map task reads a document and
breaks it into its sequence of words w1,w2, . . . ,wn. It then
emits a sequence of key-value pairs where the value is
always 1. That is, the output of the Map task for this
document is the sequence of key-value pairs:
 (w1, 1), (w2, 1), . . . , (wn, 1)
6/16/2015
17
Key Players in Hadoop World
 HortonWorks
 Cloudera
 MAPR
 Hortonworks is a Business computer software company based in Palo
Alto,California
 Hortonworks supports & develops Apache Hadoop framework, that
allows distributed processing of large data sets across clusters of
computers
 They are the sponsors of Apache Software Foundation
 Founded in June 2011 by Yahoo and Benchmark capital as an
independent company. It went public on December 2014
 Below are the list of company collaborated with Hortonworks
 Microsoft on October 2011 to develop Azure & Window server
 Infomatica on November 2011 to develop HParser
 Teradata on February 2012 to develop Aster data system
 SAP AG on September 2012 announced it would resell Hortonworks
distribution
6/16/2015
Hortonworks
About Cloudera
 Cloudera is “The commercial Hadoop company”
 Founded by leading experts on Hadoop from
Facebook, Google, Oracle and Yahoo
 Provides consulting and training services for
Hadoop users
 Staff includes several committers to Hadoop
projects
6/16/2015
20
HaaS example
Amazon Web Services(AWS) -Amazon Elastic
MapReduce (EMR) providing Hadoop based
platform for data analysis with S3 as the storage
system and EC2 as the compute system
Microsoft HDInsight, Cloudera CDH3, IBM
Infoshpere BigInsights, EMC GreenPlum HD and
Windows Azure HDInsight Service are the primary
HaaS services by global IT giants
What is MapReduce Used For?
 In research:
 Analyzing Wikipedia conflicts (PARC)
 Natural language processing (CMU)
 Climate simulation (Washington)
 Bioinformatics (Maryland)
 Particle physics (Nebraska)
 <Your application here>
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
Key Cloud Solution Providers for Hadoop as A Service
• Windows azure
• Aws
• Google
Windows Azure
 Enterprise-level on-demand capacity builder
 Fabric of cycles and storage available on-request for
a cost
 You have to use Azure API to work with the
infrastructure offered by Microsoft
 Significant features: web role, worker role , blob
storage, table and drive-storage
25
Amazon EC2
 EC2 provided an API for instantiating computing
instances with any of the operating systems
supported.
 Excellent distribution, load balancing, cloud
monitoring tools
26
Google App Engine
 Google offers the same reliability, availability and
scalability at par with Google’s own applications
27
MapReduce Engine
 MapReduce requires a distributed file system and an
engine that can distribute, coordinate, monitor and
gather the results.
 Hadoop provides that engine through (the file system
we discussed earlier) and the JobTracker +
TaskTracker system.
 JobTracker is simply a scheduler.
 TaskTracker is assigned a Map or Reduce (or other
operations); Map or Reduce run on node and so is
the TaskTracker; each task is run on its own JVM on
a node.
28
Building a Custom MapReduce Job in .NET
 A .NET map-reduce program comprises a number of
parts
 Job definition
 Mapper, Reducer, and Combiner classes
 Input data
 Job executor
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
References & Further Reading
References & Further Reading
 https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-use-
mapreduce/
 https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-apache-spark-
zeppelin-notebook-jupyter-spark-sql/
 https://azure.microsoft.com/en-
us/services/machine-learning/
Questions

More Related Content

PDF
Introduction To Big Data Analytics On Hadoop - SpringPeople
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
PDF
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
PPTX
Is Hadoop a necessity for Data Science
PPTX
Whatisbigdataandwhylearnhadoop
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
DOCX
10 Popular Hadoop Technical Interview Questions
PDF
Why Talend for Big Data?
Introduction To Big Data Analytics On Hadoop - SpringPeople
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Is Hadoop a necessity for Data Science
Whatisbigdataandwhylearnhadoop
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
10 Popular Hadoop Technical Interview Questions
Why Talend for Big Data?

What's hot (20)

PDF
Hadoop at the Center: The Next Generation of Hadoop
PPTX
Big Data & Hadoop Tutorial
PDF
Open source stak of big data techs open suse asia
PDF
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
PDF
Intro to HDFS and MapReduce
PDF
Introduction to Big Data and Hadoop
PDF
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
PDF
What is Hadoop? Oct 17 2013
PDF
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
PPT
Big Data and Hadoop Basics
PDF
Introduction to Bigdata and HADOOP
PPTX
Big Data Analytics for Non-Programmers
PDF
Hadoop essentials by shiva achari - sample chapter
PPTX
Hadoop for beginners free course ppt
PPTX
Hadoop for Java Professionals
PPTX
Hadoop for Data Warehousing professionals
PPT
Big data introduction, Hadoop in details
PPTX
Intro to Big Data Hadoop
PDF
DBA to Data Scientist
PPTX
Hadoop at the Center: The Next Generation of Hadoop
Big Data & Hadoop Tutorial
Open source stak of big data techs open suse asia
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Intro to HDFS and MapReduce
Introduction to Big Data and Hadoop
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
What is Hadoop? Oct 17 2013
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Big Data and Hadoop Basics
Introduction to Bigdata and HADOOP
Big Data Analytics for Non-Programmers
Hadoop essentials by shiva achari - sample chapter
Hadoop for beginners free course ppt
Hadoop for Java Professionals
Hadoop for Data Warehousing professionals
Big data introduction, Hadoop in details
Intro to Big Data Hadoop
DBA to Data Scientist
Ad

Viewers also liked (20)

PDF
The path to a Modern Data Architecture in Financial Services
PDF
Realtime analytics with_hadoop
PPTX
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
PPTX
Spark with HDInsight
PPTX
Applying Machine Learning using H2O
PPTX
Restructuring Technical Debt - A Software and System Quality Approach
PPTX
Business Intelligence Barista: What DataViz Tool to Use, and When?
PPTX
Business Intelligence Barista: What DataViz Tool to Use, and When?
PPTX
Cloud computing by Bhavesh
PPTX
Visualising the tabular model for power view upload
PDF
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
PPTX
Cloud Computing Architecture Primer
PPTX
System Quality Attributes for Software Architecture
PPTX
Windows Azure HDInsight Service
PPTX
How Universities Use Big Data to Transform Education
PPTX
Intorducing Big Data and Microsoft Azure
PPTX
Hive - 1455: Cloud Storage
PPTX
How to Use Apache Zeppelin with HWX HDB
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Dynamic Column Masking and Row-Level Filtering in HDP
The path to a Modern Data Architecture in Financial Services
Realtime analytics with_hadoop
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with HDInsight
Applying Machine Learning using H2O
Restructuring Technical Debt - A Software and System Quality Approach
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?
Cloud computing by Bhavesh
Visualising the tabular model for power view upload
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Cloud Computing Architecture Primer
System Quality Attributes for Software Architecture
Windows Azure HDInsight Service
How Universities Use Big Data to Transform Education
Intorducing Big Data and Microsoft Azure
Hive - 1455: Cloud Storage
How to Use Apache Zeppelin with HWX HDB
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Dynamic Column Masking and Row-Level Filtering in HDP
Ad

Similar to Data science with Windows Azure - A Brief Introduction (20)

PPTX
Hadoop and Big Data: Revealed
PPTX
Big Data, Hadoop, NoSQL and more ...
PPTX
Microsoft's Hadoop Story
PPTX
Bigdata and hadoop
PPTX
Not Just Another Overview of Apache Hadoop
PPTX
Apache hadoop for windows server and windwos azure
PPTX
Distributed computing poli
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PPT
Big Data & Hadoop
PPSX
Computer project
PPTX
PDF
Hadoop on Azure, Blue elephants
PPTX
Big data Presentation
PPTX
Microsoft cloud big data strategy
PPTX
Big data Analytics Hadoop
PPTX
Derfor skal du bruge en DataLake
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
PDF
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Hadoop and Big Data: Revealed
Big Data, Hadoop, NoSQL and more ...
Microsoft's Hadoop Story
Bigdata and hadoop
Not Just Another Overview of Apache Hadoop
Apache hadoop for windows server and windwos azure
Distributed computing poli
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Big Data & Hadoop
Computer project
Hadoop on Azure, Blue elephants
Big data Presentation
Microsoft cloud big data strategy
Big data Analytics Hadoop
Derfor skal du bruge en DataLake
Analyzing Big data in R and Scala using Apache Spark 17-7-19
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...

More from Adnan Masood (8)

PDF
Agile Software Development
PPTX
Belief Networks & Bayesian Classification
PPTX
Bayesian Networks and Association Analysis
PPTX
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
PDF
Bayesian Networks - A Brief Introduction
PPTX
Web API or WCF - An Architectural Comparison
PPTX
SOLID Principles of Refactoring Presentation - Inland Empire User Group
PDF
Brief bibliography of interestingness measure, bayesian belief network and ca...
Agile Software Development
Belief Networks & Bayesian Classification
Bayesian Networks and Association Analysis
Probabilistic Interestingness Measures - An Introduction with Bayesian Belief...
Bayesian Networks - A Brief Introduction
Web API or WCF - An Architectural Comparison
SOLID Principles of Refactoring Presentation - Inland Empire User Group
Brief bibliography of interestingness measure, bayesian belief network and ca...

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
PPTX
Spectroscopy.pptx food analysis technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development
Spectroscopy.pptx food analysis technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx

Data science with Windows Azure - A Brief Introduction

  • 1. A D N A N M A S O O D , P H D S Y S T E M S A R C H I T E C T / S O F T W A R E E N G I N E E R A D N A N . M A S O O D @ O W A S P . O R G ( H T T P : / / B L O G . A D N A N M A S O O D . C O M ) G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) , T W I T T E R ( @ A D N A N M A S O O D ) . P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P – T A M P A A N A L Y T I C S P R O F E S S I O N A L S H T T P : / / W W W . M E E T U P . C O M / A N A L Y T I C S - P R O F E S S I O N A L S - O F - T A M P A / E V E N T S / 2 2 8 7 9 6 3 4 3 / Data Science with Windows Azure
  • 2. About the Speaker Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in machine learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a leading UK based nonprofit organization as a solutions architect. A strong believer in the development community, Adnan is an active member of the Open Web Application Security Project (OWASP), an organization dedicated to software security. In the .NET community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for several Fortune 500 company projects. Adnan devotes himself to his own continual, practical education. He holds certifications in big data, machine learning, and systems architecture from Massachusetts Institute of Technology; an Application Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions Developer, and Sun Certified Java Developer.
  • 3. Key Take Aways from this Talk Understand what Microsoft Offers for Data Science in Windows Azure. (or how to write mapReduce jobs in C#) Diagrams are Courtesy of Microsoft Corporation
  • 5. Diagrams are Courtesy of Microsoft Corporation
  • 6. Diagrams are Courtesy of Microsoft Corporation
  • 7. Diagrams are Courtesy of Microsoft Corporation
  • 12. What is Hadoop?  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  This is open source and distributed by Apache. 12
  • 13. MapReduce13 MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. 6/16/2015
  • 14. Classes of problems “mapreducable”  Benchmark for comparing: Jim Gray’s challenge on data- intensive computing. Ex: “Sort”  Google uses it for wordcount, adwords, pagerank, indexing data.  Simple algorithms such as grep, text-indexing, reverse indexing  Bayesian classification: data mining domain  Facebook uses it for various operations: demographics  Financial services use it for analytics  Astronomy: Gaussian analysis for locating extra-terrestrial objects.  Expected to play a critical role in semantic web and in web 3.0 14
  • 15. Apache Spark  Apache Spark is an open source cluster computing framework originally developed in the AMPlab at UC Berkley.  Spark in-memory provides performance up to 100 times faster for certain applications.  Spark is well suited for machine learning algorithms.  Spark requires a cluster manager and a distributed storage system.  Spark supports Hadoop YARN. 6/16/2015 15
  • 17. Example: counting the number of occurrences for each word in a collection of documents  The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2, . . . ,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key-value pairs:  (w1, 1), (w2, 1), . . . , (wn, 1) 6/16/2015 17
  • 18. Key Players in Hadoop World  HortonWorks  Cloudera  MAPR
  • 19.  Hortonworks is a Business computer software company based in Palo Alto,California  Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers  They are the sponsors of Apache Software Foundation  Founded in June 2011 by Yahoo and Benchmark capital as an independent company. It went public on December 2014  Below are the list of company collaborated with Hortonworks  Microsoft on October 2011 to develop Azure & Window server  Infomatica on November 2011 to develop HParser  Teradata on February 2012 to develop Aster data system  SAP AG on September 2012 announced it would resell Hortonworks distribution 6/16/2015 Hortonworks
  • 20. About Cloudera  Cloudera is “The commercial Hadoop company”  Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo  Provides consulting and training services for Hadoop users  Staff includes several committers to Hadoop projects 6/16/2015 20
  • 21. HaaS example Amazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute system Microsoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants
  • 22. What is MapReduce Used For?  In research:  Analyzing Wikipedia conflicts (PARC)  Natural language processing (CMU)  Climate simulation (Washington)  Bioinformatics (Maryland)  Particle physics (Nebraska)  <Your application here>
  • 23. Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  • 24. Key Cloud Solution Providers for Hadoop as A Service • Windows azure • Aws • Google
  • 25. Windows Azure  Enterprise-level on-demand capacity builder  Fabric of cycles and storage available on-request for a cost  You have to use Azure API to work with the infrastructure offered by Microsoft  Significant features: web role, worker role , blob storage, table and drive-storage 25
  • 26. Amazon EC2  EC2 provided an API for instantiating computing instances with any of the operating systems supported.  Excellent distribution, load balancing, cloud monitoring tools 26
  • 27. Google App Engine  Google offers the same reliability, availability and scalability at par with Google’s own applications 27
  • 28. MapReduce Engine  MapReduce requires a distributed file system and an engine that can distribute, coordinate, monitor and gather the results.  Hadoop provides that engine through (the file system we discussed earlier) and the JobTracker + TaskTracker system.  JobTracker is simply a scheduler.  TaskTracker is assigned a Map or Reduce (or other operations); Map or Reduce run on node and so is the TaskTracker; each task is run on its own JVM on a node. 28
  • 29. Building a Custom MapReduce Job in .NET  A .NET map-reduce program comprises a number of parts  Job definition  Mapper, Reducer, and Combiner classes  Input data  Job executor
  • 39. References & Further Reading  https://azure.microsoft.com/en- us/documentation/articles/hdinsight-use- mapreduce/  https://azure.microsoft.com/en- us/documentation/articles/hdinsight-apache-spark- zeppelin-notebook-jupyter-spark-sql/  https://azure.microsoft.com/en- us/services/machine-learning/

Editor's Notes

  • #37: The Map() function alone is enough for a simple calculation like determining square roots. So your Reducer class would not have any processing code or logic in this case. You can choose to omit it because Reduce and Combine are optional operations in a MapReduce job. However, it is a good practice to have the skeleton class for the Reducer, which derives from the ReducerCombinerBase .NET Framework class, as shown in You can write your code in the overridden Reduce() method later if you need to implement any reduce operations.