SlideShare a Scribd company logo
MapReduce in the Clouds for ScienceThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox{tgunarat, taklwu, xqiu,gcf}@indiana.eduCloudCom 2010Nov 30 – Dec 3, 2010
IntroductionCloud computing combined with cloud infrastructure services A very viable alternative for scientistsMapReduce frameworks ScalabilityExcellent fault tolerance featuresEase of use. Several options for using MapReduce in cloud environmentsMapReduceas a serviceSetting up MapReducecluster on cloud instancesSpecialized cloud MapReduce runtimes Take advantage of cloud infrastructure services.
IntroductionAnalyze the performance and viability of performing two types of bioinformatics computations using MapReduce in cloud environmentsSequence alignmentSequence assemblyAzureMapReduceProvide an decentralized, on demand MapReduce frameworkLeverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services Sustained performance of clouds
PlatformsApache HadoopOn BareMetalOn EC2Amazon Web ServicesElastic MapReduceMicrosoft AzureAzureMapReduce
Challenges for MapReduce in the cloudsData storageReliabilityMaster nodeMetadata storagePerformance consistencyCommunication consistency and scalabilityCPU performance Choosing suitable instance typesLogging
AzureMapReduceBuilt on using Azure cloud servicesDistributed, highly scalable & highly available servicesMinimal management / maintenance overheadReduced footprintCo-exist with eventual consistency & high latency of cloud servicesDecentralized control
AzureMapReduce FeaturesAbility to dynamically scale up/downFamiliar programming modelFault ToleranceEasy testing and deployment Combiner stepWeb based monitoring console
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce ArchitectureStarting the Sort & Reduce phases, When all the map tasks are finished &When a reduce task is finished downloading all the intermediate data productsNo guarantee when all the intermediate data will appear in Task tablesMap Tasks store the number of reduce data products it generated for each reduce task
PerformanceParallel efficiencyAzureMapReduceAzure small instances – Single Core (1.7 GB memory)Hadoop Bare Metal -IBM iDataplex clusterTwo quad-core CPUs (Xeon 2.33GHz),16 GB memory, Gigabit Ethernet per node EMR & Hadoop on EC2Cap3 – HighCPU Extra Large instances (8 Cores, 20 CU, 7GB memory per instance)SWG – Extra Large Instances (4 Cores, 8 CU, 15GB memory per instance)
Sequence AlignmentSmith-Waterman-GOTOH to calculate all-pairs dissimilarityOutFile1OutFile2OutFile3OutFile4
Sequence Alignment Performance
Seqeunce AssemblyAssemble sequences using Cap3Pleasingly parallelMap Only
Sequence Assembly Performance
Sustained performance of clouds
ConclusionMapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupled scientific computations.Cloud infrastructure services can successfully be leveraged built distributed parallel systems with acceptable performance and consistency.For non-IO intensive workloads, cloud performance sustained well.
Thankshttp://salsahpc.indiana.edu/azuremapreduce/
AcknowledgementsAll the SALSA group members for their supportMicrosoft for their technical support on Azure. This work was made possible using the compute use grant provided by Amazon Web Service which is titled "Proof of concepts linking FutureGrid users to AWS".This work is partially funded by Microsoft "CRMC" grant and NIH Grant Number RC2HG005806-02.

More Related Content

PDF
Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PPT
Cloudsim & greencloud
DOCX
On the speedup of recovery in large scale erasure-coded storage systems
PDF
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
PDF
cnsm2011_slide
PPTX
Clustring computing
PPTX
Hpc with qpu
Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Cloudsim & greencloud
On the speedup of recovery in large scale erasure-coded storage systems
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
cnsm2011_slide
Clustring computing
Hpc with qpu

What's hot (20)

PPTX
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
PDF
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
PDF
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
PPTX
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
PDF
HP - Jerome Rolia - Hadoop World 2010
PPTX
06 how to write a map reduce version of k-means clustering
PPTX
Working together with SURF Raymond Oonk Annette Langedijk SURF
PDF
post119s1-file3
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
PDF
2015 cloud sim projects
PDF
energy efficient resource management in virtualised datacenters
PPTX
Scaling Deep Learning Models for Large Spatial Time-Series Forecasting
PPTX
Hello cloud 3
DOC
Distributed, concurrent, and independent access to encrypted cloud databases
PPTX
IEEE CLOUD \'11
PDF
Making Elasticity Testing of Cloud-Based Systems Reproducible
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
PDF
SkyhookDM - Towards an Arrow-Native Storage System
PPT
Clustering (from Google)
PPT
Super Computer
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
HP - Jerome Rolia - Hadoop World 2010
06 how to write a map reduce version of k-means clustering
Working together with SURF Raymond Oonk Annette Langedijk SURF
post119s1-file3
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
2015 cloud sim projects
energy efficient resource management in virtualised datacenters
Scaling Deep Learning Models for Large Spatial Time-Series Forecasting
Hello cloud 3
Distributed, concurrent, and independent access to encrypted cloud databases
IEEE CLOUD \'11
Making Elasticity Testing of Cloud-Based Systems Reproducible
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
SkyhookDM - Towards an Arrow-Native Storage System
Clustering (from Google)
Super Computer
Ad

Similar to Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/) (20)

PDF
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
PDF
D017212027
PPTX
Everything comes in 3's
PDF
Azure and cloud design patterns
PPTX
HPC with Clouds and Cloud Technologies
PDF
International Journal of Engineering Research and Development (IJERD)
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PDF
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
PPTX
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
PPT
Cloud computing skepticism - But i'm sure
PDF
Paper444012-4014
PDF
Eg4301808811
PPTX
Qiu bosc2010
PPTX
GRID COMPUTING
PDF
International Journal of Engineering Inventions (IJEI)
PDF
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
PDF
5 1-33-1-10-20161221 kennedy
PDF
Evaluation of genetic algorithm in network-on-chip based architecture
PDF
MapReduce: Distributed Computing for Machine Learning
PPTX
Architecture and Performance of Runtime Environments for Data Intensive Scala...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
D017212027
Everything comes in 3's
Azure and cloud design patterns
HPC with Clouds and Cloud Technologies
International Journal of Engineering Research and Development (IJERD)
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
Cloud computing skepticism - But i'm sure
Paper444012-4014
Eg4301808811
Qiu bosc2010
GRID COMPUTING
International Journal of Engineering Inventions (IJEI)
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
5 1-33-1-10-20161221 kennedy
Evaluation of genetic algorithm in network-on-chip based architecture
MapReduce: Distributed Computing for Machine Learning
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Ad

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Programs and apps: productivity, graphics, security and other tools
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
NewMind AI Weekly Chronicles - August'25 Week I
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

Editor's Notes

  • #3: The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable alternative to traditional servers and computing clusters. MapReduce distributed data processing architecture has become the weapon of choice for data-intensive analyses in the clouds and in commodity clusters due to its excellent fault tolerance features, scalability and the ease of use. Currently, there are several options for using MapReduce in cloud environments, such as using MapReduce as a service, setting up one’s own MapReduce cluster on cloud instances, or using specialized cloud MapReduce runtimes that take advantage of cloud infrastructure services. In this paper, we introduce AzureMapReduce, a novel MapReduce runtime built using the Microsoft Azure cloud infrastructure services. AzureMapReduce architecture successfully leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services to provide an efficient, on demand alternative to traditional MapReduce clusters. Further we evaluate the use and performance of MapReduce frameworks, including AzureMapReduce, in cloud environments for scientific applications using sequence assembly and sequence alignment as use cases.
  • #6: Data storage: Clouds typically provide a variety of storage options, such as off-instance cloud storage (e.g.: Amazon S3), mountable off-instance block storage (e.g.: Amazon EBS) as well as virtualized instance storage (persistent for the lifetime of the instance), which can be used to set up a file system similar to HDFS [13]. The choice of the storage best-suited to the particular MapReduce deployment plays a crucial role as the performance of data intensive applications rely a lot on the storage location and on the storage bandwidth.Metadata storage: MapReduce frameworks need to maintain metadata information to manage the jobs as well as the infrastructure. This metadata needs to be stored reliability ensuring good scalability and the accessibility to avoid single point of failures and performance bottlenecks to the MapReduce computation.Communication consistency and scalability: Cloud infrastructures are known to exhibit inter-node I/O performance fluctuations (due to shared network, unknown topology), which affect the intermediate data transfer performance of MapReduce applications.Performance consistency (sustained performance): Clouds are implemented as shared infrastructures operating using virtual machines. It’s possible for the performance to fluctuate based the load of the underlying infrastructure services as well as based on the load from other users on the shared physical node which hosts the virtual machine (see Section VII).Reliability (Node failures): Node failures are to be expected whenever large numbers of nodes are utilized for computations. But they become more prevalent when virtual instances are running on top of non-dedicated hardware. While MapReduce frameworks can recover jobs from worker node failures, master node (nodes which store meta-data, which handle job scheduling queue, etc) failures can become disastrous.Choosing a suitable instance type: Clouds offer users several types of instance options, with different configurations and price points (See Sections B and D). It’s important to select the best matching instance type, both in terms of performance as well as monetary wise, for a particular MapReduce job.Logging: Cloud instance storage is preserved only for the lifetime of the instance. Hence, information logged to the instance storage would be lost after the instance termination. This can be crucial if one needs to process the logs afterwards, for an example to identify a software-caused instance failure. On the other hand, performing excessive logging to a bandwidth limited off-instance storage location can become a performance bottleneck for the MapReduce computation.
  • #10: Client driver loads the map & reduce tasks to queues in parallel using TPL..Create the task monitoring table. Standalone client or a web client. Can wait for completion.Explain the advantages of using Azure queues.Explain the advantages of using Azure table.. Scalability. Ease of use.. No maintenance overhead. No need to install DB. Easily visualize using a webrole.
  • #11: Map & Reduce workers pick up map tasks from the queue
  • #12: Map workers download data from Blob storage and start processing- – update the status in the task monitoring table.Advantages of blob storage.Custom input/output formats & keys..
  • #13: Finished Map tasks upload result data sets to Azure Storage and then add entries for the respective reduce task tables. – update the status. Get the next task from the queue and start processing it.Custom part
  • #14: Reduce tasks notice the intermediate data product meta-data in reduce task tables and start downloading them -> update the reduce task tablesThis happens when the map tasks are actually processing the next set of map tasks..
  • #15: Reduce tasks start reducing, when all the map tasks are finished and when the respective reduce tasks are finish downloading the intermediate data products.Custom output formats
  • #16: Global barrier…
  • #17: Idataplex - Two quad-core CPUs (Intel Xeon CPU E5410 2.33GHz) 16 GB memory, Gigabit Ethernet network interface
  • #18: Use block decompositionLower triangle only using load balancing algorithmEach row block is collected by reducers.Relatively small amount of input data, but large intermediate and output data.
  • #19: ~123 million sequence alignments, for under 30$ with zero up front hardware cost,
  • #22: SWG - In these tests, 32 cores were used to align 4000 sequences. Standard deviations of 1.56% for EMR and 2.25% for AzureMapReduce.