Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

MapReduce in the Clouds for ScienceThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox{tgunarat, taklwu, xqiu,gcf}@indiana.eduCloudCom 2010Nov 30 – Dec 3, 2010

IntroductionCloud computing combined with cloud infrastructure services A very viable alternative for scientistsMapReduce frameworks ScalabilityExcellent fault tolerance featuresEase of use. Several options for using MapReduce in cloud environmentsMapReduceas a serviceSetting up MapReducecluster on cloud instancesSpecialized cloud MapReduce runtimes Take advantage of cloud infrastructure services.

IntroductionAnalyze the performance and viability of performing two types of bioinformatics computations using MapReduce in cloud environmentsSequence alignmentSequence assemblyAzureMapReduceProvide an decentralized, on demand MapReduce frameworkLeverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services Sustained performance of clouds

PlatformsApache HadoopOn BareMetalOn EC2Amazon Web ServicesElastic MapReduceMicrosoft AzureAzureMapReduce

Challenges for MapReduce in the cloudsData storageReliabilityMaster nodeMetadata storagePerformance consistencyCommunication consistency and scalabilityCPU performance Choosing suitable instance typesLogging

AzureMapReduceBuilt on using Azure cloud servicesDistributed, highly scalable & highly available servicesMinimal management / maintenance overheadReduced footprintCo-exist with eventual consistency & high latency of cloud servicesDecentralized control

AzureMapReduce FeaturesAbility to dynamically scale up/downFamiliar programming modelFault ToleranceEasy testing and deployment Combiner stepWeb based monitoring console

AzureMapReduce ArchitectureStarting the Sort & Reduce phases, When all the map tasks are finished &When a reduce task is finished downloading all the intermediate data productsNo guarantee when all the intermediate data will appear in Task tablesMap Tasks store the number of reduce data products it generated for each reduce task

PerformanceParallel efficiencyAzureMapReduceAzure small instances – Single Core (1.7 GB memory)Hadoop Bare Metal -IBM iDataplex clusterTwo quad-core CPUs (Xeon 2.33GHz),16 GB memory, Gigabit Ethernet per node EMR & Hadoop on EC2Cap3 – HighCPU Extra Large instances (8 Cores, 20 CU, 7GB memory per instance)SWG – Extra Large Instances (4 Cores, 8 CU, 15GB memory per instance)

Sequence AlignmentSmith-Waterman-GOTOH to calculate all-pairs dissimilarityOutFile1OutFile2OutFile3OutFile4

Sequence Alignment Performance

Seqeunce AssemblyAssemble sequences using Cap3Pleasingly parallelMap Only

Sustained performance of clouds

ConclusionMapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupled scientific computations.Cloud infrastructure services can successfully be leveraged built distributed parallel systems with acceptable performance and consistency.For non-IO intensive workloads, cloud performance sustained well.

Thankshttp://salsahpc.indiana.edu/azuremapreduce/

AcknowledgementsAll the SALSA group members for their supportMicrosoft for their technical support on Azure. This work was made possible using the compute use grant provided by Amazon Web Service which is titled "Proof of concepts linking FutureGrid users to AWS".This work is partially funded by Microsoft "CRMC" grant and NIH Grant Number RC2HG005806-02.

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

More Related Content

What's hot (20)

Similar to Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/) (20)

Recently uploaded (20)

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

Editor's Notes