Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

Big Data Cloud Meetup Big Data & Cloud Computing - Help, Educate & Demystify. June 3 rd 2011

Optimizing Bursty Hadoop Who I am: Paul Baclace Hadoop/Nutch work: 2005-2006 Internet Archive with Doug Cutting 2008-2010 AT&T interactive 2010-present Euclid Elements, Yoterra Contributed Patches to Hadoop/Nutch Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Options Storage on S3, EBS, local disk Latencies, Prices, and stretchy clusters Amazon Elastic-MapReduce and customized EC2 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Goals Optimize bursty Hadoop analysis demands Optimize testing demands Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Logical Information Flow DataSource--> CloudStorage--> MapReduce--> CloudStorage--> Reports Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Variable Cost Factors Storage, GB per-month Access, IO operations Latency (human attention) Compute Cores Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Price Insensitive, Permanent EC2 Solution: HDFS(local_disk)-->EC2nodes-->HDFS(local_disk) Local disks only (fast access, low latency) All data lost if master node terminates Difficult to move to migrate to new machines Cluster start/stop latency Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Elastic-MapReduce Solution Keep data in S3 and run EMR jobs S3-->Elastic-MapReduce-->S3 Cluster start/stop latency S3 data load time, 5-10min for 2GB in 1500 parts S3 data store time, 1hour Rate of writing to S3 is about 5X slower than reading Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

EBS HDFS Solution HDFS(EBS)-->EC2nodes-->HDFS(EBS) Cluster start/stop latency in standby mode with minimal nodes, no waiting no waiting after map-reduce job finishes Keep a minimal standby HDFS Cluster for HDFS queries and low cost testing Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

EBS HDFS Solution (2) Can be shutdown and resumed if OS is also EBS Data blocks on EBS networked storage Task-only nodes need no EBS storage, can be added while job is running Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Performance Typical Performance of EBS and S3 EBS: write-latency: 5-25msec EBS: read rate: 65MB/sec EBS: write rate: 21MB/sec S3: write-latency: 400msec S3: read rate: 15MB/sec S3: write rate: 1.5MB/sec Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Results 4 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.0000 S3_and_EMR 1200 14.000 0.000 6.35375 39.4150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 42.7152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Results 40 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 14.000 0.000 6.35375 268.150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 247.152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Results 4 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 140.000 0.000 6.35375 165.415 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 228.252 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Results 40 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.00 S3_and_EMR 1200 140.000 0.000 6.35375 394.15 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 482.52 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

EC2 Hadoop Set Up Tips Put CDH distribution on custom AMI for task-only nodes Use Whirr from Cloudera One security group for HDFS Cluster Temporary tasktracker nodes should be excluded from serving blocks Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

EC2 Hadoop Set Up Tips (2) For large map-reduce, spawn tasktracker-only nodes Spot check with nmon to determine whether machines are limited by disk, network or cpu. Resources must be all in same availability zone Billing is rounded up to hours, so provision jobs to take just under N hours of runtime Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

More Related Content

What's hot (13)

Viewers also liked (10)

Similar to Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup (20)

More from BigDataCloud (20)

Recently uploaded (20)

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup