SlideShare a Scribd company logo
Big Data Cloud Meetup Big Data & Cloud Computing  - Help, Educate & Demystify. June 3 rd  2011
Optimizing Bursty Hadoop Who I am:  Paul Baclace Hadoop/Nutch work:  2005-2006 Internet Archive with Doug Cutting 2008-2010 AT&T interactive  2010-present Euclid Elements, Yoterra Contributed Patches to Hadoop/Nutch Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Options Storage on S3, EBS, local disk Latencies, Prices, and stretchy clusters Amazon Elastic-MapReduce and customized EC2 Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Goals Optimize bursty Hadoop analysis demands Optimize testing demands Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Logical Information Flow DataSource--> CloudStorage--> MapReduce--> CloudStorage--> Reports Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Variable Cost Factors Storage, GB per-month  Access, IO operations Latency (human attention) Compute Cores Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Price Insensitive,  Permanent EC2 Solution: HDFS(local_disk)-->EC2nodes-->HDFS(local_disk) Local disks only (fast access, low latency) All data lost if master node terminates Difficult to move to migrate to new machines Cluster start/stop latency Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Elastic-MapReduce Solution Keep data in S3 and run EMR jobs S3-->Elastic-MapReduce-->S3 Cluster start/stop latency S3 data load time, 5-10min for 2GB in 1500 parts S3 data store time, 1hour Rate of writing to S3 is about 5X slower than reading Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
EBS HDFS Solution HDFS(EBS)-->EC2nodes-->HDFS(EBS) Cluster start/stop latency in standby mode with minimal nodes, no waiting no waiting after map-reduce job finishes Keep a minimal standby HDFS Cluster for HDFS queries and low cost testing Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
EBS HDFS Solution (2) Can be shutdown and resumed if OS is also EBS Data blocks on EBS networked storage Task-only nodes need no EBS storage, can be added while job is running Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Performance Typical Performance of EBS and S3 EBS: write-latency: 5-25msec  EBS: read rate: 65MB/sec EBS: write rate: 21MB/sec  S3: write-latency: 400msec S3: read rate: 15MB/sec  S3: write rate: 1.5MB/sec  Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Results 4 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525  0.000  0.00000  3978.0000 S3_and_EMR  1200  14.000  0.000  6.35375  39.4150 EBS_HDFS_EC2_Tasks  0  20.000  0.425  5.67880  42.7152 Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Results 40 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525  0.000  0.00000  3978.000 S3_and_EMR  1200  14.000  0.000  6.35375  268.150 EBS_HDFS_EC2_Tasks  0  20.000  0.425  5.67880  247.152 Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Results 4 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525  0.000  0.00000  3978.000 S3_and_EMR  1200 140.000  0.000  6.35375  165.415 EBS_HDFS_EC2_Tasks  0 200.000  0.425  7.06300  228.252 Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
Results 40 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525  0.000  0.00000  3978.00 S3_and_EMR  1200 140.000  0.000  6.35375  394.15 EBS_HDFS_EC2_Tasks  0 200.000  0.425  7.06300  482.52 Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
EC2 Hadoop Set Up Tips Put CDH distribution on custom AMI for task-only nodes Use Whirr from Cloudera One security group for HDFS Cluster Temporary tasktracker nodes should be excluded from serving blocks Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup
EC2 Hadoop Set Up Tips (2) For large map-reduce, spawn tasktracker-only nodes Spot check with nmon to determine whether machines are limited by disk, network or cpu. Resources must be all in same availability zone Billing is rounded up to hours, so provision jobs to take just under N hours of runtime Optimizing Bursty Hadoop on AWS June 3 rd  2011 Meetup

More Related Content

PDF
Troposphere Python infrastructure as code for AWS Cloudformation
PDF
PPTX
Kubernetes Node and Application scaling
PDF
How to run your startup on Amazon Web Services, by Alex Iskold
PDF
Apache Spark & ML Workflows
PPT
AWS Dublin Briefing - Cool AWS Use Cases
PPTX
EC2 Performance, Spot Instance ROI and EMR Scalability
PDF
Alluxio 2 Community Update
Troposphere Python infrastructure as code for AWS Cloudformation
Kubernetes Node and Application scaling
How to run your startup on Amazon Web Services, by Alex Iskold
Apache Spark & ML Workflows
AWS Dublin Briefing - Cool AWS Use Cases
EC2 Performance, Spot Instance ROI and EMR Scalability
Alluxio 2 Community Update

What's hot (13)

PDF
What Can HPC on AWS Do?
PPTX
IT Services - TCO Study by Frost & Sullivan
PDF
Nike tech-talk-intro-to-apache-ignite
PDF
프론트엔드 개발자를 위한 서버리스 - 윤석찬 (AWS 테크에반젤리스트)
PPTX
Spot at qubole
PDF
HPC on Azure for Reserach
PPTX
Getting started with Azure Machine Learning Studio
PDF
Managed Kubernetes: What? Why run it yourself? at #OPEN18
PDF
Deploy Mattermost on AKS
PPTX
Amazon presentation
PPT
PutPlace's use of Amazon Cloud Services
PDF
Practical Hadoop Big Data Training Course by Certified Architect
PDF
Oracle Cloud Infrastructure Introduction
What Can HPC on AWS Do?
IT Services - TCO Study by Frost & Sullivan
Nike tech-talk-intro-to-apache-ignite
프론트엔드 개발자를 위한 서버리스 - 윤석찬 (AWS 테크에반젤리스트)
Spot at qubole
HPC on Azure for Reserach
Getting started with Azure Machine Learning Studio
Managed Kubernetes: What? Why run it yourself? at #OPEN18
Deploy Mattermost on AKS
Amazon presentation
PutPlace's use of Amazon Cloud Services
Practical Hadoop Big Data Training Course by Certified Architect
Oracle Cloud Infrastructure Introduction
Ad

Viewers also liked (10)

PPTX
Improving Customer Decision in e-Commerce: a Collaborative Filtering Approach
PDF
intelligent ambulance with automatic traffic control
PPTX
Intelligent ambulance with automatic traffic control new
PPT
PMML - Predictive Model Markup Language
PDF
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
PPT
Topic Models
DOCX
intelligence Ambulance project report
PPT
Cyberbullying Presentation
PDF
SVD and the Netflix Dataset
PDF
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Improving Customer Decision in e-Commerce: a Collaborative Filtering Approach
intelligent ambulance with automatic traffic control
Intelligent ambulance with automatic traffic control new
PMML - Predictive Model Markup Language
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
Topic Models
intelligence Ambulance project report
Cyberbullying Presentation
SVD and the Netflix Dataset
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Ad

Similar to Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup (20)

PPTX
AWS Cost Optimization
PPTX
Cost effective BigData Processing on Amazon EC2
PPTX
Hadoop AWS infrastructure cost evaluation
PDF
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
PPTX
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
PPTX
3rd meetup - Intro to Amazon EMR
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
ODP
Farming hadoop in_the_cloud
PPTX
Managing growth in Production Hadoop Deployments
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PPTX
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PPTX
Costing your Bug Data Operations
PDF
Effectively deploying hadoop to the cloud
PDF
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Hadoop Architecture_Cluster_Cap_Plan
PPTX
Big data solution capacity planning
PDF
Dynamic Hadoop Clusters
AWS Cost Optimization
Cost effective BigData Processing on Amazon EC2
Hadoop AWS infrastructure cost evaluation
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
3rd meetup - Intro to Amazon EMR
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Farming hadoop in_the_cloud
Managing growth in Production Hadoop Deployments
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Costing your Bug Data Operations
Effectively deploying hadoop to the cloud
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Hadoop ecosystem framework n hadoop in live environment
Hadoop Architecture_Cluster_Cap_Plan
Big data solution capacity planning
Dynamic Hadoop Clusters

More from BigDataCloud (20)

PDF
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
PDF
Crime Analysis & Prediction System
PDF
REAL-TIME RECOMMENDATION SYSTEMS
PDF
Cloud Computing Services
PDF
Google Enterprise Cloud Platform - Resources & $2000 credit!
PDF
Big Data in the Cloud - Solutions & Apps
PDF
Big Data Analytics in Motorola on the Google Cloud Platform
PDF
Streak + Google Cloud Platform
PDF
Using Advanced Analyics to bring Business Value
PDF
Creating Business Value from Big Data, Analytics & Technology.
PDF
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
PPTX
Recommendation Engines - An Architectural Guide
PPTX
Why Hadoop is the New Infrastructure for the CMO?
PDF
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
PPTX
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
PPTX
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
PDF
What Does Big Data Mean and Who Will Win
PDF
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
PDF
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Crime Analysis & Prediction System
REAL-TIME RECOMMENDATION SYSTEMS
Cloud Computing Services
Google Enterprise Cloud Platform - Resources & $2000 credit!
Big Data in the Cloud - Solutions & Apps
Big Data Analytics in Motorola on the Google Cloud Platform
Streak + Google Cloud Platform
Using Advanced Analyics to bring Business Value
Creating Business Value from Big Data, Analytics & Technology.
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Recommendation Engines - An Architectural Guide
Why Hadoop is the New Infrastructure for the CMO?
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
What Does Big Data Mean and Who Will Win
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

  • 1. Big Data Cloud Meetup Big Data & Cloud Computing - Help, Educate & Demystify. June 3 rd 2011
  • 2. Optimizing Bursty Hadoop Who I am: Paul Baclace Hadoop/Nutch work: 2005-2006 Internet Archive with Doug Cutting 2008-2010 AT&T interactive 2010-present Euclid Elements, Yoterra Contributed Patches to Hadoop/Nutch Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 3. Options Storage on S3, EBS, local disk Latencies, Prices, and stretchy clusters Amazon Elastic-MapReduce and customized EC2 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 4. Goals Optimize bursty Hadoop analysis demands Optimize testing demands Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 5. Logical Information Flow DataSource--> CloudStorage--> MapReduce--> CloudStorage--> Reports Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 6. Variable Cost Factors Storage, GB per-month Access, IO operations Latency (human attention) Compute Cores Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 7. Price Insensitive, Permanent EC2 Solution: HDFS(local_disk)-->EC2nodes-->HDFS(local_disk) Local disks only (fast access, low latency) All data lost if master node terminates Difficult to move to migrate to new machines Cluster start/stop latency Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 8. Elastic-MapReduce Solution Keep data in S3 and run EMR jobs S3-->Elastic-MapReduce-->S3 Cluster start/stop latency S3 data load time, 5-10min for 2GB in 1500 parts S3 data store time, 1hour Rate of writing to S3 is about 5X slower than reading Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 9. EBS HDFS Solution HDFS(EBS)-->EC2nodes-->HDFS(EBS) Cluster start/stop latency in standby mode with minimal nodes, no waiting no waiting after map-reduce job finishes Keep a minimal standby HDFS Cluster for HDFS queries and low cost testing Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 10. EBS HDFS Solution (2) Can be shutdown and resumed if OS is also EBS Data blocks on EBS networked storage Task-only nodes need no EBS storage, can be added while job is running Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 11. Performance Typical Performance of EBS and S3 EBS: write-latency: 5-25msec EBS: read rate: 65MB/sec EBS: write rate: 21MB/sec S3: write-latency: 400msec S3: read rate: 15MB/sec S3: write rate: 1.5MB/sec Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 12. Results 4 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.0000 S3_and_EMR 1200 14.000 0.000 6.35375 39.4150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 42.7152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 13. Results 40 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 14.000 0.000 6.35375 268.150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 247.152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 14. Results 4 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 140.000 0.000 6.35375 165.415 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 228.252 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 15. Results 40 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.00 S3_and_EMR 1200 140.000 0.000 6.35375 394.15 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 482.52 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 16. EC2 Hadoop Set Up Tips Put CDH distribution on custom AMI for task-only nodes Use Whirr from Cloudera One security group for HDFS Cluster Temporary tasktracker nodes should be excluded from serving blocks Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 17. EC2 Hadoop Set Up Tips (2) For large map-reduce, spawn tasktracker-only nodes Spot check with nmon to determine whether machines are limited by disk, network or cpu. Resources must be all in same availability zone Billing is rounded up to hours, so provision jobs to take just under N hours of runtime Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup