SlideShare a Scribd company logo
‘Amazon EMR’ coming up…by Sujee Maniyam
Big Data Cloud MeetupCost Effective Big-Data Processing using Amazon Elastic Map ReduceSujee Maniyamhello@sujee.net  |  www.sujee.netJuly 08, 2011
Cost Effective Big-Data Processing using Amazon Elastic Map ReduceSujee Maniyamhttp://sujee.nethello@sujee.net
QuizPRIZE!Where was this picture taken?
Quiz : Where was this picture taken?
Answer : Montara Light House
Hi, I’m Sujee10+ years of software developmententerprise apps  web apps iphone apps   HadoopHands on experience with Hadoop / Hbase/ Amazon ‘cloud’More : http://sujee.net/tech
I am  an ‘expert’ 
Ah.. Data
Nature of Data…Primary DataEmail, blogs, pictures, tweetsCritical for operation (Gmail can’t loose emails)Secondary dataWikipedia access logs, Google search logsNot ‘critical’, but  used to ‘enhance’  user experienceSearch logs help predict ‘trends’Yelp can figure out you like Chinese food
Data ExplosionPrimary data has grown phenomenallyBut secondary data has exploded in recent years“log every thing and ask questions later”Used forRecommendations (books, restaurants ..etc)Predict trends (job skills in demand)Show ADS  ($$$)..etc‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)Startups are struggling to get on top of ‘big data’
Hadoop to RescueHadoop can help with BigDataHadoop has been proven in the fieldUnder active developmentThrow hardware at the problemGetting cheaper by the yearBleeding edge technologyHire good people!
Hadoop: It is a CAREER
Data Spectrum
Who is Using Hadoop?
Big Guys
Startups
Startups and bigdata
About This PresentationBased on my experience with a startup5 people (3 Engineers)Ad-Serving SpaceAmazon EC2 is our ‘data center’Technologies:Web stack : Python, Tornado,  PHP,  mysql , LAMPAmazon EMR to crunch dataData size : 1 TB  / week
Story of a Startup…month-1Each web serverwrites logs locallyLogs were copiedto a log-serverand purged from web serversLog Data size : ~100-200 G
Story of a Startup…month-6More web servers comeonlineAggregate log serverfalls behind
Data @ 6 months2 TB of data already50-100 G new data / day And we were operating on 20% of our capacity!
Future…
Solution?Scalable database (NOSQL)HbaseCassandraHadoop log processing / Map Reduce
What We Evaluated1) Hbase cluster2) Hadoop cluster3) Amazon EMR
Hadoop on Amazon EC21) Permanent Cluster2) On demand cluster (elastic map reduce)
1) Permanent Hadoop Cluster
Architecture 1
Hadoop Cluster7 C1.xlarge machines15 TB EBS volumesSqoop exports mysql log tables into HDFSLogs are compressed (gz) to minimize disk usage (data locality trade-off)All is working well…
Lessons LearnedC1.xlarge is  pretty stable (8 core / 8G memory)EBS volumesmax size 1TB,  so string few for higher density / nodeDON’T RAID them; let hadoop handle them as individual disks?? : Skip EBS.  Use instance store disks, and store data in S3
Amazon Storage Options
2 months laterCouple of EBS volumes DIECouple of EC2 instances DIEMaintaining the hadoop cluster is mechanical  job  less appealingCOST!Our jobs utilization is about 50%But still paying for machines running 24x7
Amazon EC2 Cost
Hadoop cluster on EC2 cost$3,500 = 7 c1.xlarge @ $500 / month$1,500 = 15 TB EBS storage @ $0.10 per GB$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests  $5,500 / month$60,000 / year !
Buy / Rent ?Typical hadoop machine cost : $10k10 node cluster = $100k Plus data center  costsPlus IT-ops costsAmazon Ec2 10 node cluster:$500 * 10 = $5,000 / month = $60k / year
Buy / RentAmazon EC2 is great, forQuickly getting startedStartupsScaling on demand / rapidly adding more serverspopular social gamesNetflix storyStreaming is powered by EC2Encoding movies ..etcUse 1000s of instancesNot so economical for running clusters 24x7
Next : Amazon EMR
Where was this picture taken?
Answer : Pacifica Pier
Amazon’s solution :  Elastic Map ReduceStore data on Amazon S3Kick off a hadoop cluster to process dataShutdown when donePay for the HOURS used
Architecture : Amazon EMR
Moving partsLogs go into ScribeScribe master ships logs into S3, gzippedSpin EMR cluster, run job, doneUsing same old Java MR jobs for EMRSummary data gets directly updated to a mysql
EMR Launch Scriptsscripts to launch jar EMR jobsCustom parameters depending on job needs (instance types, size of cluster ..etc)monitor  job progressSave logs for later inspectionJob status (finished / cancelled)https://github.com/sujee/amazon-emr-beyond-basics
Sample Launch Script#!/bin/bash## run-sitestats4.sh# configMASTER_INSTANCE_TYPE="m1.large"SLAVE_INSTANCE_TYPE="c1.xlarge"INSTANCES=5export JOBNAME="SiteStats4"export TIMESTAMP=$(date +%Y%m%d-%H%M%S)# end configecho "==========================================="echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."export t1=$(date +%s)export JOBID=$(elastic-mapreduce --plain-output  --create --name "${JOBNAME}__${TIMESTAMP}"   --num-instances "$INSTANCES"  --master-instance-type "$MASTER_INSTANCE_TYPE"  --slave-instance-type "$SLAVE_INSTANCE_TYPE"  --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config  --log-uri s3://my_bucket/emr-logs/   --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)sh ./emr-wait-for-completion.sh
Mapred-config-m1-xl.xml	<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>    <property>        <name>mapreduce.map.java.opts</name>        <value>-Xmx1024M</value>    </property>    <property>        <name>mapreduce.reduce.java.opts</name>        <value>-Xmx3000M</value>    </property>    <property>        <name>mapred.tasktracker.reduce.tasks.maximum</name>        <value>3</value>        <decription>4 is running out of memory</description>    </property><property>        <name>mapred.output.compress</name>        <value>true</value></property>    <property>        <name>mapred.output.compression.type</name>        <value>BLOCK</value>    </property></configuration>
emr-wait-for-completion.shPolls for job status periodicallySaves the logs Calculates job run time
Saved Logs
Sample Saved Log
Data joining (x-ref)Data is split across log files, need to x-ref during Map phaseUsed to load the data in mapper’s memory (data was small and in mysql)Now we use Membase  (Memcached)Two MR jobs are chainedFirst one processes logfile_type_A and populates Membase (very quick,  takes minutes)Second one, processes logfile_type_B, cross-references values from Membase
X-ref
EMR WinsCost   only pay for usehttp://aws.amazon.com/elasticmapreduce/pricing/Example: EMR ran on 5 C1.xlarge for 3hrsEC2 instances for 3 hrs = $0.68  per hr x 5 inst x 3 hrs = $10.20http://aws.amazon.com/elasticmapreduce/faqs/#billing-4(1 hour of c1.xlarge = 8 hours normalized compute time)EMR cost = 5 instances x 3 hrs x 8 normalized hrs x  0.12 emr = $14.40Plus S3 storage cost :  1TB / month = $150Data bandwidth from S3 to EC2 is FREE! $25 bucks
EMR WinsNo hadoop cluster to maintainno failed nodes / disksBonus : Can tailor cluster  for various jobssmaller jobs  fewer number of machinesmemory hungry tasks  m1.xlargecpu hungry tasks  c1.xlarge
Design WinsBidders now write logs to Scribe directly No mysql at web server machinesWrites much faster!S3 has been a reliable  storage and cheap
Next : Lessons Learned
Where was this pic taken?
Answer : Foster City
Lessons learned : Logfile formatCSV  JSONStarted with CSVCSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL20-40 fields… fragile, position dependant, hard to code url = csv[18]…counting position numbers gets old after 100th time around)If (csv.length == 29) url = csv[28]     else url = csv[26]JSON: { exchange_id: 2,  url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}Self-describing,  easy to add new fields, easy to processurl = map.get(‘url’)
Lessons Learned : Control the amount of InputWe get different type of eventsevent A (freq: 10,000)   >>> event B (100)  >> event C (1)Initially we put them all into a single log fileAAAABAABC
Control Input…So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processingSo we split the logslog_A….gzlog_B….gzlog_C…gzNow only processing fraction of our logsInput : s3://my_bucket/logs/log_B*x-ref using memcache if needed
Lessons learned : Incremental Log ProcessingRecent data (today / yesterday / this week) is more relevant than older data (6 months +)Adding ‘time window’ to our statsonly process newer logs faster
EMR trade-offsLower performance on MR jobs compared to a  clusterReduced data throughput (S3 isn’t the same as local disk)Streaming data from S3, for each jobEMR Hadoop is not the latest versionMissing tools : OozieRight now, trading performance for convenience and cost
Next steps : faster processingStreaming S3 data for each MR job is not optimalSpin clusterCopy data from S3 to HDFSRun all MR jobs (make use of data locality)terminate
Next Steps : More ProcessingMore MR jobsMore frequent data processingFrequent log rollsSmaller delta window
Next steps : new software New SoftwarePython,  mrJOB(from Yelp)Scribe  Cloudera flume?Use work flow tools like OozieHive?Adhoc SQL like queries
Next Steps : SPOT instancesSPOT instances : name your price (ebay style)Been available on EC2 for a whileJust became available for Elastic map reduce!New cluster setup:10 normal instances + 10 spot instancesSpots may go away anytimeThat is fine!  Hadoop will handle node failuresBigger cluster : cheaper & faster
Example Price Comparison
Next Steps : nosqlSummary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing)Evaluating nosql solutionsusing Membase in limited capacityWatch out for Amazon’s Hbase offering
Take a test driveJust bring your credit-card http://aws.amazon.com/elasticmapreduce/Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
ThanksQuestions?Sujee Maniyamhttp://sujee.nethello@sujee.netDevil’s slide, Pacifica

More Related Content

PDF
Digdagによる大規模データ処理の自動化とエラー処理
PDF
Systems Bioinformatics Workshop Keynote
PDF
Reproducible datascience [with Terraform]
PDF
Automating Workflows for Analytics Pipelines
PPTX
Sherlock Homepage - A detective story about running large web services - WebN...
PDF
Building Scalable Websites with Perl
PDF
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
PPTX
MongoSF 2011 - Using MongoDB for IGN's Social Platform
Digdagによる大規模データ処理の自動化とエラー処理
Systems Bioinformatics Workshop Keynote
Reproducible datascience [with Terraform]
Automating Workflows for Analytics Pipelines
Sherlock Homepage - A detective story about running large web services - WebN...
Building Scalable Websites with Perl
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
MongoSF 2011 - Using MongoDB for IGN's Social Platform

What's hot (20)

PDF
12 core technologies you should learn, love, and hate to be a 'real' technocrat
PDF
HPC on AWS
PDF
OpenERP Performance Benchmark
PPT
Setting Up Amazon EC2 server
PPTX
Altitude San Francisco 2018: Logging at the Edge
PDF
Capacity Planning For Web Operations Presentation
PDF
Capacity Planning For Web Operations Presentation
PDF
API analytics with Redis and Google Bigquery. NoSQL matters edition
PDF
Hadoop Pig: MapReduce the easy way!
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
PPT
Client Side Storage
PDF
CODAIT/Spark-Bench
PDF
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
PPTX
Big data Lambda Architecture - Batch Layer Hands On
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
PDF
Nancy CLI. Automated Database Experiments
PDF
Time series databases
PDF
Embulk, an open-source plugin-based parallel bulk data loader
PPT
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
PDF
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
12 core technologies you should learn, love, and hate to be a 'real' technocrat
HPC on AWS
OpenERP Performance Benchmark
Setting Up Amazon EC2 server
Altitude San Francisco 2018: Logging at the Edge
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
API analytics with Redis and Google Bigquery. NoSQL matters edition
Hadoop Pig: MapReduce the easy way!
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Client Side Storage
CODAIT/Spark-Bench
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big data Lambda Architecture - Batch Layer Hands On
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Nancy CLI. Automated Database Experiments
Time series databases
Embulk, an open-source plugin-based parallel bulk data loader
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....

Similar to BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee (20)

PPTX
Cost effective BigData Processing on Amazon EC2
PPTX
3rd meetup - Intro to Amazon EMR
PDF
Big Data and Hadoop in the Cloud
PPTX
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
PDF
Introduction To Elastic MapReduce at WHUG
PDF
Lessons learned scaling big data in cloud
PDF
Scaling your analytics with Amazon EMR
PPTX
Aaum Analytics event - Big data in the cloud
PDF
Amazon EMR Masterclass
PPTX
How to run your Hadoop Cluster in 10 minutes
PDF
Map Reduce along with Amazon EMR
PDF
Cloud Connect 2013- Lock Stock and x Smoking EC2's
PPTX
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
PDF
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
PDF
AWS EMR (Elastic Map Reduce) explained
PPT
Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp
PPTX
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Hadoop AWS infrastructure cost evaluation
Cost effective BigData Processing on Amazon EC2
3rd meetup - Intro to Amazon EMR
Big Data and Hadoop in the Cloud
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Introduction To Elastic MapReduce at WHUG
Lessons learned scaling big data in cloud
Scaling your analytics with Amazon EMR
Aaum Analytics event - Big data in the cloud
Amazon EMR Masterclass
How to run your Hadoop Cluster in 10 minutes
Map Reduce along with Amazon EMR
Cloud Connect 2013- Lock Stock and x Smoking EC2's
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Getting started with Hadoop, Hive, and Elastic MapReduce
AWS EMR (Elastic Map Reduce) explained
Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Hadoop ecosystem framework n hadoop in live environment
Hadoop AWS infrastructure cost evaluation

More from BigDataCloud (20)

PDF
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
PDF
Crime Analysis & Prediction System
PDF
REAL-TIME RECOMMENDATION SYSTEMS
PDF
Cloud Computing Services
PDF
Google Enterprise Cloud Platform - Resources & $2000 credit!
PDF
Big Data in the Cloud - Solutions & Apps
PDF
Big Data Analytics in Motorola on the Google Cloud Platform
PDF
Streak + Google Cloud Platform
PDF
Using Advanced Analyics to bring Business Value
PDF
Creating Business Value from Big Data, Analytics & Technology.
PDF
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
PPTX
Recommendation Engines - An Architectural Guide
PPTX
Why Hadoop is the New Infrastructure for the CMO?
PDF
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
PPTX
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
PPTX
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
PDF
What Does Big Data Mean and Who Will Win
PDF
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
PDF
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Crime Analysis & Prediction System
REAL-TIME RECOMMENDATION SYSTEMS
Cloud Computing Services
Google Enterprise Cloud Platform - Resources & $2000 credit!
Big Data in the Cloud - Solutions & Apps
Big Data Analytics in Motorola on the Google Cloud Platform
Streak + Google Cloud Platform
Using Advanced Analyics to bring Business Value
Creating Business Value from Big Data, Analytics & Technology.
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Recommendation Engines - An Architectural Guide
Why Hadoop is the New Infrastructure for the CMO?
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
What Does Big Data Mean and Who Will Win
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Modernizing your data center with Dell and AMD

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

  • 1. ‘Amazon EMR’ coming up…by Sujee Maniyam
  • 2. Big Data Cloud MeetupCost Effective Big-Data Processing using Amazon Elastic Map ReduceSujee Maniyamhello@sujee.net | www.sujee.netJuly 08, 2011
  • 3. Cost Effective Big-Data Processing using Amazon Elastic Map ReduceSujee Maniyamhttp://sujee.nethello@sujee.net
  • 4. QuizPRIZE!Where was this picture taken?
  • 5. Quiz : Where was this picture taken?
  • 6. Answer : Montara Light House
  • 7. Hi, I’m Sujee10+ years of software developmententerprise apps  web apps iphone apps  HadoopHands on experience with Hadoop / Hbase/ Amazon ‘cloud’More : http://sujee.net/tech
  • 8. I am an ‘expert’ 
  • 10. Nature of Data…Primary DataEmail, blogs, pictures, tweetsCritical for operation (Gmail can’t loose emails)Secondary dataWikipedia access logs, Google search logsNot ‘critical’, but used to ‘enhance’ user experienceSearch logs help predict ‘trends’Yelp can figure out you like Chinese food
  • 11. Data ExplosionPrimary data has grown phenomenallyBut secondary data has exploded in recent years“log every thing and ask questions later”Used forRecommendations (books, restaurants ..etc)Predict trends (job skills in demand)Show ADS ($$$)..etc‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)Startups are struggling to get on top of ‘big data’
  • 12. Hadoop to RescueHadoop can help with BigDataHadoop has been proven in the fieldUnder active developmentThrow hardware at the problemGetting cheaper by the yearBleeding edge technologyHire good people!
  • 13. Hadoop: It is a CAREER
  • 15. Who is Using Hadoop?
  • 19. About This PresentationBased on my experience with a startup5 people (3 Engineers)Ad-Serving SpaceAmazon EC2 is our ‘data center’Technologies:Web stack : Python, Tornado, PHP, mysql , LAMPAmazon EMR to crunch dataData size : 1 TB / week
  • 20. Story of a Startup…month-1Each web serverwrites logs locallyLogs were copiedto a log-serverand purged from web serversLog Data size : ~100-200 G
  • 21. Story of a Startup…month-6More web servers comeonlineAggregate log serverfalls behind
  • 22. Data @ 6 months2 TB of data already50-100 G new data / day And we were operating on 20% of our capacity!
  • 25. What We Evaluated1) Hbase cluster2) Hadoop cluster3) Amazon EMR
  • 26. Hadoop on Amazon EC21) Permanent Cluster2) On demand cluster (elastic map reduce)
  • 29. Hadoop Cluster7 C1.xlarge machines15 TB EBS volumesSqoop exports mysql log tables into HDFSLogs are compressed (gz) to minimize disk usage (data locality trade-off)All is working well…
  • 30. Lessons LearnedC1.xlarge is pretty stable (8 core / 8G memory)EBS volumesmax size 1TB, so string few for higher density / nodeDON’T RAID them; let hadoop handle them as individual disks?? : Skip EBS. Use instance store disks, and store data in S3
  • 32. 2 months laterCouple of EBS volumes DIECouple of EC2 instances DIEMaintaining the hadoop cluster is mechanical job less appealingCOST!Our jobs utilization is about 50%But still paying for machines running 24x7
  • 34. Hadoop cluster on EC2 cost$3,500 = 7 c1.xlarge @ $500 / month$1,500 = 15 TB EBS storage @ $0.10 per GB$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests $5,500 / month$60,000 / year !
  • 35. Buy / Rent ?Typical hadoop machine cost : $10k10 node cluster = $100k Plus data center costsPlus IT-ops costsAmazon Ec2 10 node cluster:$500 * 10 = $5,000 / month = $60k / year
  • 36. Buy / RentAmazon EC2 is great, forQuickly getting startedStartupsScaling on demand / rapidly adding more serverspopular social gamesNetflix storyStreaming is powered by EC2Encoding movies ..etcUse 1000s of instancesNot so economical for running clusters 24x7
  • 38. Where was this picture taken?
  • 40. Amazon’s solution : Elastic Map ReduceStore data on Amazon S3Kick off a hadoop cluster to process dataShutdown when donePay for the HOURS used
  • 42. Moving partsLogs go into ScribeScribe master ships logs into S3, gzippedSpin EMR cluster, run job, doneUsing same old Java MR jobs for EMRSummary data gets directly updated to a mysql
  • 43. EMR Launch Scriptsscripts to launch jar EMR jobsCustom parameters depending on job needs (instance types, size of cluster ..etc)monitor job progressSave logs for later inspectionJob status (finished / cancelled)https://github.com/sujee/amazon-emr-beyond-basics
  • 44. Sample Launch Script#!/bin/bash## run-sitestats4.sh# configMASTER_INSTANCE_TYPE="m1.large"SLAVE_INSTANCE_TYPE="c1.xlarge"INSTANCES=5export JOBNAME="SiteStats4"export TIMESTAMP=$(date +%Y%m%d-%H%M%S)# end configecho "==========================================="echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."export t1=$(date +%s)export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)sh ./emr-wait-for-completion.sh
  • 45. Mapred-config-m1-xl.xml <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024M</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx3000M</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>3</value> <decription>4 is running out of memory</description> </property><property> <name>mapred.output.compress</name> <value>true</value></property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property></configuration>
  • 46. emr-wait-for-completion.shPolls for job status periodicallySaves the logs Calculates job run time
  • 49. Data joining (x-ref)Data is split across log files, need to x-ref during Map phaseUsed to load the data in mapper’s memory (data was small and in mysql)Now we use Membase (Memcached)Two MR jobs are chainedFirst one processes logfile_type_A and populates Membase (very quick, takes minutes)Second one, processes logfile_type_B, cross-references values from Membase
  • 50. X-ref
  • 51. EMR WinsCost  only pay for usehttp://aws.amazon.com/elasticmapreduce/pricing/Example: EMR ran on 5 C1.xlarge for 3hrsEC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20http://aws.amazon.com/elasticmapreduce/faqs/#billing-4(1 hour of c1.xlarge = 8 hours normalized compute time)EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40Plus S3 storage cost : 1TB / month = $150Data bandwidth from S3 to EC2 is FREE! $25 bucks
  • 52. EMR WinsNo hadoop cluster to maintainno failed nodes / disksBonus : Can tailor cluster for various jobssmaller jobs  fewer number of machinesmemory hungry tasks  m1.xlargecpu hungry tasks  c1.xlarge
  • 53. Design WinsBidders now write logs to Scribe directly No mysql at web server machinesWrites much faster!S3 has been a reliable storage and cheap
  • 54. Next : Lessons Learned
  • 55. Where was this pic taken?
  • 57. Lessons learned : Logfile formatCSV  JSONStarted with CSVCSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL20-40 fields… fragile, position dependant, hard to code url = csv[18]…counting position numbers gets old after 100th time around)If (csv.length == 29) url = csv[28] else url = csv[26]JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}Self-describing, easy to add new fields, easy to processurl = map.get(‘url’)
  • 58. Lessons Learned : Control the amount of InputWe get different type of eventsevent A (freq: 10,000) >>> event B (100) >> event C (1)Initially we put them all into a single log fileAAAABAABC
  • 59. Control Input…So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processingSo we split the logslog_A….gzlog_B….gzlog_C…gzNow only processing fraction of our logsInput : s3://my_bucket/logs/log_B*x-ref using memcache if needed
  • 60. Lessons learned : Incremental Log ProcessingRecent data (today / yesterday / this week) is more relevant than older data (6 months +)Adding ‘time window’ to our statsonly process newer logs faster
  • 61. EMR trade-offsLower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk)Streaming data from S3, for each jobEMR Hadoop is not the latest versionMissing tools : OozieRight now, trading performance for convenience and cost
  • 62. Next steps : faster processingStreaming S3 data for each MR job is not optimalSpin clusterCopy data from S3 to HDFSRun all MR jobs (make use of data locality)terminate
  • 63. Next Steps : More ProcessingMore MR jobsMore frequent data processingFrequent log rollsSmaller delta window
  • 64. Next steps : new software New SoftwarePython, mrJOB(from Yelp)Scribe  Cloudera flume?Use work flow tools like OozieHive?Adhoc SQL like queries
  • 65. Next Steps : SPOT instancesSPOT instances : name your price (ebay style)Been available on EC2 for a whileJust became available for Elastic map reduce!New cluster setup:10 normal instances + 10 spot instancesSpots may go away anytimeThat is fine! Hadoop will handle node failuresBigger cluster : cheaper & faster
  • 67. Next Steps : nosqlSummary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing)Evaluating nosql solutionsusing Membase in limited capacityWatch out for Amazon’s Hbase offering
  • 68. Take a test driveJust bring your credit-card http://aws.amazon.com/elasticmapreduce/Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52