SlideShare a Scribd company logo
Ephemeral	
  Hadoop	
  Clusters	
  in	
  the	
  Cloud	
  




                                               [1]	
  


                Greg	
  Fodor,	
  Etsy	
  
               gfodor@etsy.com	
  
about	
  me	
  
 gfodor@etsy.com	
  
     @gfodor	
  
   Data	
  Wrangler	
  
about	
  etsy	
  
the	
  world’s	
  handmade	
  marketplace	
  
total	
  members:	
  9,000,000	
  
total	
  acHve	
  shops:	
  800,000	
  
     items	
  listed:	
  9.5M	
  
page	
  views	
  per	
  month:	
  >1B	
  
    2010	
  sales:	
  $314.3M	
  
lots	
  of	
  data	
  
about	
  this	
  talk	
  
ephemeral?	
  
[5]	
  
“elasHc”	
  to	
  the	
  extreme	
  
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
how	
  did	
  we	
  get	
  here?	
  
wanted	
  to	
  dip	
  our	
  toes	
  
        stop	
  hiWng	
  the	
  database	
  
          stop	
  grepping	
  log	
  files	
  
2	
  data	
  sources	
  -­‐>	
  S3	
  
database	
  snapshots	
  

                          input:	
  
                       nightly	
  diffs	
  
(SELECT	
  *	
  FROM	
  <table>	
  WHERE	
  update_date	
  >	
  1	
  day	
  ago)	
  




               output:	
  
full	
  tables	
  as	
  sequence	
  files	
  
visit	
  logs	
  
      input:	
  
akamai	
  access	
  logs	
  
 (event	
  beacons)	
  

       output:	
  
 [visit_id,	
  [event]]	
  
processing	
  the	
  data	
  
[2]	
  
data	
  flow	
  
 joins,	
  group	
  bys,	
  etc.	
  
cascading	
  
      Chris	
  Wensel	
  
hhp://www.cascading.org/	
  
great	
  implementaHon	
  
Java	
  syntax	
  




                 [10]	
  
cascading.jruby	
  
Grégoire	
  Marabout	
  (Qualtera),	
  Mah	
  Walker	
  (Etsy),	
  Stefan	
  Karpinski	
  (Etsy),	
  Steve	
  Mardenfeld	
  (Etsy)	
  

                                               github:	
  hhp://bit.ly/o3DNtC	
  
                                               blog:	
  hhp://etsy.me/cFytuL	
  
Emphemeral hadoop clusters in the cloud
“push”	
  job	
  binaries	
  to	
  S3	
  

run	
  on	
  ElasHc	
  Map/Reduce	
  
           starts	
  cluster,	
  runs,	
  shuts	
  down	
  




    access	
  results	
  on	
  S3	
  
next	
  project:	
  
shop	
  recommendaHons	
  
3	
  steps:	
  
✔ data	
  preparaHon	
  -­‐	
  Cascading	
  
     ✖ analysis/training	
  
           ✖ predicHon	
  
sparse	
  implementaHon	
  of	
  SVD	
  
3	
  steps:	
  
✔ data	
  preparaHon	
  -­‐	
  Cascading	
  
 ✖ analysis/training	
  -­‐	
  MATLAB	
  
   ✖ predicHon	
  -­‐	
  MATLAB	
  
“MATLAB,	
  in	
  my	
  	
  
Hadoop	
  cluster?”	
  
hadoop	
  streaming	
  
arbitrary	
  scripts	
  for	
  map	
  &	
  reduce	
  
Swiss	
  army	
  knife	
  
Full	
  dataset	
  analysis	
  
Matlab,	
  Ruby	
  scripts	
  



‘ArHfact’	
  outputs	
  
Tokyo	
  Cabinet,	
  Lucene,	
  SQLite	
  



Side-­‐effects	
  
MySQL,	
  CloudFront	
                                            [3]	
  
3	
  steps:	
  
✔ data	
  preparaHon	
  -­‐	
  Cascading	
  
✔ analysis/training	
  -­‐	
  MATLAB	
  
   ✔ predicHon	
  -­‐	
  MATLAB	
  
Job	
  2	
     Job	
  1	
  




                              [4]	
  
Barnum	
  
Sinatra	
  web	
  service	
  on	
  EC2	
  
barnum	
  starts	
  job	
  and	
  passes	
  
             callback	
  URL	
  

  when	
  job	
  finishes,	
  hadoop	
  hits	
  
callback	
  URL	
  to	
  barnum	
  to	
  proceed	
  
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Barnum	
  constructs	
  
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
3	
  steps:	
  
✔ data	
  preparaHon	
  -­‐	
  Cascading	
  
✔ analysis/training	
  -­‐	
  MATLAB	
  
   ✔ predicHon	
  -­‐	
  MATLAB	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
suggested_shops.yaml:	
  
geWng	
  data	
  back	
  to	
  web	
  stack?	
  
[6]	
  
v1	
  
ad-­‐hoc	
  shell	
  scripts	
  
TSV	
  into	
  unsharded	
  MySQL	
  
          not	
  re-­‐usable	
  



                                        [6]	
  
v2	
  
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
datasets	
  are	
  versioned	
  based	
  upon	
  
        job	
  execuHon	
  Hme	
  
Emphemeral hadoop clusters in the cloud
MySQL	
  Tables:	
  




Memcache	
  Cluster:	
  
Output	
  dataset	
  <-­‐>	
  ORM	
  Model	
  
PHP:	
  
PHP:	
  




Cascading:	
  
PHP:	
  




Cascading:	
  




PHP:	
  
Old	
  tables	
  regularly	
  dropped	
  
how	
  we’re	
  using	
  this	
  stack	
  


analyHcs	
                   products	
  
   (internal)	
                 (external)	
  
analyHcs	
  
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
products	
  
Emphemeral hadoop clusters in the cloud
search	
  quality	
  
recommendaHons	
  
May	
  2011:	
  	
  
4,926	
  successful	
  job	
  runs	
  
[5]	
  
scale	
  up	
  from	
  zero	
  
isolaHon	
  
isolaHon	
  across	
  runs	
  
       fresh	
  machine	
  each	
  Hme	
  
isolaHon	
  between	
  developers	
  
              no	
  toe-­‐stepping	
  
heterogeneous	
  clusters	
  
big	
  RAM	
  when	
  you	
  need	
  it	
  
            (but	
  not	
  when	
  you	
  don’t)	
  
need	
  one	
  machine?	
  	
  
 use	
  one	
  machine.	
  
wriHng	
  jobs	
  
PHENOMENAL	
  
  COSMIC	
  
  POWERS	
  

                 [7]	
  
prototyping	
  
run	
  slow,	
  unopHmized	
  version	
  on	
  500	
  machine	
  for	
  <	
  $100	
  
parameter	
  tuning	
  
Try	
  N=1,	
  2,	
  5,	
  10	
  and	
  see	
  which	
  results	
  in	
  best	
  output	
  
[9]	
  
quesHons?	
  
photo	
  credits	
  
[1]	
  by	
  elfike	
  hhp://www.flickr.com/photos/elfike/157439707/	
  
[2]	
  by	
  Dan4th	
  hhp://www.flickr.com/photos/43264265@N00/5371557240/	
  
[3]	
  by	
  mandolux	
  	
  hhp://www.flickr.com/photos/73935252@N00/34418046/	
  
[4]	
  by	
  The	
  Suss-­‐Man	
  hhp://www.flickr.com/photos/8692813@N06/4580254188/	
  
[5]	
  by	
  Stephen	
  Rees	
  hhp://www.flickr.com/photos/60142746@N00/214461223/	
  
[6]	
  by	
  Let	
  Ideas	
  Compete	
  
hhp://www.flickr.com/photos/quesHon_everything/3414827746/	
  
[7]	
  by	
  funkandjazz	
  hhp://www.flickr.com/photos/phunk/2484159004/	
  
[8]	
  by	
  ViaMoi	
  hhp://www.flickr.com/photos/12187843@N07/3343619603/	
  
[9]	
  by	
  kreg.steppe	
  hhp://www.flickr.com/photos/spyndle/500305000/	
  
[10]	
  clipart	
  (really)	
  
[11]	
  by	
  Chris	
  Pirillo	
  hhp://www.flickr.com/photos/49503157467@N01/34588230/	
  

More Related Content

PPT
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
KEY
Getting Started on Hadoop
PPTX
3rd meetup - Intro to Amazon EMR
PDF
introduction to data processing using Hadoop and Pig
PPT
Introduction to Apache Hadoop
PPTX
Pig, Making Hadoop Easy
KEY
Hive vs Pig for HadoopSourceCodeReading
PDF
Hadoop Pig: MapReduce the easy way!
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Getting Started on Hadoop
3rd meetup - Intro to Amazon EMR
introduction to data processing using Hadoop and Pig
Introduction to Apache Hadoop
Pig, Making Hadoop Easy
Hive vs Pig for HadoopSourceCodeReading
Hadoop Pig: MapReduce the easy way!

What's hot (20)

PDF
How to measure everything - a million metrics per second with minimal develop...
PDF
Karmasphere hadoop-productivity-tools
ODP
Cascalog internal dsl_preso
PPTX
scalable machine learning
PDF
20160908 hivemall meetup
PPT
Hadoop at Yahoo! -- University Talks
PDF
Elasticwulf Pycon Talk
PDF
Prototyping Data Intensive Apps: TrendingTopics.org
PPT
Introduction To Map Reduce
PDF
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
PPTX
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
PPTX
MapReduce basic
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PDF
Automating Workflows for Analytics Pipelines
PPT
Hadoop basics
KEY
Hadoop導入事例 in クックパッド
PDF
Introduction To Apache Pig at WHUG
How to measure everything - a million metrics per second with minimal develop...
Karmasphere hadoop-productivity-tools
Cascalog internal dsl_preso
scalable machine learning
20160908 hivemall meetup
Hadoop at Yahoo! -- University Talks
Elasticwulf Pycon Talk
Prototyping Data Intensive Apps: TrendingTopics.org
Introduction To Map Reduce
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...
Hadoop, Pig, and Twitter (NoSQL East 2009)
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
MapReduce basic
HIVE: Data Warehousing & Analytics on Hadoop
Hadoop & Hive Change the Data Warehousing Game Forever
Yahoo! Mail antispam - Bay area Hadoop user group
Automating Workflows for Analytics Pipelines
Hadoop basics
Hadoop導入事例 in クックパッド
Introduction To Apache Pig at WHUG
Ad

Viewers also liked (13)

PDF
Solr @ Etsy - Apache Lucene Eurocon
PDF
Data mining for_product_search
KEY
Transforming Search in the Digital Marketplace
PDF
Responding to Outages Maturely
PDF
Migrating from PostgreSQL to MySQL Without Downtime
PDF
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
PDF
DevTools at Etsy
PDF
Solr & Lucene @ Etsy by Gregg Donovan
PDF
Resilient Response In Complex Systems
PDF
Outages, PostMortems, and Human Error
PDF
Scaling Etsy: What Went Wrong, What Went Right
KEY
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
PDF
Code as Craft: Building a Strong Engineering Culture at Etsy
Solr @ Etsy - Apache Lucene Eurocon
Data mining for_product_search
Transforming Search in the Digital Marketplace
Responding to Outages Maturely
Migrating from PostgreSQL to MySQL Without Downtime
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
DevTools at Etsy
Solr & Lucene @ Etsy by Gregg Donovan
Resilient Response In Complex Systems
Outages, PostMortems, and Human Error
Scaling Etsy: What Went Wrong, What Went Right
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
Code as Craft: Building a Strong Engineering Culture at Etsy
Ad

Similar to Emphemeral hadoop clusters in the cloud (20)

PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
PDF
NYC_2016_slides
PPTX
The Fundamentals Guide to HDP and HDInsight
PDF
Buildingsocialanalyticstoolwithmongodb
PPTX
Scaling Big Data Mining Infrastructure Twitter Experience
PDF
Avoiding big data antipatterns
PPTX
ETL with SPARK - First Spark London meetup
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
Cost effective BigData Processing on Amazon EC2
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
PDF
Embulk, an open-source plugin-based parallel bulk data loader
PDF
RubyEnRails2007 - Dr Nic Williams - Keynote
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
PDF
Hivemall tech talk at Redwood, CA
PDF
DrupalCampLA 2011: Drupal backend-performance
PPT
Capacity Management from Flickr
PPT
AWS (Hadoop) Meetup 30.04.09
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
NYC_2016_slides
The Fundamentals Guide to HDP and HDInsight
Buildingsocialanalyticstoolwithmongodb
Scaling Big Data Mining Infrastructure Twitter Experience
Avoiding big data antipatterns
ETL with SPARK - First Spark London meetup
EclipseCon Keynote: Apache Hadoop - An Introduction
Cost effective BigData Processing on Amazon EC2
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Embulk, an open-source plugin-based parallel bulk data loader
RubyEnRails2007 - Dr Nic Williams - Keynote
A Hands-on Intro to Data Science and R Presentation.ppt
UnConference for Georgia Southern Computer Science March 31, 2015
Getting started with Hadoop, Hive, and Elastic MapReduce
Hivemall tech talk at Redwood, CA
DrupalCampLA 2011: Drupal backend-performance
Capacity Management from Flickr
AWS (Hadoop) Meetup 30.04.09

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm

Emphemeral hadoop clusters in the cloud