SlideShare a Scribd company logo
Optimized for change:
                Architecture @ Etsy
                Kellan Elliott-McCrea
                @kellan
                CTO, Etsy




Monday, June 18, 12
Monday, June 18, 12
Launched June 18, 2005
                      875,000 active sellers
                      33.5MM items for sale
                      $65.9MM in sales, in May
                      1.4B page views, in May
                      102 engineers
                      32 releases, last Friday



Monday, June 18, 12
LAMP
                                                 any questions?




8BitLit, http://www.etsy.com/listing/90066890/
Monday, June 18, 12
Why?

Monday, June 18, 12
3 inevitabilities we design for:

         1. Things break, unexpectedly
         2. What we're building changes
         3. We don't get to start over



Monday, June 18, 12
2 years of change.




Monday, June 18, 12
Architectural Principles
               * Don't bet against the future.
               * Our customers are humans.
               * Simplicity always wins, in the end.
               * Favor global vs local optimization.
               * Ambiguity kills momentum.
               * Make failure cheap.
               * Technical debt is an inevitable by-product
               of shipping code.
               * Optimize for change.


Monday, June 18, 12
Cleverness
Ckrickett, http://www.etsy.com/listing/90611466
Monday, June 18, 12
Complex systems and change
         1. Distributed systems are inherently complex.

       2. The outcome of change in complex systems is hard to
       predict.

        3. The outcome of small, frequent, measurable changes
       are easier to predict, easier to recover from, and promote
       learning.




Ckrickett, http://www.etsy.com/listing/90611466
Monday, June 18, 12
Continuous deployment, Metrics
Driven Development, Blameless
        Post-Mortems



Ckrickett, http://www.etsy.com/listing/90611466
Monday, June 18, 12
Continuous deployment: Small,
            frequent changes to production




Ckrickett, http://www.etsy.com/listing/90611466
Monday, June 18, 12
Continuous Deployment:
                                  No branching.

       “All existing revision control systems were
       built by people who build installed
       software”
       - Paul Hammond,
       Always Ship Trunk, Velocity 2010
       Thursday, March 17, 2011




Monday, June 18, 12
Continuous Deployment:

                       feature flags
               if ($cfg[‘awesome_new_search’]) {
                   # new hotness
                   $rsp = do_solr();
               } else {
                   # boring old stuff
                   $rsp = do_grep();
               }



Monday, June 18, 12
Continuous Deployment:
                      Ramp - ups
                      (on top of feature flags)


        1. Launch to staff only
        2. Launch to 1% of all users
        3. Launch to members of a beta group




Monday, June 18, 12
Continuous Deployment:


                      any engineer can launch a feature to

                      1% of users


Monday, June 18, 12
Continuous Deployment:


           ~200 experiments
           live right now


Monday, June 18, 12
Metrics driven development:

           introspection isn’t
           optional.
           measure everything,
           log everything
Monday, June 18, 12
Metrics driven development:

           Metrics happen when
           you make it easy. And
           visible.

Monday, June 18, 12
Metrics driven development:
   Teach computer to read graphs



                      holtWintersConfidence(Upper|Lower)




Monday, June 18, 12
Metrics driven development:
    More info: http://www.slideshare.net/
    mikebrittain/metricsdriven-engineering




Monday, June 18, 12
Optimize for MTTR, not MTBF


Monday, June 18, 12
How?

Monday, June 18, 12
Etsy



Monday, June 18, 12
Etsy

                                        EMR/S3
                      PCI

                            BCP, Cold


Monday, June 18, 12
inbound request
                                      CDNs - diversified at the DNS level

                                    Internet providers - diversified at borders
                                                                                                          AWS
 Etsy                 network appliances
                                                                                            analytics     imstor
                                                        etsystatic.com/
                       etsy.com/        bcn.etsy.com                                          EMR         S3
                                                            photos
                      api.etsy.com                                                            JRuby/
                         /atlas                        Squid                                  Cascading
                                         apache
                      apache                           apache                                 S3
                                         logs
                      php application                  php                                    PHP
                                         logrotate
                        MySQL                           imstor                                MySQL
                                         HDFS
                        search           analytics      NFS
                        memcache
                        async http
                        StatsD
                        sqlite
                        gearman
                      logs
         MySQL        server/OS           search       mail out                                 PCI
                      hardware        Thrift         SMTP
        dbindex                       Jetty
        dbshards                                     X-Yarnblaster
                                       Solr slaves                               via jsonp,
        dbaux                          datasets                                  no privileged access
        dbdata                        Solr master        etc
                                      HBase
                                      sharded MySQL
Monday, June 18, 12
CDNs: Put a slider on it




   Just works via weighted DNS


Monday, June 18, 12
Apache
        * Well known
        * PHP is native
        * apache_note
        * fast start time
        * cheap in place replacement
        * .htaccess
        * Challenge: memory usage




Monday, June 18, 12
Apache: apache_note
                                  intr Addit
                                      osp       ive!
                                          ecti       insa
                                               on        nely
                                                  thro
              apache_note('etsy_uaid', $id);            ugh usefu
                                                            the l!
                                                               life
                                                                    cyc
                                                                       le




Monday, June 18, 12
Apache: log format

      LogFormat "%{X-Forwarded-For}i %
      {True-Client-IP}i %l %u %t "%r"
      %>s %b "%{Referer}i" "%{User-
      Agent}i" % {etsy_shop_id}n %
      {etsy_uaid}n %V %
      {etsy_ab_selections}n %
      {etsy_request_uuid}n %
      {etsy_api_consumer_key}n %
      {etsy_api_method_name}n %
      {php_memory_usage_bytes}n %
      {php_time_microsec}n %D" combined



Monday, June 18, 12
Etsy: the App

         * 487,000 lines of PHP
         * 214,000 lines of Javascript
         * Monolithic codebase
         * 3 front ends, Etsy.com, API, Atlas




Monday, June 18, 12
Etsy: the App

         * routing handled by Apache
         * scripts fronting OO PHP5
         * PHP, fast by default
         * opcode caching
         * Challenge: liveliness when calling services




Monday, June 18, 12
Etsy: coding patterns

         * light weight, home rolled “framework”
         * ORM handles DAO across backends
         * config and feature flags systems used
         everywhere
         * small slow moving datasets stored as PHP
         arrays
         * A/B tests
         * Smarty
         * StatsD
         * Concurrency
         * memcache

Monday, June 18, 12
Etsy: A/B tests

        * beaconed
        * inserted into logs via apache_note
        * conditionalized on feature flags
        * nightly reports on conversion, bounce rate,
        etc
        * nightly reports on page speed, memory
        usage, etc




Monday, June 18, 12
Etsy: Smarty

        * pre-compiled
        * pre-compiled per language




Monday, June 18, 12
Etsy: StatsD


      StatsD::increment("logins.success");
      StatsD::timing("gearman.time", $msec);


       * 340,000 application metrics




Monday, June 18, 12
Etsy: Concurrency

        * no native concurrency in PHP
        * asynchronous HTTP calls
        * Gearman




Monday, June 18, 12
Etsy: Async HTTP calls

       * curl_multi_exec
       * non-blocking, per request time outs
       * used for optional aspects of a page
       * curl against http://localhost to avoid
       network overhead




Monday, June 18, 12
Etsy: Gearman

      * language agnostic job server
      * don’t use an MQ when you want a job
      server
      * 150 job types
      * persistent jobs flushed to MySQL, read
      from memory
      * non-persistent jobs just stored in memory
      * NP queue is wicked fast.




Monday, June 18, 12
Etsy: Gearman

      * scaling CPU of cron jobs
      * denormalizing data
      * pushing to 3rd party services




Monday, June 18, 12
Etsy: Challenges

      * Apache memory usage
      * liveliness talking to services, no
      concurrency, blocking by default




Monday, June 18, 12
Etsy: graph of distributed failure




Monday, June 18, 12
Etsy: Challenges
      * Apache memory usage
      * liveliness talking to services: no
      concurrency, blocking by default



    Enforce liveliness with a judicious
           application of force



Monday, June 18, 12
Etsy: judicious application of force


      list($v, $res, $shar) = @fopen(‘/proc/self/statm', 'r');
      $mine = $res-$shar;
      if ($mine > $cfg[‘sizelimit’]) {
        $pid = getmypid();
        @exec("kill -USR1 $pid");
      }




Monday, June 18, 12
Etsy: judicious application of force

        Bowhunter
        * Find long running PHP processes
        * Try to avoid those mid-post


        open(APACHE, "/usr/bin/curl -s http://localhost/server-
        status|") || die "$!";




Monday, June 18, 12
Etsy: judicious application of force


        Query_killer
        * Same idea, long running queries
        * MySQL “SHOW PROCESSLIST();”




Monday, June 18, 12
Memcache

       * Caching, obviously
       * Cache invalidation is hard
       * Write buffering
       * multi_get
       * rate limits




Monday, June 18, 12
Memcache

       * atomic INCR is awesome
       * slice your time windows to reduce risk of
       cache eviction
       * we’ve been unlucky, lots of segfaults :(
       * multi_get slows down the more boxes in the
       pool




Monday, June 18, 12
MySQL: By the numbers

      * 25K+ queries/sec avg
      * 3TB InnoDB buffer pool
      * 15TB + data stored
      * 50 servers
      * 99.99% queries under 1ms




Monday, June 18, 12
MySQL: a NotMuchSQL server
      * no joins
      * no foreign keys
      * no transactions or locks
      * no sub-selects
      * store data like you want to read it.
      * also: no auto_increment




Monday, June 18, 12
MySQL: a NotMuchSQL server




               “Normalization is for sissie.”
                          - Cal Henderson, Flickr




Monday, June 18, 12
MySQL: scale horizontally

        * objects shared by key
        * lookups maintained in dbindex (MySQL is a
        FAST key-value store)
        * avoid key hashing, range partitions, and
        partitioning functions


        more: http://www.slideshare.net/jgoulah/the-etsy-shard-architecture-starts-with-s-and-ends-with-hard




Monday, June 18, 12
MySQL: Master-Master

        * objects hashed to a side, avoid split brain
        * allows in place schema upgrades without
        slave promotion
        * simplified capacity planning


        more: http://codeascraft.etsy.com/2012/04/20/two-sides-for-salvation/




Monday, June 18, 12
MySQL: Introspection
  web0038 : [Mon Jun 18 09:58:38 2012] [error] [client 10.101.1.12]
  [C6kds9y1MVptEDMoOe5KCYha9VWl] [error] [ORM_LONG_QUERY] [/var/etsy/
  current/phplib/EtsyORM/Query/RawSql.php:752] [15877310] Query exceeded 10
  seconds: long_query_time=83.0927 long_query_string='/* [etsy_shard_005_A] [/
  remove_favorite_listing.php] */ DELETE FROM `users_favoritelistings` WHERE
  `user_id` = ? AND `listing_id` = ?' long_query_trace='#10 __construct() /EtsyModel/
  UserFavoriteListingMirror.php:310 #4 delete() /EtsyModel/UserFavoriteListing.php:39
  #3 delete() /EtsyModel/User.php:1840 #2 unfavoriteListing() /Controller/
  Favorites.php:344 #1 removeFavoriteListingRecord() /Controller/Favorites.php:94 #0
  performRemoveFavoriteListing() /var/etsy/current/htdocs/remove_favorite_listing.php:
  9', referer: http://www.etsy.com/people/kellanem/favorites?page=5




   SQL Comments are awesome!



Monday, June 18, 12
MySQL: Deletes are expensive


        * update objects to state=‘deleted’
        * use partitions
        * truncatenator - on ext3, hard link file, move,
        delete slowly.




Monday, June 18, 12
Anatomy of a feature: Shop Stats




Monday, June 18, 12
Anatomy of a feature: Shop Stats



              “Never get into a land war in Asia, and never
                build an analytics tool on top of MySQL.




Monday, June 18, 12
Anatomy of a feature: Shop Stats


        * buffer writes in Memcache using
        predictable keys
        * flush to MySQL tables periodically via cron
        * bake old data into all possible date ranges,
        and archived to S3
        * truncate tables




Monday, June 18, 12
Monday, June 18, 12
bcn.etsy.com: beaconed event stream

        * Server-side and javascript event stream
        * At least one per page view
        * Apache serving static assets
        * Aggregated on HDFS via logrotate
        * Archived on S3
        * Analyzed via JRuby/Cascading on Hadoop
        * Doesn’t use: Flume, Scribe, etc




Monday, June 18, 12
bcn.etsy.com: beaconed event stream

    {"event_guid":"c2ffb51808b.6d2be52959ef{".user_id":
    8528531,"php_event_name":"s2","php_unique_id":"4fdf1cb5d5c078.37523961","php_event_dat
    e":"18/Jun/2012:08:19:01","locale_currency_code":"USD","pref_language":"en-
    US","region":"US","detected_region":"US","accept-languages":"en-
    US,en","isMobileDevice":"0","isMobileSupported":"0","isTabletSupported":"0","isTouch":"0","isEt
    syApp":"0","listing_ids":[60274277,101504389,98682771,88585080],"cids":
    [14103953,14239293,14247717,14209614],"query":"blue","keywords":
    ["blue","blue","blue","blue"],"position":1,"replay_number":1,"s2_cached":
    1,"php_ab_test_names":"orm_record_instance_caching;mobile_detector.all_blackberry;multila
    ng_shops_listings.view;ga_replacement_cookie;disable_search_autosuggest;admin_toolbar;tra
    nslations.live_translations;ab_analytics_test;search_type_experiment;search_ads.max_replays_
    less;search_diversity_experiment;search_cached_listing_cards;placefinder.cache_memcached_
    migration;search_stream_a;search_all_items_ignores_supplies;search_default_type;search.two
    _cluster_deploy;search_parameter_sample;thrift_category2_transform;search.similar_listing_b
    rowse_page;orm_replicant_safe_find_many;bottom_first;foreign_language_carousel;search.rel
    ated_searches_all_items;weddings.srp_promos;search_log_page_position;newrelic;clientlog;go
    ogle_analytics_async;personalized_endpoint;search_no_dropdown;community_nav_popout;se
    curity_settings;search_changes_tooltip;inline_listing_hearts;framelogger;log_normal;analytics_
    second_beacon;analytics_second_beacon_privileged;analytics_second_beacon_mobile","php_a
    b_var_names":"1;1;1;1;control;1;0;A;ponycorn_v3;1;threshold_off;1;1;1;0;all_sans_supplies;
    0;1;1;1;1;0;top;0;0;1;0;1;0;1;1;1;0;1;1;1;0;1;0;1","php_ab_selector_names":"




Monday, June 18, 12
Search
                      Search Master

                                BitTorrent to distribute indexes


                                                        Thrift, with server affinity
                                 Search Slave01                                                         Web01
                                                        to improve cache hit ratio,
                                                        just returns ids
                                 Search Slave02                                                         Web02

                                 Search SlaveNN                                                         WebNN

                               100% of all indexes
                                 on each slave
        incremental index, every 7 minutes,
          avoid even numbered cron times                                              hydrate IDs via multi-get,
                                                                                        ignore a few failures


                                   pull via cron,
                                 push via gearman




         denormalized listing store,                                         databases and memcache
         transition from MySQL to
           Hbase, not user facing


Monday, June 18, 12
Search
               * Solr trunk
               * Custom ranking via crunched datasets
               * BitSet fields for personalized search
               * Scaling the JVM
               * 32% of visits, 40% of sales
               * Also powers categories, unshardable
               queries
               * Next time, just use HTTP
               * Up next: custom codecs
               * Avoiding sharding


Monday, June 18, 12
Search
               * JVM slow start
               * Search deployinator does rolling restart
               * HotSpot and GC causes unpredictable
               throughput
               * Overfetch - ask multiple servers, go with 1st
               response
               * Index size is important. Don’t store too
               much.




Monday, June 18, 12
Photos
                                                      * 400 million photos
                                                      * Uploaded locally, then
                                                      streamed to S3
                                                      * GraphicsMagick FTW
                                                      * Working set is tiny, served
                                                      out of Squid
                                                      * 2% read failure rate during
                                                      full S3 outage.
                                                      * 0% write failure rate
                                                      during full S3 outage.




JonathanOtis, http://www.etsy.com/listing/96361102/

Monday, June 18, 12
Technology no longer part of the stack

       * Python Twisted
       * PostgreSQL and stored procedures
       * Scala and MongoDB
       * Clojure and Tokyo Tyrant
       * Rails
       * ActiveMQ
       * RabbitMQ
       * a "Routes" framework
       * building RPMs
       * Lighttpd

Monday, June 18, 12
Take aways
       1. A few simple, boring, well known
       components
       2. Extensive instrumentation
       3. Rapid iteration and feedback loops
       4. Human centric
       5. A few tweaks on the classics for scale
       6. Technology supports business goals

Monday, June 18, 12
Questions?

       More info:
       http://codeascraft.etsy.com
       http://slideshare.net/etsy
       http://github.com/etsy
       http://www.etsy.com/jobs
       kellan@etsy.com

Monday, June 18, 12

More Related Content

PDF
Scaling Rails with Memcached
PDF
Hong Qiangning in QConBeijing
ODP
Vote NO for MySQL
PDF
Le Profiling d'applications PHP - Blackfire.io
PDF
Analyzing the Performance of Mobile Web
PDF
Symfony 3 est sorti! Forum PHP 2015
PDF
A Whirlwind Tour of Etsy's Monitoring Stack
PDF
Effective approaches to web application security
Scaling Rails with Memcached
Hong Qiangning in QConBeijing
Vote NO for MySQL
Le Profiling d'applications PHP - Blackfire.io
Analyzing the Performance of Mobile Web
Symfony 3 est sorti! Forum PHP 2015
A Whirlwind Tour of Etsy's Monitoring Stack
Effective approaches to web application security

Similar to Architecting for Change: QCONNYC 2012 (20)

PPTX
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
PDF
OSDC 2017 | Something Openshift Kubernetes Containers by Kristian Köhntopp
ODP
Big data nyu
PDF
Camel and JBoss
PDF
Scaling Instagram
PDF
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
PDF
Scaling Rails with memcached
PDF
Engineering Change
PDF
Multi Master PostgreSQL Cluster on Kubernetes
PDF
Log everything!
PPT
Klmug presentation - Simple Analytics with MongoDB
PDF
Pinterest arch summit august 2012 - scaling pinterest
PPTX
3rd meetup - Intro to Amazon EMR
PPTX
Cost effective BigData Processing on Amazon EC2
PPTX
KubeSecOps
PPT
Big Data Real Time Analytics - A Facebook Case Study
PPT
SQL or NoSQL, that is the question!
PPTX
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
PDF
April JavaScript Tools
PDF
20080611accel
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
OSDC 2017 | Something Openshift Kubernetes Containers by Kristian Köhntopp
Big data nyu
Camel and JBoss
Scaling Instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
Scaling Rails with memcached
Engineering Change
Multi Master PostgreSQL Cluster on Kubernetes
Log everything!
Klmug presentation - Simple Analytics with MongoDB
Pinterest arch summit august 2012 - scaling pinterest
3rd meetup - Intro to Amazon EMR
Cost effective BigData Processing on Amazon EC2
KubeSecOps
Big Data Real Time Analytics - A Facebook Case Study
SQL or NoSQL, that is the question!
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
April JavaScript Tools
20080611accel
Ad

More from Kellan (11)

PDF
More women in engineering: Something that ACTUALLY WORKED.
PDF
Optimizing for change: Taking risks safely & e-commerce
KEY
Optimizing for change: Taking risks safely & e-commerce
KEY
More women in engineering: Something that ACTUALLY WORKED.
PDF
Future of handmade
PDF
Metrics driven engineering (velocity 2011)
PDF
Solving the "Brooklyn Problem"
PDF
Social Software For Robots
PDF
Beyond REST? Building data services with XMPP
PDF
Advanced OAuth Wrangling
PPT
Casual Privacy (Ignite Web2.0 Expo)
More women in engineering: Something that ACTUALLY WORKED.
Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
More women in engineering: Something that ACTUALLY WORKED.
Future of handmade
Metrics driven engineering (velocity 2011)
Solving the "Brooklyn Problem"
Social Software For Robots
Beyond REST? Building data services with XMPP
Advanced OAuth Wrangling
Casual Privacy (Ignite Web2.0 Expo)
Ad

Recently uploaded (20)

DOCX
NFL Dublin Will Howard’s Preseason Be Over After Hand Injury.docx
PDF
Download GTA 5 For PC (Windows 7, 10, 11)
PDF
Best All-Access Digital Pass me .... pdf
DOCX
From Playgrounds to Pitches Empowering the Next Generation.docx
PPTX
Best All-Access Digital Pass me .pptxxxx
PPT
Aboriginals Achievements in Society and Community Development
PPTX
International Football (International football is a type of soccer in which n...
DOC
Bishop's毕业证学历认证,维耶蒙特利尔学校毕业证毕业证文凭
DOCX
FIFA World Cup Semi Final The Battle for Global Supremacy.docx
PDF
BOOK MUAYTHAI THAI FIGHT ALEXANDRE BRECK
DOCX
NFL Dublin Who Will Rise as Super Bowl 60 Champs.docx
PPTX
Performance Analytics in the field of sports.pptx
DOCX
NFL Madrid Dolphins Scramble for Reinforcements.docx
DOCX
FA Cup Final 2026 Siring: Arne Slot Crit
DOCX
NFL Dublin Addison Returns Home To Haunt Pittsburgh.docx
DOCX
NFL Dublin Vikings Turn to Speed with Tai Felton.docx
PPTX
BADMINTON-2ND-WEEK-FUNDAMENTAL-SKILLS.pptx
PDF
FIFA World Cup Scaloni Hopeful for Messi’s FIFA World Cup 2026 Participation.pdf
PDF
How Teams Compete to Find Sponsors for Their Jerseys
DOCX
NFL Dublin Injury Ends Season for Former Vikings Standout.docx
NFL Dublin Will Howard’s Preseason Be Over After Hand Injury.docx
Download GTA 5 For PC (Windows 7, 10, 11)
Best All-Access Digital Pass me .... pdf
From Playgrounds to Pitches Empowering the Next Generation.docx
Best All-Access Digital Pass me .pptxxxx
Aboriginals Achievements in Society and Community Development
International Football (International football is a type of soccer in which n...
Bishop's毕业证学历认证,维耶蒙特利尔学校毕业证毕业证文凭
FIFA World Cup Semi Final The Battle for Global Supremacy.docx
BOOK MUAYTHAI THAI FIGHT ALEXANDRE BRECK
NFL Dublin Who Will Rise as Super Bowl 60 Champs.docx
Performance Analytics in the field of sports.pptx
NFL Madrid Dolphins Scramble for Reinforcements.docx
FA Cup Final 2026 Siring: Arne Slot Crit
NFL Dublin Addison Returns Home To Haunt Pittsburgh.docx
NFL Dublin Vikings Turn to Speed with Tai Felton.docx
BADMINTON-2ND-WEEK-FUNDAMENTAL-SKILLS.pptx
FIFA World Cup Scaloni Hopeful for Messi’s FIFA World Cup 2026 Participation.pdf
How Teams Compete to Find Sponsors for Their Jerseys
NFL Dublin Injury Ends Season for Former Vikings Standout.docx

Architecting for Change: QCONNYC 2012

  • 1. Optimized for change: Architecture @ Etsy Kellan Elliott-McCrea @kellan CTO, Etsy Monday, June 18, 12
  • 3. Launched June 18, 2005 875,000 active sellers 33.5MM items for sale $65.9MM in sales, in May 1.4B page views, in May 102 engineers 32 releases, last Friday Monday, June 18, 12
  • 4. LAMP any questions? 8BitLit, http://www.etsy.com/listing/90066890/ Monday, June 18, 12
  • 6. 3 inevitabilities we design for: 1. Things break, unexpectedly 2. What we're building changes 3. We don't get to start over Monday, June 18, 12
  • 7. 2 years of change. Monday, June 18, 12
  • 8. Architectural Principles * Don't bet against the future. * Our customers are humans. * Simplicity always wins, in the end. * Favor global vs local optimization. * Ambiguity kills momentum. * Make failure cheap. * Technical debt is an inevitable by-product of shipping code. * Optimize for change. Monday, June 18, 12
  • 10. Complex systems and change 1. Distributed systems are inherently complex. 2. The outcome of change in complex systems is hard to predict. 3. The outcome of small, frequent, measurable changes are easier to predict, easier to recover from, and promote learning. Ckrickett, http://www.etsy.com/listing/90611466 Monday, June 18, 12
  • 11. Continuous deployment, Metrics Driven Development, Blameless Post-Mortems Ckrickett, http://www.etsy.com/listing/90611466 Monday, June 18, 12
  • 12. Continuous deployment: Small, frequent changes to production Ckrickett, http://www.etsy.com/listing/90611466 Monday, June 18, 12
  • 13. Continuous Deployment: No branching. “All existing revision control systems were built by people who build installed software” - Paul Hammond, Always Ship Trunk, Velocity 2010 Thursday, March 17, 2011 Monday, June 18, 12
  • 14. Continuous Deployment: feature flags if ($cfg[‘awesome_new_search’]) { # new hotness $rsp = do_solr(); } else { # boring old stuff $rsp = do_grep(); } Monday, June 18, 12
  • 15. Continuous Deployment: Ramp - ups (on top of feature flags) 1. Launch to staff only 2. Launch to 1% of all users 3. Launch to members of a beta group Monday, June 18, 12
  • 16. Continuous Deployment: any engineer can launch a feature to 1% of users Monday, June 18, 12
  • 17. Continuous Deployment: ~200 experiments live right now Monday, June 18, 12
  • 18. Metrics driven development: introspection isn’t optional. measure everything, log everything Monday, June 18, 12
  • 19. Metrics driven development: Metrics happen when you make it easy. And visible. Monday, June 18, 12
  • 20. Metrics driven development: Teach computer to read graphs holtWintersConfidence(Upper|Lower) Monday, June 18, 12
  • 21. Metrics driven development: More info: http://www.slideshare.net/ mikebrittain/metricsdriven-engineering Monday, June 18, 12
  • 22. Optimize for MTTR, not MTBF Monday, June 18, 12
  • 25. Etsy EMR/S3 PCI BCP, Cold Monday, June 18, 12
  • 26. inbound request CDNs - diversified at the DNS level Internet providers - diversified at borders AWS Etsy network appliances analytics imstor etsystatic.com/ etsy.com/ bcn.etsy.com EMR S3 photos api.etsy.com JRuby/ /atlas Squid Cascading apache apache apache S3 logs php application php PHP logrotate MySQL imstor MySQL HDFS search analytics NFS memcache async http StatsD sqlite gearman logs MySQL server/OS search mail out PCI hardware Thrift SMTP dbindex Jetty dbshards X-Yarnblaster Solr slaves via jsonp, dbaux datasets no privileged access dbdata Solr master etc HBase sharded MySQL Monday, June 18, 12
  • 27. CDNs: Put a slider on it Just works via weighted DNS Monday, June 18, 12
  • 28. Apache * Well known * PHP is native * apache_note * fast start time * cheap in place replacement * .htaccess * Challenge: memory usage Monday, June 18, 12
  • 29. Apache: apache_note intr Addit osp ive! ecti insa on nely thro apache_note('etsy_uaid', $id); ugh usefu the l! life cyc le Monday, June 18, 12
  • 30. Apache: log format LogFormat "%{X-Forwarded-For}i % {True-Client-IP}i %l %u %t "%r" %>s %b "%{Referer}i" "%{User- Agent}i" % {etsy_shop_id}n % {etsy_uaid}n %V % {etsy_ab_selections}n % {etsy_request_uuid}n % {etsy_api_consumer_key}n % {etsy_api_method_name}n % {php_memory_usage_bytes}n % {php_time_microsec}n %D" combined Monday, June 18, 12
  • 31. Etsy: the App * 487,000 lines of PHP * 214,000 lines of Javascript * Monolithic codebase * 3 front ends, Etsy.com, API, Atlas Monday, June 18, 12
  • 32. Etsy: the App * routing handled by Apache * scripts fronting OO PHP5 * PHP, fast by default * opcode caching * Challenge: liveliness when calling services Monday, June 18, 12
  • 33. Etsy: coding patterns * light weight, home rolled “framework” * ORM handles DAO across backends * config and feature flags systems used everywhere * small slow moving datasets stored as PHP arrays * A/B tests * Smarty * StatsD * Concurrency * memcache Monday, June 18, 12
  • 34. Etsy: A/B tests * beaconed * inserted into logs via apache_note * conditionalized on feature flags * nightly reports on conversion, bounce rate, etc * nightly reports on page speed, memory usage, etc Monday, June 18, 12
  • 35. Etsy: Smarty * pre-compiled * pre-compiled per language Monday, June 18, 12
  • 36. Etsy: StatsD StatsD::increment("logins.success"); StatsD::timing("gearman.time", $msec); * 340,000 application metrics Monday, June 18, 12
  • 37. Etsy: Concurrency * no native concurrency in PHP * asynchronous HTTP calls * Gearman Monday, June 18, 12
  • 38. Etsy: Async HTTP calls * curl_multi_exec * non-blocking, per request time outs * used for optional aspects of a page * curl against http://localhost to avoid network overhead Monday, June 18, 12
  • 39. Etsy: Gearman * language agnostic job server * don’t use an MQ when you want a job server * 150 job types * persistent jobs flushed to MySQL, read from memory * non-persistent jobs just stored in memory * NP queue is wicked fast. Monday, June 18, 12
  • 40. Etsy: Gearman * scaling CPU of cron jobs * denormalizing data * pushing to 3rd party services Monday, June 18, 12
  • 41. Etsy: Challenges * Apache memory usage * liveliness talking to services, no concurrency, blocking by default Monday, June 18, 12
  • 42. Etsy: graph of distributed failure Monday, June 18, 12
  • 43. Etsy: Challenges * Apache memory usage * liveliness talking to services: no concurrency, blocking by default Enforce liveliness with a judicious application of force Monday, June 18, 12
  • 44. Etsy: judicious application of force list($v, $res, $shar) = @fopen(‘/proc/self/statm', 'r'); $mine = $res-$shar; if ($mine > $cfg[‘sizelimit’]) { $pid = getmypid(); @exec("kill -USR1 $pid"); } Monday, June 18, 12
  • 45. Etsy: judicious application of force Bowhunter * Find long running PHP processes * Try to avoid those mid-post open(APACHE, "/usr/bin/curl -s http://localhost/server- status|") || die "$!"; Monday, June 18, 12
  • 46. Etsy: judicious application of force Query_killer * Same idea, long running queries * MySQL “SHOW PROCESSLIST();” Monday, June 18, 12
  • 47. Memcache * Caching, obviously * Cache invalidation is hard * Write buffering * multi_get * rate limits Monday, June 18, 12
  • 48. Memcache * atomic INCR is awesome * slice your time windows to reduce risk of cache eviction * we’ve been unlucky, lots of segfaults :( * multi_get slows down the more boxes in the pool Monday, June 18, 12
  • 49. MySQL: By the numbers * 25K+ queries/sec avg * 3TB InnoDB buffer pool * 15TB + data stored * 50 servers * 99.99% queries under 1ms Monday, June 18, 12
  • 50. MySQL: a NotMuchSQL server * no joins * no foreign keys * no transactions or locks * no sub-selects * store data like you want to read it. * also: no auto_increment Monday, June 18, 12
  • 51. MySQL: a NotMuchSQL server “Normalization is for sissie.” - Cal Henderson, Flickr Monday, June 18, 12
  • 52. MySQL: scale horizontally * objects shared by key * lookups maintained in dbindex (MySQL is a FAST key-value store) * avoid key hashing, range partitions, and partitioning functions more: http://www.slideshare.net/jgoulah/the-etsy-shard-architecture-starts-with-s-and-ends-with-hard Monday, June 18, 12
  • 53. MySQL: Master-Master * objects hashed to a side, avoid split brain * allows in place schema upgrades without slave promotion * simplified capacity planning more: http://codeascraft.etsy.com/2012/04/20/two-sides-for-salvation/ Monday, June 18, 12
  • 54. MySQL: Introspection web0038 : [Mon Jun 18 09:58:38 2012] [error] [client 10.101.1.12] [C6kds9y1MVptEDMoOe5KCYha9VWl] [error] [ORM_LONG_QUERY] [/var/etsy/ current/phplib/EtsyORM/Query/RawSql.php:752] [15877310] Query exceeded 10 seconds: long_query_time=83.0927 long_query_string='/* [etsy_shard_005_A] [/ remove_favorite_listing.php] */ DELETE FROM `users_favoritelistings` WHERE `user_id` = ? AND `listing_id` = ?' long_query_trace='#10 __construct() /EtsyModel/ UserFavoriteListingMirror.php:310 #4 delete() /EtsyModel/UserFavoriteListing.php:39 #3 delete() /EtsyModel/User.php:1840 #2 unfavoriteListing() /Controller/ Favorites.php:344 #1 removeFavoriteListingRecord() /Controller/Favorites.php:94 #0 performRemoveFavoriteListing() /var/etsy/current/htdocs/remove_favorite_listing.php: 9', referer: http://www.etsy.com/people/kellanem/favorites?page=5 SQL Comments are awesome! Monday, June 18, 12
  • 55. MySQL: Deletes are expensive * update objects to state=‘deleted’ * use partitions * truncatenator - on ext3, hard link file, move, delete slowly. Monday, June 18, 12
  • 56. Anatomy of a feature: Shop Stats Monday, June 18, 12
  • 57. Anatomy of a feature: Shop Stats “Never get into a land war in Asia, and never build an analytics tool on top of MySQL. Monday, June 18, 12
  • 58. Anatomy of a feature: Shop Stats * buffer writes in Memcache using predictable keys * flush to MySQL tables periodically via cron * bake old data into all possible date ranges, and archived to S3 * truncate tables Monday, June 18, 12
  • 60. bcn.etsy.com: beaconed event stream * Server-side and javascript event stream * At least one per page view * Apache serving static assets * Aggregated on HDFS via logrotate * Archived on S3 * Analyzed via JRuby/Cascading on Hadoop * Doesn’t use: Flume, Scribe, etc Monday, June 18, 12
  • 61. bcn.etsy.com: beaconed event stream {"event_guid":"c2ffb51808b.6d2be52959ef{".user_id": 8528531,"php_event_name":"s2","php_unique_id":"4fdf1cb5d5c078.37523961","php_event_dat e":"18/Jun/2012:08:19:01","locale_currency_code":"USD","pref_language":"en- US","region":"US","detected_region":"US","accept-languages":"en- US,en","isMobileDevice":"0","isMobileSupported":"0","isTabletSupported":"0","isTouch":"0","isEt syApp":"0","listing_ids":[60274277,101504389,98682771,88585080],"cids": [14103953,14239293,14247717,14209614],"query":"blue","keywords": ["blue","blue","blue","blue"],"position":1,"replay_number":1,"s2_cached": 1,"php_ab_test_names":"orm_record_instance_caching;mobile_detector.all_blackberry;multila ng_shops_listings.view;ga_replacement_cookie;disable_search_autosuggest;admin_toolbar;tra nslations.live_translations;ab_analytics_test;search_type_experiment;search_ads.max_replays_ less;search_diversity_experiment;search_cached_listing_cards;placefinder.cache_memcached_ migration;search_stream_a;search_all_items_ignores_supplies;search_default_type;search.two _cluster_deploy;search_parameter_sample;thrift_category2_transform;search.similar_listing_b rowse_page;orm_replicant_safe_find_many;bottom_first;foreign_language_carousel;search.rel ated_searches_all_items;weddings.srp_promos;search_log_page_position;newrelic;clientlog;go ogle_analytics_async;personalized_endpoint;search_no_dropdown;community_nav_popout;se curity_settings;search_changes_tooltip;inline_listing_hearts;framelogger;log_normal;analytics_ second_beacon;analytics_second_beacon_privileged;analytics_second_beacon_mobile","php_a b_var_names":"1;1;1;1;control;1;0;A;ponycorn_v3;1;threshold_off;1;1;1;0;all_sans_supplies; 0;1;1;1;1;0;top;0;0;1;0;1;0;1;1;1;0;1;1;1;0;1;0;1","php_ab_selector_names":" Monday, June 18, 12
  • 62. Search Search Master BitTorrent to distribute indexes Thrift, with server affinity Search Slave01 Web01 to improve cache hit ratio, just returns ids Search Slave02 Web02 Search SlaveNN WebNN 100% of all indexes on each slave incremental index, every 7 minutes, avoid even numbered cron times hydrate IDs via multi-get, ignore a few failures pull via cron, push via gearman denormalized listing store, databases and memcache transition from MySQL to Hbase, not user facing Monday, June 18, 12
  • 63. Search * Solr trunk * Custom ranking via crunched datasets * BitSet fields for personalized search * Scaling the JVM * 32% of visits, 40% of sales * Also powers categories, unshardable queries * Next time, just use HTTP * Up next: custom codecs * Avoiding sharding Monday, June 18, 12
  • 64. Search * JVM slow start * Search deployinator does rolling restart * HotSpot and GC causes unpredictable throughput * Overfetch - ask multiple servers, go with 1st response * Index size is important. Don’t store too much. Monday, June 18, 12
  • 65. Photos * 400 million photos * Uploaded locally, then streamed to S3 * GraphicsMagick FTW * Working set is tiny, served out of Squid * 2% read failure rate during full S3 outage. * 0% write failure rate during full S3 outage. JonathanOtis, http://www.etsy.com/listing/96361102/ Monday, June 18, 12
  • 66. Technology no longer part of the stack * Python Twisted * PostgreSQL and stored procedures * Scala and MongoDB * Clojure and Tokyo Tyrant * Rails * ActiveMQ * RabbitMQ * a "Routes" framework * building RPMs * Lighttpd Monday, June 18, 12
  • 67. Take aways 1. A few simple, boring, well known components 2. Extensive instrumentation 3. Rapid iteration and feedback loops 4. Human centric 5. A few tweaks on the classics for scale 6. Technology supports business goals Monday, June 18, 12
  • 68. Questions? More info: http://codeascraft.etsy.com http://slideshare.net/etsy http://github.com/etsy http://www.etsy.com/jobs kellan@etsy.com Monday, June 18, 12