SlideShare a Scribd company logo
Designing, Scoping, and Configuring
    Scalable LAMP
    Infrastructure


    Presented 2010-05-19 by David Strauss

Wed 2010-06-09
About me




Wed 2010-06-09
About me
     ‣   Founded Four Kitchens in 2006 while at UT Austin




Wed 2010-06-09
About me
     ‣   Founded Four Kitchens in 2006 while at UT Austin
     ‣   In 2008, launched Pressflow,
         which now powers the largest Drupal sites




Wed 2010-06-09
About me
     ‣   Founded Four Kitchens in 2006 while at UT Austin
     ‣   In 2008, launched Pressflow,
         which now powers the largest Drupal sites
     ‣   Worked with some of the largest sites in the world:
         Lifetime Digital, Mansueto Ventures, Wikipedia, The
         Internet Archive, and The Economist




Wed 2010-06-09
About me
     ‣   Founded Four Kitchens in 2006 while at UT Austin
     ‣   In 2008, launched Pressflow,
         which now powers the largest Drupal sites
     ‣   Worked with some of the largest sites in the world:
         Lifetime Digital, Mansueto Ventures, Wikipedia, The
         Internet Archive, and The Economist
     ‣   Engineered the LAMP stack, deployment tools, and
         management tools for Yale University, multiple NBC-
         Universal properties, and Drupal.org




Wed 2010-06-09
About me
     ‣   Founded Four Kitchens in 2006 while at UT Austin
     ‣   In 2008, launched Pressflow,
         which now powers the largest Drupal sites
     ‣   Worked with some of the largest sites in the world:
         Lifetime Digital, Mansueto Ventures, Wikipedia, The
         Internet Archive, and The Economist
     ‣   Engineered the LAMP stack, deployment tools, and
         management tools for Yale University, multiple NBC-
         Universal properties, and Drupal.org
     ‣   Engineered development workflows for Examiner.com




Wed 2010-06-09
About me
     ‣   Founded Four Kitchens in 2006 while at UT Austin
     ‣   In 2008, launched Pressflow,
         which now powers the largest Drupal sites
     ‣   Worked with some of the largest sites in the world:
         Lifetime Digital, Mansueto Ventures, Wikipedia, The
         Internet Archive, and The Economist
     ‣   Engineered the LAMP stack, deployment tools, and
         management tools for Yale University, multiple NBC-
         Universal properties, and Drupal.org
     ‣   Engineered development workflows for Examiner.com
     ‣   Contributor to Drupal, Bazaar, Ubuntu, BCFG2,
         Varnish, and other open-source projects

Wed 2010-06-09
Some assumptions




    David Strauss

Wed 2010-06-09
Some assumptions
     ‣   You have more than one web server




    David Strauss

Wed 2010-06-09
Some assumptions
     ‣   You have more than one web server
     ‣   You have root access




    David Strauss

Wed 2010-06-09
Some assumptions
     ‣   You have more than one web server
     ‣   You have root access
     ‣   You deploy to Linux
         (though PHP on Windows is more sane than ever)




    David Strauss

Wed 2010-06-09
Some assumptions
     ‣   You have more than one web server
     ‣   You have root access
     ‣   You deploy to Linux
         (though PHP on Windows is more sane than ever)
     ‣   Database and web servers occupy separate boxes




    David Strauss

Wed 2010-06-09
Some assumptions
     ‣   You have more than one web server
     ‣   You have root access
     ‣   You deploy to Linux
         (though PHP on Windows is more sane than ever)
     ‣   Database and web servers occupy separate boxes
     ‣   Your application behaves more or less
         like Drupal, WordPress, or MediaWiki


    David Strauss

Wed 2010-06-09
Understanding
     Load Distribution



    David Strauss

Wed 2010-06-09
Predicting peak traffic
         Traffic over the day can be highly irregular. To plan
         for peak loads, design as if all traffic were as heavy
         as the peak hour of load in a typical month — and
         then plan for some growth.




    David Strauss

Wed 2010-06-09
Analyzing hit distribution




    David Strauss

Wed 2010-06-09
Analyzing hit distribution



 100%




    David Strauss

Wed 2010-06-09
Analyzing hit distribution

                                 nt
                            n te
                          Co
                    tic
                 Sta


 100%




    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                      30%
                                 nt
                            n te
                          Co
                    tic
                 Sta


 100%




    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                       30%
                                  nt
                             n te
                           Co
                     tic
                  Sta


 100%
                 Dy Pag
                   na es
                     m
                       ic




    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                       30%
                                  nt
                             n te
                           Co
                     tic
                  Sta


 100%
                 Dy Pag
                   na es
                     m
                       ic




                                70%


    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                       30%
                                  nt
                             n te
                           Co
                     tic
                  Sta


 100%
                 Dy Pag
                   na es
                     m
                       ic




                                70%
                                         Auth
                                             enticat
                                                    ed

    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                       30%
                                  nt
                             n te
                           Co
                     tic
                  Sta


 100%
                 Dy Pag
                   na es
                     m
                       ic




                                70%
                                         Auth
                                             enticat
                                                    ed
                                                         20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                       30%
                                  nt
                             n te
                           Co
                     tic
                  Sta


 100%
                                                     s
                                                ou
                                               m
                                            ny
                 Dy Pag




                                        no
                   na es




                                        A
                     m
                       ic




                                70%
                                            Auth
                                                en ticat
                                                        ed
                                                             20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                       30%
                                  nt
                             n te
                           Co                            50%
                     tic
                  Sta


 100%
                                                     s
                                                ou
                                               m
                                            ny
                 Dy Pag




                                        no
                   na es




                                        A
                     m
                       ic




                                70%
                                            Auth
                                                en ticat
                                                        ed
                                                             20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                      30%                           an
                                                              H   um
                                 nt
                             n te
                           Co                           50%
                     tic
                  Sta


 100%
                                                    s
                                               ou
                                              m
                                           ny
                 Dy Pag




                                       no
                   na es




                                       A
                     m
                       ic




                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                                                         40%
                                      30%                           an
                                                              H   um
                                 nt
                             n te
                           Co                           50%
                     tic
                  Sta


 100%
                                                    s
                                               ou
                                              m
                                           ny
                 Dy Pag




                                       no
                   na es




                                       A
                     m
                       ic




                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                                                            40%
                                      30%                              an
                                                                H    um
                                 nt
                             n te
                           Co                           50%
                     tic
                  Sta




                                                            W wl
                                                            C

                                                             eb er
                                                             ra
 100%
                                                    s
                                               ou
                                              m
                                           ny
                 Dy Pag




                                       no
                   na es




                                       A
                     m
                       ic




                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                                                            40%
                                      30%                              an
                                                                H    um
                                 nt
                             n te
                           Co                           50%
                     tic
                  Sta




                                                            W wl
                                                            C

                                                             eb er
                                                             ra
 100%
                                                    s
                                               ou
                                              m
                                           ny


                                                                     10%
                 Dy Pag




                                       no
                   na es




                                       A
                     m
                       ic




                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                                                                40%
                                      30%                              an
                                                                H    um
                                 nt
                             n te
                           Co                           50%
                     tic
                  Sta                                                                          e nt
                                                                                              m
                                                                                           at




                                                            W wl
                                                                                       Tre




                                                            C

                                                             eb er
                                                             ra
                                                                                     l
 100%                                                                             cia

                                                    s
                                               ou                            o Sp
                                                                                 e
                                              m
                                                                            N
                                           ny


                                                                     10%
                 Dy Pag




                                       no
                   na es




                                       A
                     m
                       ic




                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                                                                40%
                                      30%                              an
                                                                H    um
                                 nt
                             n te
                           Co                           50%                                           3%
                     tic
                  Sta                                                                          e nt
                                                                                              m
                                                                                           at




                                                            W wl
                                                                                       Tre




                                                            C

                                                             eb er
                                                             ra
                                                                                     l
 100%                                                                             cia

                                                    s
                                               ou                            o Sp
                                                                                 e
                                              m
                                                                            N
                                           ny


                                                                     10%
                 Dy Pag




                                       no
                   na es




                                       A
                     m
                       ic




                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                                                                  40%
                                      30%                              an
                                                                H    um
                                 nt
                             n te
                           Co                           50%                                            3%
                     tic
                  Sta                                                                           e nt
                                                                                               m
                                                                                            at




                                                            W wl
                                                                                        Tre




                                                            C

                                                             eb er
                                                             ra
                                                                                      l
 100%                                                                              cia

                                                    s
                                               ou                            o Sp
                                                                                 e
                                              m
                                                                            N
                                           ny


                                                                     10%
                 Dy Pag




                                       no



                                                                                “Pay
                   na es




                                                                                     W
                                       A




                                                                                 Byp all”
                     m




                                                                                     ass
                       ic




                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Analyzing hit distribution
                                                                                  40%
                                      30%                              an
                                                                H    um
                                 nt
                             n te
                           Co                           50%                                            3%
                     tic
                  Sta                                                                           e nt
                                                                                               m
                                                                                            at




                                                            W wl
                                                                                        Tre




                                                            C

                                                             eb er
                                                             ra
                                                                                      l
 100%                                                                              cia

                                                    s
                                               ou                            o Sp
                                                                                 e
                                              m
                                                                            N
                                           ny


                                                                     10%
                 Dy Pag




                                       no



                                                                                “Pay
                   na es




                                                                                     W
                                       A




                                                                                 Byp all”
                     m




                                                                                     ass
                       ic




                                                                                                   7%
                                70%
                                           Auth
                                               en ticat
                                                       ed
                                                            20%
    David Strauss

Wed 2010-06-09
Throughput vs. Delivery Methods
                                                       Yellow
                                Green                                           Red
                                                     (Dynamic,
                               (Static)                                      (Dynamic)
                                                     Cacheable)
                                                                    2
     Content Delivery
        Network
                        ●●●●●●●●●●                       ✖                       ✖
        Reverse Proxy
           Cache
                        ●●●●●●●●                ●●●●●●●                          ✖
                                   5000 req/s

                                          1
         PHP + APC +
                        ●●●●                    ●●●                     ●●●
         memcached

                                          1
          PHP + APC     ●●●●                    ●●                      ●●


                                          1
       PHP (No APC)     ●●●●                    ●                       ●
                                                                                      10 req/s

                                                1   Delivered by Apache without PHP
     More dots = More throughput                2   Some actually can do this.
    David Strauss

Wed 2010-06-09
Objective

                 Deliver hits using the
                 fastest, most scalable
                   method available


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step




    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step




        Traffic




    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step




        Traffic




    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step




        Traffic




          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter




        Traffic




          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter




        Traffic




          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter




        Traffic


                      DNS Round Robin


          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter



                      Load
        Traffic        Balancer


                      DNS Round Robin


          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter



                      Load
        Traffic        Balancer


                      DNS Round Robin


          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter



                      Load            Reverse
        Traffic                          Proxy
                     Balancer          Cache


                      DNS Round Robin


          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter



                      Load            Reverse
        Traffic                          Proxy
                     Balancer          Cache


                      DNS Round Robin


          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter



                      Load            Reverse
        Traffic                          Proxy    Application
                     Balancer          Cache      Server


                      DNS Round Robin


          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter



                      Load            Reverse
        Traffic                          Proxy    Application
                     Balancer          Cache      Server


                      DNS Round Robin


          CDN


    David Strauss

Wed 2010-06-09
Layering: Less Traffic at Each Step

                    Your Datacenter



                      Load            Reverse
        Traffic                          Proxy    Application
                     Balancer          Cache      Server


                      DNS Round Robin


          CDN                                   Database



    David Strauss

Wed 2010-06-09
Offload from the master database
                    Your master database is the
                    single greatest limitation on
                    scalability.




    David Strauss

Wed 2010-06-09
Offload from the master database
                                  Your master database is the
                                  single greatest limitation on
                                  scalability.


                    Application
                      Server




                                     Master
                                    Database



    David Strauss

Wed 2010-06-09
Offload from the master database
                                  Your master database is the
                                  single greatest limitation on
                                  scalability.


                    Application
                      Server




                                     Master
                    Memory
                     Cache
                                    Database



    David Strauss

Wed 2010-06-09
Offload from the master database
                                  Your master database is the
                                  single greatest limitation on
                                  scalability.


                    Application         Slave
                      Server           Database




                                     Master
                    Memory
                     Cache
                                    Database



    David Strauss

Wed 2010-06-09
Offload from the master database
                     Search       Your master database is the
                                  single greatest limitation on
                                  scalability.


                    Application         Slave
                      Server           Database




                                     Master
                    Memory
                     Cache
                                    Database



    David Strauss

Wed 2010-06-09
Tools to use




    David Strauss

Wed 2010-06-09
Tools to use
     ‣   Apache Solr or Sphinx for search
          ‣   Solr can be fronted with Varnish or another
              proxy cache if queries are repetitive.




    David Strauss

Wed 2010-06-09
Tools to use
     ‣   Apache Solr or Sphinx for search
          ‣   Solr can be fronted with Varnish or another
              proxy cache if queries are repetitive.
     ‣   Varnish, nginx, Squid, or Traffic Server
         for reverse proxy caching




    David Strauss

Wed 2010-06-09
Tools to use
     ‣   Apache Solr or Sphinx for search
          ‣   Solr can be fronted with Varnish or another
              proxy cache if queries are repetitive.
     ‣   Varnish, nginx, Squid, or Traffic Server
         for reverse proxy caching
     ‣   Any third-party service for CDN



    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.




    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.


        Internal
         Traffic




    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.


        Internal     Load
         Traffic      Balancer




    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.


        Internal     Load          Reverse
                                    Proxy
         Traffic      Balancer        Cache




    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.


        Internal     Load          Reverse
                                    Proxy
         Traffic      Balancer        Cache




    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.


        Internal     Load          Reverse
                                                Application
                                    Proxy
         Traffic      Balancer        Cache         Server




    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.


        Internal     Load          Reverse
                                                Application
                                    Proxy
         Traffic      Balancer        Cache         Server




    David Strauss

Wed 2010-06-09
Do the math
    ‣   All non-CDN traffic travels through your load
        balancers and reverse proxy caches. Even traffic
        passed through to application servers must run
        through the initial layers.


        Internal         Load           Reverse
                                                       Application
                                         Proxy
         Traffic          Balancer         Cache           Server



                    What hit rate is each layer getting?
                    How many servers share the load?

    David Strauss

Wed 2010-06-09
Get a management/monitoring box




    David Strauss

Wed 2010-06-09
Get a management/monitoring box




                    Management




    David Strauss

Wed 2010-06-09
Get a management/monitoring box




                                 Application
                    Management
                                   Server




    David Strauss

Wed 2010-06-09
Get a management/monitoring box




                                 Application
                    Management
                                   Server




                     Reverse
                      Proxy
                      Cache

    David Strauss

Wed 2010-06-09
Get a management/monitoring box




                                         Application
                 Database   Management
                                           Server




                             Reverse
                              Proxy
                              Cache

    David Strauss

Wed 2010-06-09
Get a management/monitoring box
                             Load
                            Balancer




                                         Application
                 Database   Management
                                           Server




                             Reverse
                              Proxy
                              Cache

    David Strauss

Wed 2010-06-09
Get a management/monitoring box
                             Load        (maybe even two
                            Balancer       and have them
                                          specialize or be
                                              redundant)


                                            Application
                 Database   Management
                                              Server




                             Reverse
                              Proxy
                              Cache

    David Strauss

Wed 2010-06-09
Planning + Scoping



    David Strauss

Wed 2010-06-09
Infrastructure goals




    David Strauss

Wed 2010-06-09
Infrastructure goals
     ‣   Redundancy: tolerate failure




    David Strauss

Wed 2010-06-09
Infrastructure goals
     ‣   Redundancy: tolerate failure
     ‣   Scalability: engage more users




    David Strauss

Wed 2010-06-09
Infrastructure goals
     ‣   Redundancy: tolerate failure
     ‣   Scalability: engage more users
     ‣   Performance: ensure each user’s experience is fast




    David Strauss

Wed 2010-06-09
Infrastructure goals
     ‣   Redundancy: tolerate failure
     ‣   Scalability: engage more users
     ‣   Performance: ensure each user’s experience is fast
     ‣   Manageability: stay sane in the process




    David Strauss

Wed 2010-06-09
Redundancy




    David Strauss

Wed 2010-06-09
Redundancy
     ‣   When one server fails, the website should
         be able to recover without taking too long.




    David Strauss

Wed 2010-06-09
Redundancy
     ‣   When one server fails, the website should
         be able to recover without taking too long.
     ‣   This requires at least N+1, putting a floor
         on system requirements even for small sites.




    David Strauss

Wed 2010-06-09
Redundancy
     ‣   When one server fails, the website should
         be able to recover without taking too long.
     ‣   This requires at least N+1, putting a floor
         on system requirements even for small sites.
     ‣   How long can your site be down?




    David Strauss

Wed 2010-06-09
Redundancy
     ‣   When one server fails, the website should
         be able to recover without taking too long.
     ‣   This requires at least N+1, putting a floor
         on system requirements even for small sites.
     ‣   How long can your site be down?
          ‣   Automatic versus manual failover




    David Strauss

Wed 2010-06-09
Redundancy
     ‣   When one server fails, the website should
         be able to recover without taking too long.
     ‣   This requires at least N+1, putting a floor
         on system requirements even for small sites.
     ‣   How long can your site be down?
          ‣   Automatic versus manual failover
          ‣   Warning: over-automation can reduce uptime


    David Strauss

Wed 2010-06-09
Performance




    David Strauss

Wed 2010-06-09
Performance
     ‣   Find the “sweet spot” for hardware. This is the
         best price/performance point.




    David Strauss

Wed 2010-06-09
Performance
     ‣   Find the “sweet spot” for hardware. This is the
         best price/performance point.
     ‣   Avoid overspending on any type of component




    David Strauss

Wed 2010-06-09
Performance
     ‣   Find the “sweet spot” for hardware. This is the
         best price/performance point.
     ‣   Avoid overspending on any type of component
     ‣   Yet, avoid creating bottlenecks




    David Strauss

Wed 2010-06-09
Performance
     ‣   Find the “sweet spot” for hardware. This is the
         best price/performance point.
     ‣   Avoid overspending on any type of component
     ‣   Yet, avoid creating bottlenecks
     ‣   Swapping memory to disk is very dangerous




    David Strauss

Wed 2010-06-09
Performance
     ‣   Find the “sweet spot” for hardware. This is the
         best price/performance point.
     ‣   Avoid overspending on any type of component
     ‣   Yet, avoid creating bottlenecks
     ‣   Swapping memory to disk is very dangerous
          ‣   Don’t skimp on RAM


    David Strauss

Wed 2010-06-09
Relative importance
                              Processors/Cores        Memory       Disk Speed


           Reverse Proxy
              Cache           ●●                 ●●●           ●●

             Web Server       ●●●●●              ●●            ●

         Database Server      ●●●                ●●●●          ●●●●

                 Monitoring   ●                  ●             ●



    David Strauss

Wed 2010-06-09
All of your servers




    David Strauss

Wed 2010-06-09
All of your servers
     ‣   64-bit: no excuse to use anything less in 2010




    David Strauss

Wed 2010-06-09
All of your servers
     ‣   64-bit: no excuse to use anything less in 2010
     ‣   RHEL/CentOS and Ubuntu have the broadest
         adoption for large-scale LAMP




    David Strauss

Wed 2010-06-09
All of your servers
     ‣   64-bit: no excuse to use anything less in 2010
     ‣   RHEL/CentOS and Ubuntu have the broadest
         adoption for large-scale LAMP
          ‣   But pick one, and stick with it for development,
              staging, and production




    David Strauss

Wed 2010-06-09
All of your servers
     ‣   64-bit: no excuse to use anything less in 2010
     ‣   RHEL/CentOS and Ubuntu have the broadest
         adoption for large-scale LAMP
          ‣   But pick one, and stick with it for development,
              staging, and production
     ‣   Some disk redundancy: rebuilding a server
         is time-consuming unless you’re very automated


    David Strauss

Wed 2010-06-09
Reverse proxy caches




    David Strauss

Wed 2010-06-09
Reverse proxy caches
     ‣   Varnish and nginx have modern architecture and
         broad adoption
          ‣   Sites often front Varnish with nginx
              for gzip and/or SSL




    David Strauss

Wed 2010-06-09
Reverse proxy caches
     ‣   Varnish and nginx have modern architecture and
         broad adoption
          ‣   Sites often front Varnish with nginx
              for gzip and/or SSL
     ‣   Squid and Traffic Server are clunky
         but reliable alternatives




    David Strauss

Wed 2010-06-09
Reverse proxy caches
     ‣   Varnish and nginx have modern architecture and
         broad adoption
          ‣   Sites often front Varnish with nginx
              for gzip and/or SSL
     ‣   Squid and Traffic Server are clunky
         but reliable alternatives
                 CPU

         Save Your
          Money



    David Strauss

Wed 2010-06-09
Reverse proxy caches
     ‣   Varnish and nginx have modern architecture and
         broad adoption
          ‣   Sites often front Varnish with nginx
              for gzip and/or SSL
     ‣   Squid and Traffic Server are clunky
         but reliable alternatives
                 CPU              Memory

         Save Your
          Money        +      1 GB base system
                             + 3 GB for caching



    David Strauss

Wed 2010-06-09
Reverse proxy caches
     ‣   Varnish and nginx have modern architecture and
         broad adoption
          ‣   Sites often front Varnish with nginx
              for gzip and/or SSL
     ‣   Squid and Traffic Server are clunky
         but reliable alternatives
                 CPU              Memory                 Disk

         Save Your
          Money        +      1 GB base system
                             + 3 GB for caching   +       Slow
                                                         + Small
                                                      + Redundant



    David Strauss

Wed 2010-06-09
Reverse proxy caches
     ‣   Varnish and nginx have modern architecture and
         broad adoption
          ‣   Sites often front Varnish with nginx
              for gzip and/or SSL
     ‣   Squid and Traffic Server are clunky
         but reliable alternatives
                 CPU              Memory                 Disk

         Save Your
          Money        +      1 GB base system
                             + 3 GB for caching   +       Slow
                                                         + Small
                                                      + Redundant



    David Strauss
                        = 5000 req/s
Wed 2010-06-09
Web servers




    David Strauss

Wed 2010-06-09
Web servers
     ‣   Apache 2.2 + mod_php + memcached




    David Strauss

Wed 2010-06-09
Web servers
     ‣ Apache 2.2 + mod_php + memcached
     ‣ FastCGI is a bad idea
          ‣ Memory improvements are redundant w/ Varnish
          ‣ Higher latency + less efficient with APC opcode




    David Strauss

Wed 2010-06-09
Web servers
     ‣ Apache 2.2 + mod_php + memcached
     ‣ FastCGI is a bad idea
          ‣
         Memory improvements are redundant w/ Varnish
       ‣ Higher latency + less efficient with APC opcode
     ‣ Check the memory your app takes per process




    David Strauss

Wed 2010-06-09
Web servers
     ‣ Apache 2.2 + mod_php + memcached
     ‣ FastCGI is a bad idea
          ‣
         Memory improvements are redundant w/ Varnish
       ‣ Higher latency + less efficient with APC opcode
     ‣ Check the memory your app takes per process
     ‣   Tune MaxClients to around 25 × cores




    David Strauss

Wed 2010-06-09
Web servers
     ‣ Apache 2.2 + mod_php + memcached
     ‣ FastCGI is a bad idea
          ‣
         Memory improvements are redundant w/ Varnish
       ‣ Higher latency + less efficient with APC opcode
     ‣ Check the memory your app takes per process
     ‣    Tune MaxClients to around 25 × cores
           CPU
          Max out
           cores
          (but prefer fast
         cores to density)




    David Strauss

Wed 2010-06-09
Web servers
     ‣ Apache 2.2 + mod_php + memcached
     ‣ FastCGI is a bad idea
          ‣
         Memory improvements are redundant w/ Varnish
       ‣ Higher latency + less efficient with APC opcode
     ‣ Check the memory your app takes per process
     ‣    Tune MaxClients to around 25 × cores
           CPU                         Memory
          Max out
           cores
          (but prefer fast
         cores to density)
                             +     1 GB base system
                                  + 1 GB memcached
                                  + 25 × cores × per-
                                 process app memory


    David Strauss

Wed 2010-06-09
Web servers
     ‣ Apache 2.2 + mod_php + memcached
     ‣ FastCGI is a bad idea
          ‣
         Memory improvements are redundant w/ Varnish
       ‣ Higher latency + less efficient with APC opcode
     ‣ Check the memory your app takes per process
     ‣    Tune MaxClients to around 25 × cores
           CPU                         Memory                  Disk
          Max out
           cores
          (but prefer fast
         cores to density)
                             +     1 GB base system
                                  + 1 GB memcached
                                  + 25 × cores × per-
                                 process app memory
                                                        +       Slow
                                                               + Small
                                                            + Redundant



    David Strauss

Wed 2010-06-09
Web servers
     ‣ Apache 2.2 + mod_php + memcached
     ‣ FastCGI is a bad idea
          ‣
         Memory improvements are redundant w/ Varnish
       ‣ Higher latency + less efficient with APC opcode
     ‣ Check the memory your app takes per process
     ‣    Tune MaxClients to around 25 × cores
           CPU                         Memory                  Disk
          Max out
           cores
          (but prefer fast
         cores to density)
                             +     1 GB base system
                                  + 1 GB memcached
                                  + 25 × cores × per-
                                 process app memory
                                                        +       Slow
                                                               + Small
                                                            + Redundant



    David Strauss            = 100 req/s
Wed 2010-06-09
Database servers




    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB




    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB
    ‣   Consider Percona builds and (eventually) MariaDB




    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB
    ‣   Consider Percona builds and (eventually) MariaDB
    ‣   Every Apache process generally needs at least one
        connection available, and leave some headroom




    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB
    ‣   Consider Percona builds and (eventually) MariaDB
    ‣   Every Apache process generally needs at least one
        connection available, and leave some headroom
    ‣   Tune the InnoDB buffer pool to at least half of RAM




    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB
    ‣   Consider Percona builds and (eventually) MariaDB
    ‣   Every Apache process generally needs at least one
        connection available, and leave some headroom
    ‣   Tune the InnoDB buffer pool to at least half of RAM
                 CPU
         No more
         than 8-12
           cores



    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB
    ‣   Consider Percona builds and (eventually) MariaDB
    ‣   Every Apache process generally needs at least one
        connection available, and leave some headroom
    ‣   Tune the InnoDB buffer pool to at least half of RAM
                 CPU               Memory
         No more
         than 8-12
           cores
                       +     As much as you can
                            afford (even RAM not
                           used by MySQL caches
                                disk content)


    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB
    ‣   Consider Percona builds and (eventually) MariaDB
    ‣   Every Apache process generally needs at least one
        connection available, and leave some headroom
    ‣   Tune the InnoDB buffer pool to at least half of RAM
                 CPU               Memory                Disk
         No more
         than 8-12
           cores
                       +     As much as you can
                            afford (even RAM not
                           used by MySQL caches
                                disk content)
                                                  +        Fast
                                                         + Large
                                                      + Redundant



    David Strauss

Wed 2010-06-09
Database servers
    ‣   Insist on MySQL 5.1+ and InnoDB
    ‣   Consider Percona builds and (eventually) MariaDB
    ‣   Every Apache process generally needs at least one
        connection available, and leave some headroom
    ‣   Tune the InnoDB buffer pool to at least half of RAM
                 CPU               Memory                Disk
         No more
         than 8-12
           cores
                       +     As much as you can
                            afford (even RAM not
                           used by MySQL caches
                                disk content)
                                                  +        Fast
                                                         + Large
                                                      + Redundant



    David Strauss      = 3000 queries/s
Wed 2010-06-09
Management server




    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring




    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring




    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring
     ‣   Hudson: builds, deployment, and automation




    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring
     ‣   Hudson: builds, deployment, and automation
     ‣   Yum/Apt repo: cluster package distribution




    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring
     ‣   Hudson: builds, deployment, and automation
     ‣   Yum/Apt repo: cluster package distribution
     ‣   Puppet/BCFG2/Chef: configuration management




    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring
     ‣   Hudson: builds, deployment, and automation
     ‣   Yum/Apt repo: cluster package distribution
     ‣   Puppet/BCFG2/Chef: configuration management
                 CPU

         Save Your
          Money



    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring
     ‣   Hudson: builds, deployment, and automation
     ‣   Yum/Apt repo: cluster package distribution
     ‣   Puppet/BCFG2/Chef: configuration management
                 CPU          Memory

         Save Your
          Money        +   Save Your Money




    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring
     ‣   Hudson: builds, deployment, and automation
     ‣   Yum/Apt repo: cluster package distribution
     ‣   Puppet/BCFG2/Chef: configuration management
                 CPU          Memory                  Disk

         Save Your
          Money        +   Save Your Money
                                             +       Slow
                                                    + Large
                                                 + Redundant



    David Strauss

Wed 2010-06-09
Management server
     ‣   Nagios: service outage monitoring
     ‣   Cacti: trend monitoring
     ‣   Hudson: builds, deployment, and automation
     ‣   Yum/Apt repo: cluster package distribution
     ‣   Puppet/BCFG2/Chef: configuration management
                 CPU          Memory                  Disk

         Save Your
          Money        +   Save Your Money
                                             +       Slow
                                                    + Large
                                                 + Redundant

                       = good enough
    David Strauss

Wed 2010-06-09
Assembling the numbers




    David Strauss

Wed 2010-06-09
Assembling the numbers
     ‣   Start with an architecture providing redundancy.
          ‣   Two servers, each running the whole stack




    David Strauss

Wed 2010-06-09
Assembling the numbers
     ‣   Start with an architecture providing redundancy.
          ‣   Two servers, each running the whole stack
     ‣   Increase the number of proxy caches based on
         anonymous and search engine traffic.




    David Strauss

Wed 2010-06-09
Assembling the numbers
     ‣   Start with an architecture providing redundancy.
          ‣   Two servers, each running the whole stack
     ‣   Increase the number of proxy caches based on
         anonymous and search engine traffic.
     ‣   Increase the number of web servers based on
         authenticated traffic.




    David Strauss

Wed 2010-06-09
Assembling the numbers
     ‣   Start with an architecture providing redundancy.
          ‣   Two servers, each running the whole stack
     ‣   Increase the number of proxy caches based on
         anonymous and search engine traffic.
     ‣   Increase the number of web servers based on
         authenticated traffic.
     ‣   Databases are harder to predict, but large sites
         should run them on at least two separate boxes
         with replication.
    David Strauss

Wed 2010-06-09
Extreme measures
     for performance
     and scalability



    David Strauss

Wed 2010-06-09
When caching and search
     offloading isn’t enough




    David Strauss

Wed 2010-06-09
When caching and search
     offloading isn’t enough
     ‣   Some sites have intense custom page needs
          ‣   High proportion of authenticated users
          ‣   Lots of targeted content for anonymous users




    David Strauss

Wed 2010-06-09
When caching and search
     offloading isn’t enough
     ‣   Some sites have intense custom page needs
          ‣   High proportion of authenticated users
          ‣   Lots of targeted content for anonymous users
     ‣   Too much data to process real-time on an RDBMS




    David Strauss

Wed 2010-06-09
When caching and search
     offloading isn’t enough
     ‣   Some sites have intense custom page needs
          ‣   High proportion of authenticated users
          ‣   Lots of targeted content for anonymous users
     ‣   Too much data to process real-time on an RDBMS
     ‣   Data is so volatile that maintaing standard caches
         outweighs the overhead of regeneration

    David Strauss

Wed 2010-06-09
Non-relational/NoSQL tools




    David Strauss

Wed 2010-06-09
Non-relational/NoSQL tools
     ‣   Most web applications can run well
         on less-than-ACID persistence engines




    David Strauss

Wed 2010-06-09
Non-relational/NoSQL tools
     ‣   Most web applications can run well
         on less-than-ACID persistence engines
     ‣   In some cases, like MongoDB, easier to use than
         SQL in addition to being higher performance




    David Strauss

Wed 2010-06-09
Non-relational/NoSQL tools
     ‣   Most web applications can run well
         on less-than-ACID persistence engines
     ‣   In some cases, like MongoDB, easier to use than
         SQL in addition to being higher performance
          ‣   Interested? You’ve already missed the tutorial.




    David Strauss

Wed 2010-06-09
Non-relational/NoSQL tools
     ‣   Most web applications can run well
         on less-than-ACID persistence engines
     ‣   In some cases, like MongoDB, easier to use than
         SQL in addition to being higher performance
          ‣   Interested? You’ve already missed the tutorial.
     ‣   In other cases, like Cassandra, considerably harder
         to use than SQL but massively scalable




    David Strauss

Wed 2010-06-09
Non-relational/NoSQL tools
     ‣   Most web applications can run well
         on less-than-ACID persistence engines
     ‣   In some cases, like MongoDB, easier to use than
         SQL in addition to being higher performance
          ‣   Interested? You’ve already missed the tutorial.
     ‣   In other cases, like Cassandra, considerably harder
         to use than SQL but massively scalable
     ‣   Current Erlang-based systems are neat but slow



    David Strauss

Wed 2010-06-09
Non-relational/NoSQL tools
     ‣   Most web applications can run well
         on less-than-ACID persistence engines
     ‣   In some cases, like MongoDB, easier to use than
         SQL in addition to being higher performance
          ‣   Interested? You’ve already missed the tutorial.
     ‣   In other cases, like Cassandra, considerably harder
         to use than SQL but massively scalable
     ‣   Current Erlang-based systems are neat but slow
     ‣   Many require a special PHP extension,
         at least for ideal performance
    David Strauss

Wed 2010-06-09
Offline processing




    David Strauss

Wed 2010-06-09
Offline processing
     ‣   Gearman
          ‣   Primarily asynchronous job manager




    David Strauss

Wed 2010-06-09
Offline processing
     ‣   Gearman
          ‣   Primarily asynchronous job manager
     ‣   Hadoop
          ‣   MapReduce framework




    David Strauss

Wed 2010-06-09
Offline processing
     ‣   Gearman
          ‣   Primarily asynchronous job manager
     ‣   Hadoop
          ‣   MapReduce framework
     ‣   Traditional message queues
          ‣   ActiveMQ + Stomp is easy from PHP
          ‣   Allows you to build your own job manager

    David Strauss

Wed 2010-06-09
Edge-side includes




Wed 2010-06-09
Edge-side includes


             ESI Processor
        (Varnish, Akamai, other)




Wed 2010-06-09
Edge-side includes
      <html>
      <body>
         <esi:include href=“http://drupal.org/block/views/3” />
      </body>
      </html>




             ESI Processor
        (Varnish, Akamai, other)




Wed 2010-06-09
Edge-side includes
      <html>
      <body>
         <esi:include href=“http://drupal.org/block/views/3” />
      </body>
      </html>




                                           <div>
             ESI Processor                    My block HTML.
        (Varnish, Akamai, other)           </div>




Wed 2010-06-09
Edge-side includes
      <html>
      <body>
         <esi:include href=“http://drupal.org/block/views/3” />
      </body>
      </html>




                                           <div>
             ESI Processor                    My block HTML.
        (Varnish, Akamai, other)           </div>




        <html>
        <body>
           <div>
              My block HTML.
           </div>
        </body>
        </html>




Wed 2010-06-09
Edge-side includes
      <html>
      <body>
                                                                  ‣   Blocks of HTML are
         <esi:include href=“http://drupal.org/block/views/3” />
      </body>
                                                                      integrated into the
      </html>
                                                                      page at the edge
                                                                      layer.
                                           <div>
             ESI Processor                    My block HTML.
        (Varnish, Akamai, other)           </div>




        <html>
        <body>
           <div>
              My block HTML.
           </div>
        </body>
        </html>




Wed 2010-06-09
Edge-side includes
      <html>
      <body>
                                                                  ‣   Blocks of HTML are
         <esi:include href=“http://drupal.org/block/views/3” />
      </body>
                                                                      integrated into the
      </html>
                                                                      page at the edge
                                                                      layer.
                                           <div>
             ESI Processor                    My block HTML.      ‣   Non-primary page
        (Varnish, Akamai, other)           </div>
                                                                      content often
                                                                      occupies >50% of
        <html>                                                        PHP execution time.
        <body>
           <div>
              My block HTML.
           </div>
        </body>
        </html>




Wed 2010-06-09
Edge-side includes
      <html>
      <body>
                                                                  ‣   Blocks of HTML are
         <esi:include href=“http://drupal.org/block/views/3” />
      </body>
                                                                      integrated into the
      </html>
                                                                      page at the edge
                                                                      layer.
                                           <div>
             ESI Processor                    My block HTML.      ‣   Non-primary page
        (Varnish, Akamai, other)           </div>
                                                                      content often
                                                                      occupies >50% of
        <html>                                                        PHP execution time.
        <body>
           <div>
              My block HTML.                                      ‣   Decouples block
           </div>
        </body>                                                       and page cache
        </html>
                                                                      lifetimes

Wed 2010-06-09
HipHop PHP




    David Strauss

Wed 2010-06-09
HipHop PHP
     ‣   Compiles PHP to a C++-based binary
          ‣   Integrated HTTP server




    David Strauss

Wed 2010-06-09
HipHop PHP
     ‣   Compiles PHP to a C++-based binary
          ‣   Integrated HTTP server
     ‣   Supports a subset of PHP and extensions




    David Strauss

Wed 2010-06-09
HipHop PHP
     ‣   Compiles PHP to a C++-based binary
          ‣   Integrated HTTP server
     ‣   Supports a subset of PHP and extensions
     ‣   Requires an organizational commitment to
         building, testing, and deploying on HipHop




    David Strauss

Wed 2010-06-09
HipHop PHP
     ‣   Compiles PHP to a C++-based binary
          ‣   Integrated HTTP server
     ‣   Supports a subset of PHP and extensions
     ‣   Requires an organizational commitment to
         building, testing, and deploying on HipHop
     ‣   Scott MacVicar has a presentation on HipHop later
         today at 16:00.


    David Strauss

Wed 2010-06-09
Cluster Problems



    Credits

Wed 2010-06-09
Server failure




    David Strauss

Wed 2010-06-09
Server failure
     ‣   Load balancers can remove broken or overloaded
         application reverse proxy caches.




    David Strauss

Wed 2010-06-09
Server failure
     ‣   Load balancers can remove broken or overloaded
         application reverse proxy caches.
     ‣   Reverse proxy caches like Varnish can
         automatically use only functional application
         servers.




    David Strauss

Wed 2010-06-09
Server failure
     ‣   Load balancers can remove broken or overloaded
         application reverse proxy caches.
     ‣   Reverse proxy caches like Varnish can
         automatically use only functional application
         servers.
     ‣   Memcached clients automatically handle failure.




    David Strauss

Wed 2010-06-09
Server failure
     ‣   Load balancers can remove broken or overloaded
         application reverse proxy caches.
     ‣   Reverse proxy caches like Varnish can
         automatically use only functional application
         servers.
     ‣   Memcached clients automatically handle failure.
     ‣   Virtual service IP management tools like
         heartbeat2 can manage which MySQL servers
         receive connections to automate failover.



    David Strauss

Wed 2010-06-09
Server failure
     ‣   Load balancers can remove broken or overloaded
         application reverse proxy caches.
     ‣   Reverse proxy caches like Varnish can
         automatically use only functional application
         servers.
     ‣   Memcached clients automatically handle failure.
     ‣   Virtual service IP management tools like
         heartbeat2 can manage which MySQL servers
         receive connections to automate failover.
     ‣   Conclusion: Each layer intelligently monitors and
         uses the servers beneath it.
    David Strauss

Wed 2010-06-09
Cluster coherency




    David Strauss

Wed 2010-06-09
Cluster coherency
     ‣   Systems that run properly on single boxes may
         lose coherency when run on a networked cluster.




    David Strauss

Wed 2010-06-09
Cluster coherency
     ‣   Systems that run properly on single boxes may
         lose coherency when run on a networked cluster.
     ‣   Some caches, like APC’s object cache, have no
         ability to handle network-level coherency. (APC’s
         opcode cache is safe to use on clusters, though.)




    David Strauss

Wed 2010-06-09
Cluster coherency
     ‣   Systems that run properly on single boxes may
         lose coherency when run on a networked cluster.
     ‣   Some caches, like APC’s object cache, have no
         ability to handle network-level coherency. (APC’s
         opcode cache is safe to use on clusters, though.)
     ‣   memcached, if misconfigured, can hash values
         inconsistently across the cluster, resulting in
         different servers using different memcached
         instances for the same keys.



    David Strauss

Wed 2010-06-09
Cluster coherency
     ‣   Systems that run properly on single boxes may
         lose coherency when run on a networked cluster.
     ‣   Some caches, like APC’s object cache, have no
         ability to handle network-level coherency. (APC’s
         opcode cache is safe to use on clusters, though.)
     ‣   memcached, if misconfigured, can hash values
         inconsistently across the cluster, resulting in
         different servers using different memcached
         instances for the same keys.
     ‣   Session coherency issues can be helped with load
         balancer affinity or storage in memcached
    David Strauss

Wed 2010-06-09
Cache regeneration races




    David Strauss

Wed 2010-06-09
Cache regeneration races
    ‣   Downside to network cache coherency:
        synched expiration




    David Strauss

Wed 2010-06-09
Cache regeneration races
    ‣   Downside to network cache coherency:
        synched expiration
    ‣   Requires a locking framework (like ZooKeeper)




    David Strauss

Wed 2010-06-09
Cache regeneration races
    ‣   Downside to network cache coherency:
        synched expiration
    ‣   Requires a locking framework (like ZooKeeper)




                    Old Cached Item




    David Strauss

Wed 2010-06-09
Cache regeneration races
    ‣   Downside to network cache coherency:
        synched expiration
    ‣   Requires a locking framework (like ZooKeeper)




                    Old Cached Item




        Time

    David Strauss

Wed 2010-06-09
Cache regeneration races
    ‣   Downside to network cache coherency:
        synched expiration
    ‣   Requires a locking framework (like ZooKeeper)




                    Old Cached Item



                             Expiration
        Time

    David Strauss

Wed 2010-06-09
Cache regeneration races
    ‣   Downside to network cache coherency:
        synched expiration
    ‣   Requires a locking framework (like ZooKeeper)
                             All servers regenerating the item.



                    Old Cached Item



                             Expiration
                                          {
        Time

    David Strauss

Wed 2010-06-09
Cache regeneration races
    ‣   Downside to network cache coherency:
        synched expiration
    ‣   Requires a locking framework (like ZooKeeper)
                             All servers regenerating the item.



                    Old Cached Item



                             Expiration
                                          {            New Cached Item




        Time

    David Strauss

Wed 2010-06-09
Broken replication




    David Strauss

Wed 2010-06-09
Broken replication
     ‣   MySQL slave servers get out of synch, fall further
         behind




    David Strauss

Wed 2010-06-09
Broken replication
     ‣   MySQL slave servers get out of synch, fall further
         behind
     ‣   No (sane) method of automated recovery




    David Strauss

Wed 2010-06-09
Broken replication
     ‣   MySQL slave servers get out of synch, fall further
         behind
     ‣   No (sane) method of automated recovery
     ‣   Only solvable with good monitoring and recovery
         procedures




    David Strauss

Wed 2010-06-09
Broken replication
     ‣   MySQL slave servers get out of synch, fall further
         behind
     ‣   No (sane) method of automated recovery
     ‣   Only solvable with good monitoring and recovery
         procedures
     ‣   Can automate DB slave blacklisting from use,
         but requires cluster management tools


    David Strauss

Wed 2010-06-09
All content in this presentation, except where noted otherwise, is Creative Commons Attribution-
             ShareAlike 3.0 licensed and copyright 2009 Four Kitchen Studios, LLC.

Wed 2010-06-09
DrupalCamp Stockholm
     Presentation Ended Here



    David Strauss

Wed 2010-06-09
Managing the Cluster



    Credits

Wed 2010-06-09
The problem
                                 Software and
                                 Configuration




         Applicati   Applicati    Applicati     Applicati   Applicati
         on Server   on Server    on Server     on Server   on Server



        Objectives:
        Fast, atomic deployment and rollback
        Minimize single points of failure and contention
        Restart services
        Integrate with version control systems
    Credits

Wed 2010-06-09
Manual updates and deployment

            Human     Human       Human       Human       Human




         Applicati   Applicati   Applicati   Applicati   Applicati
         on Server   on Server   on Server   on Server   on Server




      Why not: slow deployment,
      non-atomic/difficult rollbacks


    Credits

Wed 2010-06-09
Shared storage
         Applicati   Applicati   Applicati   Applicati   Applicati
         on Server   on Server   on Server   on Server   on Server




                                   NFS




      Why not: single point of contention and failure



    Credits

Wed 2010-06-09
rsync
                                 Synchronized
                                  with rsync




         Applicati   Applicati    Applicati     Applicati   Applicati
         on Server   on Server    on Server     on Server   on Server




      Why not: non-atomic, does not manage services



    Credits

Wed 2010-06-09
Capistrano
                                 Deployed with
                                  Capistrano




         Applicati   Applicati    Applicati      Applicati   Applicati
         on Server   on Server    on Server      on Server   on Server




  Capistrano provides near-atomic deployment,
  service restarts, automated rollback, test automation,
  and version control integration (tagged releases).

    Credits

Wed 2010-06-09
Multistage deployment
                                 Deployments
         Deployed with                                       Deployed with
          Capistrano             can be staged.               Capistrano
                               cap staging deploy
                               cap production deploy

          Development
           Integration               Deployed with               Staging
                                      Capistrano




         Applicati       Applicati    Applicati      Applicati     Applicati
         on Server       on Server    on Server      on Server     on Server


    Credits

Wed 2010-06-09
But your application isn’t the only
               thing to manage.



    Credits

Wed 2010-06-09
Beneath the application
           Reverse
                                 Cluster-level
            Proxy                configuration
                                                             Database
            Cache




         Applicati   Applicati    Applicati      Applicati   Applicati
         on Server   on Server    on Server      on Server   on Server

     Cluster management applies to package
     management, updates, and software
     configuration.
     cfengine and bcfg2 are popular
     cluster-level system configuration tools.
    Credits

Wed 2010-06-09
System configuration management
     ‣   Deploys and updates packages, cluster-wide or
         selectively.
     ‣   Manages arbitrary text configuration files
     ‣   Analyzes inconsistent configurations (and
         converges them)
     ‣   Manages device classes (app. servers, database
         servers, etc.)
     ‣   Allows confident configuration testing on a
         staging server.


    Credits

Wed 2010-06-09
All on the management box




                            {
                            Developme
                                nt
                            Integration




                             Staging

                 Manageme
                    nt
                            Deploymen
                              t Tools




                            Monitoring




    Credits

Wed 2010-06-09
Monitoring



    Credits

Wed 2010-06-09
Types of monitoring
                        Failure         Capacity/Load

                       Analyzing       Analyzing Trends
                       Downtime
                                       Predicting Load
                    Viewing Failover
                                       Checking Results
                    Troubleshooting    of Configuration
                                        and Software
                      Notification          Changes


    David Strauss

Wed 2010-06-09
Everyone needs both.




    Credits

Wed 2010-06-09
What to use

                    Failure/Uptime   Capacity/Load

                       Nagios            Cacti

                       Hyperic          Munin




    David Strauss

Wed 2010-06-09
Nagios
     ‣   Highly recommended.
     ‣   Used by Four Kitchens and Tag1 Consulting for
         client work, Drupal.org, Wikipedia, etc.
     ‣   Easy to install on CentOS 5 using EPEL packages.
     ‣   Easy to install nrpe agents to monitor diverse
         services.
     ‣   Can notify administrators on failure.
     ‣   We use this on Drupal.org

    David Strauss

Wed 2010-06-09
Cacti
     ‣   Highly annoying to set up.
     ‣   One instance generally collects all statistics.
         (No “agents” on the systems being monitored.)
     ‣   Provides flexible graphs that can be customized
         on demand.




    Credits

Wed 2010-06-09
Munin
       ‣   Fairly easy to set up.
       ‣   One instance generally collects all statistics.
           (No “agents” on the systems being monitored.)
       ‣   Provides static graphs that cannot be
           customized.




    Credits

Wed 2010-06-09
Pressflow
     Make Drupal sites scale by upgrading core
     with a compatible, powerful replacement.




    David Strauss

Wed 2010-06-09
Common large-site issues
     ‣   Drupal core requires patching to effectively
         support the advanced scalability techniques
         discussed here.
     ‣   Patches often conflict and have to be reapplied
         with each Drupal upgrade.
     ‣   The original patches are often unmaintained.
     ‣   Sites stagnate, running old, insecure versions of
         Drupal core because updating is too difficult.


    David Strauss

Wed 2010-06-09
What is Pressflow?
     ‣   Pressflow is a derivative of Drupal core that
         integrates the most popular performance and
         scalability enhancements.
     ‣   Pressflow is completely compatible with existing
         Drupal 5 and 6 modules, both standard and
         custom.
     ‣   Pressflow installs as a drop-in replacement for
         standard Drupal.
     ‣   Pressflow is free as long as the matching version of
         Drupal is also supported by the community.
    David Strauss

Wed 2010-06-09
What are the enhancements?
     ‣   Reverse proxy support
     ‣   Database replication support
     ‣   Lower database and session management load
     ‣   More efficient queries
     ‣   Testing and optimization by Four Kitchens
         with standard high-performance software
         and hardware configuration
     ‣   Industry-leading scalability support
         by Four Kitchens and Tag1 Consulting
    David Strauss

Wed 2010-06-09
Four Kitchens + Tag1
     ‣   Provide the development, support, scalability, and
         performance services behind Pressflow
     ‣   Comprise most members of the Drupal.org
         infrastructure team
     ‣   Have the most experience scaling Drupal sites
         of all sizes and all types




    David Strauss

Wed 2010-06-09
Ready to scale?
     ‣   Learn more about Pressflow:
          ‣   Pick up pamphlets in the lobby
          ‣   Request Pressflow releases at fourkitchens.com
     ‣   Get the help you need to make it happen:
          ‣   Talk to me (David) or Todd here at DrupalCamp
          ‣   Email shout@fourkitchens.com



    David Strauss

Wed 2010-06-09

More Related Content

PDF
Why 85% of Decisions Made in Your Organization are Wrong and How to Fix It!
PDF
Majerowicz
PDF
High Scalability Toronto: Meetup #2
ODP
Cassandra-Powered Distributed DNS
PPT
Life on the Edge with ESI
PDF
Valhalla at Pantheon
ODP
Cassandra queuing
PDF
Container Security via Monitoring and Orchestration - Container Security Summit
Why 85% of Decisions Made in Your Organization are Wrong and How to Fix It!
Majerowicz
High Scalability Toronto: Meetup #2
Cassandra-Powered Distributed DNS
Life on the Edge with ESI
Valhalla at Pantheon
Cassandra queuing
Container Security via Monitoring and Orchestration - Container Security Summit

More from David Timothy Strauss (10)

PDF
Advanced Drupal 8 Caching
PDF
LCache DrupalCon Dublin 2016
PDF
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
PDF
Effective service and resource management with systemd
PDF
Containers > VMs
PDF
PHP at Density and Scale (Lone Star PHP 2014)
PDF
PHP at Density and Scale
PDF
PHP at Density and Scale
PDF
Scalable Drupal Infrastructure
PDF
Is Drupal Secure?
Advanced Drupal 8 Caching
LCache DrupalCon Dublin 2016
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Effective service and resource management with systemd
Containers > VMs
PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale
PHP at Density and Scale
Scalable Drupal Infrastructure
Is Drupal Secure?
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
Machine learning based COVID-19 study performance prediction
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Cloud computing and distributed systems.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
Review of recent advances in non-invasive hemoglobin estimation
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
Cloud computing and distributed systems.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Ad

Planning LAMP infrastructure

  • 1. Designing, Scoping, and Configuring Scalable LAMP Infrastructure Presented 2010-05-19 by David Strauss Wed 2010-06-09
  • 3. About me ‣ Founded Four Kitchens in 2006 while at UT Austin Wed 2010-06-09
  • 4. About me ‣ Founded Four Kitchens in 2006 while at UT Austin ‣ In 2008, launched Pressflow, which now powers the largest Drupal sites Wed 2010-06-09
  • 5. About me ‣ Founded Four Kitchens in 2006 while at UT Austin ‣ In 2008, launched Pressflow, which now powers the largest Drupal sites ‣ Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist Wed 2010-06-09
  • 6. About me ‣ Founded Four Kitchens in 2006 while at UT Austin ‣ In 2008, launched Pressflow, which now powers the largest Drupal sites ‣ Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist ‣ Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org Wed 2010-06-09
  • 7. About me ‣ Founded Four Kitchens in 2006 while at UT Austin ‣ In 2008, launched Pressflow, which now powers the largest Drupal sites ‣ Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist ‣ Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org ‣ Engineered development workflows for Examiner.com Wed 2010-06-09
  • 8. About me ‣ Founded Four Kitchens in 2006 while at UT Austin ‣ In 2008, launched Pressflow, which now powers the largest Drupal sites ‣ Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist ‣ Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org ‣ Engineered development workflows for Examiner.com ‣ Contributor to Drupal, Bazaar, Ubuntu, BCFG2, Varnish, and other open-source projects Wed 2010-06-09
  • 9. Some assumptions David Strauss Wed 2010-06-09
  • 10. Some assumptions ‣ You have more than one web server David Strauss Wed 2010-06-09
  • 11. Some assumptions ‣ You have more than one web server ‣ You have root access David Strauss Wed 2010-06-09
  • 12. Some assumptions ‣ You have more than one web server ‣ You have root access ‣ You deploy to Linux (though PHP on Windows is more sane than ever) David Strauss Wed 2010-06-09
  • 13. Some assumptions ‣ You have more than one web server ‣ You have root access ‣ You deploy to Linux (though PHP on Windows is more sane than ever) ‣ Database and web servers occupy separate boxes David Strauss Wed 2010-06-09
  • 14. Some assumptions ‣ You have more than one web server ‣ You have root access ‣ You deploy to Linux (though PHP on Windows is more sane than ever) ‣ Database and web servers occupy separate boxes ‣ Your application behaves more or less like Drupal, WordPress, or MediaWiki David Strauss Wed 2010-06-09
  • 15. Understanding Load Distribution David Strauss Wed 2010-06-09
  • 16. Predicting peak traffic Traffic over the day can be highly irregular. To plan for peak loads, design as if all traffic were as heavy as the peak hour of load in a typical month — and then plan for some growth. David Strauss Wed 2010-06-09
  • 17. Analyzing hit distribution David Strauss Wed 2010-06-09
  • 18. Analyzing hit distribution 100% David Strauss Wed 2010-06-09
  • 19. Analyzing hit distribution nt n te Co tic Sta 100% David Strauss Wed 2010-06-09
  • 20. Analyzing hit distribution 30% nt n te Co tic Sta 100% David Strauss Wed 2010-06-09
  • 21. Analyzing hit distribution 30% nt n te Co tic Sta 100% Dy Pag na es m ic David Strauss Wed 2010-06-09
  • 22. Analyzing hit distribution 30% nt n te Co tic Sta 100% Dy Pag na es m ic 70% David Strauss Wed 2010-06-09
  • 23. Analyzing hit distribution 30% nt n te Co tic Sta 100% Dy Pag na es m ic 70% Auth enticat ed David Strauss Wed 2010-06-09
  • 24. Analyzing hit distribution 30% nt n te Co tic Sta 100% Dy Pag na es m ic 70% Auth enticat ed 20% David Strauss Wed 2010-06-09
  • 25. Analyzing hit distribution 30% nt n te Co tic Sta 100% s ou m ny Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 26. Analyzing hit distribution 30% nt n te Co 50% tic Sta 100% s ou m ny Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 27. Analyzing hit distribution 30% an H um nt n te Co 50% tic Sta 100% s ou m ny Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 28. Analyzing hit distribution 40% 30% an H um nt n te Co 50% tic Sta 100% s ou m ny Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 29. Analyzing hit distribution 40% 30% an H um nt n te Co 50% tic Sta W wl C eb er ra 100% s ou m ny Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 30. Analyzing hit distribution 40% 30% an H um nt n te Co 50% tic Sta W wl C eb er ra 100% s ou m ny 10% Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 31. Analyzing hit distribution 40% 30% an H um nt n te Co 50% tic Sta e nt m at W wl Tre C eb er ra l 100% cia s ou o Sp e m N ny 10% Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 32. Analyzing hit distribution 40% 30% an H um nt n te Co 50% 3% tic Sta e nt m at W wl Tre C eb er ra l 100% cia s ou o Sp e m N ny 10% Dy Pag no na es A m ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 33. Analyzing hit distribution 40% 30% an H um nt n te Co 50% 3% tic Sta e nt m at W wl Tre C eb er ra l 100% cia s ou o Sp e m N ny 10% Dy Pag no “Pay na es W A Byp all” m ass ic 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 34. Analyzing hit distribution 40% 30% an H um nt n te Co 50% 3% tic Sta e nt m at W wl Tre C eb er ra l 100% cia s ou o Sp e m N ny 10% Dy Pag no “Pay na es W A Byp all” m ass ic 7% 70% Auth en ticat ed 20% David Strauss Wed 2010-06-09
  • 35. Throughput vs. Delivery Methods Yellow Green Red (Dynamic, (Static) (Dynamic) Cacheable) 2 Content Delivery Network ●●●●●●●●●● ✖ ✖ Reverse Proxy Cache ●●●●●●●● ●●●●●●● ✖ 5000 req/s 1 PHP + APC + ●●●● ●●● ●●● memcached 1 PHP + APC ●●●● ●● ●● 1 PHP (No APC) ●●●● ● ● 10 req/s 1 Delivered by Apache without PHP More dots = More throughput 2 Some actually can do this. David Strauss Wed 2010-06-09
  • 36. Objective Deliver hits using the fastest, most scalable method available David Strauss Wed 2010-06-09
  • 37. Layering: Less Traffic at Each Step David Strauss Wed 2010-06-09
  • 38. Layering: Less Traffic at Each Step Traffic David Strauss Wed 2010-06-09
  • 39. Layering: Less Traffic at Each Step Traffic David Strauss Wed 2010-06-09
  • 40. Layering: Less Traffic at Each Step Traffic CDN David Strauss Wed 2010-06-09
  • 41. Layering: Less Traffic at Each Step Your Datacenter Traffic CDN David Strauss Wed 2010-06-09
  • 42. Layering: Less Traffic at Each Step Your Datacenter Traffic CDN David Strauss Wed 2010-06-09
  • 43. Layering: Less Traffic at Each Step Your Datacenter Traffic DNS Round Robin CDN David Strauss Wed 2010-06-09
  • 44. Layering: Less Traffic at Each Step Your Datacenter Load Traffic Balancer DNS Round Robin CDN David Strauss Wed 2010-06-09
  • 45. Layering: Less Traffic at Each Step Your Datacenter Load Traffic Balancer DNS Round Robin CDN David Strauss Wed 2010-06-09
  • 46. Layering: Less Traffic at Each Step Your Datacenter Load Reverse Traffic Proxy Balancer Cache DNS Round Robin CDN David Strauss Wed 2010-06-09
  • 47. Layering: Less Traffic at Each Step Your Datacenter Load Reverse Traffic Proxy Balancer Cache DNS Round Robin CDN David Strauss Wed 2010-06-09
  • 48. Layering: Less Traffic at Each Step Your Datacenter Load Reverse Traffic Proxy Application Balancer Cache Server DNS Round Robin CDN David Strauss Wed 2010-06-09
  • 49. Layering: Less Traffic at Each Step Your Datacenter Load Reverse Traffic Proxy Application Balancer Cache Server DNS Round Robin CDN David Strauss Wed 2010-06-09
  • 50. Layering: Less Traffic at Each Step Your Datacenter Load Reverse Traffic Proxy Application Balancer Cache Server DNS Round Robin CDN Database David Strauss Wed 2010-06-09
  • 51. Offload from the master database Your master database is the single greatest limitation on scalability. David Strauss Wed 2010-06-09
  • 52. Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Master Database David Strauss Wed 2010-06-09
  • 53. Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Master Memory Cache Database David Strauss Wed 2010-06-09
  • 54. Offload from the master database Your master database is the single greatest limitation on scalability. Application Slave Server Database Master Memory Cache Database David Strauss Wed 2010-06-09
  • 55. Offload from the master database Search Your master database is the single greatest limitation on scalability. Application Slave Server Database Master Memory Cache Database David Strauss Wed 2010-06-09
  • 56. Tools to use David Strauss Wed 2010-06-09
  • 57. Tools to use ‣ Apache Solr or Sphinx for search ‣ Solr can be fronted with Varnish or another proxy cache if queries are repetitive. David Strauss Wed 2010-06-09
  • 58. Tools to use ‣ Apache Solr or Sphinx for search ‣ Solr can be fronted with Varnish or another proxy cache if queries are repetitive. ‣ Varnish, nginx, Squid, or Traffic Server for reverse proxy caching David Strauss Wed 2010-06-09
  • 59. Tools to use ‣ Apache Solr or Sphinx for search ‣ Solr can be fronted with Varnish or another proxy cache if queries are repetitive. ‣ Varnish, nginx, Squid, or Traffic Server for reverse proxy caching ‣ Any third-party service for CDN David Strauss Wed 2010-06-09
  • 60. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. David Strauss Wed 2010-06-09
  • 61. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic David Strauss Wed 2010-06-09
  • 62. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Load Traffic Balancer David Strauss Wed 2010-06-09
  • 63. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Load Reverse Proxy Traffic Balancer Cache David Strauss Wed 2010-06-09
  • 64. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Load Reverse Proxy Traffic Balancer Cache David Strauss Wed 2010-06-09
  • 65. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Load Reverse Application Proxy Traffic Balancer Cache Server David Strauss Wed 2010-06-09
  • 66. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Load Reverse Application Proxy Traffic Balancer Cache Server David Strauss Wed 2010-06-09
  • 67. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Load Reverse Application Proxy Traffic Balancer Cache Server What hit rate is each layer getting? How many servers share the load? David Strauss Wed 2010-06-09
  • 68. Get a management/monitoring box David Strauss Wed 2010-06-09
  • 69. Get a management/monitoring box Management David Strauss Wed 2010-06-09
  • 70. Get a management/monitoring box Application Management Server David Strauss Wed 2010-06-09
  • 71. Get a management/monitoring box Application Management Server Reverse Proxy Cache David Strauss Wed 2010-06-09
  • 72. Get a management/monitoring box Application Database Management Server Reverse Proxy Cache David Strauss Wed 2010-06-09
  • 73. Get a management/monitoring box Load Balancer Application Database Management Server Reverse Proxy Cache David Strauss Wed 2010-06-09
  • 74. Get a management/monitoring box Load (maybe even two Balancer and have them specialize or be redundant) Application Database Management Server Reverse Proxy Cache David Strauss Wed 2010-06-09
  • 75. Planning + Scoping David Strauss Wed 2010-06-09
  • 76. Infrastructure goals David Strauss Wed 2010-06-09
  • 77. Infrastructure goals ‣ Redundancy: tolerate failure David Strauss Wed 2010-06-09
  • 78. Infrastructure goals ‣ Redundancy: tolerate failure ‣ Scalability: engage more users David Strauss Wed 2010-06-09
  • 79. Infrastructure goals ‣ Redundancy: tolerate failure ‣ Scalability: engage more users ‣ Performance: ensure each user’s experience is fast David Strauss Wed 2010-06-09
  • 80. Infrastructure goals ‣ Redundancy: tolerate failure ‣ Scalability: engage more users ‣ Performance: ensure each user’s experience is fast ‣ Manageability: stay sane in the process David Strauss Wed 2010-06-09
  • 81. Redundancy David Strauss Wed 2010-06-09
  • 82. Redundancy ‣ When one server fails, the website should be able to recover without taking too long. David Strauss Wed 2010-06-09
  • 83. Redundancy ‣ When one server fails, the website should be able to recover without taking too long. ‣ This requires at least N+1, putting a floor on system requirements even for small sites. David Strauss Wed 2010-06-09
  • 84. Redundancy ‣ When one server fails, the website should be able to recover without taking too long. ‣ This requires at least N+1, putting a floor on system requirements even for small sites. ‣ How long can your site be down? David Strauss Wed 2010-06-09
  • 85. Redundancy ‣ When one server fails, the website should be able to recover without taking too long. ‣ This requires at least N+1, putting a floor on system requirements even for small sites. ‣ How long can your site be down? ‣ Automatic versus manual failover David Strauss Wed 2010-06-09
  • 86. Redundancy ‣ When one server fails, the website should be able to recover without taking too long. ‣ This requires at least N+1, putting a floor on system requirements even for small sites. ‣ How long can your site be down? ‣ Automatic versus manual failover ‣ Warning: over-automation can reduce uptime David Strauss Wed 2010-06-09
  • 87. Performance David Strauss Wed 2010-06-09
  • 88. Performance ‣ Find the “sweet spot” for hardware. This is the best price/performance point. David Strauss Wed 2010-06-09
  • 89. Performance ‣ Find the “sweet spot” for hardware. This is the best price/performance point. ‣ Avoid overspending on any type of component David Strauss Wed 2010-06-09
  • 90. Performance ‣ Find the “sweet spot” for hardware. This is the best price/performance point. ‣ Avoid overspending on any type of component ‣ Yet, avoid creating bottlenecks David Strauss Wed 2010-06-09
  • 91. Performance ‣ Find the “sweet spot” for hardware. This is the best price/performance point. ‣ Avoid overspending on any type of component ‣ Yet, avoid creating bottlenecks ‣ Swapping memory to disk is very dangerous David Strauss Wed 2010-06-09
  • 92. Performance ‣ Find the “sweet spot” for hardware. This is the best price/performance point. ‣ Avoid overspending on any type of component ‣ Yet, avoid creating bottlenecks ‣ Swapping memory to disk is very dangerous ‣ Don’t skimp on RAM David Strauss Wed 2010-06-09
  • 93. Relative importance Processors/Cores Memory Disk Speed Reverse Proxy Cache ●● ●●● ●● Web Server ●●●●● ●● ● Database Server ●●● ●●●● ●●●● Monitoring ● ● ● David Strauss Wed 2010-06-09
  • 94. All of your servers David Strauss Wed 2010-06-09
  • 95. All of your servers ‣ 64-bit: no excuse to use anything less in 2010 David Strauss Wed 2010-06-09
  • 96. All of your servers ‣ 64-bit: no excuse to use anything less in 2010 ‣ RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP David Strauss Wed 2010-06-09
  • 97. All of your servers ‣ 64-bit: no excuse to use anything less in 2010 ‣ RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP ‣ But pick one, and stick with it for development, staging, and production David Strauss Wed 2010-06-09
  • 98. All of your servers ‣ 64-bit: no excuse to use anything less in 2010 ‣ RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP ‣ But pick one, and stick with it for development, staging, and production ‣ Some disk redundancy: rebuilding a server is time-consuming unless you’re very automated David Strauss Wed 2010-06-09
  • 99. Reverse proxy caches David Strauss Wed 2010-06-09
  • 100. Reverse proxy caches ‣ Varnish and nginx have modern architecture and broad adoption ‣ Sites often front Varnish with nginx for gzip and/or SSL David Strauss Wed 2010-06-09
  • 101. Reverse proxy caches ‣ Varnish and nginx have modern architecture and broad adoption ‣ Sites often front Varnish with nginx for gzip and/or SSL ‣ Squid and Traffic Server are clunky but reliable alternatives David Strauss Wed 2010-06-09
  • 102. Reverse proxy caches ‣ Varnish and nginx have modern architecture and broad adoption ‣ Sites often front Varnish with nginx for gzip and/or SSL ‣ Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money David Strauss Wed 2010-06-09
  • 103. Reverse proxy caches ‣ Varnish and nginx have modern architecture and broad adoption ‣ Sites often front Varnish with nginx for gzip and/or SSL ‣ Squid and Traffic Server are clunky but reliable alternatives CPU Memory Save Your Money + 1 GB base system + 3 GB for caching David Strauss Wed 2010-06-09
  • 104. Reverse proxy caches ‣ Varnish and nginx have modern architecture and broad adoption ‣ Sites often front Varnish with nginx for gzip and/or SSL ‣ Squid and Traffic Server are clunky but reliable alternatives CPU Memory Disk Save Your Money + 1 GB base system + 3 GB for caching + Slow + Small + Redundant David Strauss Wed 2010-06-09
  • 105. Reverse proxy caches ‣ Varnish and nginx have modern architecture and broad adoption ‣ Sites often front Varnish with nginx for gzip and/or SSL ‣ Squid and Traffic Server are clunky but reliable alternatives CPU Memory Disk Save Your Money + 1 GB base system + 3 GB for caching + Slow + Small + Redundant David Strauss = 5000 req/s Wed 2010-06-09
  • 106. Web servers David Strauss Wed 2010-06-09
  • 107. Web servers ‣ Apache 2.2 + mod_php + memcached David Strauss Wed 2010-06-09
  • 108. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ FastCGI is a bad idea ‣ Memory improvements are redundant w/ Varnish ‣ Higher latency + less efficient with APC opcode David Strauss Wed 2010-06-09
  • 109. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ FastCGI is a bad idea ‣ Memory improvements are redundant w/ Varnish ‣ Higher latency + less efficient with APC opcode ‣ Check the memory your app takes per process David Strauss Wed 2010-06-09
  • 110. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ FastCGI is a bad idea ‣ Memory improvements are redundant w/ Varnish ‣ Higher latency + less efficient with APC opcode ‣ Check the memory your app takes per process ‣ Tune MaxClients to around 25 × cores David Strauss Wed 2010-06-09
  • 111. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ FastCGI is a bad idea ‣ Memory improvements are redundant w/ Varnish ‣ Higher latency + less efficient with APC opcode ‣ Check the memory your app takes per process ‣ Tune MaxClients to around 25 × cores CPU Max out cores (but prefer fast cores to density) David Strauss Wed 2010-06-09
  • 112. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ FastCGI is a bad idea ‣ Memory improvements are redundant w/ Varnish ‣ Higher latency + less efficient with APC opcode ‣ Check the memory your app takes per process ‣ Tune MaxClients to around 25 × cores CPU Memory Max out cores (but prefer fast cores to density) + 1 GB base system + 1 GB memcached + 25 × cores × per- process app memory David Strauss Wed 2010-06-09
  • 113. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ FastCGI is a bad idea ‣ Memory improvements are redundant w/ Varnish ‣ Higher latency + less efficient with APC opcode ‣ Check the memory your app takes per process ‣ Tune MaxClients to around 25 × cores CPU Memory Disk Max out cores (but prefer fast cores to density) + 1 GB base system + 1 GB memcached + 25 × cores × per- process app memory + Slow + Small + Redundant David Strauss Wed 2010-06-09
  • 114. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ FastCGI is a bad idea ‣ Memory improvements are redundant w/ Varnish ‣ Higher latency + less efficient with APC opcode ‣ Check the memory your app takes per process ‣ Tune MaxClients to around 25 × cores CPU Memory Disk Max out cores (but prefer fast cores to density) + 1 GB base system + 1 GB memcached + 25 × cores × per- process app memory + Slow + Small + Redundant David Strauss = 100 req/s Wed 2010-06-09
  • 115. Database servers David Strauss Wed 2010-06-09
  • 116. Database servers ‣ Insist on MySQL 5.1+ and InnoDB David Strauss Wed 2010-06-09
  • 117. Database servers ‣ Insist on MySQL 5.1+ and InnoDB ‣ Consider Percona builds and (eventually) MariaDB David Strauss Wed 2010-06-09
  • 118. Database servers ‣ Insist on MySQL 5.1+ and InnoDB ‣ Consider Percona builds and (eventually) MariaDB ‣ Every Apache process generally needs at least one connection available, and leave some headroom David Strauss Wed 2010-06-09
  • 119. Database servers ‣ Insist on MySQL 5.1+ and InnoDB ‣ Consider Percona builds and (eventually) MariaDB ‣ Every Apache process generally needs at least one connection available, and leave some headroom ‣ Tune the InnoDB buffer pool to at least half of RAM David Strauss Wed 2010-06-09
  • 120. Database servers ‣ Insist on MySQL 5.1+ and InnoDB ‣ Consider Percona builds and (eventually) MariaDB ‣ Every Apache process generally needs at least one connection available, and leave some headroom ‣ Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores David Strauss Wed 2010-06-09
  • 121. Database servers ‣ Insist on MySQL 5.1+ and InnoDB ‣ Consider Percona builds and (eventually) MariaDB ‣ Every Apache process generally needs at least one connection available, and leave some headroom ‣ Tune the InnoDB buffer pool to at least half of RAM CPU Memory No more than 8-12 cores + As much as you can afford (even RAM not used by MySQL caches disk content) David Strauss Wed 2010-06-09
  • 122. Database servers ‣ Insist on MySQL 5.1+ and InnoDB ‣ Consider Percona builds and (eventually) MariaDB ‣ Every Apache process generally needs at least one connection available, and leave some headroom ‣ Tune the InnoDB buffer pool to at least half of RAM CPU Memory Disk No more than 8-12 cores + As much as you can afford (even RAM not used by MySQL caches disk content) + Fast + Large + Redundant David Strauss Wed 2010-06-09
  • 123. Database servers ‣ Insist on MySQL 5.1+ and InnoDB ‣ Consider Percona builds and (eventually) MariaDB ‣ Every Apache process generally needs at least one connection available, and leave some headroom ‣ Tune the InnoDB buffer pool to at least half of RAM CPU Memory Disk No more than 8-12 cores + As much as you can afford (even RAM not used by MySQL caches disk content) + Fast + Large + Redundant David Strauss = 3000 queries/s Wed 2010-06-09
  • 124. Management server David Strauss Wed 2010-06-09
  • 125. Management server ‣ Nagios: service outage monitoring David Strauss Wed 2010-06-09
  • 126. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring David Strauss Wed 2010-06-09
  • 127. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring ‣ Hudson: builds, deployment, and automation David Strauss Wed 2010-06-09
  • 128. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring ‣ Hudson: builds, deployment, and automation ‣ Yum/Apt repo: cluster package distribution David Strauss Wed 2010-06-09
  • 129. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring ‣ Hudson: builds, deployment, and automation ‣ Yum/Apt repo: cluster package distribution ‣ Puppet/BCFG2/Chef: configuration management David Strauss Wed 2010-06-09
  • 130. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring ‣ Hudson: builds, deployment, and automation ‣ Yum/Apt repo: cluster package distribution ‣ Puppet/BCFG2/Chef: configuration management CPU Save Your Money David Strauss Wed 2010-06-09
  • 131. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring ‣ Hudson: builds, deployment, and automation ‣ Yum/Apt repo: cluster package distribution ‣ Puppet/BCFG2/Chef: configuration management CPU Memory Save Your Money + Save Your Money David Strauss Wed 2010-06-09
  • 132. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring ‣ Hudson: builds, deployment, and automation ‣ Yum/Apt repo: cluster package distribution ‣ Puppet/BCFG2/Chef: configuration management CPU Memory Disk Save Your Money + Save Your Money + Slow + Large + Redundant David Strauss Wed 2010-06-09
  • 133. Management server ‣ Nagios: service outage monitoring ‣ Cacti: trend monitoring ‣ Hudson: builds, deployment, and automation ‣ Yum/Apt repo: cluster package distribution ‣ Puppet/BCFG2/Chef: configuration management CPU Memory Disk Save Your Money + Save Your Money + Slow + Large + Redundant = good enough David Strauss Wed 2010-06-09
  • 134. Assembling the numbers David Strauss Wed 2010-06-09
  • 135. Assembling the numbers ‣ Start with an architecture providing redundancy. ‣ Two servers, each running the whole stack David Strauss Wed 2010-06-09
  • 136. Assembling the numbers ‣ Start with an architecture providing redundancy. ‣ Two servers, each running the whole stack ‣ Increase the number of proxy caches based on anonymous and search engine traffic. David Strauss Wed 2010-06-09
  • 137. Assembling the numbers ‣ Start with an architecture providing redundancy. ‣ Two servers, each running the whole stack ‣ Increase the number of proxy caches based on anonymous and search engine traffic. ‣ Increase the number of web servers based on authenticated traffic. David Strauss Wed 2010-06-09
  • 138. Assembling the numbers ‣ Start with an architecture providing redundancy. ‣ Two servers, each running the whole stack ‣ Increase the number of proxy caches based on anonymous and search engine traffic. ‣ Increase the number of web servers based on authenticated traffic. ‣ Databases are harder to predict, but large sites should run them on at least two separate boxes with replication. David Strauss Wed 2010-06-09
  • 139. Extreme measures for performance and scalability David Strauss Wed 2010-06-09
  • 140. When caching and search offloading isn’t enough David Strauss Wed 2010-06-09
  • 141. When caching and search offloading isn’t enough ‣ Some sites have intense custom page needs ‣ High proportion of authenticated users ‣ Lots of targeted content for anonymous users David Strauss Wed 2010-06-09
  • 142. When caching and search offloading isn’t enough ‣ Some sites have intense custom page needs ‣ High proportion of authenticated users ‣ Lots of targeted content for anonymous users ‣ Too much data to process real-time on an RDBMS David Strauss Wed 2010-06-09
  • 143. When caching and search offloading isn’t enough ‣ Some sites have intense custom page needs ‣ High proportion of authenticated users ‣ Lots of targeted content for anonymous users ‣ Too much data to process real-time on an RDBMS ‣ Data is so volatile that maintaing standard caches outweighs the overhead of regeneration David Strauss Wed 2010-06-09
  • 144. Non-relational/NoSQL tools David Strauss Wed 2010-06-09
  • 145. Non-relational/NoSQL tools ‣ Most web applications can run well on less-than-ACID persistence engines David Strauss Wed 2010-06-09
  • 146. Non-relational/NoSQL tools ‣ Most web applications can run well on less-than-ACID persistence engines ‣ In some cases, like MongoDB, easier to use than SQL in addition to being higher performance David Strauss Wed 2010-06-09
  • 147. Non-relational/NoSQL tools ‣ Most web applications can run well on less-than-ACID persistence engines ‣ In some cases, like MongoDB, easier to use than SQL in addition to being higher performance ‣ Interested? You’ve already missed the tutorial. David Strauss Wed 2010-06-09
  • 148. Non-relational/NoSQL tools ‣ Most web applications can run well on less-than-ACID persistence engines ‣ In some cases, like MongoDB, easier to use than SQL in addition to being higher performance ‣ Interested? You’ve already missed the tutorial. ‣ In other cases, like Cassandra, considerably harder to use than SQL but massively scalable David Strauss Wed 2010-06-09
  • 149. Non-relational/NoSQL tools ‣ Most web applications can run well on less-than-ACID persistence engines ‣ In some cases, like MongoDB, easier to use than SQL in addition to being higher performance ‣ Interested? You’ve already missed the tutorial. ‣ In other cases, like Cassandra, considerably harder to use than SQL but massively scalable ‣ Current Erlang-based systems are neat but slow David Strauss Wed 2010-06-09
  • 150. Non-relational/NoSQL tools ‣ Most web applications can run well on less-than-ACID persistence engines ‣ In some cases, like MongoDB, easier to use than SQL in addition to being higher performance ‣ Interested? You’ve already missed the tutorial. ‣ In other cases, like Cassandra, considerably harder to use than SQL but massively scalable ‣ Current Erlang-based systems are neat but slow ‣ Many require a special PHP extension, at least for ideal performance David Strauss Wed 2010-06-09
  • 151. Offline processing David Strauss Wed 2010-06-09
  • 152. Offline processing ‣ Gearman ‣ Primarily asynchronous job manager David Strauss Wed 2010-06-09
  • 153. Offline processing ‣ Gearman ‣ Primarily asynchronous job manager ‣ Hadoop ‣ MapReduce framework David Strauss Wed 2010-06-09
  • 154. Offline processing ‣ Gearman ‣ Primarily asynchronous job manager ‣ Hadoop ‣ MapReduce framework ‣ Traditional message queues ‣ ActiveMQ + Stomp is easy from PHP ‣ Allows you to build your own job manager David Strauss Wed 2010-06-09
  • 156. Edge-side includes ESI Processor (Varnish, Akamai, other) Wed 2010-06-09
  • 157. Edge-side includes <html> <body> <esi:include href=“http://drupal.org/block/views/3” /> </body> </html> ESI Processor (Varnish, Akamai, other) Wed 2010-06-09
  • 158. Edge-side includes <html> <body> <esi:include href=“http://drupal.org/block/views/3” /> </body> </html> <div> ESI Processor My block HTML. (Varnish, Akamai, other) </div> Wed 2010-06-09
  • 159. Edge-side includes <html> <body> <esi:include href=“http://drupal.org/block/views/3” /> </body> </html> <div> ESI Processor My block HTML. (Varnish, Akamai, other) </div> <html> <body> <div> My block HTML. </div> </body> </html> Wed 2010-06-09
  • 160. Edge-side includes <html> <body> ‣ Blocks of HTML are <esi:include href=“http://drupal.org/block/views/3” /> </body> integrated into the </html> page at the edge layer. <div> ESI Processor My block HTML. (Varnish, Akamai, other) </div> <html> <body> <div> My block HTML. </div> </body> </html> Wed 2010-06-09
  • 161. Edge-side includes <html> <body> ‣ Blocks of HTML are <esi:include href=“http://drupal.org/block/views/3” /> </body> integrated into the </html> page at the edge layer. <div> ESI Processor My block HTML. ‣ Non-primary page (Varnish, Akamai, other) </div> content often occupies >50% of <html> PHP execution time. <body> <div> My block HTML. </div> </body> </html> Wed 2010-06-09
  • 162. Edge-side includes <html> <body> ‣ Blocks of HTML are <esi:include href=“http://drupal.org/block/views/3” /> </body> integrated into the </html> page at the edge layer. <div> ESI Processor My block HTML. ‣ Non-primary page (Varnish, Akamai, other) </div> content often occupies >50% of <html> PHP execution time. <body> <div> My block HTML. ‣ Decouples block </div> </body> and page cache </html> lifetimes Wed 2010-06-09
  • 163. HipHop PHP David Strauss Wed 2010-06-09
  • 164. HipHop PHP ‣ Compiles PHP to a C++-based binary ‣ Integrated HTTP server David Strauss Wed 2010-06-09
  • 165. HipHop PHP ‣ Compiles PHP to a C++-based binary ‣ Integrated HTTP server ‣ Supports a subset of PHP and extensions David Strauss Wed 2010-06-09
  • 166. HipHop PHP ‣ Compiles PHP to a C++-based binary ‣ Integrated HTTP server ‣ Supports a subset of PHP and extensions ‣ Requires an organizational commitment to building, testing, and deploying on HipHop David Strauss Wed 2010-06-09
  • 167. HipHop PHP ‣ Compiles PHP to a C++-based binary ‣ Integrated HTTP server ‣ Supports a subset of PHP and extensions ‣ Requires an organizational commitment to building, testing, and deploying on HipHop ‣ Scott MacVicar has a presentation on HipHop later today at 16:00. David Strauss Wed 2010-06-09
  • 168. Cluster Problems Credits Wed 2010-06-09
  • 169. Server failure David Strauss Wed 2010-06-09
  • 170. Server failure ‣ Load balancers can remove broken or overloaded application reverse proxy caches. David Strauss Wed 2010-06-09
  • 171. Server failure ‣ Load balancers can remove broken or overloaded application reverse proxy caches. ‣ Reverse proxy caches like Varnish can automatically use only functional application servers. David Strauss Wed 2010-06-09
  • 172. Server failure ‣ Load balancers can remove broken or overloaded application reverse proxy caches. ‣ Reverse proxy caches like Varnish can automatically use only functional application servers. ‣ Memcached clients automatically handle failure. David Strauss Wed 2010-06-09
  • 173. Server failure ‣ Load balancers can remove broken or overloaded application reverse proxy caches. ‣ Reverse proxy caches like Varnish can automatically use only functional application servers. ‣ Memcached clients automatically handle failure. ‣ Virtual service IP management tools like heartbeat2 can manage which MySQL servers receive connections to automate failover. David Strauss Wed 2010-06-09
  • 174. Server failure ‣ Load balancers can remove broken or overloaded application reverse proxy caches. ‣ Reverse proxy caches like Varnish can automatically use only functional application servers. ‣ Memcached clients automatically handle failure. ‣ Virtual service IP management tools like heartbeat2 can manage which MySQL servers receive connections to automate failover. ‣ Conclusion: Each layer intelligently monitors and uses the servers beneath it. David Strauss Wed 2010-06-09
  • 175. Cluster coherency David Strauss Wed 2010-06-09
  • 176. Cluster coherency ‣ Systems that run properly on single boxes may lose coherency when run on a networked cluster. David Strauss Wed 2010-06-09
  • 177. Cluster coherency ‣ Systems that run properly on single boxes may lose coherency when run on a networked cluster. ‣ Some caches, like APC’s object cache, have no ability to handle network-level coherency. (APC’s opcode cache is safe to use on clusters, though.) David Strauss Wed 2010-06-09
  • 178. Cluster coherency ‣ Systems that run properly on single boxes may lose coherency when run on a networked cluster. ‣ Some caches, like APC’s object cache, have no ability to handle network-level coherency. (APC’s opcode cache is safe to use on clusters, though.) ‣ memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys. David Strauss Wed 2010-06-09
  • 179. Cluster coherency ‣ Systems that run properly on single boxes may lose coherency when run on a networked cluster. ‣ Some caches, like APC’s object cache, have no ability to handle network-level coherency. (APC’s opcode cache is safe to use on clusters, though.) ‣ memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys. ‣ Session coherency issues can be helped with load balancer affinity or storage in memcached David Strauss Wed 2010-06-09
  • 180. Cache regeneration races David Strauss Wed 2010-06-09
  • 181. Cache regeneration races ‣ Downside to network cache coherency: synched expiration David Strauss Wed 2010-06-09
  • 182. Cache regeneration races ‣ Downside to network cache coherency: synched expiration ‣ Requires a locking framework (like ZooKeeper) David Strauss Wed 2010-06-09
  • 183. Cache regeneration races ‣ Downside to network cache coherency: synched expiration ‣ Requires a locking framework (like ZooKeeper) Old Cached Item David Strauss Wed 2010-06-09
  • 184. Cache regeneration races ‣ Downside to network cache coherency: synched expiration ‣ Requires a locking framework (like ZooKeeper) Old Cached Item Time David Strauss Wed 2010-06-09
  • 185. Cache regeneration races ‣ Downside to network cache coherency: synched expiration ‣ Requires a locking framework (like ZooKeeper) Old Cached Item Expiration Time David Strauss Wed 2010-06-09
  • 186. Cache regeneration races ‣ Downside to network cache coherency: synched expiration ‣ Requires a locking framework (like ZooKeeper) All servers regenerating the item. Old Cached Item Expiration { Time David Strauss Wed 2010-06-09
  • 187. Cache regeneration races ‣ Downside to network cache coherency: synched expiration ‣ Requires a locking framework (like ZooKeeper) All servers regenerating the item. Old Cached Item Expiration { New Cached Item Time David Strauss Wed 2010-06-09
  • 188. Broken replication David Strauss Wed 2010-06-09
  • 189. Broken replication ‣ MySQL slave servers get out of synch, fall further behind David Strauss Wed 2010-06-09
  • 190. Broken replication ‣ MySQL slave servers get out of synch, fall further behind ‣ No (sane) method of automated recovery David Strauss Wed 2010-06-09
  • 191. Broken replication ‣ MySQL slave servers get out of synch, fall further behind ‣ No (sane) method of automated recovery ‣ Only solvable with good monitoring and recovery procedures David Strauss Wed 2010-06-09
  • 192. Broken replication ‣ MySQL slave servers get out of synch, fall further behind ‣ No (sane) method of automated recovery ‣ Only solvable with good monitoring and recovery procedures ‣ Can automate DB slave blacklisting from use, but requires cluster management tools David Strauss Wed 2010-06-09
  • 193. All content in this presentation, except where noted otherwise, is Creative Commons Attribution- ShareAlike 3.0 licensed and copyright 2009 Four Kitchen Studios, LLC. Wed 2010-06-09
  • 194. DrupalCamp Stockholm Presentation Ended Here David Strauss Wed 2010-06-09
  • 195. Managing the Cluster Credits Wed 2010-06-09
  • 196. The problem Software and Configuration Applicati Applicati Applicati Applicati Applicati on Server on Server on Server on Server on Server Objectives: Fast, atomic deployment and rollback Minimize single points of failure and contention Restart services Integrate with version control systems Credits Wed 2010-06-09
  • 197. Manual updates and deployment Human Human Human Human Human Applicati Applicati Applicati Applicati Applicati on Server on Server on Server on Server on Server Why not: slow deployment, non-atomic/difficult rollbacks Credits Wed 2010-06-09
  • 198. Shared storage Applicati Applicati Applicati Applicati Applicati on Server on Server on Server on Server on Server NFS Why not: single point of contention and failure Credits Wed 2010-06-09
  • 199. rsync Synchronized with rsync Applicati Applicati Applicati Applicati Applicati on Server on Server on Server on Server on Server Why not: non-atomic, does not manage services Credits Wed 2010-06-09
  • 200. Capistrano Deployed with Capistrano Applicati Applicati Applicati Applicati Applicati on Server on Server on Server on Server on Server Capistrano provides near-atomic deployment, service restarts, automated rollback, test automation, and version control integration (tagged releases). Credits Wed 2010-06-09
  • 201. Multistage deployment Deployments Deployed with Deployed with Capistrano can be staged. Capistrano cap staging deploy cap production deploy Development Integration Deployed with Staging Capistrano Applicati Applicati Applicati Applicati Applicati on Server on Server on Server on Server on Server Credits Wed 2010-06-09
  • 202. But your application isn’t the only thing to manage. Credits Wed 2010-06-09
  • 203. Beneath the application Reverse Cluster-level Proxy configuration Database Cache Applicati Applicati Applicati Applicati Applicati on Server on Server on Server on Server on Server Cluster management applies to package management, updates, and software configuration. cfengine and bcfg2 are popular cluster-level system configuration tools. Credits Wed 2010-06-09
  • 204. System configuration management ‣ Deploys and updates packages, cluster-wide or selectively. ‣ Manages arbitrary text configuration files ‣ Analyzes inconsistent configurations (and converges them) ‣ Manages device classes (app. servers, database servers, etc.) ‣ Allows confident configuration testing on a staging server. Credits Wed 2010-06-09
  • 205. All on the management box { Developme nt Integration Staging Manageme nt Deploymen t Tools Monitoring Credits Wed 2010-06-09
  • 206. Monitoring Credits Wed 2010-06-09
  • 207. Types of monitoring Failure Capacity/Load Analyzing Analyzing Trends Downtime Predicting Load Viewing Failover Checking Results Troubleshooting of Configuration and Software Notification Changes David Strauss Wed 2010-06-09
  • 208. Everyone needs both. Credits Wed 2010-06-09
  • 209. What to use Failure/Uptime Capacity/Load Nagios Cacti Hyperic Munin David Strauss Wed 2010-06-09
  • 210. Nagios ‣ Highly recommended. ‣ Used by Four Kitchens and Tag1 Consulting for client work, Drupal.org, Wikipedia, etc. ‣ Easy to install on CentOS 5 using EPEL packages. ‣ Easy to install nrpe agents to monitor diverse services. ‣ Can notify administrators on failure. ‣ We use this on Drupal.org David Strauss Wed 2010-06-09
  • 211. Cacti ‣ Highly annoying to set up. ‣ One instance generally collects all statistics. (No “agents” on the systems being monitored.) ‣ Provides flexible graphs that can be customized on demand. Credits Wed 2010-06-09
  • 212. Munin ‣ Fairly easy to set up. ‣ One instance generally collects all statistics. (No “agents” on the systems being monitored.) ‣ Provides static graphs that cannot be customized. Credits Wed 2010-06-09
  • 213. Pressflow Make Drupal sites scale by upgrading core with a compatible, powerful replacement. David Strauss Wed 2010-06-09
  • 214. Common large-site issues ‣ Drupal core requires patching to effectively support the advanced scalability techniques discussed here. ‣ Patches often conflict and have to be reapplied with each Drupal upgrade. ‣ The original patches are often unmaintained. ‣ Sites stagnate, running old, insecure versions of Drupal core because updating is too difficult. David Strauss Wed 2010-06-09
  • 215. What is Pressflow? ‣ Pressflow is a derivative of Drupal core that integrates the most popular performance and scalability enhancements. ‣ Pressflow is completely compatible with existing Drupal 5 and 6 modules, both standard and custom. ‣ Pressflow installs as a drop-in replacement for standard Drupal. ‣ Pressflow is free as long as the matching version of Drupal is also supported by the community. David Strauss Wed 2010-06-09
  • 216. What are the enhancements? ‣ Reverse proxy support ‣ Database replication support ‣ Lower database and session management load ‣ More efficient queries ‣ Testing and optimization by Four Kitchens with standard high-performance software and hardware configuration ‣ Industry-leading scalability support by Four Kitchens and Tag1 Consulting David Strauss Wed 2010-06-09
  • 217. Four Kitchens + Tag1 ‣ Provide the development, support, scalability, and performance services behind Pressflow ‣ Comprise most members of the Drupal.org infrastructure team ‣ Have the most experience scaling Drupal sites of all sizes and all types David Strauss Wed 2010-06-09
  • 218. Ready to scale? ‣ Learn more about Pressflow: ‣ Pick up pamphlets in the lobby ‣ Request Pressflow releases at fourkitchens.com ‣ Get the help you need to make it happen: ‣ Talk to me (David) or Todd here at DrupalCamp ‣ Email shout@fourkitchens.com David Strauss Wed 2010-06-09