SlideShare a Scribd company logo
Fixing Twitter
... and Finding your own Fail Whale




            John Adams
        Twitter Operations
        <jna@twitter.com>
Operations
• Small team, growing rapidly.
• What do we do?
 • Software Performance (back-end)
 • Availability
 • Capacity Planning (metrics-driven)
 • Configuration Management
• We don’t deal with the physical plant.
Managed Services
• Dedicated team (NTTA)
• 24/7 Hands on remote support
• No clouds. We tried that!
 • Need raw processing power, latency too
     high in existing cloud offerings
• Frees us to deal with real, intellectual,
  computer science problems.
752%
2008 Growth
           5



         3.75



          2.5



         1.25



           0
                Dec 07    Feb 08   Apr 08       Jun 08   Aug 08   Oct 08 Dec 08
                Unique Visitors (in Millions)
That was only the beginning...




previous
 graph!
Uniques




 Not slowing down, despite what outsiders say.
  Hard for outsiders to measure API usage!
Growth = Pain
 + an appreciation for Institutionalized Fear
Mantra!


   Find Weakest
       Point




   Metrics +
Logs + Science =
    Analysis
Mantra!


   Find Weakest    Take Corrective
       Point           Action




   Metrics +         Process
Logs + Science =
    Analysis
Mantra!


   Find Weakest    Take Corrective    Move to Next
       Point           Action         Weakest Point




   Metrics +         Process         Repeatability
Logs + Science =
    Analysis
Find the Weakest Point

  • Metrics + Graphs
   • Individual metrics are irrelevant
  • Logs
  • SCIENCE!
  • Find out what the actionable items are.
Instrument Everything




                        (cc) seenoevil@flickr
Monitoring

 • Graph and report critical metrics in as near
   real time as possible
 • You already have the tools.
  • RRD
  • Ganglia + custom gMetric scripts
  • MRTG
Dashboards
 • “Criticals” view
 • Smokeping/MRTG
 • Google Analytics
  • Not just for
     HTTP 200s/SEO
 • XML Feeds from
   managed services
 • Data Porn!
Analyze
  • Turn data into information
   • Where is the code base going?
   • Are things worse than they were?
     • Understand the impact of the last
        software deploy
      • Run check scripts during and after
        deploys
  • Capacity Planning, not Fire Fighting!
Forecasting                   Curve-fitting for capacity planning
                               (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Deploys

  • Graph time-of-deploy along side server
      CPU and Latency
  • Display time-of-last-deploy on dashboard


 last deploy times
Whale-Watcher
•   Simple shell script,
    •   MASSIVE WIN.
•   Whale = HTTP 503 (timeout)
•   Robot = HTTP 500 (error)
•   Examines last 100,000 lines of aggregated
    daemon / www logs
•   “Whales per Second” > Wthreshold
    •   Thar be whales! Call in ops.
Take Action !
Feature “Darkmode”
 • Specific site controls to enable and
   disable computationally or IO-Heavy site
   function
 • The “Emergency Stop” button
 • Changes logged and reported to all teams
 • Around 60 switches we can throw
 • Static / Read-only mode
Configuration
Management
• Start automated configuration management
  EARLY in your company.
• Don’t wait until it’s too late.
• Twitter started within the first few months.
Configuration
Management
   • Complex Environment
   • Multiple Admins
   • Unknown Interactions
   • Solution: 2nd set of eyes.
Process through Reviews
Reviewboard
         www.review-board.org

 • SVN pre-commit hook causes a failure if
   the log message doesn’t include
   ‘reviewed’
 • SVN post-commit hook informs people
   what changed via email
 • Watches the entire SVN tree
Improve
Communication


                Campfire
Subsystems
Many limiting factors in the request pipeline

        Apache                      Rails
      MPM Model                  (mongrel)
      MaxClients            2:1 oversubscribed
 TCP Listen queue depth
                                  to cores

                                 Memcached
                                # connections


                                    MySQL
   Varnish (search)            # db connections
      # threads
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Network                     Servers++
                           Latency
                                          Better
 Timeline   Database     Update Delay
                                        algorithm
                                         DBs++
  Search    Database        Delays
                                          Code
 Updates    Algorithm      Latency      Algorithms
CPU: More with Less
• Reduction in 40% of CPU by replacing dual
  and quad core machines with 8 core
• Switching from AMD to Intel Xeon = 30%
  gain
• Saved data center space, power, cost per
  month.
• Not the best option if you own machines.
  Capital expenditure = hard to realize new
  technology gains.
Rails
• Stop blaming Rails.
• Analysis found:
 • Caching + Cache invalidation problems
 • Bad queries generated by ActiveRecord,
    resulting in slow queries against the db
 • Queue Latency
 • Memcache / Page Cache Corruption
 • Replication Lag
Disk is the new Tape.
• Social Networking application profile has
  many   O(ny)   operations.
• Page requests have to happen in < 500mS
  or users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• What to do?
Caching
  • We’re the real-time web, but lots of caching
    opportunity
  • Most caching strategies rely on long TTLs
    (>60 s)
  • Separate memcache pools for different data
    types to prevent eviction
  • Optimize Ruby Gem to libmemcached +
    FNV Hash instead of Ruby + MD5
  • Twitter now largest contributor to
    libmemcached
Caching   50% decrease in load with Native C
                gem + libmemcached
Cache Money!
• Active Record Plugin
 • Cache when reading from the DB
 • Cache when writing to the DB
• Transparently provides caching
 • Removes need for set/get cache code
 • Open Source!
Caching

 • “Cache Everything!” not the best policy
 • Invalidating caches at the right time is
   difficult.
 • Cold Cache problem
 • Network Memory Bus != Infinite
Memcached
• memcached isn’t perfect.
 • Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for
  important configuration data
  (loss of darkmode flags, for example)
• Data and Hash Corruption (even in 1.2.6)
 • Exposed corruption issue with specific
    inputs causing SEGV and unexpected
    behavior
API + Caching (search)
• Cache and control abusive clients
• Varnish between two Apache Virtual Hosts
  (failover to another backend if Varnish
  dies)
• Remove Cache busting query strings before
  applying hash algorithm
• Using ESI to cache jQuery requests when
  specifying a callback= parameter - big win.
Relational Databases
not a Panacea
• Good for:
 • Users, Relational Data, Transactions
• Bad:
 • Queues. Polling operations. Caching.
• You don’t need ACID for everything.
• Enter the message queue...
Queues
• Many message queue solutions on the
  market
• At high loads, most perform poorly when
  used in ‘durable’ mode.
• Erlang based queues work well
  (RabbitMQ), but you need in house Erlang
  experience.
• We wrote our own.
 • Kestrel to the rescue!
Kestrel
Falco tinnunculus




  • Works like memcache (same protocol)
  • SET = enqueue | GET = dequeue
  • No strict ordering of jobs
  • No shared state between servers
  • Written in Scala.
Asynchronous
Requests
• Inbound traffic consumes a mongrel
• Outbound traffic consumes a mongrel
• The request pipeline should not be used to
  handle 3rd party communications or
  back-end work.
• Daemons, Daemons, Daemons.
Don’t make services
dependent
• Move operations out of the synchronous
  request cycle
 • Email
 • Complex object generation (timelines)
 • 3rd party services (bit.ly, sms, etc.)
Daemons
• Many different types at Twitter.
• # of daemons have to match the workload
 • Early Kestrel would crash if queues filled
• “Seppaku” patch
 • Kill daemons after n requests
• Long-running daemons = low memory
MySQL Challenges
• Replication Delay
 • Single threaded. Slow.
• Social Networking not good for RDBMS
 • N x N relationships and social graph /
    tree traversal
 • Sharding importance
 • Disk issues (FS Choice, noatime,
    scheduling algorithm)
MySQL

• Replication delay and cache eviction
  produce inconsistent results to the end
  user.
• Locks create resource contention for
  popular data
Database Replication
  • Major issues around users and statuses
    tables
  • Multiple functional masters (FRP, FWP)
  • Make sure your code reads and writes to
    the write DBs. Reading from master = slow
    death
    • Monitor the DB. Find slow / poorly
      designed queries
  • Kill long running queries before they kill
    you (mkill)
status.twitter.com

• Keep users in the loop, or suffer.
• Hosted on different service (Tumblr)
• No matter how little information you have
  available.
Key Points

• Databases not always the best store.
• Instrument everything.
• Use metrics to make decisions, not guesses.
• Don’t make services dependent
• Process asynchronously when possible
Thanks!
Twitter Open Source (Apache License):

- CacheMoney Gem (Write through Caching)
http://github.com/nkallen/cache-money/tree/master

- Libmemcached
http://tangent.org/552/libmemcached.html

- Kestrel (Memcache-like message queue)
http://github.com/robey/kestrel

- mod_memcache_block (Apache 2.x Limiter/blocker)
http://github.com/netik/mod_memcache_block

More Related Content

PDF
John adams talk cloudy
PDF
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
PDF
Chirp 2010: Scaling Twitter
PDF
Super Sizing Youtube with Python
KEY
The Secrets of Building Realtime Big Data Systems
PDF
Building a Database for the End of the World
PPT
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
PDF
Using Riak for Events storage and analysis at Booking.com
John adams talk cloudy
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
Chirp 2010: Scaling Twitter
Super Sizing Youtube with Python
The Secrets of Building Realtime Big Data Systems
Building a Database for the End of the World
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
Using Riak for Events storage and analysis at Booking.com

What's hot (18)

PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Thousands of Threads and Blocking I/O
PDF
ApacheCon BigData - What it takes to process a trillion events a day?
PDF
Diagnosing MySQL performance problems
PPTX
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PDF
Open west 2015 talk ben coverston
PDF
What Drove Wordnik Non-Relational?
PDF
Linkedin NUS QCon 2009 slides
PDF
Using apache spark for processing trillions of records each day at Datadog
PDF
Cmg06 utilization is useless
PDF
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
ODP
Nyc summit intro_to_cassandra
PDF
Kafka and Storm - event processing in realtime
PDF
Error in hadoop
PDF
keyvi the key value index @ Cliqz
PDF
Reactive Supply To Changing Demand
PPTX
a real-time architecture using Hadoop and Storm at Devoxx
PPTX
Yes sql08 inmemorydb
Webinar: Diagnosing Apache Cassandra Problems in Production
Thousands of Threads and Blocking I/O
ApacheCon BigData - What it takes to process a trillion events a day?
Diagnosing MySQL performance problems
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
Open west 2015 talk ben coverston
What Drove Wordnik Non-Relational?
Linkedin NUS QCon 2009 slides
Using apache spark for processing trillions of records each day at Datadog
Cmg06 utilization is useless
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Nyc summit intro_to_cassandra
Kafka and Storm - event processing in realtime
Error in hadoop
keyvi the key value index @ Cliqz
Reactive Supply To Changing Demand
a real-time architecture using Hadoop and Storm at Devoxx
Yes sql08 inmemorydb
Ad

Viewers also liked (18)

PDF
Gfarm Fs Tatebe Tip2004
PPT
sector-sphere
PPT
Capacity Management from Flickr
PPT
openid-pres
PPTX
http://www.hfadeel.com/Blog/?p=151
PDF
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
PDF
usenix
PDF
Make Your web Work
PDF
淘宝无线电子商务数据报告
PDF
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
PDF
Oracle ha
PDF
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
PPTX
SpeedGeeks
PDF
PgSQL vs MySQL
PDF
Stats partitioned table
PPT
New zealand bloom filter
PPTX
What does it take to make google work at scale
PDF
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
Gfarm Fs Tatebe Tip2004
sector-sphere
Capacity Management from Flickr
openid-pres
http://www.hfadeel.com/Blog/?p=151
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
usenix
Make Your web Work
淘宝无线电子商务数据报告
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Oracle ha
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
SpeedGeeks
PgSQL vs MySQL
Stats partitioned table
New zealand bloom filter
What does it take to make google work at scale
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
Ad

Similar to Fixing Twitter Improving The Performance And Scalability Of The Worlds Most Popular Micro Blogging Site Presentation (20)

KEY
Fixing Twitter Velocity2009
PDF
Scalable, good, cheap
PDF
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PDF
PDF
Voldemort Nosql
PDF
Advanced Deployment
PDF
High Scalability Toronto: Meetup #2
KEY
Web frameworks don't matter
PDF
Your backend architecture is what matters slideshare
PDF
Facebook architecture
PDF
Qcon 090408233824-phpapp01
PDF
Facebook的架构
PDF
Facebook architecture
PPT
Web Speed And Scalability
PPT
Planning for-high-performance-web-application
PPTX
BTV PHP - Building Fast Websites
PPTX
Apache Performance Tuning: Scaling Up
PDF
Top ten-list
PDF
20080528dublinpt1
PPT
FOWA Scaling The Lamp Stack Workshop
Fixing Twitter Velocity2009
Scalable, good, cheap
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
Voldemort Nosql
Advanced Deployment
High Scalability Toronto: Meetup #2
Web frameworks don't matter
Your backend architecture is what matters slideshare
Facebook architecture
Qcon 090408233824-phpapp01
Facebook的架构
Facebook architecture
Web Speed And Scalability
Planning for-high-performance-web-application
BTV PHP - Building Fast Websites
Apache Performance Tuning: Scaling Up
Top ten-list
20080528dublinpt1
FOWA Scaling The Lamp Stack Workshop

More from xlight (8)

PPT
Product manager-chrissyuan v1.0
PDF
Oracle 高可用概述
PPT
C/C++与Lua混合编程
PDF
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Service
PPT
UDT
PDF
mogpres
PPT
moscow_developer_day
PPTX
OSGi
Product manager-chrissyuan v1.0
Oracle 高可用概述
C/C++与Lua混合编程
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Service
UDT
mogpres
moscow_developer_day
OSGi

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
Advanced IT Governance
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Empathic Computing: Creating Shared Understanding
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Approach and Philosophy of On baking technology
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
Cloud computing and distributed systems.
Modernizing your data center with Dell and AMD
Advanced IT Governance
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
NewMind AI Weekly Chronicles - August'25 Week I
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
“AI and Expert System Decision Support & Business Intelligence Systems”
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Advanced Soft Computing BINUS July 2025.pdf
Approach and Philosophy of On baking technology
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Cloud computing and distributed systems.

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most Popular Micro Blogging Site Presentation

  • 1. Fixing Twitter ... and Finding your own Fail Whale John Adams Twitter Operations <jna@twitter.com>
  • 2. Operations • Small team, growing rapidly. • What do we do? • Software Performance (back-end) • Availability • Capacity Planning (metrics-driven) • Configuration Management • We don’t deal with the physical plant.
  • 3. Managed Services • Dedicated team (NTTA) • 24/7 Hands on remote support • No clouds. We tried that! • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems.
  • 4. 752% 2008 Growth 5 3.75 2.5 1.25 0 Dec 07 Feb 08 Apr 08 Jun 08 Aug 08 Oct 08 Dec 08 Unique Visitors (in Millions)
  • 5. That was only the beginning... previous graph!
  • 6. Uniques Not slowing down, despite what outsiders say. Hard for outsiders to measure API usage!
  • 7. Growth = Pain + an appreciation for Institutionalized Fear
  • 8. Mantra! Find Weakest Point Metrics + Logs + Science = Analysis
  • 9. Mantra! Find Weakest Take Corrective Point Action Metrics + Process Logs + Science = Analysis
  • 10. Mantra! Find Weakest Take Corrective Move to Next Point Action Weakest Point Metrics + Process Repeatability Logs + Science = Analysis
  • 11. Find the Weakest Point • Metrics + Graphs • Individual metrics are irrelevant • Logs • SCIENCE! • Find out what the actionable items are.
  • 12. Instrument Everything (cc) seenoevil@flickr
  • 13. Monitoring • Graph and report critical metrics in as near real time as possible • You already have the tools. • RRD • Ganglia + custom gMetric scripts • MRTG
  • 14. Dashboards • “Criticals” view • Smokeping/MRTG • Google Analytics • Not just for HTTP 200s/SEO • XML Feeds from managed services • Data Porn!
  • 15. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 16. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 17. Deploys • Graph time-of-deploy along side server CPU and Latency • Display time-of-last-deploy on dashboard last deploy times
  • 18. Whale-Watcher • Simple shell script, • MASSIVE WIN. • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 100,000 lines of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 20. Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
  • 21. Configuration Management • Start automated configuration management EARLY in your company. • Don’t wait until it’s too late. • Twitter started within the first few months.
  • 22. Configuration Management • Complex Environment • Multiple Admins • Unknown Interactions • Solution: 2nd set of eyes.
  • 24. Reviewboard www.review-board.org • SVN pre-commit hook causes a failure if the log message doesn’t include ‘reviewed’ • SVN post-commit hook informs people what changed via email • Watches the entire SVN tree
  • 27. Many limiting factors in the request pipeline Apache Rails MPM Model (mongrel) MaxClients 2:1 oversubscribed TCP Listen queue depth to cores Memcached # connections MySQL Varnish (search) # db connections # threads
  • 28. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
  • 29. CPU: More with Less • Reduction in 40% of CPU by replacing dual and quad core machines with 8 core • Switching from AMD to Intel Xeon = 30% gain • Saved data center space, power, cost per month. • Not the best option if you own machines. Capital expenditure = hard to realize new technology gains.
  • 30. Rails • Stop blaming Rails. • Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Memcache / Page Cache Corruption • Replication Lag
  • 31. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • What to do?
  • 32. Caching • We’re the real-time web, but lots of caching opportunity • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
  • 33. Caching 50% decrease in load with Native C gem + libmemcached
  • 34. Cache Money! • Active Record Plugin • Cache when reading from the DB • Cache when writing to the DB • Transparently provides caching • Removes need for set/get cache code • Open Source!
  • 35. Caching • “Cache Everything!” not the best policy • Invalidating caches at the right time is difficult. • Cold Cache problem • Network Memory Bus != Infinite
  • 36. Memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Data and Hash Corruption (even in 1.2.6) • Exposed corruption issue with specific inputs causing SEGV and unexpected behavior
  • 37. API + Caching (search) • Cache and control abusive clients • Varnish between two Apache Virtual Hosts (failover to another backend if Varnish dies) • Remove Cache busting query strings before applying hash algorithm • Using ESI to cache jQuery requests when specifying a callback= parameter - big win.
  • 38. Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Caching. • You don’t need ACID for everything. • Enter the message queue...
  • 39. Queues • Many message queue solutions on the market • At high loads, most perform poorly when used in ‘durable’ mode. • Erlang based queues work well (RabbitMQ), but you need in house Erlang experience. • We wrote our own. • Kestrel to the rescue!
  • 40. Kestrel Falco tinnunculus • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
  • 41. Asynchronous Requests • Inbound traffic consumes a mongrel • Outbound traffic consumes a mongrel • The request pipeline should not be used to handle 3rd party communications or back-end work. • Daemons, Daemons, Daemons.
  • 42. Don’t make services dependent • Move operations out of the synchronous request cycle • Email • Complex object generation (timelines) • 3rd party services (bit.ly, sms, etc.)
  • 43. Daemons • Many different types at Twitter. • # of daemons have to match the workload • Early Kestrel would crash if queues filled • “Seppaku” patch • Kill daemons after n requests • Long-running daemons = low memory
  • 44. MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Sharding importance • Disk issues (FS Choice, noatime, scheduling algorithm)
  • 45. MySQL • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
  • 46. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 47. status.twitter.com • Keep users in the loop, or suffer. • Hosted on different service (Tumblr) • No matter how little information you have available.
  • 48. Key Points • Databases not always the best store. • Instrument everything. • Use metrics to make decisions, not guesses. • Don’t make services dependent • Process asynchronously when possible
  • 49. Thanks! Twitter Open Source (Apache License): - CacheMoney Gem (Write through Caching) http://github.com/nkallen/cache-money/tree/master - Libmemcached http://tangent.org/552/libmemcached.html - Kestrel (Memcache-like message queue) http://github.com/robey/kestrel - mod_memcache_block (Apache 2.x Limiter/blocker) http://github.com/netik/mod_memcache_block