SlideShare a Scribd company logo
Designing for Scale
    Knut Nesheim @knutin
   Paolo Negri @hungryblank
About this talk

2 developers and erlang
           vs.
  1 million daily users
Social Games
Flash client (game)   HTTP API
Social Games
Flash client


               • Game actions need to be
                 persisted and validated

               • 1 API call every 2 secs
Social Games
                            HTTP API



• @ 1 000 000 daily users
• 5000 HTTP reqs/sec
• more than 90% writes
The hard nut


http://www.flickr.com/photos/mukluk/315409445/
Users we expect
                                        DAU

                       1000000



                        750000
  “Monster World”
       daily users
july - december 2010    500000



                        250000



                             0
                                 July         December
Users we have
                                  DAU



   New game
   daily users
march - june 2011



                    50
                     0
                         march april    may   june
What to do?


1 Simulate users
Simulating users

• Must not be too synthetic (like
  apachebench)
• Must look like a meaningful game session
• Users must come online at a given rate and
  play
Tsung


         •    Multi protocol (HTTP, XMPP) benchmarking tool

         •    Able to test non trivial call sequences

         •    Can actually simulate a scripted gaming session




http://tsung.erlang-projects.org/
Tsung - configuration
       Fixed content                    Dynamic parameter
       <request subst="true">
       <http url="http://server.wooga.com/users/%
       %ts_user_server:get_unique_id%%/resources/column/5/
       row/14?%%_routing_key%%"
       method="POST" contents='{"parameter1":"value1"}'>
       </http>
       </request>



http://tsung.erlang-projects.org/
Tsung - configuration
         • Not something you fancy writing
         • We’re in development, calls change and we
              constantly add new calls
         • A session might contain hundreds of
              requests
         • All the calls must refer to a consistent game
              state

http://tsung.erlang-projects.org/
Tsung - configuration
         • From our ruby test code
         user.resources(:column => 5, :row => 14)

         • Same as
          <request subst="true">
          <http url="http://server.wooga.com/users/%
          %ts_user_server:get_unique_id%%/resources/column/5/
          row/14?%%_routing_key%%"
          method="POST" contents='{"parameter1":"value1"}'>
          </http>
          </request>
http://tsung.erlang-projects.org/
Tsung - configuration

         • Session                    A session is a
          • requests                group of requests

         • Arrival phase            Sessions arrive in
          • duration                  phases with a
                                     specific arrival
          • arrival rate                  rate


http://tsung.erlang-projects.org/
Tsung - setup
            Application                         Benchmarking
                                                   cluster
             app server                             tsung
                                    HTTP reqs      worker
                                                        ssh
             app server
                                                    tsung
                                                   master

             app server

http://tsung.erlang-projects.org/
Tsung

         • Generates ~ 2500 reqs/sec on AWS
              m1.large
         • Flexible but hard to extend
         • Code base rather obscure

http://tsung.erlang-projects.org/
What to do?


2 Collect metrics
Tsung-metrics

         • Tsung collects measures and provides
              reports
         • But these measure include tsung network/
              cpu congestion itself
         • Tsung machines aren’t a good point of view

http://tsung.erlang-projects.org/
HAproxy
Application                         Benchmarking
                                       cluster
app server                              tsung
                        HTTP reqs      worker
              haproxy                       ssh
app server
                                        tsung
                                       master

app server
HAproxy

  “The Reliable, High Performance TCP/
  HTTP Load Balancer”
• Placed in front of http servers
• Load balancing
• Fail over
HAproxy - syslog


• Easy to setup
• Efficient (UDP)
• Provides 5 timings per each request
HAproxy
  • Time to receive request from client
Application                        Benchmarking
                                      cluster
app server                                 tsung
                   haproxy                worker
                                               ssh
app server
                                           tsung
                                          master
HAproxy
    • Time spent in HAproxy queue
Application                     Benchmarking
                                   cluster
app server                           tsung
                 haproxy            worker
                                         ssh
app server
                                     tsung
                                    master
HAproxy
    • Time to connect to the server
Application                       Benchmarking
                                     cluster
app server                             tsung
                  haproxy             worker
                                           ssh
app server
                                       tsung
                                      master
HAproxy
• Time to receive response headers from server
Application                        Benchmarking
                                      cluster
 app server                             tsung
                   haproxy             worker
                                            ssh
 app server
                                        tsung
                                       master
HAproxy
• Total session duration time
Application                     Benchmarking
                                   cluster
 app server                         tsung
                    haproxy        worker
                                        ssh
 app server
                                    tsung
                                   master
HAproxy - syslog

• Application urls identify directly server call
• Application urls are easy to parse
• Processing haproxy syslog gives per call
  metric
What to do?


3 Understand metrics
Reading/aggregating
       metrics

• Python to parse/normalize syslog
• R language to analyze/visualize data
• R language console to interactively explore
  benchmarking results
R is a free software environment for
 statistical computing and graphics.
What you get

• Aggregate performance levels (throughput,
  latency)
• Detailed performance per call type
• Statistical analysis (outliers, trends,
  regression, correlation, frequency, standard
  deviation)
What you get
What to do?


4 go deeper
Digging into the data

• From HAproxy log analisys one call
  emerged as exceptionally slow
• Using eprof we were able to determine
  that most of the time was spent in a redis
  query fetching many keys (MGET)
Tracing erldis query
• More than 60% of runtime is spent
  manipulating the socket
• gen_tcp:recv/2 is the culprit
• But why is it called so many times?
Understanding the
     redis protocol
C: LRANGE mylist 0 2
                       <<"*2rn
s: *2                     $5rn
s: $5                     Hellorn
                          $5rn
s: Hello                  Worldrn">>
s: $5
s: World
Understanding erldis
• recv_value/2 is used in the protocol parser
  to get the next data to parse
A different approach
• Two ways to use gen_tcp: active or passive
• In passive, use gen_tcp:recv to explicitly ask
  for data, blocking
• In active, gen_tcp will send the controlling
  process a message when there is data
• Hybrid: active once
A different approach

• Is active sockets faster?
• Proof-of-concept proved active socket
  faster
• Change erldis or write a new driver?
A different approach

• Radical change => new driver
• Keep Erldis queuing approach
• Think about error handling from the start
• Use active sockets
A different approach
• Active socket, parse partial replies
Circuit breaker
• eredis has a simple circuit breaker for when
  Redis is down/unreachable
• eredis returns immediately to clients if
  connection is down
• Reconnecting is done outside request/
  response handling
• Robust handling of errors
Benchmarking eredis

• Redis driver critical for our application
• Must perform well
• Must be stable
• How do we test this?
Basho bench

• Basho produces the Riak KV store
• Basho build a tool to test KV servers
• Basho bench
• We used Basho bench to test eredis
Basho bench
• Create callback module
Basho bench
• Configuration term-file
Basho bench output
eredis is open source




https://github.com/wooga/eredis
What to do?


5 measure internals
Measure internals

HAproxy point of view is valid but how to
measure internals of our application, while
we are live, without the overhead of
tracing?
Think Basho bench

• Basho bench can benchmark a redis driver
• Redis is very fast, 100K ops/sec
• Basho bench overhead is acceptable
• The code is very simple
Cherry pick ideas from
    Basho Bench
• Creates a histogram of timings on the fly,
  reducing the number of data points
• Dumps to disk every N seconds
• Allows statistical tools to work on already
  aggregated data
• Near real-time, from event to stats in N+5
  seconds
Homegrown stats
• Measures latency from the edges of our
  system (excludes HTTP handling)
• And at interesting points inside the system
• Statistical analysis using R
• Correlate with HAproxy data
• Produces graphs and data specific to our
  application
Homegrown stats
Recap

  Measure:
• From an external point of view (HAproxy)
• At the edge of the system (excluding
  HTTP handling)
• Internals in the single process (eprof)
Recap
  Analyze:
• Aggregated measures
• Statistical properties of measures
 • standard deviation
 • distribution
 • trends
Thanks!

http://www.wooga.com/jobs

knut.nesheim@wooga.com       @knutin
paolo.negri@wooga.com    @hungryblank

More Related Content

PDF
Erlang factory SF 2011 "Erlang and the big switch in social games"
PDF
Combining the strength of erlang and Ruby
PDF
Erlang factory 2011 london
PDF
FunctionalConf '16 Robert Virding Erlang Ecosystem
PDF
Erlang as a cloud citizen, a fractal approach to throughput
PDF
Architecture Evolution at Wooga (AWS Cloud Computing for Developers,)
PDF
Xen_and_Rails_deployment
PPTX
Python Raster Function - Esri Developer Conference - 2015
Erlang factory SF 2011 "Erlang and the big switch in social games"
Combining the strength of erlang and Ruby
Erlang factory 2011 london
FunctionalConf '16 Robert Virding Erlang Ecosystem
Erlang as a cloud citizen, a fractal approach to throughput
Architecture Evolution at Wooga (AWS Cloud Computing for Developers,)
Xen_and_Rails_deployment
Python Raster Function - Esri Developer Conference - 2015

What's hot (12)

PPT
Introduction to the intermediate Python - v1.1
PPTX
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
PPT
Devops at Netflix (re:Invent)
PPT
Using Simplicity to Make Hard Big Data Problems Easy
PDF
Livy: A REST Web Service For Apache Spark
PPTX
Akka.net versus microsoft orleans
PDF
Benchmarking at Parse
PDF
Webinar: Queues with RabbitMQ - Lorna Mitchell
PPS
Storm presentation
PDF
John adams talk cloudy
PPTX
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
PPTX
CQRS Evolved - CQRS + Akka.NET
Introduction to the intermediate Python - v1.1
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Devops at Netflix (re:Invent)
Using Simplicity to Make Hard Big Data Problems Easy
Livy: A REST Web Service For Apache Spark
Akka.net versus microsoft orleans
Benchmarking at Parse
Webinar: Queues with RabbitMQ - Lorna Mitchell
Storm presentation
John adams talk cloudy
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
CQRS Evolved - CQRS + Akka.NET
Ad

Viewers also liked (20)

PDF
NoSQL Games_NoSQL Roadshow Berlin
PDF
Erlang as a Cloud Citizen
PDF
Getting the Most our of your Tools_FrontEnd DevConf2013_Minsk
PDF
When Devs Do Ops
PDF
Stateful Application Server_JRubyConf13_Lukas Rieder
PDF
JRubyConf2013_Tim Lossen_All your core
PDF
Games for the Masses: Scaling Rails to the Extreme
PDF
Metrics. Driven. Design. (Developer Conference Hamburg 2012)
PDF
How to scale a company - game teams at Wooga
PDF
Event Stream Processing with Kafka (Berlin Buzzwords 2012)
PDF
2013 04-29-evolution of backend
PDF
Stateful_Application_Server_RuPy 2012_Brno
PDF
You are not alone - Scaling multiplayer games
PDF
More than syntax
PDF
Continuous Integration for iOS (iOS User Group Berlin)
PDF
Painful success - lessons learned while scaling up
PDF
NoSQL Games
PDF
Wooga: Internationality meets Agility @Zutaten 2013
PDF
Monitoring with Syslog and EventMachine
PDF
Riak at Wooga_Riak Meetup Sept 2013
NoSQL Games_NoSQL Roadshow Berlin
Erlang as a Cloud Citizen
Getting the Most our of your Tools_FrontEnd DevConf2013_Minsk
When Devs Do Ops
Stateful Application Server_JRubyConf13_Lukas Rieder
JRubyConf2013_Tim Lossen_All your core
Games for the Masses: Scaling Rails to the Extreme
Metrics. Driven. Design. (Developer Conference Hamburg 2012)
How to scale a company - game teams at Wooga
Event Stream Processing with Kafka (Berlin Buzzwords 2012)
2013 04-29-evolution of backend
Stateful_Application_Server_RuPy 2012_Brno
You are not alone - Scaling multiplayer games
More than syntax
Continuous Integration for iOS (iOS User Group Berlin)
Painful success - lessons learned while scaling up
NoSQL Games
Wooga: Internationality meets Agility @Zutaten 2013
Monitoring with Syslog and EventMachine
Riak at Wooga_Riak Meetup Sept 2013
Ad

Similar to Designing for Scale (20)

KEY
Distributed app development with nodejs and zeromq
PPTX
Scalable Web Apps
PPTX
Api crash
PPTX
Api crash
PPTX
Api crash
PPTX
Api crash
PPTX
Api crash
PPTX
Api crash
PPTX
Api crash
PDF
PDF
Introduction to Apache NiFi And Storm
PPTX
Deploying Apache Flume to enable low-latency analytics
PDF
How to write a Neutron plugin (stadium edition)
PDF
How to build a Neutron Plugin (stadium edition)
PDF
3.2 Streaming and Messaging
PPT
slides (PPT)
PPSX
webservers
PPTX
Tech talk microservices debugging
PPTX
Debugging Microservices - key challenges and techniques - Microservices Odesa...
PDF
Push jobs: an orchestration building block for private Chef
Distributed app development with nodejs and zeromq
Scalable Web Apps
Api crash
Api crash
Api crash
Api crash
Api crash
Api crash
Api crash
Introduction to Apache NiFi And Storm
Deploying Apache Flume to enable low-latency analytics
How to write a Neutron plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
3.2 Streaming and Messaging
slides (PPT)
webservers
Tech talk microservices debugging
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Push jobs: an orchestration building block for private Chef

More from Wooga (20)

PPTX
Story of Warlords: Bringing a turn-based strategy game to mobile
PDF
Instagram Celebrities: are they the new cats? - Targetsummit Berlin 2015
PDF
In it for the long haul - How Wooga boosts long-term retention
PDF
Leveling up in localization! - Susan Alma & Dario Quondamstefano
PDF
Evoloution of Ideas
PDF
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid
PDF
Saying No to the CEO: A Deep Look at Independent Teams - Adam Telfer
PDF
Innovation dank DevOps (DevOpsCon Berlin 2015)
PDF
Big Fish, small pond - strategies for surviving in a maturing market - Ed Biden
PDF
Review mining aps2014 berlin
PDF
Riak & Wooga_Geeek2Geeek Meetup2014 Berlin
PDF
Staying in the Game: Game localization practices for the mobile market
PDF
Startup Weekend_Makers and Games_Philipp Stelzer
PDF
DevOps goes Mobile (daho.am)
PDF
DevOps goes Mobile - Jax 2014 - Jesper Richter-Reichhelm
PDF
CodeFest 2014_Mobile Game Development
PDF
Jelly Splash: Puzzling your way to the top of the App Stores - GDC 2014
PDF
How to hire the best people for your startup-Gitta Blat-Head of People
PDF
Two Ann(e)s and one Julia_Wooga Lady Power from Berlin_SGA2014
PDF
Pocket Gamer Connects 2014_The Experience of Entering the Korean Market
Story of Warlords: Bringing a turn-based strategy game to mobile
Instagram Celebrities: are they the new cats? - Targetsummit Berlin 2015
In it for the long haul - How Wooga boosts long-term retention
Leveling up in localization! - Susan Alma & Dario Quondamstefano
Evoloution of Ideas
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid
Saying No to the CEO: A Deep Look at Independent Teams - Adam Telfer
Innovation dank DevOps (DevOpsCon Berlin 2015)
Big Fish, small pond - strategies for surviving in a maturing market - Ed Biden
Review mining aps2014 berlin
Riak & Wooga_Geeek2Geeek Meetup2014 Berlin
Staying in the Game: Game localization practices for the mobile market
Startup Weekend_Makers and Games_Philipp Stelzer
DevOps goes Mobile (daho.am)
DevOps goes Mobile - Jax 2014 - Jesper Richter-Reichhelm
CodeFest 2014_Mobile Game Development
Jelly Splash: Puzzling your way to the top of the App Stores - GDC 2014
How to hire the best people for your startup-Gitta Blat-Head of People
Two Ann(e)s and one Julia_Wooga Lady Power from Berlin_SGA2014
Pocket Gamer Connects 2014_The Experience of Entering the Korean Market

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity

Designing for Scale

  • 1. Designing for Scale Knut Nesheim @knutin Paolo Negri @hungryblank
  • 2. About this talk 2 developers and erlang vs. 1 million daily users
  • 3. Social Games Flash client (game) HTTP API
  • 4. Social Games Flash client • Game actions need to be persisted and validated • 1 API call every 2 secs
  • 5. Social Games HTTP API • @ 1 000 000 daily users • 5000 HTTP reqs/sec • more than 90% writes
  • 7. Users we expect DAU 1000000 750000 “Monster World” daily users july - december 2010 500000 250000 0 July December
  • 8. Users we have DAU New game daily users march - june 2011 50 0 march april may june
  • 9. What to do? 1 Simulate users
  • 10. Simulating users • Must not be too synthetic (like apachebench) • Must look like a meaningful game session • Users must come online at a given rate and play
  • 11. Tsung • Multi protocol (HTTP, XMPP) benchmarking tool • Able to test non trivial call sequences • Can actually simulate a scripted gaming session http://tsung.erlang-projects.org/
  • 12. Tsung - configuration Fixed content Dynamic parameter <request subst="true"> <http url="http://server.wooga.com/users/% %ts_user_server:get_unique_id%%/resources/column/5/ row/14?%%_routing_key%%" method="POST" contents='{"parameter1":"value1"}'> </http> </request> http://tsung.erlang-projects.org/
  • 13. Tsung - configuration • Not something you fancy writing • We’re in development, calls change and we constantly add new calls • A session might contain hundreds of requests • All the calls must refer to a consistent game state http://tsung.erlang-projects.org/
  • 14. Tsung - configuration • From our ruby test code user.resources(:column => 5, :row => 14) • Same as <request subst="true"> <http url="http://server.wooga.com/users/% %ts_user_server:get_unique_id%%/resources/column/5/ row/14?%%_routing_key%%" method="POST" contents='{"parameter1":"value1"}'> </http> </request> http://tsung.erlang-projects.org/
  • 15. Tsung - configuration • Session A session is a • requests group of requests • Arrival phase Sessions arrive in • duration phases with a specific arrival • arrival rate rate http://tsung.erlang-projects.org/
  • 16. Tsung - setup Application Benchmarking cluster app server tsung HTTP reqs worker ssh app server tsung master app server http://tsung.erlang-projects.org/
  • 17. Tsung • Generates ~ 2500 reqs/sec on AWS m1.large • Flexible but hard to extend • Code base rather obscure http://tsung.erlang-projects.org/
  • 18. What to do? 2 Collect metrics
  • 19. Tsung-metrics • Tsung collects measures and provides reports • But these measure include tsung network/ cpu congestion itself • Tsung machines aren’t a good point of view http://tsung.erlang-projects.org/
  • 20. HAproxy Application Benchmarking cluster app server tsung HTTP reqs worker haproxy ssh app server tsung master app server
  • 21. HAproxy “The Reliable, High Performance TCP/ HTTP Load Balancer” • Placed in front of http servers • Load balancing • Fail over
  • 22. HAproxy - syslog • Easy to setup • Efficient (UDP) • Provides 5 timings per each request
  • 23. HAproxy • Time to receive request from client Application Benchmarking cluster app server tsung haproxy worker ssh app server tsung master
  • 24. HAproxy • Time spent in HAproxy queue Application Benchmarking cluster app server tsung haproxy worker ssh app server tsung master
  • 25. HAproxy • Time to connect to the server Application Benchmarking cluster app server tsung haproxy worker ssh app server tsung master
  • 26. HAproxy • Time to receive response headers from server Application Benchmarking cluster app server tsung haproxy worker ssh app server tsung master
  • 27. HAproxy • Total session duration time Application Benchmarking cluster app server tsung haproxy worker ssh app server tsung master
  • 28. HAproxy - syslog • Application urls identify directly server call • Application urls are easy to parse • Processing haproxy syslog gives per call metric
  • 29. What to do? 3 Understand metrics
  • 30. Reading/aggregating metrics • Python to parse/normalize syslog • R language to analyze/visualize data • R language console to interactively explore benchmarking results
  • 31. R is a free software environment for statistical computing and graphics.
  • 32. What you get • Aggregate performance levels (throughput, latency) • Detailed performance per call type • Statistical analysis (outliers, trends, regression, correlation, frequency, standard deviation)
  • 34. What to do? 4 go deeper
  • 35. Digging into the data • From HAproxy log analisys one call emerged as exceptionally slow • Using eprof we were able to determine that most of the time was spent in a redis query fetching many keys (MGET)
  • 36. Tracing erldis query • More than 60% of runtime is spent manipulating the socket • gen_tcp:recv/2 is the culprit • But why is it called so many times?
  • 37. Understanding the redis protocol C: LRANGE mylist 0 2 <<"*2rn s: *2 $5rn s: $5 Hellorn $5rn s: Hello Worldrn">> s: $5 s: World
  • 38. Understanding erldis • recv_value/2 is used in the protocol parser to get the next data to parse
  • 39. A different approach • Two ways to use gen_tcp: active or passive • In passive, use gen_tcp:recv to explicitly ask for data, blocking • In active, gen_tcp will send the controlling process a message when there is data • Hybrid: active once
  • 40. A different approach • Is active sockets faster? • Proof-of-concept proved active socket faster • Change erldis or write a new driver?
  • 41. A different approach • Radical change => new driver • Keep Erldis queuing approach • Think about error handling from the start • Use active sockets
  • 42. A different approach • Active socket, parse partial replies
  • 43. Circuit breaker • eredis has a simple circuit breaker for when Redis is down/unreachable • eredis returns immediately to clients if connection is down • Reconnecting is done outside request/ response handling • Robust handling of errors
  • 44. Benchmarking eredis • Redis driver critical for our application • Must perform well • Must be stable • How do we test this?
  • 45. Basho bench • Basho produces the Riak KV store • Basho build a tool to test KV servers • Basho bench • We used Basho bench to test eredis
  • 46. Basho bench • Create callback module
  • 49. eredis is open source https://github.com/wooga/eredis
  • 50. What to do? 5 measure internals
  • 51. Measure internals HAproxy point of view is valid but how to measure internals of our application, while we are live, without the overhead of tracing?
  • 52. Think Basho bench • Basho bench can benchmark a redis driver • Redis is very fast, 100K ops/sec • Basho bench overhead is acceptable • The code is very simple
  • 53. Cherry pick ideas from Basho Bench • Creates a histogram of timings on the fly, reducing the number of data points • Dumps to disk every N seconds • Allows statistical tools to work on already aggregated data • Near real-time, from event to stats in N+5 seconds
  • 54. Homegrown stats • Measures latency from the edges of our system (excludes HTTP handling) • And at interesting points inside the system • Statistical analysis using R • Correlate with HAproxy data • Produces graphs and data specific to our application
  • 56. Recap Measure: • From an external point of view (HAproxy) • At the edge of the system (excluding HTTP handling) • Internals in the single process (eprof)
  • 57. Recap Analyze: • Aggregated measures • Statistical properties of measures • standard deviation • distribution • trends