SlideShare a Scribd company logo
2013-10-18 
MONITOR 
SOME OF THE THINGS
Optimization, Backups, 
Replication, and more 
3rd Edition 
Covers Version 5.5 
High 
Performance 
MySQL 
Baron Schwartz, 
Peter Zaitsev & 
Vadim Tkachenko 
ME 
‱ Cofounder of @VividCortex 
‱ Author of High Performance MySQL 
‱ @xaprb on Twitter 
‱ baron@vividcortex.com 
‱ http://www.linkedin.com/in/xaprb
RANT, RECAPPED 
‱ The sky is falling 
‱ Tools drive processes, and we need better tools designed for methods 
‱ Pay attention to CAPS (Capacity, Availability, Performance, Scalability) 
‱ Monitoring tools need to be a lot smarter 
‱ Measure and monitor “work getting done”
HARD CAPACITY 
‱ Disk volume 
‱ CPU Cycles 
‱ max_connections 
‱ File descriptors, sockets, TCP port 
numbers, etc 
‱ %used, absolute quantity available
SOFT CAPACITY 
‱ Neil Gunther’s Universal Scalability 
Law 
‱ %used, absolute quantity available 
‱ Throughput, concurrency, errors
AVAILABILITY 
‱ Availability is absence of downtime ‱ %used, absolute quantity available 
‱ Throughput, concurrency, errors 
‱ MTBF, MTTR, MTTD, %availability
TASK PERFORMANCE 
‱ Task performance is consistently fast 
response time. 
‱ Measure an SLA in percentile 
response time per task, over 
observation intervals 
‱ %used, absolute quantity available 
‱ Throughput, concurrency, errors 
‱ MTBF, MTTR, MTTD, %availability 
‱ Response time, 95% response time
RESOURCE PERFORMANCE 
‱ Resource performance is ability to run 
tasks consistently fast. 
‱ %used, absolute quantity available 
‱ Throughput, concurrency, errors 
‱ MTBF, MTTR, MTTD, %availability 
‱ Response time, 95% response time 
‱ Throughput, concurrency, busy time, 
total response time, backlog/queue
SCALABILITY 
‱ Universal Scalability Law again ‱ %used, absolute quantity available 
‱ Throughput, concurrency, errors 
‱ MTBF, MTTR, MTTD, %availability 
‱ Response time, 95% response time 
‱ Throughput, concurrency, busy time, 
total response time, backlog/queue
STALL DETECTION 
‱ Overloaded or underperforming? ‱ %used, absolute quantity available 
‱ Throughput, concurrency, errors 
‱ MTBF, MTTR, MTTD, %availability 
‱ Response time, 95% response time 
‱ Throughput, concurrency, busy time, 
total response time, backlog/queue 
‱ Utilization, saturation, errors, sources 
of load/demand
GIT ‘ER DONE 
MONITOR WORK AND 
RESOURCES
WHAT NOT TO DO 
‱ Don’t use top-N lists from Google 
‱ Don’t just do what’s included in some 
Nagios plugin
№1 
TOP 10 LIST 
1. MySQL availability 
2. Presence of insecure users and databases 
3. Aborted connects 
4. Error log 
5. Deadlocks 
6. Change in server configuration 
7. Slow query log 
8. Slave lag 
9. Percentage of maximum allowed connections 
10. Percentage of full table scans
№2 
TOP 10 LIST 
1. Threads_connected 
2. Created_tmp_disk_tables 
3. Handler_read_first 
4. Innodb_buffer_pool_wait_free 
5. Key_reads 
6. Max_used_connections 
7. Open_tables 
8. Select_full_join 
9. Slow_queries 
10. Uptime
№1 
PLUGIN 
1. threadcache-hitrate (Hit rate of the thread-cache) 
2. slave-io-running (Slave io running: Yes) 
3. slave-sql-running (Slave sql running: Yes) 
4. qcache-hitrate (Query cache hitrate) 
5. qcache-lowmem-prunes (Query cache entries pruned because of low memory) 
6. keycache-hitrate (MyISAM key cache hitrate) 
7. bufferpool-hitrate (InnoDB buffer pool hitrate) 
8. bufferpool-wait-free (InnoDB buffer pool waits for clean page available) 
9. log-waits (InnoDB log waits because of a too small log buffer) 
10. tablecache-hitrate (Table cache hitrate) 
11. table-lock-contention (Table lock contention) 
12. index-usage (Usage of indices) 
13. tmp-disk-tables (Percent of temp tables created on disk) 
14. long-running-procs (long running processes)
№2 
PLUGIN 
1. connection-time 
2. uptime 
3. threads-connected 
4. threadcache-hitrate 
5. q[uery]cache-hitrate 
6. q[uery]cache-lowmem-prunes 
7. [myisam-]keycache-hitrate 
8. [innodb-]bufferpool-hitrate 
9. [innodb-]bufferpool-wait-free 
10. [innodb-]log-waits 
11. tablecache-hitrate 
12. table-lock-contention 
13. index-usage 
14. tmp-disk-tables 
15. slow-queries 
16. long-running-procs 
17. slave-lag 
18. slave-io-running 
19. slave-sql-running 
20. sql 
21. open-files 
22. encode 
23. cluster-ndb-running
№3 
PLUGIN
HTTP://WWW.FLICKR.COM/PHOTOS/NASAMARSHALL/5926864640/ 
SURFACE AREA
DUPLICATE SIGNALS 
‱ Queries 
‱ Com_admin_commands 
‱ Com_assign_to_keycache 
‱ Com_alter_db 
‱ Com_alter_db_upgrade 
‱ Com_alter_event 
‱ Com_alter_function 
‱ Com_alter_procedure 
‱ Com_alter_server 
‱ Com_alter_table 
‱ Com_alter_tablespace 
‱ Com_alter_user 
‱ Com_analyze 
‱ Com_begin 
‱ Com_binlog 
‱ Com_ad_nauseum
DESIRABLE METRICS 
‱ %used, absolute quantity available 
‱ Throughput, concurrency, errors 
‱ MTBF, MTTR, MTTD, %availability 
‱ Response time, 95% response time 
‱ Throughput, concurrency, busy time, total response time, backlog/queue 
‱ Utilization, saturation, errors, sources of load/demand
Desirable Easy
Desirable Easy
IRRELEVANT 
EXAMPLE PLEASE?
RESOURCE LIMITS 
‱ Threads_connected near max_connections? 
‱ %table cache used? 
‱ Open file handles? 
‱ Long-running queries/transactions?
ERRORS 
‱ Deadlocks? 
‱ Aborted connects?
AVAILABILITY 
‱ Ability to connect and run a query? 
‱ Uptime is small? 
‱ Replication is running?
PERFORMANCE 
‱ You can get throughput (Queries) and concurrency (Threads_running) from MySQL 
‱ But in a Nagios check, no context to know whether they’re good or bad 
‱ You generally can’t get response time, busy time, utilization, backlog, etc 
‱ You can aggregate thread states, thread times, users, databases, query abstracts...
NAGIOS IS BEST AT 
LIVING IN THE 
MOMENT
THOU SHALT NOT 
‱ Cache hit ratios 
‱ Thread cache hit ratio 
‱ Buffer pool cache hit ratio 
‱ Table cache hit ratio 
‱ Key cache hit ratio 
‱ Query cache hit ratio 
‱ Rates of “bad” queries 
‱ % temp tables on disk 
‱ % full table scans 
‱ % slow queries 
‱ Unfixable things 
‱ Replication delay
WHY NOT? 
‱ Those are properties of the workload and application 
‱ They are not conditions to alert/warn about 
‱ They are not fixable / actionable in the service
ALERTS ARE 
BETTER TOGETHER
QUESTION: 
WHAT IS BETTER?
№1 ALERT!!!!! 
Disk CRIT 100% /dev/sda2
№2 ALERT!!!!! 
Replication CRIT Slave I/O Thread No
№3 ALERT!!!!! 
Replication CRIT Slave SQL Thread No
№4 ALERT!!!!! 
Replication CRIT Seconds_Behind_Master NULL
№5 ALERT!!!!! 
MySQL CRIT oldest transaction: 86400 seconds
- OR -
№1 ALERT!!!!! 
CRIT 
* Disk /dev/sda2 full 
* Replication stopped 
* Oldest transaction 86400 seconds 
* 4999 threads in status “Waiting for table metadata lock”
HOLLER AT ME 
QUESTIONS? 
@XAPRB / BARON@VIVIDCORTEX.COM
RESOURCES 
‱ Chapter 3 of High Performance MySQL, 3rd Edition 
‱ Percona White Papers 
‱ Causes of Downtime in Production MySQL Servers 
‱ Preventing MySQL Emergencies 
‱ Goal-Driven Performance Optimization 
‱ Forecasting MySQL Scalability with the Universal Scalability Law 
‱ Method R: Optimizing Oracle Performance, Cary Millsap 
‱ The Goal, Eli Goldratt 
‱ The USE Method (Brendan Gregg) & his new book 
‱ Guerrilla Capacity Planning, Neil J. Gunther 
‱ Fundamental Performance & Scalability Instrumentation

More Related Content

PPTX
Serverspec and Sensu - Testing and Monitoring collide
PDF
NGINX Can Do That? Test Drive Your Config File!
PDF
How to monitor NGINX
PPTX
RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis
PPTX
5 things you didn't know nginx could do
PDF
MySQL Tuning using digested slow-logs
PDF
Sensu and Sensibility - Puppetconf 2014
Serverspec and Sensu - Testing and Monitoring collide
NGINX Can Do That? Test Drive Your Config File!
How to monitor NGINX
RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis
5 things you didn't know nginx could do
MySQL Tuning using digested slow-logs
Sensu and Sensibility - Puppetconf 2014

What's hot (20)

PPTX
5 things you didn't know nginx could do velocity
PDF
How to Fail at Kafka
KEY
Nginx - Tips and Tricks.
PDF
Redis acl
PDF
Puppet Development Workflow
PDF
Steamlining your puppet development workflow
PPTX
NGINX 101 - now with more Docker
PPT
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
PDF
Load Balancing with Nginx
PDF
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
PDF
under the covers -- chef in 20 minutes or less
KEY
London devops logging
PDF
Integrated Cache on Netscaler
PDF
PDF
Extending functionality in nginx, with modules!
PDF
Incrementalism: An Industrial Strategy For Adopting Modern Automation
PDF
How To Set Up SQL Load Balancing with HAProxy - Slides
PDF
Load Balancing MySQL with HAProxy - Slides
PPTX
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
 
PDF
Getting modern with my sql
5 things you didn't know nginx could do velocity
How to Fail at Kafka
Nginx - Tips and Tricks.
Redis acl
Puppet Development Workflow
Steamlining your puppet development workflow
NGINX 101 - now with more Docker
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
Load Balancing with Nginx
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
under the covers -- chef in 20 minutes or less
London devops logging
Integrated Cache on Netscaler
Extending functionality in nginx, with modules!
Incrementalism: An Industrial Strategy For Adopting Modern Automation
How To Set Up SQL Load Balancing with HAProxy - Slides
Load Balancing MySQL with HAProxy - Slides
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
 
Getting modern with my sql
Ad

Viewers also liked (16)

PPTX
Individual fucas
PPTX
REED 729 Seminar in Reading
PPTX
Umesh
PPTX
The five most common causes of rejection of the Brazilian work permit procedure
PPTX
Supporting Student Success: UDL and Your Library
PPT
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
PDF
Work Visa for Brazil, Brief Description of the Procedure
PPTX
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
DOC
BĂ i Táș­p HĂła
PPTX
Dairy industry
PPTX
I am a person of ...
PPTX
Akif instraction
PDF
Making big data small
PPTX
Ch12 pp
PPT
Ch7 delivering speeches (modes of delivery)
PPT
Chapter 2 3
Individual fucas
REED 729 Seminar in Reading
Umesh
The five most common causes of rejection of the Brazilian work permit procedure
Supporting Student Success: UDL and Your Library
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
Work Visa for Brazil, Brief Description of the Procedure
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
BĂ i Táș­p HĂła
Dairy industry
I am a person of ...
Akif instraction
Making big data small
Ch12 pp
Ch7 delivering speeches (modes of delivery)
Chapter 2 3
Ad

Similar to Monitor some of the things (20)

PDF
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
PPT
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
PDF
KoprowskiT - SQLBITS X - 2am a disaster just began
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
PPTX
Alfresco tuning part1
PPTX
Alfresco tuning part1
PPT
MySQL Performance Tuning at COSCUP 2014
PDF
Ensuring Consistency in a Replicated World
PDF
Diagnosing Problems in Production (Nov 2015)
PDF
Advanced Operations
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
PDF
Cassandra Day Chicago 2015: Diagnosing Problems in Production
PDF
Cassandra Day London 2015: Diagnosing Problems in Production
PDF
Analyzing and Interpreting AWR
PDF
Building an Impenetrable ZooKeeper - Kathleen Ting
PDF
Diagnosing Problems in Production - Cassandra
PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
PDF
Building data intensive applications
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
KoprowskiT - SQLBITS X - 2am a disaster just began
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
Alfresco tuning part1
Alfresco tuning part1
MySQL Performance Tuning at COSCUP 2014
Ensuring Consistency in a Replicated World
Diagnosing Problems in Production (Nov 2015)
Advanced Operations
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
Analyzing and Interpreting AWR
Building an Impenetrable ZooKeeper - Kathleen Ting
Diagnosing Problems in Production - Cassandra
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Building data intensive applications

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Nekopoi APK 2025 free lastest update
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
System and Network Administraation Chapter 3
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
top salesforce developer skills in 2025.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Transform Your Business with a Software ERP System
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How to Migrate SBCGlobal Email to Yahoo Easily
Softaken Excel to vCard Converter Software.pdf
CHAPTER 2 - PM Management and IT Context
Nekopoi APK 2025 free lastest update
Operating system designcfffgfgggggggvggggggggg
How Creative Agencies Leverage Project Management Software.pdf
System and Network Administraation Chapter 3
Understanding Forklifts - TECH EHS Solution
Odoo POS Development Services by CandidRoot Solutions
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PTS Company Brochure 2025 (1).pdf.......
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
top salesforce developer skills in 2025.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Transform Your Business with a Software ERP System
VVF-Customer-Presentation2025-Ver1.9.pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
Adobe Illustrator 28.6 Crack My Vision of Vector Design

Monitor some of the things

  • 1. 2013-10-18 MONITOR SOME OF THE THINGS
  • 2. Optimization, Backups, Replication, and more 3rd Edition Covers Version 5.5 High Performance MySQL Baron Schwartz, Peter Zaitsev & Vadim Tkachenko ME ‱ Cofounder of @VividCortex ‱ Author of High Performance MySQL ‱ @xaprb on Twitter ‱ baron@vividcortex.com ‱ http://www.linkedin.com/in/xaprb
  • 3. RANT, RECAPPED ‱ The sky is falling ‱ Tools drive processes, and we need better tools designed for methods ‱ Pay attention to CAPS (Capacity, Availability, Performance, Scalability) ‱ Monitoring tools need to be a lot smarter ‱ Measure and monitor “work getting done”
  • 4. HARD CAPACITY ‱ Disk volume ‱ CPU Cycles ‱ max_connections ‱ File descriptors, sockets, TCP port numbers, etc ‱ %used, absolute quantity available
  • 5. SOFT CAPACITY ‱ Neil Gunther’s Universal Scalability Law ‱ %used, absolute quantity available ‱ Throughput, concurrency, errors
  • 6. AVAILABILITY ‱ Availability is absence of downtime ‱ %used, absolute quantity available ‱ Throughput, concurrency, errors ‱ MTBF, MTTR, MTTD, %availability
  • 7. TASK PERFORMANCE ‱ Task performance is consistently fast response time. ‱ Measure an SLA in percentile response time per task, over observation intervals ‱ %used, absolute quantity available ‱ Throughput, concurrency, errors ‱ MTBF, MTTR, MTTD, %availability ‱ Response time, 95% response time
  • 8. RESOURCE PERFORMANCE ‱ Resource performance is ability to run tasks consistently fast. ‱ %used, absolute quantity available ‱ Throughput, concurrency, errors ‱ MTBF, MTTR, MTTD, %availability ‱ Response time, 95% response time ‱ Throughput, concurrency, busy time, total response time, backlog/queue
  • 9. SCALABILITY ‱ Universal Scalability Law again ‱ %used, absolute quantity available ‱ Throughput, concurrency, errors ‱ MTBF, MTTR, MTTD, %availability ‱ Response time, 95% response time ‱ Throughput, concurrency, busy time, total response time, backlog/queue
  • 10. STALL DETECTION ‱ Overloaded or underperforming? ‱ %used, absolute quantity available ‱ Throughput, concurrency, errors ‱ MTBF, MTTR, MTTD, %availability ‱ Response time, 95% response time ‱ Throughput, concurrency, busy time, total response time, backlog/queue ‱ Utilization, saturation, errors, sources of load/demand
  • 11. GIT ‘ER DONE MONITOR WORK AND RESOURCES
  • 12. WHAT NOT TO DO ‱ Don’t use top-N lists from Google ‱ Don’t just do what’s included in some Nagios plugin
  • 13. №1 TOP 10 LIST 1. MySQL availability 2. Presence of insecure users and databases 3. Aborted connects 4. Error log 5. Deadlocks 6. Change in server configuration 7. Slow query log 8. Slave lag 9. Percentage of maximum allowed connections 10. Percentage of full table scans
  • 14. №2 TOP 10 LIST 1. Threads_connected 2. Created_tmp_disk_tables 3. Handler_read_first 4. Innodb_buffer_pool_wait_free 5. Key_reads 6. Max_used_connections 7. Open_tables 8. Select_full_join 9. Slow_queries 10. Uptime
  • 15. №1 PLUGIN 1. threadcache-hitrate (Hit rate of the thread-cache) 2. slave-io-running (Slave io running: Yes) 3. slave-sql-running (Slave sql running: Yes) 4. qcache-hitrate (Query cache hitrate) 5. qcache-lowmem-prunes (Query cache entries pruned because of low memory) 6. keycache-hitrate (MyISAM key cache hitrate) 7. bufferpool-hitrate (InnoDB buffer pool hitrate) 8. bufferpool-wait-free (InnoDB buffer pool waits for clean page available) 9. log-waits (InnoDB log waits because of a too small log buffer) 10. tablecache-hitrate (Table cache hitrate) 11. table-lock-contention (Table lock contention) 12. index-usage (Usage of indices) 13. tmp-disk-tables (Percent of temp tables created on disk) 14. long-running-procs (long running processes)
  • 16. №2 PLUGIN 1. connection-time 2. uptime 3. threads-connected 4. threadcache-hitrate 5. q[uery]cache-hitrate 6. q[uery]cache-lowmem-prunes 7. [myisam-]keycache-hitrate 8. [innodb-]bufferpool-hitrate 9. [innodb-]bufferpool-wait-free 10. [innodb-]log-waits 11. tablecache-hitrate 12. table-lock-contention 13. index-usage 14. tmp-disk-tables 15. slow-queries 16. long-running-procs 17. slave-lag 18. slave-io-running 19. slave-sql-running 20. sql 21. open-files 22. encode 23. cluster-ndb-running
  • 19. DUPLICATE SIGNALS ‱ Queries ‱ Com_admin_commands ‱ Com_assign_to_keycache ‱ Com_alter_db ‱ Com_alter_db_upgrade ‱ Com_alter_event ‱ Com_alter_function ‱ Com_alter_procedure ‱ Com_alter_server ‱ Com_alter_table ‱ Com_alter_tablespace ‱ Com_alter_user ‱ Com_analyze ‱ Com_begin ‱ Com_binlog ‱ Com_ad_nauseum
  • 20. DESIRABLE METRICS ‱ %used, absolute quantity available ‱ Throughput, concurrency, errors ‱ MTBF, MTTR, MTTD, %availability ‱ Response time, 95% response time ‱ Throughput, concurrency, busy time, total response time, backlog/queue ‱ Utilization, saturation, errors, sources of load/demand
  • 24. RESOURCE LIMITS ‱ Threads_connected near max_connections? ‱ %table cache used? ‱ Open file handles? ‱ Long-running queries/transactions?
  • 25. ERRORS ‱ Deadlocks? ‱ Aborted connects?
  • 26. AVAILABILITY ‱ Ability to connect and run a query? ‱ Uptime is small? ‱ Replication is running?
  • 27. PERFORMANCE ‱ You can get throughput (Queries) and concurrency (Threads_running) from MySQL ‱ But in a Nagios check, no context to know whether they’re good or bad ‱ You generally can’t get response time, busy time, utilization, backlog, etc ‱ You can aggregate thread states, thread times, users, databases, query abstracts...
  • 28. NAGIOS IS BEST AT LIVING IN THE MOMENT
  • 29. THOU SHALT NOT ‱ Cache hit ratios ‱ Thread cache hit ratio ‱ Buffer pool cache hit ratio ‱ Table cache hit ratio ‱ Key cache hit ratio ‱ Query cache hit ratio ‱ Rates of “bad” queries ‱ % temp tables on disk ‱ % full table scans ‱ % slow queries ‱ Unfixable things ‱ Replication delay
  • 30. WHY NOT? ‱ Those are properties of the workload and application ‱ They are not conditions to alert/warn about ‱ They are not fixable / actionable in the service
  • 31. ALERTS ARE BETTER TOGETHER
  • 32. QUESTION: WHAT IS BETTER?
  • 33. №1 ALERT!!!!! Disk CRIT 100% /dev/sda2
  • 34. №2 ALERT!!!!! Replication CRIT Slave I/O Thread No
  • 35. №3 ALERT!!!!! Replication CRIT Slave SQL Thread No
  • 36. №4 ALERT!!!!! Replication CRIT Seconds_Behind_Master NULL
  • 37. №5 ALERT!!!!! MySQL CRIT oldest transaction: 86400 seconds
  • 39. №1 ALERT!!!!! CRIT * Disk /dev/sda2 full * Replication stopped * Oldest transaction 86400 seconds * 4999 threads in status “Waiting for table metadata lock”
  • 40. HOLLER AT ME QUESTIONS? @XAPRB / BARON@VIVIDCORTEX.COM
  • 41. RESOURCES ‱ Chapter 3 of High Performance MySQL, 3rd Edition ‱ Percona White Papers ‱ Causes of Downtime in Production MySQL Servers ‱ Preventing MySQL Emergencies ‱ Goal-Driven Performance Optimization ‱ Forecasting MySQL Scalability with the Universal Scalability Law ‱ Method R: Optimizing Oracle Performance, Cary Millsap ‱ The Goal, Eli Goldratt ‱ The USE Method (Brendan Gregg) & his new book ‱ Guerrilla Capacity Planning, Neil J. Gunther ‱ Fundamental Performance & Scalability Instrumentation