SlideShare a Scribd company logo
Performance optimization in Linux
Tales from the trenches
Alex Chistyakov, Principal Engineer, Git in Sky
Linux Piter 2015
Who are we?
●
A small consulting company based in SPb.
●
Web operations
●
Automation
●
Performance tuning
●
Sponsors of local DevOps meetup
Who are you?
●
Linux fans?
●
Developers?
●
Web developers, maybe?
●
System architects?
●
Performance engineers?
Okay, why Linux?
●
Is there anything else?
●
According to W3Cook stats, Linux
serves 95.8% of public web sites
●
And it's on the desktop too!
●
(At least on my desktop)
Linux in the perfect world
●
Linux fans?
●
Developers?
●
Web developers, maybe?
●
System architects?
●
Performance engineers?
Linux in the real world
●
Linux fans?
●
Developers?
●
Web developers, maybe?
●
System architects?
●
Performance engineers?
504 on the main page!
●
A customer is stressed extremely
●
Reaction should be quick and effective
●
The obvious plan does not work
●
We should be prepared!
The obvious plan
●
1) Change something
●
2) Expect the situation to become better
●
3) Wait anxiously
●
4) ????
●
5) PROFIT!!!
●
This plan is quite popular in fact
for some reason (simplicity?)
The proper plan (top secret!)
●
1) Gather metrics (you have them already, don't
you?)
●
2) Analyze metrics
●
3) Elaborate a hypothesis
●
4) Plan and make a single change
●
5) Repeat until success
●
If you were proficient in physics at high
school, this plan should sound extremely familiar
How to collect/analyze metrics?
●
Brendan Gregg's observability tools diagram
How do we collect/analyze?
●
atop (30 sec intervals)
How do we collect/analyze?
●
atop (30 sec intervals)
●
Ex-graphite stack (Grafana, InfluxDB or OpenTSDB
or Cyanite/Cassandra, but NOT Whisper, collectd)
How do we collect/analyze?
●
atop (30 sec intervals)
●
Ex-graphite stack (Grafana, InfluxDB or OpenTSDB
or Cyanite/Cassandra, but NOT Whisper, collectd)
●
NewRelic
How do we collect/analyze?
●
atop (30 sec intervals)
●
Ex-graphite stack (Grafana, InfluxDB or OpenTSDB
or Cyanite/Cassandra, but NOT Whisper, collectd)
●
NewRelic
●
pidstat (not iotop)
How do we collect/analyze?
●
atop (30 sec intervals)
●
Ex-graphite stack (Grafana, InfluxDB or OpenTSDB
or Cyanite/Cassandra, but NOT Whisper, collectd)
●
NewRelic
●
pidstat (not iotop)
●
perf top and perf record
How do we collect/analyze?
●
atop (30 sec intervals)
●
Ex-graphite stack (Grafana, InfluxDB or OpenTSDB
or Cyanite/Cassandra, but NOT Whisper, collectd)
●
NewRelic
●
pidstat (not iotop)
●
perf top and perf record
●
sysdig
How do we collect/analyze?
●
atop (30 sec intervals)
●
Ex-graphite stack (Grafana, InfluxDB or OpenTSDB
or Cyanite/Cassandra, but NOT Whisper, collectd)
●
NewRelic
●
pidstat (not iotop)
●
perf top and perf record
●
sysdig
●
iostat -x 1
Most common case (a no-brainer)
●
CPU time is too high (a lucky customer got an SSD)
●
Disk saturation is above 60% (a not-so-lucky customer)
Most common case (a no-brainer)
●
CPU time is too high (a lucky customer got an SSD)
●
Disk saturation is above 60% (a not-so-lucky customer)
●
PHP and, of course, MySQL
Most common case (a no-brainer)
●
CPU time is too high (a lucky customer got an SSD)
●
Disk saturation is above 60% (a not-so-lucky customer)
●
PHP and, of course, MySQL
●
InnoDB buffers are too low
●
Synchronous commit is 'on'
Most common case (a no-brainer)
●
CPU time is too high (a lucky customer got an SSD)
●
Disk saturation is above 60% (a not-so-lucky customer)
●
PHP and, of course, MySQL
●
InnoDB buffers are too low
●
Synchronous commit is 'on'
●
Too many slow queries
●
Queries with 'filesort' in execution plan
How to solve
●
Install Anemometer, turn on slow queries log
●
Range queries based on their cumulative exec time
●
Read and understand execution plans
●
Blame developers
●
Cry in vain
Story time!
●
Case #1: measuring the measurer
●
It seems that everybody still uses Graphite/Whisper
Story time!
●
Case #1: measuring the measurer
●
It seems that everybody still uses Graphite/Whisper
●
Even the big guys like Mail.Ru
Story time!
●
Case #1: measuring the measurer
●
It seems that everybody still uses Graphite/Whisper
●
Even the big guys like Mail.Ru
●
Well, because SAS HDDs are cheap...
Story time!
●
Case #1: measuring the measurer
●
It seems that everybody still uses Graphite/Whisper
●
Even the big guys like Mail.Ru
●
Well, because SAS HDDs are cheap…
●
But...
A challenger appears!
●
InfluxDB vs. Whisper, July 2015
●
The same set of metrics (carbon-relay-ng in the middle)
●
And the winner is...
Okay, we hate magic
●
Whisper is just a set of RRD-like files on a plain old FS
●
20000 metrics lead you to 20000 files
●
Accessing 20000 files every 10 secs is a major pain
●
InfluxDB is a time series database based on an LSM-tree
●
InfluxDB is much more write-optimized than 20000 separate
files on your ext4/XFS/you-name-it
●
But, of course, SAS drives are quite cheap...
In case you are not scared
●
Case #2: the site got a SUDS (sudden unexpected death
syndrome)
In case you are not scared
●
Case #2: the site got a SUDS (sudden unexpected death
syndrome)
●
Symptoms: everything slows down to a crawl (sounds familiar)
In case you are not scared
●
Case #2: the site got a SUDS (sudden unexpected death
syndrome)
●
Symptoms: everything slows down to a crawl (sounds familiar)
●
NewRelic shows nothing unusual
In case you are not scared
●
Case #2: the site got a SUDS (sudden unexpected death
syndrome)
●
Symptoms: everything slows down to a crawl (sounds familiar)
●
NewRelic shows nothing unusual
●
Well, if parameters more suitable for a busy site that for a very
low traffic one can be called “nothing unusual”
●
And this site is not busy at all
Diagnostic card
●
PHP is OK
●
MySQL does not sort anything
●
Top queries in MySQL sorted by total exec time are all indexed
●
Every MySQL query runs very slow when there is even
moderate load
But how did we solve it?
●
Even a modern rig w/decent Xeons and SATA HDDs can be
turned into a slug
●
As simple as disabling AHCI in BIOS and staying on plain IDE
●
Well this one was not that hard but was quite unusual
●
Rented servers do not suffer from problems like this because
they are configured uniformly
●
I can't easily explain how I came upon this solution, pure
intuition seemed to be involved
Not scared enough yet?
●
Then, case #3: another site got SUDS
Not scared enough yet?
●
Then, case #3: another site got SUDS
Diagnostic card
●
NewRelic blames PHP code
●
Even the SSH console is slow
●
Nothing unusual or unexpected in daily CPU load graphs
●
CPU flamegraph shows nothing
What is a «CPU flamegraph»?
How did we solve it
●
Analyzed atop recorded stats for outage periods
●
atop is quite smart in fact and color suspected values in red or
blue
●
IRQ % is over 50%
How did we solve it
●
Analyzed atop recorded stats for outage periods
●
atop is quite smart in fact and color suspected values in red or
blue
●
IRQ % is over 50%
●
But what is “IRQ %” anyway?
●
Oh, who cares, let's install Munin and get per-interrupt graphs
A blast from the past
●
Analyzed atop recorded stats for outage periods
●
atop is quite smart in fact and color suspected values in red or
blue
●
IRQ % is over 50%
●
But what is “IRQ %” anyway?
●
Oh, who cares, let's install Munin and get per-interrupt graphs
How did we solve it
●
Well we have not solved it yet
●
The graph from previous slide is for past two days
●
But at least we have a plan!
●
https://help.ubuntu.com/community/ReschedulingInterrupts
Summary
●
Linux is cool
●
Performance engineering is hard
●
Don't panic!
Thank you!
●
Questions?
●
Oh, BTW you can hire us!
●
http://gitinsky.com
●
alex@gitinsky.com
●
Please do not forget to attend our meetups:
●
http://meetup.com/Docker-Spb, http://meetup.com/Ansible-Spb,
http://meetup.com/DevOps-40

More Related Content

PDF
Harry Potter and the Daemons of Berkeley
PDF
My talk from PgConf.Russia 2016
PDF
My talk on Piter Py 2016
PDF
My talk at LVEE 2016
PDF
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PDF
PDF
Ansible
PDF
Infrastructure as code might be literally impossible part 2
Harry Potter and the Daemons of Berkeley
My talk from PgConf.Russia 2016
My talk on Piter Py 2016
My talk at LVEE 2016
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
Ansible
Infrastructure as code might be literally impossible part 2

What's hot (20)

ODP
From Test to Live with Rex
PDF
Git+jenkins+rex presentation
ODP
Rex - Lightning Talk yapc.eu 2013
ODP
Random tips that will save your project's life
PDF
Practical SystemTAP basics: Perl memory profiling
PDF
High performance json- postgre sql vs. mongodb
PDF
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
PDF
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
PDF
OSNoise Tracer: Who Is Stealing My CPU Time?
PDF
Node.js and Ruby
PDF
Node.js: Whats the Big Deal? Presented and JS Meetup Chicago
PPTX
Creating parallel tests for NUnit with PNUnit - hands on lab
PDF
Deep dive-oz
PDF
Beyond Puppet
PPTX
How go makes us faster (May 2015)
PDF
Warsztaty ansible
PDF
Puppet Camp LA 2/19/2015
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
PDF
Building a REST API with Node.js and MongoDB
PDF
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
From Test to Live with Rex
Git+jenkins+rex presentation
Rex - Lightning Talk yapc.eu 2013
Random tips that will save your project's life
Practical SystemTAP basics: Perl memory profiling
High performance json- postgre sql vs. mongodb
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
OSNoise Tracer: Who Is Stealing My CPU Time?
Node.js and Ruby
Node.js: Whats the Big Deal? Presented and JS Meetup Chicago
Creating parallel tests for NUnit with PNUnit - hands on lab
Deep dive-oz
Beyond Puppet
How go makes us faster (May 2015)
Warsztaty ansible
Puppet Camp LA 2/19/2015
High-Performance Networking Using eBPF, XDP, and io_uring
Building a REST API with Node.js and MongoDB
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
Ad

Viewers also liked (20)

PDF
My talk at Linux Piter 2016
PDF
Tk conf daniel-podolsky-sqlvsnosql
PDF
NoSQL — неспроста ли это "ЖЖЖ"?
PDF
My talk on monitoring systems at RootConf 2016
PDF
Ansible in the enterprise
PDF
My talk on Salt and Ansible from DevConf 2014
PDF
My talk on LeoFS, Highload++ 2014
PDF
"Мы два месяца долбались, а потом построили индекс" (c) Аксенов
PDF
On Docker
PDF
My talk on Hadoop stack operations engineering at OSPCon
ODP
My talk on Docker, Youcon 2015
PDF
Benchmarking PostgreSQL in Linux and FreeBSD
PDF
My talk at DevParty 2017
PDF
My talk on programming languages at SPbLUG Mar 2017
PDF
My talk at CEE-SECR 2016
PDF
My talk at YouCon Saratov 2016
PDF
My talk on HBase ops engineering at TBD Jun 2016
ODP
My talk on using LVM thin provisioning from SPbLUG/DevOps-40 meetup 25.06.14
PDF
Using Ansible
PDF
My talk on Graphite stack on 58it.ru
My talk at Linux Piter 2016
Tk conf daniel-podolsky-sqlvsnosql
NoSQL — неспроста ли это "ЖЖЖ"?
My talk on monitoring systems at RootConf 2016
Ansible in the enterprise
My talk on Salt and Ansible from DevConf 2014
My talk on LeoFS, Highload++ 2014
"Мы два месяца долбались, а потом построили индекс" (c) Аксенов
On Docker
My talk on Hadoop stack operations engineering at OSPCon
My talk on Docker, Youcon 2015
Benchmarking PostgreSQL in Linux and FreeBSD
My talk at DevParty 2017
My talk on programming languages at SPbLUG Mar 2017
My talk at CEE-SECR 2016
My talk at YouCon Saratov 2016
My talk on HBase ops engineering at TBD Jun 2016
My talk on using LVM thin provisioning from SPbLUG/DevOps-40 meetup 25.06.14
Using Ansible
My talk on Graphite stack on 58it.ru
Ad

Similar to My talk at Linux Piter 2015 (20)

PDF
Scalable, good, cheap
PDF
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
PPTX
HBase operations
ODP
Real-world Experiences in Scala
PDF
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
PDF
Cloud arch patterns
PDF
What I learned building a parallel processor from scratch
PDF
Ceph Day London 2014 - Deploying ceph in the wild
PDF
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
ODP
Automating MySQL operations with Puppet
PDF
strangeloop 2012 apache cassandra anti patterns
PDF
UKOUG 2011: Practical MySQL Tuning
PDF
MySQL 5.6 Performance
PDF
Killer Bugs From Outer Space
PDF
Experiences building a distributed shared log on RADOS - Noah Watkins
PDF
Kernel Recipes 2014 - Performance Does Matter
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Data Science in the Cloud @StitchFix
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PDF
kranonit S06E01 Игорь Цинько: High load
Scalable, good, cheap
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
HBase operations
Real-world Experiences in Scala
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Cloud arch patterns
What I learned building a parallel processor from scratch
Ceph Day London 2014 - Deploying ceph in the wild
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Automating MySQL operations with Puppet
strangeloop 2012 apache cassandra anti patterns
UKOUG 2011: Practical MySQL Tuning
MySQL 5.6 Performance
Killer Bugs From Outer Space
Experiences building a distributed shared log on RADOS - Noah Watkins
Kernel Recipes 2014 - Performance Does Matter
AWS Big Data Demystified #1: Big data architecture lessons learned
Data Science in the Cloud @StitchFix
The Dark Side Of Go -- Go runtime related problems in TiDB in production
kranonit S06E01 Игорь Цинько: High load

More from Alex Chistyakov (20)

PDF
My slides from DevOpsDays 2019
PDF
My slides from BMM №3 May 2019
PDF
My slides from DevOps-40 meetup Jun 2019
PDF
My slides from SECR'2018
PDF
My slides from the first SPb SRE community meetup at DataArt
PDF
My slides from CC'2019
PDF
My slides from BMM №4 Nov 2019
PDF
My slides from DevOps-40 meetup Oct 2019
PDF
My slides from DevOps-40 meetup Dec 2019
PDF
Configuration management and Kubernetes
PDF
Ansible and other stuff
PDF
Python performance engineering in 2017
PDF
My talk at SPb SQA sub-meetup of ITGM
PDF
My talk at SECR 2017
PDF
On scaling teams
PDF
MariaDB workshop
PDF
Docker for JS people
PDF
My talk on DevOps engineer's adventures in the Windows world at UWDC 2017
PDF
My talk on GitHub open data at ITGM #10
PDF
My talk on DevOps :) at Stachka 2017
My slides from DevOpsDays 2019
My slides from BMM №3 May 2019
My slides from DevOps-40 meetup Jun 2019
My slides from SECR'2018
My slides from the first SPb SRE community meetup at DataArt
My slides from CC'2019
My slides from BMM №4 Nov 2019
My slides from DevOps-40 meetup Oct 2019
My slides from DevOps-40 meetup Dec 2019
Configuration management and Kubernetes
Ansible and other stuff
Python performance engineering in 2017
My talk at SPb SQA sub-meetup of ITGM
My talk at SECR 2017
On scaling teams
MariaDB workshop
Docker for JS people
My talk on DevOps engineer's adventures in the Windows world at UWDC 2017
My talk on GitHub open data at ITGM #10
My talk on DevOps :) at Stachka 2017

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
KodekX | Application Modernization Development
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MIND Revenue Release Quarter 2 2025 Press Release
KodekX | Application Modernization Development
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
sap open course for s4hana steps from ECC to s4
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm

My talk at Linux Piter 2015

  • 1. Performance optimization in Linux Tales from the trenches Alex Chistyakov, Principal Engineer, Git in Sky Linux Piter 2015
  • 2. Who are we? ● A small consulting company based in SPb. ● Web operations ● Automation ● Performance tuning ● Sponsors of local DevOps meetup
  • 3. Who are you? ● Linux fans? ● Developers? ● Web developers, maybe? ● System architects? ● Performance engineers?
  • 4. Okay, why Linux? ● Is there anything else? ● According to W3Cook stats, Linux serves 95.8% of public web sites ● And it's on the desktop too! ● (At least on my desktop)
  • 5. Linux in the perfect world ● Linux fans? ● Developers? ● Web developers, maybe? ● System architects? ● Performance engineers?
  • 6. Linux in the real world ● Linux fans? ● Developers? ● Web developers, maybe? ● System architects? ● Performance engineers?
  • 7. 504 on the main page! ● A customer is stressed extremely ● Reaction should be quick and effective ● The obvious plan does not work ● We should be prepared!
  • 8. The obvious plan ● 1) Change something ● 2) Expect the situation to become better ● 3) Wait anxiously ● 4) ???? ● 5) PROFIT!!! ● This plan is quite popular in fact for some reason (simplicity?)
  • 9. The proper plan (top secret!) ● 1) Gather metrics (you have them already, don't you?) ● 2) Analyze metrics ● 3) Elaborate a hypothesis ● 4) Plan and make a single change ● 5) Repeat until success ● If you were proficient in physics at high school, this plan should sound extremely familiar
  • 10. How to collect/analyze metrics? ● Brendan Gregg's observability tools diagram
  • 11. How do we collect/analyze? ● atop (30 sec intervals)
  • 12. How do we collect/analyze? ● atop (30 sec intervals) ● Ex-graphite stack (Grafana, InfluxDB or OpenTSDB or Cyanite/Cassandra, but NOT Whisper, collectd)
  • 13. How do we collect/analyze? ● atop (30 sec intervals) ● Ex-graphite stack (Grafana, InfluxDB or OpenTSDB or Cyanite/Cassandra, but NOT Whisper, collectd) ● NewRelic
  • 14. How do we collect/analyze? ● atop (30 sec intervals) ● Ex-graphite stack (Grafana, InfluxDB or OpenTSDB or Cyanite/Cassandra, but NOT Whisper, collectd) ● NewRelic ● pidstat (not iotop)
  • 15. How do we collect/analyze? ● atop (30 sec intervals) ● Ex-graphite stack (Grafana, InfluxDB or OpenTSDB or Cyanite/Cassandra, but NOT Whisper, collectd) ● NewRelic ● pidstat (not iotop) ● perf top and perf record
  • 16. How do we collect/analyze? ● atop (30 sec intervals) ● Ex-graphite stack (Grafana, InfluxDB or OpenTSDB or Cyanite/Cassandra, but NOT Whisper, collectd) ● NewRelic ● pidstat (not iotop) ● perf top and perf record ● sysdig
  • 17. How do we collect/analyze? ● atop (30 sec intervals) ● Ex-graphite stack (Grafana, InfluxDB or OpenTSDB or Cyanite/Cassandra, but NOT Whisper, collectd) ● NewRelic ● pidstat (not iotop) ● perf top and perf record ● sysdig ● iostat -x 1
  • 18. Most common case (a no-brainer) ● CPU time is too high (a lucky customer got an SSD) ● Disk saturation is above 60% (a not-so-lucky customer)
  • 19. Most common case (a no-brainer) ● CPU time is too high (a lucky customer got an SSD) ● Disk saturation is above 60% (a not-so-lucky customer) ● PHP and, of course, MySQL
  • 20. Most common case (a no-brainer) ● CPU time is too high (a lucky customer got an SSD) ● Disk saturation is above 60% (a not-so-lucky customer) ● PHP and, of course, MySQL ● InnoDB buffers are too low ● Synchronous commit is 'on'
  • 21. Most common case (a no-brainer) ● CPU time is too high (a lucky customer got an SSD) ● Disk saturation is above 60% (a not-so-lucky customer) ● PHP and, of course, MySQL ● InnoDB buffers are too low ● Synchronous commit is 'on' ● Too many slow queries ● Queries with 'filesort' in execution plan
  • 22. How to solve ● Install Anemometer, turn on slow queries log ● Range queries based on their cumulative exec time ● Read and understand execution plans ● Blame developers ● Cry in vain
  • 23. Story time! ● Case #1: measuring the measurer ● It seems that everybody still uses Graphite/Whisper
  • 24. Story time! ● Case #1: measuring the measurer ● It seems that everybody still uses Graphite/Whisper ● Even the big guys like Mail.Ru
  • 25. Story time! ● Case #1: measuring the measurer ● It seems that everybody still uses Graphite/Whisper ● Even the big guys like Mail.Ru ● Well, because SAS HDDs are cheap...
  • 26. Story time! ● Case #1: measuring the measurer ● It seems that everybody still uses Graphite/Whisper ● Even the big guys like Mail.Ru ● Well, because SAS HDDs are cheap… ● But...
  • 27. A challenger appears! ● InfluxDB vs. Whisper, July 2015 ● The same set of metrics (carbon-relay-ng in the middle) ● And the winner is...
  • 28. Okay, we hate magic ● Whisper is just a set of RRD-like files on a plain old FS ● 20000 metrics lead you to 20000 files ● Accessing 20000 files every 10 secs is a major pain ● InfluxDB is a time series database based on an LSM-tree ● InfluxDB is much more write-optimized than 20000 separate files on your ext4/XFS/you-name-it ● But, of course, SAS drives are quite cheap...
  • 29. In case you are not scared ● Case #2: the site got a SUDS (sudden unexpected death syndrome)
  • 30. In case you are not scared ● Case #2: the site got a SUDS (sudden unexpected death syndrome) ● Symptoms: everything slows down to a crawl (sounds familiar)
  • 31. In case you are not scared ● Case #2: the site got a SUDS (sudden unexpected death syndrome) ● Symptoms: everything slows down to a crawl (sounds familiar) ● NewRelic shows nothing unusual
  • 32. In case you are not scared ● Case #2: the site got a SUDS (sudden unexpected death syndrome) ● Symptoms: everything slows down to a crawl (sounds familiar) ● NewRelic shows nothing unusual ● Well, if parameters more suitable for a busy site that for a very low traffic one can be called “nothing unusual” ● And this site is not busy at all
  • 33. Diagnostic card ● PHP is OK ● MySQL does not sort anything ● Top queries in MySQL sorted by total exec time are all indexed ● Every MySQL query runs very slow when there is even moderate load
  • 34. But how did we solve it? ● Even a modern rig w/decent Xeons and SATA HDDs can be turned into a slug ● As simple as disabling AHCI in BIOS and staying on plain IDE ● Well this one was not that hard but was quite unusual ● Rented servers do not suffer from problems like this because they are configured uniformly ● I can't easily explain how I came upon this solution, pure intuition seemed to be involved
  • 35. Not scared enough yet? ● Then, case #3: another site got SUDS
  • 36. Not scared enough yet? ● Then, case #3: another site got SUDS
  • 37. Diagnostic card ● NewRelic blames PHP code ● Even the SSH console is slow ● Nothing unusual or unexpected in daily CPU load graphs ● CPU flamegraph shows nothing
  • 38. What is a «CPU flamegraph»?
  • 39. How did we solve it ● Analyzed atop recorded stats for outage periods ● atop is quite smart in fact and color suspected values in red or blue ● IRQ % is over 50%
  • 40. How did we solve it ● Analyzed atop recorded stats for outage periods ● atop is quite smart in fact and color suspected values in red or blue ● IRQ % is over 50% ● But what is “IRQ %” anyway? ● Oh, who cares, let's install Munin and get per-interrupt graphs
  • 41. A blast from the past ● Analyzed atop recorded stats for outage periods ● atop is quite smart in fact and color suspected values in red or blue ● IRQ % is over 50% ● But what is “IRQ %” anyway? ● Oh, who cares, let's install Munin and get per-interrupt graphs
  • 42. How did we solve it ● Well we have not solved it yet ● The graph from previous slide is for past two days ● But at least we have a plan! ● https://help.ubuntu.com/community/ReschedulingInterrupts
  • 43. Summary ● Linux is cool ● Performance engineering is hard ● Don't panic!
  • 44. Thank you! ● Questions? ● Oh, BTW you can hire us! ● http://gitinsky.com ● alex@gitinsky.com ● Please do not forget to attend our meetups: ● http://meetup.com/Docker-Spb, http://meetup.com/Ansible-Spb, http://meetup.com/DevOps-40