SlideShare a Scribd company logo
Common Sense
Performance
Indicators


           Nick Gerner
         June 24, 2010
Goals
 Common Sense in the Cloud
     same as outside the cloud


1. Tune performance
2. Investigate issues
3. Visualize architecture
Nick Gerner
              www.nickgerner.com
                  @gerner

•   Formerly senior engineer at SEOmoz
•   Linkscape: index of the web for SEO
•   Lead data services
•   Developer
•   Back-end ops guy
SEOmoz
• Seattle-based Startup (~7 engineers)
• SEO Blog and Community
• Toolset and Platform
    OpenSiteExplorer.org
• 300TB/month processing pipeline
• 5 mil req/day API hits
SEOmoz Engineering
• 50 < nodes < 500
• AWS based since 2008
  – EC2 – linux root access to bare VM
  – S3 – networked disk
  – EBS – local disk I/O
  – ELB – load balancing as a service
SEOmoz Architecture
         Processing


The                  Raw
Web     Crawlers
         Crawlers
                    Storage
                                    Process   Prepare




                    Data Pipeline
SEOmoz Architecture
           API

      Memcache   App   Lighttpd
                                        Partners


      Memcache   App   Lighttpd   ELB
S3

                                        SEOmoz
      Memcache   App   Lighttpd          Apps
End-to-End
 Performance Indicators

Latency   Conversion
            Rate

                 DNS
    Time to
    On-load
               Web
              Object
              Count
Great
...but not the focus of this talk

 Latency     Conversion
               Rate

                      DNS
      Time to
      On-load
                   Web
                  Object
                  Count
Performance Indicators
   System                                App
Characteristics                         Stack
                                          Front-End

 CPU      Mem     Drives                 Middleware

                                           Caching
          Net
 Disk             Competes                Back-end
                    For

                               Database                WS-API


                             http://www.flickr.com/photos/dnisbet/3118888630/
Performance Indicators
   System
Characteristics                          App
                                        Stack
  CPU     Mem                          Front-End
                   Drives             Middleware

                                        Caching
                   Competes
                     For
                                        Back-end
           Net
  Disk                         Database          WS-API




                   http://www.flickr.com/photos/dnisbet/3118888630/
/proc
• System stats
• Per-process stats
• It all comes from here
    ...but use tools to see it
System Characteristics

      Load Average
          CPU
        Memory
          Disk
        Network
Load Average
• Combines a few things
• Good place to start
• Explains nothing


                http://www.flickr.com/photos/maple03/4176389418/
CPU
• Break out by process
• Break out user vs system
• User, System, I/O wait, Idle


                     http://www.flickr.com/photos/pacdog/213442876/
Why watch it?
•   Who's doing work
•   Is CPU maxed?
•   Blocked on I/O?
•   Compare to Load Average
                    http://www.flickr.com/photos/pacdog/213442876/
Memory
• Break out by Process
• Free, cached, used



                 http://www.flickr.com/photos/williamhook/3118248600/
Why watch it?
• Cached + Free = Available
• Do you have spare memory?
  – App uses
  – Memcache
  – DB cache

               http://www.flickr.com/photos/williamhook/3118248600/
Disk
• Read bytes/sec
• Write bytes/sec
• Disk utilization


                     http://www.flickr.com/photos/robfon/2174992215/
Why watch it?

• Is disk busy?
• When?
• Who's using it?


                    http://www.flickr.com/photos/robfon/2174992215/
Network
• Read bytes/sec
• Write bytes/sec
• Established connections


                     http://www.flickr.com/photos/ahkitj/20853609/
Why watch it?
• Max connections
      (~1024 is magic)
• Bandwidth is $$$
• When are you busy?
• SOA considerations http://www.flickr.com/photos/ahkitj/20853609/
v Perf Monitoring   Solution
FREE, in Apt

  1. data collection (collectd)
  2. data storage (rrdtool)
  3. dashboard management (drraw)
Perf Monitoring Architecture
 Multiple Clusters

Multiple Applications

  Nodes come up
   and go down




     Cluster
                        Cluster
Perf Monitoring Architecture




                      collectd agents

                       new nodes get
 Cluster               generic config

            Cluster      node names
                      follow convention
                      according to role
Perf Monitoring Architecture

                                      On its own server:
                                       collectd server
       Perf Monitoring                  Web server
                                          drraw.cgi
           Server
                                     allows connections
                                       from new nodes

                                   perf data backed up daily



 Cluster
                         Cluster
Perf Monitoring Architecture
                                     Happy Sysadmin

                                    Visibility into system
                                   history of performance

       Perf Monitoring
           Server




 Cluster
                         Cluster
Perf Dashboard Featurs

1. Summarize nodes/systems
2. Visualize data over time
3. Stack measurements
– Per-process
– Per-node
4. Handle new nodes
–
Batch Mode Dashboard
CPU
Memory
Disk
Network
Web Server Dashboard
Web Requests
mod_status
System-Wide Dashboard
Per-request
Graph Summary
•   cpu, mem, disk, net
•   over time
•   per node
•   per process
•   Through in relevant app measures
      e.g. per request stats:
       • req/sec
       • median latency/req
Ad-hoc Tools
• $ dstat -cdnml
    system characteristics
• $ iotop
    per-process disk I/O
• $ iostat -x 3
    detailed disk stats
• $ netstat -tnp
    fast, per-process TCP connection stats
Resources
• Perf Testing: What, How, Why
      http://www.nickgerner.com/2010/02/performance-testing-
      what-andhow-why/

• Perf Testing Case Study: OSE
      http://www.nickgerner.com/2010/01/performance-testing-
      case-study-ose/

• S3 Benchmarks
      http://twopieceset.blogspot.com/2009/06/s3-
      performance-benchmarks.html

• Perf Measurement
  – http://twopieceset.blogspot.com/2009/03/performance-
    measurement-for-small-and.html
  –
More Resources
•   http://www.collectd.org
•   http://oss.oetiker.ch/rrdtool/
•   http://web.taranis.org/drraw/
•   http://dag.wieers.com/home-made/dstat/

• $ man proc
    –
Q: Why? A: Perf Tuning
                     Test


Validate                                Measure




           Improve          Interpret
Q: Why? A: System Arch
• Better Devs/Ops
• Identify Bottlenecks
• Scaling
  Considerations
Q: Why? A: Issue Investigation
•   Machine Specific?
•   System Wide?
•   Which Component?
•   Timeline?
•   Cascading Failures?

More Related Content

PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
PPTX
CPN302 your-linux-ami-optimization-and-performance
PPTX
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
PDF
[2018.10.19] 김용기 부장 - IAC on OpenStack (feat. ansible)
PPTX
A fun cup of joe with open liberty
PPTX
Tech4Africa 2014
PPTX
Alfresco tuning part1
PDF
Texter blue - gdpr watchdog
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
CPN302 your-linux-ami-optimization-and-performance
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
[2018.10.19] 김용기 부장 - IAC on OpenStack (feat. ansible)
A fun cup of joe with open liberty
Tech4Africa 2014
Alfresco tuning part1
Texter blue - gdpr watchdog

What's hot (20)

PDF
RackN Physical Layer Automation Innovation
PDF
deep learning in production cff 2017
PPTX
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
PPTX
Alfresco tuning part2
PPTX
To Build My Own Cloud with Blackjack…
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
PDF
Guide to alfresco monitoring
PDF
High Concurrency Architecture and Laravel Performance Tuning
PPTX
Splunk Java Agent
PPTX
Introduction to the Cluster Infrastructure and the Systems Provisioning Engin...
PPTX
Apache Flink Hands On
PDF
Australian OpenStack User Group August 2012: Chef for OpenStack
PPTX
Indic threads pune12-accelerating computation in html 5
PDF
Alfresco scalability and performnce
PDF
Stac summit june 14th - goodbye datalakes
PDF
Spark Summit EU talk by Jorg Schad
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
Belvedere
PDF
Ansible & Cumulus Networks - Simplify Network Automation
RackN Physical Layer Automation Innovation
deep learning in production cff 2017
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Alfresco tuning part2
To Build My Own Cloud with Blackjack…
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Guide to alfresco monitoring
High Concurrency Architecture and Laravel Performance Tuning
Splunk Java Agent
Introduction to the Cluster Infrastructure and the Systems Provisioning Engin...
Apache Flink Hands On
Australian OpenStack User Group August 2012: Chef for OpenStack
Indic threads pune12-accelerating computation in html 5
Alfresco scalability and performnce
Stac summit june 14th - goodbye datalakes
Spark Summit EU talk by Jorg Schad
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Belvedere
Ansible & Cumulus Networks - Simplify Network Automation
Ad

Similar to Common Sense Performance Indicators in the Cloud (20)

KEY
Profiling php applications
PDF
Multi Layer Monitoring V1
PDF
Capacity Planning For Web Operations Presentation
PDF
Capacity Planning For Web Operations Presentation
PPTX
05. performance-concepts
PPTX
Performance on a budget
PDF
Performance Oriented Design
PDF
The Web Scale
PDF
Skalowalna architektura na przykładzie soccerway.com
PDF
Capacity Planning
PPT
Web Speed And Scalability
PDF
DTrace talk at Oracle Open World
PPTX
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
PDF
Become a Performance Diagnostics Hero
PPT
Planning for-high-performance-web-application
PPTX
Optimizing your Infrastrucure and Operating System for Hadoop
PDF
Choosing Your Windows Azure Platform Strategy
PDF
Bottlenecks exposed web app db servers
PDF
Performance Whackamole (short version)
ODP
MNPHP Scalable Architecture 101 - Feb 3 2011
Profiling php applications
Multi Layer Monitoring V1
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
05. performance-concepts
Performance on a budget
Performance Oriented Design
The Web Scale
Skalowalna architektura na przykładzie soccerway.com
Capacity Planning
Web Speed And Scalability
DTrace talk at Oracle Open World
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Become a Performance Diagnostics Hero
Planning for-high-performance-web-application
Optimizing your Infrastrucure and Operating System for Hadoop
Choosing Your Windows Azure Platform Strategy
Bottlenecks exposed web app db servers
Performance Whackamole (short version)
MNPHP Scalable Architecture 101 - Feb 3 2011
Ad

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
Spectroscopy.pptx food analysis technology
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx

Common Sense Performance Indicators in the Cloud

  • 1. Common Sense Performance Indicators Nick Gerner June 24, 2010
  • 2. Goals Common Sense in the Cloud same as outside the cloud 1. Tune performance 2. Investigate issues 3. Visualize architecture
  • 3. Nick Gerner www.nickgerner.com @gerner • Formerly senior engineer at SEOmoz • Linkscape: index of the web for SEO • Lead data services • Developer • Back-end ops guy
  • 4. SEOmoz • Seattle-based Startup (~7 engineers) • SEO Blog and Community • Toolset and Platform OpenSiteExplorer.org • 300TB/month processing pipeline • 5 mil req/day API hits
  • 5. SEOmoz Engineering • 50 < nodes < 500 • AWS based since 2008 – EC2 – linux root access to bare VM – S3 – networked disk – EBS – local disk I/O – ELB – load balancing as a service
  • 6. SEOmoz Architecture Processing The Raw Web Crawlers Crawlers Storage Process Prepare Data Pipeline
  • 7. SEOmoz Architecture API Memcache App Lighttpd Partners Memcache App Lighttpd ELB S3 SEOmoz Memcache App Lighttpd Apps
  • 8. End-to-End Performance Indicators Latency Conversion Rate DNS Time to On-load Web Object Count
  • 9. Great ...but not the focus of this talk Latency Conversion Rate DNS Time to On-load Web Object Count
  • 10. Performance Indicators System App Characteristics Stack Front-End CPU Mem Drives Middleware Caching Net Disk Competes Back-end For Database WS-API http://www.flickr.com/photos/dnisbet/3118888630/
  • 11. Performance Indicators System Characteristics App Stack CPU Mem Front-End Drives Middleware Caching Competes For Back-end Net Disk Database WS-API http://www.flickr.com/photos/dnisbet/3118888630/
  • 12. /proc • System stats • Per-process stats • It all comes from here ...but use tools to see it
  • 13. System Characteristics Load Average CPU Memory Disk Network
  • 14. Load Average • Combines a few things • Good place to start • Explains nothing http://www.flickr.com/photos/maple03/4176389418/
  • 15. CPU • Break out by process • Break out user vs system • User, System, I/O wait, Idle http://www.flickr.com/photos/pacdog/213442876/
  • 16. Why watch it? • Who's doing work • Is CPU maxed? • Blocked on I/O? • Compare to Load Average http://www.flickr.com/photos/pacdog/213442876/
  • 17. Memory • Break out by Process • Free, cached, used http://www.flickr.com/photos/williamhook/3118248600/
  • 18. Why watch it? • Cached + Free = Available • Do you have spare memory? – App uses – Memcache – DB cache http://www.flickr.com/photos/williamhook/3118248600/
  • 19. Disk • Read bytes/sec • Write bytes/sec • Disk utilization http://www.flickr.com/photos/robfon/2174992215/
  • 20. Why watch it? • Is disk busy? • When? • Who's using it? http://www.flickr.com/photos/robfon/2174992215/
  • 21. Network • Read bytes/sec • Write bytes/sec • Established connections http://www.flickr.com/photos/ahkitj/20853609/
  • 22. Why watch it? • Max connections (~1024 is magic) • Bandwidth is $$$ • When are you busy? • SOA considerations http://www.flickr.com/photos/ahkitj/20853609/
  • 23. v Perf Monitoring Solution FREE, in Apt 1. data collection (collectd) 2. data storage (rrdtool) 3. dashboard management (drraw)
  • 24. Perf Monitoring Architecture Multiple Clusters Multiple Applications Nodes come up and go down Cluster Cluster
  • 25. Perf Monitoring Architecture collectd agents new nodes get Cluster generic config Cluster node names follow convention according to role
  • 26. Perf Monitoring Architecture On its own server: collectd server Perf Monitoring Web server drraw.cgi Server allows connections from new nodes perf data backed up daily Cluster Cluster
  • 27. Perf Monitoring Architecture Happy Sysadmin Visibility into system history of performance Perf Monitoring Server Cluster Cluster
  • 28. Perf Dashboard Featurs 1. Summarize nodes/systems 2. Visualize data over time 3. Stack measurements – Per-process – Per-node 4. Handle new nodes –
  • 30. CPU
  • 32. Disk
  • 39. Graph Summary • cpu, mem, disk, net • over time • per node • per process • Through in relevant app measures e.g. per request stats: • req/sec • median latency/req
  • 40. Ad-hoc Tools • $ dstat -cdnml system characteristics • $ iotop per-process disk I/O • $ iostat -x 3 detailed disk stats • $ netstat -tnp fast, per-process TCP connection stats
  • 41. Resources • Perf Testing: What, How, Why http://www.nickgerner.com/2010/02/performance-testing- what-andhow-why/ • Perf Testing Case Study: OSE http://www.nickgerner.com/2010/01/performance-testing- case-study-ose/ • S3 Benchmarks http://twopieceset.blogspot.com/2009/06/s3- performance-benchmarks.html • Perf Measurement – http://twopieceset.blogspot.com/2009/03/performance- measurement-for-small-and.html –
  • 42. More Resources • http://www.collectd.org • http://oss.oetiker.ch/rrdtool/ • http://web.taranis.org/drraw/ • http://dag.wieers.com/home-made/dstat/ • $ man proc –
  • 43. Q: Why? A: Perf Tuning Test Validate Measure Improve Interpret
  • 44. Q: Why? A: System Arch • Better Devs/Ops • Identify Bottlenecks • Scaling Considerations
  • 45. Q: Why? A: Issue Investigation • Machine Specific? • System Wide? • Which Component? • Timeline? • Cascading Failures?