Common Sense Performance Indicators in the Cloud

Common Sense
Performance
Indicators

Nick Gerner
June 24, 2010

Goals
Common Sense in the Cloud
same as outside the cloud

1. Tune performance
2. Investigate issues
3. Visualize architecture

Nick Gerner
www.nickgerner.com
@gerner

• Formerly senior engineer at SEOmoz
• Linkscape: index of the web for SEO
• Lead data services
• Developer
• Back-end ops guy

SEOmoz
• Seattle-based Startup (~7 engineers)
• SEO Blog and Community
• Toolset and Platform
OpenSiteExplorer.org
• 300TB/month processing pipeline
• 5 mil req/day API hits

SEOmoz Engineering
• 50 < nodes < 500
• AWS based since 2008
– EC2 – linux root access to bare VM
– S3 – networked disk
– EBS – local disk I/O
– ELB – load balancing as a service

SEOmoz Architecture
Processing

The Raw
Web Crawlers
Crawlers
Storage
Process Prepare

Data Pipeline

SEOmoz Architecture
API

Memcache App Lighttpd
Partners

Memcache App Lighttpd ELB
S3

SEOmoz
Memcache App Lighttpd Apps

End-to-End
Performance Indicators

Latency Conversion
Rate

DNS
Time to
On-load
Web
Object
Count

Great
...but not the focus of this talk

Latency Conversion
Rate

DNS
Time to
On-load
Web
Object
Count

System App
Characteristics Stack
Front-End

CPU Mem Drives Middleware

Caching
Net
Disk Competes Back-end
For

Database WS-API

http://www.flickr.com/photos/dnisbet/3118888630/

System
Characteristics App
Stack
CPU Mem Front-End
Drives Middleware

Caching
Competes
For
Back-end
Net
Disk Database WS-API

http://www.flickr.com/photos/dnisbet/3118888630/

/proc
• System stats
• Per-process stats
• It all comes from here
...but use tools to see it

System Characteristics

Load Average
CPU
Memory
Disk
Network

Load Average
• Combines a few things
• Good place to start
• Explains nothing

http://www.flickr.com/photos/maple03/4176389418/

CPU
• Break out by process
• Break out user vs system
• User, System, I/O wait, Idle

http://www.flickr.com/photos/pacdog/213442876/

Why watch it?
• Who's doing work
• Is CPU maxed?
• Blocked on I/O?
• Compare to Load Average
http://www.flickr.com/photos/pacdog/213442876/

Memory
• Break out by Process
• Free, cached, used

http://www.flickr.com/photos/williamhook/3118248600/

Why watch it?
• Cached + Free = Available
• Do you have spare memory?
– App uses
– Memcache
– DB cache

http://www.flickr.com/photos/williamhook/3118248600/

Disk
• Read bytes/sec
• Write bytes/sec
• Disk utilization

http://www.flickr.com/photos/robfon/2174992215/

Why watch it?

• Is disk busy?
• When?
• Who's using it?

http://www.flickr.com/photos/robfon/2174992215/

Network
• Read bytes/sec
• Write bytes/sec
• Established connections

http://www.flickr.com/photos/ahkitj/20853609/

Why watch it?
• Max connections
(~1024 is magic)
• Bandwidth is $$$
• When are you busy?
• SOA considerations http://www.flickr.com/photos/ahkitj/20853609/

v Perf Monitoring Solution
FREE, in Apt

1. data collection (collectd)
2. data storage (rrdtool)
3. dashboard management (drraw)

Perf Monitoring Architecture
Multiple Clusters

Multiple Applications

Nodes come up
and go down

Cluster
Cluster


collectd agents

new nodes get
Cluster generic config

Cluster node names
follow convention
according to role


On its own server:
collectd server
Perf Monitoring Web server
drraw.cgi
Server
allows connections
from new nodes

perf data backed up daily

Cluster
Cluster

Happy Sysadmin

Visibility into system
history of performance

Perf Monitoring
Server

Cluster
Cluster

Perf Dashboard Featurs

1. Summarize nodes/systems
2. Visualize data over time
3. Stack measurements
– Per-process
– Per-node
4. Handle new nodes
–

Graph Summary
• cpu, mem, disk, net
• over time
• per node
• per process
• Through in relevant app measures
e.g. per request stats:
• req/sec
• median latency/req

Ad-hoc Tools
• $ dstat -cdnml
system characteristics
• $ iotop
per-process disk I/O
• $ iostat -x 3
detailed disk stats
• $ netstat -tnp
fast, per-process TCP connection stats

Resources
• Perf Testing: What, How, Why
http://www.nickgerner.com/2010/02/performance-testing-
what-andhow-why/

• Perf Testing Case Study: OSE
http://www.nickgerner.com/2010/01/performance-testing-
case-study-ose/

• S3 Benchmarks
http://twopieceset.blogspot.com/2009/06/s3-
performance-benchmarks.html

• Perf Measurement
– http://twopieceset.blogspot.com/2009/03/performance-
measurement-for-small-and.html
–

More Resources
• http://www.collectd.org
• http://oss.oetiker.ch/rrdtool/
• http://web.taranis.org/drraw/
• http://dag.wieers.com/home-made/dstat/

• $ man proc
–

Q: Why? A: Perf Tuning
Test

Validate Measure

Improve Interpret

Q: Why? A: System Arch
• Better Devs/Ops
• Identify Bottlenecks
• Scaling
Considerations

Q: Why? A: Issue Investigation
• Machine Specific?
• System Wide?
• Which Component?
• Timeline?
• Cascading Failures?

Common Sense Performance Indicators in the Cloud

More Related Content

What's hot (20)

Similar to Common Sense Performance Indicators in the Cloud (20)

Recently uploaded (20)

Common Sense Performance Indicators in the Cloud