Proactive performance monitoring with adaptive thresholds

<Insert Picture Here>
Proactive Performance Monitoring with Adaptive
Thresholds
John Beresniewicz
Consulting Member of Technical Staff
Oracle USA

The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into
any contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.

Agenda
• Performance Monitoring
• Understanding Metrics
• Baselines and Adaptive Thresholds
• Enterprise Manager Use Cases

Performance Monitoring

A brief history
• Availability monitoring
• Simple Boolean (up/down) using ping
• Notification frameworks constructed
• Performance monitoring
• Fixed thresholds over system-level counters (V$SYSSTAT)
• Use existing frameworks
• Vendor metric madness
• More metrics must be better
• Users complaints are still the primary alerting mechanism

Performance alerting is difficult
• Performance is subjective and variable
• Better or worse, not best or worst
• Applications vary in performance characteristics
• Workloads vary predictably within system
• Many metrics, few good signals
• DB Time metrics far superior to counter-based ones
• Metrics lack semantic framework
• Do alerts point at symptoms, causes, both?
• Setting thresholds manually is labor intensive
• The M x N problem (M targets and N metrics)

Understanding Metrics

Classifying metrics
• Identify a set of basic metrics
• PERFORMANCE: Time-based metrics
• KING KONG: Average Active Sessions
• Response time per Txn, Response time per call
• WORKLOAD TYPE
• What kind of work is system doing?
• Typically the “per txn” metrics
• WORKLOAD VOLUME
• How much demand is being placed on system?
• Typically the “per sec” metrics
• Triage performance effects by correlating with causes

Demand varies predictably
Autocorrelation of calls per second for email system

Executions per second over a week
• Weekdays show clear
hour-of-day pattern
• Weekends different
• What threshold to set?

Average active sessions
Scotty, I think
we have a
problem

Outliers or events?
In stable system,
metrics should be
statistically stable
and rare
observations may
signal events
Are these significant?

Baselines and
Adaptive Thresholds

Operational requirements
• Set alert thresholds automatically
• Determine thresholds relative to baseline behavior
• Adjust thresholds for expected workload changes
• Adapt thresholds to system evolution

AWR Baselines
• Captured AWR snapshots representing expected
performance under common workload
• Capture can be pre-configured using templates
• SYSTEM_MOVING_WINDOW
• Trailing N days of data
• Compare performance against recent history
• N is settable in days, 3 weeks or 5 weeks are nice settings
• Out-of-box baseline in RDBMS 11g

Time-grouping
• Captures workload periodicity by grouping data into
common diurnal time buckets
• Daily periodicity
• All hours, Day-Night, Hour-of-Day
• Weekly periodicity
• All days, Weekday-Weekend, Day-of-Week
• Time-grouping combines daily and weekly periodicities

Metric statistics
• Basic metrics only
• Computed over SYSTEM_MOVING_WINDOW
• Standard stats: MIN, MAX, AVG, STDDEV
• Percentiles:
• Measured: 25, 50 (median), 75, 90, 95, 99
• Estimated: 99.9, 99.99
• Computed over time-groups
• Automatically determined in 11g
• Computed weekly
• Saturday 12 midnight Scheduler job

Adaptive alert thresholds
• Percent of maximum thresholds
• User input multiplier over time group maximum
• Good for detecting load peaks
• Significance level thresholds
• Signal on unusual metric values
• HIGH (95 pctile)
• VERY HIGH (99 pctile)
• SEVERE (99.9 pctile)
• EXTREME (99.99 pctile)
• Computed and set automatically
• Thresholds can reset every hour (MMON task)

Enterprise Manager
User Interface

Early 10g visualization: seismograph

Enterprise Manager entry points
• DB home page: Related Links
• 10g: Metric Baselines
• Need to enable metric persistence
• Static and moving window baselines
• Time grouping selected by user
• 11g: Baseline Metric Thresholds
• Out-of-box metric persistence and statistics computation
• Improved use case based interface
• Automatic time grouping selection
• Statistics computed over SYSTEM_MOVING_WINDOW

RDBMS 11g use case goals
• Quickly configure Adaptive Thresholds
• Adjust thresholds in context
• Identify signals for known problem
• Advanced metric analysis

Baseline Metric Thresholds page

Quickly configure Adaptive Thresholds

Quick configure: Data Warehouse

Identify signals for known problem

Proactive performance monitoring with adaptive thresholds

More Related Content

What's hot (20)

Similar to Proactive performance monitoring with adaptive thresholds (20)

More from John Beresniewicz (8)

Recently uploaded (20)

Proactive performance monitoring with adaptive thresholds