How Can Observability Improve MTTR?

Comprehensive analysis is the key.

August 13th, 2025

RSS Feed

In a recent blog, we discussed some surprising findings from a survey administered at Cisco Live 2025 with 319 IT professionals participating. We were surprised when it was revealed that more than half (50.8 percent) of respondents’ organizations discovered there were performance problems when employees report them to IT or the help desk. We also learned that more than 80 percent of the time, the respondents felt that problems took several hours to a week to resolve.

Two recent outages come to mind. First, the three-hour outage in July 2025 involving Alaska Airline’s grounding of its airplanes due to the failure of a critical piece of multiredundant hardware in its data centers. This was a multihour outage that resulted in more than 200 flights being cancelled and impacted more than 15,000 travelers. In early August, United Airlines planes were grounded due to an outage of their Unimatic system which provides flight information for weights and balance calculations and flight time tracking. Starting shortly after 6 p.m. ET, and taking several hours to resolve, more than 1,000 flights were delayed and at least 40 were cancelled. Both situations are examples of how quickly problem resolution time can expand to a few hours or more.

Costs of airline downtime are significant, with one report concluding that every hour a single commercial aircraft is grounded costs an airline between $10,000 and $15,000. Outages such as these, particularly in the evenings, are even more expensive for airlines, with hotel room and food vouchers being issued, crew and other staff costs, and rebooking expenses. Customer attrition can also be a factor in future ticket sales.

How Can Observability Reduce MTTR?

Problem resolution in today’s enterprises is complicated. Complex, multivendor environments, coupled with lack of eyes, ears, and hands at the remote site, can mean unacceptable incident resolution times. With stats showing a few hours or more to resolve problems, it’s important to understand how mean time to resolution (MTTR) is calculated.

The MTTR Lifecycle

There are four stages in the MTTR lifecycle for resolving problems (see Figure 1) that cover:

Identification of what problem exists
Knowledge of where and why a problem is occurring
Fix the problem that has been found and investigated
Verify that the fix has worked for the issue and service is fully restored

Figure 1: The top graph illustrates the mean time to resolution (MTTR) problem management lifecycle that includes four stages: identification, knowledge, fix, and verify. Each stage takes up a varying length of time in the overall process. The bottom graph shows that by reducing the time spent in MTTI and MTTK stages, organizations can dramatically reduce overall MTTR.

Reducing MTTI

The first stage is to reduce the mean time to identify (MTTI) that a problem exists. Using proactive synthetic testing to evaluate user experience from the remote sites, the IT organization can identify a disruption that's happening in its earliest stages of development. The data from configurable, consistent transaction testing on key services and applications, from critical locations, can be trended, even when users are not active. When a deviation is identified, notification to the IT team can be provided immediately. “What” problem exists is revealed in this manner.

Identifying issues in the earliest stages of the problem lifecycle is a big step in reducing MTTI and accelerating overall MTTR. Imagine a failure in one of your VPN gateways at a colocation (co-lo) site that facilitates employee access to corporate applications from your most profitable region. It happens at 2 a.m. According to more than 60 percent of IT executives in our survey, IT won’t know about this until your employees start work between 7 and 8 a.m. that morning and report it to IT.

It is a complex problem. IT won’t know how many offices are impacted; which co-lo site is involved; or if it is an office problem, a wide-area network (WAN) issue, a disruption at the co-lo site, or a gateway issue. By then, help desk tickets will be piling up.

Early, automated detection, leveraging early warning with synthetic business transaction testing from the remote sales offices, would have identified the VPN unavailability issue when it began at 2 a.m. IT would have had a powerful head start troubleshooting and isolating the problem before the workday even started. That would certainly have enabled time to potentially implement a workaround and avoid employee disruption.

Reducing MTTK

But the “what” does not solve the problem! Yes, it is better than IT waiting for employees to report the problem more than 50 percent of the time, which would give the problem an opportunity to impact even more users and business services. What is needed for faster problem resolution is a way to get answers to the “why” and the “where” of a problem faster.

This is what makes the second step so important. To reduce mean time to knowledge (MTTK), real-world insights are necessary to discover the root cause of why and where a disruption or outage exists. Comprehensive observability depends on continuous deep packet intelligence (DPI) that comes from real-time monitoring of the inbound and outbound traffic at those remote locations.

By leveraging vendor-independent, ecosystemwide observability between the remote locations and wherever application or communications services are hosted, IT teams have the smart data and analytics they need to pinpoint the true root cause of user-impacting degradations. War rooms and vendor-specific point tools are inefficient and unsuccessful because they rule out only part of the environment. What is necessary to meaningfully impact MTTK is a systemwide observability solution that can analyze the overall environment, including remote locations, and pinpoint the source of the problem.

Implementing a fix and verifying that it works are the final two stages of this lifecycle. The mean times to fix and verify (MTTF and MTTV) that the problem is rectified are not dependent on observability per se, but rather on what the problem requires for restoration—for instance, if IT needs their WAN provider to correct a misconfiguration or add bandwidth at a site or if they need to dispatch a hardware technician to repair equipment. As a result, the mean time is independent of observability. However, organizations can benefit from having observability in place to support a faster, more reliable verification process. Figure 2 illustrates how improvements in the identification and knowledge phases can significantly reduce overall MTTR.

Best Practices

As headlines abound with reports of “network glitches” causing aircraft groundings, banking disruptions, and the rescheduling of surgeries at hospitals, best practices that include end-through-end observability to reduce MTTR are essential. Synthetic testing may tell you the “what” of a problem—what is happening at a given point—to identify a potential issue as it begins to emerge. This saves MTTI. However, it is unable to discover “why” and “where” the problem exists. DPI from those same locations pinpoints the root cause and reduces MTTK. The sooner that information is available, the sooner IT can begin fixing and verifying the problem, restoring quality services to employees and customers in remote offices.

See how one manufacturer put this approach in practice to solve problems in its remote locations better and faster with NETSCOUT nGenius solutions for observability.

Enterprise