Rethinking Maintenance and Support: How AI Agents Will Run Your Core

Rethinking Maintenance and Support: How AI Agents Will Run Your Core

Disclaimer

The views presented in this document are entirely my own. They reflect my personal analysis, experience, and aspirations for the future of technology-driven enterprises. This paper is also a way for me to put evolving thoughts to paper on a rapidly emerging topic. As such, some perspectives shared here may prove to be incomplete or even incorrect over time. They are not intended to represent the positions or opinions of any current or former employer or partner.

— Jaco van Staden

Executive Manifesto – Maintenance Without Maintenance

For decades, IT maintenance has been seen as a sunk cost—critical, complex, and increasingly invisible to the business. Support functions are often measured by response time and resolution rate, not by their strategic impact. And yet, the resources consumed by “keeping the lights on” continue to exceed those invested in innovation.

But I believe we’re now entering a decisive shift.

AI agents are no longer theoretical or confined to dashboards—they’re beginning to operate inside core systems. In a growing number of enterprises, these agents are moving beyond passive observability towards partial ownership of resolution, triage, and remediation. They interpret telemetry, reason over logs, trigger actions, and learn from system responses—all in live environments. This doesn’t eliminate the role of our support and maintenance employees and talent—it repositions them as designers, supervisors, and orchestrators of intelligent systems.

Gartner’s 2024 Hype Cycle for AIOps places AI-driven remediation and platform-native observability agents at the early slope of adoption. Meanwhile, companies like Dynatrace, Microsoft (Automanage), and ServiceNow have begun embedding proactive agents directly into operations. These are early signals—but they are real.

This shift forces us to rethink not just how we maintain systems, but what maintenance should actually mean in an AI-first enterprise. We no longer need to treat support as a disconnected function that reacts after failure. Instead, we can start treating it as a self-stewarding layer—where the very flows that generate service are also capable of sustaining it.

This is where Intelligent Flow Engineering (IFE) comes into play—a concept I introduced in “The AI-First Operating Model” as a foundational design principle for embedding intelligence directly into enterprise flows. In that context, IFE enabled AI-driven decision-making across processes and value chains. In this piece, we extend that construct deeper into support and maintenance—where IFE becomes the enabler of agentic remediation, telemetry-native flows, and continuous learning in the run layer itself.

We should also reconsider how we view Run the Business (RTB) functions themselves. Too often, they’re treated as operational burdens—ripe for outsourcing or cost-cutting. But RTB holds a unique advantage: it is repeatable, measurable, and saturated with real-time signal.

These characteristics make it the ideal environment to introduce, train, and validate new intelligent capabilities—before they move upstream into customer-facing or change-oriented functions. In this sense, AI-driven maintenance becomes more than operational hygiene—it becomes a strategic platform for innovation, system learning, and long-overdue tech debt reduction. RTB, reimagined, is where intelligence earns its right to scale.

In this piece, I want to explore how that evolution is taking shape—not in some distant future, but across the service towers, partner ecosystems, and operational environments we work in today. And I want to unpack what it means for roles, providers, and the intelligence fabric that increasingly binds the enterprise together.

1. The Hidden Cost of Traditional Maintenance

Support and maintenance have not been neglected. Over the past two decades, these functions have undergone wave after wave of transformation—process harmonisation, automation, ITIL standardisation, tooling consolidation, offshoring, nearshoring, managed services, and targeted AI investments. The goal has been consistent: control cost, reduce noise, and maintain service quality at scale.

Many organisations have delivered substantial improvements. Application Production Support (APS) teams have reduced incident volumes through automation and observability. ITSM platforms have been streamlined, with increased adoption of self-heal scripts, predictive alerts, and workflow integrations. Ambitions around “Zero-Touch Operations” have driven platforms to eliminate repetitive, high-volume tickets altogether.

These are not trivial accomplishments—they are critical foundations for what comes next.

Yet despite these advances, support remains structurally treated as a cost centre. The prevailing measure of success is how efficiently issues are resolved—not whether the system as a whole is becoming more intelligent, resilient, or adaptive. Most of the work has focused on containment, rather than contribution.

Support flows, though rich in system signal and behavioural insight, are rarely connected to upstream engineering, architecture, or design. Incidents are resolved in isolation, lessons are trapped in resolution notes, and friction is often localised rather than abstracted into enterprise patterns.

The result is a missed opportunity. The very function that sees the most failures, captures the most real-world signal, and operates closest to live system behaviour is excluded from transformation. It is optimised—but disconnected.

As we move toward agentic operations, this disconnect becomes a constraint. Autonomous remediation, self-improving logic, and intelligent co-stewardship require more than automation—they require flows engineered for learning, traceability, and action.

This is the moment to reposition support—not as back-office stability, but as the proving ground for enterprise intelligence. A place where systems become self-sustaining, employees become orchestrators, and maintenance becomes a source of continuous improvement.

2. Enter the AI Agent – Not a Tool, but a Teammate

The shift from rule-based automation to dynamic intelligence is no longer hypothetical—it’s happening across the ecosystem. But it’s not just about “AI agents.” What we’re seeing is the emergence of an intelligence layer across the enterprise run stack: a convergence of data agents, policy-aware automation, embedded telemetry, platform-native orchestration, and context-driven remediation.

This evolution goes beyond scripting the past. It introduces systems that observe, reason, and act—not based on pre-defined playbooks, but based on real-time signals and accumulated understanding of operational patterns.

In this model, intelligence is distributed:

  • Data agents contextualise logs, telemetry, and metrics—identifying patterns, not just outliers.
  • Platform-native services like Microsoft Automanage, ServiceNow Predictive AIOps, and Dynatrace Grail proactively initiate remediation, not just alerting.
  • Domain-specific agents embedded in infrastructure, SAP environments, or security stacks trigger precision interventions—without human initiation.
  • And LLM-powered orchestration frameworks are beginning to stitch these together into adaptive, explainable flows.

This isn’t a monolith—it’s a composite architecture of intelligent components, each playing a role in redefining how enterprise systems maintain themselves.

But to function effectively, these systems need more than integration—they need intelligent flow design.

Without visibility into state, history, dependencies, and business context, even the most advanced agent becomes brittle. This is where Intelligent Flow Engineering (IFE) becomes critical. IFE ensures that every support flow—whether it's application recovery, user provisioning, or config rollback—is structured to:

  • Expose context to the agent
  • Embed observability into the flow logic itself
  • Capture action and outcome to feed learning loops

Today, some organisations are already putting this into practice:

  • ServiceNow instances are leveraging Predictive Intelligence and Virtual Agent flows to pre-resolve known incident types and automate classification across hundreds of categories.
  • Azure Automanage allows system baselines to be maintained and automatically corrected, reducing configuration drift at scale.
  • Custom-built agents in high-performing APS environments are parsing error logs, identifying code regressions, and flagging rollback candidates—all before the incident reaches a human.

The opportunity is no longer theoretical. The technology exists. The question is whether enterprise environments are designed to accommodate it.

And that’s the missing link: most support flows weren’t built to be observed, reasoned over, or adapted in real time. They were built to be executed and tracked. This is where the shift to agentic operations requires more than tool deployment—it requires a structural rethink of how runbooks, policies, and response paths are engineered.

The benefit? Once these intelligent constructs are in place, support becomes something else entirely:

  • Silent self-repair of known failure patterns
  • Pre-emptive suppression of false alerts based on cross-system awareness
  • Automated impact assessment of config changes based on historical correlations
  • Agentic escalation only when contextual thresholds are exceeded—not just when static rules fire

And critically, all of this still includes the human. Employees don’t disappear—they shift into flow designers, escalation architects, and agent supervisors. The work becomes higher-order, more strategic, and more directly connected to service resilience.

This isn’t about replacing teams with tools. It’s about building systems where intelligence runs alongside people—at scale, at speed, and without requiring manual triggers.

This is why the support and maintenance layer matters. Not just because it can be optimised—but because it offers the ideal environment to validate enterprise-grade intelligence.

With the right telemetry, engineered flows, and human oversight in place, this becomes more than automation. It becomes the proving ground for cognitive capability—where enterprise intelligence is not only tested, but refined before scaling across the broader business.

3. From Tiers to Threads – Rethinking the Structure of Support

Most support models today are still shaped by inherited constructs: L1, L2, L3. Each tier represents an escalation in complexity, specialisation, and time-to-resolution. But in practice, this model often leads to fragmentation of insight, delays in diagnosis, and loss of institutional memory between handoffs. Each ticket becomes an isolated artefact, disconnected from the system’s broader state and evolution.

In the era of agentic operations, this structure no longer fits.

Intelligent systems don’t operate in tiers—they operate in threads: persistent, context-rich sequences of system behaviour, action, and resolution. These threads are not routed by human queues—they are initiated and maintained by agents, flows, and observability triggers that track state across boundaries.

In a thread-based support model:

  • The “incident” isn’t the unit of work. The flow is.
  • Context is preserved across touchpoints—human or machine.
  • Diagnosis and remediation are interleaved with learning and adaptation.
  • Resolution is no longer the endpoint—post-action insight becomes the default byproduct.

This model requires intentional re-architecture. Flows must be designed to:

  • Surface events as part of a broader system narrative—not isolated alerts
  • Capture actions, outcomes, and variations to support learning loops
  • Allow both humans and agents to interact, escalate, and intervene at the appropriate juncture

In leading environments, we are already seeing this emerge:

  • ServiceNow and PagerDuty integrations are shifting from static ticketing to event-driven orchestration.
  • High-maturity APS teams are collapsing tiers into cross-functional pods with flow-level observability and shared accountability.
  • Incident suppression logic is being tuned based on multi-signal correlation rather than static thresholds—blending operational telemetry with business context.

Embedded Case Example: Intelligent Threads in Action

In one global consumer goods company, what began as a standard L2 escalation for intermittent SAP order posting failures evolved into a fully observable support thread. The incident was first flagged by a telemetry pattern detected by Dynatrace, integrated into their ServiceNow ITSM stack. Instead of routing the issue to a human queue, a domain-specific agent triggered an analysis thread that ran across SAP IDoc logs, middleware transaction timings, and infrastructure CPU/memory spikes.

Using ServiceNow Predictive AIOps, the system correlated these into a single service-impact narrative. A remediation suggestion was generated—a queue buffer config change in the SAP PI layer—reviewed via a Microsoft Teams approval flow, and implemented automatically through an Ansible Tower job, governed by policy-aware automation.

Total time to resolution: under 8 minutes. Traditional route? 3–4 hours minimum with 2–3 handoffs.

Behind the scenes, this capability had been incrementally implemented over ~6 weeks:

  • Week 1–2: Flow telemetry instrumentation and Dynatrace log enrichment
  • Week 3–4: Intelligent correlation and ServiceNow Virtual Agent triggers
  • Week 5–6: Policy guardrails, audit logging, and automation linkage

Critically, the entire thread—signals, actions, approvals, and outcomes—was captured as a persistent system object. This meant that when a similar issue reappeared a month later, it was resolved automatically. The APS team used the logged flow to inform a system redesign—removing the original failure vector entirely.

As a final step, the system automatically generated a structured insight card and flagged it to the CTB backlog via Azure DevOps. The insight—complete with signal pattern, agent action trace, and remediation impact—was logged as a design issue for future change.

This created a closed-loop from incident to improvement, allowing the transformation team to evaluate and redesign the SAP integration logic in the next sprint. What began as a system fix became a CTB-level intervention—closing the loop between support flow and design backlog, and reducing future tech debt with minimal manual effort.

This is the structural underpinning of Intelligent Flow Engineering (IFE). IFE is not just about flow instrumentation—it’s about flow design as a first-class discipline. Support becomes a co-designed experience, where system resilience is engineered at the level of flow logic, not reaction time.

And for service providers, this represents a fundamental shift. Rather than staffing based on ticket volumes and escalation tiers, providers must deliver value through:

  • Flow design and optimisation expertise
  • Embedded AI observability and intervention logic
  • Outcome-level performance tied to system health—not SLA resolution time

The result is a support model that is dynamic, learning-driven, and aligned with how modern systems operate—not how support has historically been organised.

4. If AI Agents Are the Actors, Then Flow Design Is the Script

AI agents are powerful. But they’re not autonomous gods. They act based on the signals they see—and the flows they’re allowed to follow.

This makes flow design the real unlock.

In most enterprises, support flows are static. They follow ITIL-prescribed paths, defined once, rarely adapted, and invisible to the people actually working within them. Even as automation has increased, the logic behind it has remained locked in tickets, scripts, or rigid platform workflows.

If we want AI agents to become scalable contributors—not fragile bots—we need to make support flows observable, adaptive, and co-designed.

This is where Intelligent Flow Engineering (IFE) becomes more than architecture. It becomes craft—a discipline of designing flows that:

  • Can expose real-time context to agents
  • Embed decision logic that evolves over time
  • Include human checkpoints, observability, and safe rollback
  • Create versionable flow objects that can be updated like software, not policies
  • Are co-owned by support teams, not just designed upstream

IFE means designing flows that:

  • Can expose real-time context to agents
  • Embed decision logic that evolves over time
  • Include human checkpoints, observability, and safe rollback
  • Create versionable flow objects that can be updated like software, not policies
  • Are co-owned by support teams, not just designed upstream

Embedded Case Example: Elevating the Endpoint Support Loop

At a large European insurance firm, the Endpoint Support team had long struggled with recurring device performance issues—slow logins, profile corruption, and failed patching. Each case followed the same dance: incident raised, script run, reboot requested, root cause unclear.

But in mid-2024, the team piloted a new model: Using Microsoft Intune, ServiceNow Flow Designer, and Windows Autopilot diagnostics, they created a diagnose-and-remediate flow embedded directly in the Virtual Agent.

Here’s how it worked:

  1. The Virtual Agent detected recurring keywords in user complaints (e.g. "slow startup", "black screen").
  2. An embedded agent launched a context-aware diagnostics flow—pulling logs via Intune, comparing config baselines, and matching against a known issue signature (a corrupted Start Menu policy).
  3. If matched, a remediation flow auto-ran via PowerShell + Intune device sync, while logging each action.
  4. If uncertain, the case escalated to L2—but with all context bundled in.

Here’s what changed:

  • Time to resolution dropped from 3 days to 22 minutes on average
  • Flows were versioned, tracked, and updated weekly based on emerging patterns
  • The support team could tweak flow logic live, test outcomes, and flag recurring issues upstream
  • Every intervention became a signal to the CTB backlog via Azure DevOps, creating a feedback loop between operations and engineering

They didn’t just fix issues faster—they transformed how issues were recognised, tracked, and elevated for structural change.

This is what happens when flow design becomes the script, and support teams become the writers.

Flow Evolution Timeline – From Execution to Cognition

Article content

This isn’t just about better automation. It’s about building a runtime system of intelligence—where agents can act with context, people can guide outcomes, and flows become both the instruction and the insight.

When we shift from “ticket follows process” to “flow guides intelligence,” support becomes more than reaction. It becomes a learning surface, a design platform, and the place where enterprise cognition begins to take form.

5. Redesigning the Maintenance Mindset: From Reactive Fixes to Intelligent Continuity

For decades, maintenance has been synonymous with reactivity: something breaks, someone fixes it. Even with the rise of preventive maintenance, the core construct remained linear—detect, diagnose, resolve. The promise of zero-touch or lights-out operations has lingered on PowerPoint slides, but rarely moved beyond a tightly scoped automation loop.

What’s changing now is not just the toolset, but the mindset. Maintenance is no longer just about “keeping the lights on.” It’s becoming a live optimisation layer, an intelligent loop where fixes, forecasts, and functional upgrades blend into one adaptive system. This evolution builds directly on our previous framing of the enterprise support landscape as a test bed for intelligence, not just a cost centre.

The Architecture Behind It

This shift is powered by three interlocking components:

  • AI and Data Agents capable of observing and acting across telemetry, logs, and event signals
  • Intelligent Flow Engineering (IFE), which orchestrates the logic and decision pathways within runtime contexts
  • A learning loop between RTB and CTB—where each intervention informs future improvements, backlog priorities, and technical debt retirement

Maintenance, under this model, is no longer an afterthought—it’s a design artefact and an observability surface. The intelligence we embed here doesn’t just stabilise operations. It teaches the enterprise how to improve.

Embedded Example: Proactive Capacity Tuning in Cloud Infrastructure

At a European food manufacturing client, infrastructure maintenance had long been governed by static thresholds and reactive escalations. Peak season events would trigger war rooms, not scale plans.

But in 2024, the InfraOps team rewired their model.

Using a combination of Azure Monitor, Log Analytics, and an OpenAI-powered agent running in a secure container, they introduced a proactive flow for capacity tuning:

  1. The agent scanned utilisation trends weekly, identifying under- and over-provisioned workloads.
  2. A Data Agent flagged specific patterns in telemetry, routing this insight into the IFE-managed flow engine.
  3. The system launched a ServiceNow workflow to validate potential savings or risk scenarios, tagging workloads for review.
  4. Any high-value interventions were pushed as signals to the CTB backlog, tagged with expected impact and related KPIs.

What changed:

  • Monthly cloud spend dropped 17% within the first 90 days
  • Unused compute capacity was flagged within hours—not weeks
  • Support teams created flow variants via GitOps submissions, reviewed by platform engineering
  • Data Agents triggered real-time insights that were reusable across similar environments
  • CTB teams used signal telemetry to schedule refactor work and guide cloud re-architecture

This wasn’t just predictive maintenance. It was adaptive optimisation, driven by an interplay of data, flows, and learning agents.

Why This Matters Now

As enterprises increase their reliance on interconnected platforms and distributed environments, downtime costs are no longer just financial—they're reputational, regulatory, and operational. But the real opportunity lies in rethinking support and maintenance not as overhead—but as the runtime memory of the business.

We’ve spent years underinvesting in maintenance functions, outsourcing them to control costs and chase SLAs. But these very teams see every exception, workaround, patch, and regression.

They are the frontline of insight—and with the right architecture, they can become the first responders to complexity and the pilots of enterprise cognition.

By shifting from reactive to intelligent continuity, we unlock more than efficiency. We activate the feedback system between design, operation, and learning—and finally make maintenance a strategic layer of the intelligent enterprise.

6. From Playbook to Practice: Operationalising the New Support Model

The shift to intelligent maintenance isn’t just a technology play—it’s an operational shift. To realise its full value, organisations must restructure how support is planned, executed, and continuously improved. This means evolving not just the systems, but the roles, processes, and incentives around them.

Where Traditional Support Models Fall Short

Legacy support models are often constrained by:

  • Ticket-driven workflows focused on SLA compliance, not system health
  • Siloed teams (Infra, App, Network, Security) with limited shared telemetry or flow logic
  • Restricted agency for support staff, who are trained to follow scripts—not design flows
  • One-way escalations that separate RTB from CTB, limiting shared learning

Even when automation is added, it’s typically task-based and isolated, not system-aware or context-sensitive.

What Operational Excellence Looks Like in This New Model

The intelligent support model introduces a set of new capabilities and expectations:

Article content

Support teams are no longer process executors. They become flow designers, data interpreters, and platform collaborators—contributing directly to both operational stability and platform evolution.

The Architecture of Practice

To embed this in daily operations, a few foundations are essential:

  • Integrated flow platforms (e.g. ServiceNow Flow Designer, Azure Logic Apps, or PagerDuty Process Automation) that are co-owned by RTB and CTB teams
  • Developer-accessible telemetry layers and structured data models that support intelligent triggers and agents
  • Version-controlled flow libraries with observability, rollback, and reuse standards
  • Agentic audit logs that capture how decisions were made, not just that they happened

In short, practice becomes a living system—versioned, inspectable, and improvable over time.

Human Impact: The Talent Shift

This model only works if we evolve our view of talent. Maintenance engineers and support analysts are no longer "non-core" workers.

They are:

  • Custodians of enterprise telemetry
  • Early signal detectors of technical debt
  • Flow crafters whose work feeds CTB strategy

This requires a deliberate shift in training, empowerment, and recognition. Support must be embedded within agile delivery models, treated with the same investment in tooling, skills, and retrospectives as any product team.

7. Building the Observability Core: Signals, Feedback, and Data Contracts

If flow is the script and agents are the actors, then observability is the stage—the place where context, continuity, and control converge.

The move toward intelligent support cannot succeed without a foundational investment in a robust observability fabric. This layer transforms support from a reactive sequence of break-fix actions into a closed-loop system of signal, sense, and respond.

Observability is Not Just Monitoring

Traditional monitoring checks if things are working. Observability asks: Why isn’t it working? In the intelligent enterprise, it also asks: What can we learn from it?

To do this well, organisations must treat observability as a design concern, not an afterthought.

The key shift is from instrumentation of systems to instrumentation of flows:

  • Events are no longer just infrastructure anomalies—they’re cues in a decision narrative
  • Logs are not just error records—they’re signals in a broader context chain
  • Metrics are not just performance thresholds—they’re patterns in flow behaviour

Data Contracts and Feedback Loops

What enables this shift is the emergence of data contracts—agreements between producers (apps, services, platforms) and consumers (agents, analysts, flows) on the shape, quality, and semantics of data.

Data contracts:

  • Provide schema assurance for telemetry pipelines
  • Enable safe handshakes between CTB and RTB domains
  • Allow flow designers to reference consistent signals across change cycles
  • Power feedback loops where insights from support operations influence system design

This is where our prior concept of Data Agents re-emerges: they don’t just query telemetry—they negotiate it. They align signals across services, validate context windows, and identify what feedback needs to return into the CTB backlog.

Embedded Example: Signal Drift in API Performance

At a global consumer goods company, several APIs powering partner portals began showing increased latency. Traditional APM flagged the issue but offered little insight beyond stack traces.

An observability agent detected signal drift—a subtle change in call sequencing patterns. Using a telemetry data contract, it validated which downstream systems were contributing to response delays.

Instead of escalating a vague ticket, the system:

  1. Triggered a flow artefact that identified the drifted signature.
  2. Annotated it against historical baselines.
  3. Routed the insight to the CTB backlog with a linked dashboard widget for real-time tracking.
  4. A platform engineer reused the same telemetry construct in another API domain, preventing recurrence.

This didn’t just fix the incident—it strengthened the platform.

Why This Layer Matters

Without a mature observability core, agentic systems become brittle, blind, and biased. Worse, they risk hallucinating actions without context—introducing new tech debt while attempting to fix old.

Done right, observability becomes:

  • A trust anchor for automation
  • A collaboration layer between RTB and CTB
  • A platform for continuous learning, built into the very telemetry of the enterprise

In this model, support becomes insight at scale—with every intervention teaching the system, informing the backlog, and raising the bar for how change is absorbed and intelligence is embedded.

8. A New Compact: Repositioning Support as an Engine of Intelligence

It’s time we stop seeing support and maintenance as the leftovers of transformation. In an AI-first enterprise, these functions become core to how the business learns, adapts, and grows.

We’ve built systems that can recover, but now we must build systems that can teach. Every flow, every signal, every agent intervention becomes part of a living body of operational intelligence. This is where Intelligent Flow Engineering (IFE) no longer sits in the realm of design—it becomes the day-to-day mechanism of running, refining, and reshaping the enterprise.

Support Is the System

This article has argued that support and maintenance should no longer be viewed as cost centres or outsourced necessities. When reframed through the lens of agentic automation, observability, and IFE, they become:

  • The first mile for innovation rollout
  • The control plane for runtime intelligence
  • The arena where AI agents, flows, and humans co-orchestrate decisions
  • The operational reality check for every strategy committed to code

Rather than react to problems, the intelligent support function prevents, predicts, and prioritises—acting as a strategic filter between what is and what could be.

What This Demands

To unlock this model, we must:

  • Redesign roles so that support engineers own and improve flows, not just fix issues
  • Replatform our tooling so that AI agents, data contracts, and telemetry are first-class citizens
  • Reconnect CTB and RTB into a continuous flow of intervention and improvement
  • Reposition service providers from labour arbitrage to flow intelligence partners

Support becomes the heartbeat of change, and talent in this domain shifts from “operating cost” to change catalyst.

The New Compact

This is the compact the AI-first enterprise must now make:

We will no longer wait for issues to teach us. We will use every intervention as a trigger for intelligence. We will treat support as a platform for elevation, not a dumping ground for cost.

In doing so, we shift from treating RTB as something to control, to something that controls the quality of the enterprise itself.

Mihai Nicolae RADU

Senior Digital Project Manager at Genpact #Transform #360Finance

2mo

Helpful insight, Jaco

Kiran Kumar

Executive Leader – AI, Data & Analytics | Enterprise Transformation | Building Future-Ready Organizations | Global Innovation Strategy | Harvard SELPI & MIT | IIMC Alum | Ex US/ Australia

2mo

Brilliant perspective! Support is no longer just backend—it’s becoming the frontline of intelligent transformation. AI agents + human insight = next-gen resilience.

To view or add a comment, sign in

Others also viewed

Explore topics