The Current State of Self-Healing Software

The marketing around "self-healing software" suggests that with AI your systems can fix themselves. I don't have direct experience with self-healing systems yet, but it's a topic I keep running into, so I spent some time looking into what practitioners and researchers are actually saying.

Three separate conversations

The term "self-healing" covers three distinct tracks that vendors blur together but practitioners don't:

Self-healing infrastructure from the SRE and chaos engineering world, descended from IBM's autonomic computing initiative back in 2001.
Automated program repair from academia and ML research, now with tools like Codex and Claude Code.
Agentic AIOps, where LLM-based agents sit on top of observability platforms and propose or execute fixes.

The maturity levels of each are very different, and treating them as one thing leads to confused expectations.

Three railroad tracks diverging in different directions, representing the three distinct fields that get lumped together as self-healing software

What the practitioners say

Netflix engineers Prasanna Vijayanathan and Renzo Sanchez-Silva at QCon London 2026 spoke about a recent incident that took four hours, involved nine teams and thirty engineers across three related incidents. Their response isn't "add AI." It's to build an end-to-end knowledge graph modelling users, clients, services, and infrastructure as RDF-style triples.

"the ontology is the contract between chaos and understanding."

Claude proposes pull requests; a human reviews and merges them. Their stated roadmap is: automate root cause analysis, then auto-remediation, then self-healing. The ordering matters. Netflix is explicit that the knowledge graph is the prerequisite, not the AI model.

A tower under construction: solid blue foundation blocks, yellow middle layer being assembled with cranes, and pink scaffolding outline at the top representing the not-yet-built future

Meta avoids the "self-healing" label entirely. Ian Thomas from Reality Labs presented an "Assess and Grow" maturity model that addresses senior engineer concerns about code quality and review fatigue.

"If you're doing things badly in the first place, you're only going to be doing things much worse."

Google's position has two strands. On the infrastructure side, their Vertex AI Agent Builder shipped prebuilt plugins to help AI agents self-heal. That's agents healing themselves, not classical infrastructure self-healing. On the code side, Sundar Pichai claimed 75% of Google's new code is now AI-generated and approved by engineers. But what is not clear is how much AI-generated code gets rewritten before it ships.

What the vendors say

The IDC 2026 AIOps MarketScape noted that New Relic's SRE Agent features "human-in-the-loop controls, approval gates, and example workflows for automated rollbacks." That's not fully autonomous remediation.

An independent comparison of AI in observability says:

"Davis can trigger workflows in response to problems, but the 'self-healing' positioning overpromises. Most enterprises use it to page the right team, not to auto-fix." On New Relic: "Auto-remediation capabilities are minimal."

Their summary applies across the board:

"No vendor has cracked truly autonomous operations yet, and any sales pitch claiming otherwise is ahead of the product reality."

What's actually shipping, per a 2026 buyer's guide: AI SRE tools reason toward root cause, generate timeline reconstructions, and recommend remediation with audit trails. The buyer's guide describes a "trust-gradient model: start with AI-assisted investigation, expand to human-approved remediation, then bounded autonomous execution." Most organisations are still on step one.

Three ascending blocks from large to small: a solid blue base with an arrow marker, a solid yellow middle step, and a translucent pink top step, representing the trust gradient that most organisations have barely started climbing

The counter-argument

The most important thing I found in this survey wasn't a vendor or a big tech talk. It was J. Paul Reed, Staff Incident Operations Manager at Chime, reviving Lisanne Bainbridge's 1983 paper Ironies of Automation at both SRECon and QCon London 2026.

The Bainbridge Paradox: on the left a person leans casually against a simple machine, on the right the same person is alert and equipped with tools beside a complex automated system, with an upward arrow showing that greater automation demands greater human skill

"When you go into an incident and your first thought is, I bet AI can solve this problem, you are betting on efficiency again, and you already lost that bet."

His empirical finding from real incidents suggests that over-reliance on AI can double recovery times when humans are pulled in cold after automation fails. Humans who have been out of the loop then take much longer to respond because they don't have the context and full understanding. This has been documented as cognitive debt and skill atrophy due to automation.

Bainbridge's structural argument still holds after 43 years:

The most successful automated systems, those with the rarest need for human intervention, are precisely the systems that require the greatest investment in human skill.

and

"What remains after automation is not a simplified role but an arbitrary residue of the most demanding, most ambiguous, and least supported work in the entire system."

The human job doesn't get easier, it gets harder and less practised.

The incident.io engineering team holds the principle that AI tools should provide "credible theories, relevant data, and paths worth exploring," not "steer people into autopilot mode." When their agent reaches a conclusion, it shows how it got there. As they put it:

"No 'just trust me' vibes."

Where are we now?

The vendor narrative is "AI now makes self-healing real." The practitioner narrative is: we're getting better at the steps before self-healing. The things to get right first are connected observability data, knowledge graphs, and AI-assisted triage. Then we can aim for better AI-generated PR suggestions and eventually fully self-healing systems. Every serious organisation keeps humans in the loop. Auto-execution of remediation is rare and usually limited to well-understood actions like rolling back a recent deployment.

Self-healing software is a direction of travel. The interesting work is in the enablers (ontologies, knowledge graphs, structured incident memory) and in human-AI workflow design. As I start to build these enablers and move in the direction of self-healing software, I do wonder how Bainbridge's paradox will play out. When the automation gets better, does the human job become harder?