AI Incident Response: What to Do When Your Model Fails in Production
When an AI model fails in production, the failure is often silent — no error code, just degrading outputs. This guide is a practical incident response playbook for ML and LLM systems: detection, severity classification, rollback, stakeholder communication, and post-incident review, built for technical leaders who need to extend their existing incident processes to cover AI-specific failure modes.

Production AI systems fail. Not if — when. A model that performs well in testing can degrade silently in ways that traditional software monitoring will not catch: hallucination rates creep up, latency spikes under load, a safety filter starts blocking legitimate requests, or a recommendation engine quietly drifts from the data distribution it was trained on. When that happens, you need a playbook — not a post-mortem started three days later.
This guide is a practical incident response framework for ML and LLM systems in production. It covers detection, severity classification, rollback, stakeholder communication, and post-incident review. It is written for technical leaders who already have software incident response processes and need to extend them to cover the specific failure modes AI introduces.
Why AI Incidents Are Different from Software Incidents
AI failures are qualitatively different from traditional software bugs. A software bug typically produces a hard error — a 500 status code, an exception, a test that fails deterministically. An AI failure is often soft and probabilistic: the system keeps returning 200 OK while the outputs degrade in quality, accuracy, or safety. This makes detection harder and makes the blast radius harder to measure.

The core differences that shape your response process:
- Failure is probabilistic, not binary. A model does not crash — it drifts. You may not know it has failed until users or downstream systems surface the symptoms.
- Root cause is often upstream. Input distribution shift, a changed upstream data pipeline, or a prompt injection in user input can degrade output quality without any change to the model itself.
- Rollback is more complex. Rolling back a software deployment is well-understood. Rolling back to a previous model version requires you to have that version available, tested, and serving infrastructure ready.
- The failure mode taxonomy is different. Latency degradation, hallucination spikes, safety filter miscalibration, embedding drift, context window misuse, and retrieval quality failures are AI-specific failure classes.
Your incident response process needs to account for all of these.
Step 1: Detection — Knowing When Something Has Gone Wrong
Detecting an AI incident early requires purpose-built observability layered on top of your standard infrastructure monitoring. Standard APM tools will tell you about latency and error rates. They will not tell you that your LLM is hallucinating product prices or that your recommendation model has started surfacing irrelevant results.

What to monitor
A production AI system should emit signals across four layers:
Infrastructure signals — latency (p50, p95, p99), error rates, throughput, token consumption (for LLM APIs), queue depth, and GPU/CPU utilisation. These are the first things that alert on hard failures.
Output quality signals — these require application-level instrumentation. Depending on your system, this might include: response length distributions, refusal rates (for LLMs), confidence score distributions, semantic similarity of outputs over time, downstream business metrics like click-through rate or conversion, or user feedback signals like thumbs down and explicit corrections.
Data pipeline signals — for systems that consume a retrieval layer (RAG) or live feature data, monitor the freshness, schema, and statistical distribution of inputs. A changed upstream schema or a stale embedding index is a common AI incident trigger.
Safety and policy signals — track safety filter trigger rates, content policy violations flagged, and any anomalous patterns in user inputs (prompt injection attempts, jailbreak probes).
Alert thresholds and baselines
You cannot alert on output quality signals without baselines. Before you go to production, instrument your system and establish baseline distributions for the signals above. Alert on statistically significant deviations — not arbitrary thresholds. A spike in refusal rate might be benign (a viral prompt pattern) or a sign of miscalibration. You need baseline data to tell the difference.
Alerts should route to your on-call rotation. AI quality signals that degrade slowly may not wake anyone up at 3am — but they should create tickets and be reviewed in a daily ops check.
Step 2: Severity Classification
Not every AI anomaly is a P1. Before you page your entire engineering team, classify the incident. A clear severity framework prevents both under-reaction (ignoring a safety failure because it looks small) and over-reaction (all-hands for a cosmetic output degradation).
A working severity taxonomy for AI incidents:
| Severity | Description | Example | Response target |
|---|---|---|---|
| P1 — Critical | Safety failure or data exposure. User harm possible. Regulatory exposure. | Safety filter disabled, PII leaking in outputs, model serving adversarial outputs | Immediate rollback or kill switch. Comms within 30 minutes. |
| P2 — High | Core product functionality broken or severely degraded. | LLM returning empty responses, recommendation engine serving 100% identical results, RAG retrieval broken | Rollback within 1-2 hours. Internal comms within 1 hour. |
| P3 — Medium | Quality degradation visible to users but system functional. | Hallucination rate elevated but not harmful, response latency up 3x but within SLA | Investigate and fix within 1 business day. Monitor closely. |
| P4 — Low | Marginal quality drift, no user-visible impact. | Output length distribution slightly shifted, confidence scores trending lower | Backlog ticket. Review in next sprint. |
The classification should happen fast — within 10-15 minutes of an alert. Assign an incident commander (the on-call engineer or ML engineer on rotation) who owns classification and coordinates response.
Step 3: Immediate Response — Contain First, Diagnose Second
The first priority in any production incident is containment. Understand what the failure is doing and stop it from spreading before you try to understand why it happened.
Containment options for AI systems
Kill switch / feature flag — the simplest and fastest containment. Every AI feature in production should sit behind a feature flag that can be toggled off without a deployment. If your AI feature is failing, turn it off and fall back to the non-AI behaviour. Users get a degraded but functioning product. This should be achievable in under five minutes.
Traffic throttling — if the failure appears related to load (latency degradation, token budget exhaustion), reduce traffic to the AI system. Route a percentage of requests to a fallback path.
Model version rollback — if the failure was triggered by a model update, a prompt change, or a RAG index refresh, rollback to the last known good configuration. This requires you to have versioned your model artefacts, prompts, and retrieval indices. If you have not, you cannot roll back — which is a gap to close before your next incident.
Input/output filtering — if the failure is a safety or policy issue limited to a specific input pattern, a temporary filter on that pattern can contain the blast radius while a proper fix is prepared.
What to do while containing
- Open an incident channel (Slack, Teams, or equivalent) immediately. All coordination happens there and is logged.
- Assign roles: incident commander, comms lead, technical lead.
- Start a running log of actions taken and timestamps. You will need this for the post-incident review and potentially for regulatory purposes.
- Do not deploy additional changes to the affected system during the incident unless they are the rollback itself.
Step 4: Diagnosis — Finding the Root Cause
Once the incident is contained, you have breathing room to diagnose. AI root cause analysis requires a different mental model than software debugging. Work through these layers systematically:
Deployment change
Was anything deployed in the window before the incident? A new model version, a prompt template change, a RAG index rebuild, a feature flag change, a dependency update? The deployment log is your first stop. Most production AI incidents are triggered by a change — even a small one.
Input distribution shift
Did the inputs to the system change? A new user cohort, a changed upstream data pipeline, a new integration sending differently structured requests? Statistical drift in inputs can cause a well-trained model to produce degraded outputs without any change to the model itself.
Retrieval or context quality
For RAG-based systems: is the retrieval layer returning relevant, fresh, well-formed context? A stale or corrupted document index, a changed embedding model, or a schema mismatch between retrieval and generation can cause quality failures that look like model failures.
Upstream data pipeline
For models consuming live feature data: is the upstream data pipeline healthy? Stale features, null values where the model expects numeric inputs, or schema changes can cause silent degradation.
External API or model provider
If you are using a third-party LLM API (OpenAI, Anthropic, Google, etc.), check their status page. Provider-side latency, model version updates, or safety policy changes can affect your system's behaviour without any change on your side.
Step 5: Stakeholder Communication
AI incidents require clear, honest, and timely communication — internally to your team and externally to affected users or customers. Poor communication during an incident is often remembered longer than the incident itself.
Internal communication
Keep leadership informed from the moment you classify P1 or P2. A brief, factual update every 30 minutes is better than silence followed by a detailed post-mortem. The incident channel log should be accessible to anyone who needs to know the status.
External communication
The threshold for external communication depends on the severity and the nature of your product. Use these principles:
- Be honest about what happened. If your AI system produced incorrect information, say so. Do not obscure AI involvement in the failure.
- Be specific about impact. Tell users what was affected, when, and whether their data or decisions may have been compromised.
- Avoid technical blame-shifting. "Our AI vendor had an issue" may be accurate but is not the whole story — you chose and integrated that vendor.
- Communicate remediation. Tell users what you have done to fix the issue and what you are doing to prevent recurrence.
For regulated industries — financial services, healthtech, legal tech — there may be mandatory breach or incident notification obligations under Australian law. Understand your obligations before an incident, not during one. The Australian Privacy Act 1988 and the Notifiable Data Breaches scheme are the baseline; sector-specific obligations (APRA CPS 230, for example) may impose tighter requirements.
Step 6: Post-Incident Review
A post-incident review (PIR) — sometimes called a post-mortem — is how your team learns from the incident and prevents recurrence. For AI systems, the standard software PIR template needs to be extended.
When to run it
Within 48-72 hours of resolution for P1 and P2 incidents. While detail is fresh. The goal is learning, not blame.
What to cover
Timeline — a factual, timestamped account of events from first signal to resolution. What happened, in order.
Detection gap — how long between the failure starting and the alert firing? If it was hours or days, why? What monitoring gap allowed that?
Root cause — the actual technical root cause, and the contributing organisational causes. Most incidents have both.
AI-specific root cause analysis — standard software PIR templates ask about code changes and infrastructure. For AI incidents, also ask:
- Was the model evaluated against the input distribution it encountered in production?
- Were there data pipeline assumptions that were not validated?
- Were prompts version-controlled and tested before the change that triggered the incident?
- Did the retrieval layer have quality monitoring?
- Was there a kill switch available? Was it used? If not, why not?
Action items — specific, assigned, time-boxed. Not "improve monitoring" — "instrument hallucination rate metric and alert at 2x baseline, owned by [name], due [date]."
Blameless culture
AI systems are complex. Incidents happen to teams with good people and good intentions. A blameless PIR process — focused on systemic causes and system improvements, not individual errors — produces better outcomes and better organisational safety culture.
The Operational Foundations That Make Response Faster
Incident response speed is largely determined by what you built before the incident. Teams that respond well to AI incidents typically have these foundations in place:
Model and prompt versioning — every model artefact, prompt template, and RAG index has a version identifier and can be rolled back.
Feature flags on all AI features — every AI-powered feature can be turned off without a deployment.
Baseline observability — output quality signals are baselined before go-live and monitored continuously.
Runbooks — documented procedures for common failure scenarios. On-call engineers should not be improvising at 2am.
Staged rollouts — model updates and prompt changes go to a small percentage of traffic before full rollout. This limits blast radius when something goes wrong.
Human-in-the-loop for high-stakes outputs — in domains where a wrong answer has serious consequences (medical, financial, legal), build human review checkpoints before AI output reaches end users or triggers downstream actions.
These are the same foundations that underpin good AI engineering practice more broadly — incident response readiness is not a separate concern from production engineering quality.
Building AI Incident Response Into Your Broader AI Strategy
AI incident response should not be an afterthought added after your model is live. It should be designed into your AI product from the beginning — as part of your AI product strategy and your production engineering standards.
The teams that handle AI incidents well are usually the same teams that spent time on observability design, failure mode analysis, and operational runbooks before they launched. The teams that handle them poorly are usually the ones who shipped fast and assumed the model would just work.
Production AI is genuinely hard. The failure modes are non-obvious, the monitoring is non-trivial, and the rollback paths require upfront investment to exist. That is not a reason to avoid production AI — it is a reason to take the operational side as seriously as the model performance side.
For more thinking on the foundations that make AI production-ready, see our insights on data infrastructure, MLOps, and AI engineering.
Summary: The AI Incident Response Checklist
| Phase | Key actions |
|---|---|
| Detection | Monitor infrastructure, output quality, data pipeline, and safety signals. Establish baselines before go-live. |
| Classification | Assign P1–P4 severity within 15 minutes. Assign incident commander. |
| Containment | Use kill switch or feature flag first. Roll back if a clear change triggered the incident. |
| Diagnosis | Check deployment log, input distribution, retrieval quality, upstream data, and external providers. |
| Communication | Honest, timely internal updates. External comms proportional to impact. Understand legal obligations. |
| Post-incident review | Run within 48-72 hours. Cover AI-specific root causes. Produce specific, assigned action items. |
| Prevention | Version everything. Feature-flag all AI features. Baseline and monitor output quality. Write runbooks. Stage rollouts. |
If your team is building AI systems toward production and wants to get the operational foundations right from the start — monitoring, rollback paths, staging, and incident process — we can help. Getting this right before launch is significantly cheaper than rebuilding it after your first major incident.
Chris Kerr
Partner at Horizon Labs, an AI product consultancy and venture studio. A commercially focused product and technology leader with 20+ years building and scaling digital platforms, teams, and businesses across SaaS, travel, eCommerce, logistics and transport, and digital marketing — operating at the intersection of product, engineering, and data. Writes about platform strategy, AI transformation, modern data ecosystems, and the operational discipline that separates AI demos from AI products.


