What does it mean to turn a production failure into an eval case?

It means capturing the exact input, context, and trace that caused a failure, then writing a test that asserts the correct behavior. The test runs on every change from then on, so the same failure cannot silently return.

How is an AI eval feedback loop different from regular software testing?

Regular tests check deterministic code paths. An eval feedback loop tests probabilistic agent behavior against real production cases, mixing hard assertions with model-graded checks. The suite grows from incidents rather than being written upfront.

What are the main types of AI agent failures?

Most failures fall into seven types: prompt, tool, retrieval, permission, routing, workflow, and policy. Tagging each incident by type tells you which layer to fix and which evals you are missing.

How often should we review AI agent failures?

A weekly cadence works for most mid-market deployments. It is frequent enough to catch drift early and infrequent enough that you batch incidents, find patterns, and write evals against classes of failure rather than one-offs.

Who should own the eval suite for a production AI agent?

One named owner, usually the engineer or operator responsible for the workflow. Shared ownership means no ownership. The owner runs the weekly review and enforces the rule that no incident closes without an eval.

Every Production Failure Should Become an Eval Case: The AI Agent Feedback Loop That Compounds

A support agent at a mid-market SaaS company approved a $4,000 refund that broke policy. The team traced it within an hour, edited the system prompt to add a refund ceiling, shipped the fix before lunch, and closed the ticket. Six weeks later a near-identical refund slipped through on a differently shaped request. Same root cause, new clothing. The prompt edit was real work, and it was correct. The problem is that it left no trace the system could check itself against, so the agent had no memory of ever having been wrong.

This is the quiet failure mode of most AI agent programs. Teams fix incidents one at a time, get a little faster at firefighting, and never get safer. The agents that actually improve in production share one habit that has nothing to do with model choice or framework: every failure becomes a permanent eval case before anyone is allowed to call it resolved. The fix is not the prompt edit. The fix is the regression test that guarantees the prompt edit stays true.

The Failure You Fix Twice Is the Failure You Never Captured

There is a reason software engineering converged on regression tests decades ago. When you find a bug, you do not just patch it. You write a test that fails before the patch and passes after, so the bug can never return without someone noticing. The patch fixes today. The test fixes forever.

AI agents need the same discipline, and they need it more, not less. A deterministic function either returns the right value or it does not. An agent's behavior shifts when you change a prompt, swap a model version, add a tool, or update a knowledge base. Any of those changes can quietly reintroduce a failure you fixed three months ago, and you will not know until a customer tells you. Without a growing eval suite built from your own incidents, you are not improving the system. You are redecorating it.

The reframe is simple and it changes how teams operate: a production incident is not a fire to put out. It is a free, perfectly realistic test case that your users wrote for you. The only question is whether you keep it. Most teams throw it away the moment the ticket closes. The ones who keep it build a suite that gets sharper every week, anchored to behavior that actually matters because every case came from a real failure.

A Taxonomy of Agent Failures

You cannot fix what you cannot classify. The single most useful artifact in a mature agent program is a shared vocabulary for failure, because the right fix depends entirely on which layer broke. A retrieval problem dressed up as a prompt problem leads to weeks of prompt tweaking that never works.

Seven categories cover almost everything we see in production:

Failure type	What actually broke	Example
Prompt	Instructions were ambiguous or incomplete; the model did the wrong thing with the right inputs	Agent summarized a contract but ignored the cancellation clause because the prompt never asked for it
Tool	A tool call was malformed, the wrong tool was chosen, or a tool error was ignored	Agent called the "create invoice" API twice after the first call timed out
Retrieval	The context layer returned stale, wrong, or missing information	RAG surfaced last quarter's pricing because the index was never refreshed
Permission	The agent acted beyond its allowed scope, or was blocked when it should have acted	Agent read a customer record it had no business accessing during a routine lookup
Routing	The wrong path, branch, or sub-agent was chosen	A billing question got routed to the technical troubleshooting flow
Workflow	Orchestration, state, or sequencing broke	A multi-step approval skipped the human gate after a retry reset the state
Policy	The output violated a business rule even though the data was correct	The refund example: right numbers, right customer, wrong decision

Tagging every incident with one of these types does two things. It points the fix at the correct layer, and it reveals the shape of your weakness over time. If 40% of your incidents are retrieval failures, no amount of prompt engineering will save you. If policy failures dominate, your guardrails are advisory when they need to be enforced. This is the same discipline that good distributed-systems teams apply to incident classification, the kind documented in Google's SRE practices: you measure the categories that let you act, not the ones that look impressive on a slide.

The Weekly Failure Review

A taxonomy without a ritual is just a spreadsheet. The mechanism that turns failures into improvement is a recurring review with one non-negotiable rule: no incident is closed until it has an eval.

Borrow the structure from blameless postmortems, the practice popularized when John Allspaw's team at Etsy made the case that you learn nothing from incidents when people are busy protecting themselves. His write-up on blameless postmortems and a just culture is still the clearest argument for why blame destroys the signal you need. The agent version of this review is lighter weight but follows the same spirit. Once a week, the owner of the workflow pulls every incident from the last seven days and runs each one through four questions:

What was the input, in full, including the trace and tool calls? If you cannot reconstruct it, your observability is the first thing to fix.
Which of the seven types is this? One tag per incident, chosen deliberately.
What is the correct behavior, stated precisely enough to test? "Be more careful" is not testable. "Refunds above $500 require a human approval step" is.
What eval captures this, and who owns it? The case does not close until this line is filled in.

A weekly cadence matters more than it sounds. It is frequent enough to catch drift before it compounds and infrequent enough that you batch incidents and see patterns. Three separate routing failures in one week are not three tickets. They are one missing eval that covers a class of inputs. Reviewing daily turns you into a firefighter. Reviewing monthly lets small failures harden into customer churn. Weekly is the cadence where the loop starts to compound, and it fits naturally into the operating model you set up after launch.

The output of every review is the same: the suite is bigger than it was last week, and the new cases came from reality rather than imagination.

Turning a Failure Into a Regression Test

Here is where most teams stall. They agree the idea is good and then never write the test, because they imagine eval engineering is harder than it is. It is not. A useful eval case has five parts, and you already have four of them from the incident itself.

The input. The real production trace that caused the failure. Not a sanitized approximation. The actual messy request, with the same context the agent saw.
The assertion. What must be true about the output. For the refund case: the agent must not approve, and it must route to human approval. This can be a hard check (a string the output must or must not contain, a tool that must or must not be called) or a model-graded check for fuzzier behavior.
The type tag. One of the seven, carried over from the review, so you can slice your suite by failure category.
The owner and date. Who wrote it and when, so the suite has provenance.
The expected layer. Which part of the system this case is meant to exercise, so a future failure tells you where to look.

Most cases start as Level 1 assertions, the cheap and fast checks that run on every change. Hamel Husain's argument that your AI product needs evals is built around exactly this progression: start with assertions, graduate to model-graded evals where the behavior is too subtle for a string match, and reserve human review for the cases that genuinely need judgment. You do not need a grand framework to begin. You need a folder of cases that grows every week and a CI step that runs them on every prompt edit, model bump, and tool change.

This is the practical core of evaluation-driven development: the eval suite is not documentation you write after the fact. It is the spec, assembled from the failures your users found for you. And because it runs automatically, the refund that took six weeks to recur becomes a test that goes red the instant someone reintroduces the gap.

One prerequisite makes all of this possible: you have to capture the trace in the first place. You cannot turn a failure into an eval if you do not know what the agent saw, which tools it called, and in what order. Charity Majors has argued for years that this kind of high-cardinality observability is the difference between systems you can debug and systems you can only pray over. For agents, it is the raw material of every eval you will ever write.

Why This Is an Operational Value Lever, Not a Cleanup Chore

It is tempting to file all of this under engineering hygiene. For mid-market operators, and especially for the private-equity-backed companies investing in AI right now, it is something more specific: a durable operational asset.

The 2026 private equity story is about disciplined operational value creation rather than financial engineering. McKinsey's analysis of private markets and Bain's Global Private Equity Report both point in the same direction: with cheaper leverage gone, returns increasingly come from making portfolio companies genuinely better at running themselves. An AI agent that handles support, billing triage, or contract review is exactly the kind of operational improvement that thesis depends on, but only if it keeps working as the business changes around it.

An eval suite is the proof. It is the difference between "we deployed an AI agent" and "we have an agent whose reliability is measured, improving, and defensible." When an operating partner or an acquirer asks how you know the system is getting better, the answer is not a vibe. It is a chart of eval cases over time, sliced by failure type, with the recurrence rate of fixed bugs trending to zero. That is an asset that survives staff turnover, model deprecations, and the next round of business rule changes. A clever prompt is not transferable. A growing, well-typed eval suite is institutional memory you can hand to the next owner.

The teams that treat each failure as a one-time fix are buying a depreciating asset. The teams that convert each failure into a permanent eval are building a compounding one. Over a year, the gap between those two approaches is the gap between an agent everyone quietly stops trusting and an agent the business plans around.

How OpenNash Can Help

This discipline is easy to describe and easy to skip when the queue is full. The pattern we build for clients makes skipping it the harder path.

Audit: We map your current agent workflows and trace coverage, then identify where failures are slipping through without capture. If you cannot reconstruct a failed run, that is the first gap we close.
Design: We define your failure taxonomy, the weekly review ritual, and the assertion-first eval structure so the loop fits your team rather than a generic template.
Build: We wire eval suites into CI so every prompt edit, model change, and tool update runs against the cases your own incidents produced. New failures become tests as part of the workflow, not as a someday backlog item.
Deploy and own: You get the suite, the review process, and the documentation, fully owned by your team. The point is an operational habit that keeps running after we hand it off, not a dependency on us.

If your agents are in production and you are still fixing the same class of failure more than once, that is the signal to formalize the loop. Book a call to map this pattern to your workflow and turn your incident history into the eval suite you should already have.

The next time an agent fails in production, you have two options. Patch it and move on, and accept that you will see it again. Or spend the extra twenty minutes to capture it as a test, and make sure you never do. One of those choices compounds.