The demo always works. That is the problem.

An AI workflow that codes invoices, drafts support replies, or screens deal documents looks flawless in a controlled walkthrough, because a person picked the inputs. Production picks its own inputs, and they are uglier: the scanned PDF rotated 90 degrees, the customer who asks two questions in one message, the vendor name spelled three different ways across three systems. The question that decides whether a pilot becomes a deployment is not "does it work?" It is "where does it fail, and what happens when it does?"

That reframing matters because AI agents are no longer experimental. Analysts expect agents embedded in roughly 40% of enterprise applications by 2026, per Apptad's mid-year enterprise AI review, which moves the interesting decision from "should we try this?" to "is this specific workflow safe to turn on?" The honest answer is that no universal pass/fail score exists. What follows is the checklist we actually use, and how to set the bar for your own situation.

"Production-Ready" Has No Universal Score

The first mistake buyers make is asking for a single number. "Is it 95% accurate?" is the wrong question, because 95% accuracy on a draft-only assistant is excellent and 95% accuracy on an agent that posts journal entries to your ledger is a slow-motion audit finding.

Readiness is relative to two things: how much autonomy the workflow has, and how large its blast radius is when it is wrong. A grading rubric beats a single score here. The most durable framework I have seen comes from Google's machine learning teams, whose ML Test Score rubric grades a system across data, model, infrastructure, and monitoring tests instead of collapsing everything into one accuracy figure. The same spirit applies to AI agents: you pass by clearing categories, not by averaging them.

Before you score anything, write down two sentences:

  • What is the worst thing this workflow can do if it is wrong and nobody catches it?
  • How would we know it happened?

If you cannot answer the second question, the workflow is not ready, no matter how good the accuracy looks.

A useful mental model is an autonomy ladder. At the bottom, the agent drafts and a human ships. One rung up, it classifies and routes. Higher still, it recommends actions. At the top, it executes with approval, then executes autonomously with audits. Each rung raises the bar. A drafting workflow can ship with a forgiving threshold and a human in the loop. An execution workflow cannot. Most failed rollouts are workflows that were graded as if they were one rung down from where they actually sat. The work of moving safely up that ladder is the subject of our pilot-to-production playbook; this post is the gate you run before each promotion.

The Go/No-Go Checklist: Seven Gates

Treat these as gates, not a weighted average. A workflow that aces six gates and has no audit trail is a workflow that will eventually do something you cannot explain to a client, a regulator, or your own CFO.

Gate What you are verifying A workflow passes when
1. Eval coverage The test set reflects real inputs Evals cover the top failure modes from error analysis, not just happy paths
2. Edge-case pass rate Behavior on the ugly 10% Critical edge cases hit your threshold with no silent failures
3. Human approval rules Who signs off on risky actions Every high-impact action has an explicit approval gate
4. Rollback plan You can undo a bad rollout Rollback is tested, not theoretical
5. System writeback tests Writes to real systems behave Retries are idempotent and partial failures are handled
6. Audit trail You can reconstruct any decision Every run logs inputs, tool calls, and outputs
7. Operator training Humans know what to do Operators can spot, escalate, and override failures

Gate 1: Eval coverage. Your evaluation set has to look like your inbox, not your demo. Hamel Husain's work on why your AI product needs evals makes the point that generic benchmarks tell you almost nothing about your specific failure modes. The test cases that matter come from error analysis: read fifty real transcripts, tag what went wrong, and build assertions for those patterns. A workflow with twelve hand-picked test cases has not been evaluated. It has been rehearsed.

Gate 2: Edge-case pass rate. Separate your test set into the routine 90% and the ugly 10%, then grade them differently. The routine cases will pass; they always do. The deployment risk lives in the edge cases, and a worrying number of pilots have no measured pass rate there at all. Iternal's writeup on the edge cases that kill pilots catalogs the usual suspects: missing fields, conflicting inputs, out-of-scope requests, and inputs that look valid but are not. Set a hard threshold for the critical subset and refuse to round up.

Gate 3: Human approval rules. Decide, in writing, which actions a human must approve before the agent commits them. "Send the refund" and "flag the refund for approval" are different products. The rule should be specific: dollar thresholds, action types, confidence cutoffs. Vague guidance like "escalate when unsure" is not a rule, because the model's sense of "unsure" is exactly what you cannot trust yet.

Gate 4: Rollback plan. You need a way to turn the workflow off and revert to the previous process in minutes, and you need to have practiced it. Progressive rollout patterns from classic deployment engineering apply directly here. Martin Fowler's notes on the canary release describe routing a small slice of traffic to the new system first, then widening only if it holds. Run the agent on 5% of volume, watch, then scale. An untested rollback is a hope, not a plan.

Gate 5: System writeback tests. This is the gate teams skip most often and regret most. More on it below.

Gate 6: Audit trail. Every run should log the input it received, the tools it called, the data it read, and the output it produced. The NIST AI Risk Management Framework treats traceability as a baseline property of trustworthy systems, not a nice-to-have, and for good reason: the first serious incident will be a forensic exercise. If you cannot reconstruct why the agent did what it did, you cannot fix it, and you cannot defend it.

Gate 7: Operator training. The humans supervising the workflow need to know what good and bad output look like, how to override a decision, and who to escalate to. A polished agent handed to an untrained operator fails on contact, because the person watching it does not know which failures are normal noise and which are the start of a bad day.

Setting the Numbers: What "Pass" Looks Like

The thresholds are yours to set, but they should follow from cost, not from a round number someone liked. A 70% pass rate is a common internal default, and it is fine for a drafting tool where a human reads everything. It is reckless for an agent that updates a system of record.

A tiered testing approach keeps the numbers honest. Borrowing the structure many practitioners use:

  • Assertions run on every change. These are cheap, deterministic checks: the output is valid JSON, the refund amount is non-negative, the email contains a required disclosure. Pass rate here should be 100%, because these are bugs, not judgment calls.
  • Model-graded evals run on a cadence. These score subjective quality against your rubric. Set the bar by impact: 85-90% may be acceptable for routine cases, with a higher floor for critical ones.
  • Human review and live A/B tests run after meaningful changes. This is where you catch the failures your offline set never imagined.

The trap is averaging across tiers. An agent that scores 99% on routine cases and 60% on the critical subset has a headline number near 95% and a real-world risk profile that should stop the rollout. Grade the subset that can hurt you separately, and let it veto the launch on its own.

The Tests Teams Skip

Three tests get cut under deadline pressure, and all three show up later as incidents.

Writeback and idempotency. When your agent writes to Salesforce, Netsuite, or a ticketing system, the network will occasionally drop a response after the write succeeds. A naive retry then writes again, and you have duplicate invoices or two refunds for one customer. Test that every write is idempotent: the same logical action, run twice, produces one result. Test partial failures, where the agent updates two of three systems before something breaks. This is the same discipline as identifying your system of record before you build - knowing which system owns the truth tells you which writes must never duplicate.

Realistic edge cases, not synthetic ones. Generated test data is too clean. It does not contain the customer who replies to a closed ticket, the deal memo with a redacted page, or the invoice in a currency you do not handle. Pull a sample of real production-shaped inputs and run them. In private equity diligence, where AI is now triaging data rooms at speed, Third Bridge's analysis of AI in due diligence is blunt that the value shows up only when the workflow survives messy, real documents rather than tidy samples. The same holds for a midmarket back office: the agent that aces a curated test set can still choke on a real Tuesday.

Audit-trail verification under failure. It is not enough that logging works when everything goes right. Force a failure - kill a tool mid-run, feed a malformed input - and confirm the trail still captures what happened. The log you need most is the one written during the incident, and that is exactly the path teams forget to test.

Who Owns the Go/No-Go

A checklist without an owner becomes a document nobody reads. Name one accountable person for the go/no-go decision and give them the authority to say no. In practice this is a product or operations lead, not the engineer who built it, because the builder is too close to grade their own work.

Governance has to be signed before launch, not assembled after the first complaint. For anything touching regulated data, customer money, or client deliverables, get legal and compliance to approve the allowed actions and the approval rules in advance. This is cheaper than it sounds and far cheaper than the alternative. The teams that move fastest into production agents are usually the ones who front-loaded this conversation, because they never had to stop a live workflow to retrofit a control.

The graduation rule is simple: run supervised, measure, then expand autonomy based on evidence. Do not promote a workflow to full autonomy on a calendar date. Promote it when its edge-case pass rate, its incident count, and its audit trail say it has earned the next rung. The most autonomous deployments I trust have months of operational data behind them, because that data is how they learned the failure modes no pre-launch test set contained.

How OpenNash Can Help

If you are staring at a pilot and trying to decide whether to turn it on, the seven gates map cleanly onto how we work. The audit step defines the workflow's blast radius and writes the two sentences that set the bar. The design step turns approval rules, rollback, and guardrails into concrete controls before any code ships. The build step produces the eval set and edge-case tests from real inputs, not synthetic ones. The deploy step runs the canary, verifies writeback idempotency and the audit trail under failure, and trains the operators. You own all of it afterward: the workflows, the evals, the runbooks, and the CI/CD that keeps them honest.

We are not the right call for every situation. If your use case is well served by an off-the-shelf platform with a built-in approval flow, buy the platform. If you are exploring and learning, stay in pilot a while longer and gather operational data. The case for a custom build is when the workflow is specific to your operation, the writebacks touch systems you cannot afford to corrupt, and you need an auditable trail you fully control.

Book a call to map this checklist to a workflow you are trying to move into production, and we will tell you honestly which gates it clears today and which ones it does not.