A finance team I spoke with last quarter had a problem that looked like a win. Their new invoice-coding agent passed 94% of its evaluation suite, the demo dazzled the steering committee, and the engineering channel was full of green checkmarks. Then month-end close took exactly as long as it had the year before. The agent was excellent at producing plausible general-ledger codes. It was useless at the thing the CFO actually cared about, which was closing the books faster. Nobody had connected the model's score to a number on the operations report.
This is the most common failure in mid-market AI adoption, and it has nothing to do with model quality. Teams optimize the artifact they can see (the prompt, the eval pass rate, the latency chart) instead of the outcome they were hired to change. Gartner has predicted that over 40% of agentic AI projects will be canceled by the end of 2027, largely because of unclear business value rather than technical failure. Forrester's 2026 predictions make a similar bet that a wave of AI investment will disappoint. Both forecasts describe the same gap: a working agent that no one can connect to a metric leadership recognizes.
The prompt is not the product
Prompt quality and eval scores are real and necessary. They are also the floor, not the ceiling. Hamel Husain makes the case well in his guide on why your AI product needs evals: you build custom evaluators for your specific failure modes because generic benchmarks do not predict whether your system works. That is correct, and it is where most teams should start. The trap is treating the eval score as the finish line. An eval answers "did the model produce an acceptable output on this test case." It does not answer "did the queue get shorter, did the cost per case fall, did the customer get a faster answer."
Those are different questions with different owners. The eval score belongs to the engineering team. The outcome belongs to the COO and the CFO, and they will not fund a second project on the strength of a confusion matrix. When the only number you can show is model accuracy, you have built a science fair project that happens to run in production.
Here is the reframe that changes the conversation: an AI workflow is an intervention in a business process, and you measure interventions by their effect on the process, not by the internal cleanliness of the intervention. A surgeon is not graded on how elegant the incision was. They are graded on whether the patient recovered. Your agent is the incision. The outcome metric is the recovery.
Nine numbers a CFO will actually sign off on
The catalog below is deliberately boring, because the numbers a finance or operations leader trusts are the ones already on their reports. Map each workflow to one primary metric from this list, plus one or two supporting ones. Resist the urge to track all nine for every agent. The point is a clear causal claim, not a dashboard nobody reads.
| Metric | What it answers | Typical owner |
|---|---|---|
| Cycle time | How long does a case take end to end? | Operations |
| Exception rate | What share of cases the agent cannot complete cleanly | Operations / Risk |
| Rework rate | How often a finished case has to be redone | Quality |
| SLA attainment | What percentage of cases met the promised turnaround | Service / Ops |
| Cost per case | Fully loaded cost to process one unit of work | Finance |
| Reviewer time | Human minutes spent checking or correcting AI output | Operations |
| Revenue response | Effect on conversion, retention, or speed-to-revenue | Revenue / Finance |
| Auditability | Share of decisions with a complete, reviewable trace | Risk / Compliance |
| Customer outcome | Did the customer get a better result (CSAT, resolution, churn) | CX / Product |
Two of these deserve a closer look because teams routinely skip them.
Reviewer time is the metric that quietly kills ROI. An agent that drafts a response in two seconds but requires a human to read, verify, and rewrite for eight minutes has not saved eight minutes. It has moved the work from "writing" to "checking," and checking is often slower than doing because the reviewer has to reconstruct the agent's reasoning. If reviewer time per case is not falling, your headline savings are fictional. Measure it explicitly, per case, and watch the trend.
Auditability is the metric that most teams treat as a compliance checkbox rather than a number. It is a number. For every decision the agent makes, can you reconstruct what it saw, what tools it called, and why it acted? Express that as a percentage of decisions with a complete trace. When audit coverage drops below 100%, you have decisions you cannot explain, and in a regulated mid-market workflow that is a liability accruing silently.
For cycle time specifically, the software delivery world already solved the measurement problem. Google's DORA research program built its entire reputation on lead time and throughput as the metrics that actually predict performance, precisely because they resist gaming in a way that activity counts do not. Borrow the discipline. Measure the time from "work arrives" to "work is done and accepted," not the time the model spent thinking.
Build the counterfactual before you build the agent
The single most expensive mistake is launching without a baseline, because then any improvement is unprovable and any regression is invisible. Our earlier field guide on building a workflow baseline covered the mechanics of capturing cycle time, volume, exceptions, cost, and quality before you write a line of orchestration. The principle that matters for this post is simpler: the proof of value is a delta, and a delta needs two measured states.
A demo is not a delta. A demo shows the agent doing the thing once, under favorable conditions, narrated by the person who built it. Leadership has seen a hundred demos. What moves budget is a sentence like this: "Over eight weeks, invoices routed through the agent closed in 1.9 days versus 4.6 days for the control group, at 41% lower cost per case, with no increase in posting errors." That sentence requires a control group.
You have three honest ways to get one:
- Holdout. Send a random slice of volume (say 20%) through the old process for a fixed window. The comparison is clean because both paths run over the same period, absorbing the same seasonality and staffing.
- Staged rollout. Turn the agent on for one region, team, or product line first. The untouched units are your control.
- Before-and-after with adjustment. The weakest option, used only when you cannot split volume. Compare a measured pre-period against the post-period and explicitly subtract known confounders like a headcount change or a pricing event.
McKinsey's State of AI research has consistently found that the organizations capturing real value from AI are the ones with formal measurement and governance, not the ones with the most pilots. The pattern holds in the mid-market data too: firms that pair AI deployment with disciplined tracking report savings that clear seven figures annually, while the pilot-rich, measurement-poor crowd reports vibes. The difference is almost never the model.
A counter-intuitive consequence: you should be able to turn the agent off. If you cannot point to a metric that visibly degrades when the agent is disabled, you have not proven the agent is doing anything. The "off switch test" is the cheapest ROI audit you will ever run, and it terrifies the projects that were theater all along.
Governance is a metric, not a memo
Most AI governance lives in a slide deck: a policy, a RACI chart, a list of prohibited use cases. Useful, but a memo does not tell you whether your controls are holding this week. Metrics do. Three of the nine numbers above are governance instruments in disguise, and reading them as governance signals is what separates a controlled workflow from an accident waiting to be discovered.
Exception rate is your governance pulse. A well-designed agentic workflow does not try to handle everything. It handles the clean cases and routes the ambiguous ones to a human, on purpose. A stable exception rate means the boundary between "agent decides" and "human decides" is calibrated. A rate that climbs means the world has drifted away from what the agent was built for, often before any single bad decision shows up. A rate that falls toward zero is not a victory. It usually means the agent has started guessing at cases it should have escalated.
Reviewer time doubles as a control-effectiveness measure. If reviewers are rubber-stamping, time per case collapses and you have a human-in-the-loop in name only. If reviewers are drowning, the design pushed too much marginal work to people. The healthy zone is a reviewer who spends real but bounded time on the genuinely hard cases.
Audit coverage is the metric a regulator or an acquirer will ask about first. PwC's 2026 AI business predictions put responsible-AI operating models at the center of the year ahead, and the operational translation is unglamorous: every consequential decision needs a trace, and the share of decisions with a complete trace needs to read 100%. Industry analysis of 2026 financial-management trends lands in the same place for finance functions, where explainability is a precondition for putting AI anywhere near the ledger.
Read together, these three numbers turn governance from an annual policy review into a live readout. You do not wait for an incident to discover your controls slipped. You watch the exception rate creep, the audit coverage dip, and you fix the boundary before the incident happens.
A worked example: AP exception handling
Concrete beats abstract, so here is the invoice team from the opening, rebuilt the right way. The workflow handles accounts-payable invoices: the agent reads each invoice, matches it to a purchase order, codes it to the right GL account, and either posts it or routes it to a human when something does not reconcile.
Before touching the model, the team measured a four-week baseline:
| Metric | Baseline (human-only) |
|---|---|
| Cycle time (receipt to posted) | 4.6 days |
| Exception rate | n/a (all manual) |
| Cost per invoice | $6.10 fully loaded |
| Reviewer time per invoice | 7.2 minutes |
| Posting error rate (rework) | 2.3% |
| Audit trace coverage | 64% (partial notes) |
Then they ran the agent on 80% of volume for eight weeks, with 20% held out through the old process. The agent-handled path produced:
| Metric | Agent path (8-week holdout test) |
|---|---|
| Cycle time | 1.9 days |
| Exception rate | 18% routed to human |
| Cost per invoice | $3.60 fully loaded |
| Reviewer time per invoice | 2.4 minutes (on the 18% exceptions) |
| Posting error rate (rework) | 1.9% |
| Audit trace coverage | 100% |
The story those numbers tell is specific and defensible. Cycle time more than halved against a live control. Cost per invoice fell 41%, and the savings are real because reviewer time fell rather than shifting. The 18% exception rate is a feature, not a failure: those are the cases the agent correctly declined to guess on. Audit coverage went from "some sticky notes" to complete, which is the line item the auditor and the eventual buyer will care about most. Notice that model accuracy does not appear on this report at all. It was tracked internally as a supporting signal, and it earned exactly the attention it deserved, which was the engineering team's and nobody else's.
How OpenNash Can Help
If your AI initiatives keep stalling at the demo stage, the missing piece is usually measurement design, not model selection. OpenNash starts every engagement with an audit that maps the target process and captures the baseline numbers you will later prove value against, so the workflow is wired to a metric before anyone writes orchestration code. From there we design the human-in-the-loop boundaries, exception routing, and audit trace that make the governance metrics readable, then build and deploy with a holdout in place so the ROI claim survives contact with finance.
The deliverable is yours at the end: the workflow, the dashboards, and the trace infrastructure, with full ownership and documentation handed over. If you are weighing build versus buy, the honest answer is that an off-the-shelf platform is the right call when your process is generic and your governance needs are light. Custom is worth it when the workflow is specific to how you operate, when auditability is a hard requirement, and when the metric you need to move is one a buyer or regulator will eventually inspect.
Pick the one workflow where you can name the metric out loud, instrument the baseline this month, and run the off-switch test before you scale. Book a call to map this to your workflow if you want a second set of eyes on the measurement design.