A model vendor walked into a finance team's conference room with a slide that read 94 percent. The benchmark was real and the number was honest. Three weeks later the same model was misreading purchase order numbers on roughly one in six scanned invoices, because the client's vendors used a layout the benchmark never contained. Nobody lied. The score just measured a different job than the one the team needed done.
This gap between a benchmark headline and your actual results is the single most expensive misunderstanding in mid-market AI adoption. Teams approve a deployment based on a number that describes general capability, then discover their workflow has its own formats, exceptions, and rules that no public test ever touched. The fix is not a better benchmark. It is a different kind of test entirely.
The Benchmark That Tells You Nothing About Your Job
Public benchmarks exist to compare models against each other. MMLU measures broad knowledge, SWE-bench measures coding on open-source issues, and reasoning suites measure math and logic on curated problem sets. These are useful for the people building foundation models. They are close to useless for deciding whether a specific agent will handle your accounts payable queue.
Two structural problems make the scores misleading for your purposes. First, the data overlaps with training. When the questions or something close to them sit somewhere in the training corpus, the score reflects memorization as much as ability. Apple researchers showed how fragile this can be in GSM-Symbolic, where simply changing the names and numbers in standard math problems dropped model accuracy by double digits. If a cosmetic edit breaks the score, the score was never measuring robust reasoning.
Second, benchmarks average across a population that is not yours. A model can score well overall while failing badly on the narrow slice you care about. Stanford's HELM project was built partly to make this visible, reporting performance across many scenarios and metrics instead of one headline number, precisely because a single figure hides where a model is weak. As the team behind a 2026 review of AI benchmarks put it, the leaderboard rarely survives contact with a real deployment.
The takeaway for a buyer is blunt. A benchmark answers "is this model generally capable." Your business needs the answer to "does this agent process my work correctly." Those are different questions, and only the second one keeps you out of trouble.
What a Workflow-Specific Eval Actually Is
A workflow-specific eval is a test suite assembled from your own cases and graded against your own definition of correct. It treats the agent the way you would treat a new hire on day one: here are real examples of the work, here is what a good outcome looks like, now show me you can do it.
The structure is simple. Each test case has an input drawn from real work, an expected outcome defined by someone who owns the process, and a grading rule that decides pass or fail. The art is in choosing cases that represent the full shape of the job, especially the parts that break.
| Workflow | Example test case | Pass / fail rule |
|---|---|---|
| Invoice triage | Real scanned invoice with a discount line and a freight charge | Extracted total matches ledger to the cent; vendor matched to correct record |
| Support tickets | Past ticket where the customer asked two questions in one message | Routed to correct queue; both questions acknowledged in the draft reply |
| Sales quotes | Historical RFQ with a non-standard volume discount | Quote applies the tiered price defined in the pricing policy, not list price |
| Customer onboarding | Real onboarding case missing a tax ID at intake | Agent flags the gap and pauses, does not fabricate or proceed |
| Reports | Past month-end summary request | Numbers reconcile to source system; no figure invented or carried over |
Notice that none of these rules mention how smart the model is. They mention whether the total reconciles, whether the routing is right, whether the policy was applied. That is the whole point. You are testing the outcome of the workflow, not the eloquence of the language model.
Mine Your Test Cases From Real Work
The best source of test cases is the work you have already done. Pull 50 to 100 historical examples from the system of record: closed tickets, processed invoices, sent quotes, completed onboarding files. Resist the urge to write synthetic cases. Synthetic data tends to be clean and reasonable, which is exactly the opposite of what trips agents up in production.
Weight the set toward the exceptions. A workflow that succeeds on the typical case and fails on the messy 15 percent is the default failure mode of AI deployments, because the messy 15 percent is where the money and the risk live. Hunt for the disputed invoice, the angry ticket with three issues buried in it, the quote where the customer negotiated a one-off term, the onboarding file with a missing field. Hamel Husain, who has written the most practical guidance on this, argues that error analysis should consume the majority of your evaluation effort, because looking closely at real failures is what tells you which test cases matter.
Labeling is the step teams want to skip and cannot afford to. Sit with the accounts payable lead, the support manager, the deal desk. For each case, capture the outcome they consider correct and, just as important, the reason. "This invoice should route to manual review because the PO is missing" is a label that teaches you a business rule. Collect enough of those reasons and your eval set becomes a written specification of the workflow that no one had bothered to document before. That artifact is worth the exercise on its own.
A practical sequence:
- Export 50 to 100 real cases spanning the last few quarters.
- Tag each as typical, edge, or exception, and over-sample the last two.
- Have the process owner record the expected outcome and the governing rule.
- Freeze this as your golden set and version it like code.
Score Rules and Outcomes, Not Vibes
Once you have cases and expected outcomes, you have to grade. The instinct is to reach for a generic text-similarity metric, comparing the agent's output to a reference answer with something like ROUGE or BERTScore. Skip it. These metrics reward sounding similar, not being correct, and a wrong invoice total that reads fluently will sail right through. Generic metrics are widely treated as a trap for exactly this reason.
Grade in three tiers, matched to the kind of judgment each case needs.
Tier one: deterministic assertions. For anything tied to a hard business rule, write a plain check. Does the extracted total equal the ledger total? Is the routing queue correct? Was list price overridden when the policy said it should be? These run in milliseconds, cost nothing, and never disagree with themselves. Most of a well-designed eval suite lives here.
Tier two: model-based grading. For subjective quality, like whether a drafted reply is complete and on-tone, an LLM-as-judge can score at scale. The catch is that you must validate the judge against human grades on a sample before you trust it. An unvalidated judge is just a second opinion you have not checked. Frameworks for this, including the metric-and-rubric approach described in Galileo's agent evaluation guide, exist to make these grades reproducible rather than ad hoc.
Tier three: human review. Reserve scarce expert attention for the genuinely ambiguous cases and for periodic spot checks of the other two tiers. Tooling such as LangSmith's evaluation features helps here by letting you store a dataset, run the agent against it, and track which cases pass over time, so human reviewers look at deltas instead of re-reading everything.
The discipline is to push every check as far down the tiers as it will go. If a rule can be a deterministic assertion, never make it an LLM judgment, and never make it a human's job.
Wire Evals Into the Lifecycle, Not the Demo
A test suite that runs once before launch is theater. The value comes from running it continuously, because the things that change underneath an agent never stop changing: prompts get edited, models get upgraded, vendors change their invoice templates, a new product line lands in the quote engine.
Treat the cheap assertion tier like unit tests and run it on every prompt or code change. When someone tweaks a prompt to fix one case, the suite tells you immediately whether they broke three others. This is the difference between an agent that improves over time and one that drifts silently until a customer complains.
Set an explicit release bar. Decide, with the business owner, what pass rate on the golden set is required to ship, and what categories are non-negotiable. A 90 percent overall pass rate might be fine, but a single failure on "never fabricate a financial figure" should block release regardless of the average. Reliability over long, multi-step tasks deserves special attention, because errors compound. Research from METR on measuring the length of tasks agents can complete reliably shows that the failure curve steepens as workflows get longer, which is exactly why a multi-step process needs evals at each step rather than only at the end.
The honest version of this work is unglamorous. You will spend more time staring at failed cases than writing prompts. That is the job. The teams whose agents survive contact with production are the ones who built the test set first and let it tell them the truth.
How OpenNash Can Help
Most teams know they should evaluate their agents and stall on where to start. The answer is almost always the same: pull the real cases, sit with the people who own the process, and write down what correct means before writing a single prompt.
That is the sequence we run. In the audit phase we map the workflow and pull historical cases from the system of record. In design we turn the process owner's labels into deterministic rules and a release bar, so "good" is defined before anything is built. We build the agent against that golden set, wire the eval suite into CI so every change is checked, and hand over both the agent and the test harness at deployment. You own the evals, which means you can verify the system yourself long after the engagement ends.
If you are evaluating a vendor and they cannot explain their eval strategy in concrete, workflow-specific terms, treat it as a warning. A real eval plan names your documents, your rules, and your failure modes. Book a call to map this to your workflow and build the test set that proves your agent actually works.
The next move is small and you can make it today. Open your ticketing or invoicing system, export 50 of last quarter's hardest cases, and write down what the right answer should have been for each one. That list is your first eval, and it will tell you more than any leaderboard ever could.