After Zero to Agent · no prior background needed

How to think about testing your agents?

A plain-English field guide to evals. We follow one refund agent the whole way through, so every idea lands on something real. By the end you will know what to watch, what to save, and what to test before the agent meets a customer.

The whole guide, in five words↻ click any cell

Watch

read real runs

Name

label the failure

Save

turn into a test

Test

before go-live

Watch

real use, on loop

Every reliable agent sits on a loop like this. That's it.

Orientation

How to use this guide

There is a lot of mystique around evals. There should not be. Underneath the buzzwords, the whole thing rests on a small loop: watch real runs, name what broke, save the important cases, test before letting changes go live, and keep watching real customer use.

This guide builds those ideas one at a time. Each module assumes only what came before it, and every module returns to the same refund agent.

If you have not read Zero to Agent yet, start there. It gives you the basic mental model for what is happening in this space. This guide is for what comes after: you built an agent, the prototype works most of the time, and now you need confidence before it touches real customers.

Building a prototype that works 80% of the time is easy. That is not enough for real customer use, especially in large enterprises or sensitive healthcare, finance, and insurance workflows. If you want a simple way to test an agent you built as a fast prototype, this guide is for you.

If you only remember one sentence

An eval is not a grade. It is a memory of a mistake you do not want to repeat. Your agent told a customer the refund window was 60 days. The real policy is 30. You save that exact case. From now on, every new version must answer it correctly before it goes live.

This is not a phase you do after the "real" work. It is part of building, the way reviewing support calls is part of improving a support process. You do not wait for an incident to learn what customers are hearing. The minimum viable version is simple: review a small batch of real runs after meaningful changes, save the failures that matter, and keep the saved cases small enough that people still trust them.

A real run, often called a trace, is the full recording of one conversation: what the customer said, what the agent did, every lookup or action, what came back, and every reply. Think of it like a call recording plus the agent's work notes. Traces are where evals come from.

Meet our refund agent

We will follow one agent through the entire guide. A customer messages your support bot:

CustomerCan I get a refund on invoice #123?

A bad agent answers from memory and guesses the policy. A good agent looks up the invoice, checks the refund policy, decides based on what it found, and explains the next step.

An eval is what makes sure your next version still behaves like the good one. Every module comes back to this example.

Start here · 30-minute evalDo this today

Read 10 runs

Pick real conversations. Not imagined tests.

Mark first failure

Find the earliest step where the agent went wrong.

Name top 3

Group the repeated mistakes. Count them.

Save 5 cases

Write pass/fail checks for the important ones.

Run before go-live

If a must-pass case fails, the change waits.

That is the simple version. The output is concrete: your top 3 failure patterns, 5 saved cases, and one check before go-live.

One eval card · in plain EnglishWhat you are making

Customer asks

Use the real message or a close copy of it.

Agent must

Name the behavior you expect in plain language.

Passes if

List the checks that make this safe to trust.

Fails if

Name the mistakes that should block the change.

Owner

Pick the person who can say yes or no.

Run before go-live

Every new version must pass before customers see it.

This is the thing you are building. The rest of the guide shows where each field comes from.

Watch Real Runs

Before you test anything, read what happened

Zero to Agent ended on three words: watch the traces. This is where you do it. A trace is the full recording of one run: the customer's message, every lookup or action, what the agent found, and the final answer.

Do not start by writing tests. Start by reading real conversations and asking three questions:

What did the customer want?
What did the agent do?
Where was the first mistake?

That third question matters most. Mistakes cascade: a wrong step early forces every later step to be wrong too. The agent's final answer being wrong tells you less than the first broken step. When you are starting out, mark only the first failure in each run. Fix the cause and the later mess often clears up on its own.

Customer

The message or task the agent received.

Agent step

What the agent decided to do next.

Lookup / action

The search, database check, email, refund, or other action it requested.

Tool result

What came back from the tool.

What the customer or user finally saw.

Demo · Mark the first failureFIG 01

A real refund-agent trace, played back. Click the row where it first goes wrong. Watch what happens to everything after it.

refund-agent · run 8f2a · failed6 steps

01customer

Can I get a refund on invoice #123?

02agent · think

Customer wants a refund. I know our refund policy is 60 days. I will tell them yes.

03agent · reply

Yes, you're within our 60-day refund window. I'll process your refund now.

04action

createRefund({ invoice: "123" })

05tool result

error: invoice not found

06agent · reply

There was a small issue on our end. Your refund is being processed and should appear in 3 to 5 business days.

click the row where the agent first goes wrong →

The final reply (row 06) is obviously bad. But fixing row 06 fixes nothing. The whole run was broken back at row 02, when the agent promised a refund before checking the invoice, customer eligibility, or policy. First failure first.

A few habits worth picking up

Read it like a customer first. Before diving into technical details, read the conversation the way the user would. Usually the failure is obvious from the surface and you save yourself the spelunking.

The trace does not end when the AI hands off. If your agent escalates to a human, the run is not over until the customer's problem is actually solved. Some of the worst failures live right at that handoff. Escalating too early, too late, or without giving the human any context.

Ask the agent itself. Replay the run as the model saw it, then ask the same model: "You got this wrong. The answer was X. What would I have needed to change for you to get it right?" Its reply is not gospel. It is a fast clue about where your instructions are being misread.

Do this now: read 10 traces and mark only the first bad step in each one.

Name the Failure

"Bad answer" isn't a category

Once you have read enough runs, you will notice the same problems recurring. The skill is naming them precisely. "Bad answer" tells you nothing. The actual pattern tells you what to fix.

Vague vs preciseFIG 02a

Vague label	What it actually was
Bad answer	Invented the refund policy
Lookup issue	Used the wrong lookup or action
Hallucination	Answered without checking the policy or account information it looked up
Bad workflow	Skipped the human-approval step
Weird behaviour	Looped instead of stopping

The left column is what you say in standup. The right column is what you fix on Monday.

The method is plain. Read runs, write quick notes about what broke, then group the similar notes into named patterns and count them. That is it: write notes, then group notes. Each good failure name points to an ownerable fix: policy, lookup, workflow, approval, or escalation.

Start with 10 runs. If the same 3 to 5 mistakes keep appearing, you have enough to act. As the agent matters more, read more. The goal is not to catalogue every conceivable failure. It is to find the ones that happen most.

Demo · Notes, then groupsFIG 02b

Refund agent · run 1 of 6trace

Your scratch notes+ clusters

What this should produce: a short list like "invented policy: 3, wrong tool: 1, skipped approval: 1." That list tells you what to fix first.

That counted list on the right is where your evals come from. Six runs in, the top three problems are obvious. Now you know exactly what to fix first.

One person owns the quality bar

Put a single accountable reviewer in charge. Ideally someone who knows the domain. A support lead for a support bot. One owner kills the endless debates about whether something is really a failure. Add more reviewers only if your domain genuinely demands it.

Do not hand this off to an outside vendor. The whole value of reading your own failures is the product intuition you build doing it. Outsource that and you break the link between seeing a failure and knowing how to fix it. You can use an AI to help group your notes. But you read the raw runs yourself, always.

How to get your team to care

Do not pitch "evals." Do the reading yourself, then tell the story of what you found. "Here are the top five ways we are failing real customers, here is how often, and here is the one I already fixed." Nobody argues with a list of real, embarrassing failures.

Do this now: turn your trace notes into the top 3 repeated failure names.

Trust Beats Scores

A high score can still hide a dangerous failure

Before you write pass/fail checks, decide what failure means. A 99% pass rate is useless if the 1% is the agent refunding the wrong customer, leaking private data, or bluffing when it should ask for help.

For an agent that stands in for a human, your goal is not a pretty average score. Your goal is trust where trust matters. Rank failures by the harm they cause: money, privacy, legal exposure, customer trust, and time wasted by your team.

The litmus test

If you could ship at a 90% pass rate or a 99% pass rate, which do you pick? If your gut says "99%, obviously," you are chasing scores. If your first question is "which 1% fails, and how badly?," you are earning trust.

Think about a finance agent that cancels subscriptions and moves money. Impressive. Until the day it confidently gives the wrong balance or moves money without being sure. Then you stop trusting all of it. For agents that replace human work, a confident wrong answer is worse than an honest "I don't know." Trust beats cleverness.

Demo · How autonomous is your agent?FIG 03

← human checks everything agent acts on its own →

Autocompleteword suggestions

Copilotwrites code, you approve

Support botrefunds, ships emails

Banking agentmoves real money

AI doctordiagnoses unsupervised

Chasing scores

Looks better in a dashboard

Make it more capable. Higher average quality. Better demos. A confident wrong answer is part of the cost of doing business.

Earning trust

Works better in real use

Make it reliable where reliability matters. An honest "I don't know" is a correct outcome. Refusing to invent is a feature, not a regression.

The more the agent can do without a person, the more one bad failure matters. Most product teams sit further right than they think.

The cheapest reliability win there is

Teach the agent to refuse. If it is not confident, if the question is out of scope, if it cannot find the policy or the evidence is weak, it should say "I am not sure, let me get a human." It feels like a step backward (your "success rate" dips), but an honest "I don't know" is a correct outcome, not a failure.

For our refund agent: if it cannot find invoice #123, it should say so and escalate. Never invent a status.

Do this now: write the 3 failures that would make you lose trust, even if everything else looked good.

Save Your Golden Cases

The 5 to 10 things your agent must never get wrong

Now you turn the failures that matter into tests. Start with golden cases: a small set of must-handle customer situations your agent must always handle correctly.

A golden case is not an engineering test first. It is a customer moment where a wrong answer costs trust, money, escalation time, or legal risk.

You do not need fancy tooling. Write them down somewhere and start with the single most common request your agent must never botch. For a refund or CX agent, the list might be:

G-01

Eligible refund within 30 days

G-02

Shipping status with order lookup

G-03

Cancel subscription safely

G-04

Angry customer asks for escalation

G-05

Missing order needs investigation

G-06

Human handoff with context

G-07

Pricing question cites source

G-08

Out-of-scope request refuses clearly

The rule that makes this work

If the agent fails a golden case, you do not let the change go live. No exceptions. That is the whole point of designating them.

Write each case as pass or fail. Never a 1 to 5 rating. Scores like "4 out of 5" feel informative but they are a trap. The gap between a 3 and a 4 is subjective, reviewers disagree, and everyone hides indecision in the middle. Pass or fail forces a clear call. If you want more detail, break one case into several pass or fail checks.

Same answer, two ways to grade itFIG 04

1 to 5 rating

Subjective, slow, debated

"Yes you can. I'll process the refund now."

Alex= 3

Sam= 4

Jordan= 3

Average 3.3. Now what? Reviewers spent 20 minutes arguing about whether it's a 3 or a 4. Nobody knows what changed when next week's run is "3.7."

Pass or fail

Fast, clear, countable

"Yes you can. I'll process the refund now."

Looked up invoice
Checked policy
Stayed in scope
Flagged uncertainty

3 of 4 checks FAIL. The team knows exactly what to fix and exactly when the next version is better.

If a score feels informative but never changes behaviour, it is not informative. Pass or fail checks change behaviour because they point at the specific thing to fix.

Here is our refund case as a real eval. Plain English first:

G-01 · Standard refund request

Customer asks: "Can I get a refund on invoice #123?"

Passes if:

It looks up the invoice.
It checks the refund policy.
It approves only if the invoice is actually eligible.
It explains the next step clearly.

Fails if:

It invents a policy.
It refunds without checking.
It answers confidently while missing data.

Copy this · one eval cardkeep it pass/fail

Customer asks

Can I get a refund on invoice #123?

Agent must

Look up the invoice, check policy, then decide.

Passes if

Uses the right tools, approves only if eligible, explains next step.

Fails if

Invents policy, skips lookup, refunds while missing data.

Owner

Support lead owns the quality call.

Run when

Before every go-live, and after refund-policy changes.

If a golden case fails, do not let the change go live. That is the whole power of the card.

Blank card · print thisfill one per failure

Customer asks

Write the exact user message.

Agent must

Write the behavior you want.

Passes if

List the yes/no checks.

Fails if

List the mistakes to block.

Owner

Name the person who decides.

Run when

Before go-live, after policy changes, or both.

Simple test, clear owner, no fuzzy score.

Two traps to avoid

Don't write a giant set of tests before you've seen real failures. It is tempting to imagine every way things could break and test all of them up front. Resist it. You will spend effort on problems that never happen while the real ones walk past. Write tests for failures you have actually seen. The one exception is hard, known rules. "Never recommend a competitor" is fine to lock in from day one, because you already know exactly what success looks like.

Fix the obvious stuff instead of testing it. A huge share of "failures" are things you never told the agent. Too verbose? You never asked for brevity. Wrong format? Wasn't in the prompt. Fix the instruction. Save evals for failures that survive the easy fixes.

Do this now: write 5 golden cases using the card above.

Test the Real Agent

Not the prompt. The whole system.

The most expensive mistake here is testing your prompt in isolation. Once an agent is wired up to the invoice lookup, policy source, approval rules, and what the agent remembers, the behaviour lives in the whole system. Not in the prompt text.

A prompt that looks perfect on its own can fall apart the moment a real lookup returns a messy result. So good offline tests take an input, run the real agent, and check what actually happened: the answer, the lookups or actions, and what it actually changed, like creating a refund or sending an email.

For our refund agent, a real test checks four things:

Did it use the right lookups or actions? Invoice lookup before policy decision.
Did it use the result correctly? No guessing after a failed lookup.
Did it avoid risky action? No refund unless eligibility is clear.
Did it explain the next step? Clear answer, no vague reassurance.

No-code version · what the test provesFor the business owner

Question asked

"Can I get a refund on invoice #123?"

Evidence needed

Invoice record, refund policy, and whether this customer is eligible.

Pass condition

The agent checks the evidence first, then approves, refuses, or escalates clearly.

Risk blocked

No invented policy, no refund without eligibility, no vague promise to the customer.

That is the whole test in plain English: the agent must check the right evidence first, make the right decision, and avoid risky actions when the customer is not eligible.

Two testing jobs, two setups

Before each go-live. Run a small, curated set. Your golden cases plus saved past bugs. It runs constantly, so keep it cheap and fast.

In real customer use. Sample real traffic. You usually have no "correct answer" to compare against, so you watch quality trends over time instead.

The handshake: when real customer use surfaces a new failure, add it to the go-live set so it can never sneak back. That is the whole loop, in one sentence.

Do this now: pick one golden case and ask your team to show the full run: user question, lookups, decision, and any action taken.

Block vs Measure

Two kinds of safety check. Different timing.

Two kinds of safety check, and the difference is timing. Blocking guardrails stop risky actions before they happen. Measuring reviewers help you learn what to improve after the fact.

Where each one sitsFIG 06

customer

→

refund agent

→

blocking guardrail · live

→

customer sees reply

after replymeasuring reviewer reads a copy → dashboard

Blocking guardrail

In the live path · fast · simple

Fires before the customer sees anything. Dead-simple rules for clear, high-stakes problems. Leaked personal data. Profanity. Malformed output. Or for our agent: refunds above $X go to a support lead before anything is sent or processed. When it trips, the system blocks, redacts, or asks for sign-off. Because customers feel it when it fires, a false block is treated as a serious bug. Keep them conservative.

Measuring reviewer

After the response · usually in the background

Measures the nuanced things a simple rule cannot. Was the tone right? Was the policy explained correctly? Did it overpromise? Engineers may call this an evaluator. Its verdicts feed your dashboards and your next round of fixes. It does not block the live answer.

A blocking guardrail is a gate. A measuring reviewer is a thermometer. You need both, but never confuse them. A slow AI judge in the live path is a slow product.

The rule of thumb

Anything that touches money, customer data, or real-world actions gets a blocking guardrail with an approval gate. Reading and looking things up is low-risk. Doing things (refunding, emailing, deleting) is where you put the hard stops.

On AI judges

Use this when you need judgment, not just rules. "Was this empathetic?" "Did it explain the policy correctly?" Sometimes you need a model to grade another model's output. The technical label is LLM-as-judge. Powerful but expensive. It needs human-labelled examples and ongoing validation to be trustworthy.

Do not start here. Use cheap rules first. Did it include an order number? Was the answer in the right format? Did it request the refund action? Reach for a judge only for nuanced failures you will be working on for a while. Validation notes are in Only After This Works.

Do this now: choose one live action that needs approval before the agent can do it.

Keep Watching Real Use

The loop that makes the agent less embarrassing over time

Offline tests catch what you thought would break. Production, meaning real customer use, shows you what actually confuses real customers. This is where evals stop being theory and become an operating rhythm.

Watch real runs. Find a failure. Reproduce it. Save it as an eval. Fix it. Ship. Keep watching.

Demo · One real bug travels the whole loopFIG 07

STEP 01

Watch

Sample real customer use. Read what your agent actually said this week.

STEP 02

Name

Spot the pattern. Write the precise failure label.

STEP 03

Save

Reproduce locally. Add as a golden case, a must-pass example.

STEP 04

Test

Fix. Re-run the saved cases. Verify the case now passes.

STEP 05

Watch

Go live. The new test is part of every future go-live check.

A real bug, end to end. The agent kept escalating simple refund questions unnecessarily. By step 5 the fix is live and that exact failure is locked out of every future go-live check. That is what the loop earns you.

How much machinery you need scales with your volume. At a few runs a day, just read them all. As traffic grows, you let the system tell you what deserves attention. Recurring stumbles become tracked issues. Big changes get tested on real customer traffic before rolling out to everyone. Do not reach for the heavy monitoring setup before you have read enough raw runs to know what you are looking for.

Prune ruthlessly

The trap: you add every bug you ever find to the saved cases. Six months later you have 500 checks, 400 of them weird edge cases that never recur, your go-live checks take 20 minutes, and the team starts ignoring failures because "it's always something."

Twenty high-signal tests beat two hundred low-signal ones. If a test has not failed in three months, ask whether it is still earning its place.

The honest commitment

Plan to spend a meaningful, permanent slice of your agent's upkeep on this work. Call it 10 to 20%. Reading runs. Tuning checks. Investigating issues. This is the cost of a reliable agent. Teams that skip it pay more, later, in incidents.

You've reached the top

The whole thing, in one breath: read real runs, name the failures that matter, save the important ones as pass or fail tests, test the real agent before every go-live, keep watching real customer use and prune as you go.

Every reliable agent sits on a loop like this. Read 10 runs today. Read more when patterns stop surprising you.

You are ready when

You have read real runs.

You know the top failure patterns.

You have 5 to 10 golden cases.

Each case is pass or fail.

The cases run before go-live.

Someone keeps watching real customer use.

Five questions to ask your vendor or team

01Show me real traces.

02Show me the top failure patterns.

03Show me the golden cases.

04Show me what blocks go-live.

05Show me what you monitor in real customer use.

Good answerThey can point to real examples, not vibes.

Want help running this loop?

OpenNash doesn't just build agents. We help you build the operating loop around them, so agent changes can go live without guessing.

We can review your first 50 runs, identify the top failure patterns, give you your first 10 golden evals, and set up the go-live checks that keep those failures from coming back.

Book time → Email OpenNash →

★

Only After This Works

Optional engineer notes

Stop here if you are new. The main loop is enough to begin. These notes are only for teams already running evals and ready for the next layer.

0 · Minimum viable setup

Read traces before buying tools

After a meaningful change, review 20 to 50 real runs by hand. For a serious cycle, review around 100 or continue until new traces stop revealing new failure patterns.

A · Validating a judge

Check the judge

An unvalidated judge is a second opinion you cannot trust. Hand-label a small set yourself, then check whether the judge agrees with those human labels before you trust its dashboard.

B · Multi-step agents

Find the broken handoff

For agents with several steps, make a simple grid: last good step by first failed step. The biggest cell tells you where to fix first.

C · No real customer traffic yet?

Generating synthetic test data

Do not ask an LLM for "test queries." Write a small table of situations first, then have the LLM turn those situations into natural messages. Replace them with real examples as soon as you have traffic.

D · When to stop reading runs

Stop when new examples stop teaching you

When new runs stop showing new failure patterns, you have seen enough for now. Re-run the exercise after meaningful changes: new feature, prompt rewrite, model swap.

E · Cheap checks first

Do not start with an LLM judge

Use assertions, schemas, exact matches, execution tests, policy thresholds, and citation checks before fuzzy grading. Add an LLM judge only when the failure is important and subjective.

F · Agent workflows

Score the outcome, then the first failure

For multi-step agents, first ask whether the user goal was completed. Then mark the earliest upstream failure and only add step-level diagnostics where failures cluster.

Further reading: Hamel Husain and Shreya Shankar's LLM Evals FAQ is a useful companion on trace review, error analysis, binary checks, judge validation, RAG evals, and agent workflow debugging.

★ Glossary · quick referenceAPPENDIX

Eval: A saved example of something your agent must get right. A memory of a mistake you refuse to repeat. Not the same as a model benchmark.
Benchmark: A broad, generic capability score for a model (the kind labs publish). Useful for picking a model. Useless for knowing if your agent works.
Trace: The full recording of one run: the message, every lookup or action, what the agent found, and the final answer. What you read and what you test.
Error analysis: Reading real runs, writing notes on what broke, and grouping those notes into named, counted failure patterns. The highest-return activity in building agents.
Golden case: One of the 5 to 10 must-handle customer situations your agent must always pass. If a golden case fails, the change does not go live.
Pass-fail eval: A clear yes-or-no judgment. Preferred over 1 to 5 ratings, which are subjective and inconsistent. Track detail with several pass-fail checks instead.
Refusal: An honest "I don't know" or "I need a human" when the agent is missing data or the task is out of scope.
Offline eval: A test that runs the real agent before go-live. Checks both the answer and the path: lookups, actions, and outcomes. Not the prompt alone.
Online eval: Measurement during real customer use, where you usually have no "correct answer" to compare against. Quality trends over time.
Guardrail: A fast, simple blocking check that fires before the customer sees the output. Blocks a refund over $X. Catches leaked data. A false block is a serious bug.
Evaluator: The technical name for a measuring reviewer: an after-the-fact check of nuanced quality (tone, correctness). Feeds dashboards and fixes. Does not block the live answer.
LLM-as-judge: A model grading another model's output for fuzzy qualities a rule cannot catch. Powerful but expensive. Needs human-labelled examples and validation. Do not start here.
First-failure-first: Marking only the earliest mistake in a run, since later errors usually cascade from it. Fix the cause and the rest often clears up.
Saturation: The point where new runs stop revealing new failure patterns. A signal that you have read enough for now.
Regression test: A saved past failure, re-run before every go-live, so the bug cannot quietly come back.
Synthetic data: Test cases generated by a model. Useful as a starter when you have no real traffic, but unreliable for high-stakes domains.
The loop: Watch, Name, Save, Test, Watch. The whole guide in five words.