How to think
about testing
your agents?
A plain-English field guide to evals. We follow one refund agent the whole way through, so every idea lands on something real. By the end you will know what to watch, what to save, and what to test before the agent meets a customer.
Orientation
There is a lot of mystique around evals. There should not be. Underneath the buzzwords, the whole thing rests on a small loop: watch real runs, name what broke, save the important cases, test before shipping, and keep watching in production.
This guide builds those ideas one at a time. Each module assumes only what came before it, and every module returns to the same refund agent.
If you have not read Zero to Agent yet, start there. It gives you the basic mental model for what is happening in this space. This guide is for what comes after: you built an agent, the prototype works most of the time, and now you need confidence before it touches real customers.
Building a prototype that works 80% of the time is easy. That is not enough for production, especially in large enterprises or sensitive healthcare, finance, and insurance workflows. If you want a simple way to test the agent you just vibe-coded, this guide is for you.
An eval is not a grade. It is a memory of a mistake you do not want to repeat. Your agent told a customer the refund window was 60 days. The real policy is 30. You save that exact case. From now on, every new version must answer it correctly before it ships.
This is not a phase you do after the "real" work. It is part of building, the way debugging is part of writing software. You do not budget separately for debugging. You just debug. Evals work the same way. The minimum viable version is simple: review a small batch of real runs after meaningful changes, save the failures that matter, and keep the suite small enough that people still trust it.
A trace is the full recording of one conversation: what the customer said, what the agent did, every lookup, every tool result, and every reply. Traces are where evals come from.
Meet our refund agent
We will follow one agent through the entire guide. A customer messages your support bot:
A bad agent answers from memory and guesses the policy. A good agent looks up the invoice, checks the refund policy, decides based on what it found, and explains the next step.
An eval is what makes sure your next version still behaves like the good one. Every module comes back to this example.
Watch Real Runs
Zero to Agent ended on three words: watch the traces. This is where you do it. A trace is the full recording of one run. The customer's message, every tool call, what got retrieved, and the final answer.
Do not start by writing tests. Start by reading real conversations and asking three questions:
- What did the customer want?
- What did the agent do?
- Where was the first mistake?
That third question matters most. Mistakes cascade: a wrong step early forces every later step to be wrong too. The agent's final answer being wrong tells you less than the first broken step. When you are starting out, mark only the first failure in each run. Fix the cause and the later mess often clears up on its own.
A real refund-agent trace, played back. Click the row where it first goes wrong. Watch what happens to everything after it.
createRefund({ invoice: "123" })error: invoice not foundRead it like a customer first. Before diving into tool calls and JSON, read the conversation the way the user would. Usually the failure is obvious from the surface and you save yourself the spelunking.
The trace does not end when the AI hands off. If your agent escalates to a human, the run is not over until the customer's problem is actually solved. Some of the worst failures live right at that handoff. Escalating too early, too late, or without giving the human any context.
Ask the agent itself. Replay the run as the model saw it, then ask the same model: "You got this wrong. The answer was X. What would I have needed to change for you to get it right?" Its reply is not gospel. It is a fast clue about where your instructions are being misread.
Name the Failure
Once you have read enough runs, you will notice the same problems recurring. The skill is naming them precisely. "Bad answer" tells you nothing. The actual pattern tells you what to fix.
| Vague label | What it actually was |
|---|---|
| Bad answer | Invented the refund policy |
| Tool issue | Called the wrong tool |
| Hallucination | Answered without checking retrieved context |
| Bad workflow | Skipped the human-approval step |
| Weird behaviour | Looped instead of stopping |
The method is plain. Read runs, write quick notes about what broke, then group the similar notes into named patterns and count them. That is it: write notes, then group notes. It is the highest-return activity in the whole field.
Keep going until you stop learning anything new. When the same handful of mistakes keep repeating and fresh runs do not surface new ones, you have seen enough. Start with a small batch if that is all you have. As the agent matters more, read more. The goal is not to catalogue every conceivable failure. It is to find the ones that happen most.
Put a single accountable reviewer in charge. Ideally someone who knows the domain. A support lead for a support bot. One owner kills the endless debates about whether something is really a failure. Add more reviewers only if your domain genuinely demands it.
Do not hand this off to an outside vendor. The whole value of reading your own failures is the product intuition you build doing it. Outsource that and you break the link between seeing a failure and knowing how to fix it. You can use an AI to help group your notes. But you read the raw runs yourself, always.
Do not pitch "evals." Do the reading yourself, then tell the story of what you found. "Here are the top five ways we are failing real customers, here is how often, and here is the one I already fixed." Nobody argues with a list of real, embarrassing failures.
Trust Beats Scores
Before you write pass/fail checks, decide what failure means. A 99% pass rate is useless if the 1% is the agent refunding the wrong customer, leaking private data, or bluffing when it should ask for help.
For an agent that stands in for a human, your goal is not a pretty average score. Your goal is trust where trust matters.
If you could ship at a 90% pass rate or a 99% pass rate, which do you pick? If your gut says "99%, obviously," you are chasing scores. If your first question is "which 1% fails, and how badly?," you are earning trust.
Think about a finance agent that cancels subscriptions and moves money. Impressive. Until the day it occasionally cannot answer "how much is in my account?" Then you stop trusting all of it. For agents that replace human work, a confident wrong answer is worse than an honest "I don't know." Trust beats cleverness.
Chasing scores
Make it more capable. Higher average quality. Better demos. A confident wrong answer is part of the cost of doing business.
Earning trust
Make it reliable where reliability matters. An honest "I don't know" is a correct outcome. Refusing to invent is a feature, not a regression.
Teach the agent to refuse. If it is not confident, if the question is out of scope, if the retrieved policy is thin, it should say "I am not sure, let me get a human." It feels like a step backward (your "success rate" dips), but an honest "I don't know" is a correct outcome, not a failure.
For our refund agent: if it cannot find invoice #123, it should say so and escalate. Never invent a status.
Save Your Golden Cases
Now you turn the failures that matter into tests. Start with golden cases: a small set of critical paths your agent must always handle correctly.
You do not need fancy tooling. Write them down somewhere and start with the single most common request your agent must never botch. For a refund or CX agent, the list might be:
If the agent fails a golden case, you do not ship. No exceptions. That is the whole point of designating them.
Write each case as pass or fail. Never a 1 to 5 rating. Scores like "4 out of 5" feel informative but they are a trap. The gap between a 3 and a 4 is subjective, reviewers disagree, and everyone hides indecision in the middle. Pass or fail forces a clear call. If you want more detail, break one case into several pass or fail checks.
1 to 5 rating
Pass or fail
- Looked up invoice
- Checked policy
- Stayed in scope
- Flagged uncertainty
Here is our refund case as a real eval. Plain English first:
Customer asks: "Can I get a refund on invoice #123?"
Passes if:
- It looks up the invoice.
- It checks the refund policy.
- It approves only if the invoice is actually eligible.
- It explains the next step clearly.
Fails if:
- It invents a policy.
- It refunds without checking.
- It answers confidently while missing data.
Don't write a giant test suite before you've seen real failures. It is tempting to imagine every way things could break and test all of them up front. Resist it. You will spend effort on problems that never happen while the real ones walk past. Write tests for failures you have actually seen. The one exception is hard, known rules. "Never recommend a competitor" is fine to lock in from day one, because you already know exactly what success looks like.
Fix the obvious stuff instead of testing it. A huge share of "failures" are things you never told the agent. Too verbose? You never asked for brevity. Wrong format? Wasn't in the prompt. Fix the instruction. Save evals for failures that survive the easy fixes.
Test the Real Agent
The most expensive mistake here is testing your prompt in isolation. Once an agent is wired up to tools, retrieval, permissions, and memory, the behaviour lives in the whole system. Not in the prompt text.
A prompt that looks perfect on its own can fall apart the moment a real tool returns a real, messy result. So good offline tests look like ordinary software tests. They take an input, run the real agent, and check what actually happened. The answer, the tool calls, and the side effects.
For our refund agent, a real test checks four things:
- Did it use the right tools? Invoice lookup before policy decision.
- Did it use the result correctly? No guessing after a failed lookup.
- Did it avoid risky action? No refund unless eligibility is clear.
- Did it explain the next step? Clear answer, no vague reassurance.
What this test does, in English: ask the agent a real refund question, run it for real, then check three things: did it use the right tools, in the right order, and did it reach the right outcome? If you are not technical, you can skip the code and keep the idea.
createRefund moves real money. Test that it appears only when the invoice is actually eligible. Test the opposite case too.
Before each release. Run a small, curated set. Your golden cases plus regressions for past bugs. It runs constantly, so keep it cheap and fast.
In production. Sample real traffic. You usually have no "correct answer" to compare against, so you watch quality trends over time instead.
The handshake: when production surfaces a new failure, add it to the release set so it can never sneak back. That is the whole loop, in one sentence.
Block vs Measure
Two kinds of safety check, and the difference is timing.
Guardrail
Fires before the customer sees anything. Dead-simple rules for clear, high-stakes problems. Leaked personal data. Profanity. Malformed output. Or for our agent: issuing a refund above $X without human approval. When it trips, the system blocks, redacts, or asks for sign-off. Because customers feel it when it fires, a false block is treated as a serious bug. Keep them conservative.
Background reviewer
Measures the nuanced things a simple rule cannot. Was the tone right? Was the policy explained correctly? Did it overpromise? Its verdicts feed your dashboards and your next round of fixes. It does not block the live answer.
Anything that touches money, customer data, or real-world actions gets a guardrail with an approval gate. Reading and looking things up is low-risk. Doing things (refunding, emailing, deleting) is where you put the hard stops.
Sometimes you need a model to grade another model's output for the fuzzy stuff a rule cannot catch. "Was this empathetic?" "Did it explain the policy correctly?" That is an LLM-as-judge. Powerful but expensive. It needs human-labelled examples and ongoing validation to be trustworthy.
Do not start here. Use cheap rules first. Did it include an order number? Is the output valid JSON? Did it call the refund tool? Reach for a judge only for nuanced failures you will be working on for a while. Validation notes are in Only After This Works.
Keep Watching Production
Offline tests catch what you thought would break. Production shows you what actually confuses real customers. This is where evals stop being theory and become an operating rhythm.
Watch real runs. Find a failure. Reproduce it. Save it as an eval. Fix it. Ship. Keep watching.
How much machinery you need scales with your volume. At a few runs a day, just read them all. As traffic grows, you let the system tell you what deserves attention. Recurring stumbles become tracked issues. Big changes get tested on live traffic before rolling out to everyone. Do not reach for the heavy monitoring stack before you have read enough raw runs to know what you are looking for.
The trap: you add every bug you ever find to the test suite. Six months later you have 500 tests, 400 of them weird edge cases that never recur, your release checks take 20 minutes, and the team starts ignoring failures because "it's always something."
Twenty high-signal tests beat two hundred low-signal ones. If a test has not failed in three months, ask whether it is still earning its place.
Plan to spend a meaningful, permanent slice of your agent's upkeep on this work. Call it 10 to 20%. Reading runs. Tuning checks. Investigating issues. This is the cost of a reliable agent. Teams that skip it pay more, later, in incidents.
The whole thing, in one breath: read real runs, name the failures that matter, save the important ones as pass or fail tests, test the real agent before every release, keep watching production and prune as you go.
Every reliable agent sits on a loop like this. Read 10 runs today. Read more when patterns stop surprising you.
Want help running this loop?
OpenNash doesn't just build agents. We help you build the operating loop around them, so agent changes are shippable without guessing.
We can review your first 50 runs, identify the top failure patterns, give you your first 10 golden evals, and set up the release checks that keep those failures from coming back.
Only After This Works
Stop here if you are new. The main loop is enough to begin. These notes are only for teams already running evals and ready for the next layer.
Check the judge
An unvalidated judge is a second opinion you cannot trust. Hand-label a small set yourself, then check whether the judge agrees with those human labels before you trust its dashboard.
Transition failure matrices
For agents with several steps, make a simple grid: last good step by first failed step. The biggest cell tells you where to fix first.
Generating synthetic test data
Do not ask an LLM for "test queries." Write a small table of situations first, then have the LLM turn those situations into natural messages. Replace them with real examples as soon as you have traffic.
The saturation heuristic
When new runs stop showing new failure patterns, you have seen enough for now. Re-run the exercise after meaningful changes: new feature, prompt rewrite, model swap.
- Eval
- A saved example of something your agent must get right. A memory of a mistake you refuse to repeat. Not the same as a model benchmark.
- Benchmark
- A broad, generic capability score for a model (the kind labs publish). Useful for picking a model. Useless for knowing if your agent works.
- Trace
- The full recording of one run: the message, every tool call, what was retrieved, and the final answer. What you read and what you test.
- Error analysis
- Reading real runs, writing notes on what broke, and grouping those notes into named, counted failure patterns. The highest-return activity in building agents.
- Golden case
- One of the 5 to 10 critical paths your agent must always pass. If a golden case fails, you do not ship.
- Pass-fail eval
- A clear yes-or-no judgment. Preferred over 1 to 5 ratings, which are subjective and inconsistent. Track detail with several pass-fail checks instead.
- Refusal
- An honest "I don't know" or "I need a human" when the agent is missing data or the task is out of scope.
- Offline eval
- A test that runs the real agent (tools, state, side effects) before release. Checks both the answer and the path. Not the prompt alone.
- Online eval
- Measurement on real production traffic, where you usually have no "correct answer" to compare against. Quality trends over time.
- Guardrail
- A fast, simple safety check that fires before the customer sees the output. Blocks a refund over $X. Catches leaked data. A false block is a serious bug.
- Evaluator
- An after-the-fact, usually background measure of nuanced quality (tone, correctness). Feeds dashboards and fixes. Does not block the live answer.
- LLM-as-judge
- A model grading another model's output for fuzzy qualities a rule cannot catch. Powerful but expensive. Needs human-labelled examples and validation. Do not start here.
- First-failure-first
- Marking only the earliest mistake in a run, since later errors usually cascade from it. Fix the cause and the rest often clears up.
- Saturation
- The point where new runs stop revealing new failure patterns. A signal that you have read enough for now.
- Regression test
- A saved past failure, re-run on every release, so the bug cannot quietly come back.
- Synthetic data
- Test cases generated by a model. Useful as a starter when you have no real traffic, but unreliable for high-stakes domains.
- The loop
- Watch, Name, Save, Test, Watch. The whole guide in five words.