A founder showed me an AI agent last month. The demo was clean: the agent read a customer email, looked up the order in Shopify, drafted a refund response, and asked for approval. Three steps, perfect output, polite tone. He asked if we could help him roll it out to 40 support reps.

I asked him one question: "Show me what it did on the last 100 emails it has never seen."

He could not, because nobody had built that test. The agent worked in the room. Whether it worked in production was a guess. This is the gap that howtoeval.com and a growing body of 2026 practice are trying to close. Final-answer accuracy is the wrong lens for agentic systems. Agents do not just answer. They choose tools, fetch context, route between steps, ask for approvals, and write back to your systems. Every one of those decisions can be right or wrong, and the only way to know is to measure them.

This is a field guide for evaluating AI agents the way you would evaluate any other production system: continuously, with evidence, and with a clear definition of failure.

Why Final-Answer Accuracy Stopped Being Enough

When LLMs were judged on chat completions, you could grade the output and move on. An agent is different. By the time the agent produces a final answer, it has already made a chain of decisions: which tool to call, what arguments to pass, what context to retrieve, whether to escalate, whether to commit a write. A wrong answer at the end is often the visible symptom of a wrong choice three steps earlier.

Anthropic's Building Effective Agents guide makes this concrete by describing agents as LLMs running in a loop with tools. Every iteration of that loop is a chance for the system to drift. The current consensus, captured in Hamel Husain's evals FAQ on agentic workflows, is that you have to evaluate the trajectory, not just the destination.

A practical trajectory eval grades five things on every run:

  1. Tool choice: did the agent pick the right tool for this step?
  2. Tool arguments: were the arguments well-formed and complete?
  3. Retrieval grounding: did the cited context actually support the response?
  4. Step efficiency: was the path reasonable, or did it loop?
  5. Termination: did it stop when the task was done, or keep going?

If any of these are wrong, a polished final answer is still a failure. Worse, it is a failure that hides.

Demo, Test Suite, Operating Metric: Three Different Things

Most teams conflate these. They should not.

A demo is a curated scenario. It proves the system can succeed once. Demos are useful for buy-in and onboarding. They are not evidence of reliability.

A test suite is a collection of graded scenarios that run on every change. It catches regressions and locks in known-good behavior. A good agent test suite is small, fast, and built from real cases. Eugene Yan's writing on patterns for LLM systems describes this as eval-driven development: write the failing case first, then fix the agent, then keep the case in the suite forever.

An operating metric is a live signal you watch in production: tool error rate this hour, escalation rate this week, cost per resolved ticket, percentage of writes that required human override. This is what tells you the system is still working under real traffic, with real users, against the real state of your data.

You need all three. A demo gets you a contract. A test suite gets you to launch. Operating metrics keep you alive after launch. If your team only has the first one, you do not have an evaluated agent. You have a hopeful one.

The Practical Eval Stack

Here is the stack we use when building production agents. It is opinionated, and it is built from what actually catches failures, not what scores well on leaderboards.

Layer 1: Task Success on Graded Scenarios

Start with 50 to 100 scenarios that represent the work the agent is supposed to do. Each scenario has an input, a definition of success, and a grading rule. Some grades are exact (did the agent issue the refund?). Some are rubric-based (did the response cite the right policy section?). Microsoft Research has written about this scenario-first approach and how it forces teams to write down what "working" actually means before they start scoring.

The scenarios should be biased toward edge cases. The middle of the distribution is easy. The tail is where agents fail and where customers notice.

Layer 2: Tool-Call Correctness

For every tool the agent can call, you need a separate set of tests. The Anthropic tool evaluation cookbook shows the pattern: hold the tool definition constant, vary the user input, and check whether the model called the right tool with the right arguments. Then check the inverse: when the user input is irrelevant, did the agent correctly decide not to call the tool?

This is the eval that catches the most expensive class of agent bugs: confident calls to the wrong tool, malformed arguments, and hallucinated parameters that look correct until they hit your API.

Layer 3: Source Grounding for Retrieval

If your agent uses RAG or any document retrieval, you need a grounding eval. For each response that cites a source, two questions: was the cited document actually retrieved, and does it actually support the claim? The Ragas framework and similar tools score faithfulness and answer relevance, and you can run them at scale. But you also need spot human review, because LLM-as-judge gets fooled by fluent paraphrase. The cheapest way to find a hallucination is still a human looking at a trace.

Layer 4: Policy and Safety Compliance

Every agent that takes action in a business system needs a policy layer, and that layer needs its own evals. Common policy checks: refusing to act on inputs that violate scope, escalating when confidence is low, redacting personal information before logging, refusing to commit writes above a threshold. Simon Willison's writing on the lethal trifecta is the cleanest description of what happens when an agent has private data, untrusted content, and exfiltration capability all at once. If your agent has those three, you need adversarial evals built specifically to break the policy layer.

Layer 5: Escalation Precision

Most production agents are not autonomous. They route to humans when they are unsure. The eval question is: when the agent escalated, did it need to? When it did not escalate, should it have? Both directions matter. An agent that escalates everything is a forwarding service. An agent that escalates nothing is a liability.

Build this eval from real escalation traces. Label them as appropriate, premature, or missed. Track the rate over time. If your missed-escalation rate trends up, your agent is becoming overconfident, usually because someone tightened a prompt or upgraded a model.

Layer 6: Latency and Cost as First-Class Metrics

A correct agent that takes 90 seconds and costs $0.40 per task is a failed agent for most workflows. Chip Huyen's book on AI engineering makes this point repeatedly: latency and cost are not optimizations you do later. They are constraints you design against from day one. Track p50, p95, and p99 latency on every task type. Track cost per resolved task, not cost per API call. A cheaper model that needs three retries is more expensive than the model you were trying to replace.

Layer 7: Regression Tests From Real Failures

Every time a real user reports a failure, three things should happen: the failure trace gets captured, the failure case becomes a permanent test in your suite, and the fix is verified against that test before it ships. Six months in, your regression suite is your most valuable asset, because it encodes everything your team has learned about how this specific agent breaks in your specific environment.

This is the discipline most teams skip. They fix the bug, ship the patch, and move on. Then the same class of failure shows up in a new form and nobody remembers. A growing test suite is the only thing that compounds your reliability over time.

What a Real Eval Report Looks Like

If you are buying an AI agent in 2026 and you ask the vendor for their evaluation evidence, this is what a credible answer looks like:

  • A scenario suite of 100+ graded cases with pass rates by category
  • Tool-call accuracy broken down per tool, with false-positive and false-negative rates
  • Grounding scores on retrieval-based responses, with sample traces
  • Escalation precision and recall over the last 30 days of production traffic
  • Latency and cost distributions, not just averages
  • A regression suite that grows monthly, with a public changelog of new cases added
  • Live dashboards for the operating metrics, not screenshots in a deck

If a vendor gives you a single accuracy number and a demo video, you are looking at a prototype with a marketing team. That is not the same thing as a production system. LangChain's documentation on agent evaluation and the open-source tooling around LangSmith, Arize Phoenix, and Braintrust all make this kind of evidence cheap to produce. There is no longer an excuse for not having it.

The Mistakes That Quietly Kill Agent Programs

A short list of failures I see repeatedly:

Grading the final answer only. The agent looks fine until you check the trajectory and discover it was right by accident. The next input it has never seen will not get the same lucky path.

Evaluating once at launch. Models change. Tools change. Data changes. An eval suite that you ran in March and never re-ran in April is a snapshot of a system that no longer exists.

Synthetic test sets. Generating eval cases with an LLM is fast and produces clean-looking pass rates. It also misses every weird thing your real users do. Build from production traces.

No human in the loop on the eval itself. LLM-as-judge is useful, but uncalibrated. Without periodic human spot checks, you are measuring a measurement with another measurement.

No write/read separation. Treating a read-only research agent and a write-capable workflow agent with the same eval rigor is dangerous. Writes need adversarial testing, sandboxed dry runs, and approval gates that are tested as carefully as the agent itself.

How OpenNash Can Help

If you are building an agent and you do not yet have a trajectory eval suite, this is the part of the program we tend to set up first, before any production deployment. Our work usually starts with an audit of the existing system: what the agent claims to do, what its trajectories actually look like on a sample of real inputs, and where the gaps sit between demo behavior and production behavior. From there we build the scenario suite, the tool-call checks, the grounding evals, and the live operating dashboards in your environment, with your data and your policies.

The deliverable is owned by your team. The regression suite lives in your repo. The metrics flow into your observability stack. We are useful when there is real money or real risk attached to the decisions the agent makes, and when "looks good in the demo" is not an acceptable standard for shipping. If that sounds like your situation, book a call and we will map this framework to your specific workflow.

The reason this matters is simple. The teams that win with agents in 2026 are not the ones with the best models. They are the ones who can prove, on any given Tuesday, that their system is still doing what it was hired to do. That proof is what an eval stack produces. Without it, you have a hopeful agent. With it, you have evidence.