Your AI agent demo looked incredible. The pilot went smoothly. Three months into production, your support team is drowning in escalations because the agent keeps giving customers confidently wrong answers.
This story plays out constantly. The gap between "impressive demo" and "reliable production system" is almost entirely about evaluation - and most organizations skip it entirely. They treat AI agents like traditional software: ship it, wait for bug reports, fix what breaks. But AI systems fail differently. They degrade gradually. They hallucinate selectively. They work perfectly on Tuesday and mysteriously struggle on Thursday.
The good news: measuring whether your AI agent actually works isn't rocket science. It's just a discipline that most teams haven't learned yet. This guide translates Hamel Husain's comprehensive evals FAQ into practical guidance for leaders who need to hold vendors accountable and internal teams to standards.
Why Generic AI Metrics Are Useless for Your Business
Here's an uncomfortable truth from Hamel's research: the metrics that AI vendors love to cite (BERTScore, ROUGE, cosine similarity, "95% accuracy") are almost never useful for real applications.
Why? Because they measure similarity to a reference answer, not whether the output actually helps your users. Your customer service agent might score 92% on semantic similarity while still giving advice that gets customers angry. Your document summarizer might have great ROUGE scores while missing the one sentence that actually mattered.
The problem isn't that these metrics are wrong. They're measuring something real. But they're measuring the wrong thing for business purposes.
What to measure instead:
| Agent Type | Vendor Metric (Useless) | Business Metric (Useful) |
|---|---|---|
| Email drafter | "95% grammar accuracy" | Reply rate, time-to-response |
| Support router | "89% classification accuracy" | Resolution time, escalation rate |
| Document analyzer | "High semantic similarity" | Decisions made correctly, time saved |
| Research assistant | "Comprehensive coverage" | Actionable insights surfaced |
The shift: stop asking "Is the output good?" and start asking "Does the output produce the business outcome we want?"
This requires custom evaluation. You can't buy it off the shelf. And that's exactly why most vendors don't do it well - it's expensive to build evaluation systems specific to each customer's use case.
The Three Levels of AI Agent Testing
Hamel's framework breaks evaluation into three levels, each serving a different purpose. Think of them like the layers of a pyramid - you need the foundation before the top works.
Level 1: Assertions (Run on Every Change)
These are like unit tests for AI. They don't tell you if the output is "good" - they tell you if basic requirements are met.
Examples:
- Response is valid JSON (for structured outputs)
- Response doesn't contain forbidden phrases
- Response stays under token limits
- Response includes required sections
- Response doesn't mention competitors by name
These run automatically on every code change. They're fast, cheap, and catch obvious regressions. A support agent that suddenly starts returning malformed JSON is broken, regardless of how eloquent its responses are.
The trap: Many teams stop here. "All our assertions pass!" But assertions only catch catastrophic failures. An agent can pass every assertion while still being subtly, dangerously wrong.
Level 2: Model-Based Evals (Run on a Cadence)
This is where you use a more capable model (usually GPT-4 or Claude) to grade your agent's outputs. It's the AI equivalent of peer review.
How it works:
- Collect a set of real inputs your agent has handled
- Run them through your agent
- Have a grader model evaluate the outputs against criteria you define
- Track scores over time
The criteria matter enormously. "Is this response helpful?" is too vague. "Does this response directly answer the user's question without making claims beyond the provided context?" is specific enough to be useful.
Run frequency: Weekly for fast-moving systems, bi-weekly or monthly for stable ones. The goal is catching drift - the gradual degradation that happens as user inputs evolve away from your training data.
Cost reality: Model-based evals aren't free. If you're evaluating 1,000 interactions per week with GPT-4, that's roughly $20-50 in API costs. Worth it compared to the cost of shipping broken experiences.
Level 3: Human Evaluation (After Significant Changes)
Some things only humans can evaluate. Tone. Cultural appropriateness. Whether a response would actually convince a skeptical customer. Whether the agent handled a delicate situation gracefully.
Human evaluation is expensive - both in time and in the cognitive load on your reviewers. Use it strategically:
- After major model updates
- After significant prompt changes
- When model-based evals show unexpected shifts
- On a random sample of production traffic (weekly or monthly)
The calibration problem: Different humans grade differently. Build a rubric. Have multiple reviewers grade the same outputs. Measure inter-rater reliability. If your reviewers can't agree on what "good" looks like, your agent certainly won't figure it out.
Error Analysis: Where Most Teams Fail
According to Hamel's research, error analysis should consume 60-80% of AI development time. Most teams spend less than 10%.
Error analysis means actually reading your agent's failures. Not in aggregate. Not as statistics. Reading the specific inputs that broke, understanding why, and categorizing the failure modes.
A framework for error analysis:
- Collect failures - User complaints, low model-based eval scores, assertion failures
- Categorize - What type of failure? (Hallucination, wrong tone, missing information, format error)
- Prioritize - Which failures cost the most? (Customer churn, support escalations, compliance risk)
- Root cause - Why did this specific failure happen? (Bad prompt, missing context, edge case, model limitation)
- Fix and verify - Change something, confirm the failure category improves
The insight most teams miss: your failure modes are specific to your application. Generic benchmarks won't find them. Only careful analysis of your actual traffic will.
Example: A legal document review agent might hallucinate citations 2% of the time. That 2% could expose you to malpractice claims. A marketing copy agent might have the same hallucination rate, but nobody gets sued over a slightly inaccurate product description. Same technical problem, wildly different business impact.
Red Flags When Evaluating AI Vendors
After eight years in this space, I've developed a simple vendor screening test: ask them about evals. Their answer tells you everything about whether their AI actually works or just demos well.
Questions to ask:
-
"What specific metrics do you track for this use case?"
- Good answer: "We track X, Y, Z, and here's why those map to business outcomes"
- Bad answer: "We continuously monitor quality" (means nothing)
-
"How do you catch quality degradation before users report it?"
- Good answer: "We run automated evals on N samples every week, alerting on drops"
- Bad answer: "Our models are very stable" (ignores distribution shift)
-
"Can you show me evaluation results from the last month?"
- Good answer: Actual dashboard, specific numbers, trend lines
- Bad answer: "That's proprietary" (translation: we don't have it)
-
"What's your process when evaluation scores drop?"
- Good answer: Specific escalation path, root cause analysis, rollback criteria
- Bad answer: "We investigate and fix issues" (no process)
-
"How did you validate that your evaluation criteria correlate with user satisfaction?"
- Good answer: "We ran a study comparing model grades to human grades and found..."
- Bad answer: Long pause, then something about "extensive testing"
Vendors with mature evaluation practices are proud of them. They'll volunteer this information. Vendors who demo well but deploy poorly will deflect these questions toward feature lists and accuracy claims.
Building an Evaluation Culture (Not Just an Evaluation System)
The technical infrastructure for evals is straightforward. The hard part is making evaluation a habit rather than a checkbox.
Principles that work:
Fail fast, not silently. Configure your systems to alert on evaluation score drops, not just on crashes. A 10% drop in model-based eval scores should wake someone up - it means something changed.
Budget for evaluation time. If a feature takes 3 weeks to build, budget 1 week for evaluation setup. This feels slow. It's faster than debugging production failures for the next 6 months.
Make evaluation visible. Put eval scores on dashboards that leadership sees. When quality is measurable and visible, it gets prioritized. When it's hidden in engineering logs, it gets ignored.
Tie evaluation to deployment. Your CI/CD pipeline should include evaluation gates. Agent fails Level 1 assertions? Deployment blocked. Model-based eval drops 15%? Human review required before shipping.
Rotate evaluation duty. The team members who review agent failures gain intuition that makes them better at prompt engineering and system design. Don't let this knowledge concentrate in one person.
Anthropic's engineering blog on agent evaluation makes a point worth repeating: evaluation is not a one-time activity. It's an ongoing practice that evolves with your system. The evaluation criteria you start with will be wrong. That's fine. Update them as you learn what actually matters.
Getting Started: The First 30 Days
If you're starting from zero, here's a practical roadmap:
Week 1: Inventory
- What does your agent do? List every task type.
- How would you know if each task was done well? Write it down.
- What failures would be catastrophic? What would be annoying but tolerable?
Week 2: Level 1 Setup
- Build 5-10 assertions for your most common task types
- Integrate into your deployment pipeline
- Set up alerts for assertion failures
Week 3: Baseline
- Collect 100 real interactions from production
- Manually grade them using your criteria from Week 1
- This becomes your benchmark for model-based evals
Week 4: Level 2 Foundation
- Build a model-based grader using your criteria
- Run it on your 100-sample baseline
- Compare to your human grades - calibrate until they roughly match
- Schedule weekly runs
This won't make your agent perfect. But it will make it measurably better over time - and that's the difference between AI that delivers value and AI that delivers demos.
The teams I've seen succeed with AI agents all share one trait: they're obsessive about knowing whether their systems work. Not hoping. Not assuming. Knowing, because they built the infrastructure to measure it.
Your competitors are probably still shipping and hoping. That's your advantage if you're willing to do the work.