How do I test if my AI agent is working?

Test AI agents at three levels: automated assertions that run on every code change (like checking response format), model-based evaluations that run weekly or monthly (having GPT-4 grade your agent's outputs), and human evaluation after major updates. Start with the failure modes that cost you the most.

What metrics should I use to evaluate an LLM agent?

Avoid generic metrics like ROUGE or BERTScore - they rarely correlate with business value. Instead, build custom evaluators for your specific use case. If your agent drafts emails, measure whether recipients respond. If it routes support tickets, measure resolution time. The metric should map directly to a business outcome.

How often should I evaluate my AI agent?

Run automated assertions on every deployment. Run model-based evaluations weekly or bi-weekly. Conduct human evaluation after significant changes or when you notice quality degradation. Most failures happen gradually, so regular evaluation catches drift before users complain.

What's the difference between LLM evals and traditional software testing?

Traditional tests have deterministic right answers. LLM evals often don't - a good email draft might be written many different ways. This means you need fuzzy matching, model-based grading, or human judgment. The key insight: you're testing behavior patterns, not exact outputs.

How do I know if my AI vendor has good evaluation practices?

Ask three questions: What specific metrics do you track? How do you catch quality degradation before users report it? Can you show me evaluation results from the last month? Vendors with mature practices will have clear answers. Vague responses about 'continuous improvement' are a warning sign.

How to Know If Your AI Agent Actually Works: Evals for Non-Technical Leaders

Your AI agent demo looked incredible. The pilot went smoothly. Three months into production, your support team is drowning in escalations because the agent keeps giving customers confidently wrong answers.

This story plays out constantly. The gap between "impressive demo" and "reliable production system" is almost entirely about evaluation - and most organizations skip it entirely. They treat AI agents like traditional software: ship it, wait for bug reports, fix what breaks. But AI systems fail differently. They degrade gradually. They hallucinate selectively. They work perfectly on Tuesday and mysteriously struggle on Thursday.

The good news: measuring whether your AI agent actually works isn't rocket science. It's just a discipline that most teams haven't learned yet. This guide translates Hamel Husain's comprehensive evals FAQ into practical guidance for leaders who need to hold vendors accountable and internal teams to standards.

Why Generic AI Metrics Are Useless for Your Business

Here's an uncomfortable truth from Hamel's research: the metrics that AI vendors love to cite (BERTScore, ROUGE, cosine similarity, "95% accuracy") are almost never useful for real applications.

Why? Because they measure similarity to a reference answer, not whether the output actually helps your users. Your customer service agent might score 92% on semantic similarity while still giving advice that gets customers angry. Your document summarizer might have great ROUGE scores while missing the one sentence that actually mattered.

The problem isn't that these metrics are wrong. They're measuring something real. But they're measuring the wrong thing for business purposes.

What to measure instead:

Agent Type	Vendor Metric (Useless)	Business Metric (Useful)
Email drafter	"95% grammar accuracy"	Reply rate, time-to-response
Support router	"89% classification accuracy"	Resolution time, escalation rate
Document analyzer	"High semantic similarity"	Decisions made correctly, time saved
Research assistant	"Comprehensive coverage"	Actionable insights surfaced

The shift: stop asking "Is the output good?" and start asking "Does the output produce the business outcome we want?"

This requires custom evaluation. You can't buy it off the shelf. And that's exactly why most vendors don't do it well - it's expensive to build evaluation systems specific to each customer's use case.

The Three Levels of AI Agent Testing

Hamel's framework breaks evaluation into three levels, each serving a different purpose. Think of them like the layers of a pyramid - you need the foundation before the top works.

Level 1: Assertions (Run on Every Change)

These are like unit tests for AI. They don't tell you if the output is "good" - they tell you if basic requirements are met.

Examples:

Response is valid JSON (for structured outputs)
Response doesn't contain forbidden phrases
Response stays under token limits
Response includes required sections
Response doesn't mention competitors by name

These run automatically on every code change. They're fast, cheap, and catch obvious regressions. A support agent that suddenly starts returning malformed JSON is broken, regardless of how eloquent its responses are.

The trap: Many teams stop here. "All our assertions pass!" But assertions only catch catastrophic failures. An agent can pass every assertion while still being subtly, dangerously wrong.

Level 2: Model-Based Evals (Run on a Cadence)

This is where you use a more capable model (usually GPT-4 or Claude) to grade your agent's outputs. It's the AI equivalent of peer review.

How it works:

Collect a set of real inputs your agent has handled
Run them through your agent
Have a grader model evaluate the outputs against criteria you define
Track scores over time

The criteria matter enormously. "Is this response helpful?" is too vague. "Does this response directly answer the user's question without making claims beyond the provided context?" is specific enough to be useful.

Run frequency: Weekly for fast-moving systems, bi-weekly or monthly for stable ones. The goal is catching drift - the gradual degradation that happens as user inputs evolve away from your training data.

Cost reality: Model-based evals aren't free. If you're evaluating 1,000 interactions per week with GPT-4, that's roughly $20-50 in API costs. Worth it compared to the cost of shipping broken experiences.

Level 3: Human Evaluation (After Significant Changes)

Some things only humans can evaluate. Tone. Cultural appropriateness. Whether a response would actually convince a skeptical customer. Whether the agent handled a delicate situation gracefully.

Human evaluation is expensive - both in time and in the cognitive load on your reviewers. Use it strategically:

After major model updates
After significant prompt changes
When model-based evals show unexpected shifts
On a random sample of production traffic (weekly or monthly)

The calibration problem: Different humans grade differently. Build a rubric. Have multiple reviewers grade the same outputs. Measure inter-rater reliability. If your reviewers can't agree on what "good" looks like, your agent certainly won't figure it out.

Error Analysis: Where Most Teams Fail

According to Hamel's research, error analysis should consume 60-80% of AI development time. Most teams spend less than 10%.

Error analysis means actually reading your agent's failures. Not in aggregate. Not as statistics. Reading the specific inputs that broke, understanding why, and categorizing the failure modes.

A framework for error analysis:

Collect failures - User complaints, low model-based eval scores, assertion failures
Categorize - What type of failure? (Hallucination, wrong tone, missing information, format error)
Prioritize - Which failures cost the most? (Customer churn, support escalations, compliance risk)
Root cause - Why did this specific failure happen? (Bad prompt, missing context, edge case, model limitation)
Fix and verify - Change something, confirm the failure category improves

The insight most teams miss: your failure modes are specific to your application. Generic benchmarks won't find them. Only careful analysis of your actual traffic will.

Example: A legal document review agent might hallucinate citations 2% of the time. That 2% could expose you to malpractice claims. A marketing copy agent might have the same hallucination rate, but nobody gets sued over a slightly inaccurate product description. Same technical problem, wildly different business impact.

Red Flags When Evaluating AI Vendors

After eight years in this space, I've developed a simple vendor screening test: ask them about evals. Their answer tells you everything about whether their AI actually works or just demos well.

Questions to ask:

"What specific metrics do you track for this use case?"
- Good answer: "We track X, Y, Z, and here's why those map to business outcomes"
- Bad answer: "We continuously monitor quality" (means nothing)
"How do you catch quality degradation before users report it?"
- Good answer: "We run automated evals on N samples every week, alerting on drops"
- Bad answer: "Our models are very stable" (ignores distribution shift)
"Can you show me evaluation results from the last month?"
- Good answer: Actual dashboard, specific numbers, trend lines
- Bad answer: "That's proprietary" (translation: we don't have it)
"What's your process when evaluation scores drop?"
- Good answer: Specific escalation path, root cause analysis, rollback criteria
- Bad answer: "We investigate and fix issues" (no process)
"How did you validate that your evaluation criteria correlate with user satisfaction?"
- Good answer: "We ran a study comparing model grades to human grades and found..."
- Bad answer: Long pause, then something about "extensive testing"

Vendors with mature evaluation practices are proud of them. They'll volunteer this information. Vendors who demo well but deploy poorly will deflect these questions toward feature lists and accuracy claims.

Building an Evaluation Culture (Not Just an Evaluation System)

The technical infrastructure for evals is straightforward. The hard part is making evaluation a habit rather than a checkbox.

Principles that work:

Fail fast, not silently. Configure your systems to alert on evaluation score drops, not just on crashes. A 10% drop in model-based eval scores should wake someone up - it means something changed.

Budget for evaluation time. If a feature takes 3 weeks to build, budget 1 week for evaluation setup. This feels slow. It's faster than debugging production failures for the next 6 months.

Make evaluation visible. Put eval scores on dashboards that leadership sees. When quality is measurable and visible, it gets prioritized. When it's hidden in engineering logs, it gets ignored.

Tie evaluation to deployment. Your CI/CD pipeline should include evaluation gates. Agent fails Level 1 assertions? Deployment blocked. Model-based eval drops 15%? Human review required before shipping.

Rotate evaluation duty. The team members who review agent failures gain intuition that makes them better at prompt engineering and system design. Don't let this knowledge concentrate in one person.

Anthropic's engineering blog on agent evaluation makes a point worth repeating: evaluation is not a one-time activity. It's an ongoing practice that evolves with your system. The evaluation criteria you start with will be wrong. That's fine. Update them as you learn what actually matters.

Getting Started: The First 30 Days

If you're starting from zero, here's a practical roadmap:

Week 1: Inventory

What does your agent do? List every task type.
How would you know if each task was done well? Write it down.
What failures would be catastrophic? What would be annoying but tolerable?

Week 2: Level 1 Setup

Build 5-10 assertions for your most common task types
Integrate into your deployment pipeline
Set up alerts for assertion failures

Week 3: Baseline

Collect 100 real interactions from production
Manually grade them using your criteria from Week 1
This becomes your benchmark for model-based evals

Week 4: Level 2 Foundation

Build a model-based grader using your criteria
Run it on your 100-sample baseline
Compare to your human grades - calibrate until they roughly match
Schedule weekly runs

This won't make your agent perfect. But it will make it measurably better over time - and that's the difference between AI that delivers value and AI that delivers demos.

The teams I've seen succeed with AI agents all share one trait: they're obsessive about knowing whether their systems work. Not hoping. Not assuming. Knowing, because they built the infrastructure to measure it.

Your competitors are probably still shipping and hoping. That's your advantage if you're willing to do the work.