Why don't standard NLP metrics work for evaluating AI agents?

Metrics like ROUGE measure surface-level text similarity, not task completion or correctness. An agent could produce a response with zero ROUGE overlap that perfectly solves the user's problem, or high ROUGE overlap that completely fails the task.

How much time should teams spend on error analysis vs building automated evals?

Hamel Husain recommends spending 60-80% of evaluation development time on error analysis. This means manually reviewing failures, categorizing error types, and understanding root causes before attempting to automate detection.

What is the 3-level testing framework for LLM evaluation?

Level 1 uses assertions (like unit tests) that run on every code change. Level 2 uses model-based evaluators that run on a set cadence. Level 3 involves human evaluation or A/B tests after significant changes. Each level has different cost-accuracy tradeoffs.

How do you validate that a custom evaluator actually works?

Build a labeled dataset where humans have judged outputs, then measure correlation between your automated evaluator and human judgment. If correlation is below 0.7, the evaluator isn't reliable enough to trust for automated decisions.

What's the difference between evaluating workflows versus autonomous agents?

Workflows have predictable paths you can test with assertions at each step. Autonomous agents require evaluating the final outcome plus intermediate reasoning, and you need to test the agent's ability to recover from mistakes - not just avoid them.

LLM Evaluation Beyond ROUGE: Building Custom Evals for Enterprise Agents

Your CTO just demoed the new customer support agent to leadership. It handled the scripted queries beautifully. Three weeks into production, support tickets are up 40% because the agent confidently provides wrong answers that sound authoritative.

The post-mortem reveals a familiar pattern: the team evaluated the agent using BERTScore and cosine similarity against a reference dataset. Both metrics showed 0.85+ scores. The metrics were green. The users were furious.

This gap between evaluation metrics and production reality is the central challenge of enterprise AI deployment. Generic NLP metrics were designed for translation and summarization tasks with clear reference answers. They measure textual similarity, not task completion. An agent could produce a response with zero overlap to your reference that perfectly resolves the customer's issue - or high similarity scores while completely missing the point.

The good news: there's a proven evaluation framework that actually predicts production success. It requires more upfront work than dropping in an off-the-shelf metric, but it's the difference between deploying with confidence and deploying with hope.

Why Generic Metrics Fail for Agents

Hamel Husain's evaluation FAQ is direct about this: "Generic metrics like BERTScore, ROUGE, and cosine similarity are NOT useful for most AI applications."

The fundamental problem is that these metrics measure the wrong thing. ROUGE counts n-gram overlap. BERTScore computes embedding similarity. Neither captures whether the agent:

Completed the actual task
Followed the required constraints
Avoided harmful or incorrect information
Handled edge cases appropriately

Consider a customer support agent asked "How do I cancel my subscription?" A response explaining the cancellation policy in completely different words than your reference answer might score poorly on ROUGE but be perfectly correct. Meanwhile, a response that copies your reference text verbatim but appends incorrect billing information would score high on similarity metrics while creating legal liability.

The metrics optimize for the wrong target. And in machine learning, you get what you measure.

Metric Type	What It Measures	What It Misses
ROUGE	N-gram overlap with reference	Semantic correctness, task completion
BERTScore	Embedding similarity	Domain-specific requirements, constraints
Cosine Similarity	Vector distance	Factual accuracy, safety violations
Perplexity	Model confidence	Whether confidence is warranted

This doesn't mean these metrics are useless everywhere. For summarization tasks where you genuinely want the output to resemble a reference, they provide signal. But agents operate in domains where the "correct" answer depends on context, constraints, and user intent that generic metrics can't capture.

The 3-Level Testing Framework

The evaluation approach that works treats LLM testing like software testing - with different levels for different purposes. Hamel Husain and Shreya Shankar outline three levels that map directly to CI/CD workflows:

Level 1: Assertions (Run on Every Code Change)

These are the unit tests of LLM evaluation. They check for known failure modes with deterministic rules:

Does the response contain required elements? (specific fields in structured output)
Does it avoid prohibited content? (PII patterns, competitor mentions, profanity)
Does it stay within length constraints?
Does it follow formatting requirements?

Assertions are cheap, fast, and run on every commit. They catch regressions immediately. The tradeoff: they only catch problems you've explicitly coded for.

def test_response_structure(response):
    # Level 1: Assertions that run on every change
    assert "recommendation" in response, "Missing recommendation field"
    assert len(response["recommendation"]) < 500, "Recommendation too long"
    assert not contains_pii(response["recommendation"]), "PII detected"
    assert response["confidence"] >= 0.0 and response["confidence"] <= 1.0

Level 2: Model-Based Evaluators (Run on Schedule)

Here's where you use LLMs to evaluate LLMs. A separate model (often a larger one, or one with different training) judges whether outputs meet quality criteria.

This sounds circular, but it works when done correctly. The key is building evaluators for specific dimensions you care about and validating them against human judgment before trusting them.

Model-based evals are more expensive than assertions, so you run them on a cadence - nightly, or on significant changes. They catch qualitative issues that assertions miss.

def evaluate_helpfulness(response, context):
    # Level 2: Model-based evaluation
    eval_prompt = f"""
    Rate this customer support response on helpfulness (1-5):
    
    Customer Question: {context['question']}
    Response: {response}
    
    Score only. No explanation.
    """
    score = llm_judge(eval_prompt)
    return score >= 4  # Threshold determined by human validation

Level 3: Human Evaluation (After Significant Changes)

The gold standard remains human judgment. After major model updates, prompt changes, or new feature deployments, you need humans to review a sample of outputs.

This is expensive and slow, which is why it's Level 3. But it's also how you validate that your Level 1 and Level 2 evaluators are still calibrated to what actually matters.

The framework scales investment to risk. Minor changes get assertions. Regular releases get model-based evals. Major changes get human review.

Error Analysis: Where 80% of Your Time Should Go

Here's the counterintuitive part: building good evaluators requires spending most of your time not building evaluators. Hamel's guidance is specific - 60-80% of evaluation development time should go to error analysis.

Error analysis means manually reviewing failures, categorizing them, and understanding root causes before attempting to automate detection. This is unglamorous work. Engineers want to build systems, not read through logs. But skipping this step is why most evaluation pipelines fail to catch real problems.

The process:

Collect failure examples - Not just obvious crashes, but cases where users complained, where the agent's response was technically correct but unhelpful, where you got lucky and nothing bad happened yet.
Categorize failures by type - Is this a retrieval problem (wrong context)? A reasoning problem (right context, wrong conclusion)? A formatting problem? A constraint violation? Categories emerge from the data, not from preconceptions.
Identify patterns - Do failures cluster around specific query types, user segments, or edge cases? Are there temporal patterns (failures increase after context window fills up)?
Prioritize by impact - Not all failures are equal. A wrong answer about refund policy costs money. A wrong answer about medication could cost lives. Weight your evaluation investment accordingly.
Build evaluators for each category - Only now do you start automating. And each automated evaluator gets validated against your manually labeled examples.

This is how you build evaluators that catch the failures that matter rather than optimizing for metrics that don't correlate with user outcomes.

Error Category	Example	Evaluator Type
Hallucination	Agent invents product features	Fact-checking against knowledge base
Constraint Violation	Response exceeds character limit	Assertion
Wrong Retrieval	Pulls competitor info instead of ours	Source attribution check
Reasoning Failure	Correct facts, wrong conclusion	Model-based logic evaluation
Tone Mismatch	Too casual for enterprise context	Style classifier

Building Custom Evaluators That Actually Work

The key insight from production deployments: evaluators must be validated against human judgment before you trust them. An evaluator that doesn't correlate with human assessment isn't evaluating - it's just producing numbers.

Step 1: Create a Golden Dataset

Sample 200-500 examples covering your key use cases and failure modes. Have humans label each example on the dimensions you care about (helpfulness, accuracy, safety, etc.). Use multiple annotators and measure inter-annotator agreement.

If your annotators can't agree, your evaluation criteria aren't clear enough. Fix the rubric before proceeding.

Step 2: Build Candidate Evaluators

For each dimension, build multiple candidate evaluators:

Rule-based (regex patterns, keyword detection)
Classifier-based (fine-tuned small model)
LLM-as-judge (prompted large model)
Hybrid approaches

Each has tradeoffs. Rules are fast and interpretable but brittle. Classifiers require training data but are reliable. LLM judges are flexible but expensive and sometimes inconsistent.

Step 3: Measure Correlation

Run each evaluator against your golden dataset. Measure correlation with human labels. Eugene Yan's work on evaluation emphasizes this step - an evaluator below 0.7 correlation with human judgment isn't reliable enough for automated decisions.

from scipy.stats import spearmanr

def validate_evaluator(evaluator, golden_dataset):
    automated_scores = [evaluator(x) for x in golden_dataset]
    human_scores = [x['human_label'] for x in golden_dataset]
    
    correlation, p_value = spearmanr(automated_scores, human_scores)
    
    if correlation < 0.7:
        print(f"Warning: Correlation {correlation:.2f} - evaluator not reliable")
    return correlation

Step 4: Set Thresholds Empirically

Don't guess at pass/fail thresholds. Use your golden dataset to find the threshold that maximizes alignment with human judgment. The threshold for "acceptable quality" should be determined by the data, not intuition.

Step 5: Monitor and Recalibrate

Evaluators drift. User expectations change. New failure modes emerge. Schedule regular recalibration where you re-run correlation checks against fresh human labels. Quarterly is typical for stable domains; monthly for rapidly evolving ones.

Evaluating Agentic Workflows vs Autonomous Agents

The evaluation strategy differs based on how much autonomy your agent has.

For Workflows (Predictable Paths)

When you're using what Anthropic calls "workflow patterns" - prompt chaining, routing, parallelization - you can test each step independently. The system follows predetermined paths, so you can write assertions for each node:

Router sends query to correct specialist
Specialist generates response meeting schema requirements
Aggregator combines results without losing information

This is closer to traditional software testing. You know what each component should do, so you can test whether it does it.

For Autonomous Agents (Dynamic Decisions)

When the agent decides its own path - selecting tools, determining when to stop, recovering from failures - evaluation gets harder. You need to evaluate:

Final outcome quality - Did the agent complete the task?
Intermediate reasoning - Were the steps sensible even if the outcome was good?
Recovery capability - When the agent made a mistake, did it recognize and correct it?
Efficiency - Did it take a reasonable path, or waste tokens on loops?

Chip Huyen's observation applies here: "The journey from 0 to 60 is easy, whereas progressing from 60 to 100 becomes exceedingly challenging." Agents that work 80% of the time in demos often work 40% of the time in production because the evaluation didn't stress-test edge cases.

For autonomous agents, include adversarial test cases:

Ambiguous queries where multiple tools could apply
Queries requiring multi-step reasoning with opportunities to go wrong
Queries where the correct answer is "I don't know" or "I can't do that"
Queries designed to trigger known failure modes

The Evaluation Playbook CIOs Actually Need

If you're evaluating an AI agent for production deployment - whether building internally or buying from a vendor - here's the checklist:

Before Deployment:

Error analysis completed on pilot data (minimum 100 failure examples categorized)
Custom evaluators built for top 5 failure modes
Each evaluator validated against human judgment (correlation > 0.7)
Level 1 assertions covering structural requirements
Level 2 model-based evals for quality dimensions
Baseline human evaluation scores established

During Deployment:

Assertions running on every code change
Model-based evals running nightly
Human evaluation sample (20-50 cases) weekly for first month
Monitoring for score drift and new failure patterns

Ongoing:

Quarterly recalibration of evaluators against fresh human labels
Error analysis refresh when new failure modes emerge
Evaluation coverage expansion as use cases grow

Red Flags When Evaluating Vendors:

Can't explain their evaluation methodology beyond "we use GPT-4 to judge"
Quote generic metrics (BLEU, ROUGE) as primary quality indicators
No human-in-the-loop evaluation process
Can't show correlation data between their evals and human judgment
Evaluation doesn't cover your specific failure modes

The organizations deploying AI agents successfully aren't the ones with the most sophisticated models. They're the ones with the most rigorous evaluation pipelines - pipelines built on error analysis, validated against human judgment, and continuously recalibrated against production reality.

The metrics might be green. The question is whether they're measuring what matters.