Your CTO just demoed the new customer support agent to leadership. It handled the scripted queries beautifully. Three weeks into production, support tickets are up 40% because the agent confidently provides wrong answers that sound authoritative.

The post-mortem reveals a familiar pattern: the team evaluated the agent using BERTScore and cosine similarity against a reference dataset. Both metrics showed 0.85+ scores. The metrics were green. The users were furious.

This gap between evaluation metrics and production reality is the central challenge of enterprise AI deployment. Generic NLP metrics were designed for translation and summarization tasks with clear reference answers. They measure textual similarity, not task completion. An agent could produce a response with zero overlap to your reference that perfectly resolves the customer's issue - or high similarity scores while completely missing the point.

The good news: there's a proven evaluation framework that actually predicts production success. It requires more upfront work than dropping in an off-the-shelf metric, but it's the difference between deploying with confidence and deploying with hope.

Why Generic Metrics Fail for Agents

Hamel Husain's evaluation FAQ is direct about this: "Generic metrics like BERTScore, ROUGE, and cosine similarity are NOT useful for most AI applications."

The fundamental problem is that these metrics measure the wrong thing. ROUGE counts n-gram overlap. BERTScore computes embedding similarity. Neither captures whether the agent:

  • Completed the actual task
  • Followed the required constraints
  • Avoided harmful or incorrect information
  • Handled edge cases appropriately

Consider a customer support agent asked "How do I cancel my subscription?" A response explaining the cancellation policy in completely different words than your reference answer might score poorly on ROUGE but be perfectly correct. Meanwhile, a response that copies your reference text verbatim but appends incorrect billing information would score high on similarity metrics while creating legal liability.

The metrics optimize for the wrong target. And in machine learning, you get what you measure.

Metric Type What It Measures What It Misses
ROUGE N-gram overlap with reference Semantic correctness, task completion
BERTScore Embedding similarity Domain-specific requirements, constraints
Cosine Similarity Vector distance Factual accuracy, safety violations
Perplexity Model confidence Whether confidence is warranted

This doesn't mean these metrics are useless everywhere. For summarization tasks where you genuinely want the output to resemble a reference, they provide signal. But agents operate in domains where the "correct" answer depends on context, constraints, and user intent that generic metrics can't capture.

The 3-Level Testing Framework

The evaluation approach that works treats LLM testing like software testing - with different levels for different purposes. Hamel Husain and Shreya Shankar outline three levels that map directly to CI/CD workflows:

Level 1: Assertions (Run on Every Code Change)

These are the unit tests of LLM evaluation. They check for known failure modes with deterministic rules:

  • Does the response contain required elements? (specific fields in structured output)
  • Does it avoid prohibited content? (PII patterns, competitor mentions, profanity)
  • Does it stay within length constraints?
  • Does it follow formatting requirements?

Assertions are cheap, fast, and run on every commit. They catch regressions immediately. The tradeoff: they only catch problems you've explicitly coded for.

def test_response_structure(response):
    # Level 1: Assertions that run on every change
    assert "recommendation" in response, "Missing recommendation field"
    assert len(response["recommendation"]) < 500, "Recommendation too long"
    assert not contains_pii(response["recommendation"]), "PII detected"
    assert response["confidence"] >= 0.0 and response["confidence"] <= 1.0

Level 2: Model-Based Evaluators (Run on Schedule)

Here's where you use LLMs to evaluate LLMs. A separate model (often a larger one, or one with different training) judges whether outputs meet quality criteria.

This sounds circular, but it works when done correctly. The key is building evaluators for specific dimensions you care about and validating them against human judgment before trusting them.

Model-based evals are more expensive than assertions, so you run them on a cadence - nightly, or on significant changes. They catch qualitative issues that assertions miss.

def evaluate_helpfulness(response, context):
    # Level 2: Model-based evaluation
    eval_prompt = f"""
    Rate this customer support response on helpfulness (1-5):
    
    Customer Question: {context['question']}
    Response: {response}
    
    Score only. No explanation.
    """
    score = llm_judge(eval_prompt)
    return score >= 4  # Threshold determined by human validation

Level 3: Human Evaluation (After Significant Changes)

The gold standard remains human judgment. After major model updates, prompt changes, or new feature deployments, you need humans to review a sample of outputs.

This is expensive and slow, which is why it's Level 3. But it's also how you validate that your Level 1 and Level 2 evaluators are still calibrated to what actually matters.

The framework scales investment to risk. Minor changes get assertions. Regular releases get model-based evals. Major changes get human review.

Error Analysis: Where 80% of Your Time Should Go

Here's the counterintuitive part: building good evaluators requires spending most of your time not building evaluators. Hamel's guidance is specific - 60-80% of evaluation development time should go to error analysis.

Error analysis means manually reviewing failures, categorizing them, and understanding root causes before attempting to automate detection. This is unglamorous work. Engineers want to build systems, not read through logs. But skipping this step is why most evaluation pipelines fail to catch real problems.

The process:

  1. Collect failure examples - Not just obvious crashes, but cases where users complained, where the agent's response was technically correct but unhelpful, where you got lucky and nothing bad happened yet.

  2. Categorize failures by type - Is this a retrieval problem (wrong context)? A reasoning problem (right context, wrong conclusion)? A formatting problem? A constraint violation? Categories emerge from the data, not from preconceptions.

  3. Identify patterns - Do failures cluster around specific query types, user segments, or edge cases? Are there temporal patterns (failures increase after context window fills up)?

  4. Prioritize by impact - Not all failures are equal. A wrong answer about refund policy costs money. A wrong answer about medication could cost lives. Weight your evaluation investment accordingly.

  5. Build evaluators for each category - Only now do you start automating. And each automated evaluator gets validated against your manually labeled examples.

This is how you build evaluators that catch the failures that matter rather than optimizing for metrics that don't correlate with user outcomes.

Error Category Example Evaluator Type
Hallucination Agent invents product features Fact-checking against knowledge base
Constraint Violation Response exceeds character limit Assertion
Wrong Retrieval Pulls competitor info instead of ours Source attribution check
Reasoning Failure Correct facts, wrong conclusion Model-based logic evaluation
Tone Mismatch Too casual for enterprise context Style classifier

Building Custom Evaluators That Actually Work

The key insight from production deployments: evaluators must be validated against human judgment before you trust them. An evaluator that doesn't correlate with human assessment isn't evaluating - it's just producing numbers.

Step 1: Create a Golden Dataset

Sample 200-500 examples covering your key use cases and failure modes. Have humans label each example on the dimensions you care about (helpfulness, accuracy, safety, etc.). Use multiple annotators and measure inter-annotator agreement.

If your annotators can't agree, your evaluation criteria aren't clear enough. Fix the rubric before proceeding.

Step 2: Build Candidate Evaluators

For each dimension, build multiple candidate evaluators:

  • Rule-based (regex patterns, keyword detection)
  • Classifier-based (fine-tuned small model)
  • LLM-as-judge (prompted large model)
  • Hybrid approaches

Each has tradeoffs. Rules are fast and interpretable but brittle. Classifiers require training data but are reliable. LLM judges are flexible but expensive and sometimes inconsistent.

Step 3: Measure Correlation

Run each evaluator against your golden dataset. Measure correlation with human labels. Eugene Yan's work on evaluation emphasizes this step - an evaluator below 0.7 correlation with human judgment isn't reliable enough for automated decisions.

from scipy.stats import spearmanr

def validate_evaluator(evaluator, golden_dataset):
    automated_scores = [evaluator(x) for x in golden_dataset]
    human_scores = [x['human_label'] for x in golden_dataset]
    
    correlation, p_value = spearmanr(automated_scores, human_scores)
    
    if correlation < 0.7:
        print(f"Warning: Correlation {correlation:.2f} - evaluator not reliable")
    return correlation

Step 4: Set Thresholds Empirically

Don't guess at pass/fail thresholds. Use your golden dataset to find the threshold that maximizes alignment with human judgment. The threshold for "acceptable quality" should be determined by the data, not intuition.

Step 5: Monitor and Recalibrate

Evaluators drift. User expectations change. New failure modes emerge. Schedule regular recalibration where you re-run correlation checks against fresh human labels. Quarterly is typical for stable domains; monthly for rapidly evolving ones.

Evaluating Agentic Workflows vs Autonomous Agents

The evaluation strategy differs based on how much autonomy your agent has.

For Workflows (Predictable Paths)

When you're using what Anthropic calls "workflow patterns" - prompt chaining, routing, parallelization - you can test each step independently. The system follows predetermined paths, so you can write assertions for each node:

  • Router sends query to correct specialist
  • Specialist generates response meeting schema requirements
  • Aggregator combines results without losing information

This is closer to traditional software testing. You know what each component should do, so you can test whether it does it.

For Autonomous Agents (Dynamic Decisions)

When the agent decides its own path - selecting tools, determining when to stop, recovering from failures - evaluation gets harder. You need to evaluate:

  1. Final outcome quality - Did the agent complete the task?
  2. Intermediate reasoning - Were the steps sensible even if the outcome was good?
  3. Recovery capability - When the agent made a mistake, did it recognize and correct it?
  4. Efficiency - Did it take a reasonable path, or waste tokens on loops?

Chip Huyen's observation applies here: "The journey from 0 to 60 is easy, whereas progressing from 60 to 100 becomes exceedingly challenging." Agents that work 80% of the time in demos often work 40% of the time in production because the evaluation didn't stress-test edge cases.

For autonomous agents, include adversarial test cases:

  • Ambiguous queries where multiple tools could apply
  • Queries requiring multi-step reasoning with opportunities to go wrong
  • Queries where the correct answer is "I don't know" or "I can't do that"
  • Queries designed to trigger known failure modes

The Evaluation Playbook CIOs Actually Need

If you're evaluating an AI agent for production deployment - whether building internally or buying from a vendor - here's the checklist:

Before Deployment:

  • Error analysis completed on pilot data (minimum 100 failure examples categorized)
  • Custom evaluators built for top 5 failure modes
  • Each evaluator validated against human judgment (correlation > 0.7)
  • Level 1 assertions covering structural requirements
  • Level 2 model-based evals for quality dimensions
  • Baseline human evaluation scores established

During Deployment:

  • Assertions running on every code change
  • Model-based evals running nightly
  • Human evaluation sample (20-50 cases) weekly for first month
  • Monitoring for score drift and new failure patterns

Ongoing:

  • Quarterly recalibration of evaluators against fresh human labels
  • Error analysis refresh when new failure modes emerge
  • Evaluation coverage expansion as use cases grow

Red Flags When Evaluating Vendors:

  • Can't explain their evaluation methodology beyond "we use GPT-4 to judge"
  • Quote generic metrics (BLEU, ROUGE) as primary quality indicators
  • No human-in-the-loop evaluation process
  • Can't show correlation data between their evals and human judgment
  • Evaluation doesn't cover your specific failure modes

The organizations deploying AI agents successfully aren't the ones with the most sophisticated models. They're the ones with the most rigorous evaluation pipelines - pipelines built on error analysis, validated against human judgment, and continuously recalibrated against production reality.

The metrics might be green. The question is whether they're measuring what matters.