What is evaluation-driven development for AI?

Evaluation-driven development is a practice where every change to an AI feature - prompt edits, model swaps, retrieval updates - is tested against a defined set of evaluations before deployment. It replaces subjective quality checks with measurable, repeatable benchmarks.

How do you evaluate LLM outputs in production?

Production LLM evaluation uses three layers: automated assertions that check format and basic correctness on every change, model-graded evaluations that score quality on a regular cadence, and periodic human review for edge cases and subjective quality. Tools like Braintrust, Maxim AI, and DeepEval can automate much of this pipeline.

What are the best AI evaluation tools in 2026?

Leading AI evaluation platforms include Braintrust for end-to-end observability, Maxim AI for lifecycle management, and DeepEval for open-source testing. The right choice depends on your stack, but most teams should start with simple assertion scripts before adopting a platform.

How many test cases do I need for LLM evaluation?

Start with 20-50 hand-labeled examples drawn from real failure cases. This is enough to catch the most common regression patterns. Expand to 200+ as your system matures and you discover new failure modes through error analysis.

Why do AI features degrade over time without evaluation?

AI features degrade because small changes compound. A prompt tweak that improves one case can silently break five others. Without evals, these regressions go undetected until users complain, by which time the team has made additional changes that make root cause analysis much harder.

Evaluation-Driven Development: How to Ship AI Features That Actually Improve Over Time

A team ships an AI-powered document summarizer. Three engineers read a dozen outputs, nod approvingly, and push to production. Two weeks later, support tickets spike. The model started hallucinating dates from a specific PDF format that nobody tested against. The fix takes four days - not because the bug was hard, but because the team had to manually re-check every other case to make sure the patch didn't break anything else.

This is vibe-check development. It is the default at most companies shipping AI features today. And it is the single biggest reason those features stop improving after launch.

Vibe-Check Development Is the Norm (And It's Expensive)

Here's how most AI teams work right now: an engineer changes a prompt, reads a handful of outputs, decides it "looks right," and ships. If something breaks, a user reports it days or weeks later. The team scrambles to fix it, which introduces new regressions that won't surface until the next complaint.

Hamel Husain, who literally wrote the practitioner's guide to LLM evaluation, makes a point that should make every AI team uncomfortable: the single most important activity in AI development is error analysis, and most teams spend approximately 0% of their time on it.

The cost isn't just bad outputs. It's velocity. Without evals, every change to your AI system is a gamble. Engineers become afraid to modify prompts because they can't predict what will break. Product managers can't prioritize improvements because there's no data on what's actually failing. The whole team moves slower because nobody has confidence in the system's behavior.

This is the paradox that catches teams off guard: skipping evals feels faster. It is objectively slower. Every prompt change without a test suite requires someone to manually check outputs, and "someone" usually means the same senior engineer who should be building the next feature.

The Three Levels of AI Testing

Not all evals serve the same purpose. The most practical framework breaks LLM evaluation into three levels, each with different speed, cost, and coverage tradeoffs.

Level 1: Assertions (Run on Every Change)

These are the closest thing to unit tests for AI. They check deterministic properties of the output - things that are either true or false, with no subjectivity:

Does the output contain required fields?
Is the response under the token limit?
Does it avoid banned terms or leaked PII?
Does the JSON parse correctly?

def test_response_format(response):
    assert len(response) < 500, "Response too long for chat widget"
    assert not any(term in response.lower() 
                   for term in ["competitor_x", "internal_only"]), "Leaked restricted content"
    assert response.count("http") <= 2, "Too many links, likely hallucinated"

Assertions are cheap, fast, and should run in CI. They won't tell you if your summary is good, but they catch the cases where your model returns garbage, switches languages, or ignores formatting instructions.

Level 2: Model-Graded Evaluations (Run on a Cadence)

This is where you use one LLM to judge another. You define criteria - relevance, accuracy, tone, completeness - and have a grader model score outputs against reference answers or rubrics.

Model-graded evals are the workhorse of eval-driven development. They're not perfect, but they scale in ways human review never can. Maxim AI and similar platforms provide frameworks for building these scoring pipelines without writing the infrastructure from scratch.

Here's the thing most teams get wrong: they reach for generic metrics like ROUGE, BLEU, or cosine similarity. These measure surface-level text overlap, not whether your agent actually helped the user. A response can score high on cosine similarity with the reference answer while being completely wrong in context. Build custom evaluators for your specific failure modes instead.

Level 3: Human Evaluation (After Significant Changes)

Some qualities only humans can judge. Tone, trustworthiness, whether the response "feels" helpful - these are subjective dimensions that no automated metric captures reliably. But human review is expensive and slow, so reserve it for high-stakes moments:

After model upgrades or major prompt rewrites
When launching a new capability
To validate that your Level 2 graders still correlate with real quality

Level	Speed	Cost	What It Catches	When to Run
Assertions	Seconds	Free	Format errors, guardrail violations	Every commit
Model-graded	Minutes	$0.01-0.10 per eval	Quality regressions, relevance drift	Daily or weekly
Human review	Hours to days	$10-50/hour	Subtle tone issues, trust problems	After major changes

The goal is to keep Level 3 rare by making Levels 1 and 2 reliable enough to catch most problems automatically.

Building Your First Eval Suite in a Week

The biggest blocker isn't tooling. It's the belief that you need thousands of test cases to start. You don't.

Day 1-2: Collect Your Failure Cases

Go through your support tickets, user feedback, and internal bug reports. Find 20-30 real examples where your AI feature produced bad output. These are your golden test cases.

For each one, document:

The input that triggered the bad output
What the model actually said
What it should have said (or what properties a good response would have)

This is error analysis - the activity that should consume 60-80% of development time on AI features. Most teams spend zero, then wonder why their product isn't improving.

Day 3-4: Write Assertions and Rubrics

Turn your failure cases into automated checks. Some become simple assertions:

# Real failure: model recommended a discontinued product
def test_no_discontinued_products(response):
    discontinued = ["ProductX v1", "LegacyTool", "OldAPI"]
    for product in discontinued:
        assert product not in response, f"Referenced discontinued: {product}"

Others need model-graded rubrics:

rubric = """
Score the response from 1-5 on ACCURACY:
5: All claims are factually correct and verifiable
4: Minor imprecisions that don't mislead the user
3: One significant factual error
2: Multiple factual errors
1: Predominantly incorrect or hallucinated content
"""

Day 5: Wire It Into Your Deploy Pipeline

The eval suite is only useful if it runs without someone remembering to trigger it. Start simple:

python run_evals.py --suite core --threshold 0.85
if [ $? -ne 0 ]; then
    echo "Eval score below threshold. Blocking deploy."
    exit 1
fi

Braintrust and other evaluation platforms can handle this orchestration if you outgrow shell scripts, but don't let tooling decisions delay getting started. A janky eval suite that runs is infinitely more valuable than a beautifully architected one that doesn't exist.

The counter-intuitive part: 20-50 test cases feels inadequate. It isn't. Those initial cases, drawn from real failures, will catch more regressions than a thousand synthetic examples generated by an LLM. The eval set grows naturally - every time a user reports a new problem, you add it. Within three months, you'll have a comprehensive suite that reflects your actual production failure modes, not hypothetical ones.

The Eval-Driven Loop in Practice

Once you have evals, the development workflow changes completely.

Before (the vibe-check loop):

Engineer changes prompt
Engineer reads 3-5 outputs
"Looks good to me"
Ship to production
Wait for user complaints
Panic fix
Go back to step 2

After (eval-driven AI feature iteration):

Engineer changes prompt
Eval suite runs automatically (2-5 minutes)
Dashboard shows: 94% pass rate, down from 96% - two new failures in the date-parsing category
Engineer fixes the regression before shipping
Eval suite confirms: 97% pass rate
Ship with confidence
Monitor production scores, add new cases from any user-reported issues

The difference isn't just quality. It's speed. Engineers who trust their eval suite make bolder changes. They'll try a completely different prompting strategy because they know the evals will catch breakage. Teams running on vibe checks make tiny, conservative edits because they're afraid of causing damage they can't detect.

Google's web.dev documentation on eval-driven development frames this well: evaluation isn't a quality gate you add at the end. It's the development methodology itself. You write the eval first, then make the change, then verify. If that sounds like test-driven development, it should - it's the same loop, adapted for non-deterministic outputs.

What Eval Infrastructure Looks Like at Scale

For teams past the "shell script in CI" phase, mature eval infrastructure includes several components that compound over time.

A versioned eval dataset. Your test cases live in version control alongside your prompts. When you add a new eval case, it's a pull request that gets reviewed. This creates an institutional record of every failure mode your system has encountered - which turns out to be one of the most valuable artifacts an AI team produces.

Regression tracking over time. Not just pass/fail snapshots, but trend lines. If your accuracy score drops from 94% to 91% over three weeks, that's a signal even if no single change triggered a clear failure. Amplitude's approach to eval-driven analytics integrates this kind of longitudinal tracking into the product development workflow, treating eval scores as product metrics rather than engineering metrics.

Separate eval sets for separate capabilities. Your customer support agent needs different evals than your document summarizer. Don't lump everything into one aggregate score. A single "quality" number hides which capabilities are improving and which are degrading - and it hides it from exactly the people who need to know.

Cost tracking per eval run. Model-graded evals cost money. A team running 500 evaluations across 50 test cases at $0.05 per eval is spending $25 per run. That's sustainable at a daily cadence, uncomfortable at an hourly one. Know your burn rate and adjust accordingly.

A feedback loop from production. The eval set is never "done." Every user escalation, every flagged output, every edge case discovered in production becomes a candidate for the eval suite. The best teams we've seen have a lightweight process for this: a shared channel where anyone can submit a bad output, and a weekly triage to decide which ones become permanent test cases.

The Business Case in One Sentence

Teams without evals ship slower because every change is a gamble.

You can dress it up with ROI calculations and support-ticket-reduction projections, but the core argument is that simple. Eval-driven development lets your team make changes with confidence, which means they make more changes, which means the product improves faster.

The teams we work with that adopt continuous improvement AI agents through eval-driven loops typically see two things happen in the first month. First, they discover existing problems they didn't know about - their initial eval suite fails on 15-30% of test cases, which is normal and useful. Second, they start shipping prompt improvements 3-4x faster because the feedback loop shrinks from "wait for user complaints" to "run the suite."

If you're shipping AI features without evals, you're not moving fast. You're moving blind.