A team ships an AI-powered document summarizer. Three engineers read a dozen outputs, nod approvingly, and push to production. Two weeks later, support tickets spike. The model started hallucinating dates from a specific PDF format that nobody tested against. The fix takes four days - not because the bug was hard, but because the team had to manually re-check every other case to make sure the patch didn't break anything else.

This is vibe-check development. It is the default at most companies shipping AI features today. And it is the single biggest reason those features stop improving after launch.

Vibe-Check Development Is the Norm (And It's Expensive)

Here's how most AI teams work right now: an engineer changes a prompt, reads a handful of outputs, decides it "looks right," and ships. If something breaks, a user reports it days or weeks later. The team scrambles to fix it, which introduces new regressions that won't surface until the next complaint.

Hamel Husain, who literally wrote the practitioner's guide to LLM evaluation, makes a point that should make every AI team uncomfortable: the single most important activity in AI development is error analysis, and most teams spend approximately 0% of their time on it.

The cost isn't just bad outputs. It's velocity. Without evals, every change to your AI system is a gamble. Engineers become afraid to modify prompts because they can't predict what will break. Product managers can't prioritize improvements because there's no data on what's actually failing. The whole team moves slower because nobody has confidence in the system's behavior.

This is the paradox that catches teams off guard: skipping evals feels faster. It is objectively slower. Every prompt change without a test suite requires someone to manually check outputs, and "someone" usually means the same senior engineer who should be building the next feature.

The Three Levels of AI Testing

Not all evals serve the same purpose. The most practical framework breaks LLM evaluation into three levels, each with different speed, cost, and coverage tradeoffs.

Level 1: Assertions (Run on Every Change)

These are the closest thing to unit tests for AI. They check deterministic properties of the output - things that are either true or false, with no subjectivity:

  • Does the output contain required fields?
  • Is the response under the token limit?
  • Does it avoid banned terms or leaked PII?
  • Does the JSON parse correctly?
def test_response_format(response):
    assert len(response) < 500, "Response too long for chat widget"
    assert not any(term in response.lower() 
                   for term in ["competitor_x", "internal_only"]), "Leaked restricted content"
    assert response.count("http") <= 2, "Too many links, likely hallucinated"

Assertions are cheap, fast, and should run in CI. They won't tell you if your summary is good, but they catch the cases where your model returns garbage, switches languages, or ignores formatting instructions.

Level 2: Model-Graded Evaluations (Run on a Cadence)

This is where you use one LLM to judge another. You define criteria - relevance, accuracy, tone, completeness - and have a grader model score outputs against reference answers or rubrics.

Model-graded evals are the workhorse of eval-driven development. They're not perfect, but they scale in ways human review never can. Maxim AI and similar platforms provide frameworks for building these scoring pipelines without writing the infrastructure from scratch.

Here's the thing most teams get wrong: they reach for generic metrics like ROUGE, BLEU, or cosine similarity. These measure surface-level text overlap, not whether your agent actually helped the user. A response can score high on cosine similarity with the reference answer while being completely wrong in context. Build custom evaluators for your specific failure modes instead.

Level 3: Human Evaluation (After Significant Changes)

Some qualities only humans can judge. Tone, trustworthiness, whether the response "feels" helpful - these are subjective dimensions that no automated metric captures reliably. But human review is expensive and slow, so reserve it for high-stakes moments:

  • After model upgrades or major prompt rewrites
  • When launching a new capability
  • To validate that your Level 2 graders still correlate with real quality
Level Speed Cost What It Catches When to Run
Assertions Seconds Free Format errors, guardrail violations Every commit
Model-graded Minutes $0.01-0.10 per eval Quality regressions, relevance drift Daily or weekly
Human review Hours to days $10-50/hour Subtle tone issues, trust problems After major changes

The goal is to keep Level 3 rare by making Levels 1 and 2 reliable enough to catch most problems automatically.

Building Your First Eval Suite in a Week

The biggest blocker isn't tooling. It's the belief that you need thousands of test cases to start. You don't.

Day 1-2: Collect Your Failure Cases

Go through your support tickets, user feedback, and internal bug reports. Find 20-30 real examples where your AI feature produced bad output. These are your golden test cases.

For each one, document:

  • The input that triggered the bad output
  • What the model actually said
  • What it should have said (or what properties a good response would have)

This is error analysis - the activity that should consume 60-80% of development time on AI features. Most teams spend zero, then wonder why their product isn't improving.

Day 3-4: Write Assertions and Rubrics

Turn your failure cases into automated checks. Some become simple assertions:

# Real failure: model recommended a discontinued product
def test_no_discontinued_products(response):
    discontinued = ["ProductX v1", "LegacyTool", "OldAPI"]
    for product in discontinued:
        assert product not in response, f"Referenced discontinued: {product}"

Others need model-graded rubrics:

rubric = """
Score the response from 1-5 on ACCURACY:
5: All claims are factually correct and verifiable
4: Minor imprecisions that don't mislead the user
3: One significant factual error
2: Multiple factual errors
1: Predominantly incorrect or hallucinated content
"""

Day 5: Wire It Into Your Deploy Pipeline

The eval suite is only useful if it runs without someone remembering to trigger it. Start simple:

python run_evals.py --suite core --threshold 0.85
if [ $? -ne 0 ]; then
    echo "Eval score below threshold. Blocking deploy."
    exit 1
fi

Braintrust and other evaluation platforms can handle this orchestration if you outgrow shell scripts, but don't let tooling decisions delay getting started. A janky eval suite that runs is infinitely more valuable than a beautifully architected one that doesn't exist.

The counter-intuitive part: 20-50 test cases feels inadequate. It isn't. Those initial cases, drawn from real failures, will catch more regressions than a thousand synthetic examples generated by an LLM. The eval set grows naturally - every time a user reports a new problem, you add it. Within three months, you'll have a comprehensive suite that reflects your actual production failure modes, not hypothetical ones.

The Eval-Driven Loop in Practice

Once you have evals, the development workflow changes completely.

Before (the vibe-check loop):

  1. Engineer changes prompt
  2. Engineer reads 3-5 outputs
  3. "Looks good to me"
  4. Ship to production
  5. Wait for user complaints
  6. Panic fix
  7. Go back to step 2

After (eval-driven AI feature iteration):

  1. Engineer changes prompt
  2. Eval suite runs automatically (2-5 minutes)
  3. Dashboard shows: 94% pass rate, down from 96% - two new failures in the date-parsing category
  4. Engineer fixes the regression before shipping
  5. Eval suite confirms: 97% pass rate
  6. Ship with confidence
  7. Monitor production scores, add new cases from any user-reported issues

The difference isn't just quality. It's speed. Engineers who trust their eval suite make bolder changes. They'll try a completely different prompting strategy because they know the evals will catch breakage. Teams running on vibe checks make tiny, conservative edits because they're afraid of causing damage they can't detect.

Google's web.dev documentation on eval-driven development frames this well: evaluation isn't a quality gate you add at the end. It's the development methodology itself. You write the eval first, then make the change, then verify. If that sounds like test-driven development, it should - it's the same loop, adapted for non-deterministic outputs.

What Eval Infrastructure Looks Like at Scale

For teams past the "shell script in CI" phase, mature eval infrastructure includes several components that compound over time.

A versioned eval dataset. Your test cases live in version control alongside your prompts. When you add a new eval case, it's a pull request that gets reviewed. This creates an institutional record of every failure mode your system has encountered - which turns out to be one of the most valuable artifacts an AI team produces.

Regression tracking over time. Not just pass/fail snapshots, but trend lines. If your accuracy score drops from 94% to 91% over three weeks, that's a signal even if no single change triggered a clear failure. Amplitude's approach to eval-driven analytics integrates this kind of longitudinal tracking into the product development workflow, treating eval scores as product metrics rather than engineering metrics.

Separate eval sets for separate capabilities. Your customer support agent needs different evals than your document summarizer. Don't lump everything into one aggregate score. A single "quality" number hides which capabilities are improving and which are degrading - and it hides it from exactly the people who need to know.

Cost tracking per eval run. Model-graded evals cost money. A team running 500 evaluations across 50 test cases at $0.05 per eval is spending $25 per run. That's sustainable at a daily cadence, uncomfortable at an hourly one. Know your burn rate and adjust accordingly.

A feedback loop from production. The eval set is never "done." Every user escalation, every flagged output, every edge case discovered in production becomes a candidate for the eval suite. The best teams we've seen have a lightweight process for this: a shared channel where anyone can submit a bad output, and a weekly triage to decide which ones become permanent test cases.

The Business Case in One Sentence

Teams without evals ship slower because every change is a gamble.

You can dress it up with ROI calculations and support-ticket-reduction projections, but the core argument is that simple. Eval-driven development lets your team make changes with confidence, which means they make more changes, which means the product improves faster.

The teams we work with that adopt continuous improvement AI agents through eval-driven loops typically see two things happen in the first month. First, they discover existing problems they didn't know about - their initial eval suite fails on 15-30% of test cases, which is normal and useful. Second, they start shipping prompt improvements 3-4x faster because the feedback loop shrinks from "wait for user complaints" to "run the suite."

If you're shipping AI features without evals, you're not moving fast. You're moving blind.