Your AI agent demo went great. The CEO saw it summarize a contract in six seconds. The head of operations watched it route a support ticket to the right team without human intervention. Everyone clapped. Budget was approved.

That was four months ago. The agent is still running on a developer's laptop, processing test data, while the Slack channel named #ai-pilot slowly goes quiet.

This is not a hypothetical. Gartner's 2025 survey found that fewer than 25% of enterprise AI pilots advance to production. The failure mode is almost never "the model isn't good enough." It is almost always "we didn't build the operational scaffolding to run this thing for real."

This checklist is the scaffolding. It covers the seven areas where pilots die on the way to production, and what to do about each one.

1. Evaluation: You Need a Test Suite, Not a Demo

The single biggest gap between a pilot and a production system is evaluation. In a pilot, someone watches the agent work and says "that looks right." In production, nobody is watching - so you need automated checks that catch when things go wrong.

What "ready" looks like:

  • A test set of 50+ real historical cases with known-good outputs
  • Automated scoring that runs on every code change and model update
  • Separate metrics for accuracy, completeness, and safety (not one blended number)
  • A baseline: what is the current human performance on these same tasks?

The test set is where most teams get stuck. Hamel Husain's evaluation framework makes the case that you should spend 60-80% of your development time on error analysis and building domain-specific evaluators - not on prompt tuning. If your team has spent three months refining prompts and zero time building evals, you have the ratio backwards.

A practical starting point: Pull 50 real examples from whatever system the agent is supposed to replace. Have your best human operator label the correct outputs. Run the agent against all 50. Score the results. That is your baseline. If you cannot build this test set - because the task is too ambiguous, or the "right answer" is subjective - that tells you something important about production readiness.

One pattern that works well for enterprise teams: the three-level testing approach from Shreya Shankar's research on LLM pipelines. Level 1 is deterministic assertions (did the agent return valid JSON? Did it stay within the allowed actions?). Level 2 is model-graded evaluation (does a second LLM judge the output as correct?). Level 3 is periodic human review of a random sample. You need all three.

2. Monitoring: What Breaks at 3 AM on a Tuesday

Pilot agents run during business hours while someone watches. Production agents run at 3 AM processing overnight batch jobs. The monitoring gap is where most production incidents start.

Minimum viable monitoring:

Metric Why It Matters Alert Threshold
Task success rate Catch quality drops Below 90% over 1-hour window
Latency per task Detect stuck loops Above 3x normal p95
Cost per task Prevent bill shock Above 2x average
Error rate by type Identify new failure modes Any new error type
Escalation rate Track human fallback usage Above 30% (adjust per use case)

The cost metric deserves special attention. Anthropic's agent design guidelines emphasize that agents running tools in a loop will sometimes enter failure spirals - retrying the same action, calling tools unnecessarily, or expanding scope beyond the original task. Without a cost-per-task monitor, these spirals turn into surprise invoices.

Logging that actually helps: Log every LLM call with its full prompt, response, latency, and token count. Log every tool invocation with its input, output, and duration. When something breaks in production, you will need to replay the exact sequence of events. The Braintrust AI logging framework is one approach; even a structured JSON log to a database works if you are consistent.

Skip the fancy dashboards at first. A Slack alert that fires when success rate drops below your threshold is worth more than a Grafana dashboard nobody checks.

3. Cost Controls: Set the Ceiling Before You Hit It

Here is a number that surprises people: a single GPT-4-class agent task that involves research, analysis, and writing can cost $0.50-$2.00 in API calls. That sounds cheap until you multiply by 10,000 tasks per month and realize you are looking at $5,000-$20,000 in inference costs alone - before compute, storage, or engineering time.

Cost control checklist:

  • Hard per-task budget ceiling (kill the agent if it exceeds N dollars on a single task)
  • Per-day and per-month spending limits with automatic shutoff
  • Model tiering: use cheaper models for simple subtasks, expensive models only where needed
  • Caching layer for repeated queries (many enterprise tasks are variations on the same theme)
  • Token budget per LLM call (set max_tokens explicitly, do not rely on the model to stop)

The model tiering point is underappreciated. Google DeepMind's research on "Scaling LLM Test-Time Compute" showed that giving a smaller model more "thinking time" (via chain-of-thought or multiple attempts) often beats a single call to a larger model - at lower cost. Your classification step probably does not need your most expensive model. Your final output generation might.

A useful mental model: treat your agent's inference budget like a cloud computing budget. Set alerts at 50%, 75%, and 90% of monthly allocation. Review spending weekly for the first month, then monthly once patterns stabilize.

4. Security and Access: The Permissions You Forgot

The pilot had access to a test database with fake data. Production means real customer information, real financial records, real PII. The security model that worked for the demo will not work here.

The question to ask for every production agent: Does this agent simultaneously (a) access private data, (b) process untrusted input, and (c) have the ability to send data externally? If yes to all three, you have what Simon Willison calls the "lethal trifecta" - a configuration where prompt injection attacks can exfiltrate sensitive data.

Access control checklist:

  • Principle of least privilege: the agent gets only the data it needs for its specific task
  • No direct database write access (use an API layer with validation)
  • Separate credentials per agent (not shared service accounts)
  • Input sanitization on all user-facing inputs before they reach the LLM
  • Output filtering to catch PII, credentials, or internal URLs before they reach external systems
  • Audit log of every data access, with retention policy

The OWASP Top 10 for LLM Applications is worth reviewing with your security team. The relevant risks for production agents are prompt injection (LLM01), insecure output handling (LLM02), and excessive agency (LLM08) - where the agent has more permissions than the task requires.

One practical pattern: build a "permissions manifest" for each agent. A simple YAML file that lists exactly what data sources it can read, what actions it can take, and what external services it can call. Review this manifest during your security approval process. If the manifest looks scary, simplify the agent.

5. Human Handoff: The Escalation Path Nobody Designs

Every production agent needs a clear answer to: "What happens when the agent cannot handle a task?"

This sounds obvious. In practice, most pilot-to-production transitions skip it entirely, leaving users to figure out workarounds when the agent fails. The result is either (a) people stop trusting the agent and route around it, or (b) people blindly accept wrong outputs because they assume the AI must be right.

Handoff design checklist:

  • Explicit confidence thresholds: below X, the agent flags the task for human review instead of completing it
  • A queue for flagged tasks, with SLA for human response time
  • Context preservation: when a human picks up a flagged task, they see everything the agent already did (not starting from scratch)
  • Feedback loop: human corrections on flagged tasks feed back into the evaluation suite
  • Graceful degradation: if the agent service is down, tasks route to the manual process automatically

The confidence threshold is tricky because LLMs are famously poorly calibrated - they sound confident even when wrong. A 2024 study from Tsinghua University on LLM calibration found that post-hoc calibration techniques (like temperature scaling on a held-out set) can improve reliability of confidence scores significantly. But the simpler approach works too: define specific conditions that trigger escalation (e.g., "if the extracted amount differs from the invoice total by more than $100, flag for review") rather than relying on the model to know when it is uncertain.

The trust equation: Microsoft Research's work on human-AI collaboration found that users calibrate trust based on early experiences. If the first five escalations are handled smoothly - fast response, clear context, good resolution - users will trust the system. If the first five are fumbled, you will fight adoption for months. Invest in the escalation experience early.

6. The Operational Checklist: 20 Items Before Go-Live

Here is the complete pre-production checklist, organized by the team that owns each item. Print this out. Tape it to a wall. Do not launch until every row has an owner and a completion date.

Engineering

Item Owner Status
Evaluation suite with 50+ test cases
Automated scoring on every deployment
Per-task cost ceiling implemented
Monthly spending alert configured
Structured logging for all LLM calls and tool use
Retry logic with exponential backoff
Circuit breaker for downstream service failures
Model version pinned (not "latest")

Security

Item Owner Status
Permissions manifest reviewed by security team
PII filtering on agent outputs
Input sanitization on all user-facing inputs
Separate credentials (not shared service accounts)
Audit logging with retention policy

Operations

Item Owner Status
Escalation queue with SLA
Confidence thresholds configured
Graceful degradation to manual process
On-call rotation for agent incidents
Weekly performance review for first month
Quarterly evaluation benchmark refresh

This list is not exhaustive. Your domain will add items. But if you cannot check off these 20, you are shipping a demo, not a product.

7. The Model Upgrade Trap (and How to Avoid It)

One last item that catches teams off guard: model updates. Your agent was built on Claude 3.5 Sonnet or GPT-4o. Six months from now, a newer model is available. Do you upgrade?

The wrong answer is "yes, obviously, newer is better." Newer models can behave differently in subtle ways - different output formatting, different tool-calling patterns, different failure modes. Chip Huyen's analysis of AI engineering pitfalls documents cases where model upgrades broke production systems that worked fine on the previous version.

The right process:

  1. Pin your model version explicitly (never use "latest" in production)
  2. When a new model is available, run your full evaluation suite against it
  3. Compare results side-by-side: accuracy, latency, cost, and edge case behavior
  4. If the new model is better on your evals, promote it to production
  5. If it is worse on any dimension, investigate before upgrading

This is exactly how you would treat any other dependency upgrade. You would not blindly update your database engine in production. Treat model versions with the same discipline.

The teams that do this well maintain a simple spreadsheet: model version, eval score, cost per task, date tested. When someone asks "should we upgrade to the new model?" the answer is data, not opinion.

What Comes After the Checklist

This checklist is a starting point, not a finish line. The first month of production will surface failure modes your pilot never encountered - edge cases in real data, user behaviors you did not anticipate, cost patterns that differ from your projections.

The teams that succeed treat the first 90 days of production as an extended evaluation period. Weekly reviews. Fast iteration on the evaluation suite. Honest conversations about whether the agent is actually saving time or just shifting work from one team to another.

The teams that struggle are the ones who check the boxes, launch, and move on to the next project. Production AI agents are not deploy-and-forget systems. They are more like new team members who need onboarding, feedback, and occasional correction.

If your pilot is sitting in limbo and you are not sure which of these gaps is the blocker, start with evaluation. Build the test suite. Run the numbers. Everything else on this list gets easier once you can measure whether your agent actually works.