A prototype proves it can happen once
A prototype shows the model completing a clean case in a controlled setting. Implementation asks a harder question: can this run every day, against messy data, timeouts, policy changes, user errors, and incomplete context, without doing something expensive when no one is watching?
That gap is where most agent risk lives. The hard part is not the model; it is everything the demo skipped - real permissions, messy data, partial system outages, ambiguous cases, and the question of who is accountable when the agent is wrong. The model working is not the finish line. It is closer to the point where the real work starts.
The failure modes implementation has to handle
Tool errors and partial failure are common. An API times out halfway through a workflow. A weak implementation retries from the top and repeats a completed action. A production implementation persists state at each step so the system resumes safely rather than duplicating work.
Messy input is another gap between demo and reality. Real records have missing fields, contradictory values, malformed attachments, and outdated notes. The agent has to know when to ask, when to escalate, and when to stop. Guessing should be treated as a failure mode.
Unbounded tool access is the dangerous one. Read access is usually manageable. Write access needs limits, approval rules, and traces. The agent should be physically unable to take certain actions above a threshold without human review.
Finally, drift is inevitable. Policies change, knowledge changes, models change, and user behavior changes. Offline tests before release are not enough. You need online sampling after launch and a process for turning reviewed failures into new regression cases.
The rollout ladder
Do not jump from demo to full autonomy. Use a rollout ladder and move up only when the evidence supports it.
In shadow mode, the agent runs on real inputs but takes no action. You compare what it proposed with what the human actually did. In human-approval mode, the agent prepares work but a human approves before anything leaves the system. In supervised autonomy, the agent acts only on proven low-risk cases and escalates the rest. Full autonomy, where it makes sense at all, is reserved for bounded slices with long clean histories and survivable failure modes.
Most business workflows should stay at human-approval or supervised autonomy for a long time. The ladder is not a race to remove people. It is a way to expand autonomy at the same pace as evidence.
- Shadow mode: agent proposes, human does.
- Human approval: agent prepares, human approves.
- Supervised autonomy: agent acts on proven low-risk cases.
- Full autonomy: only for bounded, proven, survivable slices.
Launch gates and instrumentation
Before broad rollout, the agent should pass a small set of cases against real tools: normal case, missing-information case, tool-failure case, policy-boundary case, and regression case. These do not have to be complicated, but they do have to be explicit.
The launch bar should be written as numbers and limits, not opinions: a labeled set of real cases the agent must pass, a minimum pass rate for that set, and a cap on unreviewed write actions. Every meaningful run should capture a trace: run ID, request source, model and prompt versions, retrieved sources, tool calls, latency, cost, decision, reviewer, and final outcome. The trace is what turns a mysterious failure into a fixable one.
What changes the day after launch
An agent is not software you ship and forget. It needs an owner, weekly trace review, monthly eval refresh, and an incident format for failures. That incident format should include what happened, which guardrail failed, the business impact, the trace link, the immediate fix, the regression test added, the owner, and the due date.
The deliverable of implementation is not simply a launch. It is the operating model that keeps the agent trustworthy after launch.