The procurement team finished the rollout email on a Friday. The new AI agent handled invoice triage, ran twelve tools, talked to two systems of record, and passed a pre-launch test suite. Champagne emoji in Slack. By Tuesday, the agent had quietly approved a duplicate vendor payment, escalated a routine reconciliation to a finance director three times, and was answering questions about a tax form that had been deprecated in 2024.

Nobody had done anything wrong. They had just confused "deployed" with "done."

This is the most common and most expensive misunderstanding in AI right now. Companies treat agent deployment the way they treated SaaS deployment: buy, configure, train users, roll out, schedule periodic maintenance. That model worked when the software underneath was static. It does not work when the model changes monthly, the tool surface changes weekly, and the workflow itself reveals new edge cases every time a real customer touches it.

The agent is not the finish line. The agent is the operating system around the work, and operating systems need an operating model.

What Changes the Day After Launch

Traditional software systems have a deployment curve that flattens. You ship, you patch, you eventually leave it alone. Agents do not flatten. They keep moving because four things underneath them keep moving.

Models change. A new version of Claude, GPT, or Gemini does not arrive as a drop-in upgrade. Behavior shifts on edge cases your evals did not cover. Output format drifts. Tool-calling reliability improves in some places and regresses in others. IBM's 2026 AI operating model blueprint calls this out as the single biggest gap between companies that scale agents and those that stall: a discipline for re-evaluating, not just upgrading.

Cost and latency change. A workflow that cost $0.04 per resolution on launch day can cost $0.12 a month later if the agent learns to call one more tool per turn. A 6-second response can become 14 seconds when a retrieval tool gets slower. Nobody notices until somebody does.

Tools change. Your internal APIs ship new fields. Vendor APIs deprecate endpoints. The schema the agent learned at launch is not the schema it is calling six weeks in. Cornerstone's analysis of AI agents in workforce platforms notes that tool stability, not model quality, is the practical bottleneck for most enterprise deployments.

Workflows change. Real customers, real tickets, real orders, and real internal systems surface edge cases your test set never imagined. The first month of production is the most valuable dataset you will ever have for that agent. Most teams throw it away because they were not logging the right things.

And the quality bar itself rises. The agent that was "good enough for pilot" is not good enough once it is making decisions on accounts that matter.

The Hard Questions Nobody Asks Before Launch

Before you ship, ask these. After you ship, ask them again every quarter.

  • What should the agent actually own end to end, and what should remain deterministic software?
  • Which decisions require human approval, and how do those approvals route?
  • Which model is accurate enough for the hardest cases in this workflow?
  • Where can a cheaper, faster model do the work safely?
  • Which tools does the agent genuinely need, and which tools are just confusing it?
  • What happens when the agent is uncertain? Does it stop, escalate, ask, or guess?
  • How do you evaluate the path the agent took, not just the final answer?
  • How do production failures get converted into better prompts, better tools, better routing, or better workflow design?

If you cannot answer these on a whiteboard in fifteen minutes, you do not have an operating model. You have a demo with a webhook.

Anthropic's guidance on building effective agents makes the point that the simplest architecture that solves the problem is almost always the right one. The corollary that gets less attention: the operating model has to be just as simple. A complicated agent with a sloppy operating model is worse than a deterministic workflow nobody has to babysit.

The Operating Model in Six Layers

Think of the post-launch system as six layers stacked on top of the agent itself. Each one needs an owner, a cadence, and a metric.

1. Scope and Routing

Define what the agent owns and what it does not. Most production incidents come from agents reaching for cases that should have been routed to a different agent, a deterministic path, or a human. The USAII analysis of AI agent management puts scope discipline at the top of the skills list for 2026 - and it shows up first because it is the cheapest thing to fix and the most expensive thing to ignore.

A routing decision is also a model selection decision. Use the strongest model to establish a baseline on the hardest cases. Then move easier cases to cheaper, faster models with explicit guardrails. Do not start with the cheap model and try to climb up. You will spend three months chasing edge cases that the better model would have handled on day one.

2. Tools

Small, well-defined tools beat large, ambiguous ones every time. An agent given fifteen tools, all named some version of "search," will pick the wrong one and confidently use it. An agent given three tools with clear contracts and crisp descriptions will outperform it.

Log every tool call. Log the input, the output, the latency, and the model's reasoning for choosing that tool. This is the single highest-leverage piece of telemetry you can collect, and it costs almost nothing to instrument.

3. Human-in-the-Loop

Decide upfront which actions are reversible and which are not. Reversible actions (drafting a reply, tagging a record, suggesting a route) can run fully autonomous with sampled review. Irreversible actions (sending money, closing tickets, modifying contracts, communicating with customers under your brand) need approval queues with clear SLAs.

Google's SRE book on monitoring distributed systems is twenty years old and still the right mental model here: you are running a service, you have error budgets, you have on-call, and you have humans who own the recovery path when the automation makes a bad call. Treat the agent as a service, not as a product feature.

4. Evaluation

Evals are how you find out the agent is getting worse before your customers do. Most teams build evals for the launch and then never run them again. That is the equivalent of writing tests for the version of your code that shipped two years ago.

Evals need to grow with the system. Every production failure becomes a new test case. Every model upgrade triggers a full re-run. Every new tool gets its own targeted eval. Hamel Husain's writing on LLM evaluation lays out the practical mechanics, but the operating point is this: if your eval suite is not larger today than it was at launch, your agent is regressing and you cannot see it.

You also need to evaluate the path, not just the answer. A correct final answer reached by a hallucinated tool call is not a passing test. It is a failure that has not been caught yet.

5. Observability and Incident Response

Treat the agent like any other production system. Dashboards, alerts, on-call rotation, runbooks. The AgentCenter guide to AI agent management lists the minimum viable telemetry set: tool call success, escalation rate, average decision depth, cost per resolution, latency distribution, and outcome accuracy sampled by a human reviewer.

The agent will fail. The question is whether you find out in five minutes or five days.

6. Identity, Security, and Audit

This one is underrated and getting worse. Agents act on behalf of users, services, or themselves, and they often hold credentials that would make a security team uncomfortable if they understood what was happening. The AI Agent Identity Security 2026 Deployment Guide is worth reading even if you only do AI part-time. Scoped credentials, short-lived tokens, audit logs on every action, and a clear identity for every agent are not optional anymore.

Auditability is also a business asset. When a regulator, a customer, or a finance lead asks "why did the agent do that," the answer needs to be a log line, not a shrug.

The Post-Launch Checklist

If you only take one thing from this post, take this. It is the same checklist we use when we audit other teams' agent deployments.

  • Start with one narrow workflow. Resist the urge to ship a "platform."
  • Define success criteria in business terms, not model metrics. Resolution rate, escalation rate, cost per case, customer satisfaction on agent-handled cases.
  • Use the strongest available model for the baseline. Optimize cost only after you know the workflow works.
  • Build evals before scope expansion. Every new use case requires its own eval set.
  • Give the agent small, well-defined tools. Each tool description should fit on a Post-it.
  • Log every tool call and the full decision path. Store it for at least 90 days.
  • Route low-confidence and high-risk cases to humans by default. Loosen the gates as evidence accumulates.
  • Review failures weekly. Categorize each one as a prompt, tool, routing, or workflow issue. Fix the underlying layer, not the symptom.
  • Re-run the full eval suite on every model upgrade, every tool change, and every prompt change. No exceptions.
  • Resist multi-agent complexity until a single agent has clearly failed. Most "multi-agent" architectures are routing problems in disguise.

That last one matters. The fashion in 2026 is to wire five agents together and call it a system. Digital Bricks' analysis of agent adoption finds that the production success rate of multi-agent architectures is meaningfully lower than well-designed single-agent systems with deterministic routing. The complexity tax is real and it compounds.

What Continuous Improvement Actually Looks Like

The hardest part of running an agent in production is the feedback loop. Most teams build the agent, ship it, watch the dashboards for a week, and then move on to the next project. Six months later they wonder why quality dropped.

The teams that get this right run a weekly cadence. A senior engineer or operator pulls the last 100 sampled cases, reads the decision paths, and tags each one. The tags feed two backlogs: an evals backlog (cases the test set should cover) and a fix backlog (changes to prompts, tools, routing, or workflow). Both backlogs get worked the same week.

This is not glamorous. It is the part nobody puts in the demo video. It is also the only thing that separates an agent that gets better every month from an agent that gets quietly worse until somebody pulls the plug. Kanerika's writeup on building AI agents that scale calls this the "agent ops loop," and the companies that have it are running circles around the ones that do not.

Good agent systems are not just prompts. They are operating systems around work. The agent is one process running on that OS. The operating model is the kernel.

How OpenNash Can Help

If your team is staring at a deployed agent and wondering why it feels harder to run than it was to build, that is the normal experience and a fixable one.

OpenNash builds production AI agents and the operating model around them. The audit phase maps which workflows are actually ready for agents, which should stay deterministic, and where human approval needs to live. The design phase defines tools, routing, escalation paths, and the eval set before any code ships. The build and deploy phases include the observability, audit logging, and weekly review cadence that make the agent improve over time instead of decay.

The deliverable is the agent plus the runbook. Your team owns both at the end. We have written more about the underlying patterns in agent reliability engineering, human-in-the-loop design, LLM routing and model selection, and moving from prototype to production if you want to go deeper before talking to anyone.

If you have an agent in production and the operating model behind it is "we'll figure it out," book a call and we will map the gap to a concrete plan.

The agent that ships on Friday is the easy part. The agent that is still earning its keep six months later is the work.