Aaron Levie, the CEO of Box, has been making a quiet but important argument on X for the past few months: when you sell an AI agent, you are no longer selling software. You are selling the workflow itself being done by the technology. That sounds like marketing copy until you sit with it. Software is something you install. A workflow being done is something you operate. Those two things have almost nothing in common from a buying, staffing, or accountability perspective.

This is why so many AI agent pilots stall at month three. Companies bought what they thought was software. They got handed something that behaves more like a new hire who needs onboarding, supervision, system access, performance reviews, and ongoing coaching. Nobody on the team is staffed to do that work, and the pilot quietly dies.

The point of this post is to be honest about what AI agent deployment actually involves in 2026, give you a checklist you can use whether or not you ever hire a professional services firm, and then explain when bringing in a partner like OpenNash is worth it and when it is not. If you leave with the checklist and never call us, that is a fine outcome.

Why "Just Buy the Tool" Keeps Failing

Deploying an AI agent is not deploying a SaaS app. With SaaS, the vendor wrote the workflow once, baked it into the product, and now ten thousand customers run roughly the same flow with light configuration. With an agent, the workflow used to live in someone's head. Your job is to extract it, translate it into something an LLM plus tools can run, decide where humans still need to be involved, and keep the whole thing working as models, data, and business rules change.

That involves at least six things that look nothing like software installation:

  • Domain mapping. Watching how the work is actually done today, including the parts nobody documented.
  • System integration. Giving the agent read and write access to the systems where the work lives, with the right credentials and the right scope.
  • Context engineering. Deciding which data the agent sees on each turn, in what order, and at what cost. Anthropic's guide to building effective agents is blunt about this: most failures are context failures, not model failures.
  • Eval design. Building a test set that reflects real failure modes for your workflow, not generic accuracy metrics. Hamel Husain's evals FAQ makes the case that 60 to 80 percent of your dev time should go here, and most teams spend almost none of it.
  • Change management. Getting the people whose work is changing to actually use the thing.
  • Ongoing tuning. Models update. Data drifts. The agent that worked in May breaks in August.

Most internal teams are not staffed to do all six. The teams that are staffed for it are already the bottleneck on every other AI initiative in the company. This is why pilots stall: not because the technology is bad, but because the operating model around the technology is missing.

The New Role Quietly Emerging

Levie has been arguing for a while that every team buying or building agents now needs a new role. He calls it the agent deployer, sometimes the workflow architect. Others call it the forward deployed engineer, borrowing the term from Palantir, where the role was invented to embed engineers directly with customers rather than handing software over the wall.

Whatever you call it, the day to day is the same:

  • Map the structured and unstructured data flows the agent needs to touch
  • Decide what the future state workflow should look like, not just the current one
  • Get the agent the right context at the right moment, which is mostly a tooling and retrieval problem
  • Decide where humans stay in the loop, and design the handoff
  • Run evals after every model upgrade, prompt change, or data source change
  • Track KPIs that the business actually cares about, not token counts

This is not a project. It is an operating role. Once an agent is in production, this work continues forever, the same way you do not stop monitoring a database after you turn it on. Simon Willison's designing agentic loops post is good on the day to day reality of this: the loop, the tools, the failure recovery, the cost ceilings. None of it gets simpler over time.

If you have someone in-house who already does this, you do not need a partner. If you do not, you have two choices: hire (slow, hard, the talent market is brutal right now) or bring in a firm that already employs these people.

A Six-Step Deployment Checklist You Can Steal

This is the playbook we use, written generically enough that you can run it yourself. If you do, you will probably get further than 80 percent of the AI pilots in your industry.

1. Pick one workflow with a clear before/after KPI

Not "improve customer support." Pick "reduce average handle time on tier-1 billing tickets from 7 minutes to 4 minutes." If you cannot write the KPI in that form, you do not understand the workflow well enough yet to automate it. Go back and watch the work being done.

2. Map the data and systems the agent needs to touch

List every system the human currently opens to do this job. CRM, billing, ticketing, knowledge base, internal wiki, that one shared spreadsheet someone has not migrated yet. For each one, note read access, write access, auth method, and rate limits. Most agent integrations die here, not at the model.

3. Write the eval set before you write the agent

Take 30 to 50 real examples of the workflow being done. For each one, write down what a correct outcome looks like. This is your eval set. You will run it after every change. Eugene Yan's writing on evals is the cleanest practical primer if you have not built one before. The teams that build evals first ship; the teams that build agents first churn.

4. Decide the human-in-the-loop points up front

For every action the agent can take, ask: if this is wrong, what is the cost? Drafting a reply that gets reviewed: low cost, full automation is fine. Issuing a refund: high cost, require human approval. Deleting a record: maybe the agent should not have that capability at all. Bake this in from day one. Retrofitting human review into a fully autonomous agent is much harder than starting with review and removing it later.

5. Ship the smallest working version to one team

Not the whole company. One team, one workflow, one week. Treat the rollout like a hardware launch. The goal of week one is to find out what you got wrong, not to scale.

6. Instrument it and review weekly

You need three things visible from day one: what the agent did (a log), how often it was right (eval scores), and what it cost (tokens, API calls, human review time). Review these numbers weekly with the team using the agent. Without this loop, the agent quietly degrades and nobody notices until a customer complains.

That is the playbook. Run it on one workflow before you talk to any vendor or services firm, including us. You will learn more from one honest attempt than from ten sales calls.

When a Partner Actually Pays for Itself

This is the part most consultancies will not write honestly, so let us do it.

You probably do not need a professional services partner if:

  • You have one simple workflow with one system to integrate
  • You already have an in-house engineer who has shipped an LLM application
  • The agent is a clear nice to have, not on a deadline
  • You are happy to learn by trial and error over six to nine months

A partner is worth the money when:

  • The workflow spans three or more systems with messy auth and rate limits
  • The data is regulated (HIPAA, SOC 2 scope, financial records, PII at scale) and a wrong action is expensive
  • You have no in-house AI engineering and hiring one will take six months you do not have
  • You have a deadline (board commitment, customer contract, fiscal year goal) the internal team cannot hit
  • The downside of a bad agent is worse than the downside of no agent (think anything that touches money, contracts, or customer trust)

If two or more of those apply to your situation, the math on a flat-fee engagement usually wins versus internal headcount cost plus the opportunity cost of the workflow staying manual for another year.

How OpenNash Works

We are a small, technical, flat-fee professional services firm focused on AI agent deployment. The model is intentionally narrow:

  • Flat fee, not hourly. Our incentive is to ship the agent and move on, not to bill more hours. If we underestimate, that is on us.
  • White-glove. We own the integration, the evals, and the rollout, not just a slide deck. The deliverable is a working agent in your production environment, not a recommendation.
  • Platform-flexible. We work across n8n, Claude, MCP, custom Python, and direct API integrations. We are not selling you one stack. The right architecture depends on your data, your systems, and your team.
  • Domain partnership. We sit with your team and learn the workflow before touching code. The first week is mostly watching and asking questions.
  • Ongoing ownership. Models change. Data drifts. We stay engaged after launch under a managed operations agreement so the agent keeps working when the next Claude or GPT version ships.

That is the whole pitch. We are not the right fit for every company. If you read the checklist above and your reaction was "we can do this ourselves in a quarter," you probably can, and you should.

What a Real Engagement Looks Like

For the readers who want to picture it, here is a typical six-week shape:

  • Week 1. Workflow discovery. We sit with the team doing the work. We write the KPI together. We list every system the agent will need to touch.
  • Week 2. System mapping and eval design. We get credentials, sandbox access, and a 30 to 50 example eval set written and agreed on.
  • Weeks 3 and 4. Build and ship the smallest working version to one team. The agent is in production, doing real work, with humans reviewing the outputs that matter.
  • Weeks 5 and 6. Measure, tune, expand. We move from supervised to semi-autonomous on the actions that have proven safe. We hand off the runbook.
  • After week 6. Optional managed operations. We stay on as the team that runs evals, ships prompt and tool updates, and is on call when something breaks.

The deliverable at the end is not a deck. It is a working agent, an eval set your team can run, a runbook, and a person who can either fix things or train your internal hire to fix things.

A Last Honest Thought

The hard part of agent deployment is not the model. The model is the easy part now. The hard part is everything around the model: the workflow understanding, the integration, the evals, the human-in-the-loop calls, the change management, the ongoing tuning. Chip Huyen has written that the journey from zero to sixty in AI engineering is easy and the journey from sixty to one hundred is brutal. That has been our experience too. Most companies underestimate how much sixty-to-one-hundred work there is.

If you want to skip that conversation entirely and just run the checklist yourself, please do. If you read this and thought "we have three of those five 'when a partner pays' boxes checked," book a 30-minute workflow review call with us. No deck, no pitch. We will look at the workflow, tell you whether we think it is a good fit, and if it is not, point you at the cheapest path to running it yourself.

That is the deal.