A mid-market insurance team shipped an AI workflow to triage claims. It demoed perfectly. Three weeks into production, the same claim submitted twice landed in two different queues, and a $4,800 payout got auto-approved because the model decided "under $5,000" applied to a figure it had quietly rounded up from the line items. Nobody changed the rules. The problem was that the rules were never rules. They were a prompt.
This is the failure mode nobody warns you about when you wire an LLM into a business process. The model is fine at the parts that need judgment. The damage comes from the parts that needed no judgment at all, the ones a junior engineer could have written as four lines of code, that someone handed to a probabilistic system because it was already there in the pipeline.
The rule is not a rule if it lives in a prompt
A deterministic step returns the same answer for the same input, every single time. A loan calculator, a tax computation, a "route to tier 2 if priority is high" branch: these are functions. Feed them the same data tomorrow and you get the same result. That property is what makes them testable, auditable, and trustworthy.
An LLM does not have that property, and asking it to fake the property is where teams get hurt. Even at temperature 0, large models are not guaranteed to be deterministic. Floating-point arithmetic, the way requests get batched on a GPU, and silent model updates from your provider all introduce variation. Demian Brecht's Stop Asking LLMs to Be Deterministic makes the point bluntly: you are using a tool built to generalize over fuzzy inputs and then demanding it behave like a lookup table. It will mostly comply, right up until the run where it does not.
The cost of getting this wrong is not just correctness. It is everything downstream of correctness. Phillip Carter's writeup of the messy realities of shipping LLM products, All the Hard Stuff Nobody Talks About, keeps circling back to the same theme: the unpredictability of model output forces you to build a wall of validation and observability around it. If your "under $5,000" check is inside the prompt, you cannot write a unit test for it, you cannot reproduce the bug your auditor found, and you cannot prove to a regulator that the same input always yields the same decision. The number was never the issue. The location of the comparison was.
There is also a quieter tax. A code branch executes in microseconds and costs effectively nothing. An LLM call adds hundreds of milliseconds to a few seconds of latency and a real per-token bill. Routing a simple comparison through a model can make a step a thousand times slower and meaningfully more expensive, for a worse result. You are paying a premium to make a reliable operation unreliable.
Key takeaway: if a step's correct output is a pure function of its input, it should be code. Putting it in a prompt does not make it smarter. It makes it unauditable.
A three-question test before any step becomes an LLM call
Before any step in a workflow gets handed to a model, run it through three questions. This is the cheapest design review you will ever do.
-
Is the correct output a function of the input alone? If the answer is fully determined by the data in front of you, with no interpretation required, it is code. A discount calculation, a date comparison, a status lookup: these have one right answer.
-
Can you write the rule down without hedging? If you can express it as "if X then Y" without resorting to "it depends on context" or "usually," it is rules or code. The moment you need a paragraph of caveats to describe the logic, you may have found a real LLM candidate.
-
Would two careful humans give the same answer? If yes, you want determinism, and code or a traditional model will outperform an LLM on cost and reliability. If two reasonable people would disagree (is this email angry or just terse? is this a refund request or a complaint?), you are in genuine judgment territory, where an LLM can help.
Most steps fail the test for LLM use and pass for code, which is the point. Anthropic's Building Effective Agents opens with the same instinct from the other direction: find the simplest thing that works, and only add model-driven complexity when the task demands it. The three questions are how you operationalize "simplest thing that works" step by step rather than for the system as a whole.
The work that should almost never touch a model
Some categories of work show up in nearly every business workflow, and almost all of them should stay deterministic. If you find one of these inside a prompt, that is your first refactor.
| Task | Why it should be code or rules | What goes wrong in an LLM |
|---|---|---|
| Exact matching | Identity comparison has one answer | The model "helpfully" treats near-matches as matches |
| Calculations | Arithmetic must be exact and reproducible | Rounding drift, transposed digits, silent unit errors |
| Policy thresholds | Approval limits and cutoffs must be auditable | The threshold becomes a suggestion, not a boundary |
| Routing rules | "If priority is high, go to tier 2" is a switch | Same input routes differently across runs |
| Data validation | Format and range checks are pass or fail | The model accepts malformed data because it reads intent |
| Deduplication and idempotency keys | Must be byte-stable to prevent double-processing | Two phrasings of the same record slip through as distinct |
The insurance example sat squarely in row three. The threshold check, a one-line comparison, had been folded into the prompt that also summarized the claim. Pulling it back out into code did two things at once: it made the decision reproducible, and it made the audit trail real. The model still read the claim and proposed a payout figure. A deterministic step decided whether that figure cleared the approval limit. That division of labor is the whole game.
A useful side benefit: deterministic steps are where your traditional, non-LLM tools shine. A regex, a lookup table, a small classifier trained on your own labels, or a plain database query will beat a frontier model on a structured task in cost, latency, and consistency. You do not need a language model to check whether a string is a valid invoice number.
Key takeaway: exact matching, math, thresholds, routing, validation, and idempotency are the load-bearing walls of a workflow. Keep them in code where they can be tested and audited.
Where LLMs actually earn their cost
None of this is an argument against LLMs. It is an argument for spending them where they pay off. The model earns its latency and its bill on tasks that are genuinely ambiguous, where no clean rule exists and judgment is the product.
- Classification of fuzzy input. Is this support message a billing question, a bug report, or a churn risk? You could write a hundred keyword rules and still miss half the cases. A model reads intent well.
- Extraction from messy sources. Pulling structured fields out of a PDF contract, an email thread, or an OCR'd form is where models genuinely outperform brittle parsers.
- Summarization. Condensing a long thread into a brief a human can act on is judgment work, not a function.
- Drafting. First-pass replies, proposals, and notes that a person reviews and sends.
- Ambiguous routing. When the routing decision itself requires reading and interpreting unstructured content rather than checking a field.
The distinction between this section and the last is the one Martin Fowler's team draws in Emerging Patterns in Building GenAI Products: treat the LLM as a component with known failure characteristics, and surround it with evals and guardrails rather than trusting it raw. LangChain's workflows and agents guide makes a parallel structural point: a workflow is a set of predetermined code paths with a model used at specific nodes, while an agent hands control flow itself to the model. Start with the workflow. Let the model fill in the ambiguous nodes. Reach for full agency only when the task genuinely cannot be expressed as a fixed path, which is rarer than vendors imply. We dug into that decision in when not to build an agent, and the broader framing lives in AI agent vs workflow.
Even inside these LLM-appropriate tasks, you add controls. A classification step gets a fixed set of allowed labels and a confidence-driven fallback to a human. An extraction step gets a schema. A drafting step gets a human in the loop before anything sends. The model proposes. Deterministic code disposes.
Design the seam, not just the steps
The hard part of mixing deterministic and probabilistic steps is not either kind of step. It is the seam between them. The output of a fuzzy model becomes the input to an exact function, and that boundary is where reliability is won or lost.
Three things make the seam hold:
Structured output against a schema. Do not let the model return prose that you then parse with hope. Force it to emit JSON that conforms to a defined schema, and use a validation layer like Pydantic or the instructor library to enforce it. If the model returns a category outside your allowed set or a number where you need an integer, you want that to fail loudly at the boundary, not three steps later in production.
Validation owned by code. The validator, not the model, is what your deterministic steps trust. After the model extracts a payout figure, code confirms it is a number in a plausible range before the threshold check ever runs. After the model classifies a ticket, code confirms the label is one you defined. This is the wall that Carter and Fowler both describe: a thin, boring layer of checks that converts probabilistic output into something a deterministic system can safely consume.
An explicit fallback path. Decide in advance what happens when validation fails. Retry once with a stricter prompt, then route to a human, then log it. A workflow with no defined failure path will invent one for you at the worst possible moment, usually by passing garbage downstream.
When you design the seam first, the rest of the architecture gets simpler. You stop asking "should this whole thing be an agent?" and start asking, step by step, "is this node deterministic or not, and what validates the handoff?" That question scales. It is also exactly the kind of design that survives an audit, because every deterministic decision is reproducible and every probabilistic one is bounded by a check you wrote.
How OpenNash Can Help
Most of the AI workflows we are brought in to fix do not have a model problem. They have a placement problem: deterministic logic buried in prompts, no validation at the seam, and no fallback when the model returns something unexpected. The fix is rarely a better model. It is moving the math, the thresholds, and the routing back into code, and putting a schema and a validator between the LLM and everything it touches.
That is the work we do. We map your process, mark each step as deterministic or genuinely ambiguous, design the guardrails and human approvals before anything ships, and hand you a system you fully own with the audit trail intact. If a platform fits your use case better than a custom build, we will tell you that too. The credible position is not that custom always wins. It is that the deterministic parts of your business should never be left to a dice roll.
If you are staring at a workflow that demos well and behaves strangely in production, book a call and we will map this decision to your actual pipeline.