A mid-market PE firm ran a due diligence agent that read a target company's data room and flagged a "material revenue concentration risk." The deal team spent four days trying to figure out where that conclusion came from. The customer contracts did not support it. The agent's log said one thing: risk_level: high. That was the entire record. No retrieved documents, no tool calls, no reasoning. The team was debugging a black box with a one-word output, and their only investigative tool was to run the whole thing again and pray it failed identically.

It did not. The agent returned risk_level: medium on the rerun, and the deal team trusted that even less.

This is the default state of most AI agents in production: they emit verdicts, not records. When the verdict is right, nobody asks questions. When it is wrong, there is nothing to inspect. The fix is not a better model or a smarter prompt. It is capturing the right things while the agent runs, so that every failure becomes a fixable failure instead of an unreproducible ghost.

The verdict tells you nothing

Picture a flight data recorder that saved only one field: whether the plane landed. After a crash, investigators would have nothing. No altitude, no control inputs, no engine readings. They would be reduced to guessing. That is what a final-answer log is for an AI agent.

The instinct to log only the output comes from traditional software, where a function that returns the wrong value can be traced through deterministic code with a debugger. AI agents break that model in two ways. First, the path from input to output runs through a probabilistic model whose internal reasoning is not in your code. Second, agents take actions - they call tools, retrieve documents, and chain decisions - and any of those steps can go wrong while the final output still looks plausible.

The dangerous case is not the agent that errors out. An error at least announces itself. The dangerous case is the agent that succeeds confidently with the wrong answer. A diligence agent that quotes a revenue figure it invented, a support agent that cites a refund policy that does not exist, a deal-sourcing agent that scores a company on data it never actually retrieved. None of these throw an exception. They all pass a final-answer log without complaint.

Google's site reliability engineers learned this lesson on distributed systems a decade ago: you instrument the path, not just the endpoint. Their guidance on monitoring distributed systems treats latency, traffic, errors, and saturation as signals you collect continuously, not facts you reconstruct after an outage. Agents are distributed systems where one of the nodes is a language model. The same discipline applies, and we have written before about how SRE patterns map onto AI agent reliability.

What a complete trace captures

A useful trace records the full chain of custody for a decision. Here is the set of fields worth capturing on every run, and what each one buys you when something breaks.

Field What to capture Why it matters when debugging
Model input The fully assembled prompt: system message, user request, and any injected context Catches bad prompt assembly and context that was supposed to be there but was not
Model output The raw response, including reasoning or scratchpad if the model exposes it Shows whether the model reasoned correctly from what it was given
Retrieval results Which documents were fetched, their relevance scores, and what was actually passed to the model Separates "retrieval found nothing useful" from "the model ignored good context"
Tool calls Tool name, exact arguments, raw result, errors, and retries The most common point of silent failure (see below)
Decision path The sequence of steps and which branches the agent took Reconstructs the logic, including loops and early exits
Latency Time per step, not just total Pinpoints which tool or model call is slow
Cost Tokens and dollars per step Surfaces runaway loops and expensive retrieval before the bill does
Confidence Logprobs, self-reported certainty, or guardrail triggers Flags low-confidence runs for review before they reach a human
Reviewer decision Whether a human accepted, edited, or rejected the output, and the edit itself Your richest source of real evaluation data
Business outcome What happened downstream: was the flagged risk real, did the deal proceed Closes the loop between agent behavior and actual value

Two of these get overlooked constantly, and both are worth dwelling on.

The reviewer's decision is evaluation data you are already generating and almost certainly throwing away. When an analyst edits the agent's diligence summary before it goes into the investment committee deck, that edit is a labeled correction. Capture the before and after, and within a few weeks you have a dataset of exactly how the agent is wrong, written by domain experts, for free. That dataset is what turns vague complaints ("the agent is unreliable") into specific, testable failure modes.

The business outcome is the field nobody connects back. An agent flags a risk; six weeks later the diligence team confirms it was real or imagined. If that resolution never gets stitched back to the original trace, you can measure the agent's confidence but never its accuracy. The connection is administratively annoying and analytically priceless.

Tool calls are where agents actually break

If you can only afford to instrument one thing well, instrument tool calls. This is where the gap between "the model said something" and "the system did something" lives, and it is the largest single source of silent, confident wrongness.

Return to the diligence agent. Suppose it has a tool that pulls financial data from a vendor API. The model decides to look up the target's revenue, calls the tool with a company identifier, and the tool returns a 404 because the identifier was malformed. A well-behaved agent surfaces the error. A poorly designed one - which is most of them by default - sees the empty result, shrugs, and generates a revenue figure from its pretraining. The final output is a clean, confident number. The final-answer log shows nothing wrong. The trace, if you captured the tool call, shows a 404 followed by a fabricated answer.

To catch this, log four things for every tool invocation:

  • The arguments the model passed. Not the arguments you expected - the ones it actually sent. Wrong tickers, malformed dates, and truncated queries all show up here.
  • The raw result. Including empty results and partial results, which read very differently from a clean success.
  • Errors and retries. A tool that fails and gets retried three times is a latency and reliability problem you will never see from the output.
  • Whether the result was used. Did the next model step reference the tool's output, or did the agent route around a failure? This is the silent-fallback detector.

Latitude's framework for detecting agent failure modes treats tool-level instrumentation as the foundation of observability-driven diagnosis, and the reasoning is simple: tools are the only place an agent touches the real world, so tool telemetry is where most production failures first become visible. The emerging standard for capturing this in a portable way is OpenTelemetry's tracing model, which represents each step as a span with timing and attributes. Treating each tool call and model call as a span gives you a trace structure that existing observability tooling already understands, rather than a bespoke logging format you maintain forever.

One trace fixes a bug, a thousand traces fix the system

Capturing rich traces solves the single-incident problem. Someone reports a bad output, you pull the trace, you see the malformed tool argument, you fix it. Useful, but reactive. The harder question is the one that actually moves your reliability numbers: what is failing that nobody has reported yet?

You cannot answer that by reading traces. A production agent generates thousands of runs a week, and skimming them is both unscalable and unreliable - the failures hide in the volume. The move that works is clustering by behavior.

Group runs by something meaningful: the user's intent, the sequence of tools called, the type of output produced, or the sentiment of the request. Then look across clusters for the ones that smell wrong. A cluster with a 40% reviewer-rejection rate. A cluster where confidence is consistently low. A cluster where the same tool errors repeatedly. Those clusters are systematic failures, and a single fix to a cluster is worth a hundred individual bug patches.

Braintrust's analysis of discovering hidden failure patterns in production traffic makes the case that behavior clustering finds problems that no test suite anticipated, precisely because the failures are emergent properties of real traffic rather than bugs you could have predicted. This is also why generic monitoring dashboards fall short for agents. Monitoring answers questions you knew to ask in advance. Agent failures are usually novel, so you need to ask new questions of your data after the fact - which is the actual definition of observability that Honeycomb's team has argued for for years in conventional software.

The practical implication: design your trace schema so that clustering is cheap. Tag runs with intent, tool sequence, and outcome at capture time. Retrofitting those tags onto a million untagged traces is a project; capturing them on the way in is a field.

What this looks like for a mid-market PE deal team

Private equity is a useful stress test for these ideas because the cost of a confident wrong answer is measured in deal economics, not support tickets. A mid-market firm running diligence at speed has every incentive to deploy agents - reading data rooms, summarizing contracts, scoring management interviews - and every incentive to distrust them, because a fabricated revenue concentration risk can kill a good deal and a missed one can sink a bad investment. Industry guidance on AI in PE due diligence consistently lands on the same conclusion: the value is real, but only if the output is auditable.

Auditability is what traces provide. When the investment committee asks why the agent flagged a risk, the answer is not "the model thought so." It is a trace showing the three contracts it retrieved, the concentration ratio it calculated, and the threshold it applied. If the conclusion is wrong, the trace shows whether it was bad retrieval (it missed two contracts), a bad tool call (the revenue API returned stale figures), or bad reasoning (it had the right data and drew the wrong inference). Each of those has a different fix, and you cannot tell them apart from the verdict alone.

The other two fields PE deployments tend to undervalue are latency and cost, both of which belong in the trace. Diligence runs against deadlines, so an agent that takes nine minutes because one tool is silently retrying is a process problem you need step-level latency to see. And at the volume of a busy deal pipeline, an agent that loops twice as often as expected doubles a bill that nobody is watching until quarter-end. Per-step cost in the trace turns that into a number you can manage. How a team operationalizes this once the agent is live - who reads the clusters, who acts on the reviewer data - is the operating model that has to exist after launch, not something you bolt on during the first incident.

How OpenNash Can Help

Observability is one of those things that is cheap to build into an agent on day one and painful to retrofit on day ninety. By the time you need the traces, the runs that would have explained your failures are already gone.

OpenNash builds production AI agents with the trace schema defined during design, not after the first incident. In an audit, we map where your workflow's decisions actually happen and which of them carry real downstream cost - the revenue figures, the risk flags, the routing choices that change what humans do next. In the build, we instrument those decision points with full traces: model input and output, retrieval results, tool calls with arguments and errors, latency, cost, and the reviewer's decision. The result ships to you with the observability owned by your team, integrated into your CI/CD, and documented so your analysts can read a cluster without calling us.

If you are deploying agents into work where a confident wrong answer is expensive - diligence, finance ops, contract review - the trace design is not optional plumbing. It is the difference between an agent you can defend to an investment committee and one you have to apologize for. Book a call to map this to your workflow and we will start with the decisions worth instrumenting.

A model that gives the right answer for the wrong reason will eventually give the wrong answer for the same reason. The trace is how you find out which one you have before it costs you a deal.