A customer service agent that worked perfectly in October starts refusing routine refund requests in February. The traces look fine. Each individual run passes your eval set. But refund completion is down 8 percent and nobody can point to the failed test that should have caught it.

This is the failure mode of single-trace debugging. Agentic systems do not break the way deterministic software breaks. They drift. Tool selection shifts a few percent. Retries climb. A specific cohort starts hitting a refusal pattern that did not exist last quarter. By the time a single trace is bad enough to flag in review, the population has already moved.

Macro evals are the answer to this. Instead of grading one run against a known answer, they operate on the population of traces, join labels across thousands of runs, and surface the patterns that single-trace inspection cannot see. The CIO coverage of agent drift puts the problem clearly: failures rarely arrive as outages. They arrive as slow shifts that are invisible at the trace level and obvious at the cohort level.

What a macro eval actually is

A micro eval asks: did this trace produce the right answer. A macro eval asks: across the last 10,000 traces, which behaviors are clustering, which cohorts are degrading, and which failure modes are costing the most.

The OpenAI Cookbook example on macro evals walks through this on an EV order workflow. The system has specialist agents for pricing, compliance, supply, factory routing, scheduling, and release decisions. Each step emits structured events. Some of those events already carry labels from earlier review passes: tool-call correctness, policy adherence, schedule feasibility.

The macro eval pipeline does five things in order:

  1. Normalize raw traces into a consistent schema so they can be joined and queried.
  2. Build a compact trace document per run that summarizes inputs, decisions, tools, and outcomes.
  3. Join the existing step-level labels onto those documents.
  4. Discover recurring patterns across the population, not within a single run.
  5. Rank patterns by impact and route the top suspects into a human inspection queue.

The output is not a dashboard. It is a short, ranked list and a queue. That distinction matters because dashboards encourage browsing. A queue forces decisions.

The shape of the data

Most teams already have most of this data. They just have not joined it.

A typical production stack will have:

  • Raw traces from the agent runtime. The OpenAI Agents SDK tracing docs and LangSmith observability both expose step-level spans with tool calls, model inputs, and outputs.
  • Step-level evaluators that ran inline or as a batch job. These produce labels like tool_choice_ok, refusal_appropriate, policy_violation, schedule_feasible.
  • Business outcomes from the surrounding system. Did the order ship. Did the refund get issued. Did the ticket close without a human touch.

The macro eval pipeline joins these three sources on trace_id and produces what the OpenAI Cookbook calls a compact trace document. A compact document is not the full trace. It is a few hundred tokens summarizing what the agent decided, which tools it called, in what order, and what happened next. Compact documents are small enough to embed, cluster, and feed into a discovery model without exploding your costs.

This compression step is where most internal projects stall. Teams try to run macro analysis on raw traces and hit context limits, cost walls, and signal-to-noise problems. The compact document pattern is the unlock.

From thousands of traces to a suspect leaderboard

Once you have compact documents joined to labels and outcomes, you can ask population-level questions. Three patterns do most of the work.

Pattern 1: Frequency by failure mode. Group traces by labeled failure type and count them. This is the boring one and it is still the most useful. A team running this for the first time on a six-month-old agent will typically find one or two failure modes that account for 60 to 80 percent of bad outcomes. Fix those and the rest of the long tail matters less.

Pattern 2: Cluster discovery on unlabeled traces. Not every failure mode has a label yet. Embed the compact documents, cluster them, and inspect the clusters with high error rates or low downstream conversion. This is how new failure modes get discovered. The Microsoft Research Taxonomy of Failure Mode in Agentic AI Systems is a useful starting taxonomy, but every production system grows its own once you start clustering.

Pattern 3: Release diff. Hold the prompt, tool set, or model version constant on one cohort. Roll the change to another. Compare failure-mode distributions, not aggregate pass rates. Aggregate pass rates hide trade-offs. A new release might raise overall pass rate by 2 percent while cutting refund-flow success by 15 percent.

The output of all three is the same shape: a ranked leaderboard of suspect behavior patterns, with counts, cohorts, and a rough estimate of business impact. Galileo's writeup on agent evaluation frameworks calls this the move from metrics to rubrics, and the framing is right. Single-number metrics tell you something is off. Rubric-based leaderboards tell you what to fix first.

Drilling into high-impact failures

A leaderboard is not the end of the work. It is the entry point to root-cause analysis.

The drill-down loop looks like this:

  1. Pick the top suspect pattern by impact.
  2. Sample 20 to 50 traces from that pattern.
  3. Render the full trace, not the compact document, for each sample.
  4. Look for the common cause: a brittle tool description, an ambiguous policy, a model misreading a specific phrasing.
  5. Write a hypothesis and a fix candidate.
  6. Re-run the population check after the fix lands.

This is straightforward but rarely done in practice because step 3 is annoying. Full traces are long, often spanning thousands of tokens. Teams that get serious about this build internal viewers that render traces with collapsible tool calls and inline labels. The arxiv paper on automatically detecting failures in agentic traces goes further and uses a model to flag suspicious sub-spans, which makes the human review faster but does not replace it.

The most common discovery in this loop is that the failure is not in the model. It is in the tool surface. A retrieval tool that returns slightly stale results 4 percent of the time will produce a recognizable behavior pattern at the population level even though no single trace screams "broken." O'Reilly's piece on the hidden cost of agentic failure makes this point about the compounding nature of small tool-level errors across multi-step workflows. One bad lookup early in a chain has outsized downstream cost.

The human inspection queue

A macro eval pipeline that ends at "here is a chart" has not finished the job. The deliverable is a ranked queue of traces a human will actually open.

A good queue has three properties:

  • Bounded. No more than 50 items per reviewer per day. If your pipeline produces 500 suspect traces, your impact ranking is not selective enough.
  • Ordered by impact, not by recency. A failure that happened yesterday but cost $200 in refunds is less interesting than a failure that happened last week and cost $40,000 in lost contracts.
  • Stateful. Reviewers can mark traces as fixed, false positive, duplicate of pattern X, or new failure mode. That state feeds back into the next run.

Tools like Promptfoo handle the eval-running half of this well. The queue half tends to be a custom-built piece because every team's review workflow is different. A spreadsheet works for the first six months. A small internal app works for the next two years. The trap is buying a heavy platform before the workflow is stable.

What this means for governance

Population-level evals are also the only credible answer to the agent identity and accountability questions raised in Strata's 2026 governance research. When auditors or regulators ask whether your agent behaves consistently, single-trace screenshots are not the answer. The answer is a population-level evaluation report showing failure-mode distributions, drift over time, and the inspection-queue close-out rate.

The same data structure serves both engineering and governance. The compact trace document, the joined labels, the suspect leaderboard, and the closed-out queue items become the audit trail. Teams that build this once for engineering tend to find they have already solved most of their compliance story.

A practical sequencing for teams getting started:

Phase Goal Output
1 Tracing is on for every agent run Structured traces with step-level events
2 Step-level evaluators run on a sample Labeled spans for known failure modes
3 Compact documents joined to labels and outcomes Queryable trace dataset
4 Suspect leaderboard runs per release Ranked failure-mode list
5 Inspection queue with stateful review Closed-loop fix cadence

Most teams are at phase 1 or 2. The jump to 3 is the one that takes a week of work and pays back for years.

How OpenNash Can Help

If you are running an agent in production and your eval story is "we spot-check failed traces in Slack," that is a phase-2 setup masquerading as a phase-5 one. The gap is not exotic. It is a compact-document join, a suspect-ranking job, and a queue your reviewers will actually open.

OpenNash builds production agent systems with the trace, eval, and review layers wired up from the start. The work usually splits into an audit of the current trace surface, a design pass on the failure-mode taxonomy specific to your workflow, and a build of the macro eval pipeline against your real production traces. Clients keep full ownership of the pipeline, the labels, and the queue tooling.

If you want to map this pattern to your own agent workflow, book a call and we can walk through the trace data you already have and where the leaderboard would land.

Macro evals are not a tooling problem. They are a data-modeling problem with a tooling layer on top. The teams that get this right do not buy a platform first. They get their traces into a shape they can join, write a small pipeline that produces a real ranked queue, and only then decide whether they want a vendor to run it for them.