OpenAI published numbers on a tax agent built with Thrive that are easy to misread as science fiction. Seven thousand returns processed. Up to 97% draft accuracy. Roughly a third less time per return. Field completion above the firm's 75% threshold climbed from about a quarter of returns at launch to 86% in six weeks. Read quickly, that sounds like an AI that taught itself to do taxes. Read the actual writeup and you find something less mystical and far more useful: a feedback loop that humans designed, measured, and kept on a short leash.

That distinction matters because "self-improving AI agents" is becoming a buzzword detached from how the working systems are built. The teams getting real gains are not running recursive self-modification. They are running an eval-backed loop where expert corrections feed structured changes, and every change has to pass a regression gate before it ships. Andrej Karpathy's autoresearch project is the same idea shrunk to a single GPU. If you understand both, you can build the pattern around your own workflows and skip the magical thinking.

What the tax agent actually did

The OpenAI and Thrive writeup is worth reading slowly because the mechanism is the whole story. A tax return has hundreds of fields. An LLM drafting those fields will be wrong in specific, repeatable ways: misreading a particular form, fumbling an edge case in dependent rules, guessing on a field it should have flagged. Left alone, that agent does not get better. It makes the same mistakes on return 7,000 that it made on return 1.

What changed the trajectory was the loop around the model. Tax preparers corrected drafts as part of their normal work. Those corrections were captured as structured traces, not just thrown away. The team turned recurring failure modes into tailored evals, then scoped Codex to engineering tasks aimed at those specific failures. A failure on, say, a particular K-1 schedule became a test case, then a code or prompt change, then a verified fix that did not break the cases that already worked.

The numbers fall out of that discipline. Field completion above the 75% threshold went from roughly one quarter of returns at launch to 86% in six weeks because the loop kept converting human judgment into measurable targets. Throughput rose around 50% and accuracy reached up to 97% on drafts because the system stopped repeating known mistakes. None of that requires the agent to be self-aware. It requires the agent's mistakes to be observable and the fixes to be testable.

The takeaway for builders: the improvement lived in the harness, not the model weights. The model was the engine. The loop was the gearbox.

Karpathy's loop is the same pattern, stripped down

If the tax agent is the enterprise version, Karpathy's autoresearch is the lab version with everything inessential removed. An AI agent is pointed at the training script for a small nanochat model. It proposes a change to train.py, runs a fixed five-minute experiment on a single GPU, and checks validation bits-per-byte. If the metric improves, the change stays. If it regresses, the change is reverted. Then the agent proposes the next idea and the cycle repeats.

The repo's README keeps the mechanism deliberately plain: three files matter, only train.py is edited by the agent, every training run gets the same five-minute wall-clock budget, and the result is judged by validation bits-per-byte. That constraint is the point. The experiment length is fixed, so cost is bounded. The metric is single and concrete, so there is no ambiguity about whether a change helped. The revert is automatic, so a bad idea cannot compound into ten bad ideas. The value is not that the agent is clever. It is that the agent is cheap to be wrong.

Strip both systems down and you see the identical skeleton:

Element Tax agent Autoresearch
Signal source Preparer corrections in production Validation bits-per-byte
Unit of change Scoped Codex task Edit to train.py
Success gate Tailored evals + regression tests Fixed five-minute experiment
Bad-change handling Human review before promote Automatic revert
Cost control Scoped repo, bounded tasks Fixed experiment budget

The domains could not be more different. The pattern is the same. That is the part worth internalizing.

The anatomy of a real improvement loop

Once you see the skeleton, you can name the parts. A self-improvement loop that actually converges has six components, and skipping any one of them turns it into an agent that changes things rather than one that improves.

1. Structured traces. You cannot fix what you cannot see. The tax loop worked because corrections were captured in a form you could query and group, not buried in a chat log. If your agent's failures are not landing in a structured store, start there before anything else.

2. Expert feedback. The loop needs a source of truth about what "better" means, and for most real workflows that source is a human expert, not a generic metric. This is where the popular framing breaks down. Generic similarity scores rarely capture domain correctness, a point Hamel Husain makes at length in his evals FAQ. Tax preparers defined correctness. The loop encoded it.

3. An eval target. Vague goals produce vague agents. "Get better" is not a target. "Raise field completion above 75% on this return type" is. The eval is the contract between the human's judgment and the agent's behavior.

4. A scoped change surface. The agent should be allowed to change a bounded thing: a specific repo, a defined toolset, a prompt template. Codex was scoped to engineering tasks. Autoresearch was scoped to one file. Scope is what keeps a single bad idea from rewriting half your system.

5. A regression gate. Every proposed change runs against the cases that already pass. Karpathy's fixed experiment is a regression gate. The tax loop's tailored evals plus regression tests are a regression gate. No gate, no loop, just drift.

6. Rollback and human review. Promote on green, revert on red, and keep a human on anything consequential. A detailed walkthrough of the autoresearch design emphasizes that the automatic revert is not a convenience feature, it is the safety mechanism that lets the agent explore aggressively without accumulating damage.

A useful mental model: trace, eval, scoped edit, regression gate, review, promote or rollback. Run that cycle on a real signal and you have a self-improving system. Remove the eval and the gate and you have an expensive way to generate variance.

What you should never put inside the loop

The seductive failure is to automate the wrong parts. Three things should stay under human control no matter how mature the loop gets.

Do not let the agent define its own eval target. The moment the system both proposes changes and decides what counts as success, it will optimize the metric it can move rather than the outcome you care about. This is Goodhart's law with a faster clock. The tax loop stayed honest because preparers, not the agent, decided what a correct return looked like.

Do not automate the rollback decision on high-stakes changes. Autoresearch can auto-revert because a bad training run costs five minutes and a few cents. A bad change to a production tax workflow or a customer-facing agent costs trust and possibly money. Bounded, cheap, reversible domains can automate the revert. Consequential ones keep a human on the promote.

Do not remove human review of what gets promoted. Anthropic's guide to building effective agents makes the broader point that the simplest controllable system usually beats the most autonomous one. A self-improvement loop is not an exception. It is one of the strongest arguments for human-in-the-loop design, because the loop is specifically built to keep changing.

The pattern that fails in production is the one that sounds best in a pitch deck: an agent that watches itself, grades itself, fixes itself, and ships itself. Every working system trims that fantasy down to a loop where humans own correctness and the machine owns iteration speed.

When self-improvement actually pays off

This loop is not free, and it is not always worth building. It pays off when three conditions hold.

First, the task is high-volume and repetitive enough that the same failure modes recur. Seven thousand tax returns surface the same K-1 edge case many times. A workflow you run twice a month will not generate enough signal to converge. If your failures are one-off and never repeat, you do not have a loop problem, you have a one-time-fix problem.

Second, correctness is checkable. Either a human expert can label outputs cleanly, or there is a hard metric like validation loss. Domains where "good" is genuinely subjective and contested are poor candidates, because the eval target will never stabilize.

Third, changes are scopeable and reversible. You need a surface the agent can safely modify and a way to undo a bad change without a fire drill. If every change requires a three-week deploy and a migration, the loop's iteration speed is gone and the economics collapse.

When those hold, the returns compound in a way that one-shot prompting never matches. Chip Huyen has written that the journey from 0 to 60 is easy and 60 to 100 is brutal. An eval-backed improvement loop is the only practical machinery I have seen for grinding out that last forty points, because it turns the long tail of edge cases into a queue you can work down instead of a wall you keep hitting.

How OpenNash Can Help

Most teams already have the raw material for this loop and do not know it. The corrections your staff make to AI drafts, the tickets your support agent gets wrong, the fields a human always has to fix: those are traces waiting to be captured.

OpenNash installs the loop around your real workflow. In the audit, we map where humans are already correcting the agent and where those corrections currently evaporate. In design, we define the eval targets and the regression gates with your domain experts, and we decide explicitly which changes can auto-promote and which stay human-reviewed. In the build, we scope the change surface so the agent can iterate without touching anything it should not, and we wire rollback in from day one. On deploy, you own the whole system: traces, evals, gates, and documentation.

The honest version of this advice includes when not to call us. If your task runs rarely, or correctness is genuinely contested, or you cannot scope a safe change surface, a self-improvement loop is the wrong tool and we will say so. If you already have strong internal ML engineering and a working eval culture, you may not need a partner at all. The case for custom work is when you have the volume and the signal but no machinery to turn one into the other.

If you have an agent that plateaued and a team that keeps fixing the same mistakes by hand, that is exactly the gap this pattern closes. Book a call to map an eval-backed improvement loop to your workflow.

The next time someone sells you a self-improving agent, ask one question: where is the eval, and who owns it. If the answer is "the agent," walk away. If the answer is a human expert with a regression gate behind them, you are looking at the real thing.