What AI Will Automate Next: Closed-Loop vs. Open-Loop Work

Everyone wants to know the same thing: what work will AI automate, what will stay valuable, and where should people invest their time?

The answer is not simply "coding goes away" or "creative work is safe." The better question is: how easy is the work to verify?

AI improves fastest when there is a tight feedback loop. If a model writes code, you can run the tests. If it solves a math problem, you can check the answer. If it uses a browser, you can verify whether the task was completed. If the answer is wrong, the system gets a clear correction signal.

That is closed-loop work. The loop from output to judgment is short, cheap, and repeatable.

Open-loop work is different. The answer is harder to prove. Was the investment memo good? Was the brand positioning right? Did the sales strategy match the buyer's internal politics? Did the clinical decision fit the patient, payer, provider, and hospital workflow? These are not impossible to evaluate, but correctness lives inside context, judgment, taste, proprietary data, or future outcomes.

That distinction helps explain where frontier model labs will win, where open models will commoditize the market, and where domain-specific AI agents can create durable value.

The Simple Matrix

Think of AI work across two axes:

Task maturity: Is the task saturated and common, or still frontier and difficult?
Verification: Is correctness public and easy to check, or private and expensive to establish?

That creates four quadrants:

	Public or easy to verify	Private or expensive to verify
Saturated tasks	Commodity AI	Internal workflow AI
Frontier tasks	Frontier lab territory	Domain-specific AI agents

The most important insight: the value does not sit only in generating the answer. It sits in owning the verification loop.

Quadrant 1: Saturated + Public = Commodity AI

These are tasks where the output is common, the answer is easy to judge, and many models can already do the job well.

Examples:

Translating common languages
Summarizing articles, meeting notes, and PDFs
Drafting standard SEO blog posts
Writing simple sales emails
Classifying sentiment
Extracting fields from clean documents
Creating boilerplate CRUD code
Answering basic customer support FAQs
Converting data formats
Writing simple regex or SQL queries
Rephrasing copy for tone

This work still matters, but it will not support much pricing power by itself. When the task is common and the answer is easy to verify, open models and low-cost inference providers push the price toward zero.

The strategic move here is not to build a business around "we summarize documents." The move is to embed the task into a larger workflow where the summary triggers an action, updates a system of record, asks for approval, or feeds a proprietary decision.

Quadrant 2: Frontier + Public = Frontier Labs Win

These are hard problems with public answers or scalable test harnesses. The task may be difficult, but the feedback loop is strong.

Examples:

Competitive programming
Software engineering benchmarks like SWE-bench
Browser and desktop agent tasks with execution-based checks
Math problems with exact answers
Physics simulations with measurable outputs
Cybersecurity capture-the-flag tasks
Formal proof tasks
Spreadsheet tasks with cell-level tests
Scientific benchmarks with public scoring rules
Code migration tasks where tests and linters define success

This is why coding agents have improved so quickly. A compiler, test suite, linter, type checker, browser state, or benchmark score gives the model an objective signal. The model can try, fail, inspect the failure, edit, and try again.

Anthropic's internal research on AI use at Anthropic shows this pattern clearly. Engineers reported using Claude most for debugging and code understanding, and they tended to delegate work that was easy to validate, low-stakes, self-contained, or repetitive. They were more cautious with high-level design, strategy, and taste-heavy decisions. Anthropic also found that Claude Code was handling longer chains of tool calls over time, moving from roughly 10 consecutive actions before human input to around 20 in their observed period.

This quadrant is where frontier labs have a structural advantage. Public verification lets them train, evaluate, and improve models at scale. If a benchmark is public, repeatable, and economically important, the largest labs can pour compute, data, scaffolding, and reinforcement learning into it.

That does not mean every public benchmark maps cleanly to real work. METR's 2025 randomized study of experienced open-source developers found that AI tools slowed participants by 19% on the studied tasks, even though developers expected a speedup. METR also warned that benchmarks can overestimate real-world usefulness because benchmark tasks are often well-scoped and algorithmically scorable. The lesson is not "AI cannot code." The lesson is narrower and more useful: AI performs best when the verification loop is clear, and performance gets harder to transfer when the real work contains tacit context, high quality bars, and implicit requirements.

Quadrant 3: Saturated + Private = Useful, But Usually Not Defensible

These are routine tasks inside a company's private data or internal systems. The data is proprietary, but the work itself is not that hard.

Examples:

Classifying internal support tickets
Routing emails based on company policy
Extracting vendor fields from known invoice formats
Deduplicating CRM contacts
Tagging sales calls by topic
Summarizing internal meeting transcripts
Matching candidates to job descriptions
Drafting standard procurement responses
Generating first-pass compliance checklists
Updating internal knowledge base entries
Creating weekly internal performance reports

This is where a lot of near-term enterprise AI value lives. It saves time. It reduces friction. It helps employees move faster.

But it is rarely the deepest moat. If the task is easy and only the data is private, a company can often solve it with retrieval, workflow automation, fine-tuning, or a small model connected to internal systems. The defensibility comes less from the model and more from integration, permissions, governance, change management, and reliability.

For most companies, this quadrant is still worth doing. It is where teams learn how to use AI safely. It creates quick wins. It builds internal muscle. It also exposes the places where simple automation breaks down, which points toward the more valuable fourth quadrant.

Quadrant 4: Frontier + Private = The Prize

This is the highest-value quadrant: difficult work where correctness is private, expensive, slow, or context-dependent.

Examples:

Hedge fund research using proprietary theses, portfolio constraints, and historical decision records
Insurance underwriting using carrier-specific loss history
Claims adjudication against payer-specific rules and exception patterns
Drug candidate prioritization using failed-trial data and internal assay results
Manufacturing yield optimization using private sensor and defect data
Semiconductor process tuning inside a specific fab
Enterprise sales strategy based on CRM history, call transcripts, buyer politics, and deal outcomes
Legal strategy based on firm precedent, judge behavior, client risk tolerance, and privileged documents
M&A diligence using private financials, internal operating data, and market assumptions
Product roadmap prioritization using support tickets, usage logs, sales objections, and executive strategy
Hospital operations planning across staffing, patient flow, payer rules, and local constraints
Security operations triage using internal logs, architecture, assets, and incident history

This work is hard for frontier labs to fully capture because the training signal is not public. There is no universal benchmark for "make the right underwriting decision for this carrier's book of business" or "write the investment memo this fund would actually trust." The ground truth lives inside the institution.

This is where custom agents become valuable. Not because they are magically smarter than frontier models, but because they can be connected to the private feedback loop:

What did the expert approve?
What did they reject?
Which signals did they trust?
Which recommendations performed well later?
Which internal constraints mattered?
Which workflow step created the bottleneck?
Which exception did the generic model miss?

The more private, longitudinal, and outcome-linked the feedback loop becomes, the more the AI system can adapt to the organization. That is where generic chatbots stop being enough.

Closed-Loop Work Will Move First

The fastest automation happens when the system can test itself.

Coding is the clean example. The agent can write code, run tests, inspect errors, edit files, run the tests again, and stop when the suite passes. Browser tasks are similar: the agent can click, submit, observe page state, and verify whether the desired outcome happened. Math has exact answers. Data transformation can be checked against a schema. Spreadsheets can be checked cell by cell.

This is why public, closed-loop domains are so important to frontier model labs. They produce scalable training data and measurable progress. If the answer can be verified cheaply, the lab can turn the crank.

OpenAI's GDPval is interesting because it tries to move closer to real work: tasks across 44 occupations, with deliverables like legal briefs, engineering plans, spreadsheets, and nursing care plans. But even OpenAI notes that GDPval's first version is limited because many workplace tasks require iteration, context building, and integration into a larger workflow. That limitation is exactly the point. The closer you get to real work, the more verification becomes social, institutional, and context-specific.

Open-Loop Work Will Not Disappear, But It Will Change

Open-loop does not mean "AI cannot help." It means the human or institution remains part of the evaluation function.

Examples:

Is this brand direction tasteful?
Is this investment thesis convincing?
Is this product feature worth building?
Is this legal argument appropriate for this client?
Is this diagnosis plausible given messy patient context?
Is this sales strategy right for this buyer?
Is this executive memo politically realistic?
Is this design good for our audience?

AI will accelerate drafts, research, alternatives, critique, and simulation. But the final judgment still depends on taste, context, authority, accountability, and lived experience.

In Anthropic's internal study, people described using Claude as a constant collaborator while still actively supervising it, especially in high-stakes work. More than half said they could fully delegate only 0-20% of their work. That is a useful picture of where many professional workflows are heading: not "AI replaces the whole job," but "AI increases the volume of drafts, analyses, prototypes, and options that humans must judge."

That creates a new premium skill: knowing what good looks like.

What This Means For Your Career

If your work is mostly saturated and publicly verifiable, assume the task will be automated or commoditized. The answer is not panic. The answer is to move up the stack.

Invest in:

Defining the problem, not just executing the task
Owning the workflow around the AI output
Learning how to verify AI work quickly
Building taste and domain judgment
Understanding systems, incentives, and edge cases
Working with proprietary data and feedback loops
Communicating decisions to real stakeholders
Turning AI output into accountable action

The people who do well will not necessarily be the people who type the fastest or memorize the most syntax. They will be the people who can supervise systems, spot subtle errors, ask better questions, and connect outputs to outcomes.

In closed-loop fields, learn the tools deeply. Use agents. Build harnesses. Write tests. Create evaluation suites. The person who owns the testing loop owns the leverage.

In open-loop fields, build taste and judgment. Study examples. Create rubrics. Capture decisions. Turn vague expertise into observable feedback. The person who can make tacit judgment legible will be able to train and manage better AI systems.

What This Means For Companies

Most companies should not start with "which model should we use?" They should start with "where is our verification loop?"

Ask:

What decisions do we make repeatedly?
Which outputs are easy to check automatically?
Which outputs require expert review?
Where do we already have historical outcomes?
Where does proprietary context change the answer?
Where do employees spend time translating between systems?
Where do mistakes become expensive?
Where would a 10x increase in draft volume actually help?

Then classify the workflow:

If it is saturated and public, buy it cheap.
If it is frontier and public, expect the frontier labs to improve quickly.
If it is saturated and private, automate it for efficiency.
If it is frontier and private, build a custom feedback loop.

That final category is where AI becomes strategic. The agent is not just producing text. It is learning how your organization decides.

The Bottom Line

AI automation will not advance evenly across all work. It will follow the verification gradient.

Tasks with public, cheap, repeatable feedback loops will be automated fastest. Frontier labs will dominate the hardest public benchmarks because they can train directly against them. Commodity tasks will collapse in price because many models can perform them well enough.

The durable value sits where correctness is private, contextual, and expensive to establish. That is where companies need AI systems connected to their data, experts, workflows, and outcomes.

The future of work is not just about who has the biggest model. It is about who owns the loop.

OpenNash helps teams understand AI, identify high-value automation opportunities, and build domain-specific agents around real business workflows. To explore a project or workshop, contact [email protected].

Sources and Further Reading

Anthropic, How AI is transforming work at Anthropic
Anthropic, The Anthropic Economic Index
SWE-bench, Official leaderboards
OSWorld, Benchmarking multimodal agents for open-ended tasks in real computer environments
METR, Measuring the impact of early-2025 AI on experienced open-source developer productivity
Epoch AI, FrontierMath
OpenAI, Measuring the performance of our models on real-world tasks