What are the Anthropic cookbooks?

The Anthropic cookbooks are a public collection of code recipes for building with Claude, covering simple agent loops, tool use, tool search, managed agents, outcome graders, and production patterns. They live at platform.claude.com/cookbook and on GitHub, and they are designed to be copied into real systems.

Are Anthropic agents the same as LangGraph or OpenAI Assistants?

No. Anthropic's cookbook style emphasizes thin agent loops and explicit runtime primitives - tools, traces, graders - rather than a heavyweight orchestration graph or a hosted assistant abstraction. The patterns are framework-agnostic, which is why teams use them alongside LangGraph and other harnesses.

What is tool search with embeddings and why does it matter?

Tool search retrieves a small relevant subset of tools from a large library using embeddings before the model decides what to call. It matters because once an agent has more than 20-30 tools, tool selection accuracy drops sharply, and pre-filtering recovers most of that lost reliability.

What is an outcome grader in the managed agents cookbook?

An outcome grader is an LLM judge that scores whether an agent's final state actually satisfies the goal, not just whether it ran without errors. The managed agents cookbook pairs this with a verifier loop so the agent can re-try when the grader rejects the outcome.

Should we build our agent stack internally or buy a platform?

Build the workflow logic and tools that encode your business; buy the harness pieces that are commodity - loops, tracing, evaluation, memory primitives. Most internal builds fail because teams try to own the harness and under-invest in tools and evaluation, which is where the real product lives.

What the Anthropic Cookbooks Reveal About Building Agents That Actually Work

A useful thing happens when you read the Anthropic cookbooks back to back instead of one at a time. The recipes stop looking like clever prompt tricks and start looking like a quiet argument about what an agent actually is. The argument goes: an agent is not a model with a personality, it is a runtime. The model is one component. The interesting work is everything around it - tools, environment, state, traces, human handoff, evaluators, resource lifecycle - and the cookbooks teach that work by showing it, not by explaining it.

This is a different bet than the one most agent frameworks made in 2024 and 2025. Those frameworks assumed the hard part was orchestration: nodes, edges, state machines, supervisors. Anthropic's recipes assume the hard part is the runtime shape of the loop and the boring infrastructure that decides whether the loop converges. That shift matters for any team deciding what to build and what to buy in 2026.

The Cookbook's Thesis Is About Runtime, Not Prompts

If you skim the Anthropic cookbook index, the recipes cluster into a small number of categories: simple loops, tool use, tool evaluation, tool search, context engineering, managed agents, outcome grading, and SDK patterns. There is no recipe for "the perfect agent prompt." There is no recipe for personas. There is a recipe for tool evaluation, which scores how well a given tool definition lets the model accomplish a task, and another for tool search with embeddings, which retrieves the right subset of tools when the library is too large to fit in context.

That table of contents is the thesis. Anthropic is telling builders: your tools are the product, your evals are the spec, and your loop is mostly already solved. The fastest way to make an agent better is almost never a better prompt.

Compare this to the way most internal agent projects start. A team picks a framework, writes a system prompt, wires up three or four tools, runs a demo, and ships. When the demo breaks in production - which it does - the team rewrites the prompt. The cookbook patterns suggest a different first move: write a tool evaluation, measure tool-use accuracy on a small fixed set of tasks, and then change the part of the stack that the measurement says is broken. That is closer to how Stripe and Shopify describe their internal agent work in their engineering posts, and closer to what Hamel Husain has been saying about evaluation-driven development for two years.

Tool Design Is the Real Product Surface

The single most underrated cookbook is the tool evaluation recipe. It does something obvious in retrospect: it treats a tool definition like a piece of code that has a test suite. You hand it a tool, a set of inputs the user might phrase in real language, and the expected calls. The recipe runs Claude against the tool, scores how often it called the right tool with the right arguments, and tells you which descriptions are ambiguous and which parameter names are confusing.

This sounds small. It is not. In a typical agent project, the tools are written once, in a hurry, and never measured. Then the team spends three weeks tuning prompts to compensate for a tool description that should have been one sentence longer. Anthropic's cookbook flips the order. You write the tool, you eval the tool, you fix the tool, and the prompt does almost no work.

The companion recipe is tool search. Once an agent has more than 20 or 30 tools, models start making selection mistakes that no amount of prompt engineering recovers. The tool search recipe builds an embedding index of tool descriptions and retrieves the top-K relevant tools per turn before the model sees them. This is the same insight as RAG, applied one level up. The interesting consequence is that you can now ship agents with a thousand tools without their accuracy collapsing, which changes what kinds of products are feasible.

A practical takeaway for teams: if your agent will ever have more than 15 tools, the architecture decision is not "which framework" but "where does tool retrieval live." Building it later is painful because every prompt and trace assumes the old shape.

Context Engineering Is Where Long-Running Work Lives

The building effective agents post and the cookbook recipes for context engineering tell a consistent story about long horizons. As an agent's loop gets longer, the context window fills with prior turns, tool outputs, and partial results. The naive approach is to dump everything in. The cookbook approach is to treat context as a managed resource with three explicit moves: summarize, externalize, and replay.

Summarize means the agent rolls older turns into a compressed state representation when the window pressure rises. Externalize means tool outputs that are large or reusable get written to a file, a vector store, or a scratchpad, and the agent retrieves them on demand. Replay means the agent can restart from a checkpoint without losing the relevant facts, which is how you survive crashes and timeouts in a 40-step task.

These are not new ideas. What is new is the cookbook treating them as first-class primitives instead of footnotes. Lance Martin's LangChain post on long-context agents makes a similar point: teams that ship long-running agents almost always end up rebuilding these three primitives from scratch, and almost always wish they had started with them.

For business leaders, this is the part of the stack to ask vendors about. Any agent doing work that takes more than a few minutes is implicitly doing context engineering, and the team that has not thought about summarize-externalize-replay is shipping a demo, not a system.

Outcome Graders Turn Agents Into Measurable Systems

The managed agents cookbook is the recipe that most internal builds skip and most production teams wish they had read first. The pattern is straightforward: after the agent finishes a task, a separate LLM grader inspects the final state - files changed, messages sent, database rows written - and scores whether the goal was actually met. If the grader rejects, the agent retries with the rejection as feedback.

This sounds like another layer of LLM, and it is. The reason it works anyway is that grading a finished outcome is a much easier task than producing one, in the same way that proofreading is easier than writing. The asymmetry gives you a cheap, automated quality gate that catches the failure mode most agents have: completing the steps without achieving the goal.

The deeper point is that an outcome grader is not just a safety net. It is a reproducible eval substrate. Once you have a grader you trust, you can run it against historical traces, compare model versions, A/B test tool changes, and detect regressions. This is the same pattern Eugene Yan describes in his LLM patterns post and what the OpenAI practical guide to building agents calls "verification loops." The cookbook makes it concrete.

Capability	What it gives you	What you lose without it
Tool evaluation	Per-tool accuracy scores	Prompt-tuning to mask bad tools
Tool search	Scaling past ~30 tools	Selection errors as catalog grows
Context engineering	Long-horizon stability	Window-pressure failures
Outcome grader	Automated quality gate	Manual trace review only
Managed agent SDK	Reproducible runtime	Re-implementing the loop

Build vs Buy: The Harness Is Commodity, the Workflow Is Yours

The most honest read of the cookbooks is that Anthropic has open-sourced the answer to a question many teams are still spending engineering quarters on. The agent loop is solved. The tool calling protocol is solved. The grader pattern is solved. What is not solved, and cannot be solved by anyone but you, is which tools your agent has, what your business considers a successful outcome, and how human handoff works inside your operations.

That suggests a sharp build-vs-buy line for 2026:

Buy or borrow the harness. Use the Claude Agent SDK, LangGraph, or the cookbook patterns directly. Do not reimplement loops, retry policies, tool routing, and trace capture. These are commodity now and the open implementations are better than what most teams will build internally.
Build the tools. Your tools encode your business logic, your data access, your guardrails, and your audit trail. They are the part of the system you cannot outsource and the part that determines whether the agent is useful.
Build the graders. What "success" means for your workflow is your IP. The grader prompt and its rubric are where your domain expertise lives, and they are usually short enough to write and version in your own repo.
Buy the model. Do not finetune unless you have measured that prompting plus tools plus graders is not enough. Chip Huyen's pitfalls post is still right about this.

The teams that get this wrong usually go in two directions. The first overinvests in the harness, rebuilding orchestration that already works, and ships late with a thin tool layer. The second underinvests in graders and evals, ships fast, and then cannot tell whether changes are helping or hurting. The cookbook makes both mistakes harder to justify, because the alternative is sitting in front of you in a notebook.

How OpenNash Can Help

If the cookbook patterns map cleanly to your roadmap, you probably do not need outside help - the recipes are deliberately copyable. The cases where bringing in a partner pays off are different. They tend to involve tool libraries that already exceed what a single team can evaluate, graders that need to encode regulated or auditable decisions, or long-running agents that have to integrate with legacy systems where context engineering is not optional.

OpenNash builds production agents around exactly this runtime shape: tools as first-class artifacts with evaluation suites, graders that match the way your operations actually define success, and lifecycle controls for human handoff and audit. The work follows the audit, design, build, deploy pattern, and ownership of the resulting code, tools, and graders sits with the client at handoff. If you are deciding which parts of the agent stack are worth building internally and which are worth borrowing from the cookbooks, book a call and we can map it to your specific workflow before anyone writes a line of code.

The cookbooks are the clearest signal yet that the agent conversation has moved on from "can the model do it" to "is your runtime shaped to let it." That is a better conversation. It is also one where the teams that have done the boring infrastructure work are about to pull ahead of the teams that are still tuning prompts.