Most teams building "AI agents" are building the wrong thing.
They read about autonomous systems that can browse the web, write code, and manage complex projects. They spin up the OpenAI API, wire up a dozen tools, and watch their agent hallucinate its way into a $400 API bill while accomplishing nothing useful.
The gap between demo and production isn't more autonomy. It's less. The teams shipping reliable AI systems aren't building autonomous agents. They're building carefully constrained workflows that use LLMs as components, not decision-makers.
Anthropic's Building Effective Agents guide codifies what practitioners have learned through painful experience: the most reliable agent architectures are the simplest ones that solve your problem. They outline five patterns, ordered from simplest to most complex, and the guidance is clear: exhaust each level before graduating to the next.
Here's what each pattern actually looks like in production, when to use it, and the failure modes that catch teams off guard.
Pattern 1: Prompt Chaining - The Workhorse You're Probably Underusing
Prompt chaining breaks a task into fixed steps, where each LLM call processes the output of the previous one. There's no autonomy here. The sequence is predetermined. The LLM is a sophisticated text processor, not a decision-maker.
The pattern:
Input → LLM Call 1 → Gate/Transform → LLM Call 2 → Gate/Transform → Output
Real production example: Document processing for legal contracts.
- Extract key terms, dates, parties from the document
- Gate: Verify extraction confidence > 0.9, else flag for human review
- Classify contract type and risk level
- Gate: If high-risk, route to senior review
- Generate summary and action items
- Validate summary against original (catch hallucinations)
Each step has a single, clear purpose. Each gate catches failures before they compound. The "agent" is really a sophisticated pipeline.
When to use prompt chaining:
- Tasks with clear sequential steps
- When you need verifiable intermediate outputs
- When different steps benefit from different prompts or system instructions
- When you want to catch and handle failures at each stage
The failure mode nobody talks about: Latency accumulation. Five sequential LLM calls at 2 seconds each means 10 seconds minimum response time. Users don't wait 10 seconds. Design for parallelization from the start (Pattern 3), even if you implement sequentially first.
Cost optimization: Not every step needs your most capable model. Extraction and classification often work fine with Claude Haiku or GPT-4o-mini at 1/10th the cost. Reserve Opus/GPT-4 for reasoning-heavy steps.
Pattern 2: Routing - Cut Costs Without Cutting Corners
Routing uses an initial classification to direct inputs to specialized handlers. It's how you serve enterprise-grade quality without enterprise-grade bills.
The pattern:
Input → Classifier → Route A (simple) → Output
→ Route B (complex) → Output
→ Route C (specialist) → Output
Real production example: Customer support automation.
A financial services company processes 50,000 support tickets monthly. Before routing:
- Every ticket hits GPT-4: $0.03 per ticket average
- Monthly cost: $1,500
- Quality issues: Overkill for simple queries, undertrained for specialist topics
After implementing routing:
- Route A (60% of tickets): FAQ-style questions → RAG lookup + Haiku response ($0.002/ticket)
- Route B (30% of tickets): Standard support → GPT-4o-mini with company context ($0.008/ticket)
- Route C (10% of tickets): Complex/sensitive → GPT-4 with full reasoning ($0.05/ticket)
Monthly cost after routing: ~$400. Same quality, 73% cost reduction.
Implementation detail that matters: Your classifier IS the routing decision. A bad classifier means expensive queries going cheap (quality degradation) or cheap queries going expensive (cost inflation). Invest heavily in classifier accuracy. It's the highest-impact optimization in this pattern.
When to use routing:
- High-volume applications with mixed complexity
- When different query types genuinely need different handling
- When cost optimization is a priority
- When you have enough data to build a reliable classifier
The failure mode: Routing based on surface features rather than actual complexity. A short question isn't necessarily simple ("Should I exercise my stock options before the merger closes?"). Build your classifier on task complexity, not input length.
Pattern 3: Parallelization - The Latency Killer
Parallelization runs independent subtasks simultaneously, either processing different inputs (sectioning) or getting multiple perspectives on the same input (voting).
The pattern (sectioning):
Input → Split → [LLM Call A] → Aggregate → Output
→ [LLM Call B] ↗
→ [LLM Call C] ↗
The pattern (voting):
Input → [LLM Call 1 with Prompt A] → Vote/Merge → Output
→ [LLM Call 2 with Prompt B] ↗
→ [LLM Call 3 with Prompt C] ↗
Real production example (sectioning): Analyzing a 100-page annual report.
Sequential approach: 5 minutes, processing 10 pages at a time. Parallel approach: 45 seconds, processing all 10 chunks simultaneously.
The aggregation step synthesizes findings. You're trading straightforward implementation for dramatic latency improvement.
Real production example (voting): Code review automation.
Three parallel reviews with different focuses:
- Security vulnerabilities (prompt emphasizes OWASP Top 10)
- Performance issues (prompt emphasizes algorithmic complexity)
- Maintainability (prompt emphasizes code clarity, naming, structure)
Merge step combines findings, deduplicates, and ranks by severity. Each reviewer sees the same code but through a different lens.
When to use parallelization:
- Latency-sensitive applications where sequential processing is too slow
- Tasks that naturally decompose into independent subtasks
- When multiple perspectives improve output quality
- When you can afford the increased API costs (parallel = more concurrent calls)
The failure mode: Assuming independence when subtasks actually depend on each other. If Chunk 3's analysis depends on understanding established in Chunks 1 and 2, parallelization introduces errors. Map your actual dependencies before parallelizing.
Cost consideration: Voting patterns multiply your API costs by the number of voters. Three parallel reviews = 3x the cost. Ensure the quality improvement justifies the expense: run A/B tests, not assumptions.
Pattern 4: Orchestrator-Workers - When You Actually Need Delegation
The orchestrator-workers pattern introduces real autonomy: a central LLM dynamically breaks down tasks and delegates to specialized workers. This is where "agent" starts meaning something.
The pattern:
Input → Orchestrator → [Analyze task, create subtasks]
→ Delegate to Worker A → Result A
→ Delegate to Worker B → Result B
→ [Synthesize results] → Output
Real production example: Competitive intelligence gathering.
User query: "How is Stripe positioning against Adyen in the European market?"
Orchestrator breaks this into:
- Search Worker: Find recent Stripe announcements about European expansion
- Search Worker: Find recent Adyen European market share data
- Analysis Worker: Compare pricing structures from public documentation
- Synthesis Worker: Identify positioning differences and strategic implications
The orchestrator doesn't know in advance exactly what information exists or what the workers will find. It adapts the synthesis based on what comes back.
Critical implementation detail: The orchestrator needs to know what each worker can do. This isn't magic; it's carefully designed tool descriptions and clear capability boundaries. Vague worker definitions lead to misrouted tasks and compounding errors.
When to use orchestrator-workers:
- Tasks where subtask structure can't be predetermined
- When specialized capabilities genuinely help (code generation worker vs. research worker)
- When you need to scale complexity beyond what a single prompt can manage
- When you have robust error handling and evaluation infrastructure
The failure modes (there are several):
-
Orchestrator overreach: The orchestrator tries to delegate tasks the workers can't handle. Solution: Explicit capability descriptions and graceful failure handling.
-
Worker isolation: Workers can't share context, leading to redundant work or contradictory outputs. Solution: Shared memory or context passing (adds complexity).
-
Infinite loops: The orchestrator keeps delegating without converging on an answer. Solution: Hard limits on iterations and explicit completion criteria.
Honest assessment: Most teams implementing orchestrator-workers would be better served by well-designed prompt chains. The autonomy this pattern provides is seductive but expensive to make reliable. If you can enumerate your subtasks in advance, use prompt chaining instead.
Pattern 5: Evaluator-Optimizer - The Quality Ratchet
The evaluator-optimizer pattern generates output, evaluates it against criteria, and iteratively refines until quality thresholds are met. It's how you get outputs that meet specific standards, not just "pretty good" outputs.
The pattern:
Input → Generator → Output v1 → Evaluator → [Meets criteria?]
↓ No
← Feedback ← [Generate specific feedback]
↓ Yes
Final Output
Real production example: Marketing copy generation for regulated industries.
A healthcare company needs marketing copy that's compelling AND compliant. First drafts from LLMs routinely include claims that would trigger FDA review.
The evaluator checks against:
- Prohibited claim patterns (specific regex + semantic matching)
- Required disclosure presence
- Tone guidelines (professional but accessible)
- Brand voice consistency
When evaluation fails, specific feedback goes back to the generator: "Claim 'clinically proven' on line 3 requires citation. Rephrase or add supporting evidence."
The loop continues until all criteria pass or max iterations hit (then human review).
When to use evaluator-optimizer:
- Outputs must meet specific, verifiable criteria
- When "good enough" isn't good enough (legal, medical, financial content)
- When you can define clear evaluation rubrics
- When the cost of iteration is less than the cost of bad output
Implementation detail: The evaluator and generator should ideally use different prompts or even different models. If the same model that generated the error evaluates it, blind spots persist. Cross-model evaluation or specialized evaluation prompts catch more issues.
The failure mode: Evaluation criteria that the generator can't actually satisfy. If your evaluator demands perfect factual accuracy but your generator hallucinates, you get infinite loops. Match your evaluation criteria to what's actually achievable, then use external verification for claims that matter.
Cost and latency: This pattern multiplies both. Three iterations = 3x generator cost + 3x evaluator cost. For a 2-second generation + 1-second evaluation, three iterations means 9 seconds minimum. Design your criteria to pass on first attempt most of the time, with iteration as the exception.
Choosing Your Pattern: The Decision Framework
The temptation is to start complex. Resist it.
Start here:
- Can you enumerate the exact steps in advance? → Prompt Chaining
- Do you have high volume with mixed complexity? → Routing
- Are there independent subtasks that can run simultaneously? → Parallelization
- Must the task decomposition happen dynamically? → Orchestrator-Workers
- Must outputs meet specific verifiable criteria? → Evaluator-Optimizer
Combine patterns intentionally: Real production systems layer these. A routed system might use prompt chaining within each route. An orchestrator might spawn workers that use evaluator-optimizer loops. But start with one pattern, prove it works, then add complexity.
The meta-pattern for production:
Route → [Simple route: Prompt Chain]
→ [Complex route: Orchestrator-Workers with Evaluator loop]
This gives you cost efficiency (routing), reliability (chaining for simple cases), capability (orchestration for complex cases), and quality (evaluation for critical outputs).
What Nobody Tells You About Production Agent Systems
1. Evaluation is your actual product.
Your agent is only as good as your ability to measure whether it works. Before building any pattern, define:
- What does success look like for this task?
- How will you measure it automatically?
- What's the human baseline you're comparing against?
Hamel Husain's evaluation guide should be required reading before any agent implementation.
2. Simpler patterns have compounding advantages.
Every additional LLM call is a potential failure point. Prompt chains have N failure points. Orchestrator-workers have N × M failure points (N workers, M potential interactions). The math isn't linear. Complexity compounds.
3. The "agent" framing is often wrong.
As Simon Willison notes, an agent is "an LLM running tools in a loop to achieve a goal." Most successful production systems don't fit this definition - they're sophisticated workflows with LLMs as components. That's not a limitation; it's a feature.
4. Cost optimization happens at the pattern level, not the prompt level.
Prompt engineering saves you 10-20%. Choosing the right pattern (routing to appropriate model tiers, parallelizing for throughput) saves you 60-80%. Optimize architecture first.
The Honest Path Forward
Here's what I'd tell a team starting today:
Week 1-2: Build the simplest possible version using prompt chaining. No frameworks: just API calls and Python. Get something working end-to-end.
Week 3-4: Instrument everything. Log inputs, outputs, latencies, costs, failure rates. You can't optimize what you can't measure.
Week 5-6: Based on actual data (not assumptions), identify your bottleneck. Is it cost? Add routing. Latency? Add parallelization. Quality? Add evaluation loops.
Week 7+: Only now consider orchestrator-workers or multi-agent architectures - and only if simpler patterns genuinely can't solve your problem.
The teams shipping reliable AI agents aren't the ones with the most sophisticated architectures. They're the ones who chose the simplest pattern that works and invested heavily in evaluation. Everything else is premature optimization - or worse, premature complexity.
The five patterns exist on a spectrum from fully deterministic to fully autonomous. Your job isn't to reach maximum autonomy. It's to find the minimum autonomy that solves your users' problems reliably. Start simple, measure obsessively, and earn your complexity.