Your agent worked fine in staging. It answered questions, called tools, returned structured output. Then you deployed it, and within 48 hours the Slack channel lit up. The search API returned 503s, so the agent retried in a tight loop until it burned through your token budget. A slow embeddings call caused the entire chain to hang for 90 seconds. The LLM provider had a brownout, and instead of failing cleanly, your agent hallucinated tool results and presented them with full confidence.
None of these are AI problems. They are distributed systems problems. And distributed systems engineering solved most of them 15 years ago.
Agents Are Distributed Systems Now
An AI agent in production is not a single model doing inference. It is a coordinator that calls external APIs, queries databases, invokes LLMs (sometimes multiple), reads from vector stores, and writes to downstream systems. Every one of those calls can fail, slow down, or return garbage.
Michael Nygard laid out the core patterns for this in Release It! back in 2007. Circuit breakers, bulkheads, timeouts, retry budgets - these were solutions to exactly the kinds of cascading failures that agent systems now face. The difference is that agent builders in 2026 are largely rediscovering these problems from scratch, because the AI engineering community grew up on notebooks and batch inference, not distributed services.
Google's SRE book formalized the idea that reliability is a feature you engineer, not a property you hope for. The same thinking applies to agents. You do not make an agent reliable by testing harder. You make it reliable by designing for failure at every integration point.
The agent-reliability-engineering framework on GitHub codifies some of these patterns specifically for agent systems, and the 2026 SRE Report from Catchpoint shows that even traditional SRE teams are now asking how these patterns apply to AI-driven automation.
Circuit Breakers on Every Tool Call
A circuit breaker is simple: track the failure rate of an external call. When failures cross a threshold, stop making that call for a cooldown period. Try again after the cooldown. If it works, resume normal operation.
For agents, this means wrapping every tool invocation - search APIs, database queries, third-party services - in a circuit breaker. Here is what that looks like in practice:
class ToolCircuitBreaker:
def __init__(self, failure_threshold=3, cooldown_seconds=60):
self.failures = 0
self.threshold = failure_threshold
self.cooldown = cooldown_seconds
self.last_failure = 0
self.state = "closed" # closed = healthy
def call(self, tool_fn, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure > self.cooldown:
self.state = "half-open"
else:
return {"error": "circuit_open", "fallback": True}
try:
result = tool_fn(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "open"
raise
The counter-intuitive part: when the circuit opens, the agent should not retry or wait. It should skip that tool and continue with reduced capabilities. An agent that answers "I could not check the latest pricing data, but based on what I know..." is more useful than an agent that hangs for two minutes and then fails entirely.
This is what Chip Huyen calls the gap between 60 and 100 - getting the basic agent working is easy, but making it handle real-world degradation is where most teams stall.
Timeout Budgets: Cap the Whole Turn, Not Just Each Step
Most agent frameworks let you set timeouts on individual tool calls. That is necessary but not sufficient. The real problem is unbounded agent turns - where the model decides to call six tools sequentially, each taking 3-4 seconds, and the user is staring at a spinner for 25 seconds.
The fix is a timeout budget for the entire agent turn:
| Budget Component | Typical Value | What Happens When Exceeded |
|---|---|---|
| Total turn budget | 15 seconds | Agent returns best partial answer |
| Per-tool timeout | 5 seconds | Tool call skipped, agent continues |
| LLM call timeout | 10 seconds | Fall back to faster model |
| Retry budget | 2 retries max | Fail with cached or static response |
The turn budget is the outer constraint. If the agent has burned 12 of its 15 seconds, it should not start a new tool call that might take 5 seconds. It should synthesize what it has and respond.
This is different from how most teams implement timeouts. They set per-call limits and hope the total stays reasonable. But agents make dynamic decisions about how many calls to make. Without a ceiling on the whole sequence, a "helpful" agent that decides to be thorough will blow past any reasonable response time.
Stripe's API design principles are instructive here, even though they are not about AI. Stripe treats every external call as potentially slow and every retry as potentially expensive. The same discipline applies to agent tool calls: each call is a budget draw, and when the budget runs out, the agent must deliver what it has.
Fallback Chains: Graceful Degradation for LLM Calls
LLM providers have outages. Not "catastrophic, everything is down" outages - more like brownouts where latency spikes, rate limits tighten, or quality degrades subtly. Your agent needs a plan for this.
A fallback chain routes LLM requests through progressively simpler alternatives:
Primary: Claude Opus (best quality, highest latency)
↓ on timeout or 5xx
Secondary: GPT-4o (comparable quality, different provider)
↓ on timeout or 5xx
Tertiary: Claude Haiku (faster, cheaper, good enough for most tasks)
↓ on timeout or 5xx
Deterministic: Rule-based response from templates
The key insight is that each level trades capability for reliability. Your agent running on Haiku is worse than your agent on Opus, but it is infinitely better than your agent returning an error. Netflix's approach to graceful degradation follows the same principle - when the recommendation engine is slow, show popular titles instead of nothing.
For multi-model fallback to work, you need two things:
-
Prompt compatibility. Your prompts should work across models without modification. This means avoiding model-specific features in your core agent prompts. Keep the fancy stuff in model-specific wrappers.
-
Quality tracking per tier. If your agent silently falls back to a weaker model, you need to know how often that happens and what the quality impact is. Log which model served each response.
A common mistake is building fallback chains that are too aggressive. If your primary model has a 200ms latency spike, you do not want to immediately fall back to a weaker model. Set your fallback thresholds based on actual user tolerance, not engineering perfectionism. Most users will wait 8 seconds for a good answer before they will accept a mediocre instant one.
SLOs for Agent Quality, Not Just Uptime
Traditional SRE defines SLOs for latency and availability. Agent reliability engineering needs SLOs for output quality too.
Here is a starter SLO framework for a production agent:
| SLO | Target | Measurement |
|---|---|---|
| Task completion rate | > 92% | Did the agent accomplish what the user asked? |
| Hallucination rate | < 3% | Did the agent state something verifiably false? |
| Tool call success rate | > 95% | Did external tool calls return valid results? |
| p95 response time | < 8s | End-to-end, from user input to agent response |
| Fallback rate | < 10% | How often did the agent use degraded paths? |
| User correction rate | < 15% | How often did users have to fix or redo agent output? |
The Rootly AI SRE Guide covers how SRE teams are starting to think about these metrics for AI-assisted operations. But the same framework applies to any agent system.
The error budget concept from SRE translates directly. If your task completion SLO is 92% over a 30-day window and you have burned through 80% of your error budget by day 15, that is a signal to freeze deployments and investigate. Maybe a tool API changed its response format. Maybe a prompt regression slipped through. The point is that you catch quality degradation with the same rigor you catch infrastructure degradation.
Honeycomb's approach to observability emphasizes high-cardinality tracing for exactly this reason. You need to slice agent performance by model version, tool combination, user segment, and task type. Aggregate metrics hide the specific failure modes that eat your error budget.
What to Build First
If you are running agents in production today and have none of these patterns, here is the priority order:
Week 1: Timeout budgets. Add a hard ceiling on agent turn duration. This is the highest-impact, lowest-effort change. A runaway agent turn that burns tokens and time is the most common production failure, and a simple timer fixes it.
Week 2: Circuit breakers on tool calls. Start with the flakiest tool - you already know which one it is. Wrap it in a circuit breaker with a 60-second cooldown. When the circuit opens, return a cached result or skip the tool.
Week 3: Basic fallback chain. Configure a secondary LLM provider. Even if you never use it, having the routing in place means you can flip a switch during an outage instead of scrambling.
Week 4: Quality SLOs. Pick two metrics - task completion rate and hallucination rate are the best starting pair. Instrument them, set targets, review weekly.
You do not need a platform or a vendor for any of this. These are patterns, not products. A circuit breaker is 30 lines of Python. A timeout budget is a decorator. A fallback chain is a try/except with a list of clients. The Dash0 overview of AI SRE tooling lists platforms that can help at scale, but the patterns themselves are straightforward to implement.
The teams that ship reliable agents are not the ones with the best models or the most sophisticated prompts. They are the ones that treat their agent like a distributed service and apply 15 years of operational discipline to a 2-year-old technology. Start with the fundamentals. The fancy stuff can wait.