How much does it cost to run an AI agent in production?

A well-optimized AI agent handling 200-500 tasks per day typically costs $30-80/day in API fees. Without optimization, the same workload can easily hit $200-500/day due to verbose prompts, expensive model selection for simple tasks, and uncontrolled retry loops.

What is model routing for AI agents?

Model routing sends different parts of an agent's work to different LLMs based on task complexity. Simple classification or extraction goes to cheap, fast models (GPT-4o mini, Claude Haiku) while complex reasoning goes to expensive models (GPT-4o, Claude Sonnet). This typically reduces costs 60-80% with minimal quality loss.

How do I reduce LLM API costs for my AI application?

The highest-impact strategies are prompt caching (90% savings on repeated system prompts), model routing (60-80% savings by using cheap models for simple tasks), response length limits (set max_tokens to prevent runaway generation), and batching similar requests together. Start with caching - it's the easiest win.

Why do AI agent costs explode in production?

Three main reasons: retry loops multiply costs exponentially because each retry includes full conversation context, verbose system prompts get sent with every single API call, and developers default to the most expensive model for every task. A single stuck agent loop can burn through $50 in minutes.

What is the cheapest way to run AI agents?

Use a tiered approach: open-source models for high-volume simple tasks, cheap API models (Haiku, GPT-4o mini) for classification and routing, and expensive models only for complex reasoning. Combine this with aggressive caching, token budgets per run, and circuit breakers to prevent runaway costs.

The $50-a-Day Agent: Cost Engineering for Production AI Workflows

Your AI agent demo was impressive. It researched prospects, drafted emails, and even scheduled follow-ups. The CEO loved it. Engineering high-fived. Then someone ran the numbers on what it would cost to handle 500 customers a day instead of 5, and the project quietly moved to the "revisit later" column.

This is the most common way AI agents die. Not from technical failure - from cost surprise. The POC ran on the best model with verbose prompts and no budget controls, and nobody did the multiplication until it was too late.

Here's the thing: running a production agent for $50 a day is entirely achievable. It just requires the same discipline you'd apply to any other infrastructure cost - something the AI industry has been strangely quiet about.

What Agents Actually Cost (The Uncomfortable Math)

Before you can optimize, you need to understand where the money goes. Every agent interaction has a cost anatomy:

Input tokens (your prompts, system instructions, context) + Output tokens (the model's response) = per-call cost. Multiply by the number of calls per task, tasks per day, and days per month.

Here's what common agent patterns actually cost at current API pricing (February 2026):

Agent Pattern	Calls/Task	Avg Tokens/Call	Model	Cost/Task	200 Tasks/Day
Customer support (classify + respond)	2-3	2,000 in / 500 out	Claude Sonnet	$0.04-0.06	$8-12/day
Research agent (search + synthesize)	5-8	4,000 in / 1,000 out	GPT-4o	$0.15-0.30	$30-60/day
Data pipeline (extract + validate)	3-4	1,500 in / 300 out	Claude Haiku	$0.003-0.005	$0.60-1.00/day
Email drafting (context + write + refine)	3-5	3,000 in / 800 out	Claude Sonnet	$0.08-0.15	$16-30/day

Notice the 60x cost difference between data extraction on Haiku and research on GPT-4o. That gap is your optimization surface.

The numbers above assume everything works on the first try. In production, it won't. Failed tool calls, malformed outputs that need re-parsing, and agents that loop because they misunderstood the task - these all multiply your costs. A practical analysis from OpenAI's agent guide emphasizes that cost and latency tradeoffs should be designed in from the start, not bolted on later.

The Four Levers of Agent Cost Engineering

Lever 1: Model Routing (The Biggest Win)

Not every step in your agent's workflow needs the smartest model. A customer support agent might handle these steps:

Classify the ticket (billing? technical? cancellation?)
Retrieve relevant docs (search knowledge base)
Draft a response (write the actual reply)
Check for policy compliance (verify the reply is safe)

Steps 1, 2, and 4 are classification and retrieval tasks. They don't need Claude Sonnet or GPT-4o. A model 10-20x cheaper handles them fine.

A practical routing setup:

Tier 1 (cheap/fast): Classification, extraction, simple Q&A
  → Claude Haiku ($0.25/$1.25 per MTok) or GPT-4o mini ($0.15/$0.60 per MTok)

Tier 2 (mid-range): Drafting, summarization, moderate reasoning
  → Claude Sonnet ($3/$15 per MTok) or GPT-4o ($2.50/$10 per MTok)

Tier 3 (expensive): Complex reasoning, creative writing, multi-step planning
  → Claude Opus or o1 ($15/$75+ per MTok) - use sparingly

Chip Huyen's analysis of AI engineering patterns documents how compound systems with multiple model tiers consistently outperform single-model approaches on both cost and quality. The key insight: a cheap model making a routing decision adds pennies but saves dollars.

In practice, you implement this with a simple classifier at the start of your pipeline. The classifier itself runs on your cheapest model. If it's wrong 5% of the time, that's still cheaper than running every request through your most expensive model.

Lever 2: Prompt Caching (90% Input Savings)

Here's a fact that surprises most teams: your system prompt is your biggest token expense, and you're paying for it on every single API call.

A typical agent system prompt runs 2,000-4,000 tokens. If your agent makes 5 calls per task and handles 200 tasks a day, that's 1,000 API calls - and you're sending the same system prompt in every one. That's 2-4 million tokens per day just in repeated instructions.

Both Anthropic and OpenAI now offer prompt caching that stores your system prompt server-side and charges a fraction of the normal input rate for cache hits. Anthropic's implementation charges 90% less for cached tokens after the first request. OpenAI's approach is similar.

The implementation is straightforward:

# Instead of sending 3,000 tokens of system prompt every call,
# mark it as cacheable. First call pays full price,
# subsequent calls pay ~10% for the cached portion.

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    system=[{
        "type": "text",
        "text": your_long_system_prompt,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=messages
)

For a support agent making 1,000 calls/day with a 3,000-token system prompt, caching cuts the system prompt cost from roughly $9/day to $0.90/day at Sonnet rates. That's real money over a month.

Lever 3: Token Budgets and Circuit Breakers

The scariest cost scenario isn't a busy day - it's a stuck agent. When an agent enters a retry loop (tool returns an error, agent tries again with full context, fails again, tries again), costs compound fast. Each retry includes the entire conversation history, which grows with each attempt.

A three-retry loop doesn't cost 3x. It costs roughly 4x because each retry sends all previous attempts as context:

Attempt	Input Tokens	Cumulative Cost
First try	3,000	$0.009
Retry 1	5,500 (original + first response)	$0.025
Retry 2	8,500 (all previous)	$0.050
Retry 3	12,000 (all previous)	$0.086

That's a nearly 10x cost increase over the first attempt. Google's Site Reliability Engineering handbook describes this as cascading failure - the same pattern that takes down distributed systems applies to agent cost overruns.

The fix is straightforward:

Set a max token budget per agent run (e.g., 50,000 tokens total). When you hit it, stop and return a partial result.
Limit retry attempts (3 max, with exponential backoff).
Implement circuit breakers: if an agent fails N times in a row, stop trying and alert a human.

class AgentBudget:
    def __init__(self, max_tokens=50_000, max_retries=3):
        self.max_tokens = max_tokens
        self.max_retries = max_retries
        self.tokens_used = 0
        self.retries = 0

    def can_continue(self):
        return (self.tokens_used < self.max_tokens
                and self.retries < self.max_retries)

    def record_call(self, input_tokens, output_tokens):
        self.tokens_used += input_tokens + output_tokens

This isn't sophisticated engineering. It's the same pattern you'd use for any API with metered billing. But most agent frameworks don't include it by default, and most tutorials skip it entirely.

Lever 4: Batching and Async Processing

Not every agent task needs a real-time response. If you're processing customer feedback, enriching CRM records, or generating reports, you can batch requests and take advantage of lower pricing tiers.

OpenAI's Batch API offers 50% cost reduction for requests that can tolerate a 24-hour turnaround. Anthropic offers similar batch pricing through their Message Batches API. For workflows where latency doesn't matter - nightly data enrichment, weekly report generation, bulk email drafting - this is free money.

Even without formal batch APIs, you can reduce costs by:

Combining similar requests: Instead of 10 separate "classify this ticket" calls, send one call with 10 tickets and ask the model to classify all of them. Most models handle this well up to about 20 items per batch.
Pre-filtering: Don't send everything to the LLM. Use simple rules (regex, keyword matching, heuristics) to handle the easy 40% before the agent sees anything.
Response length limits: Always set max_tokens. An agent drafting a support reply doesn't need 4,000 tokens of output. Cap it at 500 and you cut output costs by 80%.

A Real $50/Day Budget

Let's build an actual budget for a customer support agent handling 300 tickets per day:

Step	Model	Calls/Ticket	Tokens In/Out	Cost/Ticket	Daily (300)
Classify intent	Haiku	1	800/50	$0.0003	$0.09
Retrieve docs	Haiku	1	1,200/200	$0.0006	$0.17
Draft response	Sonnet	1	2,500/400	$0.014	$4.10
Compliance check	Haiku	1	1,500/100	$0.0005	$0.16
Subtotal					$4.52
Prompt caching savings					-$2.80
Retry overhead (8% fail rate)					+$0.90
Daily total					$2.62

That's $2.62 per day. Not $50. You could run 19 agents like this for $50.

The research agent is more expensive. Let's budget one that handles 50 deep research tasks per day:

Step	Model	Calls/Task	Tokens In/Out	Cost/Task	Daily (50)
Plan research	Sonnet	1	1,500/500	$0.012	$0.60
Web search (5 queries)	Haiku	5	1,000/200	$0.003	$0.15
Synthesize findings	Sonnet	1	8,000/2,000	$0.054	$2.70
Format + citations	Haiku	1	3,000/500	$0.001	$0.06
Subtotal					$3.51
Prompt caching savings					-$1.40
Retry overhead (12% fail rate)					+$1.20
Daily total					$3.31

Combined: $5.93 per day for both agents. The $50/day budget leaves room for 8-10x growth before you need to worry.

The Hidden Costs Nobody Mentions

API token costs are the obvious line item. But three other costs quietly eat your budget:

Tool call overhead. Every time your agent calls a tool (search API, database query, calculator), it generates a round-trip: the model outputs a tool call, you execute it, and you send the result back. That result goes into the context window for the next call. A study from Databricks' engineering team showed that tool-heavy agents can spend 40-60% of their token budget on tool result context rather than actual reasoning.

The fix: summarize tool results before injecting them. If your search API returns 5,000 tokens of results, condense to 500 before sending to the agent. You can use a cheap model for this summarization step.

Observability costs. You need logging and monitoring to debug production agents. Services like Langfuse (open source) or Braintrust let you track every call, but storing and querying traces has its own cost. Budget $20-50/month for observability tooling, or self-host Langfuse on your existing infrastructure.

The "just in case" model. Teams often keep a powerful model as a fallback "just in case" the cheaper one fails. This is fine as an architecture pattern, but watch the fallback rate. If 30% of requests escalate to your expensive model, your routing isn't working - it's just adding an extra cheap call before every expensive one. Track your routing hit rate weekly. Aim for 80%+ of requests resolved by the cheap tier.

What to Do Monday Morning

If you're running agents in production (or heading there), here's the priority order:

Week 1: Measure. You can't optimize what you don't track. Add token counting to every API call. Log input tokens, output tokens, model used, and whether the call succeeded. A spreadsheet works fine at this stage.

Week 2: Cache. Enable prompt caching on your highest-volume endpoints. This is typically a one-line change and delivers the best ROI of any optimization.

Week 3: Route. Identify which agent steps actually need your expensive model. Move classification, extraction, and simple validation to your cheapest model. Test quality to make sure it holds.

Week 4: Budget. Implement per-run token budgets and retry limits. Set up alerts for when individual runs exceed 2x the expected cost. This protects you from the runaway agent scenario that turns a $5 day into a $500 day.

The teams that get agent economics right treat LLM calls like database queries - something you monitor, optimize, and budget for. The teams that don't treat them like magic boxes and act surprised when the invoice arrives.

Your agent doesn't need to be cheap. It needs to be predictable. The $50/day budget isn't a constraint - it's a design requirement that forces you to build something that actually scales.