Your AI agent is burning money on arithmetic.

That customer support agent you built? It routes every single request - "what's my order status," "explain your refund policy," "help me debug this complex integration" - through the same $15-per-million-token frontier model. The order status lookup takes the same model that handles the nuanced technical debugging. One of those tasks needs a $75/hour specialist. The other needs a lookup table.

This is where most production AI systems hemorrhage budget. Not on the hard problems, but on the easy ones they're overpaying to solve.

LLM routing fixes this by matching each request to the cheapest model that can handle it well. The pattern is simple in concept, surprisingly tricky in practice, and responsible for the single largest cost reduction most teams achieve after their initial deployment.

The Three-Tier Model Stack

Before you can route, you need to know what you're routing between. Most production systems settle on three tiers:

Tier 1 - Fast and cheap: Small models (Claude Haiku, GPT-4o-mini, Gemini Flash). These handle classification, entity extraction, simple Q&A, format conversion, and intent detection. They respond in under 500ms and cost pennies per thousand calls.

Tier 2 - Workhorse: Mid-range models (Claude Sonnet, GPT-4o). These handle content generation, summarization, code generation, multi-step instructions, and moderate reasoning. Good balance of quality and cost for the bulk of real work.

Tier 3 - Heavy reasoning: Frontier models (Claude Opus, o3, Gemini Pro). These handle complex analysis, ambiguous edge cases, multi-document synthesis, and tasks where being wrong is expensive. You want to call these as rarely as possible - not because they're bad, but because they're overkill for 80% of requests.

The exact models shift every few months. The tier structure doesn't. What matters is that you have a fast/cheap option, a capable default, and an expensive option you use sparingly.

Tier Latency Relative Cost Use When
1 - Fast < 500ms 1x Task has a known structure or expected output format
2 - Workhorse 1-3s 10-15x Task requires generation, synthesis, or following complex instructions
3 - Frontier 3-10s 30-75x Task is ambiguous, high-stakes, or requires multi-step reasoning

Stripe's engineering team documented a similar tiering approach for their internal AI tools, noting that moving classification tasks off their frontier model reduced API costs by 72% with no measurable quality drop.

Four Routing Strategies (From Simple to Sophisticated)

Strategy 1: Task-Based Static Routing

The simplest approach. You look at the task type and route to a predetermined model.

ROUTE_MAP = {
    "classify_intent": "haiku",
    "extract_entities": "haiku",
    "generate_reply": "sonnet",
    "summarize_document": "sonnet",
    "analyze_contract": "opus",
    "debug_code": "opus",
}

def route(task_type: str) -> str:
    return ROUTE_MAP.get(task_type, "sonnet")  # default to workhorse

This works when your tasks are well-defined and you can tag them before they hit the model. Most agent architectures already have this information - if your agent uses tool calls, each tool maps to a complexity tier.

When it breaks: When the same task type has wildly varying complexity. "Summarize this document" is easy for a two-page memo and hard for a 50-page technical spec with contradictory sections.

Strategy 2: Confidence-Based Cascading

Start cheap, escalate if needed. Send every request to Tier 1 first. If the output looks good, use it. If not, escalate to Tier 2 or Tier 3.

The trick is defining "looks good." For structured outputs, this is straightforward:

  • Classification: Does the model return a valid category with high logprob?
  • Extraction: Does the output match the expected schema? Are required fields populated?
  • Generation: Does the output meet a minimum length? Does it contain the required sections?
def cascading_route(prompt, schema=None):
    # Try cheap model first
    result = call_model("haiku", prompt)
    
    if schema and validates(result, schema):
        return result  # Cheap model nailed it
    
    if confidence_score(result) > 0.85:
        return result  # High confidence, good enough
    
    # Escalate to workhorse
    result = call_model("sonnet", prompt)
    
    if confidence_score(result) > 0.70:
        return result
    
    # Last resort: frontier model
    return call_model("opus", prompt)

The LogRocket engineering blog walks through a similar cascading pattern with detailed confidence thresholds for different task types.

The catch: You're now making multiple API calls on escalated requests. If your escalation rate is above 30%, the cascade costs more than just calling the workhorse directly. Track your escalation rate obsessively.

Strategy 3: Classifier-Based Routing

Use a small, fine-tuned model (or even a rule-based system) to predict which tier a request needs before calling any LLM.

This is where routing gets interesting. Your classifier looks at the input and predicts complexity:

  • Short, structured queries with clear intent → Tier 1
  • Open-ended requests with moderate context → Tier 2
  • Ambiguous, multi-part, or high-stakes requests → Tier 3

You can build this classifier with a fine-tuned small model, a few hundred labeled examples, and a weekend. Anyscale's research on LLM routing showed that a simple BERT-based classifier achieved 89% routing accuracy after training on just 500 labeled examples.

The real power comes from the feedback loop: log every routed request, have humans spot-check the cheap-model responses periodically, and retrain your classifier with the misroutes.

Strategy 4: Shadow Routing (The Bootstrap Strategy)

This is how you build the dataset for strategies 2 and 3 when you're starting from scratch.

Run your production traffic through the workhorse model as normal. In parallel (asynchronously, so it doesn't affect latency), send the same requests to the cheap model. Compare outputs. After a few thousand comparisons, you know exactly which request types the cheap model handles well.

async def shadow_route(prompt, task_type):
    # Production path - always runs
    primary = await call_model("sonnet", prompt)
    
    # Shadow path - async, no latency impact
    asyncio.create_task(
        shadow_compare("haiku", prompt, primary, task_type)
    )
    
    return primary

This approach is borrowed directly from how teams roll out new microservices - Cindy Sridharan's "Distributed Systems Observability" describes the same shadow traffic pattern for validating service replacements. The LLM version just swaps "new service" for "cheaper model."

Shadow routing costs more in the short term (you're paying for two model calls). But it's the fastest path to a trustworthy router because you're building your validation dataset from real production traffic, not synthetic benchmarks.

The Biggest Win: Not Calling a Model at All

Routing between models is good. Not calling a model at all is better.

Three patterns that eliminate model calls entirely:

Semantic caching: Hash the input (or compute an embedding) and check if you've seen a similar request recently. Customer support agents get the same 50 questions in different phrasings. Cache the answers. Gptcache and similar tools make this a drop-in optimization - teams report 20-40% cache hit rates on support workloads.

Deterministic short-circuits: If your agent's first step is always "classify this ticket," and you can classify with a regex or keyword match 60% of the time, skip the model call for those cases. Simple if/else logic before the router catches more than you'd expect.

Precomputed responses: For high-frequency, low-variance queries (order status, business hours, return policy), generate the response once and serve it from a template. Your agent doesn't need to reason about your return policy every time someone asks.

At one client deployment, we found that adding a semantic cache with a 0.95 similarity threshold eliminated 34% of model calls. Combined with routing the remaining calls across tiers, total LLM spend dropped 78% with no change in user-reported quality.

Building Your Router: A Practical Sequence

If you're starting from zero, here's the order that minimizes wasted effort:

Week 1-2: Instrument and measure. Before you route anything, log every LLM call with its task type, input length, output length, latency, and cost. You can't optimize what you don't measure. Most teams are shocked by their actual task distribution - it's almost always more skewed toward simple tasks than they assumed.

Week 3-4: Static routing by task type. Build the simple route map. Move your obvious Tier 1 tasks (classification, extraction, intent detection) to the cheap model. This alone typically cuts costs 30-50%.

Week 5-8: Shadow routing. Run shadow comparisons on your Tier 2 tasks to find which ones the cheap model handles acceptably. Expand your Tier 1 routing based on real data.

Week 9+: Confidence-based cascading. For tasks that fall between tiers, implement cascading with quality checks. This is where you squeeze out the remaining savings.

Ongoing: Cache and short-circuit. Layer in semantic caching and deterministic overrides as you identify high-frequency patterns.

The Martian team (YC W24) has published benchmarks showing this staged approach outperforms jumping straight to sophisticated routing. Their data shows simple task-based routing captures 70% of the total possible savings, with each additional strategy layer adding diminishing returns.

Don't skip the measurement step. I've seen teams spend a month building an elaborate classifier-based router only to discover that 90% of their traffic was already in one tier. A simple if/else would have gotten them 95% of the benefit.

What Goes Wrong

Over-routing to cheap models. The first time your Tier 1 model gives a customer a confidently wrong answer to a complex question, you'll understand why quality thresholds exist. Start conservative - route only the tasks you're certain about, then expand.

Ignoring latency. A cascade that calls three models sequentially is slower than calling the expensive model once. If your use case is latency-sensitive (real-time chat, live coding assistance), cascading might not be the right pattern. Task-based routing with no escalation is faster.

Treating routing as "set and forget." Models change. Pricing changes. The cheap model that couldn't handle summarization six months ago might handle it fine now. Re-run your shadow comparisons quarterly.

Building a router before you have enough traffic. If you're making 100 LLM calls a day, the engineering cost of a sophisticated router exceeds your potential savings. Start with static routing, optimize later. Chip Huyen's guidance on premature optimization applies directly here - don't build infrastructure for scale you don't have yet.

The right level of routing complexity is proportional to your LLM spend. If you're spending $500/month, a route map is fine. If you're spending $50,000/month, a classifier-based router with shadow validation pays for itself in days.

The Decision Matrix

Pick your starting strategy based on two factors: how well you can categorize your tasks, and how much you're spending.

Your Situation Start With Expected Savings
Well-defined task types, moderate spend Static routing 30-50%
Mixed task types, high spend Shadow routing → classifier 50-70%
High-volume repetitive queries Semantic cache + static routing 60-80%
Latency-critical, variable complexity Task-based routing (no cascade) 20-40%
Low volume (< 1K calls/day) Don't bother yet Focus on product-market fit

The pattern that works for most teams: start with static routing on day one, add caching on day thirty, and build toward classifier-based routing only when your monthly LLM bill makes the engineering investment worthwhile. Every dollar spent on routing infrastructure should save at least five dollars in model costs, or you're optimizing the wrong thing.