Skip to main content

The Intent Classification Layer Most Agent Routers Skip

· 11 min read
Tian Pan
Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Why Nobody Adds This Layer Until They're Bleeding

The path of least resistance for tool dispatch is function calling: give the LLM every tool schema, let it pick. OpenAI, Anthropic, and Google all support this natively. It requires zero architecture. During prototyping with five or ten tools, it works fine. So it ships.

The problem is that tool sets grow. An agent that started with billing tools gains calendar scheduling, then HR lookup, then technical support workflows. By the time you have 50 tools you feel the friction; by 100 it's a real problem; by 200 you're in crisis mode. But by then the architecture decision is baked in.

There's also a philosophical argument against classifying first: LLM function calling is flexible. It handles ambiguous queries, multi-step intents, and novel combinations without a predefined taxonomy. An intent classifier requires you to enumerate categories upfront, which feels like constraint. This argument is valid at small scale. It doesn't survive at large scale.

The structural cause of the degradation is what researchers call "context rot." Transformer attention scales quadratically with context length. At 100K tokens, the model is managing ten billion pairwise relationships. When 400 tool schemas occupy 80K tokens of that context, the tools in the middle of the list become effectively invisible. Research across 18 frontier models found that every one degrades as input length grows, with tools positioned at the 50% context midpoint receiving dramatically less accurate selection than tools at the extremes—even when the correct tool is present in the context.

The "lost in the middle" effect isn't a model bug. It's a physics constraint. You can't solve it by prompting harder or by upgrading model versions. You solve it by not loading irrelevant tools in the first place.

The Token Economics of Skipping Classification

At 10 tools with 500 tokens each, you're loading 5,000 tokens of overhead per request. Manageable.

At 741 tools, the overhead hits 127,315 tokens per request before you've included the user's actual message. With a classification layer that narrows to relevant tools, that drops to 1,084 tokens—a 117x reduction. At 1 million requests per month, the difference is approximately $3.79 million per year in API cost at typical frontier model pricing.

The cost multiplier isn't just token count. Agent sessions accumulate context across turns. Every token you include in turn 1 of a 30-turn conversation is re-included in every subsequent turn, effectively multiplying its cost by 30. Wasting 1,000 tokens on irrelevant tool schemas in turn 1 costs 30,000 tokens over the session.

Wrong-namespace routing compounds this further. When a billing query routes through a context loaded with calendar and HR tools, the LLM either picks the wrong tool or returns a "I can't help with that" despite being fully capable—because the relevant tools aren't visible. The retry loop then consumes additional tokens and latency. A single misrouted call in an agent workflow can cascade into five or six API round-trips before recovery.

Teams discover this the hard way. A 50proofofconceptthatworksbeautifullywithahandfuloftoolsbecomesa50 proof-of-concept that works beautifully with a handful of tools becomes a 2.5 million per month production bill once user volume hits and tool sets expand. The economics don't change gradually; they cliff-edge when you cross certain tool count thresholds.

Three Approaches to Classification, and When to Use Each

The right classification approach depends on your tool count, traffic volume, and how well-defined your intent taxonomy is.

Embedding-based routing is the fastest option. Query embeddings are compared against pre-encoded example utterances for each intent category using cosine similarity. Latency runs 16–100ms. In documented production deployments, this approach reduced end-to-end routing latency from 5,000ms to 100ms while achieving 92–96% precision after iterative example refinement. The cost is sub-penny per query compared to ~$0.65 per 10,000 queries for LLM-based classification—roughly 65x cheaper. The limitation: embedding routers struggle with out-of-distribution queries and intents requiring compositional reasoning. They work best when your intent set is fixed and well-defined.

Fine-tuned small models (SetFit, DistilBERT, ModernBERT) add training overhead but deliver better accuracy on nuanced intents. SetFit runs 56x faster than a frontier LLM at inference time and achieves F1 scores within 8–10% of the best LLMs at a tiny fraction of the cost. A SetFit model can be trained on 8 labeled examples in 30 seconds on commodity hardware. IBM Research's semantic router for vLLM uses ModernBERT as its intent classifier, achieving a 10.24 percentage point accuracy gain on MMLU-Pro with 47.1% latency reduction. Google's research team showed that a two-stage approach—a small model summarizes each interaction, then a fine-tuned small model extracts intent—matches Gemini Pro accuracy while enabling on-device, privacy-preserving classification.

LLM-based classification is the highest-accuracy but highest-cost option. Using the main model (or a cheaper smaller model) as a dedicated classifier adds a full model call before tool dispatch. This is the right choice for genuinely ambiguous intents that require reasoning, but at scale it doubles latency and cost compared to embedding-based approaches. The practical use case is as the catch-all at the bottom of a classification cascade.

The cascade pattern captures the best of all three:

  1. Keyword filter (sub-millisecond) — handles high-frequency, unambiguous intents
  2. Embedding router (16–100ms) — covers the majority of queries
  3. Fine-tuned classifier (50–200ms) — handles ambiguous cases within the router's scope
  4. LLM catch-all (1–5 seconds) — handles novel or compositional intents the cascade couldn't confidently classify

The threshold for moving up the cascade is classifier confidence. Predictions above 0.8 confidence route automatically. Predictions between 0.5 and 0.8 route but flag for review. Below 0.5, escalate to the next stage or to a human. A hybrid implementation using this pattern reaches within 2% of native LLM accuracy at roughly 50% less latency on in-distribution data.

A practical heuristic: fewer than 15 tools, LLM function calling is fine. 15–50 tools, add an embedding router. More than 50 tools, you need a fine-tuned classifier. More than 100 tools, the classification layer is non-negotiable.

What the Classifier Actually Gates

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates