Skip to main content

The Intent Classification Layer Most Agent Routers Skip

· 11 min read
Tian Pan
Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Why Nobody Adds This Layer Until They're Bleeding

The path of least resistance for tool dispatch is function calling: give the LLM every tool schema, let it pick. OpenAI, Anthropic, and Google all support this natively. It requires zero architecture. During prototyping with five or ten tools, it works fine. So it ships.

The problem is that tool sets grow. An agent that started with billing tools gains calendar scheduling, then HR lookup, then technical support workflows. By the time you have 50 tools you feel the friction; by 100 it's a real problem; by 200 you're in crisis mode. But by then the architecture decision is baked in.

There's also a philosophical argument against classifying first: LLM function calling is flexible. It handles ambiguous queries, multi-step intents, and novel combinations without a predefined taxonomy. An intent classifier requires you to enumerate categories upfront, which feels like constraint. This argument is valid at small scale. It doesn't survive at large scale.

The structural cause of the degradation is what researchers call "context rot." Transformer attention scales quadratically with context length. At 100K tokens, the model is managing ten billion pairwise relationships. When 400 tool schemas occupy 80K tokens of that context, the tools in the middle of the list become effectively invisible. Research across 18 frontier models found that every one degrades as input length grows, with tools positioned at the 50% context midpoint receiving dramatically less accurate selection than tools at the extremes—even when the correct tool is present in the context.

The "lost in the middle" effect isn't a model bug. It's a physics constraint. You can't solve it by prompting harder or by upgrading model versions. You solve it by not loading irrelevant tools in the first place.

The Token Economics of Skipping Classification

At 10 tools with 500 tokens each, you're loading 5,000 tokens of overhead per request. Manageable.

At 741 tools, the overhead hits 127,315 tokens per request before you've included the user's actual message. With a classification layer that narrows to relevant tools, that drops to 1,084 tokens—a 117x reduction. At 1 million requests per month, the difference is approximately $3.79 million per year in API cost at typical frontier model pricing.

The cost multiplier isn't just token count. Agent sessions accumulate context across turns. Every token you include in turn 1 of a 30-turn conversation is re-included in every subsequent turn, effectively multiplying its cost by 30. Wasting 1,000 tokens on irrelevant tool schemas in turn 1 costs 30,000 tokens over the session.

Wrong-namespace routing compounds this further. When a billing query routes through a context loaded with calendar and HR tools, the LLM either picks the wrong tool or returns a "I can't help with that" despite being fully capable—because the relevant tools aren't visible. The retry loop then consumes additional tokens and latency. A single misrouted call in an agent workflow can cascade into five or six API round-trips before recovery.

Teams discover this the hard way. A 50proofofconceptthatworksbeautifullywithahandfuloftoolsbecomesa50 proof-of-concept that works beautifully with a handful of tools becomes a 2.5 million per month production bill once user volume hits and tool sets expand. The economics don't change gradually; they cliff-edge when you cross certain tool count thresholds.

Three Approaches to Classification, and When to Use Each

The right classification approach depends on your tool count, traffic volume, and how well-defined your intent taxonomy is.

Embedding-based routing is the fastest option. Query embeddings are compared against pre-encoded example utterances for each intent category using cosine similarity. Latency runs 16–100ms. In documented production deployments, this approach reduced end-to-end routing latency from 5,000ms to 100ms while achieving 92–96% precision after iterative example refinement. The cost is sub-penny per query compared to ~$0.65 per 10,000 queries for LLM-based classification—roughly 65x cheaper. The limitation: embedding routers struggle with out-of-distribution queries and intents requiring compositional reasoning. They work best when your intent set is fixed and well-defined.

Fine-tuned small models (SetFit, DistilBERT, ModernBERT) add training overhead but deliver better accuracy on nuanced intents. SetFit runs 56x faster than a frontier LLM at inference time and achieves F1 scores within 8–10% of the best LLMs at a tiny fraction of the cost. A SetFit model can be trained on 8 labeled examples in 30 seconds on commodity hardware. IBM Research's semantic router for vLLM uses ModernBERT as its intent classifier, achieving a 10.24 percentage point accuracy gain on MMLU-Pro with 47.1% latency reduction. Google's research team showed that a two-stage approach—a small model summarizes each interaction, then a fine-tuned small model extracts intent—matches Gemini Pro accuracy while enabling on-device, privacy-preserving classification.

LLM-based classification is the highest-accuracy but highest-cost option. Using the main model (or a cheaper smaller model) as a dedicated classifier adds a full model call before tool dispatch. This is the right choice for genuinely ambiguous intents that require reasoning, but at scale it doubles latency and cost compared to embedding-based approaches. The practical use case is as the catch-all at the bottom of a classification cascade.

The cascade pattern captures the best of all three:

  1. Keyword filter (sub-millisecond) — handles high-frequency, unambiguous intents
  2. Embedding router (16–100ms) — covers the majority of queries
  3. Fine-tuned classifier (50–200ms) — handles ambiguous cases within the router's scope
  4. LLM catch-all (1–5 seconds) — handles novel or compositional intents the cascade couldn't confidently classify

The threshold for moving up the cascade is classifier confidence. Predictions above 0.8 confidence route automatically. Predictions between 0.5 and 0.8 route but flag for review. Below 0.5, escalate to the next stage or to a human. A hybrid implementation using this pattern reaches within 2% of native LLM accuracy at roughly 50% less latency on in-distribution data.

A practical heuristic: fewer than 15 tools, LLM function calling is fine. 15–50 tools, add an embedding router. More than 50 tools, you need a fine-tuned classifier. More than 100 tools, the classification layer is non-negotiable.

What the Classifier Actually Gates

A common misconception is that the classifier replaces the LLM's reasoning. It doesn't. The classifier answers a narrower question: which tool namespace is relevant to this query? The LLM then handles all the compositional reasoning, parameter extraction, and multi-step execution—but within a curated, relevant subset of tools.

The classifier output looks something like this:

{
"intent": "billing_inquiry",
"confidence": 0.92,
"extracted_entities": [{ "name": "time_period", "value": "last month" }]
}

That intent field gates three things:

  • Tool namespace filtering: Only billing tools are loaded into the LLM context. Calendar and HR tools never appear, even if the agent technically has access to them.
  • System prompt selection: The LLM receives a billing-specialist system prompt rather than a generic agent prompt. Domain priming improves accuracy on domain-specific reasoning.
  • Agent delegation: In multi-agent architectures, the classified intent routes to the appropriate specialized sub-agent, each with their own tool set, prompt, and memory scope.

This is the GeckOpt pattern from Microsoft Research, validated on a real production deployment of 100+ GPT-4-Turbo nodes in a Copilot system. The offline phase builds intent-to-tool-subset mappings; the online phase classifies each request and loads only the relevant subset. Result: 24.6% token reduction with less than 1% accuracy degradation.

The entity extraction in the classifier output is also useful. If the classifier can reliably extract time_period: "last month" from billing queries, that structured value can be passed directly to the tool call rather than relying on the LLM to extract it again—reducing one more source of hallucination.

The Feedback Loop That Makes Classification a Moat

A rule-based classifier built from enumerated intents is a static artifact. It will drift as user behavior evolves. The teams that build durable classification infrastructure instrument a feedback loop from the start.

The flywheel works like this:

  1. Every production request logs its predicted intent, confidence score, the tools actually invoked, and whether those tool calls succeeded or failed.
  2. Failed tool calls—wrong API called, parameters hallucinated, empty results—surface as candidate misclassifications.
  3. User feedback (thumbs down, corrections, rephrased follow-up queries) provides additional signal on where classification is wrong.
  4. Low-confidence predictions (below 0.7) are queued for human review.
  5. Human-labeled corrections flow back into the training set for fine-tuned classifiers, or into few-shot example pools for LLM-based classifiers.
  6. The classifier retrains on the expanded dataset—or for LLM-based classifiers, the example pool updates at query time via semantic retrieval.

The compounding effect is significant. Without a feedback loop, error rates in multi-turn agents can exceed 80% over 20 conversation turns as compound misclassification degrades context. With feedback loops operating at 80% coverage of user-corrected errors, that rate drops to under 40%.

More importantly, a fine-tuned classifier trained on your production traffic becomes specialized on your users' actual vocabulary, your domain's edge cases, and the failure modes specific to your tool set. An LLM doing zero-shot routing doesn't accumulate this specialization. The classifier gets better every week; the LLM stays frozen at its training cutoff.

The practical starting point: log classifier confidence and tool call outcomes from day one. You don't need a full labeling pipeline immediately. Even passive signal from tool call success rates gives you a ranked list of the intents where classification is weakest, which tells you exactly where to invest human review effort.

The System-Level Payoff

Intent classification is boring infrastructure. It doesn't appear in demos. It doesn't make your agent smarter. But it makes everything else work better.

When the LLM only sees tools relevant to the current query, it makes fewer errors not because it's more capable, but because the problem is actually easier. The right tools are visible. The distractors are gone. The context is short enough that attention works as designed.

At small tool counts, this doesn't matter much. At production scale—50+ tools, millions of requests per month, users with genuinely varied intents—it's the difference between a system that works and one that's simultaneously too slow, too expensive, and too unreliable to trust.

The classification layer also creates a clean interface between intent understanding and tool execution. That separation has downstream benefits: you can upgrade tools without retraining the router, swap the LLM backend without changing classification, and add new tool namespaces without polluting existing ones.

Most teams add this layer after their first production crisis. The ones who add it before never have the crisis.


The math on this is now well-established. Tool selection accuracy at 417 tools is 20% without classification. With a classification gate, the relevant tool is in the top-3 candidates 94%+ of the time. Token overhead drops by 99% in the 741-tool case. A hybrid classifier achieves within 2% of LLM accuracy at half the latency. The flywheel compounds improvement over time. The cost of adding this layer is a few hundred lines of infrastructure and some labeled examples. The cost of not adding it scales with your traffic.

References:Let's stay in touch and Follow me for more thoughts and updates