Skip to main content

11 posts tagged with "model-routing"

View all tags

Lazy Evaluation in AI Pipelines: Stop Calling the LLM Until You Have To

· 11 min read
Tian Pan
Software Engineer

Most AI pipelines are written as if every request deserves a full LLM call. The user submits a message, the pipeline passes it to the model, waits for a response, and returns it — every time, unconditionally. This works, but it's expensive, slow, and often unnecessary.

The fraction of requests that actually require a full LLM inference is smaller than most engineers assume. Research on token-level routing shows that only about 11% of tokens differ between a 1.5B and a 32B parameter model, and only 4.9% of tokens are genuinely "divergent" — meaning they alter the reasoning path if handled by the smaller model. Production semantic caches show that 65% of incoming traffic is semantically similar to something the pipeline has already answered. These aren't edge cases. They're the majority of your traffic, and you're paying full price to handle them.

The fix is lazy evaluation: don't invoke the expensive model until you've confirmed that the expensive model is actually needed.

The Budget Inversion Trap: Why Your Most Valuable AI Features Get the Cheapest Inference

· 8 min read
Tian Pan
Software Engineer

Most teams optimize AI inference costs by routing cheaper queries to cheaper models. That sounds reasonable — and it's backwards. The queries that go to cheap models first aren't the simple ones. They're the complex ones, because those are the expensive ones your FinOps dashboard flagged.

The result: your contract renewal workflow, the one that closes six-figure deals, runs on a model that hallucinates clause references. Your customer support triage — entry-level stuff, genuinely low-stakes — gets frontier model treatment because nobody complained about it yet.

This is the budget inversion trap. It's not caused by negligence. It's the predictable output of applying cost pressure without value context.

The 20% Problem in Model Routing: When Cost Optimization Creates Second-Class Users

· 9 min read
Tian Pan
Software Engineer

Your routing system works exactly as designed. Eighty percent of queries go to the cheap model; twenty percent escalate to the capable one. Latency is down, costs dropped by 60%, and leadership is happy. Then someone pulls the data by user segment, and you see it: users writing in non-native English are escalated at half the rate of native speakers, and their satisfaction scores are 18 points lower. The routing system treated the query complexity signal as neutral, but it wasn't — it was a proxy for language proficiency, and you've been giving a systematically worse product to a specific group of users for months.

This is the 20% problem. It's not a bug in the router. It's an emergent property of any cost-optimized routing system that nobody measures until it's too late.

The Monday Morning AI Degradation Your Dashboard Treats As Noise

· 10 min read
Tian Pan
Software Engineer

Pull up your AI feature's latency and quality dashboards and squint. The line is mostly flat with occasional spikes your team has been calling "noise" or "provider weirdness" for months. Now break that same data out by hour-of-day and day-of-week. The noise resolves into a face: every Monday between 9 and 11am Eastern, your p95 latency is 30–60% worse than it is on a Saturday night, your cache hit rate dips 10–20 points, your retry rate doubles, and your token spend per task quietly climbs. The dashboard wasn't lying. It was averaging.

Most teams discover this pattern the way you discover a slow leak: by tracing the cost back from a quarterly bill nobody can explain. The instinct is to call it provider flakiness, file a ticket with the inference vendor, and move on. But the pattern isn't really about your LLM provider. It's about the fact that your AI feature now sits on top of a stack of shared, time-of-day-sensitive systems — the model API, the embedding API, the dependent SaaS tools your agent calls, the customer's own infrastructure on the receiving end of webhooks — and the cyclic load patterns of every one of them compose. You inherited the diurnal curve of an entire dependency chain, and your dashboard is showing you the average of all of them.

The 'Try a Bigger Model' Reflex Is a Refactor Smell

· 10 min read
Tian Pan
Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

LLM Model Routing Is Market Segmentation Disguised As A Cost Optimization

· 10 min read
Tian Pan
Software Engineer

The cost dashboard makes the case for itself. Sixty percent of traffic is "easy," a quick eval shows the smaller model lands within a couple of points on the global accuracy metric, and the routing layer ships behind a feature flag the same week. The graph bends. Finance is happy. The team moves on.

What nobody tracks is that the customer who hit the cheap path on Tuesday afternoon and the expensive path on Wednesday morning is now using two different products. The two models fail differently. They format differently. They refuse different things. They handle ambiguity, follow-up questions, and partial inputs with different defaults. From the customer's seat, the assistant developed amnesia overnight and nobody can tell them why — because internally, the change was filed as a finops win, not a product release.

Model Routing in Production: When the Router Costs More Than It Saves

· 10 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.

The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.

This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.

Quality-Aware Model Routing: Why Optimizing for Cost Alone Wrecks Your AI Product

· 9 min read
Tian Pan
Software Engineer

Every team that ships LLM routing starts the same way: sort models by price, send easy queries to the cheap one, hard queries to the expensive one, celebrate the 60% cost reduction. Six weeks later, someone notices that contract analysis accuracy dropped from 94% to 79%, the coding assistant started hallucinating API endpoints that don't exist, and customer satisfaction on complex support tickets fell off a cliff — all while the routing dashboard showed "95% quality maintained."

The problem isn't routing itself. Cost-optimized routing treats all quality degradation as equal, when in practice the queries you're downgrading are disproportionately the ones where quality matters most.

Compound AI Systems: Why Your Best Architecture Uses Three Models, Not One

· 10 min read
Tian Pan
Software Engineer

The instinct is always to reach for the biggest model. Pick the frontier model, point it at the problem, and hope that raw capability compensates for architectural laziness. It works in demos. It fails in production.

The teams shipping the most reliable AI systems in 2025 and 2026 aren't using one model. They're composing three, four, sometimes five specialized models into pipelines where each component does exactly one thing well. A classifier routes. A generator produces. A verifier checks. The result is a system that outperforms any single model while costing a fraction of what a frontier-model-for-everything approach would.

Hybrid Cloud-Edge LLM Inference: The Routing Layer That Determines Your Cost, Latency, and Privacy Profile

· 10 min read
Tian Pan
Software Engineer

Most teams pick a side: run everything in the cloud, or push everything to the edge. Both are wrong for the majority of production workloads. The interesting engineering happens in the routing layer between them — the component that decides, per-request, whether a query deserves a 70B frontier model on an H100 or a 3B quantized model running on local silicon.

This routing decision isn't just about latency. It's a three-variable optimization across cost, privacy, and capability — and the optimal split changes based on your traffic patterns, regulatory environment, and what "good enough" means for each query type. Teams that get the routing right cut inference costs 60–80% while improving p95 latency. Teams that get it wrong either overspend on cloud GPUs for trivial queries or ship degraded answers from edge models that can't handle the complexity.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.