Skip to main content

The 'Try a Bigger Model' Reflex Is a Refactor Smell

· 10 min read
Tian Pan
Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

Why the Reflex Is So Sticky

The reflex is sticky because it works on the metric that's loudest in the room. Mean-time-to-recovery on quality issues drops; the team feels productive; the dashboard turns green. Meanwhile the metric that should be loudest — cost-per-correct-answer — lives on a different dashboard owned by FinOps, who sees the invoice monthly, while the eval pass rate updates hourly. The lag between the action and its true cost is exactly long enough for the habit to form before the bill arrives.

It's also sticky because the alternative is harder. Investigating an upstream cause means: pulling traces, running the closed-book test (does the model know the answer without retrieved context?), running the gold-context test (if you hand-craft the perfect context, does it answer correctly?), reading tool descriptions critically, diffing the system prompt against the last known-good version. Each of those takes thirty minutes of senior attention. A model swap takes a config flag flip and a re-run of the eval. Engineering effort is real money too, and the team is implicitly arbitraging compute cost against engineer-time cost in the wrong direction — because compute cost compounds across every future request, while debug time is a one-shot expenditure that pays the principal down.

The third reason it's sticky: the bigger model genuinely is more forgiving. A model two tiers up will tolerate a poorly-ordered prompt that a smaller model can't recover from. It will paper over a tool description that's ambiguous about which parameter is required. It will nod past a retrieval set with one bad chunk in it. The team mistakes that forgiveness for "the model fixed it," when what actually happened is the model absorbed the bug into its own slack, and the bug is still there waiting for the next regression — or, worse, waiting for the next time you try to roll the model back.

The Five Upstream Bugs Bigger Models Are Hiding

When a team reaches for a bigger model, in production AI work the underlying bug is usually one of five shapes, and the bigger model masks rather than fixes each one.

Bad retrieval. Stanford research and field reports converge on the same finding: at scale, RAG quality issues are about retrieval architecture — chunk size, vector-space partitioning, top-k tuning, embedding-model mismatch — not about the generator's intelligence. A bigger model can sometimes squeeze a coherent answer out of one good chunk surrounded by three irrelevant ones, but the retrieval pipeline returning four irrelevant chunks is the bug, and a smaller model with a fixed retriever beats a bigger model with a broken one on both quality and cost.

Tool descriptions the model misreads. Tool schemas are read by the model on every call; the description IS the prompt. A description that's ambiguous about whether customer_id is the internal UUID or the external billing ID will produce wrong tool calls 30% of the time on a smaller model and 8% of the time on a bigger one. Both rates are bugs. The fix is rewriting the description, not paying 4× to drop the rate from 30% to 8%.

System prompts with conflicting instructions. Datadog's recent state-of-AI-engineering analysis found that system prompts now consume around 69% of input tokens in customer traces. As prompts grow past 2K tokens with three teams editing different sections, instruction conflicts compound: a guardrail says "always cite sources" while a brand-voice rule says "keep responses to two sentences." Bigger models pick the right side of the conflict more often, but the conflict is the bug.

Eval sets that don't cover the failing distribution. If your eval pass rate is 92% but your user-perceived quality is 78%, the eval set has drifted from the live distribution. A bigger model will lift the eval pass rate to 96% while doing nothing for the live distribution that the eval doesn't measure. The team will celebrate; the users will keep complaining.

Lost-in-the-middle positioning. Models attend more to the top and bottom of long contexts than the middle. A retrieved chunk that lives in the middle of a 30K-token context window is functionally invisible to a smaller model and only intermittently visible to a bigger one. The fix is reordering or summarizing context, not paying for the model to brute-force its attention through worse positioning.

The Closed-Book and Gold-Context Tests

Before any tier-up decision, the team should be able to answer two questions, and the answers gate the decision.

Closed-book test: does the current model answer correctly when you remove the retrieved context entirely? If yes, your retrieval is at best redundant and at worst the source of the noise the model is wading through. A bigger model won't change that.

Gold-context test: does the current model answer correctly when you hand-craft the perfect context — the exact chunks, in the right order, no distractors? If yes, the bug is in retrieval. The fix is upstream of the model. If no, the bug is in the prompt or in genuine model capability — only now is "tier up" a candidate hypothesis, and even then it's one hypothesis among "rewrite the prompt," "decompose the task," or "swap to a model with better instruction-following but the same tier."

Teams that institutionalize these two tests find that 70–80% of regressions resolve at the prompt or retrieval layer. The remaining tier-up cases are now defensible: there's a written hypothesis ("the gold-context test fails on multi-hop reasoning, the smaller model can't sustain the chain"), and the tier-up is a measured response to a specific upstream finding rather than a Pavlovian button-press.

The Tier-Up Gate

The discipline that has to land is a tier-up gate — a lightweight process that turns model-tier changes from a config flag into a decision with a paper trail. The gate isn't bureaucracy; it's the same shape as the deploy gate engineers already accept for production code.

The gate has four parts. A written hypothesis for why a bigger model would help, naming the failure mode and the closed-book / gold-context test results. "User reports show wrong product recommendations; gold-context test passes 95%; retrieval test passes 60%; tier-up is not the right intervention" should as often be the conclusion as "tier-up is justified." A cost projection that translates the tier-up into a per-request and monthly delta at current QPS, attributed to the feature team's budget so the trade-off is visible at decision time rather than at invoice time. A senior-engineer review on every tier-up — not as approval theatre, but because tier-up decisions compound, and a peer reviewer catches the "we already tier-upped this route last quarter for the same kind of issue" pattern that the original engineer can't see in their own ticket history. A regression postmortem thirty days after the change asking the same question retroactively: with hindsight, was the upstream issue findable? If yes, what would have surfaced it earlier?

Without the gate, tier-up is the path of least resistance and becomes the default. With the gate, it's still available — sometimes it's correct — but the team has paid the diagnostic price first, and the upstream bugs get fixed instead of buried.

The FinOps Counter-Reflex

The finance side of this isn't about reading the bill; it's about creating a counter-reflex strong enough to balance the engineering reflex. Three patterns work.

Cost-per-correct-answer on the same dashboard as the eval pass rate. When the dashboard shows "92% pass rate, $0.04 per correct answer" instead of just "92% pass rate," the tier-up that drops the pass rate to 94% at $0.16 per correct answer is visibly a 4× regression on the cost axis to buy a 2-point quality gain. That trade-off might still be worth it for a high-stakes feature; it's almost never worth it for a high-volume one.

Tier-up budget as a feature-team line item. When the team that owns the route owns the tier-up cost as part of their quarterly budget, the trade-off lands at planning time, not at invoice time. The same engineers who reflexively tier-upped when the cost was abstract will reflexively look upstream when the cost is concrete and theirs.

A tier-down review. Twice a year, walk the routes that tier-upped and ask: with the prompt and retrieval improvements that landed since, can this route tier-down again? Most teams have never asked the question, and at least 20–30% of routes by volume are over-served at the tier they're on. The exercise is uncomfortable because it admits the original tier-up may have been a workaround, but the cost reclaim is real and so is the discipline-building effect of having to rationalize each tier choice.

The Architectural Realization

Model tier is the most expensive debugging tool in your stack — measured per-request, per-day, and per-quarter. A team that reaches for it first is making a leverage choice they didn't price: trading a one-time engineering investment in upstream fixes for a permanent compute multiplier on every future request through that route. Across a year of growing traffic, the math is rarely close.

The fix isn't a ban on tier-ups; sometimes the bigger model is genuinely the right answer, and the team that refuses to ever tier up will under-serve a class of legitimately hard tasks. The fix is making tier-up a deliberate response to an investigated failure rather than a reflexive response to an uninvestigated one. The closed-book test, the gold-context test, the tier-up gate, the cost-per-correct-answer dashboard, and the tier-down review are the supporting structure. Each of them costs less than one month of unnecessary tier-up on a single route.

The architectural realization, once the discipline lands, is that the team's debugging vocabulary expands. Instead of "the model is bad, try a bigger one," the conversation becomes "the retrieval is returning stale chunks for queries with this shape," or "this tool description has been ambiguous since the v2 change," or "the eval set hasn't been refreshed since the customer cohort shifted." Those are conversations that produce code changes, prompt changes, and infrastructure changes that compound over time. "Bump to the bigger tier" is a conversation that produces an invoice.

References:Let's stay in touch and Follow me for more thoughts and updates