Skip to main content

6 posts tagged with "model-selection"

View all tags

Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts

· 10 min read
Tian Pan
Software Engineer

The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.

The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

Smaller Model, Bigger Bill: Why Cheaper-Per-Token Often Costs More

· 8 min read
Tian Pan
Software Engineer

A finance-led mandate to "switch to the smaller model" is one of the most reliable ways to raise your LLM bill quarter-over-quarter. The dashboard the procurement team is watching — cost per call, average tokens per request — keeps trending down. Meanwhile the invoice keeps trending up. By the time someone reconciles the two, six months of prompt iteration has been spent compensating for a model that's worse at the task, and the team is in too deep to walk it back without admitting the original switch was a mistake.

The mistake isn't about pricing. It's about the unit. Per-token price is a misleading axis when reasoning depth, retry count, and prompt size all vary by model. The right metric is tokens-per-successful-completion, and on that axis the cheaper model often loses.

How to Pick the Right LLM Before You Write a Single Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams pick an LLM the same way they picked a database ten years ago: they look at a comparison table, pick the one with the highest score in the column they care about, and start building. Six months later, they're either migrating or wondering why their eval results look nothing like what users experience. The benchmark was right. The model was wrong for them.

The mistake isn't picking the wrong model — it's picking a model before you know what your actual production task distribution looks like. A benchmark tests what someone else decided matters. Your production system has a completely different distribution. These two things are not the same.

The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI

· 9 min read
Tian Pan
Software Engineer

Most teams ship their first AI feature on the best model available, because that's what the demo ran on and nobody had time to think harder about it. Then a second feature ships on the same model. Then a third. Six months later, every call across every feature routes to the frontier tier — and the bill is five to ten times higher than it needs to be.

The uncomfortable truth is that 40–60% of the requests your production system processes don't require frontier-level reasoning at all. They require competent text processing. Competent text processing is dramatically cheaper to buy.

Reasoning Models in Production: When They Help and When They Hurt

· 9 min read
Tian Pan
Software Engineer

A team building a support triage system switched their classification pipeline from GPT-4o to o3. Accuracy improved by 2%. Costs went up by 900%. The latency jumped from 400ms to 12 seconds. They switched back.

This is the most common story in production AI right now. Reasoning models represent a genuine capability leap — o3 solved 25% of problems on the Frontier Math benchmark when no previous model had exceeded 2%. But that capability comes with a cost and latency profile that makes them wrong for the majority of tasks in the average production system. Knowing the difference is one of the more valuable things an AI engineer can internalize right now.