Skip to main content

3 posts tagged with "model-selection"

View all tags

How to Pick the Right LLM Before You Write a Single Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams pick an LLM the same way they picked a database ten years ago: they look at a comparison table, pick the one with the highest score in the column they care about, and start building. Six months later, they're either migrating or wondering why their eval results look nothing like what users experience. The benchmark was right. The model was wrong for them.

The mistake isn't picking the wrong model — it's picking a model before you know what your actual production task distribution looks like. A benchmark tests what someone else decided matters. Your production system has a completely different distribution. These two things are not the same.

The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI

· 8 min read
Tian Pan
Software Engineer

Most teams ship their first AI feature on the best model available, because that's what the demo ran on and nobody had time to think harder about it. Then a second feature ships on the same model. Then a third. Six months later, every call across every feature routes to the frontier tier — and the bill is five to ten times higher than it needs to be.

The uncomfortable truth is that 40–60% of the requests your production system processes don't require frontier-level reasoning at all. They require competent text processing. Competent text processing is dramatically cheaper to buy.

Reasoning Models in Production: When They Help and When They Hurt

· 9 min read
Tian Pan
Software Engineer

A team building a support triage system switched their classification pipeline from GPT-4o to o3. Accuracy improved by 2%. Costs went up by 900%. The latency jumped from 400ms to 12 seconds. They switched back.

This is the most common story in production AI right now. Reasoning models represent a genuine capability leap — o3 solved 25% of problems on the Frontier Math benchmark when no previous model had exceeded 2%. But that capability comes with a cost and latency profile that makes them wrong for the majority of tasks in the average production system. Knowing the difference is one of the more valuable things an AI engineer can internalize right now.