How to Pick the Right LLM Before You Write a Single Prompt
Most teams pick an LLM the same way they picked a database ten years ago: they look at a comparison table, pick the one with the highest score in the column they care about, and start building. Six months later, they're either migrating or wondering why their eval results look nothing like what users experience. The benchmark was right. The model was wrong for them.
The mistake isn't picking the wrong model — it's picking a model before you know what your actual production task distribution looks like. A benchmark tests what someone else decided matters. Your production system has a completely different distribution. These two things are not the same.
Why Public Benchmarks Cannot Make This Decision for You
The most widely cited LLM benchmarks — MMLU, HumanEval, GPQA — have all saturated at the frontier. By mid-2024, every major model family scored above 90% on MMLU. A half-percentage-point difference between Claude and GPT-4o on that benchmark is statistical noise, not signal you can act on. Manual review of MMLU found an estimated 6.5% of questions contain ground-truth errors, which means the metric you're optimizing against is partially wrong.
HumanEval has the opposite problem: it tests self-contained algorithm puzzles, not the messy reality of debugging a 3,000-line TypeScript codebase or writing a migration script for a poorly documented API. Models that top HumanEval regularly fail on the kind of code generation tasks that shipping teams actually care about.
There's also contamination. Every model trained on large internet crawls has a non-trivial probability of having seen benchmark questions in its training data. You cannot tell from a published score whether you're looking at capability or memorization.
This isn't an argument against benchmarks. It's an argument for knowing what they measure. A benchmark only predicts production performance when the benchmark task distribution resembles your production task distribution. The further those distributions are from each other, the less the score tells you.
The Production Dimensions That Don't Appear in Any Leaderboard
These are the variables that routinely decide whether an LLM integration ships reliably or requires constant manual intervention. None of them appear consistently in public benchmark tables.
Function-calling reliability. If your system uses tool calls, structured extraction, or agent loops, this is probably your most important dimension. The Berkeley Function-Calling Leaderboard shows a 4–5 percentage point gap between top and bottom models at function calling accuracy — from roughly 95% to 99.9% compliance. At 1,000 tool calls per day, that difference is 15 extra failures vs. 1. The failure modes also differ: some models generate syntactically invalid JSON, others produce JSON that parses but doesn't match the schema, others simply skip required fields under certain conditions.
Structured output compliance. JSON mode and structured output aren't the same thing. Standard JSON mode typically produces 2–5% schema mismatch failures — the output parses, but required fields are missing or wrong types. Constrained decoding (where the model is only allowed to produce valid schema tokens) gets this below 0.1%. The critical question isn't "does this model support JSON mode?" but "which failure mode will I see, and how often?"
Refusal rate on your domain. Safety filters are calibrated against generic harm categories, not your domain. A model with aggressive safety filtering will refuse a benign legal, medical, or financial query at a rate that makes it unusable for specialized applications. The refusal behavior varies significantly across models and providers, and it depends on how you frame requests — not just what you're asking. You cannot predict this without testing on your actual content. One model might refuse 0.2% of medical queries; another might refuse 8%.
Context window behavior at limits. A 200k context window doesn't mean information placed anywhere in that window will be used equally. Research on the "lost in the middle" problem shows LLMs perform 30%+ worse when relevant information sits in the middle of a long context vs. at the beginning or end. The attention mechanism has a U-shaped preference for edge tokens. This means a model with a smaller context window, used thoughtfully with retrieval, may outperform a model with a larger window stuffed with documents naively.
Latency at your concurrency level. Benchmark latency measurements are almost always single-request, warm-cache numbers. Your system runs concurrent requests, often with cold caches, at specific context lengths. Time-to-first-token and throughput diverge dramatically under load — the model that feels fast in testing can feel slow in production when there are 50 concurrent requests queuing. Test latency at your expected concurrency, not in isolation.
Instruction adherence across long system prompts. If your system prompt is 1,000+ words with multiple constraints, models differ significantly in how consistently they follow all of them. Some models will "forget" constraints that appear early in the system prompt, especially under context pressure. This failure mode is nearly invisible in benchmarks but becomes the dominant failure mode in complex agentic systems.
The 48-Hour Adversarial Evaluation Sprint
Rather than relying on published benchmarks, the most production-ready teams run a focused evaluation before committing to a model. Forty-eight hours is enough time to get actionable data.
Day one: baseline and stratification. Compile 100–200 representative examples of your actual task distribution. If you have production traffic, sample from it; if not, construct examples that match your expected use cases. Stratify by difficulty: easy (cases where any decent model should succeed), medium (cases where you expect some variance), hard (edge cases and domain-specific queries that are genuinely difficult). Run all candidate models against this dataset. Record accuracy, latency, and cost per successful output.
Day two: adversarial testing. Once you know the baseline, stress the models. Introduce realistic perturbations: typos in user input, rephrased versions of the same intent, boundary cases at the edges of your context window, concurrent requests to measure latency degradation. Test your domain-specific refusal risk: does the model handle your legitimate use cases without false refusals? Use PromptBench-style perturbation (small variations like synonyms, rephrasing) — research shows these simple perturbations cause 33% average performance drops across models, revealing brittleness that the baseline hides.
The output of this sprint should be a table comparing candidate models on the dimensions that actually matter for your system: function call accuracy, JSON schema compliance rate, refusal rate on your content, p95 latency at expected concurrency, and cost per successful output. This is your actual decision matrix.
The Decision You're Really Making
After running the evaluation, the decision usually isn't "which model is best" — it's "which combination of models is right for which requests."
No single model wins on all dimensions simultaneously. The model with the best reasoning depth is rarely the fastest. The cheapest model is rarely the most reliable at structured output. The most instruction-following model is rarely the one with the lowest refusal rate on specialized content.
The teams getting best efficiency in production route requests by type rather than sending everything to one model:
- User-facing, latency-sensitive interactions: prioritize time-to-first-token over reasoning depth
- Complex multi-step reasoning or long-horizon planning: pay for inference compute, use a model optimized for extended reasoning
- High-volume batch processing (summarization, classification, extraction): optimize for cost per successful output, test whether a smaller model meets the quality bar
- Privacy-sensitive or regulated data: the routing decision isn't capability at all — it's whether the data can leave your infrastructure
This framing — routing by task type rather than picking a champion — is the architectural decision that matters. The specific model you choose today may not be the model you use in six months. Provider pricing, new model releases, and your own fine-tuning will shift the optimal routes. Build the routing layer first; treat model selection as a parameter, not a commitment.
What Actually Breaks Teams in Production
A review of documented production LLM failures shows a consistent pattern: teams underestimate the gap between "passes evals" and "works for users."
The most common failure modes aren't model quality problems — they're selection mismatches:
- Token cost surprises: A model that's 40% cheaper per token can be 3× more expensive per successful outcome if it has a 30% failure rate requiring retries or human review
- Rate limit mismatches: A team selects a model based on capability, then discovers their production throughput exceeds the tier's TPM limits, causing queue buildup and latency spikes during traffic peaks
- Refusal creep: A model that works in testing starts refusing more queries after a provider-side safety update. No notification, no version bump — the refusal rate just changes
- Structured output drift: JSON schema compliance degrades silently when the model receives inputs slightly outside its training distribution, causing downstream parsing failures that log as application errors, not model errors
These failures share a common root cause: the team evaluated the model under conditions different from production. They tested at low concurrency, with curated inputs, at a point in time before provider updates changed behavior.
The Metric That Should Drive Model Selection
The right unit of measurement isn't tokens per dollar — it's cost per successful outcome.
This includes model API cost, but also: retry overhead (how often does the model fail and require a retry?), error correction time (how often does an engineer need to debug a model failure?), latency cost (how does model speed affect user conversion or satisfaction?), and migration risk (how much does your prompt need to change if you switch models?).
A model that's 30% cheaper per token but produces outputs requiring 20% more engineering attention to handle edge cases may not be cheaper at all. Total cost of ownership for LLM systems is dominated by inference costs for high-volume applications, but dominated by engineering costs for complex applications. The benchmark for "best" model is different depending on which cost is larger for your system.
Before You Choose a Model, Choose Your Constraints
Model selection is a constraint satisfaction problem. Before you evaluate any models, write down your hard constraints:
- Data residency: Can inference calls leave your jurisdiction?
- Latency budget: What's your p95 target for end-to-end response time?
- Cost ceiling: What's the maximum cost per request you can absorb at expected volume?
- Quality floor: What accuracy rate makes the feature usable, not just technically functional?
- Compliance requirements: HIPAA, PCI, SOC2 — which apply, and which providers certify against them?
Models that don't satisfy your hard constraints can be eliminated without evaluation. The remaining candidates go through the 48-hour sprint. The result of that sprint, mapped against your decision matrix, gives you a defensible, evidence-based model selection.
The teams that skip this process aren't wrong to ship faster — they're accepting the risk of a migration later. That migration will cost more than the evaluation sprint. Plan accordingly.
- https://myengineeringpath.dev/genai-engineer/llm-benchmarks/
- https://www.vellum.ai/blog/llm-benchmarks-overview-limits-and-model-comparison
- https://gorilla.cs.berkeley.edu/leaderboard.html
- https://agenta.ai/blog/the-guide-to-structured-outputs-and-function-calling-with-llms
- https://aclanthology.org/2024.tacl-1.9/
- https://www.morphllm.com/lost-in-the-middle-llm
- https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
- https://www.alphanome.ai/post/beyond-the-token-why-the-true-measure-of-llm-value-is-the-total-cost-per-successful-outcome
- https://www.zenml.io/blog/llmops-in-production-457-case-studies-of-what-actually-works
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
- https://arxiv.org/abs/2405.02764
