A Year of Building with LLMs: What the Field Has Actually Learned
Most teams building with LLMs today are repeating mistakes that others made a year ago. The most expensive one is mistaking the model for the product.
After a year of LLM-powered systems shipping into production — codegen tools, document processors, customer-facing assistants, internal knowledge systems — practitioners have accumulated a body of hard-won knowledge that's very different from what the hype cycle suggests. The lessons aren't about which foundation model to choose or whether RAG beats finetuning. They're about the unglamorous work of building reliable systems: how to evaluate output, how to structure workflows, when to invest in infrastructure versus when to keep iterating on prompts, and how to think about differentiation.
This is a synthesis of what that field experience actually shows.
The Model Is the Least Durable Part of Your System
Here's the single most consistent finding from teams that have shipped: the model is not where your competitive advantage lives, and it's the component most likely to be commoditized or disrupted.
The evidence is stark. BloombergGPT — trained from scratch by a 9-person team on 363 billion domain-specific tokens — was outperformed by GPT-3.5-turbo within a year of its release. Teams that finetuned aggressively in 2023, before prompting had hit its ceiling, almost uniformly describe that investment as premature. The model they finetuned on shipped a version update, their evaluations broke, and they were back to baseline.
Model capabilities have roughly halved in cost every six months since the first commercial APIs launched — a 100× cost reduction in approximately 18 months. The practical implication: any differentiation built on "better model access" evaporates quickly. What doesn't evaporate is infrastructure: your evaluation harness, your guardrails, your caching layer, your data flywheel.
The teams building durable systems have internalized this. They treat model selection as a cost/performance trade-off decision — nearly commodity — and focus engineering effort on the system that wraps it. The moat is in evals, data collection, and domain-specific workflows. Not the weights.
Evaluation Is Engineering Work, Not QA Theater
The most common failure mode in LLM deployments is treating evaluation as an afterthought — something you add when the system "feels about ready." Teams that do this end up with a proliferation of vibes ("it seems to be working better") and no ability to safely iterate.
Proper eval infrastructure is engineering work, and it requires the same rigor as any other system component.
A few concrete approaches that work in practice:
Assertion-based unit tests from production samples. Review actual input/output pairs from production daily. When you find a failure, write a test for it. Aim for at least three distinct assertions per test case. This isn't glamorous, but it's the fastest path to a reliable eval suite that tracks what users actually care about.
Pairwise comparisons over Likert scales. When using LLM-as-Judge to automate evaluation at scale, pairwise comparisons ("which of these two responses is better?") are dramatically more reliable than asking models to assign scores on a 1–5 scale. They're also cheaper: one team reduced annotation costs from $25 per unit to $3.50 by switching to pairwise methodology. Pair the comparisons with chain-of-thought reasoning to reduce positional bias.
The intern test. Before debugging the model, ask: could a competent new hire succeed at this task with the same context I gave the model? If yes, the failure is a model problem. If no, the failure is a context problem — and adding more context or better retrieval will likely fix it faster than prompt engineering.
Don't over-optimize for a single metric. One team spent weeks improving Needle-in-Haystack scores and ended up with a model that was worse at summarization. Eval suites need to cover the full surface area of what matters, not the metric that's easiest to measure.
Multi-Step Workflows Beat Single Prompts, Consistently
A persistent misconception: that better models make complex prompts unnecessary. In practice, the opposite tends to hold — as task complexity grows, structured multi-step workflows outperform single-shot prompting regardless of model capability.
The AlphaCodium result illustrates this clearly. GPT-4's accuracy on competitive programming benchmarks jumped from 19% to 44% by adding structure: reflect on the problem, reason about test cases, generate a solution, rank alternatives, iterate with synthetic tests. The underlying model was unchanged. The workflow made the difference.
This pattern shows up across domains:
- A meeting summarizer broken into "extract decisions" → "verify consistency" → "generate summary" is more reliable and easier to debug than a single mega-prompt
- Code generation pipelines that reflect on edge cases before producing final output outperform direct generation
- Document classifiers that first identify document type, then apply type-specific extraction logic, significantly reduce errors
The engineering implication: invest early in workflow architecture. Structured, deterministic pipelines where each step is observable and testable are fundamentally easier to debug and improve than opaque single prompts. Reserve autonomous agent loops for genuinely open-ended tasks where predefined workflows can't cover the problem space.
RAG First, Finetune Later (If Ever)
A common early mistake: reaching for finetuning when the system doesn't have enough knowledge. In almost all cases, retrieval-augmented generation is the right first move.
The evidence is consistent across benchmark comparisons: RAG outperforms finetuning for injecting new, updateable knowledge. Finetuning bakes knowledge into weights — it's expensive to update, harder to audit, and doesn't give the model the ability to cite sources. RAG makes knowledge explicit, updateable, and traceable.
A few nuances practitioners have learned:
Vector embeddings don't magically solve search. Pure semantic search underperforms hybrid approaches in most production settings. BM25 keyword search should be the baseline — it's fast, interpretable, and handles exact-match queries that embeddings frequently miss. Hybrid retrieval (BM25 + embeddings, combined with reranking) consistently outperforms either alone.
Retrieval quality has three axes: relevance (are you retrieving the right things?), information density (how signal-dense is each chunk?), and detail level (does the retrieved content have sufficient specificity for the task?). Most teams optimize only for relevance and ignore the other two.
Long-context models don't eliminate RAG. Even with 128K or 200K context windows, relevance filtering and cost constraints mean retrieval remains valuable. Dumping an entire document corpus into context is expensive and often degrades performance — models lose focus in long contexts.
Finetune only after prompting has hit a clear ceiling — roughly when you're at ~90% of required quality and have exhausted prompt engineering, few-shot examples, and retrieval improvements. Most teams never reach that ceiling before the next model generation resets the baseline.
Operational Lessons That Only Show Up in Production
Some lessons don't appear until a system has been running long enough to accumulate real usage patterns.
Pin your model versions. A platform migration between GPT-3.5 versions caused a 10% performance drop at one company — a silent regression with no code change. Production LLM applications should pin to specific model versions and treat upgrades as migrations that require re-evaluation.
Monitor distribution shift on two axes. Track structural drift (formatting, field names, casing in inputs) and semantic drift (topic or query distribution, detectable via embedding clustering). Most monitoring setups catch the former and miss the latter.
Use the smallest model that works. In a documented case, Haiku with 10-shot prompting outperformed zero-shot Opus and GPT-4 on the same task — at a fraction of the cost. The lesson isn't to always use small models; it's that model selection should be empirical, not aspirational. Bigger isn't automatically better.
Structure outputs from the start. Free-form LLM output that gets parsed downstream is a maintenance burden and a fragility surface. Libraries like Instructor (for API use) and Outlines (for self-hosted models) enforce structured generation and eliminate an entire class of brittle parsing code.
Cache aggressively. Semantic caching for repeated query patterns, response caching for guardrails and policy-heavy outputs, and multi-turn context caching all deliver meaningful cost and latency reductions at low implementation cost.
Strategy: When to Invest, When to Stay Lean
The strategic landscape for LLM products has clarified considerably. A few principles that have held up:
No GPUs before product-market fit. Training foundation models before you know what users actually need is almost always a capital allocation mistake. The resources required are enormous; the performance gains over tuned, prompted commercial APIs are rarely worth it pre-PMF; and the maintenance burden is ongoing. Self-host when confidentiality requirements or scale-driven cost pressure make it necessary — not as a first-principles architectural decision.
Build moats the model providers can't replicate. Domain-specific evals, proprietary data flywheels, high-quality annotation datasets, guardrails tuned to your specific risk surface — these compound over time and can't be easily replicated by a competitor running the same base model. Generic commodity features (text-to-SQL, documentation chatbots, generic knowledge base integrations) don't compound and won't differentiate you.
Team structure follows maturity. Successful LLM products tend to evolve in three phases: Phase 1 is product and prompt engineering — find what works. Phase 2 is instrumentation and data collection — build the feedback loop. Phase 3 is systematic optimization — use evals and data to drive measurable improvements. Jumping to Phase 3 infrastructure before Phase 1 product clarity is a common form of premature optimization.
The Underlying Bet
All of these lessons point to the same underlying bet: that the teams who will win are not the ones with the best model access, but the ones who build the best feedback loops.
The evaluation harness, the data collection infrastructure, the annotation pipeline, the human-in-the-loop UX that generates signal — these are harder to build than a good initial prompt, and they're harder to copy. They're also the only thing that keeps compounding as models improve. When the next generation of models drops and everyone's baseline performance jumps, the teams with robust evaluation infrastructure will be able to measure the improvement and adapt. Everyone else will be guessing.
The model will change. The system that validates it doesn't have to.
