Hard-Won Lessons from Shipping LLM Systems to Production
Most engineers building with LLMs share a common arc: a working demo in two days, production chaos six weeks later. The technology behaves differently under real load, with real users, against real data. The lessons that emerge aren't philosophical—they're operational.
After watching teams across companies ship (and sometimes abandon) LLM-powered products, a handful of patterns appear again and again. These aren't edge cases. They're the default experience.
Evals Aren't Optional Infrastructure—They're the Foundation
The single most common mistake is shipping before you have evaluations. Without evals, every architectural decision—RAG vs. fine-tuning, which model to use, whether chain-of-thought helps—is just vibes. You can't tell if a prompt change improved things or regressed a different capability. You're flying blind.
Start with assertion-based unit tests built from real production samples, not synthetic examples. If you don't have production data yet, collect it manually, even fifty examples. Write specific, measurable criteria (at least three per eval set). The criteria will feel obvious at first. They won't stay that way—as your team interacts with real data, what counts as "good" shifts, and if you're not tracking criteria explicitly, you won't notice the drift.
For pairwise comparisons, LLM-as-Judge works surprisingly well, but only when you calibrate it. Don't use off-the-shelf judge prompts. Start with manually labeled examples, check where the judge disagrees with your team's judgment, and iterate on the judge prompt until alignment is high. A miscalibrated judge is worse than no judge—it gives you false confidence.
Simplify human annotation tasks to binary judgments or pairwise comparisons. Nuanced rating scales introduce inter-rater variance that drowns signal in noise. Harder tasks also take longer, so annotators rush, introducing more error. Keep it simple.
Hybrid Search Beats Pure Vector Retrieval
The default RAG implementation—embed everything, retrieve by cosine similarity—works for demos and fails quietly in production. The failure mode is subtle: retrieval looks plausible, but it misses the documents that actually matter.
Keyword search (BM25) handles exact terms, product names, SKUs, and proper nouns in ways embeddings don't. Vector search handles semantic similarity and paraphrase. Neither alone is adequate. Teams that switched to hybrid approaches—combining BM25 and embedding retrieval with score fusion—consistently saw retrieval quality improvements that downstream generation couldn't recover from retrieval failures.
Three factors determine retrieval quality: relevance (are the right documents coming back?), information density (is there signal in what's retrieved, or mostly boilerplate?), and detail level (are the retrieved chunks granular enough to answer the question?). Track these separately. Retrieval failures that look like generation failures are the hardest to debug because you'll spend weeks on the wrong component.
Chunking strategy matters more than most teams expect. Token-count chunking is fast to implement and quietly catastrophic for dense documents. A legal clause that spans two chunks loses its meaning when split mid-sentence. Semantic chunking by section or logical unit improves retrieval at the cost of implementation complexity—usually worth it.
Multi-Step Workflows Outperform Monolithic Prompts
The reflex to put everything into one prompt is understandable—fewer API calls, simpler code. It's also reliably worse on complex tasks.
Breaking a complex task into focused, single-purpose prompts improves both reliability and debuggability. When a pipeline has six steps, you can isolate which step failed. When you have one mega-prompt, you can't. Code generation research has shown accuracy improvements of 2x or more when replacing single-pass generation with multi-step workflows that include self-review and refinement stages.
The implementation heuristic: if a prompt requires the model to do more than two substantially different things (extract, then classify, then format), split it. The overhead of orchestration code is small compared to the improvement in controllability.
For agentic systems—where models are making sequential decisions—prefer deterministic plans over open-ended exploration. Agents that plan then execute (rather than planning and executing simultaneously) are more reliable and easier to monitor. The planning step can be LLM-generated; the execution step should be constrained and predictable.
Model Selection Is a Moving Target
Picking a model at project start and assuming it will stay optimal is incorrect. Models get updated, deprecated, and outclassed. GPT-3.5-turbo was state-of-the-art in early 2023; within a year, significantly better models at lower cost existed.
Pin specific model versions in production. Floating versions ("gpt-4") can silently change behavior under your application. A model update that improves general capability can still regress your specific use case if your prompts relied on particular response patterns.
Prompt migration across models is more painful than expected. Switching from one model family to another typically requires rebuilding prompts from scratch, not porting them. Expect 10%+ performance variation even for semantically identical prompts across models. If you've built an eval suite, this is manageable. If you haven't, you're guessing.
The smallest model that meets your quality bar on evals is usually the right choice. Smaller models with well-engineered workflows frequently outperform larger models with sloppy prompting, and the cost difference compounds at scale. This isn't always true—some tasks genuinely need the biggest models—but treat "use the biggest model" as a fallback position, not a default.
Operational Lessons That Don't Fit a Demo
Monitor inputs and outputs daily. Development data doesn't reflect production distribution. Structural skew—formatting differences, field types—is easy to detect. Semantic skew—the meaning of what users ask changing over time—requires human review. Teams that look at a random sample of production I/O daily catch drift that automated monitoring misses. This is the "vibe check" that sounds informal and is actually irreplaceable.
Cache aggressively. LLM latency is high and cost per call is non-trivial. Exact-match caching handles repeated inputs (common in FAQ-style systems). Semantic caching—where a query within some embedding distance of a cached query returns the cached answer—handles near-duplicates. Both require cache invalidation policies, but the latency and cost savings are substantial enough to justify the implementation work.
Guardrails require output monitoring, not just input filtering. LLMs generate text even when they shouldn't—when they're uncertain, when the question is out of scope, when the retrieved context doesn't support an answer. Prompt engineering alone won't prevent confident-sounding incorrect outputs. Build output-side checks: factual consistency verification against retrieved sources, refusal detection, output structure validation. Reference-free evals—which assess output quality without requiring a gold-standard answer—can double as runtime guardrails.
Don't fine-tune until you have to. Early fine-tuning decisions are regularly regretted. Fine-tuning requires a labeled dataset, training infrastructure, and ongoing maintenance as models update. It also makes debugging harder—behavior changes from fine-tuning and behavior changes from prompt changes are difficult to disentangle. RAG with well-engineered prompting handles the majority of cases where teams initially reach for fine-tuning. Invest in fine-tuning only after you have strong eval evidence that prompting has hit its ceiling.
What Actually Creates Long-Term Value
The model itself is not a competitive moat. Models are becoming cheaper and more capable on a predictable curve. What creates durable value is the surrounding system:
- Evals that encode your domain's definition of quality and improve over time
- Data pipelines that capture user feedback and convert it into training signal
- Guardrails and observability that make the system trustworthy enough to scale
- Domain specialization that creates stickiness generic tools can't match
Teams that spent 2023 and 2024 trying to differentiate on model capability found that distinction evaporated. Teams that built deep evaluation infrastructure and domain-specific data flywheels found those investments compounding.
The transition from prototype to production isn't primarily a technical challenge—it's a discipline challenge. The practices that make LLM systems reliable (evals, monitoring, structured retrieval, modular prompting) are all known. They just require committing to them before you can feel the consequences of skipping them.
By the time the consequences are obvious, you're firefighting.
