Skip to main content

Seven Patterns for Building LLM Systems That Actually Work in Production

· 10 min read
Tian Pan
Software Engineer

The demo always works. Prompt the model with a curated example, get a clean output, ship the screenshot to the stakeholder deck. Six weeks later, the system is in front of real users, and none of the demo examples appear in production traffic.

This is the gap every LLM product team eventually crosses: the jump from "it works on my inputs" to "it works on inputs I didn't anticipate." The patterns that close that gap aren't about model selection or prompt cleverness — they're about system design. Seven patterns account for most of what separates functional prototypes from reliable production systems.

1. Evals First, Not Last

The most common mistake in LLM product development is treating evaluation as something you do after the system exists. Teams spend weeks iterating on prompts and model configurations without a stable measurement baseline, then realize they can't tell if last Tuesday's change made things better or worse.

Evals need to come first — before prompt engineering, before model selection, before deployment. The mechanics aren't complicated: define the task, collect representative examples (even 50-100 well-chosen cases is far better than zero), and write an automated scorer that runs on every change. This is "eval-driven development," and it's the closest thing LLM engineering has to test-driven development in conventional software.

Generic benchmarks like MMLU or HELM exist to rank foundation models — not to measure your product. A customer service bot being evaluated on graduate-level science questions tells you nothing. Build task-specific eval datasets from real traffic. Log examples where the system fails, where users regenerate responses, where feedback is negative. These become your eval set.

LLM-as-evaluator has become a practical alternative to human labeling for many intermediate tasks. Using a stronger model to rate outputs with chain-of-thought reasoning produces reasonably calibrated scores at scale. The key constraint: define the rubric precisely. "Is this response helpful?" is too vague. "Does this response answer the user's specific question without hallucinating product details?" is evaluable.

2. Retrieval Before Retraining

For any system that needs access to facts, documents, or knowledge that changes over time, retrieval-augmented generation (RAG) should be your first architectural move — not fine-tuning.

The intuition is simple: RAG changes what the model can see right now. Fine-tuning changes how the model tends to behave every time. Most production knowledge problems are about the former.

The failure mode in early RAG implementations is treating it as a simple vector search problem. You chunk documents, embed them, build a FAISS index, retrieve by cosine similarity, and paste the results into the prompt. This works for demos. In production, it fails on proper nouns, product codes, acronyms, and rare strings — exactly the tokens that matter most in enterprise applications.

The production fix is hybrid retrieval: pair a keyword-based retriever (BM25 or equivalent) with dense embeddings, then combine rankings using Reciprocal Rank Fusion. Keyword search catches exact tokens; semantic search catches meaning. Neither alone handles the full distribution of real queries. Teams that implement hybrid search as one of their first upgrades to a RAG system consistently report large recall improvements over vector-only approaches.

Beyond retrieval itself, chunking strategy matters more than most teams expect. Naive fixed-size chunking cuts across paragraph boundaries, table rows, and list items in ways that degrade retrieval precision. Sentence-aware or structure-aware chunking — which respects document boundaries — yields cleaner retrieved passages. Adding a cross-encoder reranker as a second pass (scoring retrieved candidates against the query before passing to the LLM) is a high-return, low-risk improvement for most RAG pipelines.

3. Fine-Tuning for Behavior, Not Knowledge

If RAG handles knowledge, fine-tuning handles behavior. The distinction is important because it determines when fine-tuning is the right tool.

Fine-tuning works well for:

  • Consistency in tone, format, or structure that's hard to enforce through prompts alone
  • Classification and routing tasks where a smaller specialized model can match a larger general one
  • Output schema adherence (structured JSON, domain-specific formats)
  • Policy and constraint following that needs to be baked in, not prompted in

Fine-tuning works poorly as a knowledge store. Models don't reliably memorize factual data from training — they interpolate. Injecting product catalogs, policy documents, or frequently updated information through fine-tuning produces hallucinations and staleness. That's what RAG is for.

On the practical side, parameter-efficient methods like LoRA and QLoRA have made fine-tuning accessible. QLoRA's 4-bit quantization can reduce memory requirements by an order of magnitude for large models, making single-GPU fine-tuning viable for most teams. For most product use cases, full fine-tuning is overkill — LoRA adapters provide comparable task performance with far less compute and faster iteration.

One legal note that trips up teams: most frontier model providers prohibit using their API outputs to train competing models. If you're distilling capabilities from a proprietary model, verify the license. Open-source base models (Llama, Mistral, Falcon families) avoid this constraint entirely and provide audit trails that matter in regulated industries.

4. Cache Conservatively

Semantic caching — storing prior LLM responses keyed by embedding similarity — is seductive. It promises to slash latency and cost by serving cached answers to "similar enough" questions. In practice, it requires considerably more care than most implementations apply.

The core risk: semantic similarity is not semantic equivalence. "What's the return policy for orders under $50?" and "What's the return policy for orders over $500?" have high embedding similarity and opposite correct answers. Serving a cached response from the first query to the second is a product defect, potentially a legal one.

Safe caching patterns are narrower than general-purpose semantic search:

  • Item-based caching: Pre-compute and cache summaries of specific known entities (products, articles, user profiles). The input is constrained to a known set.
  • Pair-based caching: For comparison queries between known pairs, cache the result when both entities are recognized.
  • Constrained input caching: When the LLM is wrapping a parameterized query (e.g., "summarize ticket #{id}"), cache by the structured key, not the natural language form.
  • Offline pre-computation: For high-traffic predictable outputs, generate them in batch and serve statically.

General-purpose semantic similarity caching works well for queries with high traffic concentration — where a small number of question variants account for most volume. Measure your cache hit rate against a real traffic sample before committing to it as a latency strategy.

5. Guardrails as Architecture, Not Afterthought

Guardrails are often added to LLM systems after the first production incident — a user elicited an off-topic response, the model hallucinated a number, or the output format broke a downstream parser. The reactive approach means guardrails end up as patches rather than architecture.

There are four layers worth designing explicitly:

Structural guardrails enforce output format. The most reliable technique is constrained decoding — libraries like Guidance or Outlines inject tokens at generation time to force the output into a valid JSON schema, regex pattern, or grammar. This is more reliable than prompting the model to "respond in JSON" because it operates at the token level rather than relying on instruction following.

Syntactic guardrails check correctness of generated code, SQL, or URLs before execution. Run generated SQL through a parser before sending it to the database. Validate URLs before displaying them. These are straightforward and should be non-optional for any system that executes model-generated code.

Semantic guardrails assess whether the response is relevant, accurate, or consistent with source documents. SelfCheckGPT — sampling the model multiple times and checking for consistency — is a practical hallucination detection technique. Retrieval-grounded responses can be checked by verifying that claims are supported by retrieved context.

Safety guardrails handle content policy. Whether you use a moderation API or a fine-tuned classifier, this layer needs both input and output coverage. Input moderation catches injection attempts and policy-violating requests before they reach the model. Output moderation catches failures that make it through.

6. Design the UX for Uncertainty

Most LLM product failures aren't model failures — they're expectation failures. Users who expect deterministic behavior from an LLM will be consistently surprised. Designing for that uncertainty is an engineering problem, not just a design problem.

Some principles that hold up in practice:

Be transparent about confidence. Bard, Bing, and Perplexity all display disclaimers and source citations not just for legal reasons but because they set accurate expectations. Users who know a response might be wrong are more likely to verify. Users who assume responses are authoritative don't.

Make dismissal easy. GitHub Copilot's ghost-text interface is instructive — suggestions appear inline but require an explicit accept action. The friction of the accept action is intentional. AI features that require active dismissal rather than active acceptance get evaluated more carefully.

Constrain the input space where possible. Slot-based interfaces (dropdowns, autocomplete, structured forms) are dramatically easier to build reliable systems around than open-ended chat. Chat has higher perceived capability but also much higher failure surface. Chat should earn its place in a product; don't default to it because it's the obvious interface for an LLM.

Provide attribution. Users trust cited, sourced responses more than unsourced ones — even when they don't follow the citations. Citations signal that the system is grounded in something external, not generating from nothing.

7. Build the Feedback Loop Into the Product

The most durable competitive advantage in an LLM product isn't the model or the prompt — it's the data flywheel. Systems that systematically collect signal from real usage can improve continuously; systems that don't are stuck with the baseline.

Feedback collection needs to be designed into product interactions, not bolted on afterward. The patterns that work:

Explicit feedback: Thumbs up/down, regenerate buttons, correction flows. These produce labeled examples directly usable for evals and fine-tuning. The challenge is response rate — most users don't leave explicit feedback. Optimize the interaction to make feedback the path of least resistance.

Implicit signals: Code acceptance rate (Copilot's core metric), conversation continuation, ignored suggestions, session abandonment. These require instrumentation but produce much higher signal volume than explicit feedback.

Failure mode logging: Every error, exception, guardrail trigger, and user-reported problem is a training signal. A rigorous incident taxonomy — what went wrong, why, and what output would have been correct — turns production failures into dataset.

The flywheel closes when collected data improves evals, which guides fine-tuning and prompt iteration, which improves the product, which generates more usage and more signal. Teams that instrument this loop early have compounding advantages over teams that treat data collection as a post-launch concern.


Where to Start

The seven patterns don't need to be implemented simultaneously. The practical sequence for most teams:

  1. Evals before anything else — you need a measurement baseline to know if changes help
  2. RAG for knowledge-dependent use cases — before fine-tuning, before scaling prompts
  3. Hybrid retrieval as the first RAG upgrade — keyword plus semantic, combined with RRF
  4. Guardrails starting with structural and syntactic, adding semantic as the system matures
  5. Feedback instrumentation from launch — not after you realize you need more data
  6. Fine-tuning once you have a stable eval baseline and enough task-specific data to justify it
  7. Caching only after you understand your traffic distribution and cache safely for your domain

Demos are easy. Products are a system design problem. The patterns that matter aren't about finding the right model — they're about what surrounds it.

Let's stay in touch and Follow me for more thoughts and updates