Building LLM Applications for Production: What Actually Breaks

September 29, 2025 · 9 min read

Software Engineer

Most LLM demos work. Most LLM applications in production don't—at least not reliably. The gap between a compelling prototype and something that survives real user traffic is wider than any other software category I've worked with, and the failures are rarely where you expect them.

This is a guide to the parts that break: cost, consistency, composition, and evaluation. Not theory—the concrete problems that cause teams to quietly shelve projects three months after their first successful demo.

The Core Problem: Natural Language Is Not an API

Traditional software is deterministic. Call a function with the same inputs, get the same outputs. LLMs fundamentally break this contract. You can instruct a model to return JSON with a specific schema, and it will—until it doesn't. You can tell it to limit responses to 200 words, and it will comply—until a user's input triggers a pattern that overrides your instruction.

This isn't a bug you can fix. It's the nature of the interface.

The practical consequence is that everything downstream of an LLM call becomes probabilistic. Your JSON parser will encounter malformed JSON. Your downstream pipeline will receive truncated responses. Your UI will display outputs that violate your schema constraints. None of this happens predictably; it happens at a rate you have to measure.

The engineers who succeed in production internalize this early. They treat every LLM output as user input: validate it, handle failures gracefully, and design for the 5% of cases where the output is structurally wrong.

Cost and Latency: The Economics Nobody Plans For

The math that surprises most teams: inference costs scale with production traffic in ways that training costs don't.

A few hundred dollars buys you extensive prompt experimentation. But at production scale, the economics shift entirely. If your application makes an LLM call per user action, and each call is 2,000 tokens at current API pricing, a modest 100,000 daily active users can generate bills that exceed your entire infrastructure budget. The models that impressed you in the demo phase are often the expensive ones.

Latency is the other side of this. Input token length has relatively little impact on latency—models process inputs in parallel. But output generation is sequential, token-by-token, and there's a floor around 500ms for any response of substance. Applications with real-time requirements (chat, voice, autocomplete) hit this wall and can't engineer their way around it without accepting shorter outputs or switching to faster, less capable models.

Practical strategies that survive contact with production:

Route by complexity. Use a small, fast model for classification, extraction, and simple transformation. Reserve large models for tasks that actually require reasoning. Most tasks don't.
Cache aggressively. Semantic caching—storing embeddings of previous queries and returning cached results for semantically similar inputs—can cut costs 40-60% for workloads with natural repetition patterns.
Batch when latency allows. Background jobs, report generation, content moderation—anything not blocking a user action should be batched and processed at lower-priority rates.
Audit your prompt length. Teams routinely ship prompts with 500-word system instructions that could be 100 words. Every token counts at scale.

The teams that get this right treat cost as a first-class engineering metric from day one, not a finance problem to fix later.

Prompting, RAG, or Finetuning: Making the Right Call

The three approaches exist on a spectrum of effort versus performance, and the wrong choice wastes months.

Prompting is where everything starts. Zero training data required, instant iteration, and good enough for most tasks. The ceiling is the model's existing knowledge and the context window. Prompting alone fails when you need domain-specific knowledge the base model lacks, highly consistent formatting under adversarial inputs, or significant latency/cost reduction.

Retrieval-augmented generation (RAG) solves the knowledge problem without retraining. You embed your documents, store them in a vector database, retrieve relevant chunks at query time, and inject them into the context. For enterprise use cases—internal documentation, customer support, knowledge bases—RAG handles 80% of the cases where prompting alone falls short. The gotchas: retrieval quality is hard to tune, relevant chunks don't always assemble into coherent context, and embedding pipelines need maintenance as documents change.

Finetuning makes sense when you have large amounts of high-quality labeled data, you need the model to internalize patterns that are hard to express in prompts, or you need to reduce inference cost by shortening prompts significantly. A well-finetuned smaller model can outperform a prompted larger model at a fraction of the per-token cost. But finetuning pipelines are expensive to build and maintain, models degrade as the base model updates, and the feedback loop for improvement is slow.

A useful heuristic: try prompting first. Move to RAG when you need domain knowledge. Move to finetuning when you have data and a clear performance gap that prompting can't close. Most applications never need finetuning.

Task Composition: Where Multi-Step Applications Break

Single-prompt applications are the exception in production. Most real applications chain multiple LLM calls: classify the input, retrieve context, generate a response, validate the output, format it for display. Each step can fail, and failures compound.

The failure modes in composed systems are different from single-call failures:

Individual step failures are the obvious case—one call returns malformed output and breaks the pipeline. These are testable.

Composability gaps are harder: each step works correctly in isolation, but the outputs don't assemble into what you intended. The classifier correctly routes the query. The retriever returns relevant documents. The generator correctly synthesizes those documents. But the final output doesn't answer the user's actual question because the routing decision was slightly wrong in a way your unit tests didn't catch.

Runaway loops affect any agentic system where an LLM decides when to stop. Without hard limits on iterations, a model can enter loops that consume tokens until you terminate it externally.

Handling these well requires:

Integration tests, not just unit tests on individual prompts. Test the full pipeline with representative inputs.
Hard termination conditions on any loop or agent that can self-direct.
Explicit error handling at composition boundaries, with fallback behavior for each failure mode.
Structured outputs enforced at the API level where possible, not just requested in prompts.

The shift from single-call to multi-step applications is where most teams discover that "working on my machine" means something different with LLMs than it does with traditional software.

Evaluation: The Part Most Teams Skip Until It's Too Late

The single most reliable predictor of production success is whether a team built an evaluation framework before shipping. Not after things broke—before.

Evaluation for LLMs isn't pass/fail. It's a distribution: how often does the output meet your quality bar, how does that rate change across input types, and how does it shift when you update the model or the prompt?

Building a useful eval setup doesn't require a research team:

Collect representative examples. 50-100 real inputs with expected outputs, covering the distribution you care about. Include adversarial cases—the inputs you know are hard.
Define a quality rubric. What does "correct" mean for your application? Factual accuracy? Format compliance? Tone? Enumerate it explicitly.
Version prompts with git. Treat prompts as code. Track changes, measure performance deltas, and never deploy a prompt change without running evals.
Automate regression testing. Run your eval suite on every prompt change, before deployment. LLM-as-judge has become reliable enough for this: use a strong model to evaluate outputs against your rubric automatically.

The teams that skip this discover they have no baseline when something degrades. Model providers push updates; your application's performance on specific inputs can silently worsen without detection. Evals are the only way to know.

Observability: Table Stakes, Not Optional

By 2025, the production LLM ecosystem had developed enough tooling that observability became straightforward—but teams still skip it because the instrumentation feels like overhead in the early phase.

It's not. Logging inputs, outputs, latencies, and costs for every LLM call in production is the minimum. You need this to debug user complaints ("the model gave a wrong answer"), to detect silent degradation ("performance on this query class dropped 20%"), and to optimize costs ("we didn't realize 30% of our spend was on this one endpoint").

Beyond basic logging, trace the full request path through multi-step pipelines. When a complex agentic workflow fails, you need to know which step failed and what its input was—not just that the final output was wrong.

Observability data feeds back into eval datasets. Real production failures become test cases. This loop—ship, observe, improve evals, fix, ship—is how production LLM applications actually improve over time.

What Holds Up Under Change

The tools and APIs change continuously. Pricing drops weekly. New models outperform last month's state of the art. New frameworks promise to solve orchestration. Most of what you read will be outdated in six months.

What doesn't change:

The need to evaluate before and after any change
The economics of routing by task complexity
The failure patterns in composed pipelines
The value of treating every LLM output as untrusted input

Engineers who build on these fundamentals can absorb the surface-level changes without rebuilding from scratch. The teams that built directly on specific model behaviors or framework APIs had to rewrite significant portions every time something changed.

The core insight is that LLM application engineering is less like traditional software engineering and more like distributed systems work: you're building reliable behavior out of unreliable components, and the discipline comes from the rigor you apply to the seams, not from the quality of any single component.

That discipline is learnable. It just has to be applied deliberately, from the beginning, before the demo becomes a product.

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Building LLM Applications for Production: What Actually Breaks

The Core Problem: Natural Language Is Not an API

Cost and Latency: The Economics Nobody Plans For

Prompting, RAG, or Finetuning: Making the Right Call

Task Composition: Where Multi-Step Applications Break

Evaluation: The Part Most Teams Skip Until It's Too Late

Observability: Table Stakes, Not Optional

What Holds Up Under Change

Recommended Reading

About Tian Pan

The Core Problem: Natural Language Is Not an API​

Cost and Latency: The Economics Nobody Plans For​

Prompting, RAG, or Finetuning: Making the Right Call​

Task Composition: Where Multi-Step Applications Break​

Evaluation: The Part Most Teams Skip Until It's Too Late​

Observability: Table Stakes, Not Optional​

What Holds Up Under Change​

Recommended Reading

About Tian Pan

The Core Problem: Natural Language Is Not an API

Cost and Latency: The Economics Nobody Plans For

Prompting, RAG, or Finetuning: Making the Right Call

Task Composition: Where Multi-Step Applications Break

Evaluation: The Part Most Teams Skip Until It's Too Late

Observability: Table Stakes, Not Optional

What Holds Up Under Change