Skip to main content

340 posts tagged with "llm"

View all tags

LLM Observability in Production: The Four Silent Failures Engineers Miss

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications to production have a logging setup they mistake for observability. They store prompts and responses in a database, track token counts in a spreadsheet, and set up latency alerts in Datadog. Then a user reports the chatbot gave wrong answers for two days, and nobody can tell you why — because none of the data collected tells you whether the model was actually right.

Traditional monitoring answers "is the system up and how fast is it?" LLM observability answers a harder question: "is the system doing what it's supposed to do, and when did it stop?" That distinction matters enormously when your system's behavior is probabilistic, context-dependent, and often wrong in ways that don't trigger any alert.

LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries

· 11 min read
Tian Pan
Software Engineer

Most teams reach the same inflection point: LLM API costs are scaling faster than usage, and every query — whether "summarize this sentence" or "audit this 2,000-line codebase for security vulnerabilities" — hits the same expensive model. The fix isn't squeezing prompts. It's routing.

LLM routing means directing each request to the most appropriate model for that specific task. Not the most capable model. The right model — balancing cost, latency, and quality for what the query actually demands. Done well, routing cuts LLM costs by 50–85% with minimal quality degradation. Done poorly, it creates silent quality regressions you won't detect until users churn.

This post covers the mechanics, the tradeoffs, and what actually breaks in production.

Token Budget Strategies for Production LLM Applications

· 10 min read
Tian Pan
Software Engineer

Most teams discover their context management problem the same way: a production agent that worked fine in demos starts hallucinating after 15 conversation turns. The logs show valid JSON, the model returned 200, and nobody changed the code. What changed was the accumulation — tool results, retrieved documents, and conversation history quietly filled the context window until the model was reasoning over 80,000 tokens of mixed-relevance content.

Context overflow is the obvious failure mode, but "context rot" is the insidious one. Research shows that LLM performance degrades before you hit the limit. As context grows, models exhibit a lost-in-the-middle effect: attention concentrates at the beginning and end of the input while content in the middle becomes unreliable. Instructions buried at turn 12 of a 30-turn conversation may effectively disappear. The model doesn't error out — it just quietly ignores them.

Load Testing LLM Applications: Why k6 and Locust Lie to You

· 11 min read
Tian Pan
Software Engineer

You ran your load test. k6 reported 200ms average latency, 99th percentile under 800ms, zero errors at 50 concurrent users. You shipped to production. Within a week, users were reporting 8-second hangs, dropped connections, and token budget exhaustion mid-stream. What happened?

The test passed because you measured the wrong things. Conventional load testing tools were designed for stateless HTTP endpoints that return a complete response in milliseconds. LLM APIs behave like nothing those tools were built to model: they stream tokens over seconds, charge by the token rather than the request, saturate GPU memory rather than CPU threads, and respond completely differently depending on whether a cache is warm. A k6 script that hammer-tests your /chat/completions endpoint will produce numbers that look like performance data but contain almost no signal about what production actually looks like.

The Eval-to-Production Gap: Why 92% on Your Test Suite Means 40% User Satisfaction

· 10 min read
Tian Pan
Software Engineer

Your team spent three weeks building a rigorous eval suite. It covers edge cases. It includes adversarial examples. The LLM-as-judge scores 92% across all dimensions. You ship.

Then the support tickets start. Users say the AI "doesn't understand what they're asking." Session abandonment is up 30%. Satisfaction scores come back at 41%.

This gap — between eval performance and real-world outcomes — is the most common failure mode in production AI systems today. It's not a model problem. It's a measurement problem.

Temporal Reasoning Failures in Production AI Systems

· 10 min read
Tian Pan
Software Engineer

An agent that confidently recommends products that have been out of stock for six months. A customer service bot that tells a user there's no record of the order they placed 20 minutes ago. A coding assistant that generates working code against a library API deprecated two years ago. These aren't hallucinations in the traditional sense — the model is recalling something that was once accurate. That's a different failure mode entirely, and most teams aren't equipped to detect or defend against it.

The distinction matters because the mitigations are fundamentally different. You can't prompt-engineer your way out of staleness. You can't fine-tune your way out of it either — fine-tuning on stale knowledge makes the problem worse, not better, because the model expresses outdated information with greater authority. And as models become more fluent and confident in their delivery, their confidently-wrong stale answers become harder, not easier, for users to catch.

Prompt Versioning and Change Management in Production AI Systems

· 9 min read
Tian Pan
Software Engineer

A team added three words to a customer service prompt to make it "more conversational." Within hours, structured-output error rates spiked and a revenue-generating pipeline stalled. Engineers spent most of a day debugging infrastructure and code before anyone thought to look at the prompt. There was no version history. There was no rollback. The three-word change had been made inline, in a config file, by a product manager who had no reason to think it was risky.

This is the canonical production prompt incident. Variations of it play out at companies of every size, and the root cause is almost always the same: prompts were treated as ephemeral configuration instead of software.

Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

· 10 min read
Tian Pan
Software Engineer

A team built an AI research assistant using Claude. They iterated on the prompt for three weeks, demo'd it to stakeholders, and launched it feeling confident. Two months later they discovered that the assistant had been silently hallucinating citations across roughly 30% of outputs — a failure mode no one had tested for because the eval suite was built after the prompt had already "felt right" in demos.

This pattern is the rule, not the exception. The LLM development industry has largely adopted test-driven development vocabulary — evals, regression suites, golden datasets, LLM-as-judge — while ignoring the most important rule TDD establishes: write the test before the implementation, not after.

Here is how to do that correctly, and the three places where the TDD analogy breaks down so badly that following it literally will make your system worse.

LLM API Resilience in Production: Rate Limits, Failover, and the Hidden Costs of Naive Retry Logic

· 10 min read
Tian Pan
Software Engineer

In mid-2025, a team building a multi-agent financial assistant discovered their API spend had climbed from $127/week to $47,000/week. An agent loop — Agent A asked Agent B for clarification, Agent B asked Agent A back, and so on — had been running recursively for eleven days. No circuit breaker caught it. No spend alert fired in time. The retry logic dutifully kept retrying each timeout, compounding the runaway cost at every step.

This is not a story about model quality. It is a story about distributed systems engineering — specifically, about the parts of it that most LLM application developers skip because they assume the provider handles it.

They do not.

LLM Latency Decomposition: Why TTFT and Throughput Are Different Problems

· 11 min read
Tian Pan
Software Engineer

Most engineers building on LLMs treat latency as a single dial. They tune something — a batch size, a quantization level, an instance type — observe whether "it got faster," and call it done. This works until you hit production and discover that your p50 TTFT looks fine while your p99 is over 3 seconds, or that the optimization that doubled your throughput somehow made individual users feel the system got slower.

TTFT and throughput are not two ends of the same slider. They are caused by fundamentally different physics, degraded by different bottlenecks, and fixed by different techniques. Treating them as interchangeable is the root cause of most LLM inference incidents I've seen in production.

Synthetic Data Pipelines for Domain-Specific LLM Fine-Tuning

· 9 min read
Tian Pan
Software Engineer

Your model fine-tuned on synthetic data scores 95% on your internal evals. Then you deploy it, and it confidently invents drug interactions that don't exist, cites legal precedents with wrong case numbers, and hallucinates API endpoints with plausible-sounding names. The model hasn't regressed on fluency — it's gotten worse in a way that fluency metrics completely miss. Researchers call this knowledge collapse: factual accuracy degrades while surface coherence stays intact. It's one of the more insidious failure modes in synthetic data training, and it happens most often when engineers build pipelines without accounting for it.

Synthetic data generation has become unavoidable for teams fine-tuning LLMs on specialized domains. Human annotation at scale is expensive, slow, and impossible for tasks that require expertise. Synthetic data generated by a capable teacher model can fill that gap cheaply. But the pipeline is not as simple as "prompt GPT-4 for examples, train your model." The details determine whether you get a specialized system that outperforms a general model on your domain, or a fluent but factually broken one.

Structured Generation: Making LLM Output Reliable in Production

· 10 min read
Tian Pan
Software Engineer

There is a silent bug lurking in most LLM-powered applications. It doesn't show up in unit tests. It doesn't trigger on the first thousand requests. It waits until a user types something with a quote mark in it, or until the model decides — for no apparent reason — to wrap its JSON response in a markdown code block, or to return the field "count" as the string "three" instead of the integer 3. Then your production pipeline crashes.

The gap between "LLMs are text generators" and "my application needs structured data" is where most reliability problems live. Bridging that gap is not a prompt engineering problem. It's an infrastructure problem, and in 2026 we finally have the tools to solve it correctly.