720 posts tagged with "llm"

The LLM Request Lifecycle Your try/catch Is Missing

April 10, 2026 · 10 min read

Software Engineer

The most dangerous failure your LLM stack can produce returns HTTP 200. The JSON parses. Your schema validation passes. No exception is raised. And the response is completely wrong — wrong facts, wrong structure, truncated mid-sentence, or fabricated from whole cloth.

A single try/catch around an LLM API call handles the easy failures: rate limits, server errors, network timeouts. These are the visible failures. The invisible ones — a model that hit its token limit and stopped mid-answer, an agent that looped 21 extra tool calls before finding the right parameter name, a validation retry that inflated your costs by 37% — produce no exceptions. They produce results.

The fix is not better error handling. It is modeling the LLM request lifecycle as an explicit state machine, where every state transition emits an observable span, and failure modes are first-class states rather than buried exception handlers.

The Long-Horizon Evaluation Gap: Why Your Agent Passes Every Benchmark and Still Fails in Production

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

A model that scores 75% on SWE-Bench Verified falls below 25% on tasks that take a human engineer hours to complete. The same agent that reliably handles single-turn question answering can spiral into incoherent loops, hallucinate tool outputs, and forget its original goal when asked to coordinate a dozen steps toward an open-ended objective. The gap between benchmark number and production behavior isn't noise—it's structural, and understanding it is the difference between shipping something useful and shipping something that looks good in the demo.

This post is about that gap: why it exists, what specific failure modes emerge in long-horizon tasks that never appear in static evals, and what it takes to build an evaluation harness that actually catches them.

Model Fingerprinting: Detecting Silent Provider-Side LLM Swaps Before They Wreck Your Evals

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o without any API changelog entry, developer notification, or public announcement. Within 48 hours, users were posting screenshots of the model endorsing catastrophic business decisions, validating obviously broken plans, and agreeing that stopping medication sounded like a reasonable idea. The model had become so agreeable that it would call anything a genius idea. OpenAI rolled it back days later — an unusual public acknowledgment of a behavioral regression they'd shipped to production.

The deeper problem wasn't the sycophancy itself. It was that no one building on the API had any automated way to know the model had changed. Their evals were still passing. Their monitoring dashboards showed HTTP 200s. Their p95 latency looked fine. The model was silently different, and the only signal was user complaints.

This is the problem model fingerprinting solves.

Multimodal LLMs in Production: The Cost Math Nobody Runs Upfront

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams add multimodal capabilities to an existing LLM pipeline without running the cost math first. They prototype with a few test images, it works, they ship — and then the first billing cycle arrives. The number is somewhere between embarrassing and catastrophic, depending on volume.

The problem isn't that multimodal AI is expensive in principle. It's that each modality has a distinct token arithmetic that compounds in ways that text-only intuition doesn't prepare you for. A single configuration parameter — video frame rate, image resolution mode, whether you're re-sending a system prompt every turn — can silently multiply your inference bill by 10x or more before you've noticed anything is wrong.

The Non-Determinism Tax: Building Reliable Pipelines on Probabilistic Infrastructure

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Setting temperature=0 and expecting reproducible outputs is one of the most common misconceptions in production LLM engineering. The thinking is intuitive: temperature controls randomness, so zero temperature means zero randomness. But temperature only controls the token selection rule — switching from probabilistic sampling to greedy argmax. It does nothing to stabilize the logits themselves, which is where the real variance lives.

The practical consequence: running the same prompt against the same model at temperature=0 one thousand times can generate 80 distinct completions. That's not a hypothetical — it's an empirical result from testing a Qwen3-235B model under realistic inference server conditions. Divergence first appears deep in the output (token 103 in that test), where 992 runs produce "Queens, New York" and 8 produce "New York City." Same model, same prompt, same temperature, different batching state on the server.

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Most RAG tutorials treat chunking as a footnote: split your documents into 512-token chunks, embed them, store them in a vector database, and move on to the interesting parts. This works well enough on toy examples — Wikipedia articles, clean markdown docs, short PDFs. It falls apart in production.

A recent study deploying RAG for clinical decision support found that the fixed-size baseline achieved 13% fully accurate responses across 30 clinical questions. An adaptive chunking approach on the same corpus: 50% fully accurate (p=0.001). The documents were the same. The LLM was the same. Only the chunking changed. That gap is not a tuning problem or a prompt engineering problem. It is a structural failure in how most teams split documents.

RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams building RAG systems think they have two failure modes: retrieval fails to find the relevant document, or the LLM hallucinates despite having it. The first is measured obsessively — recall@K, MRR, NDCG. The second is treated as the model's problem. Neither framing is complete.

There's a third failure mode that sits between them: retrieval succeeds (the relevant document ranks in the top-K), but the retrieved context doesn't actually contain enough information to answer the question correctly. The model gets confident, generates a plausible answer, and gets it wrong. Research on frontier models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 shows this happens at rates above 50% on multi-step queries — and most production systems have no instrumentation to detect it.

The Reasoning Trace Privacy Problem: How Chain-of-Thought Leaks Sensitive Data in Production

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Your reasoning model correctly identifies that a piece of data is sensitive 98% of the time. Yet it leaks that same data in its chain-of-thought 33% of the time. That gap — between knowing something is private and actually keeping it private — is the core of the reasoning trace privacy problem, and most production teams haven't built for it.

Extended thinking has become a standard tool for accuracy-hungry applications: customer support triage, medical coding assistance, legal document review, financial analysis. These are also exactly the domains where the data in the prompt is most sensitive. Deploying reasoning models in these contexts without understanding how traces handle that data is a significant exposure.

The Reasoning Trace Privacy Problem: What Your CoT Logs Are Leaking

April 10, 2026 · 8 min read

Tian Pan

Software Engineer

Most teams building on reasoning models treat privacy as a two-surface problem: sanitize the prompt going in, sanitize the response coming out. The reasoning trace in between gets logged wholesale for observability, surfaced to downstream systems for debugging, and sometimes passed back to users who asked to "see the thinking." That middle layer is where the real exposure lives — and most production deployments are not treating it like the liability it is.

Research from early 2026 quantified what practitioners have been observing anecdotally: large reasoning models (LRMs) leak personally identifiable information in their intermediate reasoning steps more often than in their final answers. In one study testing five open-source models across medical and financial scenarios, the finding was unambiguous — intermediate reasoning reliably surfaces PII that the final response had successfully withheld. The final answer is sanitized; the trace is not.

Semantic Caching for LLMs: The Cost Tier Most Teams Skip

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams building LLM applications know about prompt caching — the prefix-reuse mechanism that API providers offer to discount repeated input tokens. Far fewer have deployed the layer above it: semantic caching, which eliminates LLM calls entirely for queries that mean the same thing but are phrased differently. The gap isn't laziness; it's a widespread misunderstanding of what "95% accuracy" means in semantic caching vendor documentation.

That 95% figure refers to match correctness on cache hits, not to how often the cache actually gets hit. Real production hit rates range from 10% for open-ended chat to 70% for structured FAQ systems — and the math that determines which side of that range you're on should happen before you write any cache code.

Structured Output Reliability in Production LLM Systems

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

Your LLM pipeline hits 97% success rate in testing. Then it ships, and somewhere in the tail of real-world usage, a JSON parse failure silently corrupts downstream state, a missing field causes a null-pointer exception three steps later, or a response wrapped in markdown fences breaks your extraction logic at 2am. Structured output failures are the unsung reliability killer of production AI systems — they rarely show up in benchmarks, they compound invisibly in multi-step pipelines, and they're entirely preventable if you understand the actual problem.

The uncomfortable truth: naive JSON prompting fails 15–20% of the time in production environments. For a pipeline making a thousand LLM calls per day, that's 150–200 silent failures. And because those errors often don't surface immediately — they propagate forward as malformed data, not exceptions — they're the hardest class of bug to detect and debug.

Text-to-SQL in Production: Why Correct SQL Is the Easy Part

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

GPT-4o scores 86.6% on the Spider benchmark. Deploy it against your actual data warehouse and you might get 10%. That gap is not a rounding error—it is the entire problem. The queries that make up the missing 76% execute without errors, return rows with the correct schema, and are completely wrong.

Text-to-SQL is not a syntax problem. Every serious production deployment discovers the same uncomfortable truth: the hard failures are silent ones. A query that scans a 10TB Snowflake table, returns revenue figures that are 30% too high due to a duplicated join, or quietly bypasses row-level security looks identical to a correct query from the outside. It finishes, it returns data, and nobody flags it.

This post covers the failure modes that actually bite teams in production, and the layered architecture that prevents them.

About Tian Pan