216 posts tagged with "production"

Structured Output Reliability in Production LLM Systems

April 10, 2026 · 10 min read

Software Engineer

Your LLM pipeline hits 97% success rate in testing. Then it ships, and somewhere in the tail of real-world usage, a JSON parse failure silently corrupts downstream state, a missing field causes a null-pointer exception three steps later, or a response wrapped in markdown fences breaks your extraction logic at 2am. Structured output failures are the unsung reliability killer of production AI systems — they rarely show up in benchmarks, they compound invisibly in multi-step pipelines, and they're entirely preventable if you understand the actual problem.

The uncomfortable truth: naive JSON prompting fails 15–20% of the time in production environments. For a pipeline making a thousand LLM calls per day, that's 150–200 silent failures. And because those errors often don't surface immediately — they propagate forward as malformed data, not exceptions — they're the hardest class of bug to detect and debug.

Text-to-SQL in Production: Why Correct SQL Is the Easy Part

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

GPT-4o scores 86.6% on the Spider benchmark. Deploy it against your actual data warehouse and you might get 10%. That gap is not a rounding error—it is the entire problem. The queries that make up the missing 76% execute without errors, return rows with the correct schema, and are completely wrong.

Text-to-SQL is not a syntax problem. Every serious production deployment discovers the same uncomfortable truth: the hard failures are silent ones. A query that scans a 10TB Snowflake table, returns revenue figures that are 30% too high due to a duplicated join, or quietly bypasses row-level security looks identical to a correct query from the outside. It finishes, it returns data, and nobody flags it.

This post covers the failure modes that actually bite teams in production, and the layered architecture that prevents them.

The Tool Result Validation Gap: Why AI Agents Blindly Trust Every API Response

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

Your agent calls a tool, gets a response, and immediately reasons over it as if it were gospel. No schema check. No freshness validation. No sanity test against what the response should look like. This is the default behavior in every major agent framework, and it is silently responsible for an entire class of production failures that traditional monitoring never catches.

The tool result validation gap is the space between "the tool returned something" and "the tool returned something correct." Most teams obsess over getting tool calls right — selecting the right tool, generating valid arguments, handling timeouts. Almost nobody validates what comes back.

Feature Flags for AI: Progressive Delivery of LLM-Powered Features

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams discover the hard way that rolling out a new LLM feature is nothing like rolling out a new UI button. A prompt change that looked great in offline evaluation ships to production and silently degrades quality for 30% of users — but your dashboards show HTTP 200s the whole time. By the time you notice, thousands of users have had bad experiences and you have no fast path back to the working state.

The same progressive delivery toolkit that prevents traditional software failures — feature flags, canary releases, A/B testing — applies directly to LLM-powered features. But the mechanics are different enough that copy-pasting your existing deployment playbook will get you into trouble. Non-determinism, semantic quality metrics, and the multi-layer nature of LLM changes (model, prompt, parameters, retrieval strategy) each create wrinkles that teams routinely underestimate.

Releasing AI Features Without Breaking Production: Shadow Mode, Canary Deployments, and A/B Testing for LLMs

April 9, 2026 · 11 min read

Tian Pan

Software Engineer

A team swaps GPT-4o for a newer model on a Tuesday afternoon. By Thursday, support tickets are up 30%, but nobody can tell why — the new model is slightly shorter with responses, refuses some edge-case requests the old one handled, and formats dates differently in a way that breaks a downstream parser. The team reverts. Two sprints of work, gone.

This story plays out constantly. The problem isn't that the new model was worse — it may have been better on most things. The problem is that the team released it with the same process they'd use to ship a bug fix: merge, deploy, watch. That works for code. It fails for LLMs.

Long-Context Models vs. RAG: When the 1M-Token Window Is the Wrong Tool

April 9, 2026 · 9 min read

Tian Pan

Software Engineer

When Gemini 1.5 Pro launched with a 1M-token context window, a wave of engineers declared RAG dead. The argument seemed airtight: why build a retrieval pipeline with chunkers, embeddings, vector databases, and re-rankers when you can just dump your entire knowledge base into the prompt and let the model figure it out?

That argument collapses under production load. Gemini 1.5 Pro achieves 99.7% recall on the "needle in a haystack" benchmark — a single fact hidden in a document. On realistic multi-fact retrieval, average recall hovers around 60%. That 40% miss rate isn't a benchmarking artifact; it's facts your system silently fails to surface to users. And the latency for a 1M-token request runs 30–60x slower than a RAG pipeline at roughly 1,250x the per-query cost.

Long-context models are a powerful tool. They're just not the right tool for most production retrieval workloads.

Why Your Agent Harness Should Be Stateless: Decoupling Brain from Hands in Production

April 9, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams building AI agents treat the harness — the scaffolding that handles tool routing, context management, and the inference loop — as a long-lived, stateful process tied to a single container. When the container fails, the session dies. When you need to swap in a better model, you have to restart everything. When you want to scale horizontally, you hit a wall: each harness instance knows too much about its own state to be interchangeable.

The fix isn't a smarter harness. It's a stateless one.

Multimodal LLM Inputs in Production: Vision, Documents, and the Failure Modes Nobody Warns You About

April 9, 2026 · 9 min read

Tian Pan

Software Engineer

Adding vision to an LLM application looks deceptively simple. You swap a text model for a multimodal one, pass in an image alongside your prompt, and the demo works brilliantly. Then you push to production and discover that half your invoices get the total wrong, tables in PDFs lose their structure, and low-quality scans produce confident hallucinations. The debugging is harder than anything you faced with text-only systems, because the failures are visual and the LLM will not tell you it cannot see clearly.

This post covers what actually goes wrong when you move multimodal LLM inputs from prototype to production, and the architectural decisions that prevent those failures.

Prompt Versioning in Production: The Engineering Discipline Teams Learn the Hard Way

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

You get paged at 2am. Users are reporting garbage output. You SSH in, check logs, stare at traces — everything looks structurally fine. The model is responding. Latency is normal. But something is wrong with the answers. Then the question lands in your incident channel: "Which prompt version is actually running right now?"

If you can't answer that question in under thirty seconds, you have a prompt versioning problem.

Prompts are treated like configuration in most early-stage LLM projects. A product manager edits a string in a .env file, a developer pastes an updated instruction into a hardcoded constant, someone else pastes a slightly different version into a staging Slack channel. Eventually the versions diverge, and nobody has a complete picture of what's running where. The experimentation-phase casualness that got you to launch becomes a liability the moment you have real users.

JSON Mode Won't Save You: Structured Output Failures in Production LLM Systems

April 9, 2026 · 9 min read

Tian Pan

Software Engineer

When developers first wire up JSON mode, the response feels like solving a problem. The LLM stops returning markdown fences, prose apologies, and curly-brace-adjacent gibberish. The output parses. The tests pass. Production ships.

Then, three weeks later, a background job silently fails because the model returned {"status": "complete"} when the schema expected {"status": "completed"}. A data pipeline crashes because a required field came back as null instead of being omitted. An agent tool-call loop terminates early because the model embedded a stray newline inside a string value and the downstream parser choked on it.

JSON mode guarantees syntactically valid JSON. It does not guarantee that the JSON means what you think it means, contains the fields your application expects, or maintains semantic consistency across requests. These are different problems, and they require different solutions.

The Tool Selection Problem: How Agents Choose What to Call When They Have Dozens of Tools

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

Most agent demos work with five tools. Production systems have fifty. The gap between those two numbers is where most agent architectures fall apart.

When you give an LLM four tools and a clear task, it usually picks the right one. When you give it fifty tools, something more interesting happens: accuracy collapses, token costs balloon, and the failure mode often looks like the model hallucinating a tool call rather than admitting it doesn't know which tool to use. Research from the Berkeley Function Calling Leaderboard found accuracy dropping from 43% to just 2% on calendar scheduling tasks when the number of tools expanded from 4 to 51 across multiple domains. That is not a graceful degradation curve.

Voice AI in Production: Engineering the 300ms Latency Budget

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building voice AI discover the latency problem the same way: in production, with real users. The demo feels fine. The prototype sounds impressive. Then someone uses it on an actual phone call and says it feels robotic — not because the voice sounds bad, but because there's a slight pause before every response that makes the whole interaction feel like talking to someone with a bad satellite connection.

That pause is almost always between 600ms and 1.5 seconds. The target is under 300ms. The gap between those two numbers explains everything about how voice AI systems are actually built.

About Tian Pan