Skip to main content

177 posts tagged with "production"

View all tags

Multi-Model Reliability Is Not 2x: The Non-Linear Cost of a Second LLM Provider

· 13 min read
Tian Pan
Software Engineer

The naive calculation goes like this. Our primary provider has 99.3% uptime. Add a second provider with similar independence, and simultaneous failure drops to roughly 0.005%. Multiply cost by two, divide risk by two hundred. Engineering leadership signs off on the 2x budget and the oncall rotation stops paging on provider outages. The spreadsheet says this is the best reliability investment on the roadmap.

Six months later the spreadsheet is wrong. The eval suite takes 3x as long to run, prompt changes need two PRs, the weekly regression report has two columns that disagree with each other, and nobody can remember which provider the staging fallback is currently routing to. The 2x budget is closer to 4–5x once the team tallies the human hours spent keeping both paths calibrated. The second provider is still technically serving traffic, but half the features have been quietly pinned to one side because keeping both in sync stopped being worth it.

This is the multi-model cost trap. The reliability math is correct; the operational math is the part teams get wrong. What follows is the cost decomposition of going multi-provider, the single-provider-with-degraded-mode option most teams should try first, and the narrow set of criteria that actually justify the nonlinear complexity.

The Ship-and-Pin Trap: How Model Version Stability Becomes Deprecation Debt

· 9 min read
Tian Pan
Software Engineer

Pinning a model version in production feels like engineering discipline. You lock claude-opus-4-0 or gpt-4o-2024-08-06 into config, write a note in the README about why, and move on to shipping features. The output distribution stops shifting under you, the evals stay green, and the prompt tuning you did last quarter keeps working. What you've actually done is start a silent timer. Twelve to fifteen months later the deprecation email arrives, and three sprints of undocumented behavioral dependencies — prompt tuning, eval calibration, output shape assumptions, temperature quirks — all come due at once.

This is the ship-and-pin trap. Pinning is correct in the short term and catastrophic in the long term, because the cost of stability compounds in places you aren't looking. The prompt that was "good enough" a year ago is now load-bearing in ways nobody documented. The JSON schema your downstream service expects was shaped to one model's tokenization habits. The few-shot examples you hand-tuned were tuned against a specific model's notion of helpfulness. When the provider retires the version string, none of these dependencies migrate automatically, and the work to re-qualify them always lands under deadline pressure.

The Alignment Tax: When Safety Features Make Your AI Product Worse

· 9 min read
Tian Pan
Software Engineer

A developer asks your AI coding assistant to "kill the background process." A legal research tool refuses to discuss precedent on a case involving violence. A customer support bot declines to explain a refund policy because the word "dispute" triggered a content classifier. In each case, the AI was doing exactly what it was trained to do — and it was completely wrong.

This is the alignment tax: the measurable cost in user satisfaction, task completion, and product trust that your safety layer extracts from entirely legitimate interactions. Most AI teams treat it as unavoidable background noise. It isn't. It's a tunable product parameter — one that many teams are accidentally maxing out.

Cache Invalidation for AI: Why Every Cache Layer Gets Harder When the Answer Can Change

· 10 min read
Tian Pan
Software Engineer

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — was coined before language models entered production. Add AI to the stack and cache invalidation doesn't just get harder; it gets harder at every layer simultaneously, for fundamentally different reasons at each one.

Traditional caches store deterministic outputs: the database row, the rendered HTML, the computed price. When the source changes, you invalidate the key, and the next request fetches fresh data. The contract is simple because the answer is a fact.

AI caches store something different: responses to queries where the "correct" answer depends on context, recency, model behavior, and the source documents the model was given. Stale here doesn't mean outdated — it means semantically wrong in ways your monitoring won't catch until a user notices.

Canary Deploys for LLM Upgrades: Why Model Rollouts Break Differently Than Code Deployments

· 11 min read
Tian Pan
Software Engineer

Your CI passed. Your evals looked fine. You flipped the traffic switch and moved on. Three days later, a customer files a ticket saying every generated report has stopped including the summary field. You dig through logs and find the new model started reliably producing exec_summary instead — a silent key rename that your JSON schema validation never caught because you forgot to add it to the rollout gates. The root cause was a model upgrade. The detection lag was 72 hours.

This is not a hypothetical. It happens in production at companies that have sophisticated deployment pipelines for their application code but treat LLM version upgrades as essentially free — a config swap, not a deployment. That mental model is wrong, and the failure modes that result from it are distinctly hard to catch.

The Compound Accuracy Problem: Why Your 95% Accurate Agent Fails 40% of the Time

· 11 min read
Tian Pan
Software Engineer

Your agent performs beautifully in isolation. You've benchmarked each step. You've measured per-step accuracy at 95%. You demo the system to stakeholders and it looks great. Then you ship it, and users report that it fails almost half the time.

The failure isn't a bug in any individual component. It's the math.

Data Lineage for AI Systems: Tracking the Path from Source to Response

· 10 min read
Tian Pan
Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.

Idempotency Is Not Optional in LLM Pipelines

· 10 min read
Tian Pan
Software Engineer

A batch inference job finishes after six minutes. The network hiccups on the response. Your retry logic kicks in. Two minutes later the job finishes again — and your invoice doubles. This is the tamest version of what happens when you apply traditional idempotency thinking to LLM pipelines without adapting it to stochastic systems.

Most production teams discover the problem the hard way: a retry that was supposed to recover from a transient error triggers a second payment, sends a duplicate email, or writes a contradictory record to the database. The fix is not better retry logic — it is a different mental model of what idempotency even means when your core component is probabilistic.

LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip

· 9 min read
Tian Pan
Software Engineer

A team ships a support chatbot. In testing, the monthly bill looks manageable—a few hundred dollars across the engineering team's demo sessions. Three weeks into production, the invoice arrives: $47,000. Nobody had lied about the token counts. Nobody had made an arithmetic error. The production workload was simply a different animal than anything they'd simulated.

This pattern repeats constantly. Teams estimate LLM costs the way they estimate database query costs—by measuring a representative request and multiplying by expected volume. That mental model breaks badly for LLMs, because the two biggest cost drivers (output token length and tool-call overhead) are determined at inference time by behavior you cannot fully predict at design time.

This post is about how to forecast better before you ship, not how to optimize after the bill arrives.

Model Migration as Database Migration: Safely Switching LLM Providers Without Breaking Production

· 10 min read
Tian Pan
Software Engineer

When your team decides to upgrade from Claude 3.5 Sonnet to Claude 3.7, or migrate from OpenAI to a self-hosted Llama deployment, the instinct is to treat it like a library upgrade: change the API key, update the model name string, run a quick sanity check, and ship. This instinct is wrong, and the teams that follow it discover why at 2 AM in week two when a customer support agent starts producing responses in a completely different format — technically valid, semantically disastrous.

Switching LLM providers or model versions is structurally identical to a database schema migration. Both involve changing the behavior of a system that the rest of your application has implicit contracts with. Both can look fine on day one and fail catastrophically on day ten. Both require dual-running, canary deployment, rollback criteria, and a migration playbook — not a config change followed by a Slack message.

What Model Cards Don't Tell You: The Production Gap Between Published Benchmarks and Real Workloads

· 9 min read
Tian Pan
Software Engineer

A model card says 89% accuracy on code generation. Your team gets 28% on the actual codebase. A model card says 100K token context window. Performance craters at 32K under your document workload. A model card passes red-team safety evaluation. A prompt injection exploit ships to your users within 72 hours of launch.

This gap isn't rare. It's the norm. In a 2025 analysis of 1,200 production deployments, 42% of companies abandoned their AI initiatives at the production integration stage — up from 17% the previous year. Most of them had read the model cards carefully.

The problem isn't that model cards lie. It's that they measure something different from what you need to know. Understanding that gap precisely — and building the internal benchmark suite to close it — is what separates teams that ship reliable AI from teams that ship regrets.

Privacy-Preserving Inference in Practice: The Spectrum Between Cloud APIs and On-Prem

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM privacy as a binary: either you send data to the cloud and accept the risk, or you run everything on-prem and accept the cost. Both framings are wrong. In practice, there is a spectrum of approaches with very different risk profiles and engineering budgets — and most teams are operating at the wrong point on that spectrum without realizing it.

Researchers recently demonstrated they could extract authentic PII from 3,912 individuals at a cost of $0.012 per record with a 48.9% success rate. That statistic tends to get dismissed as academic threat modeling until a security audit or compliance review lands on your desk. The question isn't whether to care about LLM privacy; it's which controls actually move the needle and how much each one costs to implement.