639 posts tagged with "llm"

Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production

April 20, 2026 · 9 min read

Software Engineer

There's a failure mode that bites teams repeatedly after they've solved the easier problems — hallucination filtering, output parsing, retry logic. The model is giving confident-sounding wrong answers, the confidence-based routing logic is trusting those wrong answers, and the system is silently misbehaving in production while the eval dashboard looks fine.

This isn't a prompting problem. It's a calibration problem, and it's baked into how modern LLMs are trained.

LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A team ships a support chatbot. In testing, the monthly bill looks manageable—a few hundred dollars across the engineering team's demo sessions. Three weeks into production, the invoice arrives: $47,000. Nobody had lied about the token counts. Nobody had made an arithmetic error. The production workload was simply a different animal than anything they'd simulated.

This pattern repeats constantly. Teams estimate LLM costs the way they estimate database query costs—by measuring a representative request and multiplying by expected volume. That mental model breaks badly for LLMs, because the two biggest cost drivers (output token length and tool-call overhead) are determined at inference time by behavior you cannot fully predict at design time.

This post is about how to forecast better before you ship, not how to optimize after the bill arrives.

Model Migration as Database Migration: Safely Switching LLM Providers Without Breaking Production

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

When your team decides to upgrade from Claude 3.5 Sonnet to Claude 3.7, or migrate from OpenAI to a self-hosted Llama deployment, the instinct is to treat it like a library upgrade: change the API key, update the model name string, run a quick sanity check, and ship. This instinct is wrong, and the teams that follow it discover why at 2 AM in week two when a customer support agent starts producing responses in a completely different format — technically valid, semantically disastrous.

Switching LLM providers or model versions is structurally identical to a database schema migration. Both involve changing the behavior of a system that the rest of your application has implicit contracts with. Both can look fine on day one and fail catastrophically on day ten. Both require dual-running, canary deployment, rollback criteria, and a migration playbook — not a config change followed by a Slack message.

LLM-Powered Data Migrations: What Actually Works at Scale

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The pitch is compelling: feed your legacy records into an LLM, describe the target schema, and let the model figure out the mapping. No hand-written parsers, no months of transformation logic, no domain expert bottlenecks. Teams have run this and gotten to 70–97% accuracy in a fraction of the time it would take traditional ETL. The problem is that the remaining 3–30% of failures don't look like failures. They look like correct data.

That asymmetry—where wrong outputs are structurally valid and plausible—is what makes LLM-powered data migrations genuinely dangerous without the right validation architecture. This post covers what the teams that have done this successfully actually built: when LLMs earn their place in the pipeline, where they silently break, and the validation layer that catches errors traditional tools cannot.

What Model Cards Don't Tell You: The Production Gap Between Published Benchmarks and Real Workloads

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A model card says 89% accuracy on code generation. Your team gets 28% on the actual codebase. A model card says 100K token context window. Performance craters at 32K under your document workload. A model card passes red-team safety evaluation. A prompt injection exploit ships to your users within 72 hours of launch.

This gap isn't rare. It's the norm. In a 2025 analysis of 1,200 production deployments, 42% of companies abandoned their AI initiatives at the production integration stage — up from 17% the previous year. Most of them had read the model cards carefully.

The problem isn't that model cards lie. It's that they measure something different from what you need to know. Understanding that gap precisely — and building the internal benchmark suite to close it — is what separates teams that ship reliable AI from teams that ship regrets.

The Model Portability Tax: How to Architect AI Systems You Can Actually Migrate

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

You inherited an AI feature built on GPT-4-turbo. The model is being deprecated. Your manager wants to cut costs by switching to a newer, cheaper model. You run a quick test, metrics look passable, you ship it — and a week later, accuracy on your core use case drops 22%. Support tickets climb. You're now in a crisis migration rather than a planned one.

This is the model portability tax: the hidden engineering cost that accumulates every time you couple your application logic tightly to a specific foundation model. Every team pays it. Most don't realize how large the bill has gotten until the invoice arrives.

The Multilingual Quality Cliff: Why Your LLM Works Great in English and Quietly Fails Everyone Else

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your LLM passes every eval you throw at it. Latency is solid, accuracy looks fine, and the team ships with confidence. Then a user in Cairo files a bug: the structured extraction returns malformed JSON. A developer in Seoul notices the assistant ignores complex instructions after a few turns. A product manager in Mumbai realizes the chatbot's summarization is just wrong—subtly, consistently, wrong.

None of this showed up in your benchmarks because your benchmarks are in English.

This is the multilingual quality cliff: a performance drop that is steep, systematic, and almost universally invisible to teams that ship AI products. The gap isn't marginal. In long multi-turn conversations, Arabic and Korean users see accuracy around 40.8% on tasks where English users are at 54.8%—a 14-point gap that compounds with every additional turn. For structured editing tasks, that same gap widens to catastrophic: 32–37% accuracy versus acceptable English performance. The users feel this. Your dashboards don't.

The ORM Impedance Mismatch for AI Agents: Why Your Data Layer Is the Real Bottleneck

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams building AI agents spend weeks tuning prompts and evals, benchmarking model choices, and tweaking temperature — while their actual bottleneck sits one layer below: the data access layer that was designed for human developers, not agents.

The mismatch isn't subtle. ORMs like Hibernate, SQLAlchemy, and Prisma, combined with REST APIs that return paginated, single-entity responses, produce data access patterns exactly wrong for autonomous AI agents. The result is token waste, rate limit failures, cascading N+1 database queries, and agents that hallucinate simply because they can't afford to load the context they need.

This post is about the structural problem — and what an agent-optimized data layer actually looks like.

The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?

The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.

Privacy-Preserving Inference in Practice: The Spectrum Between Cloud APIs and On-Prem

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams treat LLM privacy as a binary: either you send data to the cloud and accept the risk, or you run everything on-prem and accept the cost. Both framings are wrong. In practice, there is a spectrum of approaches with very different risk profiles and engineering budgets — and most teams are operating at the wrong point on that spectrum without realizing it.

Researchers recently demonstrated they could extract authentic PII from 3,912 individuals at a cost of $0.012 per record with a 48.9% success rate. That statistic tends to get dismissed as academic threat modeling until a security audit or compliance review lands on your desk. The question isn't whether to care about LLM privacy; it's which controls actually move the needle and how much each one costs to implement.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.

About Tian Pan