Skip to main content

702 posts tagged with "llm"

View all tags

The Summarization Validity Problem: How to Know Your AI Compressed Away What Mattered

· 10 min read
Tian Pan
Software Engineer

Summarization fails silently. Your system doesn't crash, logs don't flag an error, and the generated text looks coherent—but somewhere in the compression, the one fact that mattered for the downstream task got dropped. The RAG pipeline returns a confident answer. The multi-hop reasoner reaches a conclusion. The customer service agent gives advice. All of it grounded in a summary that no longer contains the original constraint, exception, or data point the answer depended on.

This is the summarization validity problem: the gap between a summary that is consistent with its source and a summary that preserves what the downstream task needs. Most teams don't instrument for it. They ship pipelines that validate summaries exist, not summaries that are complete.

The Zero-Shot Wall: Why In-Context Examples Stop Working at Production Scale

· 8 min read
Tian Pan
Software Engineer

Most teams discover the zero-shot wall the same way: a new edge case breaks the model, they add an example to the prompt, it helps. Three months later they've got 40 examples, 6,000 tokens of context, the performance metrics haven't moved in weeks, and the prompt engineer who knows where every example came from just left the company.

Few-shot prompting is seductive because it works quickly. You observe a failure, you add a demonstration, the failure goes away. The feedback loop is tight and the wins feel free. What you don't notice is that each subsequent example is buying less than the last — and at some point you're spending tokens, latency, and cognitive overhead for improvements that round to zero.

This is the zero-shot wall: not a hard limit where performance drops off a cliff, but a zone of sharply diminishing returns where in-context learning has hit the ceiling of what it can accomplish for your task, and the only lever left is fine-tuning.

Multi-Region AI Deployment: Data Residency, Model Parity, and the Latency Tax Nobody Budgets

· 10 min read
Tian Pan
Software Engineer

When engineers budget for multi-region AI deployments, they typically account for two variables: infrastructure cost per region and replication overhead. What they consistently underestimate — sometimes catastrophically — are three costs that only appear once you're live: model parity gaps that make your EU cluster produce different outputs than your US cluster, KV cache isolation penalties that make every token in GDPR territory more expensive to generate, and silent compliance violations that trigger when your retry logic routes a French user's data through Virginia.

A German bank spent 14 months deploying a large open-source model on-premises to satisfy GDPR requirements. That's not unusual. What's unusual is that the engineers who proposed the architecture understood the compliance constraint upfront. Most don't until an incident report forces the conversation.

The 200-Token System Prompt That Beats Your 4000-Token One

· 10 min read
Tian Pan
Software Engineer

A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.

This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.

Your AI Feature Needs a Kill Switch That Isn't a Deploy

· 13 min read
Tian Pan
Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

Bug Bashes for AI Features: Sampling a Distribution, Not Hunting Defects

· 11 min read
Tian Pan
Software Engineer

The classic bug bash is a deterministic ritual built for deterministic software. Ten engineers crowd a Slack channel for two hours, hammer a checklist of golden-path flows, and file tickets with crisp repro steps: "Click X, see Y, expected Z." It works because the system under test is reproducible — same input, same output, same bug, every time.

Run that exact ritual against an AI feature and you will produce two hundred tickets, close one hundred and eighty as "expected stochastic variation," and miss the twenty that signal a real cohort regression. The format isn't just stale; it's actively miscalibrated. A bug bash against an LLM-backed feature is not a defect-hunting session. It is a sampling exercise against a probability distribution, and the team that runs it like a deterministic test session is collecting noise and calling it signal.

This post is about how to redesign the bug bash for stochastic systems — what to change about the format, the participants, the triage rubric, and what counts as "done."

Distillation Is a Product Decision, Not a Research Artifact

· 10 min read
Tian Pan
Software Engineer

A frontier-model chat feature is roughly a thirty-cents-per-conversation product. The distilled variant of the same feature is roughly a third-of-a-cent-per-conversation product. These are not two implementations of one product. They are two products, with different free-tier economics, different acquisition costs, different markets, and different competitive moats. The team that ships the distilled version as "the same feature, cheaper" wastes the move.

Most engineering organizations still treat distillation as a research-team optimization that gets applied after a feature is "done" — a tail-end pass to wring inference cost out of something already spec'd against the frontier model. That framing is wrong by an order of magnitude. The choice of teacher, the choice of student, the eval suite the student is graded against, and the product surface the student is deployed to are product decisions. They determine which capabilities you are consenting to lose, which traffic shape you are designing for, and which price floor you are unlocking. Hand them to a research team to optimize against MMLU and you will ship a model that wins benchmarks the product does not care about.

The Eval Automation Trap: When Your Pipeline Drifts Away From What Users Actually Want

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline scores are trending up. Response quality is improving. The LLM judge is catching more bad outputs. Your dashboard is green.

Meanwhile, a support ticket trickles in: "The assistant keeps giving me long, formal answers when I asked a simple question." Then another: "It stopped suggesting next steps. Used to do that automatically." Then your product manager shows you a chart: user satisfaction down 12% over the last quarter, correlated almost perfectly with the stretch where your automated eval metrics were climbing fastest.

This is the eval automation trap. Your measurement apparatus became optimized for itself rather than for what your users value — and because the feedback loop was entirely automated, nobody noticed until the damage was already in production.

The Eval Migration Tax: Why a Prompt Schema Change Wrecks 800 Test Cases

· 11 min read
Tian Pan
Software Engineer

Every AI team I've watched ship a "small" output schema change has lived through the same week. Someone renames a field in the system prompt — say, summary becomes tldr, or the tool catalog gains a required confidence parameter — and the next CI run lights up red across 800 eval cases that have nothing to do with the change. The prompt diff is fifteen lines. The eval diff is a four-day migration project nobody scoped, owned, or budgeted.

This is the eval migration tax. It is the maintenance cost no roadmap accounts for, paid in delayed releases that get blamed on "flaky tests" rather than the architectural choice that actually caused them. Most teams pay it for years before they recognize the pattern, because each individual incident looks like ordinary churn. The compounding only becomes visible when you tally the engineering hours spent migrating evals across a quarter and realize they exceed the hours spent improving the model behavior the evals were supposed to measure.

The Fallback Cascade: Why Your AI Feature Needs Five Failure Modes, Not One

· 9 min read
Tian Pan
Software Engineer

Most AI features ship with exactly two states: working and broken. The model call succeeds and the feature responds; the model call fails and the user sees an error. This is the equivalent of building a web service with no load balancing, no cache, and a single database replica — technically functional until the moment it isn't.

The difference is that engineers learned database resilience patterns in the 1990s and have internalized them deeply. AI feature resilience is still being discovered the hard way, one production outage at a time. A payment processor lost $2.3M in a four-hour AI outage. A logistics company missed delivery windows for 30,000 packages when their routing model went down. Both failures shared a root cause: when the primary model was unavailable, there was nothing to fall back to.

The LLM-as-Validator Antipattern: Why Your AI Quality Gate Has a Blind Spot

· 8 min read
Tian Pan
Software Engineer

Your AI feature ships with a quality gate: every response runs through a GPT-4 prompt that scores it on helpfulness, accuracy, and tone. Green scores trigger no alerts. The dashboard shows 97% pass rate. Meanwhile, your support tickets double.

The problem is structural. You used the same class of system that generates your outputs to validate those outputs. When the generator hallucinates a plausible-sounding fact, the judge — trained on the same distribution of internet text — reads the hallucination as credible and passes it through. Both models share the blind spot. Your quality gate is measuring confidence, not correctness.

Persona Drift in Long-Running Agent Sessions: Why Your Agent Forgets Who It Is

· 10 min read
Tian Pan
Software Engineer

Most production agent failures look like model errors. The agent starts a session responding correctly to the system prompt — maintaining the right tone, respecting tool constraints, following the defined workflow. Then somewhere around turn 30 or 40, things subtly shift. The agent starts hedging where it should be direct. It calls tools it was told to avoid. It contradicts a decision it made 15 turns earlier. The system prompt hasn't changed, but the agent's behavior has.

This is persona drift: the progressive divergence between an agent's actual behavior and its original system instructions, caused by how transformers attend to increasingly buried context. Research quantifies it precisely — after 8–12 dialogue turns, persona self-consistency metrics degrade by more than 30%. Single-turn agents achieve roughly 90% task accuracy; multi-turn agents running the same tasks fall to around 65%. That 25-point drop isn't a model quality problem you can prompt your way around. It's an architectural property of how attention works over long sequences, and most teams discover it only after they've shipped a feature that degrades silently for hours before a user finally notices.