639 posts tagged with "llm"

The 200-Token System Prompt That Beats Your 4000-Token One

May 2, 2026 · 10 min read

Software Engineer

A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.

This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.

Your AI Feature Needs a Kill Switch That Isn't a Deploy

May 2, 2026 · 13 min read

Tian Pan

Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

Bug Bashes for AI Features: Sampling a Distribution, Not Hunting Defects

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

The classic bug bash is a deterministic ritual built for deterministic software. Ten engineers crowd a Slack channel for two hours, hammer a checklist of golden-path flows, and file tickets with crisp repro steps: "Click X, see Y, expected Z." It works because the system under test is reproducible — same input, same output, same bug, every time.

Run that exact ritual against an AI feature and you will produce two hundred tickets, close one hundred and eighty as "expected stochastic variation," and miss the twenty that signal a real cohort regression. The format isn't just stale; it's actively miscalibrated. A bug bash against an LLM-backed feature is not a defect-hunting session. It is a sampling exercise against a probability distribution, and the team that runs it like a deterministic test session is collecting noise and calling it signal.

This post is about how to redesign the bug bash for stochastic systems — what to change about the format, the participants, the triage rubric, and what counts as "done."

Distillation Is a Product Decision, Not a Research Artifact

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A frontier-model chat feature is roughly a thirty-cents-per-conversation product. The distilled variant of the same feature is roughly a third-of-a-cent-per-conversation product. These are not two implementations of one product. They are two products, with different free-tier economics, different acquisition costs, different markets, and different competitive moats. The team that ships the distilled version as "the same feature, cheaper" wastes the move.

Most engineering organizations still treat distillation as a research-team optimization that gets applied after a feature is "done" — a tail-end pass to wring inference cost out of something already spec'd against the frontier model. That framing is wrong by an order of magnitude. The choice of teacher, the choice of student, the eval suite the student is graded against, and the product surface the student is deployed to are product decisions. They determine which capabilities you are consenting to lose, which traffic shape you are designing for, and which price floor you are unlocking. Hand them to a research team to optimize against MMLU and you will ship a model that wins benchmarks the product does not care about.

The Eval Automation Trap: When Your Pipeline Drifts Away From What Users Actually Want

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

Your eval pipeline scores are trending up. Response quality is improving. The LLM judge is catching more bad outputs. Your dashboard is green.

Meanwhile, a support ticket trickles in: "The assistant keeps giving me long, formal answers when I asked a simple question." Then another: "It stopped suggesting next steps. Used to do that automatically." Then your product manager shows you a chart: user satisfaction down 12% over the last quarter, correlated almost perfectly with the stretch where your automated eval metrics were climbing fastest.

This is the eval automation trap. Your measurement apparatus became optimized for itself rather than for what your users value — and because the feedback loop was entirely automated, nobody noticed until the damage was already in production.

The Eval Migration Tax: Why a Prompt Schema Change Wrecks 800 Test Cases

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

Every AI team I've watched ship a "small" output schema change has lived through the same week. Someone renames a field in the system prompt — say, summary becomes tldr, or the tool catalog gains a required confidence parameter — and the next CI run lights up red across 800 eval cases that have nothing to do with the change. The prompt diff is fifteen lines. The eval diff is a four-day migration project nobody scoped, owned, or budgeted.

This is the eval migration tax. It is the maintenance cost no roadmap accounts for, paid in delayed releases that get blamed on "flaky tests" rather than the architectural choice that actually caused them. Most teams pay it for years before they recognize the pattern, because each individual incident looks like ordinary churn. The compounding only becomes visible when you tally the engineering hours spent migrating evals across a quarter and realize they exceed the hours spent improving the model behavior the evals were supposed to measure.

The Fallback Cascade: Why Your AI Feature Needs Five Failure Modes, Not One

May 2, 2026 · 9 min read

Tian Pan

Software Engineer

Most AI features ship with exactly two states: working and broken. The model call succeeds and the feature responds; the model call fails and the user sees an error. This is the equivalent of building a web service with no load balancing, no cache, and a single database replica — technically functional until the moment it isn't.

The difference is that engineers learned database resilience patterns in the 1990s and have internalized them deeply. AI feature resilience is still being discovered the hard way, one production outage at a time. A payment processor lost $2.3M in a four-hour AI outage. A logistics company missed delivery windows for 30,000 packages when their routing model went down. Both failures shared a root cause: when the primary model was unavailable, there was nothing to fall back to.

The LLM-as-Validator Antipattern: Why Your AI Quality Gate Has a Blind Spot

May 2, 2026 · 8 min read

Tian Pan

Software Engineer

Your AI feature ships with a quality gate: every response runs through a GPT-4 prompt that scores it on helpfulness, accuracy, and tone. Green scores trigger no alerts. The dashboard shows 97% pass rate. Meanwhile, your support tickets double.

The problem is structural. You used the same class of system that generates your outputs to validate those outputs. When the generator hallucinates a plausible-sounding fact, the judge — trained on the same distribution of internet text — reads the hallucination as credible and passes it through. Both models share the blind spot. Your quality gate is measuring confidence, not correctness.

Persona Drift in Long-Running Agent Sessions: Why Your Agent Forgets Who It Is

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

Most production agent failures look like model errors. The agent starts a session responding correctly to the system prompt — maintaining the right tone, respecting tool constraints, following the defined workflow. Then somewhere around turn 30 or 40, things subtly shift. The agent starts hedging where it should be direct. It calls tools it was told to avoid. It contradicts a decision it made 15 turns earlier. The system prompt hasn't changed, but the agent's behavior has.

This is persona drift: the progressive divergence between an agent's actual behavior and its original system instructions, caused by how transformers attend to increasingly buried context. Research quantifies it precisely — after 8–12 dialogue turns, persona self-consistency metrics degrade by more than 30%. Single-turn agents achieve roughly 90% task accuracy; multi-turn agents running the same tasks fall to around 65%. That 25-point drop isn't a model quality problem you can prompt your way around. It's an architectural property of how attention works over long sequences, and most teams discover it only after they've shipped a feature that degrades silently for hours before a user finally notices.

Persona Overlays: When One Agent Needs Many Voices for Different Customer Cohorts

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

A Fortune 500 procurement lead opens your support agent and asks why the SOC 2 report references a control your product no longer implements. Your agent answers in the same chipper voice it uses with hobbyists on the free tier — three exclamation points, an emoji, and a cheerful suggestion to "ping our team" with no escalation path or citation. The procurement lead forwards the screenshot to her CISO with one line: "This is who they sent to handle our compliance question." You lose the renewal not because the answer was wrong, but because the voice was wrong for the room.

Most teams ship one agent persona because the org chart has one support team. The customer base, however, is rarely that uniform. Enterprise buyers expect formality, citations, and named escalation paths. Self-serve users want quick answers and zero friction. Developers want code, not paragraphs. The single-persona agent reads as condescending to one cohort and unprofessional to another, and "let users pick a tone" punts a product decision to the user that the user shouldn't have to make.

The PRD for an AI Feature: Why Your Old Template Misses the Cliff

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

The deterministic-software PRD template has aged into a kind of muscle memory. Problem statement, user stories, acceptance criteria, edge cases, success metrics, scope cuts. Engineers know how to read it. PMs know how to fill it in. Designers know which sections to lift wireframes from. It is a well-worn artifact that has shipped a generation of CRUD apps, dashboards, and SaaS workflows.

It also has no field for "what the model gets wrong five percent of the time." No field for "what we accept as a passing eval score." No field for "what the user sees when the model refuses to answer." No field for "which prompt version this PRD locks down, and who is allowed to change it after ship." Every AI feature shipped against that template is shipping with a hidden contract that nobody wrote down. Postmortems keep finding it the hard way.

The 'What Changed' Query Is the RAG Question Your Index Can't Answer

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A user asks your assistant, "what changed about our refund policy this quarter?" The system returns a confident, well-formatted summary of the current refund policy. The user nods, closes the chat, and acts on information that has nothing to do with the question they asked. Nothing in your eval suite caught this. Nothing in your faithfulness metric flagged it. The retrieval looked perfect — it returned highly-relevant chunks. The synthesis looked perfect — it cited every chunk it used. The only problem is that the question was about change, and your index has no concept of change.

This is the failure mode that vector-similarity retrieval cannot fix by tuning. Two versions of the same document have nearly-identical embeddings — that is what good embeddings do, they collapse semantically equivalent text into the same neighborhood. So when you ask "what changed," the retriever returns one of the versions, the LLM summarizes that version, and the answer is silently a hallucination of nothing-changed. The user cannot tell. Your eval set probably cannot tell either, because your eval set is built around "what is X" questions, not "what's different about X now."

About Tian Pan