780 posts tagged with "ai-engineering"

The Prompt Bench Press: Stress-Testing Prompts Outside the Happy Path

May 10, 2026 · 10 min read

Software Engineer

A prompt that scores 92% on your eval set and 60% on real production traffic is not a prompt with a bug. It is a prompt whose evaluation set was structurally incapable of finding the bug. The gap is not noise. It is the consequence of optimizing against examples that share a register, a length distribution, a language, and a politeness level with the prompt's design intent — the very same intent that wrote the eval cases.

Real users do not cooperate with your design intent. They send three-word fragments, twelve-paragraph essays, code blocks pasted as questions, casual register that drops articles, formal register that adds honorifics, and queries in languages your few-shot examples never used. None of this is adversarial. It is just the input distribution. And if your eval set was curated by the same person who wrote the prompt, it almost certainly looks nothing like that distribution.

The discipline that closes this gap is not "more evals." It is a different kind of eval — a stress matrix that deliberately varies the dimensions your curated set holds constant, and that grades degradation curves rather than a single accuracy number. Call it the prompt bench press: you are not testing whether the prompt can do the work. You are testing how it fails as the input gets harder.

The Regional Model Rollout Lottery: When Your Product Quietly Behaves Differently by Continent

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

A customer-success email lands on a Friday afternoon: "the model got worse for our German users." The team pulls up the eval dashboard. Scores are flat. Latency p95 is normal. The model name in the config is the same one shipped three weeks ago. Nothing changed. Except something did. The US endpoint quietly received the new model generation last sprint, the EU endpoint is still on the prior version because the provider hasn't completed the regional rollout yet, and the load balancer in front of both has been hiding the gap from every dashboard the team owns.

This is the regional model rollout lottery. Your "single model" abstraction is not single. It bifurcates the moment a provider stages a release across continents — which is most of the time, for most providers, in most years. The version string in your client SDK does not change when this happens. Your traces look identical. Your contract with the provider does not promise otherwise. And your eval suite, the artifact you trust to catch behavioral regressions, is almost certainly running from a CI box that lives in one region and hits whichever endpoint is geographically closest.

Sampling Drift: When Temperature and Top-P Become Tribal Knowledge

May 10, 2026 · 9 min read

Tian Pan

Software Engineer

Open the production config of any AI feature that has been live for more than a year and you will find an archaeological dig site. temperature: 0.7 because someone needed the demo to feel less robotic. top_p: 0.85 because a customer complained the outputs were too generic. frequency_penalty: 0.4 because there was a bad week in 2024 where a now-retired model kept repeating itself. None of these decisions are documented. None of them have been re-tested against the current foundation model. They run on every request, in every eval, in every A/B, shaping behavior nobody has consciously chosen since the original ticket got closed.

This is sampling drift. It is the slow accumulation of expedient sampler tweaks whose original justifications evaporate while their effects compound. The values in your config are not "tuned" — they are a fossil record of past incidents, scaled to the volume of your current traffic.

The reason it is invisible is structural. Every eval you run scores against the current sampling config, so the headline number always looks fine. There is no alarm that fires when a temperature value is two foundation-model versions out of date. There is no calendar invite that says "re-grid sampling parameters this quarter." The decay is silent until somebody runs a clean experiment and finds a quality lift, a token reduction, or both, sitting in plain sight at no engineering cost.

Spec, Code, Tests, One Author: The Independence You Quietly Lost

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

When the same model writes the requirements, implements them, and authors the assertions that say it is correct, "all tests pass" is no longer evidence the feature works. It is evidence the model is internally consistent. Those are different things, and the difference is the entire point of having tests in the first place.

The standard story we tell about test suites is that they are a second opinion. The author wrote the code with one mental model of the requirement, the test author wrote the assertions with a slightly different mental model, and the points where the two models disagree are where the bugs live. That story depends on the test author having a different cognitive vantage point than the code author. Strip out the difference in vantage points and the test suite stops carrying any independent information about correctness — it only carries information about consistency.

The Specification Translation Tax: When Spec, Prompt, and Eval Drift Apart

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

A PM writes a feature spec in English. An engineer translates it into a system prompt with idiomatic LLM patterns — chain-of-thought scaffolding, output format coercion, a few hedge clauses to cover failure modes the spec never mentioned. An eval author opens the same spec, re-reads it cold, and writes JSON test cases against their interpretation. Three weeks later, all three artifacts disagree, and nobody can tell whether a regression is a prompt bug, a spec-implementation gap, or an eval that was wrong from day one.

This is the specification translation tax. Traditional software has it too — the gap between PRD and code, between code and tests — but compilers and type systems narrow it. AI features have no such backstop. The prompt is documentation that the system actually reads. The eval is a contract that nobody signed. The spec is a description of intent that nobody enforces. Each is a translation of the same intent into a different medium, and without bidirectional consistency, behavior leaks in through whichever artifact is easiest to edit.

Voice Agent Turn-Taking: The 250ms Threshold That Reshapes Your Architecture

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

Linguists who study turn-taking across languages keep arriving at the same number: the gap between speakers in casual conversation is roughly 200 to 300 milliseconds. Anything longer reads as hesitation, distance, or deference; anything shorter reads as interruption. That window is so tight that humans demonstrably begin formulating their reply before the other person finishes — listening and planning happen in parallel, not in sequence.

Voice agents that miss this window do not feel slightly slow. They feel wrong. A 700ms gap that nobody notices in a chat product feels like the agent is dim, distracted, or about to be interrupted by the user out of impatience. A 1.5-second gap and the user is already repeating themselves. Hitting the budget is not a polish task — it forces architectural choices that text agents never have to face, and those choices reshape how the whole stack is built.

The Onboarding Gap: Why New Engineers Take Three Months to Touch the AI Stack

May 9, 2026 · 9 min read

Tian Pan

Software Engineer

A backend engineer with eight years of experience joins your team. By week three on a normal codebase, they would be shipping features. On the AI surface, they are still asking questions in DMs, and you can predict which two senior engineers they are asking. Three months in, they are finally trusted to edit the system prompt — not because the prompt is hard, but because nobody could tell them which evals would catch a regression and which would happily wave bad output through.

This is not a hiring problem or a documentation problem in the usual sense. AI codebases carry a hidden domain-knowledge tax that does not show up in code review, does not appear in the README, and is invisible to the static analyzer. The tax is paid in onboarding time, in repeated questions to the same two people, and eventually in a team that quietly bifurcates into "the people who can touch it" and "everyone else."

The AI Wallet: Why Token Budgets Belong in the UI, Not the Engineering Dashboard

May 9, 2026 · 10 min read

Tian Pan

Software Engineer

Pull up the per-user cost dashboard for any AI product on a flat subscription. The shape is always the same. A long, flat tail of users who barely move the needle, and a thin spike at the top where five percent of accounts burn eighty percent of the inference budget. The spike is hidden from users on both ends. The power users don't know they're subsidizing nothing — they assume the price is the price. The casual users don't know they could ask for more — they assume the limit is the limit.

The dashboard stays engineering-internal because product is afraid that exposing it will scare users. It does the opposite. The team that hides cost ends up shipping silent throttling, hidden model downgrades, and answer truncation that the user reads as "this product is broken." The team that exposes cost — as a deliberate UI surface, not an admin page — turns the same cost ceiling from a churn driver into a monetization lever.

This is the AI wallet. Not a billing page. A product primitive.

Compliance Reviewer as Eval Author: Why Legal Should Be Writing Your Test Cases

May 9, 2026 · 13 min read

Tian Pan

Software Engineer

The most useful adversarial prompt I have seen for an enterprise LLM did not come from a red team, a security researcher, or a prompt engineer. It came from a senior compliance attorney who asked the model, in plain English, to "tell me which of the three retirement annuities discussed earlier in this thread is the best one for a 62-year-old approaching their first required minimum distribution." The model produced a confident, thoughtful, beautifully-formatted recommendation. That output, had it been sent to a customer, would have been a textbook FINRA suitability violation — an unsuitable individualized recommendation made without the supervisory infrastructure that securities rules require around personalized advice.

The compliance attorney spotted the failure mode in about four seconds. The engineering eval suite, which had a hundred-plus carefully constructed cases for hallucination, refusal calibration, and tool-use accuracy, had no concept that this particular response shape was illegal. Not low quality. Not a hallucination. Illegal. And the workflow at the company at the time had her reading sample outputs in a Google Doc and writing memos, rather than checking a test case into the regression suite. So her catch lived in a memo, the memo got summarized in a launch-readiness slide, and the next month a refactor of the system prompt regressed the behavior because nobody had a failing test pinned to it.

That is the gap I want to argue we should close: the compliance reviewer should be authoring eval cases directly, and those cases should be the artifact that gates release — not the document review that produced them.

Conversational REST: When Your Chat UI Needs Pagination, Filters, and Sort

May 9, 2026 · 11 min read

Tian Pan

Software Engineer

A user asks your shopping agent for "running shoes under $150 with good arch support." The model dutifully returns twelve options as a wall of bulleted text inside a single chat bubble that overflows the viewport. The user scrolls, loses their place, and types "show me only Asics" — at which point your agent re-runs the entire search instead of filtering the result set it already has. Three turns later, the user is inventing a query language one prompt at a time, and your product feels like a command line wearing a chat-bubble costume.

This is the failure mode I keep watching teams ship. They built a chat product on top of what users actually wanted to be a faceted-search product. The model is fine. The retrieval is fine. The UI is the problem, and it's the wrong shape for the task.

The shortest way I can put it: chat is an input modality, not an output one. The agent's job is to translate user intent into a structured query. The moment the result set is more than three items, the right answer is to render UI, not to keep talking.

The Cost of Reversal: Why Pulling Back an AI Feature Is Harder Than Shipping One

May 9, 2026 · 10 min read

Tian Pan

Software Engineer

The release process you have was designed for a world where shipping is irreversible and rollback is free. AI flips that. Once a feature has been live for a quarter, the disruption cost of pulling it back exceeds the disruption cost of launching it — and the louder customer feedback you will ever get on that feature is the day you take it away, not the day it shipped.

The team builds a kill switch for every AI launch. Nobody ever pulls it. Not because the feature is flawless, but because by the time anyone wants to, the cost of doing so has compounded past anything the launch criteria considered. Feature flags assume the world is symmetric: the system before the flip and the system after the flip are equally valid resting points, and you can move between them as you please. AI features break that assumption silently, and the team's release process — built around reversible flags — quietly assumes the asymmetry away.

The first time the team notices is when somebody proposes deprecating the feature.

Cost-Per-Conversation as a Product Contract: When Pricing Drives Architecture

May 9, 2026 · 10 min read

Tian Pan

Software Engineer

The cleanest way to find out your AI feature's pricing model is wrong is to look at which engineer is currently rewriting the truncation logic at midnight. They aren't shipping a capability — they're patching a unit-economics leak that the PRD never named, and the patch is necessarily user-hostile because the product spec told them the budget was infinite. On a flat-fee SaaS plan, every conversation that runs longer than the median pulls margin out of the company in real time. The only real question is whether the product team admits it before finance does.

Traditional SaaS economics rest on near-zero marginal cost per user: once the software is built, serving the next customer barely moves the infrastructure line. AI features break that assumption. Every turn in a conversation consumes inference compute that scales with prompt size, output length, tool-call fan-out, and retrieval volume — and conversations don't have a natural stopping point. A heavy user can consume 50× the median in a billing period without leaving the happy path of the product. Under flat pricing, that user is funded by the rest of the user base, and the company finds out only when COGS reporting catches up a quarter later.

This is why pricing on AI features is not a finance problem to be handled after launch. It is an architecture input that decides what the product is allowed to do, and refusing to make it visible in the spec just means it gets resolved later, in worse ways, by people without product authority.

About Tian Pan