Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The Eval That Converges, Then Quietly Collapses

· 11 min read
Tian Pan
Software Engineer

Your weekly eval dashboard has gone flat. The line that used to wobble between 0.71 and 0.78 has tightened to a hairline around 0.84 for three release cycles. The team reads it as a ceiling — the model is as good as the rubric allows, and further work needs a harder eval. Someone schedules a planning meeting to "design eval v2."

That reading is plausible, and sometimes correct. But there is a second explanation that produces the same picture and quietly destroys your release-gating signal: your labelers, human or LLM-judge, have homogenized around the same opinions, and the eval is no longer measuring the model. It is measuring how well the model produces the shape of output your labelers have learned to call correct.

The Inference Region Your Data Residency Policy Forgot to Pin

· 9 min read
Tian Pan
Software Engineer

The compliance audit always starts with the same question and your team always answers it the same way. "Where is customer data processed?" In the EU region, the slide deck says, and the SDK config screenshot confirms it, and the DPA promises it. Then the auditor pulls a sample of last quarter's request logs, joins them to the provider's per-request region header, and the room gets quiet. Something like four percent of EU enterprise prompts were served by a US-region inference node during a forty-minute capacity event the team did not know happened. The cache that holds reusable prefixes was in the global pool. The trace store the support team queries is in us-east. The DPA was a slide deck. The contract was a routing hint.

This is the kind of incident that does not show up in a postmortem because no service degraded. The model returned an answer, the user got a response, the latency graph stayed flat. The thing that broke is a thing the dashboards were never wired to see: the geographic path of the request through the provider's infrastructure. Engineers who would never confuse a us-east-1 URL with "the request actually executed in us-east-1" routinely make that exact mistake at the LLM API layer, because the provider's region parameter looks like the AWS one, behaves like the AWS one in the happy path, and silently degrades to "best effort" the moment the preferred region runs out of GPU.

The Kill Switch With a Latency Budget Your Incident Never Met

· 12 min read
Tian Pan
Software Engineer

The runbook said "disable the agent." The on-call followed it. Forty-three minutes later, when the kill switch finally propagated through the config service, the agent had already filed 1,200 incorrect tickets, called the billing API 8,000 times, and sent emails to customers who hadn't signed up for any of it. The runbook was correct. The runbook was also useless, because nobody had ever measured how long "disable the agent" actually takes when an agent is producing damage by the second.

Most AI features ship with a kill switch the same way most buildings ship with a fire extinguisher: someone signed off that it exists, nobody timed how long it takes to reach. The compliance review asks "is there a kill switch?" and the answer is yes. The incident asks "how fast does it stop the bleeding?" and the answer is whatever the underlying plumbing happens to take — a number nobody on the team has ever measured against the rate at which the feature is doing harm.

The mismatch is the whole problem. A feature whose containment time is longer than its blast time has shipped containment theater.

The Legal Review Timeline Your AI Feature Roadmap Never Costed

· 10 min read
Tian Pan
Software Engineer

You sketched a six-quarter AI roadmap. The model swap, the new data source, the multilingual launch, and the prompt that now offers advice each got a single row on the Gantt chart, sized by engineering effort. Then the first launch slipped four weeks, and the post-mortem said the same thing three times in three different sections: "waiting on legal." The roadmap had assumed engineering capacity was the binding constraint. The actual binding constraint was a queue of legal reviews, each running its own three-to-six-week SLA, none of them aware of each other, and all of them landing on the same two product counsels.

The mistake was not in any of the individual reviews. Each one was warranted. The mistake was treating four parallel features as four parallel timelines while their legal dependencies serialized through the same upstream resource. By the second slip the org learns the shape of the problem. By the fourth it learns to plan against it. The teams that ship AI features on a predictable cadence have stopped treating legal throughput as an external surprise and started treating it as a planning input on the same footing as headcount and infra capacity.

The Model Deprecation Notice That Landed During Your Code Freeze

· 8 min read
Tian Pan
Software Engineer

The email arrives on a Tuesday. The checkpoint your two largest features depend on enters a 90-day sunset. Your engineering org is in week two of a coordinated freeze for a different launch. By the time the freeze lifts, you will have under thirty days to revalidate two production features against a new model — and "revalidate" here means rebuilding the eval set, running shadow traffic, getting product sign-off, and shipping behind a flag that nobody is watching because the launch team is still ramping the thing the freeze was for.

This is not a rare collision. Major providers publish deprecation cadences measured in months, and every team running on hosted models has now seen one cycle. What teams have not absorbed is that provider deprecation is not an engineering event the way a library upgrade is — it is a scheduling event that arrives on a clock you do not control, and any roadmap that did not budget for it inherits the cost as a surprise.

The Near-Duplicate Filter That Took Your Only Hard Example With It

· 10 min read
Tian Pan
Software Engineer

Your dedup step reported a corpus shrink of 28% and the training run finished six hours faster. The eval numbers came in flat-to-slightly-better. Nobody opened the diff of what got removed. Three weeks later support starts paging about a class of refund-reversal tickets the model used to handle and now flatly mishandles. There are eleven training rows that touched that exact pattern. Nine of them are gone — collapsed into a single representative that kept the shortest, cleanest phrasing and dropped the messy hostile-tone variants where the model had actually learned to de-escalate. Your dedup pipeline did that, and your evals did not catch it, because by the time the eval set was built, those examples were already gone from the train set the eval was sampled from.

This is the failure mode that bothers me about deduplication as a pipeline step: it presents itself as hygiene and it is actually distribution editing. Removing exact duplicates of boilerplate is hygiene. Removing near-duplicates by a similarity threshold is a sampling decision dressed up as one. The threshold picks which slices of your training distribution survive, and the slices most likely to lose are the ones where you have the fewest examples to begin with — which are also, almost by definition, the ones you were keeping for coverage rather than count.

The PR Description Your Coding Agent Generated That Humans Stopped Reading

· 11 min read
Tian Pan
Software Engineer

A year ago your team adopted a PR description template. It had a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes. Reviewers loved it: every PR had context, every PR had a test plan, every PR had structure. Six months later the coding agent learned to fill it in. Now every PR has a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes — and reviewers no longer read past the title. The format that once focused attention now signals that there is nothing worth focusing on. The structure outlived the signal it carried.

This is not a code-quality problem. The code in those PRs is often fine. The problem is that the act of writing a description has been amputated from the act of thinking about the change, and the description is the artifact reviewers used to triage what to spend their finite attention on. When that artifact becomes uniformly formatted, plausibly worded, and indistinguishable from every other PR, the reviewer's attention triage breaks. The system that used to surface the unusual now flattens everything into the same shape.

The Provider Failover That Swapped Your Safety Policy Mid-Conversation

· 11 min read
Tian Pan
Software Engineer

A user is twelve turns into a careful conversation with your assistant about prescribing patterns for a controlled substance. The model has been measured, asking clarifying questions, citing guidance, declining to extrapolate beyond the literature. On turn thirteen, the user asks a follow-up that should land the same way the prior twelve did. Instead, they get a flat refusal: "I can't help with that." The conversation is over. They write to support furious — they were not asking anything different, the assistant was just helping them, what changed.

Your logs explain what changed. Halfway through turn thirteen, your primary provider returned a 503 in the middle of the stream. Your gateway, doing exactly what it was configured to do, failed over to the secondary provider for the remainder of the request. The secondary provider's refusal threshold for that class of query is calibrated more conservatively than the primary's. The user did not ask anything different — they asked the same question to a different model under the same brand, and the new model said no.

The Refusal Calibration Your Two Separate Evals Keep Undoing

· 12 min read
Tian Pan
Software Engineer

Pull up the dashboards for the last four model upgrades and look at the safety number next to the helpfulness number. One of them moved on every release. It was almost never the same one twice. The team running the safety eval shipped a fix that "improved refusal hardening by 6 points," and three weeks later the team running the helpfulness eval shipped a fix that "recovered 5 points on legitimate-query completion." Then the cycle started over.

This is not two teams making independent progress. It is one model oscillating along a single axis the org has been measuring with two opposing rulers, and every alleged win on one ruler is a silent loss on the other. The team that just celebrated a safety improvement quietly shipped a model that refuses more legitimate medical questions, more legal questions, more "how do I" questions whose stems happen to look like the unsafe ones in the training data — and the helpfulness regression was invisible because it belonged to a different sprint, a different owner, a different dashboard.

The Retention Policy That Erased Context Your Model Was Still Reading

· 12 min read
Tian Pan
Software Engineer

A nightly retention worker deletes any user message older than thirty days. A long-running enterprise support session, opened in early March, is still active in late May. On the request that comes in at turn 41, your prompt assembler reads from the same messages table the retention worker has been quietly pruning. Turns 1 through 28 are gone. The model receives a conversation that starts at turn 29 with no signal that earlier turns ever existed. The user asks "what was the SLA we agreed on earlier?" and the model confidently invents a number, because the actual answer was in turn 4 — which the retention worker erased the night before.

This is not a model failure. The model did exactly what it was supposed to: produce a plausible answer from the context it was handed. The failure happened upstream, in the gap between two teams that each thought they owned the messages table.

The Retry Your Dashboard Counted Three Different Ways

· 11 min read
Tian Pan
Software Engineer

An agent ran. The plan-step crashed. The tool-call step retried twice with a 500, then succeeded on the fourth attempt. The user got their answer.

How many events was that? Ask product, and it's one — the user got a working result, so the funnel counts a conversion. Ask SRE, and it's three failures plus one success, a 75% error rate on the underlying step. Ask finance, and it's four billable inferences, two retried tool calls, and roughly four times the unit cost product is forecasting against. Each team's dashboard is correct. They are also irreconcilable, and the moment someone tries to reconcile them — usually during an incident review — they will discover the team has been operating against three contradictory pictures of reliability for months.

The Self-Correction Loop That Shared Its Verifier's Blind Spot

· 10 min read
Tian Pan
Software Engineer

The screenshot that gets passed around in agent post-mortems looks the same every time. A long trace. A single task. Twelve iterations. The agent generated a draft, evaluated it, found a minor flaw, generated a revision, evaluated it, found a slightly different minor flaw, generated another revision. The score the verifier returned hovered between 0.78 and 0.84 the entire time. It never crossed the threshold. The agent never escalated. The job timed out three hours later at a token bill that would have paid for a quarter of a senior engineer's day.

The team called this a "self-correction" problem because that is what the architecture diagram labeled it. The actual failure was structural. The verifier was the generator wearing a different prompt. The convergence criterion was the model's own opinion. The retry budget was implicit, capped by the agent timeout rather than by anything the agent itself reasoned about. None of those three failures look like bugs in isolation, which is why teams ship them.