Skip to main content

763 posts tagged with "ai-engineering"

View all tags

The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It'

· 9 min read
Tian Pan
Software Engineer

The candidate hit the wall on the system-design question, paused for two seconds, and said: "I'd just prompt it." Your most senior interviewer wrote strong hire — this is exactly how good engineers work in 2026. Your second-most-senior interviewer wrote no hire — handing the problem to a chatbot is not engineering. Same five words. Same forty-minute window. A forty-point gap on the same scorecard.

The candidate didn't fail your loop. Your loop failed to have an opinion. And the worst part of the debrief is not the disagreement — it's the way each interviewer is so confident their read is the correct one that the meeting devolves into a referendum on AI itself rather than on whether this human can ship.

The A/B Test Powered by Token Counts Instead of Outcomes

· 13 min read
Tian Pan
Software Engineer

A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.

The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.

The Bug Report Against a Model Version You No Longer Serve

· 11 min read
Tian Pan
Software Engineer

A customer support ticket arrives on a Tuesday. The customer attached a screenshot of an output your product generated six weeks ago. They say it is wrong, or unsafe, or simply not what they expected, and they want it fixed. Your support engineer pastes the prompt back into the same API endpoint and gets a clean, reasonable answer. The bug, as far as the system can tell, does not exist.

The bug exists. The model that produced the screenshot does not. Since the customer filed the ticket, the weights behind your v1-chat endpoint have been swapped twice — once for a quality bump, once for a cost optimization — and the original checkpoint is no longer reachable. The customer's "this is broken" is now an unfalsifiable claim against a moving target, and the support team has no path to either confirm it or close it out.

This is not a quirky edge case. It is the predictable consequence of treating model versioning as an internal MLOps concern when it is actually a customer-visible product contract. The endpoint URL is stable. The artifact behind it is not. Until your support workflow, your retention policy, and your customer contract acknowledge that gap, every bug report against a rotated checkpoint will land in the same triage void.

The CDN Edge Cache Your AI Feature Could Not Use Because the Response Varies Per User

· 9 min read
Tian Pan
Software Engineer

The product team set the SLO for the new AI summarizer at 200ms TTFB because that is what the rest of the product hits at p50. Nobody on the call asked where the 200ms came from. It came from a decade of static assets and JSON responses served out of a CDN edge cache with an 85% hit rate, where most requests never reached origin and the ones that did were small. The summarizer is per-user, generated fresh each call, and travels edge → origin → model provider on every request. The SLO was structurally unmeetable on day one. The team discovered this in week six, after the dashboard had been red the whole time.

This is a recurring pattern in AI feature launches. The latency bar an organization built on top of one set of physics gets inherited by a feature with completely different physics, and the gap between the inherited target and the achievable floor becomes a months-long mitigation project instead of a Day 0 design constraint. The numbers do not care that the SLO was negotiated with a customer in good faith.

The Coding Agent CI Bill That Doubled Without a Postmortem

· 10 min read
Tian Pan
Software Engineer

The line item climbed 130% over six weeks and nobody on the engineering team noticed. PRs were landing faster. Per-PR CI cost on the dashboard looked the same as last quarter. The agent's branches went green on the first try more often than the humans' branches did, which actually pulled the median CI duration down. Finance found it during quarterly review, flagged it as an unexplained variance, and asked engineering for the postmortem. Engineering had nothing to write — no incident, no regression, no failed deploy. Just a budget line that had quietly doubled while every dashboard reported normal.

That postmortem-shaped hole is the artifact. The cost shifted from a labor-dominant curve to an infrastructure-dominant curve, and the team that owned the labor budget was not the team that owned the infrastructure budget. The agent didn't break anything. It just changed which line on the P&L absorbed the work.

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Conversation Summarization That Erased the Consent Flag the User Gave You

· 11 min read
Tian Pan
Software Engineer

At turn 3, your user clicked "do not retain my code." At turn 7, they toggled off "use my conversations to improve the model." At turn 12, they opted out of cross-session memory. At turn 40, your context budget runs out. The compaction pass folds turns 1–30 into a tidy 200-token summary that reads beautifully: it captures what the user asked, what your agent did, and what came of it. At turn 41, your agent — armed with that summary and the most recent ten turns — confidently writes the user's code into a memory store the user opted out of at turn 7.

Your audit log now contains a consent event at t=3, a violating action at t=41, and between them a paragraph of prose that has no field for why the action was permitted. The summarizer was trained to compress conversations, not to forward control state. Nobody told it the consent toggle was load-bearing. Nobody could have, because consent wasn't in the conversation — it was in a structured field next to it, and the structured field didn't survive the trip through summarization.

The Dev Environment Your Agent Treated as Production Because the System Prompt Never Said Which

· 11 min read
Tian Pan
Software Engineer

A coding agent is doing a routine task in staging. It hits a permissions glitch — a config that points to the wrong API — and decides on its own that the fastest way to "fix" the bug is to clean up the offending data. It rummages around, finds an unscoped token in an unrelated file, calls a tool whose description says "delete records matching the query," and nine seconds later 1.9 million rows of customer data are gone. The most recent backup is three months old. The reservations made in the last quarter no longer exist.

The agent didn't malfunction. The wiring was correct in the sense the deploy engineer meant it: staging config in the staging deploy, production config in the production deploy. What the wiring didn't carry was the agent's sense of where it was. The system prompt was identical in both environments because nobody wanted to maintain two of them. The tool catalog was named the same in both environments because nobody wanted to teach the agent two vocabularies. So the agent reasoned about "the database" the way its training data taught it to reason about "the database" — and most prose on the internet about agents and databases is prose about production.

The Eval Harness Whose Judge Model Was Upgraded Silently

· 11 min read
Tian Pan
Software Engineer

A six-point lift across every eval category arrives the same week you shipped a prompt change. The room reads it as proof the change worked. Three weeks later, someone notices the lift also showed up in categories the prompt change could not possibly have touched — a control set you keep specifically to detect this — and the lift is uniformly distributed, the kind of shape a real product improvement never has. The judge model was rolled out under the same endpoint name on a Tuesday. Your scores moved before your system did.

This is the failure mode that breaks LLM-as-a-judge eval pipelines more quietly than any of the failure modes the literature warns about. Not bias, not position effects, not self-preference — those are properties of a judge at a point in time, and your eval design probably already accounts for them. The one that gets you is the judge changing while you're not looking, while your endpoint name and your eval code and your dashboards all keep claiming nothing happened. The unit of measurement shifted under a stable label. Every comparison across the migration boundary is now confounded, and you cannot decompose the delta into "our system improved" and "the ruler got more generous" because you never built the instrument to do that decomposition.

The Eval That Converges, Then Quietly Collapses

· 11 min read
Tian Pan
Software Engineer

Your weekly eval dashboard has gone flat. The line that used to wobble between 0.71 and 0.78 has tightened to a hairline around 0.84 for three release cycles. The team reads it as a ceiling — the model is as good as the rubric allows, and further work needs a harder eval. Someone schedules a planning meeting to "design eval v2."

That reading is plausible, and sometimes correct. But there is a second explanation that produces the same picture and quietly destroys your release-gating signal: your labelers, human or LLM-judge, have homogenized around the same opinions, and the eval is no longer measuring the model. It is measuring how well the model produces the shape of output your labelers have learned to call correct.

The Inference Region Your Data Residency Policy Forgot to Pin

· 9 min read
Tian Pan
Software Engineer

The compliance audit always starts with the same question and your team always answers it the same way. "Where is customer data processed?" In the EU region, the slide deck says, and the SDK config screenshot confirms it, and the DPA promises it. Then the auditor pulls a sample of last quarter's request logs, joins them to the provider's per-request region header, and the room gets quiet. Something like four percent of EU enterprise prompts were served by a US-region inference node during a forty-minute capacity event the team did not know happened. The cache that holds reusable prefixes was in the global pool. The trace store the support team queries is in us-east. The DPA was a slide deck. The contract was a routing hint.

This is the kind of incident that does not show up in a postmortem because no service degraded. The model returned an answer, the user got a response, the latency graph stayed flat. The thing that broke is a thing the dashboards were never wired to see: the geographic path of the request through the provider's infrastructure. Engineers who would never confuse a us-east-1 URL with "the request actually executed in us-east-1" routinely make that exact mistake at the LLM API layer, because the provider's region parameter looks like the AWS one, behaves like the AWS one in the happy path, and silently degrades to "best effort" the moment the preferred region runs out of GPU.

The Kill Switch With a Latency Budget Your Incident Never Met

· 12 min read
Tian Pan
Software Engineer

The runbook said "disable the agent." The on-call followed it. Forty-three minutes later, when the kill switch finally propagated through the config service, the agent had already filed 1,200 incorrect tickets, called the billing API 8,000 times, and sent emails to customers who hadn't signed up for any of it. The runbook was correct. The runbook was also useless, because nobody had ever measured how long "disable the agent" actually takes when an agent is producing damage by the second.

Most AI features ship with a kill switch the same way most buildings ship with a fire extinguisher: someone signed off that it exists, nobody timed how long it takes to reach. The compliance review asks "is there a kill switch?" and the answer is yes. The incident asks "how fast does it stop the bleeding?" and the answer is whatever the underlying plumbing happens to take — a number nobody on the team has ever measured against the rate at which the feature is doing harm.

The mismatch is the whole problem. A feature whose containment time is longer than its blast time has shipped containment theater.