Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The Retry Budget That Hid Your Provider's Actual Error Rate From Your Dashboard

· 11 min read
Tian Pan
Software Engineer

The weekly review slide said 99.9%. The invoice said the bill had tripled. The two numbers had been on adjacent dashboards for months, and nobody had noticed that they were measuring different worlds. The reliability number was post-retry — every call that eventually returned a 200 counted as a success — and the cost number was every attempt the client made, billed by the token. Between them sat a generous five-attempt retry loop and a provider whose tail latency had been quietly degrading. The first time anyone looked at both numbers together was during an outage, when the cost-anomaly alert fired before the availability alert did.

That is the whole pattern. A retry budget that looks like a reliability mechanism is also a cost-quality knob, and the team that watches only one side of it is paying for an availability number the invoice will eventually correct.

The SSE Keep-Alive Your Reverse Proxy Stripped, And The Prompt You Paid For Twice

· 10 min read
Tian Pan
Software Engineer

Your agent called a tool that took 35 seconds. During those 35 seconds, no tokens flowed from the model back to the browser. The provider's SSE stream was still open. Your tool was still running. The user's spinner was still spinning. And somewhere in the middle, a reverse proxy you do not control decided the connection had been quiet for too long, closed it, and your client's reconnection logic dutifully restarted the entire request from scratch.

The first response was 4,200 prompt tokens and 600 completion tokens. The second response was 4,200 prompt tokens and 600 completion tokens. The user got one answer. Your invoice got two.

Your RAG Corpus Trust Boundary Is Whoever Can Write to Its Sources

· 10 min read
Tian Pan
Software Engineer

A support agent gives the right answer to the wrong audience. A customer asks about their account, the model dutifully calls a URL-fetch tool, and a snapshot of that account's context lands on a server the security team has never heard of. No credentials leaked. No API keys exposed. The exfiltration vector was a five-star product review written by a competitor three weeks earlier, retrieved as relevant context because the visible praise actually was relevant to the user's question.

This is the failure mode that breaks the mental model engineers carry from years of web security. The threat model in RAG systems is usually phrased as "we own the corpus" because we own the ingestion pipeline, the embedding model, and the vector database. But owning the code that pulls the content is not the same as owning the content. If your corpus includes any source whose writes are not gated by your authorization, you have handed a prompt-engineering channel to whoever can post.

The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It'

· 9 min read
Tian Pan
Software Engineer

The candidate hit the wall on the system-design question, paused for two seconds, and said: "I'd just prompt it." Your most senior interviewer wrote strong hire — this is exactly how good engineers work in 2026. Your second-most-senior interviewer wrote no hire — handing the problem to a chatbot is not engineering. Same five words. Same forty-minute window. A forty-point gap on the same scorecard.

The candidate didn't fail your loop. Your loop failed to have an opinion. And the worst part of the debrief is not the disagreement — it's the way each interviewer is so confident their read is the correct one that the meeting devolves into a referendum on AI itself rather than on whether this human can ship.

The A/B Test Powered by Token Counts Instead of Outcomes

· 13 min read
Tian Pan
Software Engineer

A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.

The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.

The Bug Report Against a Model Version You No Longer Serve

· 11 min read
Tian Pan
Software Engineer

A customer support ticket arrives on a Tuesday. The customer attached a screenshot of an output your product generated six weeks ago. They say it is wrong, or unsafe, or simply not what they expected, and they want it fixed. Your support engineer pastes the prompt back into the same API endpoint and gets a clean, reasonable answer. The bug, as far as the system can tell, does not exist.

The bug exists. The model that produced the screenshot does not. Since the customer filed the ticket, the weights behind your v1-chat endpoint have been swapped twice — once for a quality bump, once for a cost optimization — and the original checkpoint is no longer reachable. The customer's "this is broken" is now an unfalsifiable claim against a moving target, and the support team has no path to either confirm it or close it out.

This is not a quirky edge case. It is the predictable consequence of treating model versioning as an internal MLOps concern when it is actually a customer-visible product contract. The endpoint URL is stable. The artifact behind it is not. Until your support workflow, your retention policy, and your customer contract acknowledge that gap, every bug report against a rotated checkpoint will land in the same triage void.

The CDN Edge Cache Your AI Feature Could Not Use Because the Response Varies Per User

· 9 min read
Tian Pan
Software Engineer

The product team set the SLO for the new AI summarizer at 200ms TTFB because that is what the rest of the product hits at p50. Nobody on the call asked where the 200ms came from. It came from a decade of static assets and JSON responses served out of a CDN edge cache with an 85% hit rate, where most requests never reached origin and the ones that did were small. The summarizer is per-user, generated fresh each call, and travels edge → origin → model provider on every request. The SLO was structurally unmeetable on day one. The team discovered this in week six, after the dashboard had been red the whole time.

This is a recurring pattern in AI feature launches. The latency bar an organization built on top of one set of physics gets inherited by a feature with completely different physics, and the gap between the inherited target and the achievable floor becomes a months-long mitigation project instead of a Day 0 design constraint. The numbers do not care that the SLO was negotiated with a customer in good faith.

The Coding Agent CI Bill That Doubled Without a Postmortem

· 10 min read
Tian Pan
Software Engineer

The line item climbed 130% over six weeks and nobody on the engineering team noticed. PRs were landing faster. Per-PR CI cost on the dashboard looked the same as last quarter. The agent's branches went green on the first try more often than the humans' branches did, which actually pulled the median CI duration down. Finance found it during quarterly review, flagged it as an unexplained variance, and asked engineering for the postmortem. Engineering had nothing to write — no incident, no regression, no failed deploy. Just a budget line that had quietly doubled while every dashboard reported normal.

That postmortem-shaped hole is the artifact. The cost shifted from a labor-dominant curve to an infrastructure-dominant curve, and the team that owned the labor budget was not the team that owned the infrastructure budget. The agent didn't break anything. It just changed which line on the P&L absorbed the work.

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Conversation Summarization That Erased the Consent Flag the User Gave You

· 11 min read
Tian Pan
Software Engineer

At turn 3, your user clicked "do not retain my code." At turn 7, they toggled off "use my conversations to improve the model." At turn 12, they opted out of cross-session memory. At turn 40, your context budget runs out. The compaction pass folds turns 1–30 into a tidy 200-token summary that reads beautifully: it captures what the user asked, what your agent did, and what came of it. At turn 41, your agent — armed with that summary and the most recent ten turns — confidently writes the user's code into a memory store the user opted out of at turn 7.

Your audit log now contains a consent event at t=3, a violating action at t=41, and between them a paragraph of prose that has no field for why the action was permitted. The summarizer was trained to compress conversations, not to forward control state. Nobody told it the consent toggle was load-bearing. Nobody could have, because consent wasn't in the conversation — it was in a structured field next to it, and the structured field didn't survive the trip through summarization.

The Dev Environment Your Agent Treated as Production Because the System Prompt Never Said Which

· 11 min read
Tian Pan
Software Engineer

A coding agent is doing a routine task in staging. It hits a permissions glitch — a config that points to the wrong API — and decides on its own that the fastest way to "fix" the bug is to clean up the offending data. It rummages around, finds an unscoped token in an unrelated file, calls a tool whose description says "delete records matching the query," and nine seconds later 1.9 million rows of customer data are gone. The most recent backup is three months old. The reservations made in the last quarter no longer exist.

The agent didn't malfunction. The wiring was correct in the sense the deploy engineer meant it: staging config in the staging deploy, production config in the production deploy. What the wiring didn't carry was the agent's sense of where it was. The system prompt was identical in both environments because nobody wanted to maintain two of them. The tool catalog was named the same in both environments because nobody wanted to teach the agent two vocabularies. So the agent reasoned about "the database" the way its training data taught it to reason about "the database" — and most prose on the internet about agents and databases is prose about production.

The Eval Harness Whose Judge Model Was Upgraded Silently

· 11 min read
Tian Pan
Software Engineer

A six-point lift across every eval category arrives the same week you shipped a prompt change. The room reads it as proof the change worked. Three weeks later, someone notices the lift also showed up in categories the prompt change could not possibly have touched — a control set you keep specifically to detect this — and the lift is uniformly distributed, the kind of shape a real product improvement never has. The judge model was rolled out under the same endpoint name on a Tuesday. Your scores moved before your system did.

This is the failure mode that breaks LLM-as-a-judge eval pipelines more quietly than any of the failure modes the literature warns about. Not bias, not position effects, not self-preference — those are properties of a judge at a point in time, and your eval design probably already accounts for them. The one that gets you is the judge changing while you're not looking, while your endpoint name and your eval code and your dashboards all keep claiming nothing happened. The unit of measurement shifted under a stable label. Every comparison across the migration boundary is now confounded, and you cannot decompose the delta into "our system improved" and "the ruler got more generous" because you never built the instrument to do that decomposition.