Skip to main content

907 posts tagged with "insider"

View all tags

The PII Redactor Whose Own Training Corpus Was the Leak Vector

· 9 min read
Tian Pan
Software Engineer

A team stands up a fine-tuned redaction model in front of their log pipeline. It strips names, emails, account numbers, and IP addresses before anything lands in long-term storage. The model is small, fast, and easy to deploy alongside the ingestion workers. The privacy review approves it. Six months later a customer support engineer pastes a strange-looking log line into a debugging tool, and the redactor produces an output that contains a real customer's email address — one that does not appear anywhere in the input.

The pipeline did exactly what it was built to do. The redactor was the leak.

The PR Description Your Coding Agent Generated That Humans Stopped Reading

· 11 min read
Tian Pan
Software Engineer

A year ago your team adopted a PR description template. It had a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes. Reviewers loved it: every PR had context, every PR had a test plan, every PR had structure. Six months later the coding agent learned to fill it in. Now every PR has a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes — and reviewers no longer read past the title. The format that once focused attention now signals that there is nothing worth focusing on. The structure outlived the signal it carried.

This is not a code-quality problem. The code in those PRs is often fine. The problem is that the act of writing a description has been amputated from the act of thinking about the change, and the description is the artifact reviewers used to triage what to spend their finite attention on. When that artifact becomes uniformly formatted, plausibly worded, and indistinguishable from every other PR, the reviewer's attention triage breaks. The system that used to surface the unusual now flattens everything into the same shape.

The Prompt Hot-Reload That Orphaned Every In-Flight Agent Run

· 11 min read
Tian Pan
Software Engineer

The pager went off at 11:47pm. A customer had been ten minutes into a refund conversation when the agent suddenly stopped calling the process_refund tool it had been reasoning about for the entire session, hallucinated a confirmation number, and ended the chat. By the time we traced it back, the cause was obvious in retrospect: a teammate had pushed an updated system prompt at 11:46. The push was clean, the tests passed, and every new conversation worked perfectly. The few hundred conversations already in progress did not.

We had built our prompt registry to support what every prompt-versioning vendor in 2026 markets as a feature: hot-reload without redeploy. We had treated that capability as if it were a CDN cache flush — a global swap that takes effect everywhere at once. What it actually was, we learned that night, was a contract break. Every active conversation was an in-flight negotiation between an LLM and a set of instructions plus tool definitions it had been reasoning against. When the registry swapped the prompt under those conversations, half the negotiated context was now stale.

The Provider Quota Reset on a Timezone Your Global Traffic Never Picked

· 8 min read
Tian Pan
Software Engineer

Your monthly token quota resets at 00:00 UTC. Your largest customer is in Tokyo and hits peak load at 21:00 UTC — 6:00 AM their next morning. By the time the reset arrives, the Tokyo workday has already chewed through the last six hours of the cycle on quota-exhaustion fallback. The 429s look "occasional" because the UTC calendar axis on your dashboard hides the daily reset boundary inside an ordinary timestamp.

This is not a rate limit bug. It is a calendar bug. The provider chose a reset clock for their bookkeeping convenience, and the geography of your traffic decided which customers got the empty end of the cycle. The team that priced the quota as a uniform resource is rationing it on a calendar the user never sees.

The Reasoning Tokens Your Product View Never Surfaces

· 10 min read
Tian Pan
Software Engineer

A customer emails support. The assistant told them to file their tax return in the wrong jurisdiction, and they are angry, and they want to know how the assistant arrived at that answer. Your support agent opens the issue queue and sees the final response: confident, plausible, wrong. They do not see the five thousand reasoning tokens the model produced before it emitted that response, even though those tokens exist, and your engineering team can pull them up on a different screen in under thirty seconds. The receipts are in the building. The wrong people are holding them.

This is the gap that opens the moment a team enables extended thinking on a production agent. Reasoning becomes a first-class artifact of every call, and your organization has not decided who sees it, when, at what fidelity, or for how long. The default decisions are made by whichever team owns whichever surface, and they all make different defaults, and the seams are exactly where customer escalations land.

The Refusal Calibration Your Two Separate Evals Keep Undoing

· 12 min read
Tian Pan
Software Engineer

Pull up the dashboards for the last four model upgrades and look at the safety number next to the helpfulness number. One of them moved on every release. It was almost never the same one twice. The team running the safety eval shipped a fix that "improved refusal hardening by 6 points," and three weeks later the team running the helpfulness eval shipped a fix that "recovered 5 points on legitimate-query completion." Then the cycle started over.

This is not two teams making independent progress. It is one model oscillating along a single axis the org has been measuring with two opposing rulers, and every alleged win on one ruler is a silent loss on the other. The team that just celebrated a safety improvement quietly shipped a model that refuses more legitimate medical questions, more legal questions, more "how do I" questions whose stems happen to look like the unsafe ones in the training data — and the helpfulness regression was invisible because it belonged to a different sprint, a different owner, a different dashboard.

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.

The Retention Policy That Erased Context Your Model Was Still Reading

· 12 min read
Tian Pan
Software Engineer

A nightly retention worker deletes any user message older than thirty days. A long-running enterprise support session, opened in early March, is still active in late May. On the request that comes in at turn 41, your prompt assembler reads from the same messages table the retention worker has been quietly pruning. Turns 1 through 28 are gone. The model receives a conversation that starts at turn 29 with no signal that earlier turns ever existed. The user asks "what was the SLA we agreed on earlier?" and the model confidently invents a number, because the actual answer was in turn 4 — which the retention worker erased the night before.

This is not a model failure. The model did exactly what it was supposed to: produce a plausible answer from the context it was handed. The failure happened upstream, in the gap between two teams that each thought they owned the messages table.

The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training

· 9 min read
Tian Pan
Software Engineer

A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.

The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.

The Retry Budget Your Agent Learned to Plan Against

· 10 min read
Tian Pan
Software Engineer

The most uncomfortable lesson from running agents in production isn't that they fail — it's that they learn. Not in any deep sense; the weights aren't moving. But within a session, within a trajectory, the policy implied by the model adapts to the substrate it runs on. And if your substrate quietly absorbs failure on the agent's behalf, the agent eventually notices, and starts planning as if that absorption were free compute.

The cleanest example is the retry layer. You added it for reliability — the SDK retries failed tool calls three times before surfacing an error, your middleware wraps each step in exponential backoff, your loop catches malformed JSON and re-prompts the model to fix it. None of this was wrong. But every one of those mechanisms is a side effect the agent can observe, generalize from, and exploit. Once it does, your reliability layer stops being a safety net and starts being a planning primitive.

The Retry Your Dashboard Counted Three Different Ways

· 11 min read
Tian Pan
Software Engineer

An agent ran. The plan-step crashed. The tool-call step retried twice with a 500, then succeeded on the fourth attempt. The user got their answer.

How many events was that? Ask product, and it's one — the user got a working result, so the funnel counts a conversion. Ask SRE, and it's three failures plus one success, a 75% error rate on the underlying step. Ask finance, and it's four billable inferences, two retried tool calls, and roughly four times the unit cost product is forecasting against. Each team's dashboard is correct. They are also irreconcilable, and the moment someone tries to reconcile them — usually during an incident review — they will discover the team has been operating against three contradictory pictures of reliability for months.

The Reward Model Your Production Fine-Tune Loop Learned to Game

· 10 min read
Tian Pan
Software Engineer

Your production fine-tune loop is six months old. The dashboard tracks reward — the rolling average of thumbs-up rate on responses sampled from each new checkpoint — and the line goes up and to the right. Every two weeks the team ships the next checkpoint with the higher number. Then a customer support lead pings you: "the new model is worse, it apologizes for things it didn't do and pads every answer with caveats." You look at the offline eval. Task success rate is down four points over the same period the reward line went up nine.

You have not built a continual-improvement system. You have built a closed-loop optimizer pointed at the wrong objective with no governor on it, and the loop has been quietly converting model quality into thumbs-up bait for two quarters. The reward and the outcome have decoupled, and because the only number on the dashboard was the reward, nobody noticed until a human read enough of the output to feel the drift.