Skip to main content

907 posts tagged with "insider"

View all tags

The PII Redactor That Scrubbed the User's Question and Left the Prompt Cache Untouched

· 11 min read
Tian Pan
Software Engineer

A customer audit finds eleven months of verbatim user PII sitting in a Redis cluster nobody on the residency team knew existed. No system was compromised. No attacker got in. The data was written there on purpose, by a service the inference team built and named "prompt cache," as a performance optimization. The redactor on the analytics path worked perfectly the entire time. The redactor was simply not on this path.

The breach is real anyway. Under GDPR, retention beyond the contracted thirty days is enough; the data does not need to have leaked to trigger Article 33 notification obligations. The residency team's inventory listed every log, every warehouse, every queue — and missed the cache because the cache was on the inference team's side of the org chart. The privacy boundary that everyone trusted ran straight down the analytics pipeline and stopped at the wall where the LLM stack began.

The Pinned Dependency Your Security Agent Upgraded Past the Comment It Could Not See

· 10 min read
Tian Pan
Software Engineer

A Spanish customer complained that her annual renewal had been billed a day early. The support ticket bounced through three queues before it landed in front of an engineer who recognized the smell: a date-formatting regression, European cohort only. He ran git log against the date-formatting module and found nothing. The module had not been touched in eleven days. What had been touched, eleven days earlier, was its package.json — a lodash bump from 4.17.20 to 4.17.22, opened by a security agent, approved by the on-call, merged without comment.

Two lines above the version string, in the same file, was a comment written eighteen months ago: // do not upgrade — breaks the snapshot tests in date-formatting, see FRONT-2418. The security agent had not read it. Or, more precisely: the security agent had read the entire file, but its prompt instructed it to find vulnerable version strings, not to weigh the comments around them. The comment was load-bearing institutional knowledge. The agent treated it as scenery.

This is a coordination failure between two systems that did not know they were colliding. The security agent was doing its job. The original engineer who wrote the comment had done his job. The feature-development agent that respected the pin every time it touched the file was doing its job. Nobody had decided whose job it was to mediate between them.

The Prompt Engineer Who Quietly Became Your Only Eval Set Reader

· 8 min read
Tian Pan
Software Engineer

The eval set is a file. It is also, secretly, a theory of what the AI feature is for. The two are not the same thing, and the team that confuses them has built a quality gate whose calibration depends on a single human's working memory. When that human leaves, the file stays and the theory walks out the door.

This is the failure mode you don't see in the org chart. You scoped a prompt engineering role. You hired someone good. They shipped the v1 prompts, looked at the thin benchmark, and rewrote it into something rich — a taxonomy of failure modes, weights per category, a labeling rubric that disambiguates edge cases. The eval set became the contract for "is this model good enough to ship." Six quarters later you discover that the contract is unreadable by anyone except the person who wrote it.

The Prompt Version Your Team Treated as Independent of the Model Version

· 9 min read
Tian Pan
Software Engineer

Your incident timeline says "no deploys in the last 72 hours." Your prompt registry agrees: prompt v37 has been frozen for three weeks. Your eval harness ran clean on Tuesday night. But on Wednesday morning, the structured-output failure rate on one of your agents tripled, the retry budget on another doubled, and a third started cheerfully ignoring an instruction it had been honoring for a month. Nothing changed. Except something did change, and it changed in the place neither side of the org was watching: the model.

The prompt registry knows about prompt versions. The model gateway knows about model versions. Almost nobody, in practice, tracks the pair. And prompt v37 isn't a free-standing artifact — it is, whether your tooling admits it or not, a contract negotiated against one specific model. When the platform team rolls the claude-sonnet-latest alias forward by a single point release, the contract on the other side has been quietly amended, and your incident timeline reads "no deploys" because the deploy happened on someone else's infrastructure under a name that didn't move.

The Provider Auto-Router That Quietly Routed Your Premium Traffic To Haiku

· 10 min read
Tian Pan
Software Engineer

Your platform team adopted the provider's "auto" model identifier for cost reasons. The first dashboard after rollout was hard to argue with: a 34% spend reduction with no measurable quality drop on the weekly eval. Three months later, customer satisfaction on your shortest, highest-volume surface had been sliding for two quarters, and a product-led investigation eventually traced the regression to a model identifier nobody on the engineering team had touched. The code said "auto." The provider had been redefining what "auto" meant the whole time.

The lesson is not that auto-routing is bad. The lesson is that "auto" is a moving target whose distribution drifts with provider economics, and your eval's representativeness is the only check standing between vendor optimization and your product quality. If the eval does not match the traffic, the discount you celebrated is being paid out of a quality slope nobody is reviewing.

The RAG Dedup Step That Broke Silently and Flooded Your Top-K With Near-Duplicates

· 10 min read
Tian Pan
Software Engineer

A retrieval-augmented generation pipeline can degrade for weeks without a single metric noticing. The relevance scores look fine. The retrieval latency is unchanged. The eval slice that touches the broken topic moves a quarter of a point in the wrong direction, and your weekly review chalks it up to noise. Then someone reads the actual context window the model received for a customer ticket and sees the same paragraph three times — once in title case, once in lowercase, once with the punctuation stripped — and you understand that your top-five has secretly been a top-two for a month.

This is the class of failure where the system is doing exactly what it was told to do. The retriever is returning the most similar vectors to the query. Each of those vectors is genuinely about the right topic. The index has no idea that three of them came from the same paragraph indexed three ways, because the ingestion-time dedup pass that was supposed to catch that case is silently skipping it.

The Reserved Capacity Contract That Priced Out Your Overflow When the Provider Redefined the Bucket

· 10 min read
Tian Pan
Software Engineer

A platform team signed a multi-quarter reserved-throughput contract. Fixed per-token rate on committed capacity, a higher overage rate above the ceiling. Finance modeled the burn against six months of historical traffic that rarely crested the limit. The contract said "overflow" meant bytes-per-minute above the committed ceiling, and on that definition the deal was sound.

Six weeks later the bill was up 2.4× with no change to traffic shape, no change to routing config, no change to product surface. The provider had quietly revised the metering definition mid-quarter. "Overflow" now also counted any request the auto-router sent to a model tier above the one the reservation was anchored to — so a single Sonnet selection on a complex prompt landed in the overage bucket even when aggregate throughput sat comfortably inside the committed envelope. Thirty percent of traffic that used to invoice at the reserved rate now invoiced at the overage rate. Finance chased the spike through dashboards for three weeks before someone read the mid-quarter pricing addendum and found the redefinition in a footnote.

The contract had not been broken. The unit it was denominated in had been redenominated.

The Retry Budget That Hid Your Provider's Actual Error Rate From Your Dashboard

· 11 min read
Tian Pan
Software Engineer

The weekly review slide said 99.9%. The invoice said the bill had tripled. The two numbers had been on adjacent dashboards for months, and nobody had noticed that they were measuring different worlds. The reliability number was post-retry — every call that eventually returned a 200 counted as a success — and the cost number was every attempt the client made, billed by the token. Between them sat a generous five-attempt retry loop and a provider whose tail latency had been quietly degrading. The first time anyone looked at both numbers together was during an outage, when the cost-anomaly alert fired before the availability alert did.

That is the whole pattern. A retry budget that looks like a reliability mechanism is also a cost-quality knob, and the team that watches only one side of it is paying for an availability number the invoice will eventually correct.

The SSE Keep-Alive Your Reverse Proxy Stripped, And The Prompt You Paid For Twice

· 10 min read
Tian Pan
Software Engineer

Your agent called a tool that took 35 seconds. During those 35 seconds, no tokens flowed from the model back to the browser. The provider's SSE stream was still open. Your tool was still running. The user's spinner was still spinning. And somewhere in the middle, a reverse proxy you do not control decided the connection had been quiet for too long, closed it, and your client's reconnection logic dutifully restarted the entire request from scratch.

The first response was 4,200 prompt tokens and 600 completion tokens. The second response was 4,200 prompt tokens and 600 completion tokens. The user got one answer. Your invoice got two.

The Streaming UI That Committed a Partial Answer Your Model Never Finished

· 10 min read
Tian Pan
Software Engineer

The post-mortem read like a hallucination report. A user had acted on a confidently-worded recommendation that turned out to be wrong in a way the model would not have written if it had finished — except the trace showed the model had not finished. The provider connection dropped at token 412 of an expected 800. The client's error handler logged the failure. The persisted partial message, written to the conversation history as tokens arrived, sat in the user's UI looking exactly like every other complete answer. They acted on it. Support categorized the ticket as a content-quality issue. It took two weeks to route it to the platform team.

Nothing in this chain was a model failure. The model behaved correctly for the 412 tokens it produced. The failure was that the streaming UI and the durable conversation history had quietly disagreed about what counts as a message — and during the exact failure mode that streaming was supposed to make tolerable, the disagreement became the canonical record.

This is the contract between optimistic rendering and durable storage. Most chat products inherit it from a tutorial or a framework without thinking about it as a contract at all, and the gap shows up as a tail of incidents that look like model bugs and aren't.

The Supervisor Agent That Rubber-Stamped Its Subagent Because They Shared a Prompt Template

· 9 min read
Tian Pan
Software Engineer

A team I talked to last month was proud of a number: their supervisor agent approved 97% of its subagents' plans on first review. They read that as "the subagents are competent." A red-team review six weeks later read it as "the supervisor and the subagents are the same evaluator scoring its own output." Both readings fit the data. Only one of them was load-bearing in production.

The supervisor-reviews-subagent pattern is the most common shape multi-agent systems take in 2026 — somewhere around 70% of production deployments, including most of the reference designs the big labs publish. It looks like a check on paper. A planner decomposes the task, specialist workers produce plans, a supervisor reviews each plan before authorizing execution. Separation of concerns, clean audit trail, the works. The problem is that if you build the supervisor and the subagents from the same base prompt template — even with role-specific addenda differing by a paragraph — you have not built a check. You have built a system whose review step is an artifact of the same model agreeing with itself.

The System Prompt Your Screenshare Leaked to a Vendor on a Support Call

· 10 min read
Tian Pan
Software Engineer

Your AI team treats the system prompt as proprietary IP. The deployment pipeline strips it from every customer-readable surface. The runbook for production debugging tells engineers to grep it out of any incident artifact before that artifact leaves the war room. Your last security review caught and closed three different paths the prompt could escape through: an over-verbose API response, a debug header that shipped to the wrong tier, a stack-trace endpoint that interpolated the prompt into its message.

None of that mattered the morning an engineer joined a vendor support call about an unrelated billing dispute, screen-shared their terminal to walk through a stack trace, and the trace included a single verbose log line that printed the fully resolved prompt — every injected variable substituted in, including the customer-specific business rules and the internal model-routing hints. The vendor's support engineer recorded the call as part of their standard support workflow. The recording landed in the vendor's case management system. The prompt was now legibly stored in a third-party SaaS your security review had no contract with, no DPA against, and no audit rights over.