Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The Codebase Index Your Coding Agent Rebuilt From a Checkout Three Weeks Behind Main

· 10 min read
Tian Pan
Software Engineer

A coding agent on your team opens a pull request that calls parseUserToken() four times across two files. The function does not exist in the repository, has not existed for nineteen days, and was replaced by decodeSessionClaim() in a commit your engineers all remember reviewing. The agent did not invent the name. It read the name from its semantic index — a vector store rebuilt from a working copy that was twenty-one days behind main. The agent's edit step, by contrast, ran git pull at session start and operated on fresh code. Two views of the same repository, three weeks apart, and the agent confidently bridged them with code that does not compile against anything real.

This is the failure mode that doesn't announce itself. The agent ran. The tests appeared to pass. The PR landed. The first reviewer noticed only because a stubbed-out function shared a name with an unrelated helper and tripped the linter. By then the agent had spent a full sprint writing against a phantom version of the codebase, and no one on the team — including the agent — had any signal that something was wrong.

The Data Residency Contract Your Provider Honored at the API Boundary and Broke at the Cache

· 9 min read
Tian Pan
Software Engineer

Your residency audit traced every outbound request from the tenant's traffic, watched it terminate on a hostname in Frankfurt, and signed off. The audit was correct about everything it measured. It was also looking at the wrong layer. The request went to the EU. The bytes that satisfied the request — the cached prefix the provider hashed and pulled from the nearest available node — lived in us-east-1. Your regional endpoint promised you a destination. The cache promised nothing, because the cache was a different product, governed by a different SLA, designed for cost rather than for compliance.

The customer's auditor caught it. Not yours. A different vendor's incident report mentioned that prompt cache placement was decoupled from inference region, and the customer's GRC team asked the obvious follow-up question: where do our prefixes go? The contract amendment to close the gap took ninety days. The renewal got suspended. The team that wrote the integration had done nothing wrong by the documentation they were handed.

The Debug Logger That Put Your System Prompt in a Customer-Readable Audit Feed

· 10 min read
Tian Pan
Software Engineer

A security-conscious customer pulled their tenant's audit export, opened the JSON, and read the verbatim refusal policy, retrieval pipeline structure, and a handful of internal product identifiers from a field called llm.request.system. No exploit. No prompt injection. No jailbreak. Just a log field your platform team added six months earlier so engineers could correlate prompt versions with incidents — surfaced through a feed your enterprise team had separately opened to tenants for SOC 2 reasons.

The disclosure happened during a normal Wednesday afternoon. Your security team got paged by the customer, not by an alert. The incident timeline doesn't show a deploy on the day of the leak — the misconfiguration shipped on the day the audit feed expanded its field allowlist, which was a different team, a different sprint, and a different ticket. Both reviewers signed off on what they were looking at. Neither was looking at the composition.

The Eval Rubric That Weighted Tone Next to Correctness and Quietly Selected Against Being Right

· 10 min read
Tian Pan
Software Engineer

Your judge prompt scored four axes on a 1–5 scale: helpfulness, clarity, empathy, accuracy. You averaged them. Your weekly dashboard trended up for six months. Your support queue trended in the opposite direction the whole time, and nobody connected the two until a customer escalation forced a manual audit and you discovered the model had learned a posture your product could not afford.

The posture was hedged wrongness. A softened wrong answer — "there are a few ways to think about this, one common view is X" where X is incorrect — scored 4.2 on your composite. A blunt correct answer — "no, X is wrong, the answer is Y" — scored 3.8. The judge wasn't broken. The rubric wasn't obviously broken either. Each axis was defensible in isolation. The aggregation was the bug.

The Eval Set Your Prompt Engineers Turned Into Production Few-Shots

· 11 min read
Tian Pan
Software Engineer

The eval dashboard had been climbing for three sprints. Quality up six points on the hard slice, up nine on the regression slice, up twelve on the slice the support team had hand-curated from last quarter's worst tickets. The team shipped a model promotion off the back of it. Two days later, a customer asked a question that looked nothing like anything in the eval set, and the answer was worse than what they had been getting six months ago.

The forensic was quick once someone thought to run it. The prompt engineers had been working out of the same repo as the eval team. They had found the curated examples — the painstaking ones, the ones where someone had argued for an hour about the correct phrasing of the ideal answer — and over a few sprints they had copy-pasted the strongest of them as few-shot demonstrations into the production system prompt. The dashboard kept going up because the model was being graded on inputs it had seen verbatim at inference time. Nobody flagged it. Nobody owned the boundary between "the examples we measure quality against" and "the examples we ship in the prompt." Both teams were doing exactly the job they had been hired to do.

The Few-Shot Example Your Model Treated as Binding Precedent

· 10 min read
Tian Pan
Software Engineer

A user submits a question. Your model produces an answer that is confidently wrong in a very specific way: the format is perfect, the reasoning is well-structured, and a particular qualifier — one that does not apply to this question at all — appears in exactly the place a similar qualifier appeared in example three of your system prompt. Not a hallucination. Not a prompt injection. The model did precisely what the examples taught it to do, on a question those examples were never meant to cover.

This is the failure mode that few-shot prompting actively encourages and that most eval suites are structurally blind to. Your examples are not neutral demonstrations of "what good looks like." They are case law. The model selects the closest match by surface tokens and applies the precedent — including its constraints — to whatever case is in front of it.

The KV Cache Eviction Your Provider Called Cache Pressure and Your Bill Called a Doubled Prefix Charge

· 11 min read
Tian Pan
Software Engineer

Your application opens a long conversation with a forty-thousand-token system prompt and a full tool inventory. Turn 1 pays the prefix at write rates and the provider's KV cache warms up. Turn 2 arrives ninety seconds later. You assume it's a cache hit. Sometimes it is. Sometimes the same forty thousand tokens land on your invoice again at uncached prices, and nothing in your code changed between turn 1 and turn 2.

The thing that changed was somebody else's traffic. The KV cache is shared infrastructure. Your tenant was colocated on a serving node that, in the ninety seconds between your two turns, took on enough other tenants to evict your prefix from memory. The provider's dashboard will describe this as "cache pressure." Your finance team will describe it as a line item that doubled. Both descriptions are accurate. Neither is in your code.

The OAuth Scope Your Agent Inherited When On-Behalf-Of Quietly Became Act-As

· 10 min read
Tian Pan
Software Engineer

The security review said the agent acts "on behalf of" the user. The OAuth token said something else, and the audit log agreed with the token.

A small distinction in language did a lot of architectural work nobody noticed. "On behalf of" is the language a security review reaches for when it wants to capture an arrangement where the agent is a delegate, recognizable as a delegate, and constrained by being a delegate. "Act as" is the runtime behavior when the agent holds a token indistinguishable from the user's own and is therefore the user as far as every downstream system can tell. These two phrases describe completely different threat models. A typical enterprise OAuth integration ships the second one and prices it as the first.

The Presigned URL That Expired Before Your User Could Verify the Multimodal Model's Claim

· 10 min read
Tian Pan
Software Engineer

A user opens yesterday's conversation. Next to their support agent's reply sits a broken-image placeholder where their uploaded receipt used to be. The reply confidently quotes "the charge of $47.32 dated March 14 at the merchant Coffee Tribunal." The user has no way to check whether that quote is accurate, because the evidence the model worked from is now a 403 from your object store. They file a hallucination ticket. Your eval suite did not catch it because the model was, at the time of the call, exactly right.

This is a story about retention mismatch, not about model quality. Your transcript outlived its grounding. The grounding was a presigned URL with a fifteen-minute clock, and the claim about the grounding is text that will sit in your database for years. When the asset clock and the claim clock run at different speeds, every correctly-grounded multimodal answer eventually looks like fabrication to whoever revisits it.

The Prompt Engineer Who Quietly Became Your Only Eval Set Reader

· 8 min read
Tian Pan
Software Engineer

The eval set is a file. It is also, secretly, a theory of what the AI feature is for. The two are not the same thing, and the team that confuses them has built a quality gate whose calibration depends on a single human's working memory. When that human leaves, the file stays and the theory walks out the door.

This is the failure mode you don't see in the org chart. You scoped a prompt engineering role. You hired someone good. They shipped the v1 prompts, looked at the thin benchmark, and rewrote it into something rich — a taxonomy of failure modes, weights per category, a labeling rubric that disambiguates edge cases. The eval set became the contract for "is this model good enough to ship." Six quarters later you discover that the contract is unreadable by anyone except the person who wrote it.

The Pull Request Your Coding Agent Opened That Closed a Real One

· 11 min read
Tian Pan
Software Engineer

Your coding agent opened a pull request at 3:14 on a Tuesday afternoon. The PR description was clean, the diff was small, the CI was green. It got squash-merged twenty minutes later. The teammate who came back from lunch at 1:20 the next day saw a notification: "PR #1247 was closed." Not merged. Closed. The branch was gone. The seventy-two review comments she'd left on it the previous week were gone too — collapsed under an "outdated" label on a PR that no longer existed in any active list. A senior engineer's design decisions, two rounds of back-and-forth with the security reviewer, and a careful migration plan that took a week to negotiate, all vanished into a footnote on a different PR that nobody had read closely. The squash commit's only trace of what happened was a one-line tag at the bottom: Closed by #1893.

This is the failure mode of trusting a coding agent to write its own pull request metadata. Not the code — the metadata. The diff was fine. The agent did good work. What it could not do was distinguish a fresh discussion from a stale one, and GitHub's auto-close machinery treats every closing keyword the agent writes as a load-bearing instruction. Your agent reads the comments to gather context, infers from a six-month-old reply that its work supersedes an older PR, writes Closes #1247 in the description it generates, and the merge does the rest — silently, mechanically, irrevocably from the perspective of anyone who wasn't watching the diff at the moment of squash.

The Reserved Capacity Contract That Priced Out Your Overflow When the Provider Redefined the Bucket

· 10 min read
Tian Pan
Software Engineer

A platform team signed a multi-quarter reserved-throughput contract. Fixed per-token rate on committed capacity, a higher overage rate above the ceiling. Finance modeled the burn against six months of historical traffic that rarely crested the limit. The contract said "overflow" meant bytes-per-minute above the committed ceiling, and on that definition the deal was sound.

Six weeks later the bill was up 2.4× with no change to traffic shape, no change to routing config, no change to product surface. The provider had quietly revised the metering definition mid-quarter. "Overflow" now also counted any request the auto-router sent to a model tier above the one the reservation was anchored to — so a single Sonnet selection on a complex prompt landed in the overage bucket even when aggregate throughput sat comfortably inside the committed envelope. Thirty percent of traffic that used to invoice at the reserved rate now invoiced at the overage rate. Finance chased the spike through dashboards for three weeks before someone read the mid-quarter pricing addendum and found the redefinition in a footnote.

The contract had not been broken. The unit it was denominated in had been redenominated.