Skip to main content

763 posts tagged with "ai-engineering"

View all tags

The Watermark Your Eval Set Still Needed Even Though You Swore You'd Never Share It

· 11 min read
Tian Pan
Software Engineer

Your private eval set is one of the most important pieces of intellectual property your AI team owns. It encodes what "good" means for your product, it gates every model upgrade, it tells you whether last week's prompt change was an improvement or a regression. And the moment you wrote the first case, you started a countdown to the day it leaks.

Not because you'll publish it. Not because you'll demo it at a conference. It will leak the way everything leaks: a support engineer pastes a failing case into a bug ticket, a PM screenshots a rubric into a Slack thread that gets indexed by something, a debug log uploads a sample payload to a third-party error tracker, a vendor evaluator runs your benchmark through their fine-tune pipeline because that's what the contract sort of allows. Over a long enough timeline, the probability of leakage approaches one, and the worst-case version of leakage is the one nobody on your team notices: the next model the provider ships has quietly memorized your eval, and your scores jump because the test became the training set rather than because the model got better.

The Coding Agent That Passes Locally and Fails in CI

· 11 min read
Tian Pan
Software Engineer

The agent's diff was green on your laptop. Tests passed, lint passed, the dev server hot-reloaded clean. You let it open the PR, and ninety seconds later CI is red on a step that has nothing to do with the change: a missing CLI, an env var the agent never declared, a Node version that resolves differently because your .nvmrc resolves through a global shim that the runner does not have. The agent did not write a broken diff. It wrote a diff that depends on your machine, and your machine and the runner are not the same computer.

"Works on my machine" was a human bug. The fix was discipline — pin versions, write Dockerfiles, read the CI logs. Coding agents inherited the bug at scale and removed the discipline that used to compensate, because the agent does not know which of the things it relied on came from the repo and which came from the warm sediment of your shell history. Every developer's laptop is a uniquely configured environment that the agent absorbs without naming. Then the same agent runs in a runner that is none of those things, and the failure mode looks like the agent's fault when it is actually an environmental contract that nobody wrote down.

The Idiom Your Coding Agent Wrote Around Instead Of Using

· 11 min read
Tian Pan
Software Engineer

A senior engineer on a payments team I work with told me a story that I think every team running coding agents will eventually live through. Their codebase has a Result<T, E> wrapper — homegrown, sits in a single core/result.ts file, used in roughly two hundred call sites across the service. New code is expected to thread Result through every function that can fail; throwing is reserved for genuinely unexpected states. It's not enforced by a lint rule. It is the dialect.

Six months into shipping with a coding agent, they audited the diffs the agent had merged. About a third of the new functions ignored Result entirely. The agent had reached for try/catch, returned T | null, thrown Error subclasses with descriptive messages — every one of those choices is correct in some imagined codebase. None of them was correct in this one. The code typechecked. The tests passed. Reviewers approved it because nothing in it looked wrong line by line. But the file the agent touched no longer fit the file it lived next to, and the team had quietly grown a second dialect inside their own service.

This is the failure mode I want to talk about: not bugs, not hallucinations, not lint violations — idiomatic drift. The agent ships code that compiles, runs, and passes tests, in a style your codebase does not speak. Over enough merges, the codebase bifurcates into agent-style zones and human-style zones, and the cost shows up in places no dashboard is watching.

The Inner Loop Your Coding Agent Quietly Broke

· 8 min read
Tian Pan
Software Engineer

The productivity claim around coding agents is that they remove the typing bottleneck. The bottleneck the engineer actually hits in practice is different. The engineer can no longer hold the system in their head, because the agent is editing files faster than the engineer can read them, writing tests faster than the engineer can reason about coverage, and refactoring abstractions faster than the engineer can verify they still type-check at the design level rather than just the compiler level.

The tight inner loop — hypothesize, change, observe, refine — that defines competent engineering quietly collapses into a different loop. The engineer is now reviewing agent output rather than building intuition about the system. A METR randomized controlled trial from mid-2025 found experienced open-source developers were 19% slower on familiar codebases when using AI assistants, while reporting they felt 20% faster. The 39-point gap between perceived and actual productivity is not a measurement error. It is the sound of comprehension being silently traded for throughput.

The Process Your Agent Quietly Owns Without Documentation

· 10 min read
Tian Pan
Software Engineer

Six months ago, your team shipped a support agent that handles refunds. There was a one-page Notion doc describing what it should do. Today, the doc still says what it said, but the agent does not. The prompt has 47 edits in its history. Three tools were added — one of them quietly bypasses a finance check that the doc still asserts exists. The model was swapped twice. A retry policy was hardened after an incident nobody wrote up. And when somebody on the data team asks "what are the actual rules for issuing a refund here," the honest answer is: read the system prompt and the tool registry, because that is the spec now.

This is the quiet failure mode of agentic systems in production: the agent's behavior IS the runbook nobody wrote. The prompt got treated as a configuration value — a string in a YAML file, edited by whoever owned the feature, reviewed like a copy change — when it was actually the most authoritative description of a multi-step business process in the company. The org accumulated process logic the way legacy codebases accumulate behavior: through edits, not design. And the people who would historically own that process — a product manager, a compliance lead, an ops director — never realized they had lost the artifact, because there was never a document to lose.

The Synthetic Eval Your Real Users Never Resemble

· 10 min read
Tian Pan
Software Engineer

There is a class of eval failure that no dashboard catches because it shows up as success. The score climbs week over week. The judge agrees with the answer. The regression tests stay green. Meanwhile, the support team is logging a slow drift in user-reported quality, sales is hearing "it doesn't quite get what I meant," and nobody in engineering can reproduce the complaint because every example anyone tries on the eval set passes. The eval and the users live in different distributions, and the eval is the more polished of the two.

The mechanism is simple, and it hides in plain sight: the model that wrote your eval prompts and the model under test are siblings, and siblings share priors. They smooth the same edges, prefer the same phrasings, leave out the same kinds of malformed input. The eval certifies behavior on a world the generator imagined users have. Your actual users live somewhere else.

The Wiki Edit Mid-Flight When Your RAG Pipeline Read It

· 11 min read
Tian Pan
Software Engineer

A tech writer on your platform team is moving a paragraph. Not metaphorically — literally cutting a section from the onboarding page, pasting it into the runbook, deleting a stub draft on a third page, and rewording a deprecated warning on a fourth. The whole edit takes her about eleven minutes. Your RAG ingest job runs every fifteen. It happens to fire at minute six.

For the next fifteen minutes, your retrieval index contains a state of the wiki that did not exist at any single moment in her mind. The onboarding page still has the section. The runbook still doesn't. The stub draft is captured halfway through being deleted, with a placeholder sentence she never intended to publish. The old deprecated warning is still indexed. When an engineer asks the agent "how do we handle credential rotation in this service," the model retrieves contradictory chunks from the same source and confidently synthesizes whichever was ranked higher. The answer is wrong in a shape no one wrote.

This is a failure mode most teams ship without noticing: the source-of-truth is transactional, the ingest is a poll, and the gap between them is where dirty reads live.

Why Your Agent Works in Dev and Panics in Prod

· 10 min read
Tian Pan
Software Engineer

The agent demo always works. Three customers in the table, one matching record, twelve documents in the vector index, an empty calendar with infinite open slots. The agent picks the right row, retrieves the right document, books the right meeting. Ship it.

Then production hands the same agent ten million customers with three "John Smith"s in the same city, a filter that returns four thousand rows because the agent confidently wrote status != 'closed' when it meant status = 'active', a vector query that returns seven plausible documents the agent has never had to choose between, and a calendar where every slot is a negotiation. The capability that looked correct in dev is qualitatively different in prod — not slightly worse, not flakier, but solving a different problem the dev environment never made it solve.

This is the gap that "it worked locally" hides. For deterministic code, that phrase is already a lie about edge cases. For agents, it is a stronger lie, because the agent's behavior is a function of input distribution, and the input distribution shifts from "trivial" to "ambiguous" the moment you cross the prod boundary.

The Agent That Learned to Hedge Its Way to a Higher Eval Score

· 9 min read
Tian Pan
Software Engineer

The eval score climbed 12% over three months. Customer-satisfaction held flat, then drifted down half a point. The team kept shipping prompt variants. The dashboard kept rewarding them. Then somebody pulled the highest-scoring conversations from the last week and read them like a customer would, and the agent's voice had quietly mutated into something nobody on the team had asked for: every answer now opened with "I'm not entirely certain, but a reasonable interpretation would be," every recommendation hedged behind "there are several perspectives here," and questions with one correct answer were being delivered as multiple-choice essays.

The score was not lying. It was measuring exactly what the rubric told it to measure. The agent had learned, slowly and faithfully, that the surest way to win the judge was to sound calibrated — and calibration, as the rubric had operationalized it, looked indistinguishable from hedging on questions whose users needed an unambiguous answer.

The Marketing Page Your RAG Cited as an Engineering Spec

· 9 min read
Tian Pan
Software Engineer

A support engineer pastes a customer ticket into your internal assistant. The question is sharp: "Does our API support multi-region writes on the free tier?" The assistant comes back instantly, citing a chunk it retrieved with 0.91 cosine similarity. The answer is yes. The chunk is from a landing page written by marketing in 2023 to win a head-to-head against a competitor. Engineering removed multi-region writes from the free tier eighteen months ago and posted a terse internal RFC that nobody linked from a customer-facing page. The RFC is also in the vector store. It scored 0.74.

The assistant didn't hallucinate. It retrieved the highest-scoring document and faithfully grounded its answer in the text. The retriever did its job. The job was wrong.

The Pointer Your Agent Mistook for a Value: Reference vs Value in Tool Outputs

· 11 min read
Tian Pan
Software Engineer

A search tool returns ten document IDs. An asset tool returns an S3 presigned URL. A database tool returns a row handle. A file tool returns a path. Each of those returns is, formally, a pointer — a small string that names a value the agent does not yet possess. The model's downstream behavior depends entirely on whether it knows that and dereferences before reasoning, or whether it treats the pointer as if it were already the thing.

The failure mode is invisible from the trace. The tool call succeeded. The return is well-formed. The model emitted plausible-looking output. Nothing in the log says "the agent reasoned about a filename and called it a document." The pointer-vs-value confusion sits underneath the visible behavior, in a layer your tool schema never named.

The PR-Bot That Never Sleeps: When Your Reviewers Become the Rate Limiter

· 11 min read
Tian Pan
Software Engineer

For two decades the bottleneck in software engineering was writing code. We optimized IDEs, autocompletion, refactoring tools, and frameworks to make typing cheaper. We won. Now the bottleneck moved one step downstream: writing is cheap, and reading is expensive. The PR-bot can spin up ten implementation attempts in parallel and open ten pull requests against your repo before you finish your morning coffee. Your reviewers cannot.

The rate limiter for AI-assisted software delivery is no longer the model's tokens per second. It is the number of human eyes you can put on a diff per day. And when those eyes get overwhelmed, you do not get a graceful degradation — you get rubber stamps. Code merges with LGTM 🚀 on top of code that nobody actually read. A senior engineer approves an AI-written patch that another AI tool already reviewed, and three weeks later a data-inconsistency bug eats forty hours of someone's life. Surface correctness is not systemic correctness, and a green pipeline is not understanding.