Skip to main content

763 posts tagged with "ai-engineering"

View all tags

The Agent That Wouldn't Stop: Scope Creep as a Runtime Failure Mode

· 9 min read
Tian Pan
Software Engineer

You asked the agent to fix a flaky test. At minute three, the test passes. At minute four, the agent is reading neighbouring files. At minute nine, it has "improved" a helper that the test never touched, renamed an unrelated parameter for clarity, and started a refactor of the fixture builder. The diff that lands is twelve files and four hundred lines. The original bug is fixed. So is some other code that wasn't broken.

This is not a model getting confused. This is a model doing exactly what its instructions left room for. The task said "fix the bug." It did not say "stop after the bug is fixed." Most agent loops have a defined start and a defined success criterion, and a very fuzzy answer to the third question: when are you done? In a chat session, "done" is whatever the user accepts. In an autonomous loop, "done" is whatever the stopping condition says, and if you didn't write one, the stopping condition is "the model lost interest." That isn't a failure mode you can debug. It's a failure mode you have to design out.

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

· 10 min read
Tian Pan
Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

The Kill Switch Nobody Wired Because the Feature Never Failed

· 10 min read
Tian Pan
Software Engineer

The launch flag worked perfectly. You shipped the AI summarizer behind it, ramped 1% to 10% to 50% to 100% over two weeks, watched the dashboards, saw nothing on fire, and at the end of the quarter the platform team's flag-hygiene bot opened a PR to delete the now-redundant gate. You approved it. The PR merged with the rest of the expired-flag cleanup, and the codebase got 200 lines lighter. Six weeks later at 2am, the provider rolls a fresh model snapshot, your summarizer starts confidently fabricating clauses into legal documents, and your on-call engineer discovers there is no fast lever to turn it off — only a deploy.

The flag did its job. The flag was the wrong artifact to keep. A launch flag answers "should this new code path be reachable?" and once everyone agrees yes, deleting it is the correct hygiene move. A kill switch answers "is the upstream model behaving today?" — and that question never expires, because the upstream model never stops changing. Cleaning them up together is the same category error as treating a smoke detector like a construction permit: the permit gets archived once the building is up, but the detector stays wired forever because the thing it watches for can still happen.

What You Deleted Is Invisible to Your Coding Agent

· 10 min read
Tian Pan
Software Engineer

You spent Tuesday afternoon deleting a dead utility module. You cleaned up the imports, ran the type checker, watched CI go green, and merged the PR. Wednesday morning, a fresh agent session looks at the same code, decides the codebase is "missing" a small helper, and writes the dead module back in — same name, same shape, slightly different style. The reviewer who approved the deletion yesterday now has to remember why they killed it, find the conversation that justified it, and explain it again. The agent is not malfunctioning. It is doing exactly what its context says to do.

This is the structural reliability problem of coding agents that nobody is solving with prompt engineering: the agent's context starts from the repository's current state, but not from the history of why that state is what it is. The file you removed leaves no trace the agent can see. The dependency you migrated away from is just another package on npm. The flaky test you intentionally deleted is a coverage gap waiting to be "fixed." Absence — the negative space of decisions you made — is invisible.

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.

Build vs. Buy Is the Wrong Question for Your AI Feature

· 9 min read
Tian Pan
Software Engineer

Every planning meeting about an AI feature collapses into the same binary. One camp wants to "just wrap an API" and ship next sprint. The other wants to "own the model" so the company controls its destiny. The argument feels strategic. It is actually a category error.

Build vs. buy treats your AI feature as one indivisible thing that you either make or purchase. But an AI feature is not one thing. It is a stack of at least five distinct layers, and each layer has its own answer. The team that frames the decision as a single coin flip will almost always own the wrong layer and rent the wrong layer, because the question they asked could not distinguish between them.

The better question is not "can we build it?" Most things, you can build. The question is: which layer breaks our differentiation if a competitor buys the exact same thing tomorrow? That question sorts the stack for you.

The Carbon Line Item Nobody Puts in the AI Feature Spec

· 10 min read
Tian Pan
Software Engineer

Open any AI feature review and you will hear the same three numbers debated: latency, token cost, and accuracy. Someone pulls up the p95 chart, someone else does the math on cost-per-thousand-requests, and a third person argues the eval score is good enough to ship. Nobody mentions energy. Nobody mentions carbon. And because nobody mentions it, the environmental footprint of the feature still gets decided — implicitly, by whoever wins the argument about the dollar figure.

That is the quiet problem with AI sustainability. It is not that teams choose a high-carbon design on purpose. It is that they never choose at all. The footprint is a side effect of a cost decision, and cost only loosely tracks carbon. A routing rule that looks like a clean win on the spend dashboard can quietly double emissions, and no one in the room would know, because the number that would have told them was never on a dashboard.

This post treats energy and carbon as what they actually are: a measurable, ownable property of an AI system, on the same footing as latency and cost. Not a corporate-values footnote. A line item.

The Eval Budget Your CFO Cannot See on a Spreadsheet

· 8 min read
Tian Pan
Software Engineer

Open any quarterly planning spreadsheet and you can find every feature your team shipped, every contractor invoice, every cloud line item. What you will not find is a row for the outage that never happened, the hallucinated refund that was caught before it reached a customer, or the prompt regression that an eval blocked at 2 a.m. Those non-events have no SKU. They generate no ticket, no postmortem, no Slack thread. And so, when the eval budget comes up for renewal, it is competing for headcount against a feature that has a demo — and it loses, almost every time.

This is not a failure of nerve. It is a measurement problem. Eval investment behaves like a safety net and a test suite at the same time: it compounds quietly, it pays out in disasters avoided, and its entire value is counterfactual. Finance is structurally blind to counterfactuals. If you lead an AI team, your job is not to argue that evals are important — everyone already nods at that. Your job is to make a compounding, invisible return legible to people who only trust spreadsheets.

Hiring for AI Roles That Have No Career Ladder Yet

· 9 min read
Tian Pan
Software Engineer

You open a requisition for an "eval engineer." A week later your recruiter asks the obvious question: what level is this, and what does a good resume look like? You don't have an answer. The title didn't exist two years ago. There is no leveling rubric, no canonical interview loop, no pool of people with the words "eval engineer" already on their LinkedIn. You are hiring for a job the industry has not agreed exists.

This is the quiet bottleneck in shipping AI systems. The model is available. The infrastructure is rentable. What you cannot buy off the shelf is the person whose actual job is keeping a prompt-driven system honest — and your hiring machinery, built for roles with decades of precedent, has no slot for them.

The instinct is to wait. Wait for the title to standardize, for the bootcamps to mint candidates, for someone else to write the leveling guide you can copy. That instinct is wrong. The work exists now whether or not the title does, and the teams staffing it now are the ones learning what "good" looks like before their competitors even open the req.

The Postmortem Where the Root Cause Was a Prompt Nobody Owned

· 9 min read
Tian Pan
Software Engineer

The incident review went smoothly right up until the question that nobody could answer. Structured-output errors had spiked at 2:14pm, a revenue workflow had stalled for ninety minutes, and the timeline reconstructed cleanly: a system prompt had been edited three weeks earlier, and a few extra words about "conversational tone" had quietly pushed the model off its JSON contract under certain inputs. The fix was a one-line revert. The hard part came next. Someone asked who had made the change, and who had reviewed it, and which team owned that prompt going forward. The room went quiet. There was no pull request. There was no reviewer. The edit had been made in a vendor dashboard at 11pm by someone who no longer remembered doing it.

That silence is the actual incident. The JSON contract breaking was a symptom. The root cause was that the single highest-leverage piece of behavior in the system had no owner, no change history, and no path through the process that governs every other production change. The model didn't fail. The model did exactly what it was told. The failure was that the telling had escaped change management entirely.

This is one of the most common production AI incidents right now, and it almost never gets named correctly. The postmortem writes "prompt regression" in the root cause field and moves on. But "prompt regression" describes the code. The real root cause is an org chart with a hole in it.

The Token Budget Is a Product Decision, Not a Config Value

· 10 min read
Tian Pan
Software Engineer

Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.

That is a product decision. It is sitting in an f-string.

Your Voice Agent Trusts Every Transcription Error as Fact

· 10 min read
Tian Pan
Software Engineer

A user calls your insurance voice agent and asks about their deductible. The speech recognizer hears "the duck tibble." Your language model receives the string "the duck tibble," finds nothing coherent to do with it, and either asks a confused follow-up question or — worse — confabulates an answer about a product that does not exist. The user hangs up. Your logs show a successful turn: audio in, transcript produced, response generated, no error thrown.

That is the quiet failure at the heart of nearly every voice agent in production. The speech-to-text system did its job — it produced its single best guess. The language model did its job — it reasoned over the text it was handed. The bug lives in the gap between them, in a handoff that takes a probabilistic guess and relabels it as a fact.