4 posts tagged with "eval"

The Promotion Packet for AI Engineers Who Didn't Ship a Feature

May 2, 2026 · 11 min read

Software Engineer

The AI engineer with the strongest case for promotion on your team has a promotion packet that looks empty. Two quarters of work and the impact graph is a flat line. The eval-regression rate that used to spike to 12% on every model swap now sits at 4%. The $40k/month cost spike that finance was about to escalate never reached finance because somebody added a budget guard to the gateway. The P0 incident that would have made the company's status page never happened because a kill-switch tripped and routed traffic to the previous prompt version.

The packet has nothing to write in the "shipped X" column. The calibration committee sits down with two engineers side by side: one who shipped two visible features this half, one who quietly absorbed the load that made those features possible. The committee, doing what it has always done, rates the shipper higher. The infra-shaped engineer either takes a "meets expectations" rating they don't deserve and quits inside a quarter, or learns to write the packet in a language the committee can actually read.

The Agent Degraded-Mode Spec Is the Document You Didn't Write

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

When the search index goes stale, the vendor API throttles you, the database read replica falls behind, or a downstream microservice starts returning 503s, your agent has to decide what to do. In most production agent systems, that decision was never made. It was inherited — silently — from whatever the engineer who wrote the tool wrapper happened to type at 4 PM on a Tuesday in week three of the project.

The result is what your customers eventually write for you: a Reddit thread, a support transcript, a quote in a press article. "The assistant told me my balance was $0 when my account was actually fine — turns out their lookup service was down." That paragraph is the degraded-mode spec your team didn't write. It is now public, it is now the customer's, and it is the version your engineering org will spend the next quarter responding to.

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.

This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.

The AI Feature You Should Not Have Shipped: A Task-Shape Checklist

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The demo always works. That is the most expensive sentence in AI product development. The product manager sees the model handle the happy path, the engineer ships the obvious version of the feature, and six weeks later the support queue is full of complaints that the metric did not predict. Nothing in the model regressed. Nothing in the prompt got worse. The feature was simply not the shape the model could do well, and the team did not have a way to say so before the work began.

A meaningful fraction of shipped AI features fail this way — not because the model is bad, but because the task is wrong. The output the product needs is deterministic and the engine is stochastic. The user's tolerance for the tail is one bad answer per thousand and the model's failure distribution is heavier than that. The latency budget the unit economics require is half of what the model can deliver at any tier you can afford. The ground truth required to evaluate quality does not exist and cannot be cheaply created. None of these are model problems. They are task-shape problems, and they should have been screened before the first prompt was written.

About Tian Pan