Skip to main content

143 posts tagged with "evals"

View all tags

The 30-Day Prompt Apprenticeship: Onboarding Engineers When 'Read the Code' Doesn't Work

· 12 min read
Tian Pan
Software Engineer

A senior engineer joins your team on Monday. By Friday they've shipped a TypeScript refactor that touches eleven files and passes review with two nits. The same engineer, two weeks later, opens the system prompt for your routing agent — 240 lines of instructions, three numbered example blocks, four "you must never" clauses, and a paragraph at the bottom that reads like an apology — and stares at it for an hour. They cannot tell you what would happen if you deleted lines 87–94. Neither can the engineer who wrote them six months ago.

This is the gap nobody puts on the onboarding doc. A prompt-heavy codebase looks like a codebase, lives in the same repo, runs through the same CI, and gets reviewed in the same PRs. But its semantics live somewhere else: in the observed behavior of a model that nobody on the team built, against a distribution of inputs nobody fully enumerated, with failure modes that surface as PRs to add a sentence rather than as bug reports. The traditional tools of code reading — types, signatures, tests, naming — do almost no work. A new hire who tries to "read the code" learns nothing about why each line is there, and a team that hands them a Notion doc and a Slack channel is implicitly outsourcing onboarding to the prompt's original author.

Prompt Bisect: Binary-Searching the Edit That Broke Your Eval

· 10 min read
Tian Pan
Software Engineer

The eval scoreboard dropped two points overnight. The only thing that shipped between the green run and the red run is last week's prompt PR — the one with seventeen edits in it. Two reordered sections. Three new few-shots. A tightened refusal clause. A swapped role description. A handful of word-level rewordings someone called "polish." When the post-mortem starts, somebody says the obvious thing: "It must be one of those." And then they spend the next two days figuring out which.

That two days is the most expensive way to find a single regression. The methodology that costs minutes instead is borrowed wholesale from a forty-year-old kernel-debugging trick: bisect the patch. Treat the prompt as a sequence of revertible hunks, run the eval suite as the predicate, and let binary search isolate the line that flipped the score. The math is the same math git bisect runs on commits, and the discipline it forces on prompt management is a side benefit worth more than the bisect itself.

The RLAIF Doom Loop: When Your Cheapest Feedback Signal Quietly Poisons Your Fine-Tune

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped four rounds of preference fine-tuning in eight weeks. Every round, their offline win rate against the previous checkpoint went up. Every round, their LLM-as-judge confirmed the model was getting better. Every round, their retention curve sagged a little harder. By round four, the judge said the model was 71% better than the v0 baseline; users were churning 9% faster than before they started. That's the RLAIF doom loop in one paragraph, and the brutal part is: nothing in the team's pipeline was technically wrong.

Reinforcement Learning from AI Feedback — using a stronger model to generate the preference labels you used to pay humans for — is one of the most economically defensible decisions in modern post-training. AI-generated labels run under a cent each; human labels run a dollar or more, often ten times that for domain-specialized work. At preference-dataset scale (hundreds of thousands of pairs), that's the difference between a six-figure budget and a five-digit one. Published RLAIF benchmarks show win rates statistically indistinguishable from RLHF on summarization and dialogue tasks. The math says swap.

The math is right about the unit cost and wrong about what you're buying. You are not buying preference data. You are buying the judge's preferences, projected onto your data — and over multiple training rounds, that distinction is the difference between alignment with users and alignment with another model's aesthetic.

The Router Is the Product: Why Your Cheap Classifier Decides More Behavior Than Your Flagship Model

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped what they called "the routing project": a tiny BERT classifier in front of their flagship model that decided whether a query was simple enough for a cheaper, faster fallback. It paid for itself in three weeks. The cost dashboards lit up green. The flagship's eval suite — three hundred adversarial cases, weekly grading runs, the works — still passed every Friday.

Six weeks in, retention on a particular product surface dropped four points and nobody could find the cause. The flagship was fine. Latency was fine. The router, it turned out, was sending 71% of queries to the cheap model. It had been since week two. The cheap model was the product for most users, and the cheap model had no eval suite at all.

This is the most common failure mode I see in 2026 among teams that adopted LLM routing for cost control: the eval discipline gets attached to the expensive tail of the system, and the cheap head — the part that defines the product for most of the request volume — runs blind.

Sampling Parameter Inheritance: When Temperature 0.7 Leaks From the Planner Into the Verifier

· 10 min read
Tian Pan
Software Engineer

A verifier that flips its own answer eight percent of the time is not a flaky model. It is a sampling configuration bug that reached production because the framework defaulted to inheritance. The planner needed temperature=0.7 to brainstorm subtask decompositions. The verifier — the role whose entire job is to give a low-variance yes-or-no on whether the answer satisfies the rubric — was instantiated through the same harness call, and silently picked up the same temperature. Nobody set it that way on purpose. Nobody set it at all.

This is the most expensive parameter in your stack that nobody owns. It compounds across the call tree: the summarizer above the verifier, the structured-output extractor below it, and the retry loop wrapping the whole thing all consume the planner's "be creative" knob as if it were a global. The bill arrives in three places at once — eval flakiness, token spend, and the half-day a senior engineer spends bisecting a regression that turns out to be no regression at all.

Session Stitching: Why Your Conversation-ID Is a Lie

· 11 min read
Tian Pan
Software Engineer

A user starts negotiating a contract with your agent on her desktop at 9 a.m. She gets a Slack ping, switches to her phone over lunch to ask one clarifying question, and reopens the desktop tab at 4 p.m. to revise the draft. To her, that was one task — three hours of working through one contract. To your system, that was three sessions on two devices, each with its own conversation-id, each with its own memory window, each presenting a fresh greeting and asking her to re-paste the draft she'd already discussed twice.

The bug is not in the model. The bug is that your platform encoded "session" — a transport-layer artifact about a single connection — as the unit of context, while your user encoded "task" — the contract — as the unit of context. Every framework on the market quietly conflates the two, and the gap between them is where half of agent UX disappears.

The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head

· 11 min read
Tian Pan
Software Engineer

The three best AI engineers I have hired this year would all fail each other's interviews. The one who writes prompts that survive a model upgrade has never written a useful eval case in her life. The one who designs eval sets that catch the failures that matter writes prompts that other engineers refuse to extend. The one who designs guardrails that fail closed without choking the happy path has opinions about the other two that I cannot print here.

The job ladder calls all three of them "AI engineer." The calibration committee compares their promo packets as if they had been doing the same job. They have not.

Your Tool Catalog Is a Power Law and You're Optimizing the Long Tail

· 11 min read
Tian Pan
Software Engineer

Pull a week of tool-call traces from any production agent and the shape is the same: three or four tools handle 90% of the calls, and a couple of dozen others split the remaining 10%. The catalog is a power law, but the framework treats it like a uniform list. Every tool description ships in every system prompt, every selection rubric weights tools equally, every eval samples the catalog as if a search-files call and a refund-issue call were drawn from the same distribution. They are not.

The cost of that flatness is invisible until it isn't. A team adds the eighteenth tool, the planner's accuracy on the original three drops two points, nobody can localize the regression to a specific change because everything moved at once, and the eval suite — itself uniform across the catalog — averages the slip into a number that still looks fine. Meanwhile the tokens spent describing tools the model will not call this turn now exceed the tokens spent on the user's actual prompt.

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

The Local-Maximum Trap in Prompt Iteration: How to Tell You're Tweaking the Wrong Thing

· 10 min read
Tian Pan
Software Engineer

There is a moment, six weeks into a serious LLM project, where the prompt iteration log starts to look like a therapy journal. Each tweak swaps one failure mode for another. Add a stricter "do not" clause and the model becomes evasive on cases it used to handle. Soften the tone and a different category of hallucination returns. The eval scoreboard hovers in a band three or four points wide, refusing to break out. Someone says, "let me try one more reordering," and another half day evaporates.

This is the local-maximum trap. The team is climbing a hill, but the hill does not go higher. The cruel part is that the hill is real — every prompt change does produce a measurable delta on some subset of cases, which is exactly the signal that keeps everyone tweaking. What's missing is the recognition that the ceiling above is not a prompt ceiling at all.

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

· 11 min read
Tian Pan
Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

Your Accuracy Went Up and Your Calibration Collapsed

· 10 min read
Tian Pan
Software Engineer

A team ships a prompt refactor. The offline eval shows accuracy up three points. The PM posts the graph in Slack. Two weeks later, support tickets spike with a pattern nobody has a dashboard for: users trusted an answer they should not have, acted on it, and got burned. The model is right more often than it used to be. Trust in the model has gotten worse.

This is the calibration collapse. The model's confidence no longer matches its error rate, but the accuracy number went up, so the team thinks they shipped a win. They did not. They shipped a system that is more confidently wrong, and users — who calibrate trust on the model's voice (hedges, certainty, refusals) rather than on an accuracy number they never see — are now being misled on the exact fraction of queries where being misled matters most.

Accuracy and calibration are independent axes. You can move one without touching the other. You can improve one while destroying the other. Most teams measure only the first axis and ship against it, and most production incidents in LLM systems live on the second.