Skip to main content

113 posts tagged with "evals"

View all tags

Compliance Reviewer as Eval Author: Why Legal Should Be Writing Your Test Cases

· 13 min read
Tian Pan
Software Engineer

The most useful adversarial prompt I have seen for an enterprise LLM did not come from a red team, a security researcher, or a prompt engineer. It came from a senior compliance attorney who asked the model, in plain English, to "tell me which of the three retirement annuities discussed earlier in this thread is the best one for a 62-year-old approaching their first required minimum distribution." The model produced a confident, thoughtful, beautifully-formatted recommendation. That output, had it been sent to a customer, would have been a textbook FINRA suitability violation — an unsuitable individualized recommendation made without the supervisory infrastructure that securities rules require around personalized advice.

The compliance attorney spotted the failure mode in about four seconds. The engineering eval suite, which had a hundred-plus carefully constructed cases for hallucination, refusal calibration, and tool-use accuracy, had no concept that this particular response shape was illegal. Not low quality. Not a hallucination. Illegal. And the workflow at the company at the time had her reading sample outputs in a Google Doc and writing memos, rather than checking a test case into the regression suite. So her catch lived in a memo, the memo got summarized in a launch-readiness slide, and the next month a refactor of the system prompt regressed the behavior because nobody had a failing test pinned to it.

That is the gap I want to argue we should close: the compliance reviewer should be authoring eval cases directly, and those cases should be the artifact that gates release — not the document review that produced them.

Your Eval Suite Is the Product Spec You Refused to Write

· 10 min read
Tian Pan
Software Engineer

Open the PRD for any AI feature shipping this quarter. Notice the adjectives. The assistant should be helpful. Responses should feel natural. The agent should understand the user's intent. The summary should be accurate and concise. Every one of these words is a place the team gave up. They did not decide what the feature does. They decided how they would describe the feature to each other in a meeting, then handed the actual product definition — quietly, without anyone calling it that — to whoever wrote the eval suite.

This is not a documentation problem. The eval is the spec. The PRD is a press release written before the product exists. The fuzzy adjectives in the doc become unambiguous behavioral assertions in the eval, or they become nothing — the model picks an interpretation, ships it, and the team discovers a quarter later that "concise" meant something different to the reviewer than to the user, and different again to whoever tuned the prompt last sprint. An AI feature whose eval suite is thin is a feature whose product definition is thin. The model didn't fail. The team never decided what success meant.

The Frozen Prompt: When Your Team Is Afraid to Edit a System Prompt That Works

· 13 min read
Tian Pan
Software Engineer

Every mature AI product eventually grows a system prompt that nobody on the current team fully understands. It started as forty tokens of plain English, and twenty months later it is a 4,000-token wall of conditional clauses, refusal templates, formatting rules, persona reinforcements, edge-case warnings, and one peculiar sentence about Tuesdays that nobody can explain. Each line was added in response to a specific failure: a customer complaint, a Slack ping from legal, a regression caught by an eval, a one-off bug that surfaced during an investor demo. The engineer who wrote line 37 has rotated to another team. The engineer who wrote line 112 was a contractor whose Notion doc was archived. The eval suite covers maybe a third of the behaviors the prompt is asserting, and nobody is sure which third.

So the prompt becomes load-bearing in the worst possible way: it works, the team knows it works, and the team has stopped touching it. Engineers who should be iterating on the prompt route their changes around it instead — adding a post-processing filter here, a few-shot wrapper there, a parallel "v2 prompt" feature-flagged off in case anyone ever finds the courage to A/B test the replacement. The prompt has stopped being software and has become a relic. And once that happens, the prompt is no longer the lever you use to improve the product. It's the constraint shaping it.

Prompt Edits Aren't Wording Changes: A Code Review Discipline for Prompts as Software

· 11 min read
Tian Pan
Software Engineer

A six-line system prompt edit lands in a pull request on Tuesday afternoon. The diff is in plain English. Two reviewers eyeball the new wording, agree it reads more naturally, hit approve. The PR merges in under a minute. By Friday, support is fielding tickets about an agent that suddenly refuses to summarize documents over a certain length, won't quote sources, and inexplicably starts every reply with "Certainly!" — a behavior nobody asked for and the diff didn't predict.

This is what happens when a team that has spent a decade learning to review code regresses to first-week behavior the moment the artifact is a prompt. The diff looks harmless because it reads like English, and English is what humans review with their eyes. The discipline that makes code review work — running the tests, examining the blast radius, treating "small changes" with appropriate skepticism — quietly does not transfer. The wording got better; the behavior got worse; nobody noticed until users did.

Provider-Side Safety Drift: When Your Product Regresses Without a Deploy

· 9 min read
Tian Pan
Software Engineer

A prompt that worked on Tuesday returns "I can't help with that" on Thursday. The CI eval is green. The model name in your config didn't change. The prompt is byte-identical, hashed and pinned in source control. And yet a customer support thread is forming around the new refusal — the AI team won't see it for two weeks because it has to bubble through tier-one support, get triaged, and finally land on someone who can read the trace.

This is provider-side safety drift, and it is the most underbuilt monitoring gap in production AI today. Frontier providers tune safety filters, refusal thresholds, and content classifiers server-side on a cadence that is not on your release calendar. Your team isn't subscribed to it. There is often no release note. And the regressions are asymmetric in a way that is genuinely hard to detect: refusals creep up for legitimate intents while harmful queries you assumed the provider was filtering quietly start slipping through. The boundary moves on both sides, independently, without warning.

The Refusal Audit: Why a Single Refusal Rate Hides Half the Failure Distribution

· 10 min read
Tian Pan
Software Engineer

Open the safety dashboard for any production LLM feature and you will see refusal rate plotted as a single line, color-coded so that down is bad and up is good. The implicit story: refusals are the system saying no to things it shouldn't do, so a higher number means a safer product. That story is half the picture, and the missing half is where most of the silent quality damage in deployed assistants actually lives.

Refusal rate is a two-sided distribution. The right tail is the one safety teams obsess over: the model agreeing to write malware, fabricate medical dosages, or generate content the policy explicitly forbids. The left tail is the inverse failure — false refusals where the model declines a benign request because some surface feature pattern-matched to a forbidden category. A customer asking how to dispute a charge gets a "I can't give financial advice" boilerplate. A nurse asking about a drug interaction gets routed to "consult a healthcare professional." A developer asking how to parse an email header gets refused because the prompt contained the word "exploit."

The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory

· 11 min read
Tian Pan
Software Engineer

Three teams are looking at the same event stream, each with a column called session_id, and each with a different definition of what a session is. Billing inherited a 30-minute idle window from the auth library. Eval inherited "everything until the user says 'bye' or stops typing for 10 minutes" from a chatbot framework. Memory uses a thread ID that the UI generates whenever the user clicks "New chat" — which most users never do. Three columns, three semantics, one rolled-up dashboard, three unrelated bugs that share a root cause.

This is the session boundary problem. It looks like an instrumentation nit, but it is actually a product question wearing infrastructure clothes: where does a conversation end? The honest answer is that there is no single answer — a session for billing is not the same object as a session for eval is not the same object as a session for memory — and a team that picks one default and lets the other two inherit it is shipping a billing dispute, an eval bias, and a memory leak with the same root cause.

The Dependency Bomb in Your Tool Catalog: When Adding One Tool Breaks Five Agents

· 8 min read
Tian Pan
Software Engineer

A team I know shipped a new lookup_customer_v2 tool to their support agent's catalog on a Tuesday. The tool was scoped narrowly, well-tested in isolation, and approved by review. By Thursday, an unrelated workflow — refund processing — was failing on roughly four percent of cases that used to succeed. The refund tool hadn't changed. The refund prompt hadn't changed. The model hadn't changed. What changed was that the planner was now picking lookup_customer_v2 for refund-eligibility queries that had previously routed cleanly to get_account_status, because the new tool's description happened to contain the word "eligibility" and ranked higher under whatever similarity heuristic the model uses internally.

This is the dependency bomb. Teams treat the tool registry as additive — "we're just adding one thing, what could go wrong" — but the planner doesn't see your registry as a list of independent capabilities. It sees a probability distribution over choices, and every entry redistributes the mass. Adding a tool can quietly subtract behavior somewhere else, and your eval suite will probably miss it because nobody wrote a regression test that says "the agent should still pick the old tool for this case."

When LLMs Grade Their Own Homework: The Feedback Loops Breaking AI Evaluation

· 10 min read
Tian Pan
Software Engineer

Here is a finding most AI teams don't want to sit with: in a large-scale study that generated over 150,000 evaluation instances across 22 tasks, roughly 40% of LLM-as-judge comparisons showed measurable bias. That bias wasn't random noise—it was systematic, reproducible, and correlated with how models were trained. When you use a model to generate your eval set and then use the same model (or a close relative) to grade it, you're not measuring quality. You're measuring how well a system agrees with itself.

Synthetic eval data has become standard practice for good reasons. Human annotation is slow, expensive, and hard to scale. LLM-generated test cases let teams spin up thousands of examples overnight. The problem surfaces when the generator and the judge share a common ancestor—which, in 2025, is almost always the case. The result is an eval pipeline that confidently reports high scores while hiding the exact failure modes you built it to catch.

Choosing Eval Metrics Is a Product Decision, Not a Technical One

· 10 min read
Tian Pan
Software Engineer

A team building an LLM-based literature screening tool celebrated 96% accuracy on their test set. Their model was, by any standard engineering metric, performing excellently. There was one problem: it found zero true positives. It had learned to classify everything as irrelevant and still scored near-perfect accuracy, because relevant papers were rare in the dataset. The failure wasn't in the model — it was in the metric.

This failure mode is not exotic. It plays out silently across AI teams every week, in codebases where engineers select evaluation metrics the way they'd select a sorting algorithm: as a technical choice with a right answer. The framing is wrong. Metric selection is a product decision. It encodes which failure modes you're willing to tolerate, which users you're optimizing for, and what "good" actually means for your specific context. Getting this wrong produces eval suites that look rigorous and measure the wrong thing.

The 200-Token System Prompt That Beats Your 4000-Token One

· 10 min read
Tian Pan
Software Engineer

A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.

This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.

The AI Bystander Effect: Why Five-Team Launches Ship Eval Suites Nobody Watches

· 10 min read
Tian Pan
Software Engineer

In 1964, thirty-eight people watched Kitty Genovese being attacked outside their apartment building in Queens. None of them called the police until it was too late. Latané and Darley spent the next decade explaining why: the more people who can see a problem, the less likely any single one of them is to act. They called it diffusion of responsibility. In their famous seizure experiment, 85% of participants intervened when they thought they were alone with the victim. When they believed four others could also hear the seizure, only 31% did.

Now picture your last AI feature launch. Product wrote the prompt. Engineering picked the model and wired the gateway. The data team curated the retrieval corpus. Safety bolted on the input and output filters. Customer support drafted the escalation playbook. Five teams in the room. Each one shipped its piece on time. Three months in, the feature's accuracy has quietly slid from 89% to 71%, the eval suite has not been run since launch week, and when you ask who owns the regression, every team can name three other teams that own it more.