Skip to main content

907 posts tagged with "insider"

View all tags

The Eval Rubric Pulled By Two Drift Vectors

· 9 min read
Tian Pan
Software Engineer

Your composite eval score went up two points last quarter. Nobody can tell you whether the system got better, whether the human cohort that scores it got more lenient, or whether the judge model you upgraded in March started weighting verbosity differently. The number moved. The thing the number is supposed to measure did not necessarily move with it.

This is what happens when an eval rubric is read by two populations at once — humans and an LLM judge — and both populations drift on different axes for different reasons. The composite score blends their motion together, and unless you have a measurement protocol that holds one fixed while the other moves, you have shipped a metric whose changes are not attributable to anything.

The Eval Set That Sampled Production Traffic at 3am EST

· 10 min read
Tian Pan
Software Engineer

A team I worked with had an eval set that quietly drifted into being a survey of their batch automation. The sampling cron ran at 3am Eastern, scooped 5,000 traces out of the production log table, and dropped them into the eval corpus. The leaderboard was clean. The new prompt won by four points. They shipped it. Within a day, the support queue filled with a kind of complaint they had never seen during regression testing — pricing questions that the model now hedged on, in a customer segment whose entire workday started after the eval window closed.

The eval was not wrong about what it measured. It was wrong about who it measured. At 3am EST, the production fleet was dominated by overnight batch retries, scheduled report generation, and a handful of APAC daytime sessions that mostly asked navigational questions. The new prompt was genuinely better on that slice. The slice was twelve percent of weekly traffic and zero percent of revenue-weighted traffic. Nobody had asked the question "what shape of user is in this dataset" because the dataset was constructed by a cron job that ran when the warehouse was quietest, and quietness was the only sampling criterion anyone had thought to optimize for.

The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See

· 10 min read
Tian Pan
Software Engineer

The treatment arm shipped because the dashboard said "+4% conversion, p < 0.01, n = 2.3M." Six weeks after the global rollout the lift was gone, and the team filed the post-mortem under "scale effects" because nothing else fit. The actual cause was sitting in the prompt assembler the whole time: the routing hash that decided arm assignment was derived from a user-tier attribute, and the same attribute was being interpolated into the prompt template three lines later. The model was reading the assignment in band. The "treatment" wasn't the prompt change. The treatment was the population the prompt change happened to attract.

This is a failure mode that doesn't exist in the experimentation playbooks teams inherit from the web era. A button color does not read the user's tier and decide to behave differently. A prompt does. Once your treatment is a string that the model interprets, every input that touches the routing decision and also touches the prompt becomes a back channel the experiment cannot close.

The Fine-Tune Dataset You Accidentally Built While Debugging

· 9 min read
Tian Pan
Software Engineer

The thumbs-down button on your staging UI was supposed to do one thing: tell the on-call engineer which response looked bad so they could go investigate. Six months later, somebody on the modeling team pulled "all production feedback with corrections attached" into a Parquet file and ran an SFT job against it. The eval set improved on three metrics and regressed quietly on five. Nobody could explain why until somebody scrolled through the labels and found a row that read, in the corrections column, "this is fine but I hate how it phrases it." The model learned that opinion. Then it learned forty-thousand more of them.

This is the failure mode where the debugging surface and the curation surface are the same surface. Engineers click "bad" because something is broken, because something looks weird, because they were about to file a ticket, because the formatting offends them, because they were checking whether the button works. The signal that flows out of that click is a mix of "this output is wrong," "this output is right but ugly," "I don't like this," and "I was bored." Treated as a single label, it certifies nothing. Trained against, it teaches the model the union of all those moods.

The Heavy Tail Your Token Forecast Never Priced

· 9 min read
Tian Pan
Software Engineer

The cost forecast for your AI feature was modeled on a 50-user pilot. Those users typed three-sentence prompts because that is what people type into a beta they were asked to evaluate. Production launched, you crossed ten thousand users, and the finance team flagged that your model bill is running at three times the per-user number from the deck. You went looking for the bug. There is no bug. Your pilot was sampling from one distribution and production is sampling from another, and the difference between them is a long tail of users who learned about your product on Twitter and are pasting thirty kilobytes of unstructured context they screenshotted from a thread.

This is the same financial mistake every consumer internet company learned in the 2010s, transplanted onto LLM economics. The pilot's median user is not the production p99.5, and a token cost model that uses the mean as its forecasting input has already lost the argument with the bill.

The Latency-Budget Router That Was a Quality-Loss Router by Another Name

· 10 min read
Tian Pan
Software Engineer

A model router that optimizes a single loss function will deliver exactly what that loss function asks for, and nothing else. When the function is "stay under the p95 latency target," every query that would have benefited from extended reasoning gets snapped to the cheapest path the router can defend, because the fast model returns under the SLO and the slow-but-correct model would not. The latency dashboard turns green. The aggregate eval moves a fraction of a point and the team rounds it to noise. The per-slice view nobody graphs is where the actual regression lives: concentrated in the multi-step, ambiguous, and out-of-distribution queries that should have been routed to reasoning and instead got the model that finishes fast and is wrong with confidence.

This is not a routing bug. The router is doing exactly what it was built to do. The bug is in the framing — a system whose optimizer is denominated entirely in latency will produce quality regressions invisible to the metric the team is paid to keep green. It will then ship those regressions silently, because the people watching the dashboard are not the people watching the answers.

The Localized System Prompt Your Model Performs Worse Against Than the English Original

· 11 min read
Tian Pan
Software Engineer

Your English system prompt took six weeks to tune. A staff engineer rewrote the constraint list four times, the eval suite finally cleared 94% on the held-out task set, and the launch checklist green-lit it for production. Then the i18n team picked it up, ran it through the same translation pipeline that handles button labels and tooltips, and shipped the Japanese, German, Hindi, and Arabic variants the next sprint. The launch dashboard for non-English markets shows the same task volume, the same user funnel, and — until a support ticket from a Tokyo customer surfaces six months later — the same green status.

The Tokyo customer's complaint is that the agent ignored an instruction the English prompt explicitly forbids. You re-read the Japanese prompt and it says the same thing, semantically. You re-run the English eval suite against the English variant and it passes. There is no eval suite for the Japanese variant. There never was.

The Middle-Context Blindness Your Retrieval Pipeline Never Measured

· 8 min read
Tian Pan
Software Engineer

The retrieval logs are clean. Recall@10 against your hand-labeled query set has not regressed in months. The answer-quality dashboard says faithfulness is holding above 90%. Then a customer pastes a question into your support agent, the gold passage is right there at position 7 of 12 in the assembled prompt, and the model answers as if it were never retrieved.

The retrieval team will tell you the chunk was there. The prompt team will tell you the prompt was correct. Both are technically right. The model attended to the first thousand tokens, attended to the last thousand tokens, and skimmed the middle band where the answer lived. Your pipeline is hitting a positional attention bias that neither team owns, neither dashboard tracks, and neither benchmark catches.

The Model Card Your Procurement Team Treated Like a Datasheet

· 11 min read
Tian Pan
Software Engineer

A model card is a research artifact. A datasheet is a contract. Procurement teams routinely read the first as if it were the second, and the AI vendor that handed it over is now bound to claims its engineering team thought were narrative.

This is the cleanest way to lose a renewal: you forwarded the same PDF you publish on your model index page, the customer's legal team excerpted four sentences into Schedule B, and twelve months later you discover that "intended use: general question answering" has become a contractual representation about scope of service. Your team measured those sentences in BLEU points. Their team is now measuring them in breach.

The Model Deprecation Notice That Landed During Your Code Freeze

· 8 min read
Tian Pan
Software Engineer

The email arrives on a Tuesday. The checkpoint your two largest features depend on enters a 90-day sunset. Your engineering org is in week two of a coordinated freeze for a different launch. By the time the freeze lifts, you will have under thirty days to revalidate two production features against a new model — and "revalidate" here means rebuilding the eval set, running shadow traffic, getting product sign-off, and shipping behind a flag that nobody is watching because the launch team is still ramping the thing the freeze was for.

This is not a rare collision. Major providers publish deprecation cadences measured in months, and every team running on hosted models has now seen one cycle. What teams have not absorbed is that provider deprecation is not an engineering event the way a library upgrade is — it is a scheduling event that arrives on a clock you do not control, and any roadmap that did not budget for it inherits the cost as a surprise.

The OAuth Scope Your Agent Acquired Across Chained Tool Calls

· 10 min read
Tian Pan
Software Engineer

A user clicks "Authorize" on your agent's consent screen once. By the time the session ends, that agent has chained through eleven tool calls, negotiated three step-up authorizations, and now holds the union of scopes across every tool it touched. The user remembers granting one thing. Your audit log shows read-write access to half their account. The OAuth standard says everything is working as designed, and that is exactly the problem.

The classical OAuth consent model was built for a world where one app talks to one API. Agents shattered that assumption two years ago and the standard has not caught up in practice, even where the spec has. The result is a category of silent privilege escalation that no one decides to ship — it accretes, one tool registration at a time, while your security review keeps inspecting the front door.

The On-Call Rotation Your Agent Platform Forgot to Staff

· 11 min read
Tian Pan
Software Engineer

The AI platform team has four engineers. The internal agent they shipped seven months ago is now answering questions for 200 employees a day. For the first month the founding engineer answered every Slack ping personally — Tuesday at 11pm, Sunday morning, the night of the company offsite. Then she got promoted to staff engineer for the impact she had on adoption, and three weeks later she stopped checking the channel after 6pm because that is what staff engineers do. The on-call rotation that was supposed to replace her was never formalized, because the operating model was always going to be figured out "after the pilot."

The day the agent silently degrades for a quarter of users — a retrieval index that quietly fell behind, or a model version flip that shifted refusal behavior, or a tool whose schema rotated and is now returning empty arrays — the complaints do not land on the platform team's pager. They land in the help desk queue, staffed by people who do not have access to the agent's traces, do not know what a system prompt is, and have been told by IT that the agent is "owned by the AI team." Sixteen hours pass between the first user complaint and the first engineer who looks at a trace. Nobody on the platform team is asleep at the wheel; there is no wheel.