Skip to main content

67 posts tagged with "llmops"

View all tags

The Demo Account Eval Set Your Sales Team Is Running Without You

· 10 min read
Tian Pan
Software Engineer

The most expensive eval set in your company isn't in your repo. It's in a slide deck a sales engineer assembled six months ago, plus three demo accounts named after your top-five logos, plus a half-remembered script that says "click here, ask the agent to summarize last quarter, watch the magic happen." It runs once or twice a week, in front of prospects worth six or seven figures. Nobody on the AI team has ever scored a run.

Then you ship a model migration on a Tuesday. On Thursday at 4 PM, the sales engineer pings the on-call channel: the summary output now starts with "Certainly! Here is a summary…" instead of jumping into the bullet points, the numbers are spelled out instead of digits, and the prospect — a Fortune 500 CFO who scheduled this meeting four weeks ago — just asked whether the product is always this chatty. The release notes called it a 1.2-percentage-point eval lift.

When Marketing Reads Your Eval Cases: The Cross-Functional Visibility Problem

· 11 min read
Tian Pan
Software Engineer

The eval set is the most-read artifact your AI team produces, and you almost certainly don't know who's reading it. The repo is private, the CI job is internal, the file is one directory above prompts/ — and yet a sales engineer scraped six cases for a demo last quarter, a marketing analyst pulled three failure cases into a "look how robust our system is" deck, customer success cited eval pass-rates verbatim in a renewal call, and product treats the file as the hidden spec the AI team won't share. The case files are read by more people than the code that generated them, and nobody on the AI team has noticed.

This isn't a permissions failure. The eval set is on the same Git server as the rest of the codebase, with the same access controls as every other engineering artifact. The problem is that the AI team is the only group that treats the eval set as code. Everyone else treats it as documentation, as marketing material, as a product spec, or as a customer complaint log — and each of those readings extracts a different slice of the same file, packages it for a different audience, and ships it somewhere the AI team isn't watching.

Locale-Stratified Evals: How to Catch Non-English Regressions Your English Test Set Can't See

· 12 min read
Tian Pan
Software Engineer

Your aggregate eval score is up 1.2 points after the last prompt change. Your CSAT on French queries dropped four points the same week. Both numbers are correct. The reason they disagree is that the eval set is 88% English, 6% Spanish, and the rest is a long tail none of which sees enough traffic to move the rollup. The French regression is in your data — it is just sitting at three decimal places below the noise floor of your top-line metric.

This is the most common shape of locale drift I see in production AI systems: not a sudden collapse, not a translated-string bug, but a steady performance gap that the rollup hides and the support queue eventually surfaces. By the time someone in the Paris office forwards a screenshot, you have shipped two more prompt changes on top of the regression and the bisect costs three engineering days.

The Prompt Graph Inside Your Agent: Cross-Prompt Regression Chains Nobody Mapped

· 11 min read
Tian Pan
Software Engineer

A senior engineer ships a four-word edit to the planner prompt — "if uncertain, ask first." The planner's own eval set, which grades whether plans are reasonable, moves up by half a point. They merge. Two weeks later, the verifier's eval shows a three-point pass-rate regression and nobody can repro it. The root cause turns out to be that the planner now asks more clarifying questions, the executor receives shorter task descriptions on the second turn, the verifier's rubric was implicitly tuned against the previous executor's longer outputs, and an edit nobody flagged as risky has shifted three downstream distributions at once.

This is what happens when you treat the prompts inside an agent as a flat folder of files instead of as a graph with edges. The prompts have owners. The edges between them have nobody.

Repeat-Question Detection: The Session-Level Blind Spot Your Per-Turn Eval Cannot See

· 11 min read
Tian Pan
Software Engineer

A user opens your chat, asks a question, and gets back a response your eval suite would score 4.6 out of 5. Then they ask the same question with different words. Same answer. Same score. They try once more, this time with the kind of hedging language people use when they suspect the machine isn't listening — "what I'm actually trying to do is…" — and then they close the tab. From the model's perspective, three clean Q&A turns. From the dashboard's perspective, an engaged session. From the user's perspective, a product that failed them three times in a row and won't be opened again.

This is the failure mode per-turn evaluation cannot see. Each individual turn looked correct in isolation. The judge gave a thumbs up. The hallucination detector stayed quiet. The relevance score was high. And yet the conversation, as a whole, did not resolve anything — and that's the unit the user was actually evaluating you on.

Shadow Evals: When Private Slices Replace Your Eval Rollup

· 10 min read
Tian Pan
Software Engineer

The fastest way to discover that your AI team has no eval discipline is to ask three engineers, in separate Slack DMs, "did your last prompt change improve quality?" — and watch them answer yes, all three of them, with three different numbers, against three different slices, on three different laptops, none of which is reproducible by anyone else in the room. That isn't an evals problem in the textbook sense. The textbook says you don't have evals. The reality is worse: you have too many evals, each of them privately owned, each of them measuring something real, and none of them rolling up into a single number the org can plan against.

This is the shadow eval anti-pattern, and most AI teams ship with it for longer than they admit. It looks productive — every engineer has a notebook, every PR comes with a screenshot of a pass rate, every standup mentions a "win on the long-tail slice" — and it survives quarterly reviews because the bar for "we do evals" is so low that running anything counts. But the org has no signal. Leadership cannot tell whether last month's three prompt edits moved the product forward or sideways, because the three engineers measured against three private slices and stopped tracking the previous baseline the moment they switched files.

Stale Few-Shot Examples and the Half-Life Your Prompt Repo Ignores

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any AI feature that has been in production for more than nine months. Scroll past the role description, past the formatting rules, past the safety guardrails. Stop at the block titled <examples> or ## Examples or whatever your team called it the day someone copied the first three good Slack threads into a code block. Read them. There is a 60% chance at least one of them references a feature that has been renamed, a button that no longer exists, or a workflow the product manager quietly killed two quarters ago.

The decay is not visible from the eval dashboard. The eval scores are green. They have been green for months. They are green because the eval set was authored against the same product surface the few-shots reference, and the two have aged together in lockstep. The model is performing a flawless impression of last year's product, on a test set that grades it for being faithful to last year's product, while real users interact with this year's product and quietly tolerate the resulting confabulations. This is the half-life nobody puts in the LLMOps roadmap.

AI Code Review Drift: When Your LLM Reviewer's Standards Mutate Faster Than the Code

· 9 min read
Tian Pan
Software Engineer

The PR-review dashboard has shown green for six weeks. Bot catch rate, comment volume, developer "thumbs up" reactions — all steady. Then a security incident lands in production and the post-mortem points at a missing null-check the bot used to catch and quietly stopped catching about two months ago. Nobody changed the bot. Nobody downgraded the model. The dashboard never moved. The standard moved.

This is the failure mode of automated code review that doesn't show up in any product demo. Teams adopt an LLM reviewer for the consistency win — every PR gets the same checklist, no senior engineer's bad-day variance, fast turnaround for junior contributors — and the consistency is real for about a quarter. Then the system prompt evolves, the model bumps, the few-shot library accumulates, and the bot is reviewing a different codebase against a different rubric using a different model than the one the team validated against. The team's mental model of "what the bot catches" decays into "what the bot caught last week."

AI Feature Dependency Graphs: When a Prompt Edit Is a Silent Breaking Change

· 12 min read
Tian Pan
Software Engineer

A team owns a summarizer. Another team owns the search ranker that ingests those summaries. A third team owns a router that picks between agent personalities based on the ranker's confidence score. None of these teams have a shared on-call rotation, none of them sit in the same standup, and the only contract between them is "the previous feature's output is the next feature's input." On a Tuesday, the summarizer team tightens a prompt to fix a hallucination complaint from a sales demo. The search ranker's quality collapses six hours later. The router starts handing off to the wrong agent personality by Wednesday morning. The post-mortem will record the cause as "prompt change," but the actual cause is that the team's AI features have quietly composed into a directed graph that nobody drew.

This is the most common shape of an AI outage that doesn't trip any of the alerts you built for AI outages. The model isn't down. The eval suite for the changed feature is green. The token cost line is flat. What broke is the interface between two features, which is a thing your dependency tooling treats as plain text because that's all it is at the API boundary — and treats as inert because plain text doesn't carry a version, a schema, or a deprecation policy.

Eval Triage Queues: Why FIFO Misses the Failures That Matter

· 11 min read
Tian Pan
Software Engineer

A healthy eval set is supposed to be a sign of maturity. It is also, on any given Monday, a thousand failed cases sitting in a queue with a human reviewer who has eight hours and a per-case throughput of about fifty. The arithmetic is brutal: roughly one in twenty failures gets read. The other nineteen wait. Which nineteen wait, and which one gets the seat, is decided by whichever order the file happens to load in.

Most teams call this "reviewing failures." It is closer to a lottery weighted by alphabetical order. A failure case that affects two percent of production traffic and lives at the top of the file gets attention. A failure case that affects forty percent of production traffic and lives near the bottom gets a glance on Friday afternoon, if at all. The team ships a fix for the small problem on Tuesday and writes a retro on Thursday wondering why the dashboard hasn't moved.

Per-Tenant Prompt Compilation: When Your System Prompt Becomes a Build Artifact

· 10 min read
Tian Pan
Software Engineer

The day a multi-tenant SaaS team adds the third if tenant_industry == "healthcare" branch to its system prompt is the day it accidentally hires itself a compiler engineer. Nobody filed the headcount req. Nobody scoped the work. The team thinks it is shipping a feature; it is actually shipping a build system, and the build system is held together with f-strings.

Every team that scales an AI feature into a customer base with even mild heterogeneity hits the same wall. Tenant A is in healthcare and needs HIPAA-aware response framing. Tenant B is in legal and needs strict citation discipline. Tenant C is an enterprise that bought a custom safety rubric in the master agreement. Tenant D is on the free tier and gets the default. The first instinct is to handle the variance with runtime conditionals, and the conditionals nest until the prompt becomes unreadable to anyone who didn't write it. The second instinct — and the one most teams arrive at after the wall — is prompt compilation: the canonical "prompt" is no longer a string but a source artifact, and what reaches the model is a compiled output.

Prompt Edits Without PRs: The Velocity Metric Your AI Team Is Failing

· 9 min read
Tian Pan
Software Engineer

A head of engineering opens the velocity dashboard on a Monday morning. PRs merged per week, flat. Story points completed, flat. Lines changed, suspiciously low. The AI team is having a quiet quarter, the chart says. Two floors away, that team has rewritten the system prompt seven times in three weeks, swapped a tool description that doubled tool-call accuracy, added six new few-shot examples, and tuned the rerank instruction until the product feels like a different application. None of that work shows up in the PR graph. None of it is invisible to users.

The asymmetry between what AI teams change and what engineering dashboards measure has become the load-bearing misdiagnosis of 2026. Behavior change in an AI-heavy product is increasingly decoupled from code change, and the metrics that have governed software organizations for fifteen years — PR throughput, commit volume, lines touched — measure code change. A team can be reshaping production response distributions weekly and look idle on every chart leadership trusts.