Skip to main content

84 posts tagged with "llmops"

View all tags

The Eval Set Your Prompt Engineers Turned Into Production Few-Shots

· 11 min read
Tian Pan
Software Engineer

The eval dashboard had been climbing for three sprints. Quality up six points on the hard slice, up nine on the regression slice, up twelve on the slice the support team had hand-curated from last quarter's worst tickets. The team shipped a model promotion off the back of it. Two days later, a customer asked a question that looked nothing like anything in the eval set, and the answer was worse than what they had been getting six months ago.

The forensic was quick once someone thought to run it. The prompt engineers had been working out of the same repo as the eval team. They had found the curated examples — the painstaking ones, the ones where someone had argued for an hour about the correct phrasing of the ideal answer — and over a few sprints they had copy-pasted the strongest of them as few-shot demonstrations into the production system prompt. The dashboard kept going up because the model was being graded on inputs it had seen verbatim at inference time. Nobody flagged it. Nobody owned the boundary between "the examples we measure quality against" and "the examples we ship in the prompt." Both teams were doing exactly the job they had been hired to do.

The Prompt Version Your Team Treated as Independent of the Model Version

· 9 min read
Tian Pan
Software Engineer

Your incident timeline says "no deploys in the last 72 hours." Your prompt registry agrees: prompt v37 has been frozen for three weeks. Your eval harness ran clean on Tuesday night. But on Wednesday morning, the structured-output failure rate on one of your agents tripled, the retry budget on another doubled, and a third started cheerfully ignoring an instruction it had been honoring for a month. Nothing changed. Except something did change, and it changed in the place neither side of the org was watching: the model.

The prompt registry knows about prompt versions. The model gateway knows about model versions. Almost nobody, in practice, tracks the pair. And prompt v37 isn't a free-standing artifact — it is, whether your tooling admits it or not, a contract negotiated against one specific model. When the platform team rolls the claude-sonnet-latest alias forward by a single point release, the contract on the other side has been quietly amended, and your incident timeline reads "no deploys" because the deploy happened on someone else's infrastructure under a name that didn't move.

The Retry Budget That Hid Your Provider's Actual Error Rate From Your Dashboard

· 11 min read
Tian Pan
Software Engineer

The weekly review slide said 99.9%. The invoice said the bill had tripled. The two numbers had been on adjacent dashboards for months, and nobody had noticed that they were measuring different worlds. The reliability number was post-retry — every call that eventually returned a 200 counted as a success — and the cost number was every attempt the client made, billed by the token. Between them sat a generous five-attempt retry loop and a provider whose tail latency had been quietly degrading. The first time anyone looked at both numbers together was during an outage, when the cost-anomaly alert fired before the availability alert did.

That is the whole pattern. A retry budget that looks like a reliability mechanism is also a cost-quality knob, and the team that watches only one side of it is paying for an availability number the invoice will eventually correct.

The Prompt Hot-Reload That Orphaned Every In-Flight Agent Run

· 11 min read
Tian Pan
Software Engineer

The pager went off at 11:47pm. A customer had been ten minutes into a refund conversation when the agent suddenly stopped calling the process_refund tool it had been reasoning about for the entire session, hallucinated a confirmation number, and ended the chat. By the time we traced it back, the cause was obvious in retrospect: a teammate had pushed an updated system prompt at 11:46. The push was clean, the tests passed, and every new conversation worked perfectly. The few hundred conversations already in progress did not.

We had built our prompt registry to support what every prompt-versioning vendor in 2026 markets as a feature: hot-reload without redeploy. We had treated that capability as if it were a CDN cache flush — a global swap that takes effect everywhere at once. What it actually was, we learned that night, was a contract break. Every active conversation was an in-flight negotiation between an LLM and a set of instructions plus tool definitions it had been reasoning against. When the registry swapped the prompt under those conversations, half the negotiated context was now stale.

The Token Forecast That Mistook a Holiday Trough for the New Baseline

· 10 min read
Tian Pan
Software Engineer

A capacity planner walks into the quarterly budget review with a token forecast built from a clean trailing four-week window. Three of those four weeks happened to span a regional holiday. Daily active sessions were down 40% across that span. The forecast lands 35% under what Q+1 actually consumes, the rate-limit dashboard flatlines red on day one of the new quarter, and the postmortem finds the model behaved exactly as specified — it averaged the most recent four weeks of demand and projected forward. The model was not wrong. The window was.

This is not a story about a bad forecaster. It is a story about treating LLM token spend as if it were the same shape as the EC2 bill it shares a cost center with. The EC2 bill is governed by infrastructure decisions you control: provisioned instances, reserved capacity, scaling policies that respond to load. The token bill is governed by users who decided to take a long weekend. The first is engineering output. The second is consumer demand. A planner who confuses the two will keep building forecasts on windows the calendar guarantees are non-stationary.

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.

The Demo You Recorded in March Was the Last Time It Worked

· 8 min read
Tian Pan
Software Engineer

A sales engineer at a Series B AI company recorded a five-minute walkthrough on a Tuesday in March. The agent picked the right tool on the first try, framed the answer in the buyer's vocabulary, and refused a gnarly edge case with a politeness that landed as "thoughtful, not hedging." That recording went into the asset library. Over the next seven weeks it closed five deals.

By the time the sixth prospect watched it on an onboarding call in late May, the model had received a provider point-release that re-tuned its refusal phrasing, the prompt had been edited twice to fix an unrelated regression, the tool catalog had grown by three entries (one of which the model now preferred), and the RAG corpus had been re-indexed against a new chunker. The demo was no longer a recording of the product. It was a recording of a product that no longer existed.

The Agent That Retried Its Way Past Your Rate Limit

· 10 min read
Tian Pan
Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.

The Chatbot That Inherited Your Support Team's Worst Habits

· 10 min read
Tian Pan
Software Engineer

You fine-tuned on a year of real customer-service transcripts because that is where the domain knowledge lives. The model now sounds like your support team. It also apologizes before it has a reason to, offers a goodwill credit it has no authority to grant, says "I've escalated this to our tier-two queue" — a queue that does not exist for it — and writes back in the half-sentence shorthand your agents use to ping each other in Slack. Domain accuracy on your eval set looks great. Three weeks into production the refunds line is up and legal wants a word.

The chatbot did not go rogue. It learned exactly what you trained it on. The problem is that a transcript is not a record of domain knowledge — it is a record of organizational behavior, and the two are stapled together at the token level in a way that supervised fine-tuning cannot separate. The same gradient step that teaches the model your return policy also teaches it that the appropriate response to a frustrated customer is a reflexive "I'm so sorry to hear that," whether or not the situation warrants apology. Your agents had reasons for those reflexes. The model has only the surface.

Your Eval Set Only Has Problems You Already Solved

· 9 min read
Tian Pan
Software Engineer

Your eval score went from 0.81 to 0.87 over the last quarter. The team shipped a router, swapped in a stronger model on the hard intents, tuned the system prompt, and added forty new test cases harvested from "tickets that took more than a day to close." The dashboard says you got better. NPS is flat. Active users are down two percent.

There is a clean story that explains both numbers, and you don't want to hear it. Your eval set only contains problems you already solved. The queries that failed so badly the user never filed a ticket, never came back, and never showed up in any log you grep — those are not in your suite. They are not in anyone's suite. A rising eval score is consistent with getting better at the things you can see, and it is also consistent with getting better at the things you can see while staying exactly as bad at the things you cannot.

The Eval Budget Your CFO Cannot See on a Spreadsheet

· 8 min read
Tian Pan
Software Engineer

Open any quarterly planning spreadsheet and you can find every feature your team shipped, every contractor invoice, every cloud line item. What you will not find is a row for the outage that never happened, the hallucinated refund that was caught before it reached a customer, or the prompt regression that an eval blocked at 2 a.m. Those non-events have no SKU. They generate no ticket, no postmortem, no Slack thread. And so, when the eval budget comes up for renewal, it is competing for headcount against a feature that has a demo — and it loses, almost every time.

This is not a failure of nerve. It is a measurement problem. Eval investment behaves like a safety net and a test suite at the same time: it compounds quietly, it pays out in disasters avoided, and its entire value is counterfactual. Finance is structurally blind to counterfactuals. If you lead an AI team, your job is not to argue that evals are important — everyone already nods at that. Your job is to make a compounding, invisible return legible to people who only trust spreadsheets.

The Postmortem Where the Root Cause Was a Prompt Nobody Owned

· 9 min read
Tian Pan
Software Engineer

The incident review went smoothly right up until the question that nobody could answer. Structured-output errors had spiked at 2:14pm, a revenue workflow had stalled for ninety minutes, and the timeline reconstructed cleanly: a system prompt had been edited three weeks earlier, and a few extra words about "conversational tone" had quietly pushed the model off its JSON contract under certain inputs. The fix was a one-line revert. The hard part came next. Someone asked who had made the change, and who had reviewed it, and which team owned that prompt going forward. The room went quiet. There was no pull request. There was no reviewer. The edit had been made in a vendor dashboard at 11pm by someone who no longer remembered doing it.

That silence is the actual incident. The JSON contract breaking was a symptom. The root cause was that the single highest-leverage piece of behavior in the system had no owner, no change history, and no path through the process that governs every other production change. The model didn't fail. The model did exactly what it was told. The failure was that the telling had escaped change management entirely.

This is one of the most common production AI incidents right now, and it almost never gets named correctly. The postmortem writes "prompt regression" in the root cause field and moves on. But "prompt regression" describes the code. The real root cause is an org chart with a hole in it.