Skip to main content

80 posts tagged with "llmops"

View all tags

The Token Forecast That Mistook a Holiday Trough for the New Baseline

· 10 min read
Tian Pan
Software Engineer

A capacity planner walks into the quarterly budget review with a token forecast built from a clean trailing four-week window. Three of those four weeks happened to span a regional holiday. Daily active sessions were down 40% across that span. The forecast lands 35% under what Q+1 actually consumes, the rate-limit dashboard flatlines red on day one of the new quarter, and the postmortem finds the model behaved exactly as specified — it averaged the most recent four weeks of demand and projected forward. The model was not wrong. The window was.

This is not a story about a bad forecaster. It is a story about treating LLM token spend as if it were the same shape as the EC2 bill it shares a cost center with. The EC2 bill is governed by infrastructure decisions you control: provisioned instances, reserved capacity, scaling policies that respond to load. The token bill is governed by users who decided to take a long weekend. The first is engineering output. The second is consumer demand. A planner who confuses the two will keep building forecasts on windows the calendar guarantees are non-stationary.

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.

The Demo You Recorded in March Was the Last Time It Worked

· 8 min read
Tian Pan
Software Engineer

A sales engineer at a Series B AI company recorded a five-minute walkthrough on a Tuesday in March. The agent picked the right tool on the first try, framed the answer in the buyer's vocabulary, and refused a gnarly edge case with a politeness that landed as "thoughtful, not hedging." That recording went into the asset library. Over the next seven weeks it closed five deals.

By the time the sixth prospect watched it on an onboarding call in late May, the model had received a provider point-release that re-tuned its refusal phrasing, the prompt had been edited twice to fix an unrelated regression, the tool catalog had grown by three entries (one of which the model now preferred), and the RAG corpus had been re-indexed against a new chunker. The demo was no longer a recording of the product. It was a recording of a product that no longer existed.

The Agent That Retried Its Way Past Your Rate Limit

· 10 min read
Tian Pan
Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.

The Chatbot That Inherited Your Support Team's Worst Habits

· 10 min read
Tian Pan
Software Engineer

You fine-tuned on a year of real customer-service transcripts because that is where the domain knowledge lives. The model now sounds like your support team. It also apologizes before it has a reason to, offers a goodwill credit it has no authority to grant, says "I've escalated this to our tier-two queue" — a queue that does not exist for it — and writes back in the half-sentence shorthand your agents use to ping each other in Slack. Domain accuracy on your eval set looks great. Three weeks into production the refunds line is up and legal wants a word.

The chatbot did not go rogue. It learned exactly what you trained it on. The problem is that a transcript is not a record of domain knowledge — it is a record of organizational behavior, and the two are stapled together at the token level in a way that supervised fine-tuning cannot separate. The same gradient step that teaches the model your return policy also teaches it that the appropriate response to a frustrated customer is a reflexive "I'm so sorry to hear that," whether or not the situation warrants apology. Your agents had reasons for those reflexes. The model has only the surface.

Your Eval Set Only Has Problems You Already Solved

· 9 min read
Tian Pan
Software Engineer

Your eval score went from 0.81 to 0.87 over the last quarter. The team shipped a router, swapped in a stronger model on the hard intents, tuned the system prompt, and added forty new test cases harvested from "tickets that took more than a day to close." The dashboard says you got better. NPS is flat. Active users are down two percent.

There is a clean story that explains both numbers, and you don't want to hear it. Your eval set only contains problems you already solved. The queries that failed so badly the user never filed a ticket, never came back, and never showed up in any log you grep — those are not in your suite. They are not in anyone's suite. A rising eval score is consistent with getting better at the things you can see, and it is also consistent with getting better at the things you can see while staying exactly as bad at the things you cannot.

The Eval Budget Your CFO Cannot See on a Spreadsheet

· 8 min read
Tian Pan
Software Engineer

Open any quarterly planning spreadsheet and you can find every feature your team shipped, every contractor invoice, every cloud line item. What you will not find is a row for the outage that never happened, the hallucinated refund that was caught before it reached a customer, or the prompt regression that an eval blocked at 2 a.m. Those non-events have no SKU. They generate no ticket, no postmortem, no Slack thread. And so, when the eval budget comes up for renewal, it is competing for headcount against a feature that has a demo — and it loses, almost every time.

This is not a failure of nerve. It is a measurement problem. Eval investment behaves like a safety net and a test suite at the same time: it compounds quietly, it pays out in disasters avoided, and its entire value is counterfactual. Finance is structurally blind to counterfactuals. If you lead an AI team, your job is not to argue that evals are important — everyone already nods at that. Your job is to make a compounding, invisible return legible to people who only trust spreadsheets.

The Postmortem Where the Root Cause Was a Prompt Nobody Owned

· 9 min read
Tian Pan
Software Engineer

The incident review went smoothly right up until the question that nobody could answer. Structured-output errors had spiked at 2:14pm, a revenue workflow had stalled for ninety minutes, and the timeline reconstructed cleanly: a system prompt had been edited three weeks earlier, and a few extra words about "conversational tone" had quietly pushed the model off its JSON contract under certain inputs. The fix was a one-line revert. The hard part came next. Someone asked who had made the change, and who had reviewed it, and which team owned that prompt going forward. The room went quiet. There was no pull request. There was no reviewer. The edit had been made in a vendor dashboard at 11pm by someone who no longer remembered doing it.

That silence is the actual incident. The JSON contract breaking was a symptom. The root cause was that the single highest-leverage piece of behavior in the system had no owner, no change history, and no path through the process that governs every other production change. The model didn't fail. The model did exactly what it was told. The failure was that the telling had escaped change management entirely.

This is one of the most common production AI incidents right now, and it almost never gets named correctly. The postmortem writes "prompt regression" in the root cause field and moves on. But "prompt regression" describes the code. The real root cause is an org chart with a hole in it.

The Demo-to-Production Cliff: Why a 90%-Accurate Agent Ships at 0%

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of meeting that happens about six weeks after an impressive agent demo. The prototype booked the trip, refactored the module, reconciled the invoices — live, on the first try, in front of stakeholders. Everyone agreed it was ready. Then someone pulled the production numbers, and the agent that "worked" was generating a support ticket every forty completed tasks, a refund every few hundred, and a quiet trail of half-finished states nobody could explain. The project did not get killed. It got stuck. It is still stuck.

This is the demo-to-production cliff, and it is the single most reliable way for an agent project to fail. The cliff is not caused by a bad model or a sloppy team. It is caused by a measurement mistake: treating a 90% success rate as 90% of the way to shipping. It is not. A 90%-accurate agent is a triumphant demo and, for most real workflows, an unshippable product. The MIT NANDA report that made headlines in 2025 — 95% of enterprise GenAI pilots delivering no measurable P&L impact — is this cliff, counted at scale.

The Eval Set That Got Easier While You Weren't Looking

· 9 min read
Tian Pan
Software Engineer

You wrote the eval set eighteen months ago. Back then it was a useful instrument: the cheap model scored 71%, the better one scored 84%, and when a regression slipped in the number dropped and somebody noticed. The suite earned its place in CI. You stopped thinking about it.

Run it today and every candidate model scores 96, 97, 98. The new release scores the same as the old one. The model you suspect is worse scores the same as the model you suspect is better. The number still renders green in the dashboard, the check still passes, and it tells you exactly nothing. Your eval set didn't break. It got easier — because the models got better underneath it — and nobody was watching the moment it stopped discriminating.

This is eval saturation, and it is not a failure mode you might hit. It is the guaranteed end state of any static suite given a long enough timeline. A test that everything passes has stopped being a test.

The Eval That Quietly Went Stale: When Your Test Suite Measures a World That No Longer Exists

· 9 min read
Tian Pan
Software Engineer

Your eval suite passed. All 240 cases green, same as last week. You ship. Two days later support tickets spike, and when you read the transcripts you find a failure mode your suite has no opinion about at all — not a case that flipped from pass to fail, but a question your users started asking that your suite never thought to ask.

This is the quiet failure of evals. We treat a green run as a statement about the present: "the system works." It is actually a statement about a past — the moment the eval cases were written. An eval authored six months ago encodes three things as they were that day: the product's scope, the model's failure modes, and the way real users phrase their requests. All three move. The feature grew a new surface. The model got upgraded twice. The input distribution drifted as users learned what the product could do. The suite did not move with them, so a green run increasingly certifies a world that no longer exists.

Nobody notices, because nothing breaks. A stale eval does not throw an error. It keeps passing, confidently, while measuring less and less of what matters.

Who Gets Paged When the Agent Is Wrong: On-Call for Non-Deterministic Systems

· 9 min read
Tian Pan
Software Engineer

The on-call rotation was built around a promise: failures reproduce. An alert fires, you re-run the request, you watch the bug happen, you find the bad commit, you roll back the deploy. Every part of that loop assumes determinism. The same input produces the same output, and the output is either right or wrong in a way you can stare at.

An agent fleet quietly breaks every link in that chain. The failure happened once, at a sampling temperature you can't replay, on a context window that has since been garbage-collected. There is no bad commit, because the code never changed — the model did, or the retrieved documents did, or the user phrased the request in a way nobody anticipated. You roll back the deploy and the deploy was never the problem.

So the page goes out, an engineer picks it up, and they discover the most uncomfortable fact about operating agents in production: they have been handed a system they cannot single-step, and the runbook in front of them was written for a different kind of machine.