Skip to main content

67 posts tagged with "llmops"

View all tags

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

· 11 min read
Tian Pan
Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

Eval-as-Code: When Your Release Gate Is a Notebook on Someone's Laptop

· 13 min read
Tian Pan
Software Engineer

The number that decides whether a model goes to production is being produced by a Jupyter notebook running on a single engineer's MacBook, against a CSV that lives in a Slack DM, scored by a judge model that nobody pinned. Two weeks later, after the engineer has touched the notebook three more times and the API provider has silently shipped a minor model update, nobody on the team can reproduce the number — including the engineer who originally generated it. And yet that number is the gate. It decided that GPT-4o-mini was good enough to replace GPT-4 in the customer support flow. It decided the new prompt template shipped. It decided the fine-tune was promoted. The team is treating it like a load-bearing artifact and storing it like a sticky note.

This is the eval gap. The industry has spent five years writing about evaluation as a methodology problem — which scoring technique, which judge model, which rubric, which dataset — and almost no time writing about evaluation as an engineering problem. But the moment your eval suite starts gating production releases, it inherits every requirement that the rest of your production stack lives by: reproducibility, version control, ownership, observability, dependency management, latency and reliability budgets, and a pipeline that survives the engineer who built it leaving the team. Most teams skip this layer entirely and discover its absence only after a major incident, usually one where the eval score said green and the customer experience said red.

Per-User AI Quotas: The UX Layer Your Cost Dashboard Can't See

· 10 min read
Tian Pan
Software Engineer

A user opens your AI feature at 3pm on a Tuesday. They've been using it lightly for three weeks. This time the request hangs for eight seconds and returns a red banner: "Something went wrong. Please try again later." They try again. Same banner. They close the tab and go back to whatever they were doing before — and they tell their teammate at standup the next morning that "the AI thing is broken."

What actually happened: they crossed an invisible per-user quota that your cost team set six months ago to keep a single power user from blowing through the GPU budget. The quota worked. Spend stayed flat. The dashboard is green. The feature is, by every metric your engineering org tracks, healthy. It's also dead, because the user who got that banner is never coming back, and the three teammates they told at standup will never try it.

This is the gap your cost dashboard cannot see. Per-user AI quotas are a product surface. The team that hides them inside an HTTP 429 is letting their cost-control system silently shape user perception of the product, and they will not find out until churn shows up in a quarterly review with no obvious cause.

Runtime Prompt Hot-Reload: Why Your Prompts Shouldn't Be Locked Behind a Build

· 11 min read
Tian Pan
Software Engineer

The first AI incident at most companies follows a script: a prompt-engineer notices the model is misclassifying a category that just started showing up in real traffic, opens a PR with a one-line tweak to the system prompt, and watches the build queue for the next 23 minutes while the model continues to misclassify in production. The fix is a string. The deployment is a binary. The mismatch is not a tooling oversight — it is an architectural decision the team made implicitly the day they put the system prompt in a .py file alongside the application code.

Coupling prompt changes to the deploy pipeline is a constraint you imposed on yourself. There is no law of distributed systems that says the model's behavior contract has to ship inside the same artifact as the orchestration code. The runtime prompt hot-reload pattern severs that coupling by treating prompts the way you already treat feature flags, routing rules, and pricing tables — as configuration pulled from a versioned store at request time, with a short-lived local cache and well-defined safety primitives around it. The payoff is incident-response measured in seconds rather than build minutes, and the cost is an honest accounting of a third deployment surface your release process probably ignores.

Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'

· 12 min read
Tian Pan
Software Engineer

The playbook for a bad code deploy is a sub-minute revert. The playbook for a bad config push is a sub-second flag flip. The playbook for a bad model upgrade is whatever the on-call invents at 09:14, and on a typical day it takes seven hours to finish. During those seven hours the regression keeps compounding — wrong answers ship to customers, support tickets pile up, and the dashboard shows a slow gradient rather than a clean cliff back to green.

The reason the gap is seven hours is not that the team is slow. It is that "rollback" for a model upgrade is not the same primitive as "rollback" for code. It is closer to a database schema migration: partial, hysteretic, and not reversible by pressing the button you wish existed. The team that wrote its incident playbook around a button does not have the controls the actual rollback requires.

This post is about what those controls look like, why they have to be paid for in advance, and what you find out about your platform the first time you try to roll back a model under load.

Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change

· 9 min read
Tian Pan
Software Engineer

A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."

This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.

A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

· 11 min read
Tian Pan
Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

The AI Feature RACI: Why Four Green Dashboards Add Up to a Broken Product

· 11 min read
Tian Pan
Software Engineer

An AI feature regresses on a Tuesday. The eval CI is green. The guardrail dashboards are clean. The retrieval P95 is in line. The model provider had no incident. And yet the support queue is filling up with users who say the assistant "feels worse this week." The PM is the only person in the room who can name the regression, and even she cannot tell you which dashboard would have caught it. Welcome to the seam bug — the kind of failure where every individual artifact owner can prove their piece is fine, and the integrated experience is still broken.

This is the predictable result of how AI features get staffed. The owner-of-record list looks reasonable on paper: a prompt author owns the system prompt, an eval owner owns the offline test set and CI gates, a tool/retrieval owner owns the function calls and search index, a guardrail owner owns moderation and policy filters. Plus a model-selection decision that often lives outside all four — sometimes with a platform team, sometimes with whichever engineer most recently filed the procurement ticket. Five owners. Zero of them are on the hook for "does this feature work for the user."

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

· 11 min read
Tian Pan
Software Engineer

Green CI is not the statement "this prompt works." Green CI is the statement "the engineer who wrote the evals could not think of how this prompt should break." Those are very different claims, and the gap between them is where your production incidents live. An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Their dialect, their domain knowledge, their seniority, their pet failure modes, the model they happened to be using when they wrote the test cases. Everything that engineer would not think to test is, by construction, untested. And worse: they will keep extending the suite from the same vantage point, so the blind spot does not shrink as the suite grows. It calcifies.

This is the eval-author monoculture problem, and it is the most under-discussed reliability risk in AI engineering today. Teams obsess over judge bias, position bias, verbosity bias, leakage, and contamination — but the upstream bias is the bias of the human who decided what the test cases should be in the first place. Every other source of eval error gets amplified by it. If your suite was written by one person, you have a benchmark with a personality, and that personality is the silent ceiling on what your CI can ever catch.

Prompt Asset Depreciation: The Maintenance Schedule Your AI Team Doesn't Keep

· 9 min read
Tian Pan
Software Engineer

Engineering leaders are comfortable with the idea that code rots. Dependencies need updating, infrastructure has lifecycle management, certificates expire on a calendar nobody disputes. Yet the prompt repository gets treated as a write-once-read-many artifact — even though it defines how your product talks to a probabilistic engine that ships behavior changes every six weeks.

The system prompt tuned six months ago against the model that was current then is still in production. The few-shot examples chosen against a tokenizer that has since changed are still being injected on every call. The reranker prompt was tuned against an embedding endpoint the vendor deprecated last quarter. Nobody scheduled a review. Nobody is going to.

This is not a hypothetical failure mode. When one team migrated their prompt suite — meticulously stabilized against GPT-4-32k — to GPT-4.1 and GPT-4.5-preview, only 95.1% and 97.3% of their regression tests passed. A 3-5% silent quality regression is not a rounding error in production; at any non-trivial scale it is a customer-visible degradation that nobody on the team intentionally shipped. And those are the teams that even had a regression test suite. The median team's "regression test" is whatever vibes the on-call engineer formed during the last incident.

The category we are missing is prompt asset depreciation: a maintenance discipline that treats every production prompt as a depreciating asset with a known lifespan, not a constant.

The Shared-Prompt Flag Day: When One Edit Becomes Thirty Teams' Regression

· 10 min read
Tian Pan
Software Engineer

The first edit to a shared system prompt feels like good engineering. Three teams all paste the same eighteen-line safety preamble at the top of their agents, someone notices, and an internal platform team says the obvious thing: let's centralize it. A prompts.common.safety_preamble@v1 lands in a registry. Thirty teams adopt it within a quarter because it's the path of least resistance — and because security is happy that one team owns the wording. For two quarters, this looks like a clean DRY win.

Then the security team needs a small wording change. Maybe a new compliance regulation tightens what an assistant is allowed to volunteer about a user's account. Maybe a red-team finding requires a one-sentence addition to the refusal clause. The platform team makes the edit, ships v2, and within a day the support queue fills with messages from consumer teams: our eval dropped, our format broke, our tool-call rate halved, our tone changed, our latency went up because the model started reasoning more. Each team wants the edit reverted. The security team needs it shipped. Nobody can roll forward without a re-eval, and nobody owns the re-eval. Welcome to the shared-prompt flag day.