84 posts tagged with "llmops"

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

April 28, 2026 · 11 min read

Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

The AI Feature RACI: Why Four Green Dashboards Add Up to a Broken Product

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

An AI feature regresses on a Tuesday. The eval CI is green. The guardrail dashboards are clean. The retrieval P95 is in line. The model provider had no incident. And yet the support queue is filling up with users who say the assistant "feels worse this week." The PM is the only person in the room who can name the regression, and even she cannot tell you which dashboard would have caught it. Welcome to the seam bug — the kind of failure where every individual artifact owner can prove their piece is fine, and the integrated experience is still broken.

This is the predictable result of how AI features get staffed. The owner-of-record list looks reasonable on paper: a prompt author owns the system prompt, an eval owner owns the offline test set and CI gates, a tool/retrieval owner owns the function calls and search index, a guardrail owner owns moderation and policy filters. Plus a model-selection decision that often lives outside all four — sometimes with a platform team, sometimes with whichever engineer most recently filed the procurement ticket. Five owners. Zero of them are on the hook for "does this feature work for the user."

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

Green CI is not the statement "this prompt works." Green CI is the statement "the engineer who wrote the evals could not think of how this prompt should break." Those are very different claims, and the gap between them is where your production incidents live. An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Their dialect, their domain knowledge, their seniority, their pet failure modes, the model they happened to be using when they wrote the test cases. Everything that engineer would not think to test is, by construction, untested. And worse: they will keep extending the suite from the same vantage point, so the blind spot does not shrink as the suite grows. It calcifies.

This is the eval-author monoculture problem, and it is the most under-discussed reliability risk in AI engineering today. Teams obsess over judge bias, position bias, verbosity bias, leakage, and contamination — but the upstream bias is the bias of the human who decided what the test cases should be in the first place. Every other source of eval error gets amplified by it. If your suite was written by one person, you have a benchmark with a personality, and that personality is the silent ceiling on what your CI can ever catch.

Prompt Asset Depreciation: The Maintenance Schedule Your AI Team Doesn't Keep

April 27, 2026 · 9 min read

Tian Pan

Software Engineer

Engineering leaders are comfortable with the idea that code rots. Dependencies need updating, infrastructure has lifecycle management, certificates expire on a calendar nobody disputes. Yet the prompt repository gets treated as a write-once-read-many artifact — even though it defines how your product talks to a probabilistic engine that ships behavior changes every six weeks.

The system prompt tuned six months ago against the model that was current then is still in production. The few-shot examples chosen against a tokenizer that has since changed are still being injected on every call. The reranker prompt was tuned against an embedding endpoint the vendor deprecated last quarter. Nobody scheduled a review. Nobody is going to.

This is not a hypothetical failure mode. When one team migrated their prompt suite — meticulously stabilized against GPT-4-32k — to GPT-4.1 and GPT-4.5-preview, only 95.1% and 97.3% of their regression tests passed. A 3-5% silent quality regression is not a rounding error in production; at any non-trivial scale it is a customer-visible degradation that nobody on the team intentionally shipped. And those are the teams that even had a regression test suite. The median team's "regression test" is whatever vibes the on-call engineer formed during the last incident.

The category we are missing is prompt asset depreciation: a maintenance discipline that treats every production prompt as a depreciating asset with a known lifespan, not a constant.

The Shared-Prompt Flag Day: When One Edit Becomes Thirty Teams' Regression

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The first edit to a shared system prompt feels like good engineering. Three teams all paste the same eighteen-line safety preamble at the top of their agents, someone notices, and an internal platform team says the obvious thing: let's centralize it. A prompts.common.safety_preamble@v1 lands in a registry. Thirty teams adopt it within a quarter because it's the path of least resistance — and because security is happy that one team owns the wording. For two quarters, this looks like a clean DRY win.

Then the security team needs a small wording change. Maybe a new compliance regulation tightens what an assistant is allowed to volunteer about a user's account. Maybe a red-team finding requires a one-sentence addition to the refusal clause. The platform team makes the edit, ships v2, and within a day the support queue fills with messages from consumer teams: our eval dropped, our format broke, our tool-call rate halved, our tone changed, our latency went up because the model started reasoning more. Each team wants the edit reverted. The security team needs it shipped. Nobody can roll forward without a re-eval, and nobody owns the re-eval. Welcome to the shared-prompt flag day.

Token Budgets Are the New Internal IAM

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The first time your AI bill clears seven figures in a month, the budget meeting changes shape. Until then, the question is "can we afford this." After that, the question is "who gets how much" — and most engineering orgs discover, in real time, that they have no policy framework for answering it. The team that shipped the loudest demo holds the highest quota by accident. Finance pushes for flat per-headcount caps that starve the team doing the highest-leverage work. Security gets cut out of the conversation entirely until somebody notices that the eval team has been pulling production traffic through their personal token allowance for six months.

The reason this conversation always feels like a cloud-cost argument is that it almost is one — but not quite. With cloud, the unit of waste is a forgotten EC2 instance and the worst case is a 3x bill. With token quotas, the unit of waste is a runaway agent loop, and the unit of access is a user-facing capability: whoever holds the budget can ship the feature. That second property is what makes token allocation rhyme with capability-based security instead of with cloud FinOps. The quota is not just a spending cap. It is the right to make a class of inferences happen.

User-Side Concept Drift: When Your Prompt Held but Your Users Moved

April 26, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams set up drift monitoring on the wrong side of the contract. They watch the model — capability shifts when a vendor pushes a new checkpoint, output distribution changes after a prompt rewrite, refusal-rate spikes that signal a safety filter retune. The dashboards are detailed, the alerts are wired into PagerDuty, and the team has a runbook for "the model moved." None of that helps when the model didn't move and the dashboard still goes red, because the thing that shifted was your users.

User-side concept drift is the version of this problem that almost every eval pipeline misses. Your prompt, your model, and your tools are byte-identical to the day you launched. Your golden test set still passes 91%. But the prompt that hit 91% in week one is now serving 78% in week thirty, because the input distribution has moved underneath it — users learned the product and changed how they ask, vocabulary mutated, seasonal task types appeared, a competitor reframed the category, a viral thread taught a new way to phrase the same intent. The model and prompt held. The contract held. The world the contract was negotiated against did not.

You Accidentally Built a Feature-Flag System for Prompts — Without the Governance

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Pull up the config repo your team uses to ship prompt changes. Look at the last thirty commits. How many had a code review? How many had an eval gate in CI? How many can you attribute — with certainty — to a measurable change in production behavior for the users who saw them? If your answer is "most," you are an outlier. For everyone else, those commits are running in production right now, and the system reading them is doing exactly what a feature-flag service does: hot-reload a value, fan it out to users, change product behavior. The difference is that your feature-flag service has audit logs, exposure tracking, kill switches, and per-cohort targeting. Your prompt deploy pipeline has git push.

This is not a metaphor. It is an accurate description of the production system your team is running. The prompt config repo, the S3 bucket your workers poll, the "prompts" collection in your database, the LangSmith/PromptLayer/Braintrust asset that your app fetches on boot — these are all feature-flag services. They have the same runtime shape: a value lives outside the binary, the binary reads it on a hot path, changing the value changes behavior for real users without a deploy. The only thing missing is every control your SRE team demanded before they would approve the actual feature-flag service.

The Agent Backfill Problem: Your Model Upgrade Is a Trial of the Last 90 Days

April 23, 2026 · 12 min read

Tian Pan

Software Engineer

Here is a Tuesday-morning conversation that nobody on your AI team is prepared for. The new model lands in shadow mode. Within an hour the eval dashboard lights up: it categorizes 4% of refund requests differently than the model you have been running for the last quarter. Most of those flips look like the new model is right. Someone in the room — usually the one with the most lawyers in their reporting line — asks the question that ends the celebration: so what are we doing about the ninety days of decisions the old model already shipped?

That is the agent backfill problem. The moment a smarter model starts producing outputs that look more correct than your previous model's, every durable decision the previous model made becomes a contested record. You did not intend to indict the past. The new model did it for you, automatically, the first time you compared traces. And now you have an engineering question (can we replay history?), a legal question (do we have to disclose corrected outcomes?), and a product question (do users see retroactive changes?), and they collide.

The Agent Capability Cliff: Why Your Model Upgrade Made the Easy 95% Perfect and the Hard 5% Your Worst Quarter

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

You shipped the new model. Aggregate eval pass rate went from 91% to 96%. Product declared it a win in the all-hands. Six weeks later, the reliability team is having their worst quarter on record — not because there are more incidents, but because every single incident is now the kind that takes three engineers and two days to resolve.

This is the agent capability cliff, and it is one of the most counterintuitive failure modes in production AI. Model upgrades do not raise all tasks uniformly. They concentrate their gains on the bulk of your traffic — the easy and medium cases where the previous model was already correct most of the time — while the long tail of genuinely hard inputs sees only marginal improvement. Your failure surface narrows, but every remaining failure is a capability-frontier case that the previous model also missed and that no cheap prompt engineering will fix.

The cliff is not a flaw in the new model. It is a mismatch between how we measure model improvement (average pass rate on a mixed-difficulty eval set) and what actually lands in on-call rotations (the residual set of the hardest traffic, now unpadded by the easier failures that used to dominate the signal).

Your AI Product Needs an SRE Before It Needs Another Model

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The sharpest pattern I see in struggling AI teams is the gap between how sophisticated their model stack is and how primitive their operations are. A team will run three frontier models in production behind custom routing logic, a RAG pipeline with eight retrieval stages, and an agent that calls twenty tools. They will also have no on-call rotation, no SLOs, no runbooks, and a #incidents Slack channel where prompts are hotfixed live by whoever happens to be awake. The product is operating on 2026 model infrastructure and 2012 operational infrastructure, and every week the gap costs them another outage.

The instinct when this hurts is to reach for the model lever. Quality dipped? Try the new release. Latency spiked? Switch providers. Hallucinations in production? Add another guardrail prompt. None of this fixes the underlying problem, which is that nobody owns the system's reliability as a discipline. What these teams actually need — usually before they need another applied scientist — is their first SRE.

"Done!" Is Not a Return Code: Why Agent Completion Needs a Structured Signal

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

An agent ends its turn with "All done — let me know if you want any changes!" and your orchestrator has to decide whether to mark the ticket resolved, kick off the next handoff, or retry. That sentence is not a return code. It is a polite closing line trained to sound reassuring at the end of a chat, and every line of automation downstream of it inherits the ambiguity. The teams that treat this as a parsing problem write regexes that catch \b(done|complete|finished)\b and call it a day. The teams whose agents run in production eventually learn that completion is an event, not a mood.

The failure mode is bimodal and boring. Either the agent announces done when it isn't — premature termination — and the orchestrator happily advances the workflow on a half-finished artifact. Or the agent is actually done, but phrases it in a way that doesn't match the detector ("I went ahead and landed the change, though the test for the edge case is still flaky"), and the orchestrator spins up a retry that re-does the work, duplicates the side effect, and sometimes contradicts the successful first pass. Both modes degrade silently. Neither shows up in a dashboard until someone reads a trace and notices that the agent said "I think that covers it" and the billing system treated that as a commit.

The fix is not smarter parsing. It is giving the agent a structured way to terminate — a done-tool with an enumerated status, a reason code, and a handle your pipeline can route on — and changing the orchestrator to wait for that event instead of listening to the chat stream.

About Tian Pan