Skip to main content

84 posts tagged with "llmops"

View all tags

AI Ops Is Not Platform Engineering: How Running LLM Services Breaks Your SRE Playbook

· 10 min read
Tian Pan
Software Engineer

Your SRE team is excellent at running microservices. They've mastered blue-green deployments, canary rollouts, distributed tracing, SLO burn-rate alerts, and postmortem culture. Then someone ships an LLM-powered feature, and within a week an incident happens that none of those practices were designed to handle: the model starts generating plausible-sounding but structurally wrong outputs, no error is logged, no health check fails, and users have been silently getting garbage for four hours before anyone noticed.

This isn't a skills gap. It's an architectural gap. Running LLM services is a distinct operational discipline from running microservices, and the practices that don't transfer will burn your team if you don't identify them explicitly.

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

· 11 min read
Tian Pan
Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.

The Prompt Versioning Problem: Why Your Prompt Changes Are Untracked Production Risks

· 11 min read
Tian Pan
Software Engineer

Most teams treat a prompt change the way they treated a config file change in 2008: edit the string, redeploy, hope for the best. No version tag, no test suite, no rollback plan. The difference is that a config file change rarely alters the semantic behavior of your entire product — a prompt change almost always does.

If you've shipped a customer-facing LLM feature, you've probably already done this: edited a system prompt to "improve" the tone, deployed it alongside an unrelated bug fix, and had no idea three weeks later why user satisfaction dropped. The prompt was the culprit. You had no way to know.

The AI Bystander Effect: Why Five-Team Launches Ship Eval Suites Nobody Watches

· 10 min read
Tian Pan
Software Engineer

In 1964, thirty-eight people watched Kitty Genovese being attacked outside their apartment building in Queens. None of them called the police until it was too late. Latané and Darley spent the next decade explaining why: the more people who can see a problem, the less likely any single one of them is to act. They called it diffusion of responsibility. In their famous seizure experiment, 85% of participants intervened when they thought they were alone with the victim. When they believed four others could also hear the seizure, only 31% did.

Now picture your last AI feature launch. Product wrote the prompt. Engineering picked the model and wired the gateway. The data team curated the retrieval corpus. Safety bolted on the input and output filters. Customer support drafted the escalation playbook. Five teams in the room. Each one shipped its piece on time. Three months in, the feature's accuracy has quietly slid from 89% to 71%, the eval suite has not been run since launch week, and when you ask who owns the regression, every team can name three other teams that own it more.

AI Feature Soak Windows: Why a Two-Week Canary Misses What Actually Matters

· 13 min read
Tian Pan
Software Engineer

The two-week canary is one of those practices that sounds disciplined enough to skip the harder question. Engineering imported it from microservices — ramp 1% for a few days, watch error rate, ramp to 100%, declare done — and grafted it onto AI features without asking whether the failure modes that matter for AI even surface in two weeks. They don't. The bill that kills the feature lands in week six. The customer cohort that exposes the long-tail intent onboards in week five. The eval drift that scored +3% on launch day starts costing real money in week four because the new prompt's chattier outputs have been compounding token spend the whole time, and nobody was watching for that because the dashboard was watching for crashes.

A canary built around p95 latency and HTTP 500s will tell you the LLM is up. It will not tell you the feature is working. AI features fail in shapes the deploy ceremony was never designed to catch — slow shape changes in user behavior, gradual cache erosion, retrieval quality collapse, refusal-rate creep, cost trajectories that bend the wrong way — and almost all of them take longer than two weeks to declare themselves. The team that ships by the microservice clock is shipping by a clock the failures don't run on.

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

· 11 min read
Tian Pan
Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

Eval-as-Code: When Your Release Gate Is a Notebook on Someone's Laptop

· 13 min read
Tian Pan
Software Engineer

The number that decides whether a model goes to production is being produced by a Jupyter notebook running on a single engineer's MacBook, against a CSV that lives in a Slack DM, scored by a judge model that nobody pinned. Two weeks later, after the engineer has touched the notebook three more times and the API provider has silently shipped a minor model update, nobody on the team can reproduce the number — including the engineer who originally generated it. And yet that number is the gate. It decided that GPT-4o-mini was good enough to replace GPT-4 in the customer support flow. It decided the new prompt template shipped. It decided the fine-tune was promoted. The team is treating it like a load-bearing artifact and storing it like a sticky note.

This is the eval gap. The industry has spent five years writing about evaluation as a methodology problem — which scoring technique, which judge model, which rubric, which dataset — and almost no time writing about evaluation as an engineering problem. But the moment your eval suite starts gating production releases, it inherits every requirement that the rest of your production stack lives by: reproducibility, version control, ownership, observability, dependency management, latency and reliability budgets, and a pipeline that survives the engineer who built it leaving the team. Most teams skip this layer entirely and discover its absence only after a major incident, usually one where the eval score said green and the customer experience said red.

Per-User AI Quotas: The UX Layer Your Cost Dashboard Can't See

· 10 min read
Tian Pan
Software Engineer

A user opens your AI feature at 3pm on a Tuesday. They've been using it lightly for three weeks. This time the request hangs for eight seconds and returns a red banner: "Something went wrong. Please try again later." They try again. Same banner. They close the tab and go back to whatever they were doing before — and they tell their teammate at standup the next morning that "the AI thing is broken."

What actually happened: they crossed an invisible per-user quota that your cost team set six months ago to keep a single power user from blowing through the GPU budget. The quota worked. Spend stayed flat. The dashboard is green. The feature is, by every metric your engineering org tracks, healthy. It's also dead, because the user who got that banner is never coming back, and the three teammates they told at standup will never try it.

This is the gap your cost dashboard cannot see. Per-user AI quotas are a product surface. The team that hides them inside an HTTP 429 is letting their cost-control system silently shape user perception of the product, and they will not find out until churn shows up in a quarterly review with no obvious cause.

Runtime Prompt Hot-Reload: Why Your Prompts Shouldn't Be Locked Behind a Build

· 11 min read
Tian Pan
Software Engineer

The first AI incident at most companies follows a script: a prompt-engineer notices the model is misclassifying a category that just started showing up in real traffic, opens a PR with a one-line tweak to the system prompt, and watches the build queue for the next 23 minutes while the model continues to misclassify in production. The fix is a string. The deployment is a binary. The mismatch is not a tooling oversight — it is an architectural decision the team made implicitly the day they put the system prompt in a .py file alongside the application code.

Coupling prompt changes to the deploy pipeline is a constraint you imposed on yourself. There is no law of distributed systems that says the model's behavior contract has to ship inside the same artifact as the orchestration code. The runtime prompt hot-reload pattern severs that coupling by treating prompts the way you already treat feature flags, routing rules, and pricing tables — as configuration pulled from a versioned store at request time, with a short-lived local cache and well-defined safety primitives around it. The payoff is incident-response measured in seconds rather than build minutes, and the cost is an honest accounting of a third deployment surface your release process probably ignores.

Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'

· 12 min read
Tian Pan
Software Engineer

The playbook for a bad code deploy is a sub-minute revert. The playbook for a bad config push is a sub-second flag flip. The playbook for a bad model upgrade is whatever the on-call invents at 09:14, and on a typical day it takes seven hours to finish. During those seven hours the regression keeps compounding — wrong answers ship to customers, support tickets pile up, and the dashboard shows a slow gradient rather than a clean cliff back to green.

The reason the gap is seven hours is not that the team is slow. It is that "rollback" for a model upgrade is not the same primitive as "rollback" for code. It is closer to a database schema migration: partial, hysteretic, and not reversible by pressing the button you wish existed. The team that wrote its incident playbook around a button does not have the controls the actual rollback requires.

This post is about what those controls look like, why they have to be paid for in advance, and what you find out about your platform the first time you try to roll back a model under load.

Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change

· 9 min read
Tian Pan
Software Engineer

A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."

This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.

A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.