Skip to main content

108 posts tagged with "llm-ops"

View all tags

The Agent A/B Test Whose Variants Quietly Shared Long-Term Memory

· 11 min read
Tian Pan
Software Engineer

You ran the cleanest A/B test of your career. Traffic split was 50/50, the hash function looked fine, the metric pipeline did not lie, the holdout was preserved, and after three weeks the analysis converged on a clear winner: variant B improved task completion by four points, with a p-value the stats team had no objections to. You shipped it to 100%. Two weeks after the rollout, the topline metric you launched on had drifted back toward the baseline, and nobody could explain why.

Here is the part that took a while to see. Both variants were writing to and reading from the same long-term memory store. Users in variant A wrote a memory like "this customer prefers blunt summaries" and the next day, when the same user happened to be on variant B, the variant B agent loaded that memory and read it into its prompt. The reverse happened in the other direction. The experiment was not comparing prompt A against prompt B. It was comparing "prompt A reading prompt-B-shaped memories" against "prompt B reading prompt-A-shaped memories." The result was an average over a contaminated joint distribution, and the launch was a regression to a different point on the same surface.

Fourth-Party Risk: When Your Vendor's Vendor Owns Your Customer's Incident

· 11 min read
Tian Pan
Software Engineer

Your contract is with the model provider. Your runbook handles the case where that provider is degraded. Your status page subscription pages you when their dashboard turns yellow. You feel covered. Then one Wednesday afternoon the underlying cloud region your provider runs in starts brownouts, your provider's failover region is also affected because they consolidated capacity to control unit economics, and your product is half-down for ninety minutes because of a vendor decision two layers upstream from any contract you signed.

The customer postmortem request lands in your inbox the next morning. They want a root cause. The root cause lives in a layer your status page cannot see and your contract does not let you compel. That layer is what fourth-party risk actually is — not a procurement checkbox, but a silent dependency tier that propagates failures upward with attenuation but not absorption.

The Internal Search Box You Replaced With an Agent Just Became Your SLO

· 11 min read
Tian Pan
Software Engineer

You retired the search bar on the company portal because the agent did the same job better. Type a question in plain English, get an answer with citations, refine with a follow-up. The pilot crushed the satisfaction metric. The rollout email said "deprecating legacy search, full cutover in two weeks." Two weeks passed. The old index was decommissioned. The query box was replaced with a chat input.

Six months later, on a Tuesday morning, three things happen at once. Your inference provider rate-limits the corporate account because somebody's batch job spiked the shared quota. The embedding service has a regional brownout. A config push clears the prompt cache. Every engineer in the company who used to type "vpn setup" or "expense policy" into a search bar instead watches a spinner for forty seconds, then gets a refusal that does not understand their question, or worse, a confident citation to a wiki page that does not exist. The Slack channel where employees ask each other things lights up. The IT inbox fills with "is search broken?"

The search bar you replaced had three nines of availability over a decade of small incremental improvements. The agent that replaced it has a different shape of failure — slow not down, wrong not empty, expensive not cached — and your SRE culture was not calibrated for it.

The Provider Failover That Multiplied Your Incident Surface

· 10 min read
Tian Pan
Software Engineer

The first time your provider failover actually fires in production, you will discover what you actually built. The gateway flips the traffic over in seconds — that part works. Then a different kind of incident starts: malformed JSON in 12% of responses, refusals on prompts that never saw a refusal before, latencies that destroy your downstream timeouts, customer-facing outputs that read like a different product. The primary came back ninety minutes later. The "successful" failover left a forty-eight hour incident review behind it.

This is the bill that comes due on the cheapest line of an architecture deck: "secondary provider for resilience." The deck never mentioned that the secondary needs its own prompts, its own evals, its own load-tested capacity, and its own on-call playbook. The deck just said you would not be down. The deck was right about that and wrong about everything else.

The Prompt Log Is the Product Roadmap You Threw Away

· 9 min read
Tian Pan
Software Engineer

Somewhere in your observability stack is a table that holds every prompt a user typed into your AI feature last quarter. If your team is like most, that table is used for three things: cost attribution, abuse detection, and the occasional debugging session when a customer reports a bad answer. Nobody on the product team has ever opened it. Nobody on the research team has clustered it. The PM running the AI roadmap has never read a single row.

This is the most expensive oversight in your product organization. The prompts your users typed — especially the ones your feature handled badly — are the highest-resolution form of "what users wish this product did" you will ever collect. You are paying inference costs to generate this signal in real time, and you are throwing it away because nobody decided whose job it was to read it.

The Typo Your Agent Learned to Honor

· 10 min read
Tian Pan
Software Engineer

An insurance carrier fine-tuned a support model on a year of chat transcripts. Within a week of launch, a compliance reviewer flagged something odd: the bot kept writing "deductable" instead of "deductible." Not occasionally — consistently, in roughly the same one-in-eight messages where the word appeared. The model had not invented the misspelling. It had inherited it. A handful of tier-1 reps had been typing it that way for two years, and the corpus reflected what they typed, not what the dictionary said.

This is the unsettling thing about supervised fine-tuning on operational data: the model is not learning your domain. It is learning your corpus. Those two things overlap, but they are not the same, and the gap is where every preventable behavioral defect lives. Frequency in your training data is not a signal of correctness. It is a signal of what your team happened to do enough times for the model to mimic it.

The misspelling is the easy case to spot. The hard cases are the ones nobody bothered to write down as rules, because everyone assumed the model would learn the "professional" version of the work rather than the actual work as performed.

The Kill Switch Nobody Wired Because the Feature Never Failed

· 10 min read
Tian Pan
Software Engineer

The launch flag worked perfectly. You shipped the AI summarizer behind it, ramped 1% to 10% to 50% to 100% over two weeks, watched the dashboards, saw nothing on fire, and at the end of the quarter the platform team's flag-hygiene bot opened a PR to delete the now-redundant gate. You approved it. The PR merged with the rest of the expired-flag cleanup, and the codebase got 200 lines lighter. Six weeks later at 2am, the provider rolls a fresh model snapshot, your summarizer starts confidently fabricating clauses into legal documents, and your on-call engineer discovers there is no fast lever to turn it off — only a deploy.

The flag did its job. The flag was the wrong artifact to keep. A launch flag answers "should this new code path be reachable?" and once everyone agrees yes, deleting it is the correct hygiene move. A kill switch answers "is the upstream model behaving today?" — and that question never expires, because the upstream model never stops changing. Cleaning them up together is the same category error as treating a smoke detector like a construction permit: the permit gets archived once the building is up, but the detector stays wired forever because the thing it watches for can still happen.

The Agent That Refuses to Fail Loud: How Over-Eager Fallbacks Hide Production Regressions

· 11 min read
Tian Pan
Software Engineer

Your status page is green. Your error rate is zero. Your p95 latency looks slightly better than last week. And quietly, eval-on-traffic dropped four points last Tuesday and nobody knows why for nine days, because by the time the regression rolled past the alerting threshold there were four interleaved root causes layered on top of each other and the team couldn't tell which one started the slide.

This is the dominant failure mode of mature agentic systems in 2026, and it's not a bug in any single component. It's the cumulative effect of a defensive stack the team built deliberately, one well-intentioned safety net at a time. The primary model returns garbage; the retry succeeds. The retry fails; the cheaper fallback model answers. The fallback's output is malformed; the wrapper rewrites it into a plausible shape. The wrapper logs a soft warning. Nobody alerts on the soft warning. The user receives an answer that's correct-looking, smoothly delivered, and quietly worse than the system was designed to produce.

The robustness layer worked. The quality story collapsed. And the alerting was built for the world before the robustness layer existed.

The Customer-Facing AI Postmortem When Nothing Crashed

· 12 min read
Tian Pan
Software Engineer

Your status page is green. Your error rate is zero. Your uptime dashboard reads 100% for the seventh consecutive month. And yet at 9:14 AM on a Tuesday, your account team is forwarding you a message from a Fortune 500 customer that says, "Our team noticed the assistant has been worse this week. Can you tell us what changed?" Twelve more like it land before lunch. None of them will be answered by the existing incident-comms playbook, because that playbook was built for outages, and nothing has crashed.

This is the customer-facing AI postmortem problem, and it is the single most consistent gap I see across teams shipping LLM features into enterprise contracts. The reliability surface has shifted from "is it up" to "is it as good as it was last week," and almost none of the comms infrastructure has caught up. Status pages don't have a tile for it. Severity rubrics don't grade it. Support macros default to "we identified an issue and resolved it," which reads as either dismissive or alarming depending on the customer's mood that day.

Free-Tier Traffic Is Your Real Eval Set

· 10 min read
Tian Pan
Software Engineer

The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.

This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.

Model Migration Bills You Twice: The Eval Re-Anchoring Tax Nobody Prices

· 10 min read
Tian Pan
Software Engineer

Every model upgrade gets sold to the team as a swap: a one-line config change, a measurable win on latency or cost or quality, and a few days of prompt re-tuning to absorb the new model's quirks. The procurement deck shows per-token deltas, the engineering ticket lists the rollout phases, and the FP&A team books the quarterly savings. Then the eval scores come in and nobody recognizes them. Quality is flat where it should have moved. Two judges that used to agree are now diverging by ten points. The snapshot suite is red, but the diffs look like rewordings. Somebody in standup asks the question that should have been on the migration plan from day one: what is the model actually scoring against?

This is the second bill — the eval re-anchoring tax — and it is reliably larger than the first. The human-annotated reference scores were anchored to the previous model's output distribution. The LLM-as-judge graders were calibrated against the old model's failure modes. The snapshot fixtures captured the old model's wording. The team's intuition for "good output" was trained on the old model's stylistic tells. None of that survives the swap intact.

Per-Customer Prompt Forks: Why Your Next Model Migration Is 47 Migrations

· 12 min read
Tian Pan
Software Engineer

The CTO of an AI startup I talked to last month opened her laptop and showed me a number: 47. That was the count of distinct system prompts running in production, one per enterprise customer or per logical group of them. The base prompt had been forked once in month four for a healthcare customer that needed a softer refusal posture. Then once more for a legal customer that wanted citations. Then for a financial-services customer whose compliance team had a list of forbidden phrases. None of these felt like a big deal at the time. Each was a small ask, approved in isolation, that the account team could close the deal on.

Two years later, the model provider announced the cutover deadline for the version her prompts were tuned against. Her engineering team's first instinct was to run the eval suite against the new model. The eval suite was scoped to the base prompt. The base prompt was still serving customer zero, which had no overrides, and which represented roughly 9% of revenue.