Skip to main content

58 posts tagged with "llm-ops"

View all tags

The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor

· 11 min read
Tian Pan
Software Engineer

There is a quiet myth in AI engineering that goes like this: a "minor" model bump — claude-x.6 to claude-x.7, gpt-y.0 to gpt-y.1, the patch-level snapshot rolling forward by a date — should be a drop-in upgrade. The provider releases notes that talk about improved reasoning, lower latency, better tool use. The version number ticks gently. Nothing about the change reads as breaking.

Then it ships. And the on-call channel lights up with reports that the summarizer is now adding a paragraph that wasn't there before, that the JSON extractor is escaping unicode it used to leave alone, that the agent loop is now hitting the max-step ceiling on tasks that used to terminate in three calls. The eval scores look fine in aggregate; the user-visible feature is subtly wrong.

Your Agent Release Notes List Files. Your Integrators Need Behavior Diffs.

· 13 min read
Tian Pan
Software Engineer

A platform team ships their weekly agent release on a Wednesday afternoon. The internal changelog is dutiful: three system-prompt commits, a model-alias bump from a -0815 snapshot to -1019, four edits to tool descriptions, a new eval-rubric weighting, and a refreshed retriever index. By Friday, the support queue has eighteen tickets that nobody on the platform team can pattern-match. Tickets two and seven say "the bot is suddenly refusing to summarize private repos." Ticket eleven says "every code block in the output now starts with a language tag, and our downstream parser breaks on it." Ticket fifteen says "tool X is being called twice as often on long inputs and we're hitting our rate limit."

None of these tickets reference any of the lines in the changelog. The platform team's release notes are a list of files moved. The integrator tickets are a list of behaviors changed. The two documents do not meet in the middle, and that gap is where the trust leaks out.

Your AI Feature Ramp Is Rolling Out on the Wrong Axis

· 11 min read
Tian Pan
Software Engineer

A team I talked to last month ramped a new agentic feature from 1% to 50% of users over four weeks. Aggregate quality metrics held within noise. Latency stayed within SLA. They were preparing the 100% memo when the support queue caught fire — a customer with a six-tool research workflow had been getting silently corrupted outputs since the 10% step. The hard queries had been there the whole time, evenly sprinkled across every cohort, averaging into the noise floor. Nobody saw them until a single high-volume user happened to hit them at scale.

This is not a monitoring failure. It is a ramp-axis failure. Feature flag tooling — the entire LaunchDarkly / Flagsmith / Unleash / Cloudflare-Flagship category — assumes blast radius scales with the number of humans exposed. For deterministic software that is mostly true: a NullPointerException hits everyone or nobody, and showing it to 1% of users limits the user-visible blast to 1%. For AI features, blast radius does not scale on the human axis. It scales on the input axis. And the input axis is where almost no one is ramping.

The Eval Bottleneck: Your Eval Engineer Is Now the Roadmap

· 11 min read
Tian Pan
Software Engineer

The constraint on your AI roadmap isn't GPU capacity, model availability, or prompt-engineering taste. It's the calendar of one or two engineers who actually know how to build an eval that catches a regression. Every PM with a feature is in their queue. Every model upgrade is in their queue. Every cohort drift, every prompt revision, every "is this judge still calibrated" question lands in the same inbox. And the engineer in question said "no, this isn't ready" three times this quarter, got overruled twice, watched the regression compound in production, and is now updating their LinkedIn.

This is the eval bottleneck, and most orgs don't see it until it bites. Through 2025 the visible scaling story was AI engineers — hire AI engineers, ship AI features, iterate on prompts, swap models. By Q1 2026 the throughput problem moved one layer down. The team that doubled its AI headcount discovered that adding more feature engineers didn't make features ship faster, because every feature still needed an eval, and the eval engineer was the same person.

The Eval-Set Poison Pill: When Your Benchmark Becomes a Backdoor

· 10 min read
Tian Pan
Software Engineer

A team I know spent six months chasing a regression that wasn't there. Every release passed the eval. Every release shipped. Every quarter, NPS on the AI-served cohort drifted down a point. Eventually, an intern doing a routine audit of the gold dataset noticed that one labeler — long since rotated off the contract — had graded 11% of the items, and that those items were systematically more lenient on a specific failure mode the team had been racing to fix. The eval said the model was getting better. The model was not getting better. The eval had been quietly tilted by one human's calibration drift, and nobody had been watching the labelers because nobody believed the labelers were a threat surface.

This is the eval-set poison pill. Most teams treat their eval set as a trusted artifact: the labels were graded by humans, the data came from production, and the regression dashboard is the one thing the org agrees to defer to when shipping. But the labeling pipeline is a human supply chain, and human supply chains are gameable. Treating an eval as ground truth without applying supply-chain hygiene to its inputs is trusting a number whose provenance you cannot defend.

Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap

· 11 min read
Tian Pan
Software Engineer

There is a broken prompt in your system right now that affects roughly three percent of traffic, and your dashboards do not know it exists. The p99 latency chart is green. The error rate is flat. The model-call success metric is at four nines. The only place the failure shows up is in a customer support ticket the platform team cannot reproduce, and by the time the ticket reaches a debugging session, the trace has been sampled away.

This is not a monitoring gap. It is a category mistake. The APM you are running was designed for a world in which dimensions are bounded sets — endpoint, status_code, region, service — and the cost of an additional label is at most a few new time series. LLM workloads do not fit that shape at all. The interesting dimensions are the user's prompt, the retrieved context IDs, the tool-call sequence, the model revision, the prompt template version, the tenant, the locale, the eval bucket the request fell into. Every one of those is high-cardinality, and any subset of them is enough to detonate the metrics store the moment you tag a span with it.

Pagination Is a Tool-Catalog Discipline: Why Agents Burn Context on List Returns

· 11 min read
Tian Pan
Software Engineer

Every well-designed HTTP API in your stack returns paginated results. Nobody loads a million rows into memory and hopes for the best. Yet the tools your agent calls return the entire list, and the agent dutifully reads it, because the function signature says list_orders() -> Order[] and the agent has no protocol for "give me the next page" the way a human user has scroll-and-load-more.

The agent burns tokens on rows it could have skipped. The long-tail customer with 50K records hits context-window failures the median customer never sees. The tool author cannot tell from the trace whether the agent needed all those rows or simply could not ask for fewer. And somewhere in your eval suite, the regression that would have flagged this never runs because every test fixture has fewer than 100 records.

Pagination is not a UI affordance. It is a load-shedding primitive — and the agent that consumes a tool without it is reimplementing every SELECT * FROM orders mistake the API designers in your company spent a decade learning to avoid.

Prompt Linting Is the Missing Layer Between Eval and Production

· 11 min read
Tian Pan
Software Engineer

The incident report read like a unit-test horror story. A prompt edit removed a five-line safety clause as part of a "preamble cleanup." Every eval in the suite passed. Every judge score held within tolerance. Two weeks later, a customer-facing assistant produced a response that should have been refused, the kind that triggers a Trust & Safety page at 11pm. The post-mortem traced the regression to a single deletion in a PR that nobody had flagged because the suite that was supposed to catch regressions had no opinion on whether the safety clause was present — it only had opinions on whether the model behaved well in the cases the suite remembered to ask about.

This is the gap between behavioral evals and structural correctness. Evals measure what the model produces; they do not measure what the prompt is. And prompts, like code, have a structural layer that exists independently of behavior — sections that must be present, references that must resolve, variables that must interpolate, length budgets that must hold, deprecated identifiers that must not appear. When that structural layer breaks, the behavior often stays green for a while, until the right edge case in production surfaces the failure as an incident.

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

· 12 min read
Tian Pan
Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

The AI Feature OKR Mismatch: Why Quarterly Cadence Breaks AI Roadmaps

· 10 min read
Tian Pan
Software Engineer

The team commits to "ship the AI summarizer this quarter," gets it past the technical bar by week ten, takes a victory lap at the all-hands, and ships. Six weeks later the telemetry curve starts bending the wrong way — quietly, slowly, in a way nobody dashboards because nobody owns the shape. The OKR is already marked green. The next quarter's OKRs are already drafted around new launches. The summarizer is now somebody's second-priority maintenance job, and by quarter-end review the team is wondering why customer satisfaction on the feature dropped fifteen points without anything obvious changing.

This is not a bug in the team. It's a bug in the operating model. Quarterly OKRs were calibrated for software where a feature can be scoped, built, shipped, and then largely left alone until the next major rev. AI features don't have that shape. They have a launch curve and a sustain curve, and the sustain curve is where most of the value — and most of the risk — actually lives. The OKR template that treats them as deliverables with launch dates quietly produces a portfolio of demos that decay before the next planning cycle.

The Five-Surface Triage Tree: An AI On-Call Playbook for Pages That Don't Fit Your Runbook

· 12 min read
Tian Pan
Software Engineer

The page fires at 2:47 AM. The agent is sending wrong-tone replies to customer support tickets, the latency dashboard is flat, the error rate is normal, and there is nothing to roll back because nothing was deployed in the last twelve hours. The on-call engineer opens the runbook, scrolls past "restart the worker pool" and "scale the queue," reaches the bottom, and finds nothing that maps to the page in front of them. They start reading the system prompt at 3:04 AM. They are still reading it at 3:31 AM.

This is the new failure shape, and the rotation that was designed for "high latency means restart the pod, elevated 5xx means roll back the deploy, queue depth growing means scale the worker pool" is not equipped to handle it. The first instinct — roll back the deploy — is wrong because nothing was deployed: the model upgraded silently behind a versioned alias, a third-party tool's response shape drifted, the prompt version skewed across regions, or the eval set went stale weeks ago and the regression has been compounding the whole time. The page is real. The runbook is silent. AI on-call is its own discipline now, and trying to retrofit it into the existing rotation produces playbooks whose first step is silence on the call while everyone reads the prompt for the first time.