Skip to main content

84 posts tagged with "llmops"

View all tags

AI Code Review Drift: When Your LLM Reviewer's Standards Mutate Faster Than the Code

· 9 min read
Tian Pan
Software Engineer

The PR-review dashboard has shown green for six weeks. Bot catch rate, comment volume, developer "thumbs up" reactions — all steady. Then a security incident lands in production and the post-mortem points at a missing null-check the bot used to catch and quietly stopped catching about two months ago. Nobody changed the bot. Nobody downgraded the model. The dashboard never moved. The standard moved.

This is the failure mode of automated code review that doesn't show up in any product demo. Teams adopt an LLM reviewer for the consistency win — every PR gets the same checklist, no senior engineer's bad-day variance, fast turnaround for junior contributors — and the consistency is real for about a quarter. Then the system prompt evolves, the model bumps, the few-shot library accumulates, and the bot is reviewing a different codebase against a different rubric using a different model than the one the team validated against. The team's mental model of "what the bot catches" decays into "what the bot caught last week."

AI Feature Dependency Graphs: When a Prompt Edit Is a Silent Breaking Change

· 12 min read
Tian Pan
Software Engineer

A team owns a summarizer. Another team owns the search ranker that ingests those summaries. A third team owns a router that picks between agent personalities based on the ranker's confidence score. None of these teams have a shared on-call rotation, none of them sit in the same standup, and the only contract between them is "the previous feature's output is the next feature's input." On a Tuesday, the summarizer team tightens a prompt to fix a hallucination complaint from a sales demo. The search ranker's quality collapses six hours later. The router starts handing off to the wrong agent personality by Wednesday morning. The post-mortem will record the cause as "prompt change," but the actual cause is that the team's AI features have quietly composed into a directed graph that nobody drew.

This is the most common shape of an AI outage that doesn't trip any of the alerts you built for AI outages. The model isn't down. The eval suite for the changed feature is green. The token cost line is flat. What broke is the interface between two features, which is a thing your dependency tooling treats as plain text because that's all it is at the API boundary — and treats as inert because plain text doesn't carry a version, a schema, or a deprecation policy.

Eval Triage Queues: Why FIFO Misses the Failures That Matter

· 11 min read
Tian Pan
Software Engineer

A healthy eval set is supposed to be a sign of maturity. It is also, on any given Monday, a thousand failed cases sitting in a queue with a human reviewer who has eight hours and a per-case throughput of about fifty. The arithmetic is brutal: roughly one in twenty failures gets read. The other nineteen wait. Which nineteen wait, and which one gets the seat, is decided by whichever order the file happens to load in.

Most teams call this "reviewing failures." It is closer to a lottery weighted by alphabetical order. A failure case that affects two percent of production traffic and lives at the top of the file gets attention. A failure case that affects forty percent of production traffic and lives near the bottom gets a glance on Friday afternoon, if at all. The team ships a fix for the small problem on Tuesday and writes a retro on Thursday wondering why the dashboard hasn't moved.

Per-Tenant Prompt Compilation: When Your System Prompt Becomes a Build Artifact

· 10 min read
Tian Pan
Software Engineer

The day a multi-tenant SaaS team adds the third if tenant_industry == "healthcare" branch to its system prompt is the day it accidentally hires itself a compiler engineer. Nobody filed the headcount req. Nobody scoped the work. The team thinks it is shipping a feature; it is actually shipping a build system, and the build system is held together with f-strings.

Every team that scales an AI feature into a customer base with even mild heterogeneity hits the same wall. Tenant A is in healthcare and needs HIPAA-aware response framing. Tenant B is in legal and needs strict citation discipline. Tenant C is an enterprise that bought a custom safety rubric in the master agreement. Tenant D is on the free tier and gets the default. The first instinct is to handle the variance with runtime conditionals, and the conditionals nest until the prompt becomes unreadable to anyone who didn't write it. The second instinct — and the one most teams arrive at after the wall — is prompt compilation: the canonical "prompt" is no longer a string but a source artifact, and what reaches the model is a compiled output.

Prompt Edits Without PRs: The Velocity Metric Your AI Team Is Failing

· 9 min read
Tian Pan
Software Engineer

A head of engineering opens the velocity dashboard on a Monday morning. PRs merged per week, flat. Story points completed, flat. Lines changed, suspiciously low. The AI team is having a quiet quarter, the chart says. Two floors away, that team has rewritten the system prompt seven times in three weeks, swapped a tool description that doubled tool-call accuracy, added six new few-shot examples, and tuned the rerank instruction until the product feels like a different application. None of that work shows up in the PR graph. None of it is invisible to users.

The asymmetry between what AI teams change and what engineering dashboards measure has become the load-bearing misdiagnosis of 2026. Behavior change in an AI-heavy product is increasingly decoupled from code change, and the metrics that have governed software organizations for fifteen years — PR throughput, commit volume, lines touched — measure code change. A team can be reshaping production response distributions weekly and look idle on every chart leadership trusts.

Prompt Portfolios: Manage a Basket, Not a Single Best Prompt

· 10 min read
Tian Pan
Software Engineer

Most production AI teams talk about prompts the way junior traders talk about stocks: there is one best one, and the job is to find it. So they iterate — a Slack thread, a few eval rows, a new winner, push to main, repeat. The result is a single artifact carrying the entire intent-resolution surface of the product, optimized against a frozen evaluation set, sitting one regrettable edit away from a P1.

The mistake is the singular. A prompt is not a security; it is an allocation. The same user intent can be served well by several variants, each with its own confidence interval, its own per-segment performance, and its own sensitivity to model and corpus drift. The right mental model is not "find the best prompt" — it is "manage a basket of prompts whose composition is itself the product." Quantitative finance figured this out fifty years ago, and the operational machinery transfers almost without modification.

Prompts Don't Roll Back Like Code: Why git revert Is the Wrong Primitive

· 9 min read
Tian Pan
Software Engineer

A senior engineer ships a prompt change behind a 10% canary. By the next morning, the canary cohort's helpfulness score has dropped four points, the on-call notices, and the team does what every team does — they revert the commit and redeploy. The dashboard does not recover. It does not recover the next day either. Three days later, a postmortem reveals that the cohort that saw the bad prompt is still seeing degraded outputs because their conversation histories now contain assistant turns produced by the rolled-back prompt, and the model is conditioning on those turns. The commit is gone. The damage is not.

This is the part of LLMOps that the "treat prompts like code" advice quietly skips. Code rollback is a text replacement that restores a deterministic past state. Prompt rollback has to reconcile with a tail of side effects — caches, histories, eval baselines, experiment cohorts, downstream contracts — that the bad prompt has already imprinted on the production world. git revert flips the text. It does not flip the consequences.

The Platform-Readiness Gap: When AI Features Ship Before the Infra to Operate Them

· 11 min read
Tian Pan
Software Engineer

The launch is not the moment an AI feature ships. It is the moment the platform team inherits a production system they had no chance to design.

A product team prototypes a feature. The demo lands well with the executive team. A launch date gets set. And somewhere between the slide deck and the rollout, the feature ships into production before anyone built the eval harness, the prompt registry, the routing layer, the cost dashboards, the rollback primitive, the on-call rotation that knows what an agent looks like, or the secrets-rotation policy for the new vendor's API keys. The feature works. The demo metrics are green. The platform team is now on the hook for an operational system whose primitives don't exist yet.

This is the platform-readiness gap, and it is the single most common reason that AI programs that look healthy at launch become unmanageable by the fifth feature.

Gating AI Features on Model Performance, Not User Segments

· 10 min read
Tian Pan
Software Engineer

In April 2025, a model update silently reached 180 million users and began affirming decisions to stop psychiatric medication — with confidence and warmth. The provider's monitoring showed green latency, green error rates, green throughput. No SLO was breached. The problem surfaced three days later when power users started posting examples on social media. The rollback took another day. Four days of degradation, invisible to every runbook and dashboard the team had built.

This is the failure mode that traditional feature flags cannot protect against.

When you ship a new UI layout to 5% of users, and it breaks, only those 5% see the breakage. The cohort boundary contains the blast radius. When you ship an LLM model update that introduces sycophancy or hallucination drift, it doesn't break for a segment — it degrades for everyone simultaneously, and the degradation shows up as polite, confident wrong answers, not as errors.

Golden Paths for AI Agents: How Platform Teams Can Enable Adoption Without Becoming a Bottleneck

· 11 min read
Tian Pan
Software Engineer

The most common failure mode for AI platform teams isn't technical. It's organizational: the central platform team becomes a gate that every product team must pass through to get any AI capability into production. Request queue grows. Cycle times balloon from days to weeks. Product teams get frustrated and start stitching together unofficial workarounds — hardcoded API keys, shadow LLM integrations, vendor accounts on personal credit cards. By the time the platform team notices, half the organization is running AI outside any governance structure.

The problem isn't that platform teams care about governance. It's that they implemented governance as an approval workflow instead of as infrastructure.

When Accuracy Becomes a Liability: How Users Build Workflows Around Your AI's Failure Modes

· 10 min read
Tian Pan
Software Engineer

A team ships an AI feature at 70% accuracy. Eighteen months pass. Users adapt, complain at first, then settle in. They learn which prompt phrases avoid the edge cases. They know to double-check outputs involving dates. They build a verification step into their workflow because the AI sometimes hallucinates specific field names. Then the team ships a new model. Accuracy jumps to 85%. Support tickets spike. The most frustrated users are the ones who were using the feature the most.

This is the accuracy-as-product-contract problem, and most AI teams discover it the hard way.

AI Model APIs Are Software Dependencies You Can't See, Pin, or Track

· 9 min read
Tian Pan
Software Engineer

When OpenAI silently pulled a GPT-4o update in April 2025 after engineers discovered the model had become wildly sycophantic — validating bad ideas, agreeing with factually wrong claims, and generally becoming useless for any task requiring honest feedback — most affected teams found out through Reddit and Hacker News. Their package.json showed nothing changed. Their lockfile was identical. Their deployment pipeline flagged zero dependency updates. From every standard software-supply-chain perspective, nothing happened.

That's the dependency you can't see: the foundation model behind your application.