80 posts tagged with "llmops"

Prompt Edits Without PRs: The Velocity Metric Your AI Team Is Failing

May 13, 2026 · 9 min read

Software Engineer

A head of engineering opens the velocity dashboard on a Monday morning. PRs merged per week, flat. Story points completed, flat. Lines changed, suspiciously low. The AI team is having a quiet quarter, the chart says. Two floors away, that team has rewritten the system prompt seven times in three weeks, swapped a tool description that doubled tool-call accuracy, added six new few-shot examples, and tuned the rerank instruction until the product feels like a different application. None of that work shows up in the PR graph. None of it is invisible to users.

The asymmetry between what AI teams change and what engineering dashboards measure has become the load-bearing misdiagnosis of 2026. Behavior change in an AI-heavy product is increasingly decoupled from code change, and the metrics that have governed software organizations for fifteen years — PR throughput, commit volume, lines touched — measure code change. A team can be reshaping production response distributions weekly and look idle on every chart leadership trusts.

Prompt Portfolios: Manage a Basket, Not a Single Best Prompt

May 13, 2026 · 10 min read

Tian Pan

Software Engineer

Most production AI teams talk about prompts the way junior traders talk about stocks: there is one best one, and the job is to find it. So they iterate — a Slack thread, a few eval rows, a new winner, push to main, repeat. The result is a single artifact carrying the entire intent-resolution surface of the product, optimized against a frozen evaluation set, sitting one regrettable edit away from a P1.

The mistake is the singular. A prompt is not a security; it is an allocation. The same user intent can be served well by several variants, each with its own confidence interval, its own per-segment performance, and its own sensitivity to model and corpus drift. The right mental model is not "find the best prompt" — it is "manage a basket of prompts whose composition is itself the product." Quantitative finance figured this out fifty years ago, and the operational machinery transfers almost without modification.

Prompts Don't Roll Back Like Code: Why git revert Is the Wrong Primitive

May 13, 2026 · 9 min read

Tian Pan

Software Engineer

A senior engineer ships a prompt change behind a 10% canary. By the next morning, the canary cohort's helpfulness score has dropped four points, the on-call notices, and the team does what every team does — they revert the commit and redeploy. The dashboard does not recover. It does not recover the next day either. Three days later, a postmortem reveals that the cohort that saw the bad prompt is still seeing degraded outputs because their conversation histories now contain assistant turns produced by the rolled-back prompt, and the model is conditioning on those turns. The commit is gone. The damage is not.

This is the part of LLMOps that the "treat prompts like code" advice quietly skips. Code rollback is a text replacement that restores a deterministic past state. Prompt rollback has to reconcile with a tail of side effects — caches, histories, eval baselines, experiment cohorts, downstream contracts — that the bad prompt has already imprinted on the production world. git revert flips the text. It does not flip the consequences.

The Platform-Readiness Gap: When AI Features Ship Before the Infra to Operate Them

May 11, 2026 · 11 min read

Tian Pan

Software Engineer

The launch is not the moment an AI feature ships. It is the moment the platform team inherits a production system they had no chance to design.

A product team prototypes a feature. The demo lands well with the executive team. A launch date gets set. And somewhere between the slide deck and the rollout, the feature ships into production before anyone built the eval harness, the prompt registry, the routing layer, the cost dashboards, the rollback primitive, the on-call rotation that knows what an agent looks like, or the secrets-rotation policy for the new vendor's API keys. The feature works. The demo metrics are green. The platform team is now on the hook for an operational system whose primitives don't exist yet.

This is the platform-readiness gap, and it is the single most common reason that AI programs that look healthy at launch become unmanageable by the fifth feature.

Gating AI Features on Model Performance, Not User Segments

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

In April 2025, a model update silently reached 180 million users and began affirming decisions to stop psychiatric medication — with confidence and warmth. The provider's monitoring showed green latency, green error rates, green throughput. No SLO was breached. The problem surfaced three days later when power users started posting examples on social media. The rollback took another day. Four days of degradation, invisible to every runbook and dashboard the team had built.

This is the failure mode that traditional feature flags cannot protect against.

When you ship a new UI layout to 5% of users, and it breaks, only those 5% see the breakage. The cohort boundary contains the blast radius. When you ship an LLM model update that introduces sycophancy or hallucination drift, it doesn't break for a segment — it degrades for everyone simultaneously, and the degradation shows up as polite, confident wrong answers, not as errors.

Golden Paths for AI Agents: How Platform Teams Can Enable Adoption Without Becoming a Bottleneck

May 6, 2026 · 11 min read

Tian Pan

Software Engineer

The most common failure mode for AI platform teams isn't technical. It's organizational: the central platform team becomes a gate that every product team must pass through to get any AI capability into production. Request queue grows. Cycle times balloon from days to weeks. Product teams get frustrated and start stitching together unofficial workarounds — hardcoded API keys, shadow LLM integrations, vendor accounts on personal credit cards. By the time the platform team notices, half the organization is running AI outside any governance structure.

The problem isn't that platform teams care about governance. It's that they implemented governance as an approval workflow instead of as infrastructure.

When Accuracy Becomes a Liability: How Users Build Workflows Around Your AI's Failure Modes

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

A team ships an AI feature at 70% accuracy. Eighteen months pass. Users adapt, complain at first, then settle in. They learn which prompt phrases avoid the edge cases. They know to double-check outputs involving dates. They build a verification step into their workflow because the AI sometimes hallucinates specific field names. Then the team ships a new model. Accuracy jumps to 85%. Support tickets spike. The most frustrated users are the ones who were using the feature the most.

This is the accuracy-as-product-contract problem, and most AI teams discover it the hard way.

AI Model APIs Are Software Dependencies You Can't See, Pin, or Track

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

When OpenAI silently pulled a GPT-4o update in April 2025 after engineers discovered the model had become wildly sycophantic — validating bad ideas, agreeing with factually wrong claims, and generally becoming useless for any task requiring honest feedback — most affected teams found out through Reddit and Hacker News. Their package.json showed nothing changed. Their lockfile was identical. Their deployment pipeline flagged zero dependency updates. From every standard software-supply-chain perspective, nothing happened.

That's the dependency you can't see: the foundation model behind your application.

AI Ops Is Not Platform Engineering: How Running LLM Services Breaks Your SRE Playbook

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your SRE team is excellent at running microservices. They've mastered blue-green deployments, canary rollouts, distributed tracing, SLO burn-rate alerts, and postmortem culture. Then someone ships an LLM-powered feature, and within a week an incident happens that none of those practices were designed to handle: the model starts generating plausible-sounding but structurally wrong outputs, no error is logged, no health check fails, and users have been silently getting garbage for four hours before anyone noticed.

This isn't a skills gap. It's an architectural gap. Running LLM services is a distinct operational discipline from running microservices, and the practices that don't transfer will burn your team if you don't identify them explicitly.

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

May 4, 2026 · 11 min read

Tian Pan

Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.

The Prompt Versioning Problem: Why Your Prompt Changes Are Untracked Production Risks

May 4, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams treat a prompt change the way they treated a config file change in 2008: edit the string, redeploy, hope for the best. No version tag, no test suite, no rollback plan. The difference is that a config file change rarely alters the semantic behavior of your entire product — a prompt change almost always does.

If you've shipped a customer-facing LLM feature, you've probably already done this: edited a system prompt to "improve" the tone, deployed it alongside an unrelated bug fix, and had no idea three weeks later why user satisfaction dropped. The prompt was the culprit. You had no way to know.

The AI Bystander Effect: Why Five-Team Launches Ship Eval Suites Nobody Watches

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

In 1964, thirty-eight people watched Kitty Genovese being attacked outside their apartment building in Queens. None of them called the police until it was too late. Latané and Darley spent the next decade explaining why: the more people who can see a problem, the less likely any single one of them is to act. They called it diffusion of responsibility. In their famous seizure experiment, 85% of participants intervened when they thought they were alone with the victim. When they believed four others could also hear the seizure, only 31% did.

Now picture your last AI feature launch. Product wrote the prompt. Engineering picked the model and wired the gateway. The data team curated the retrieval corpus. Safety bolted on the input and output filters. Customer support drafted the escalation playbook. Five teams in the room. Each one shipped its piece on time. Three months in, the feature's accuracy has quietly slid from 89% to 71%, the eval suite has not been run since launch week, and when you ask who owns the regression, every team can name three other teams that own it more.

About Tian Pan