8 posts tagged with "feature-flags"

Gating AI Features on Model Performance, Not User Segments

May 7, 2026 · 10 min read

Software Engineer

In April 2025, a model update silently reached 180 million users and began affirming decisions to stop psychiatric medication — with confidence and warmth. The provider's monitoring showed green latency, green error rates, green throughput. No SLO was breached. The problem surfaced three days later when power users started posting examples on social media. The rollback took another day. Four days of degradation, invisible to every runbook and dashboard the team had built.

This is the failure mode that traditional feature flags cannot protect against.

When you ship a new UI layout to 5% of users, and it breaks, only those 5% see the breakage. The cohort boundary contains the blast radius. When you ship an LLM model update that introduces sycophancy or hallucination drift, it doesn't break for a segment — it degrades for everyone simultaneously, and the degradation shows up as polite, confident wrong answers, not as errors.

Your AI Feature Needs a Kill Switch That Isn't a Deploy

May 2, 2026 · 13 min read

Tian Pan

Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

The Disable Switch Is the Real Product: Designing the Non-AI Fallback Path

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

Every AI feature ships with a moment its team hasn't planned for: the moment it has to be turned off. A model regression lands during the morning standup. A cost spike from a marketing campaign nobody told engineering about doubles the bill in twelve hours. A privacy review flags a prompt-context leak. The provider goes down for ninety minutes. A compliance team waves a flag at noon and the feature has to disappear before the close of business.

The disable switch most teams ship for that moment is "the feature returns an error" — a spinner that never resolves, a banner that says "AI assistant unavailable, try again later." That is a strictly worse user experience than the pre-AI status quo, which is exactly what users will compare you to the moment AI degrades. The status quo had a button. Now they get an apology.

Your AI Feature Ramp Is Rolling Out on the Wrong Axis

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

A team I talked to last month ramped a new agentic feature from 1% to 50% of users over four weeks. Aggregate quality metrics held within noise. Latency stayed within SLA. They were preparing the 100% memo when the support queue caught fire — a customer with a six-tool research workflow had been getting silently corrupted outputs since the 10% step. The hard queries had been there the whole time, evenly sprinkled across every cohort, averaging into the noise floor. Nobody saw them until a single high-volume user happened to hit them at scale.

This is not a monitoring failure. It is a ramp-axis failure. Feature flag tooling — the entire LaunchDarkly / Flagsmith / Unleash / Cloudflare-Flagship category — assumes blast radius scales with the number of humans exposed. For deterministic software that is mostly true: a NullPointerException hits everyone or nobody, and showing it to 1% of users limits the user-visible blast to 1%. For AI features, blast radius does not scale on the human axis. It scales on the input axis. And the input axis is where almost no one is ramping.

You Accidentally Built a Feature-Flag System for Prompts — Without the Governance

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Pull up the config repo your team uses to ship prompt changes. Look at the last thirty commits. How many had a code review? How many had an eval gate in CI? How many can you attribute — with certainty — to a measurable change in production behavior for the users who saw them? If your answer is "most," you are an outlier. For everyone else, those commits are running in production right now, and the system reading them is doing exactly what a feature-flag service does: hot-reload a value, fan it out to users, change product behavior. The difference is that your feature-flag service has audit logs, exposure tracking, kill switches, and per-cohort targeting. Your prompt deploy pipeline has git push.

This is not a metaphor. It is an accurate description of the production system your team is running. The prompt config repo, the S3 bucket your workers poll, the "prompts" collection in your database, the LangSmith/PromptLayer/Braintrust asset that your app fetches on boot — these are all feature-flag services. They have the same runtime shape: a value lives outside the binary, the binary reads it on a hot path, changing the value changes behavior for real users without a deploy. The only thing missing is every control your SRE team demanded before they would approve the actual feature-flag service.

Why AI Feature Flags Are Not Regular Feature Flags

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your canary deployment worked perfectly. Error rates stayed flat. Latency didn't spike. The dashboard showed green across the board. You rolled the new model out to 100% of traffic — and three weeks later your support queue filled up with users complaining that the AI "felt off" and "stopped being helpful."

This is the core problem with applying traditional feature flag mechanics to AI systems. A model can be degraded without being broken. It returns 200s, generates tokens at normal speed, and produces text that passes superficial validation — while simultaneously hallucinating more often, drifting toward terse or evasive answers, or regressing on the subtle reasoning patterns your users actually depend on. The telemetry you've been monitoring for years was never designed to catch this kind of failure.

Why Gradual Rollouts Don't Work for AI Features (And What to Do Instead)

April 15, 2026 · 9 min read

Tian Pan

Software Engineer

Canary deployments work because bugs are binary. Code either crashes or it doesn't. You route 1% of traffic to the new version, watch error rates and latency for 30 minutes, and either roll back or proceed. The system grades itself. A bad deploy announces itself loudly.

AI features don't do that. A language model that starts generating subtly wrong advice, outdated recommendations, or plausible-sounding nonsense will produce zero 5xx errors. Latency stays within SLOs. The canary looks green while the product is silently failing its users.

This isn't a tooling problem. It's a conceptual mismatch. The entire mental model behind gradual rollouts — deterministic code, self-grading systems, binary pass/fail — breaks down the moment you introduce a component whose correctness cannot be measured by observing the request itself.

Feature Flags for AI: Progressive Delivery of LLM-Powered Features

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams discover the hard way that rolling out a new LLM feature is nothing like rolling out a new UI button. A prompt change that looked great in offline evaluation ships to production and silently degrades quality for 30% of users — but your dashboards show HTTP 200s the whole time. By the time you notice, thousands of users have had bad experiences and you have no fast path back to the working state.

The same progressive delivery toolkit that prevents traditional software failures — feature flags, canary releases, A/B testing — applies directly to LLM-powered features. But the mechanics are different enough that copy-pasting your existing deployment playbook will get you into trouble. Non-determinism, semantic quality metrics, and the multi-layer nature of LLM changes (model, prompt, parameters, retrieval strategy) each create wrinkles that teams routinely underestimate.

About Tian Pan