Skip to main content

5 posts tagged with "feature-flags"

View all tags

Your AI Feature Ramp Is Rolling Out on the Wrong Axis

· 11 min read
Tian Pan
Software Engineer

A team I talked to last month ramped a new agentic feature from 1% to 50% of users over four weeks. Aggregate quality metrics held within noise. Latency stayed within SLA. They were preparing the 100% memo when the support queue caught fire — a customer with a six-tool research workflow had been getting silently corrupted outputs since the 10% step. The hard queries had been there the whole time, evenly sprinkled across every cohort, averaging into the noise floor. Nobody saw them until a single high-volume user happened to hit them at scale.

This is not a monitoring failure. It is a ramp-axis failure. Feature flag tooling — the entire LaunchDarkly / Flagsmith / Unleash / Cloudflare-Flagship category — assumes blast radius scales with the number of humans exposed. For deterministic software that is mostly true: a NullPointerException hits everyone or nobody, and showing it to 1% of users limits the user-visible blast to 1%. For AI features, blast radius does not scale on the human axis. It scales on the input axis. And the input axis is where almost no one is ramping.

You Accidentally Built a Feature-Flag System for Prompts — Without the Governance

· 10 min read
Tian Pan
Software Engineer

Pull up the config repo your team uses to ship prompt changes. Look at the last thirty commits. How many had a code review? How many had an eval gate in CI? How many can you attribute — with certainty — to a measurable change in production behavior for the users who saw them? If your answer is "most," you are an outlier. For everyone else, those commits are running in production right now, and the system reading them is doing exactly what a feature-flag service does: hot-reload a value, fan it out to users, change product behavior. The difference is that your feature-flag service has audit logs, exposure tracking, kill switches, and per-cohort targeting. Your prompt deploy pipeline has git push.

This is not a metaphor. It is an accurate description of the production system your team is running. The prompt config repo, the S3 bucket your workers poll, the "prompts" collection in your database, the LangSmith/PromptLayer/Braintrust asset that your app fetches on boot — these are all feature-flag services. They have the same runtime shape: a value lives outside the binary, the binary reads it on a hot path, changing the value changes behavior for real users without a deploy. The only thing missing is every control your SRE team demanded before they would approve the actual feature-flag service.

Why AI Feature Flags Are Not Regular Feature Flags

· 11 min read
Tian Pan
Software Engineer

Your canary deployment worked perfectly. Error rates stayed flat. Latency didn't spike. The dashboard showed green across the board. You rolled the new model out to 100% of traffic — and three weeks later your support queue filled up with users complaining that the AI "felt off" and "stopped being helpful."

This is the core problem with applying traditional feature flag mechanics to AI systems. A model can be degraded without being broken. It returns 200s, generates tokens at normal speed, and produces text that passes superficial validation — while simultaneously hallucinating more often, drifting toward terse or evasive answers, or regressing on the subtle reasoning patterns your users actually depend on. The telemetry you've been monitoring for years was never designed to catch this kind of failure.

Why Gradual Rollouts Don't Work for AI Features (And What to Do Instead)

· 9 min read
Tian Pan
Software Engineer

Canary deployments work because bugs are binary. Code either crashes or it doesn't. You route 1% of traffic to the new version, watch error rates and latency for 30 minutes, and either roll back or proceed. The system grades itself. A bad deploy announces itself loudly.

AI features don't do that. A language model that starts generating subtly wrong advice, outdated recommendations, or plausible-sounding nonsense will produce zero 5xx errors. Latency stays within SLOs. The canary looks green while the product is silently failing its users.

This isn't a tooling problem. It's a conceptual mismatch. The entire mental model behind gradual rollouts — deterministic code, self-grading systems, binary pass/fail — breaks down the moment you introduce a component whose correctness cannot be measured by observing the request itself.

Feature Flags for AI: Progressive Delivery of LLM-Powered Features

· 10 min read
Tian Pan
Software Engineer

Most teams discover the hard way that rolling out a new LLM feature is nothing like rolling out a new UI button. A prompt change that looked great in offline evaluation ships to production and silently degrades quality for 30% of users — but your dashboards show HTTP 200s the whole time. By the time you notice, thousands of users have had bad experiences and you have no fast path back to the working state.

The same progressive delivery toolkit that prevents traditional software failures — feature flags, canary releases, A/B testing — applies directly to LLM-powered features. But the mechanics are different enough that copy-pasting your existing deployment playbook will get you into trouble. Non-determinism, semantic quality metrics, and the multi-layer nature of LLM changes (model, prompt, parameters, retrieval strategy) each create wrinkles that teams routinely underestimate.