Skip to main content

What 'Done' Means for AI-Powered Features: Engineering the Perpetual Beta

· 10 min read
Tian Pan
Software Engineer

Shipping a feature in traditional software ends with a merge. The unit tests pass. The integration tests pass. QA signs off. You flip the flag, and unless a bug surfaces in production, you move on. The feature is done. For AI-powered features, that moment doesn't exist — and if you're pretending it does, you're accumulating a stability debt that will eventually show up as a user trust problem.

The reason is straightforward but rarely designed around: deterministic software produces the same output from the same input every time. AI features do not. Not because of a bug, but because the behavior is defined by a model that lives outside your codebase, trained on data that reflects a world that keeps changing, consumed by users whose expectations evolve as they see what's possible.

This isn't a reason to panic or to avoid shipping AI features. It's a reason to rethink what "done" means — and to build the organizational and technical infrastructure that makes "stable but evolving" feel like quality rather than incompleteness.

The Three Forces That Make AI Features Drift

AI feature behavior degrades along three independent axes, and understanding each separately matters because they require different mitigations.

Model drift happens when the underlying model changes. Your provider — whether OpenAI, Anthropic, Google, or anyone else — will periodically update their models. Sometimes these are announced with detailed release notes. Often the behavioral changes are subtle: different response lengths, slightly different formatting, shifted tone, different thresholds for refusing requests. If you're not pinning model versions, your feature's behavior changes the moment your provider ships an update. If you are pinning versions, you face a different problem: eventually the version gets deprecated and you have to migrate, discovering the behavioral delta at the worst possible time.

World drift happens when the world the model was trained on diverges from the world your users live in. A credit risk model trained in 2021 may have had 95% accuracy then and 87% accuracy by late 2024 — not because the model changed, but because the economic context, consumer behavior, and underlying risk patterns all shifted. For knowledge-dependent AI features (anything that relies on facts about current events, company information, product details, regulations), this drift is continuous and invisible without monitoring.

Expectation drift happens when your users' mental model of what AI can do evolves faster than your feature does. The first time a user sees a reasonable AI writing assistant, they're delighted by anything coherent. Six months later, after using competitors and newer tools, they're frustrated by the same quality that impressed them initially. This isn't irrational — it's how users form expectations in any rapidly improving product category. But it means a feature that shipped at the 80th percentile of the market can quietly fall to the 50th without a single line of code changing.

Evals Are Specifications, Not Tests

The most fundamental shift in building AI features that stay "done" is treating evals as specifications, not tests. In deterministic software, tests verify that code matches spec. In AI systems, evals are the spec.

This reframing matters because it changes when you write them and what you do with them. Tests get written after code exists, to verify behavior. Evals need to exist before a feature ships, because they're the only way to state what the feature is supposed to do in machine-verifiable terms.

A robust eval suite for an AI feature includes:

  • Golden inputs and expected outputs: A curated dataset of 200–500 representative prompts with ground-truth responses or labeled quality ratings. Production-grade systems eventually track thousands of these, adding new cases from every production failure.
  • Behavioral invariants: Properties that must hold regardless of model version — for example, "responses must not include PII from other users," "summaries must not exceed 200 tokens," "structured output must parse as valid JSON."
  • Regression baselines: A snapshot of output quality at each model version pinpoint, so that when you upgrade, you can measure the behavioral delta rather than discovering it through user complaints.

When a model provider notifies you of an upcoming version change, this eval suite becomes your migration checklist. You run it against the new model, diff the results, and decide whether the changes are improvements, neutral, or regressions that need prompt adjustment before you flip the version.

Without this infrastructure, model upgrades are events you discover in your support queue.

Declaring 1.0 Without Lying

The instinct of every product team under pressure is to declare "General Availability" and move on. For AI features, this instinct is partially right and partially dangerous.

You should declare 1.0. The alternative — indefinite "beta" labels — trains users to expect instability and creates a convenient excuse for not prioritizing quality. Users make different commitments to beta software than to production software, and if you want them to integrate your AI feature into their core workflows, you need to give them a signal that it's ready.

But what "ready" means needs to be communicated honestly. GitHub's pattern here is instructive. When Copilot Workspace moved from technical preview to general availability, the announcement was explicit that the feature would continue evolving — new model integrations, expanded capabilities, changed behaviors. The GA declaration meant "production-stable infrastructure, committed to not breaking the core experience, will give advance notice of major behavioral changes." It didn't mean "frozen."

This is the distinction worth internalizing: stability is a commitment about the interface, not the behavior. What you're guaranteeing when you declare 1.0 on an AI feature:

  • The API contract won't change without a versioned migration path.
  • You'll communicate model version changes before they affect production.
  • You'll maintain the eval suite and monitor for quality regressions.
  • You'll provide a mechanism to report behavioral issues and a timeline to respond.

What you're not guaranteeing: that the output will be identical tomorrow to what it was today.

Version Your Behavioral Expectations, Not Just Your API

Most teams version their API endpoints. Fewer version their behavioral expectations. This gap is where AI feature trust erodes quietly.

Behavioral versioning means treating your eval golden set as a versioned artifact with the same discipline you apply to your API schema. When you change the system prompt — even a small tweak — you create a new behavioral version. You run the full eval suite against it, record the results, and commit the diff alongside the prompt change. This creates an auditable history of how your feature's behavior has changed, which matters when a user asks why the output is different than it was six months ago.

The tooling here is still maturing, but the pattern is stable: eval snapshots in version control, model version pinning in configuration, semantic versioning for behavioral contracts. When you're ready to upgrade your model version, you run the new model against historical eval snapshots and characterize the delta before any user sees it.

Some teams are also experimenting with what might be called "behavioral SLOs" — explicit, machine-verifiable statements about output properties that are monitored in production, not just in CI. "Response latency under 2 seconds at p95" is a familiar operational SLO. "Structured output validates against schema in 99.9% of requests" is a behavioral SLO. "Summary quality score above 0.75 on held-out eval set" is a behavioral SLO. Treating these the same as infrastructure SLOs forces the question: what are you actually promising, and are you measuring it?

The Governance Question Nobody Is Answering

Here's the conversation most engineering teams are avoiding: who decides when an AI feature has regressed enough to roll back?

In traditional software, rollback decisions are usually clear — a bug causes incorrect behavior, a rollback fixes it. In AI systems, the boundary between "behavioral change" and "regression" is fuzzy. The model is more verbose now. Is that a regression? The model is refusing more requests. Is that a safety improvement or a quality degradation? The response format shifted subtly. Users haven't complained yet, but would they if they noticed?

Without explicit governance, these questions get resolved by whoever happens to be looking at the dashboard when the issue surfaces. That's not a process — it's luck.

A minimal governance model for AI features includes:

  • A designated owner for the behavioral contract — the person who can approve prompt changes, model version upgrades, and eval threshold adjustments.
  • A change review process for the system prompt and model configuration, separate from code review, that forces explicit acknowledgment that behavior may change.
  • A behavioral incident process — what happens when an eval score drops below threshold, who gets paged, what the rollback options are, and what the communication template looks like for users who noticed the regression.
  • A scheduled review cadence — at least quarterly, explicitly asking: does our feature's behavior still match what we promised in the product description?

This sounds bureaucratic until the first time your AI feature starts behaving strangely after a provider update and you have no documented process for deciding what to do.

Communicating "This Feature Is Evolving" Without Eroding Trust

The risk of being honest about AI feature evolution is that users hear "unstable" and make less investment. The risk of not being honest is that behavioral changes feel like regressions, and users make less investment anyway — but with resentment attached.

The framing that works is one you've already seen from the best-run AI products: evolution as improvement, not instability.

OpenAI's release notes for model updates explicitly call out behavioral changes. Anthropic's release notes describe training changes with enough technical detail that sophisticated users understand what shifted and why. Both treat their users as technical partners who benefit from knowing what changed, rather than customers who need to be protected from complexity.

For product-facing AI features, this translates to:

  • Release notes for behavioral changes, not just for new capabilities. If you adjusted the system prompt and output length changed, that's worth a changelog entry.
  • Version indicators that give users something to reference when reporting issues ("I'm on version 2.3 of the summarizer and it started doing X").
  • Feedback mechanisms that are specific enough to be actionable — not "was this response helpful?" but "did this response answer your question?" and "did the format match what you expected?"

The users who will notice behavioral drift are your power users — the ones who've integrated your AI feature most deeply into their workflows. These are exactly the users whose trust you can least afford to lose. Treating them as partners in the feature's evolution, rather than as customers who need to be shielded from it, is both the honest approach and the strategically correct one.

What "Done" Actually Means

An AI feature is done when you have:

  1. A golden eval set that specifies the feature's intended behavior in machine-verifiable terms.
  2. A monitoring setup that runs evals in CI and produces behavioral SLO dashboards in production.
  3. A versioning policy for the system prompt, model version, and behavioral contract.
  4. A governance process for behavioral changes and a rollback plan when things go wrong.
  5. A communication infrastructure — changelog, version indicators, feedback loops — that keeps users informed as the feature evolves.

This is more than shipping a feature. It's closer to shipping a product with an ops runbook. The upside is that features built this way can actually evolve — with each model upgrade absorbed deliberately, each behavioral change understood and communicated, each regression caught before users do.

The alternative is declaring done and discovering, six months later, that the feature you shipped isn't the one running in production anymore. That's not a failure mode. For AI-powered features, it's the default outcome if you don't engineer against it.

References:Let's stay in touch and Follow me for more thoughts and updates