Skip to main content

AI Feature Soak Windows: Why a Two-Week Canary Misses What Actually Matters

· 13 min read
Tian Pan
Software Engineer

The two-week canary is one of those practices that sounds disciplined enough to skip the harder question. Engineering imported it from microservices — ramp 1% for a few days, watch error rate, ramp to 100%, declare done — and grafted it onto AI features without asking whether the failure modes that matter for AI even surface in two weeks. They don't. The bill that kills the feature lands in week six. The customer cohort that exposes the long-tail intent onboards in week five. The eval drift that scored +3% on launch day starts costing real money in week four because the new prompt's chattier outputs have been compounding token spend the whole time, and nobody was watching for that because the dashboard was watching for crashes.

A canary built around p95 latency and HTTP 500s will tell you the LLM is up. It will not tell you the feature is working. AI features fail in shapes the deploy ceremony was never designed to catch — slow shape changes in user behavior, gradual cache erosion, retrieval quality collapse, refusal-rate creep, cost trajectories that bend the wrong way — and almost all of them take longer than two weeks to declare themselves. The team that ships by the microservice clock is shipping by a clock the failures don't run on.

The Failures That Show Up After Week Two

Most failure modes that destroy AI features are slow. They're not crashes; they're trends. A canary that watches request-level metrics is built for fast, sharp failures. AI failures arrive sideways.

The first slow failure is long-tail intent. The eval set you launched against was built from the intents you knew about. A few weeks in, a customer cohort onboards that pushes the feature into a domain you never tested — say, an HR chatbot that scored 99% on payroll questions starts getting hammered with questions about a newly-announced equity plan and starts hallucinating vesting cliffs. The feature wasn't broken on launch. The launch eval set just didn't represent the world the feature now lives in. For high-traffic systems with more than 10,000 queries a day, the first significant drift signals tend to appear in weeks two to four. For medium-traffic systems, six to eight weeks. A two-week canary catches none of the medium-traffic ones and only the earliest signal of the high-traffic ones.

The second slow failure is cache hit rate erosion. Prompt caching makes the launch economics look fantastic — first invoice comes in under budget, finance is happy, the feature looks profitable. Then someone adds a helpful detail in the middle of the prompt to fix a quality issue, and the cache hit rate drops 40% over a week. The cost per request climbs without anyone noticing because it's still smaller than the gross spend trend, and the trend itself looks like usage growth. Three months later finance asks why LLM spend is trending the wrong direction, and the team discovers a prompt edit from week three is the cause. The signal was visible the whole time on a metric nobody had wired up: cache hit rate as a first-class quantity, not an aggregate cost.

The third slow failure is token-spend compounding. A prompt change that scored +3% on the eval also added 200 tokens of output on average. On launch day, the cost delta is invisible — traffic is small, cohort is small. By week four, traffic has tripled and the per-request cost increase has compounded into a five-figure monthly delta. The kicker: the eval set wasn't measuring output length. The team optimized for a quality metric that wasn't pareto-aware, and the cost regression hid behind the quality win for a month. Some teams now track output length distribution alongside latency and refusal rate as a default canary signal, but most still don't.

The fourth slow failure is the cohort-time interaction. Your seasonal traffic patterns don't compress into two weeks. The B2B customer who runs end-of-quarter reports doesn't onboard until the quarter ends. The retail product that gets used during back-to-school hits a spike in late August. Any canary window narrower than the longest seasonal cycle relevant to the feature is making a bet the team usually didn't realize they were making.

Soak Window vs. Canary Window — They're Different Things

The right mental model is to separate them. The canary window is when you're protecting against acute failures — the model returns gibberish, the gateway is broken, the feature melts the latency budget. Two weeks is plenty for that. The soak window is when you're protecting against the slow failures — drift, cost compounding, cohort-time effects, cache erosion. That window needs to be measured in months, not weeks.

A reasonable shape:

  • Canary: 1% to 25% over 1–2 weeks, automated rollback wired to acute thresholds (latency p99, refusal rate, cost-per-request hard cap, parse-error rate on structured outputs). Goal: catch the obvious.
  • Soak: 25% to 100%, then a 4–8 week observation period at full traffic before declaring rollout success. Goal: surface the slow stuff. The previous version stays warm and rollback-eligible the whole time.
  • Exit: a soak exit checklist that names the cohorts and the time-cycles that had to be observed before the team is allowed to mark the launch done.

The naming matters because it changes what leadership thinks "complete" means. If a launch is "complete" when the canary closes, the team is being asked to certify a system against failures it can't see yet. Renaming the post-canary period explicitly — soak, observation, post-launch — buys the team permission to keep the rollback path warm and the eval cadence dense for as long as the failure modes need.

The Metrics a Soak Window Watches That a Canary Doesn't

A canary dashboard is good at the present tense — what is broken right now. A soak dashboard has to be good at the past-imperfect — what is becoming worse, and at what rate. The metric set is different.

Cost-per-task should be tracked, not just cost-in-aggregate. Aggregate cost grows with traffic and tells you nothing. Cost-per-task, broken out by intent class or feature surface, is what catches the chattier-prompt regression and the retry-storm regression. A weekly graph of p50 and p95 cost-per-task over the soak window shows you the trend; a monthly invoice does not.

Cache hit rate as a top-line metric. Not a debug stat — a metric that has its own SLO and a SEV declared when it drops below threshold. The hit rate is the leading indicator that a prompt edit broke the prefix structure, and it surfaces the regression before the cost does.

Reviewer-queue depth and human-in-the-loop labor cost. If part of the AI feature's value proposition is that it reduces human review, then the reviewer queue is part of the feature's success metric. A drift that increases reviewer escalation by 8% week-over-week is invisible to the latency dashboard and devastating to the unit economics. Track it from day one of the soak.

Output-shape drift on structured outputs. JSON parse-error rate, schema-conformance rate, field-presence rate. These are silent. A model update or an upstream provider quantization can shift the distribution of valid-but-different outputs in ways that don't trip the strict-parse gates but do cause downstream consumers to behave differently. A weekly comparison against a frozen reference distribution catches it; a per-request error rate does not.

Refusal-rate creep. Refusals tend to migrate. A safety-tuned model rolls forward, the refusal rate ticks up half a point a week, and after eight weeks the feature is refusing a meaningful fraction of legitimate requests. The week-to-week delta is below noise. The eight-week trajectory is not.

Retrieval quality, if RAG is involved. Embedding-retrieval relevance at top-k is a metric that decays as the corpus grows and as the query distribution shifts. The launch-day retrieval quality is not the steady-state retrieval quality, and the gap between them is the soak window.

Eval Refresh Cadence Is Part of the Soak

A static eval set, frozen at launch, is rotting from the moment the soak starts. Real users probe shapes the eval-set authors didn't think of, and those new shapes are the ones that matter. If the soak's pass-rate is being measured against the launch eval, the team is grading itself against a baseline that no longer reflects the world.

The discipline is to refresh the eval set on a weekly cadence during the soak. Pull a sample of production traffic, label it (with humans or with a well-calibrated LLM-as-judge that's been spot-checked), and add it to the eval. The pass-rate is then re-baselined weekly against the up-to-date set. A slow regression that the static eval hides — because the static eval doesn't contain the new intent class — shows up immediately when the eval refresh includes it.

The cost of weekly eval refresh during a soak is modest if the pipeline exists. The cost of not doing it is that the launch metrics look stable while the user-facing quality drifts, and the team only notices when the support tickets correlate. Tickets are a lagging indicator of an eval-staleness problem.

Keep the Rollback Path Warm Through the Whole Soak

The most common mistake teams make is decommissioning the previous version too early. The canary closes, the new version is at 100%, and someone files a cleanup ticket to delete the old prompt template, the old retrieval index, the old endpoint. By week three the rollback option is gone. By week six, when the cost or quality problem actually surfaces, there's nothing to roll back to without a rebuild.

The blue-green pattern from microservices ports cleanly here, with one twist. Keep the previous version warm and validated for the full soak window — not just the canary window. Periodically send shadow traffic to it so you know it still works against the current corpus and the current upstream APIs. A rollback option that hasn't been validated against the current data patterns is theatre; the previous prompt or retriever may have rotted while you were watching the new one.

The hold rule has to be explicit because the gravity of the deploy ceremony pulls toward cleanup. "Do not deprecate previous version until soak exit" needs to be a written policy that survives the developer's natural urge to tidy up after a launch. Treat the previous version as production infrastructure for as long as the soak runs.

The Org Pressure That Defeats Soak Windows

The technical case for a long soak is straightforward. The organizational case is hard. Leadership wants launches declared, marketing wants press, the team wants to free up cognitive load and start the next thing. Two weeks fits cleanly into a sprint. Eight weeks does not.

The org pressure shows up as language. "Are we done?" gets asked at the two-week mark. "Are we still in soak?" doesn't get asked at the six-week mark, because nobody has language for an eight-week thing. The fix is to make the soak a first-class artifact in the project plan: named, tracked, with its own exit criteria and its own dashboard. A soak that lives only in the team's heads is a soak that gets declared over the moment leadership stops asking.

The other org fix is to treat soak metrics as part of the launch narrative, not as a separate post-launch discipline. The launch announcement names the soak window. The launch retrospective happens at soak exit, not at full ramp. The team's incentive structure rewards soak discipline, not just speed-to-100%. Otherwise the next team copying the playbook will copy the canary and skip the soak — because the canary was the visible part.

A Soak Exit Checklist Worth Stealing

The exit criteria, written down, are what convert "the soak is done because the calendar says so" into "the soak is done because the failure modes had time to surface." A reasonable checklist for an AI feature soak:

  • All seasonal cohorts the feature was supposed to serve have been observed at full traffic for at least one full cycle of their cadence (weekly, monthly, quarterly as relevant).
  • Cost-per-task trend over the last four weeks is flat or declining; if it is rising, the cause is identified and the trend is forecasted.
  • Cache hit rate has been stable within ±10% for at least three consecutive weeks.
  • Eval pass-rate has been re-baselined weekly and the pass-rate against the most recent eval set is at or above the launch baseline.
  • Reviewer queue depth and escalation rate are within 10% of the pre-launch baseline, or the deviation is intentional and budgeted.
  • Refusal rate has not drifted by more than half a percentage point over the soak window.
  • For structured-output features: JSON parse-error rate and schema-conformance rate are within tolerance.
  • The previous version is still warm, recently shadow-validated, and has a tested rollback runbook.
  • A cleanup ticket exists for the previous version, scheduled for after soak exit, not before.

The checklist is not the soak — the observation is the soak. The checklist is the thing that prevents leadership pressure from converting "the calendar reached two weeks" into "the launch is done."

What This Means for the Deploy Ceremony

The honest version of the realization: AI feature releases need a deploy ceremony scaled to the timescale of the failure modes, not the timescale of code deploys. Code deploys can be declared done in a sprint because code failures surface fast and look like errors. AI feature failures surface slow and look like trends, and the team that conflates the two ceremonies is shipping on the wrong clock.

The 4–8 week soak is not a tax on velocity. It is the difference between a launch that holds and a launch that gets rolled back six weeks later when the bill arrives, the cohort onboards, or the eval drift compounds past the threshold. Most teams pay that tax once, in the form of an avoidable incident, and then institute the soak. The teams that institute the soak first are the ones that get to skip the lesson.

References:Let's stay in touch and Follow me for more thoughts and updates