AI Feature Soak Windows: Why a Two-Week Canary Misses What Actually Matters
The two-week canary is one of those practices that sounds disciplined enough to skip the harder question. Engineering imported it from microservices — ramp 1% for a few days, watch error rate, ramp to 100%, declare done — and grafted it onto AI features without asking whether the failure modes that matter for AI even surface in two weeks. They don't. The bill that kills the feature lands in week six. The customer cohort that exposes the long-tail intent onboards in week five. The eval drift that scored +3% on launch day starts costing real money in week four because the new prompt's chattier outputs have been compounding token spend the whole time, and nobody was watching for that because the dashboard was watching for crashes.
A canary built around p95 latency and HTTP 500s will tell you the LLM is up. It will not tell you the feature is working. AI features fail in shapes the deploy ceremony was never designed to catch — slow shape changes in user behavior, gradual cache erosion, retrieval quality collapse, refusal-rate creep, cost trajectories that bend the wrong way — and almost all of them take longer than two weeks to declare themselves. The team that ships by the microservice clock is shipping by a clock the failures don't run on.
