The Five Gates Your AI Demo Skipped: A Launch Readiness Checklist for LLM Features
There's a pattern that repeats across AI feature launches: the demo wows the room, the feature ships, and within two weeks something catastrophic happens. Not a crash — those are easy to catch. Something subtler: the model confidently generates wrong information, costs spiral three times over projection, or latency spikes under real load make the feature unusable. The team scrambles, the feature gets quietly disabled, and everyone agrees to "do it better next time."
The problem isn't that the demo was bad. The problem is that the demo was the only test that mattered.
LLM features fail in production in ways that traditional software doesn't. They're non-deterministic, so behavior changes without code changes. They have fat-tailed latency distributions, so average performance looks fine while the worst 5% of users experience timeouts. They consume resources proportional to user behavior, so a single chatty user can spike your API bill in ways a load test never reveals. And they can fail silently — returning HTTP 200 with confidently hallucinated content — while every infrastructure metric shows green.
None of these failure modes appear in a demo. A demo is the best-case scenario: a controlled input, a well-prepared prompt, a forgiving evaluator. Production is the opposite.
This is the checklist your demo skipped.
Gate 1: Evaluation Coverage
The most common failure in AI feature launches is shipping without knowing how bad the failure rate actually is. Teams run a curated test set against a small set of happy-path scenarios, achieve 95% accuracy on those, and ship. Then they discover that production users interact with the system in ways they never anticipated, and accuracy drops to 70%.
This isn't bad luck. It's a structural problem with how AI features are evaluated before launch.
Production-ready evaluation requires three things that demo-ready evaluation usually skips:
Statistical coverage, not cherry-picked examples. You need a test set that approximates real user behavior, including edge cases, adversarial inputs, and the long tail of queries your product will receive. For RAG systems, that means evaluating across the full retrieval-answer-relevance pipeline using frameworks like RAGAS, which measures context precision, faithfulness, and answer relevance without requiring ground truth annotations. For chatbots, it means including multi-turn conversations where the model has to track context over many exchanges.
Automated evaluation with human calibration. An AI judge (using a stronger or equivalent model to evaluate outputs) can scale evaluation beyond what human review allows, but only if it's properly calibrated. An AI judge needs to achieve at least 85-90% agreement with human reviewers before you trust it to gate deployment. Below that threshold, you're automating your blind spots.
Known failure rate, not just accuracy. Before shipping, you should be able to state your hallucination rate on production-realistic inputs. Industry data shows 15-38% hallucination rates in typical LLM deployments. Whether that's acceptable depends on your domain — it's not for medical or legal applications, and it might be acceptable for creative tools — but the number should be known and deliberate, not discovered post-launch.
Gate 2: Latency Budget
Latency in LLM features isn't a number. It's a distribution, and the tail is what matters.
A feature where median latency is 800ms and P95 latency is 4 seconds will feel broken to one in twenty users. Under typical load, that might be hundreds of users per hour. And tail latency in LLM systems is far worse than in traditional services because every variable — prompt length, retrieval complexity, output length, API provider load — compounds independently. The slow retrieval day and the long-context request and the API congestion can all hit at once.
Before launch, you need a defined latency budget across the full stack. For user-facing AI features, the human tolerance threshold is roughly 3 seconds before users abandon the interaction. For chatbots, time-to-first-token matters most: under 500ms feels responsive, over 1 second starts to feel broken. For code assistants and autocomplete, that threshold drops to 100-200ms, because users expect suggestions before they've finished typing.
The specific thresholds matter less than the process of defining them explicitly and then testing against them at the P95 level, not the median. The engineering question "did our latency budget pass?" must have a concrete, falsifiable answer before any launch checklist gets signed off.
If your P95 latency is out of budget, there are real levers: caching frequent retrieval results, streaming output to reduce perceived latency, using faster models for initial response and slower models for followup, and reducing prompt size by cutting system prompt bloat. But those levers need to be pulled before launch, not identified as post-incident action items.
Gate 3: Graceful Degradation Path
Every LLM feature will fail. The question is whether you've decided what happens when it does.
Traditional software degradation is straightforward: the database is down, show an error page. LLM degradation is more complex because there's a spectrum of failure: the primary model is slow but available, the primary model has a high error rate, the API is fully down, the output is available but quality has degraded below acceptable thresholds. Each of these requires a different response.
A production-ready fallback chain typically looks like: primary model → cheaper/faster model from the same provider → equivalent model from a different provider → rule-based or template response → graceful error message that doesn't lose user context. The key word is "typically" — your specific chain depends on your quality/cost tradeoffs and which failure modes matter most for your users.
The circuit breaker pattern applies directly to LLM features. Define a threshold (5 consecutive failures, or error rate above 5% over a rolling window), implement a state machine that routes to fallback when that threshold is hit, and build a probe mechanism that periodically tests recovery. The crucial operational decision is: what does "failure" mean? For LLM features, it should include quality failures (outputs below your eval threshold), not just infrastructure failures (API timeouts). An LLM confidently hallucinating at high rates is a failure even if the API is returning 200 OK.
Fallback paths must be tested before launch. It's not enough to have them implemented — you need to deliberately trigger them in staging to verify that they activate correctly, that user context is preserved, and that the fallback quality is actually acceptable (not just "better than nothing").
Gate 4: Monitoring Baseline
Only 15% of LLM deployments have adequate observability in place. This is the highest-leverage problem to fix, because without it, you're flying blind for all the other gates — you can't verify your eval coverage is holding in production, you can't catch latency regressions, and you can't see when your fallback chain activates.
LLM monitoring requires a different approach than traditional application monitoring because the interesting failures are invisible to infrastructure metrics. When an LLM returns a hallucinated answer, your infrastructure logs see: HTTP 200, latency within budget, tokens consumed, response returned. Everything looks fine. The failure only exists in the content of the response, which infrastructure monitoring doesn't read.
A minimal production monitoring stack for an LLM feature needs:
Tracing with content capture. Every request should produce a trace that includes the prompt, the response, latency at each stage (retrieval, model inference, post-processing), and token counts. Without this, you can't diagnose quality regressions.
Output quality sampling. Sample some percentage of live traffic for evaluation — either LLM-as-judge scoring or targeted human review. This is the only way to catch hallucination rate changes or quality degradations before users start complaining.
Cost attribution per user or feature. Tag every API call with a user or feature identifier. Set alerts on per-user spend thresholds. LLM costs spike in ways that are predictable in retrospect but invisible without attribution: a single user with long conversation histories, a feature that's unexpectedly popular, a context bloat bug that's inflating token consumption.
Baseline before launch, not after. The reason to establish monitoring before launch — not in the first sprint after — is that you need a baseline to compare against. A hallucination rate of 20% is alarming if your baseline was 5%. It's expected if that was always your production rate. You can't distinguish regression from steady state without a pre-launch baseline.
Gate 5: Rollback Procedure
Rolling back an LLM feature is harder than rolling back traditional software, and the reasons are worth understanding explicitly.
Traditional software rollback is usually: deploy the previous version, done. LLM features have four independent versioning axes that can each cause production failures: the model version (including sampling parameters), the prompt and system instructions, the tool and API schemas your agent uses, and the underlying agent logic. Any of these can change and cause a production regression. Industry data attributes 60% of production AI failures to tool versioning changes and 40% to model drift.
Before launch, you need a clear answer to: if we need to roll back, what exactly does that mean, and can it be done in under five minutes?
- https://arxiv.org/abs/2603.27355
- https://www.anthropic.com/research/building-effective-agents
- https://blog.langchain.com/agent-evaluation-readiness-checklist/
- https://www.helicone.ai/blog/llm-observability
- https://arxiv.org/abs/2309.15217
- https://medium.com/@nraman.n6/versioning-rollback-lifecycle-management-of-ai-agents-treating-intelligence-as-deployable-deac757e4dea
- https://medium.com/@mota_ai/building-ai-that-never-goes-down-the-graceful-degradation-playbook-d7428dc34ca3
- https://medium.com/@sparknp1/milliseconds-that-matter-your-llm-latency-budget-43d6fdc19890
- https://www.brookings.edu/articles/the-art-and-science-of-pre-mortems/
