Skip to main content

The Five Gates Your AI Demo Skipped: A Launch Readiness Checklist for LLM Features

· 12 min read
Tian Pan
Software Engineer

There's a pattern that repeats across AI feature launches: the demo wows the room, the feature ships, and within two weeks something catastrophic happens. Not a crash — those are easy to catch. Something subtler: the model confidently generates wrong information, costs spiral three times over projection, or latency spikes under real load make the feature unusable. The team scrambles, the feature gets quietly disabled, and everyone agrees to "do it better next time."

The problem isn't that the demo was bad. The problem is that the demo was the only test that mattered.

LLM features fail in production in ways that traditional software doesn't. They're non-deterministic, so behavior changes without code changes. They have fat-tailed latency distributions, so average performance looks fine while the worst 5% of users experience timeouts. They consume resources proportional to user behavior, so a single chatty user can spike your API bill in ways a load test never reveals. And they can fail silently — returning HTTP 200 with confidently hallucinated content — while every infrastructure metric shows green.

None of these failure modes appear in a demo. A demo is the best-case scenario: a controlled input, a well-prepared prompt, a forgiving evaluator. Production is the opposite.

This is the checklist your demo skipped.

Gate 1: Evaluation Coverage

The most common failure in AI feature launches is shipping without knowing how bad the failure rate actually is. Teams run a curated test set against a small set of happy-path scenarios, achieve 95% accuracy on those, and ship. Then they discover that production users interact with the system in ways they never anticipated, and accuracy drops to 70%.

This isn't bad luck. It's a structural problem with how AI features are evaluated before launch.

Production-ready evaluation requires three things that demo-ready evaluation usually skips:

Statistical coverage, not cherry-picked examples. You need a test set that approximates real user behavior, including edge cases, adversarial inputs, and the long tail of queries your product will receive. For RAG systems, that means evaluating across the full retrieval-answer-relevance pipeline using frameworks like RAGAS, which measures context precision, faithfulness, and answer relevance without requiring ground truth annotations. For chatbots, it means including multi-turn conversations where the model has to track context over many exchanges.

Automated evaluation with human calibration. An AI judge (using a stronger or equivalent model to evaluate outputs) can scale evaluation beyond what human review allows, but only if it's properly calibrated. An AI judge needs to achieve at least 85-90% agreement with human reviewers before you trust it to gate deployment. Below that threshold, you're automating your blind spots.

Known failure rate, not just accuracy. Before shipping, you should be able to state your hallucination rate on production-realistic inputs. Industry data shows 15-38% hallucination rates in typical LLM deployments. Whether that's acceptable depends on your domain — it's not for medical or legal applications, and it might be acceptable for creative tools — but the number should be known and deliberate, not discovered post-launch.

Gate 2: Latency Budget

Latency in LLM features isn't a number. It's a distribution, and the tail is what matters.

A feature where median latency is 800ms and P95 latency is 4 seconds will feel broken to one in twenty users. Under typical load, that might be hundreds of users per hour. And tail latency in LLM systems is far worse than in traditional services because every variable — prompt length, retrieval complexity, output length, API provider load — compounds independently. The slow retrieval day and the long-context request and the API congestion can all hit at once.

Before launch, you need a defined latency budget across the full stack. For user-facing AI features, the human tolerance threshold is roughly 3 seconds before users abandon the interaction. For chatbots, time-to-first-token matters most: under 500ms feels responsive, over 1 second starts to feel broken. For code assistants and autocomplete, that threshold drops to 100-200ms, because users expect suggestions before they've finished typing.

The specific thresholds matter less than the process of defining them explicitly and then testing against them at the P95 level, not the median. The engineering question "did our latency budget pass?" must have a concrete, falsifiable answer before any launch checklist gets signed off.

If your P95 latency is out of budget, there are real levers: caching frequent retrieval results, streaming output to reduce perceived latency, using faster models for initial response and slower models for followup, and reducing prompt size by cutting system prompt bloat. But those levers need to be pulled before launch, not identified as post-incident action items.

Gate 3: Graceful Degradation Path

Every LLM feature will fail. The question is whether you've decided what happens when it does.

Traditional software degradation is straightforward: the database is down, show an error page. LLM degradation is more complex because there's a spectrum of failure: the primary model is slow but available, the primary model has a high error rate, the API is fully down, the output is available but quality has degraded below acceptable thresholds. Each of these requires a different response.

A production-ready fallback chain typically looks like: primary model → cheaper/faster model from the same provider → equivalent model from a different provider → rule-based or template response → graceful error message that doesn't lose user context. The key word is "typically" — your specific chain depends on your quality/cost tradeoffs and which failure modes matter most for your users.

The circuit breaker pattern applies directly to LLM features. Define a threshold (5 consecutive failures, or error rate above 5% over a rolling window), implement a state machine that routes to fallback when that threshold is hit, and build a probe mechanism that periodically tests recovery. The crucial operational decision is: what does "failure" mean? For LLM features, it should include quality failures (outputs below your eval threshold), not just infrastructure failures (API timeouts). An LLM confidently hallucinating at high rates is a failure even if the API is returning 200 OK.

Fallback paths must be tested before launch. It's not enough to have them implemented — you need to deliberately trigger them in staging to verify that they activate correctly, that user context is preserved, and that the fallback quality is actually acceptable (not just "better than nothing").

Gate 4: Monitoring Baseline

Only 15% of LLM deployments have adequate observability in place. This is the highest-leverage problem to fix, because without it, you're flying blind for all the other gates — you can't verify your eval coverage is holding in production, you can't catch latency regressions, and you can't see when your fallback chain activates.

LLM monitoring requires a different approach than traditional application monitoring because the interesting failures are invisible to infrastructure metrics. When an LLM returns a hallucinated answer, your infrastructure logs see: HTTP 200, latency within budget, tokens consumed, response returned. Everything looks fine. The failure only exists in the content of the response, which infrastructure monitoring doesn't read.

A minimal production monitoring stack for an LLM feature needs:

Tracing with content capture. Every request should produce a trace that includes the prompt, the response, latency at each stage (retrieval, model inference, post-processing), and token counts. Without this, you can't diagnose quality regressions.

Output quality sampling. Sample some percentage of live traffic for evaluation — either LLM-as-judge scoring or targeted human review. This is the only way to catch hallucination rate changes or quality degradations before users start complaining.

Cost attribution per user or feature. Tag every API call with a user or feature identifier. Set alerts on per-user spend thresholds. LLM costs spike in ways that are predictable in retrospect but invisible without attribution: a single user with long conversation histories, a feature that's unexpectedly popular, a context bloat bug that's inflating token consumption.

Baseline before launch, not after. The reason to establish monitoring before launch — not in the first sprint after — is that you need a baseline to compare against. A hallucination rate of 20% is alarming if your baseline was 5%. It's expected if that was always your production rate. You can't distinguish regression from steady state without a pre-launch baseline.

Gate 5: Rollback Procedure

Rolling back an LLM feature is harder than rolling back traditional software, and the reasons are worth understanding explicitly.

Traditional software rollback is usually: deploy the previous version, done. LLM features have four independent versioning axes that can each cause production failures: the model version (including sampling parameters), the prompt and system instructions, the tool and API schemas your agent uses, and the underlying agent logic. Any of these can change and cause a production regression. Industry data attributes 60% of production AI failures to tool versioning changes and 40% to model drift.

Before launch, you need a clear answer to: if we need to roll back, what exactly does that mean, and can it be done in under five minutes?

For stateless LLM features, rollback is closest to traditional deployment: point to the previous prompt version and model pin, done. For stateful agents that maintain conversation history or build on prior outputs, rollback is more complex — like database migrations — because rolling back the logic doesn't roll back the state that prior logic generated.

The practical rollback requirements:

  • Model versions should be pinned explicitly, not floating. If you're using gpt-4o-latest, you don't control when the model changes.
  • Prompt changes should be versioned and incrementally deployed, not atomically replaced.
  • Rollback should be a flag toggle or a configuration change, not a code redeployment. The time from "we have a problem" to "we're on the old version" should be measured in seconds.
  • Rollback procedures should be tested in staging before launch, not designed in the middle of an incident.

Canary deployments are the right deployment strategy for LLM features: route 1-10% of traffic to the new version, set hard thresholds (error rate >5% triggers automatic rollback), and only promote to full traffic when the canary has proven stable.

The Pre-Mortem: Running the Checklist in Reverse

Each of the five gates above catches a class of failures. The pre-mortem catches the failures that don't fit into any category.

The methodology: before shipping, gather the team and assume the feature has already failed in production. Not "what could go wrong" — "it went wrong, what happened?" Prospective hindsight has been shown to increase the identification of failure causes by about 30%, because it bypasses the optimism bias that affects forward-looking risk assessment.

For LLM features specifically, the pre-mortem should cover technical failures (hallucinations, tool call failures, context limit overflows), operational failures (cost spikes, latency regressions, provider outages), and behavioral failures (outputs that are technically correct but cause user harm or violate policy). The last category is the one that traditional risk assessment misses entirely.

The output of a pre-mortem isn't just a risk register. It's a set of prevention criteria: specific, falsifiable conditions that must be true before launch. "The eval coverage is sufficient" is not a prevention criterion. "The eval pipeline shows less than 15% hallucination rate on the adversarial test set and the AI judge has 90% agreement with human reviewers" is a prevention criterion.

Why Demos Pass All Five Gates and Production Fails Them

Demos are optimized to be convincing. The prompts are tuned for the demo inputs. The inputs are chosen to elicit impressive outputs. The evaluator is a person who wants to see success. The environment has none of the load, adversarial inputs, or edge cases of production.

This creates a systematic gap: demo performance predicts exactly one thing — demo performance. It has almost no predictive validity for production behavior, because the conditions that determine demo success are entirely unlike the conditions that determine production success.

The five gates above are specific precisely because "does it work?" is not a useful question. Eval coverage asks: does it work on inputs that represent real users? Latency budget asks: does it work at the P95 level, not the median? Graceful degradation asks: does it work when components fail? Monitoring baseline asks: will you know when it stops working? Rollback procedure asks: can you recover quickly when it fails?

Each gate has concrete, falsifiable acceptance criteria. The demo has none. Ship when the gates pass, not when the demo looks great.

The Practical Checklist

Before marking any LLM feature launch-ready:

Eval coverage

  • Test set covers adversarial inputs and edge cases, not just happy paths
  • AI judge calibrated to >85% agreement with human reviewers
  • Hallucination rate on realistic inputs is known and explicitly accepted

Latency budget

  • P95 latency is within budget for the use case (not P50)
  • Latency tested under realistic concurrent load
  • Streaming or TTFT measured separately for interactive features

Graceful degradation

  • Fallback chain defined and implemented
  • Circuit breaker thresholds set (error rate + latency)
  • Fallback paths tested by deliberately triggering failures in staging

Monitoring baseline

  • End-to-end request tracing active before launch day
  • Output quality sampling implemented
  • Cost attribution per user or feature, with spend alerts

Rollback procedure

  • Model versions pinned explicitly
  • Rollback is a flag toggle, not a redeployment
  • Canary deployment with hard thresholds on error rate and quality
  • Rollback tested in staging, not designed during incidents

Pre-mortem

  • Failure scenarios documented across technical, operational, and behavioral dimensions
  • Prevention criteria defined as specific, falsifiable conditions
  • Launch criteria sign-off required from someone who ran the pre-mortem

The demo is over. The checklist starts now.

References:Let's stay in touch and Follow me for more thoughts and updates