Continuous Production Eval: Statistical Quality Monitoring for Live LLM Traffic
Most teams treat LLM quality evaluation as a pre-deployment gate: run your eval suite, check the scores, ship. That approach catches roughly 40% of the failures your users will actually see. The rest slip through because production traffic looks nothing like your eval set — different query distributions, different session lengths, different upstream data, different model behavior under concurrent load. By the time a user complaint surfaces, the problem has been happening for days.
The fix is not more evals before deployment. It is continuous evaluation against live traffic, designed around the reality that you have no ground truth labels at inference time and need actionable signal within minutes, not weeks.
What Reference-Free Quality Signals Actually Measure
The phrase "reference-free evaluation" sounds like a compromise — you'd prefer ground truth, you just can't have it. In practice, reference-free signals measure a genuinely different failure class than accuracy benchmarks do.
Accuracy benchmarks check whether the model gets the right answer on curated questions. Reference-free production signals check whether the model is behaving consistently with how it behaved last Tuesday. Behavioral consistency is often what breaks first in production, and it breaks in ways accuracy tests were never designed to catch: refusal rate spikes when a prompt edge case starts triggering safety filters, output length distributions shift when a model update changes verbosity defaults, schema conformance drops when a tool description drifts out of sync with the model version.
The most useful reference-free signals by category:
Structural signals (synchronous, cheap, 100% coverage): schema conformance rate, JSON parse success rate, output length in tokens, presence of required fields, response format adherence. These run on every request in the hot path at near-zero cost. A schema conformance drop from 99.8% to 97.1% over six hours is a concrete incident, not statistical noise.
Behavioral signals (asynchronous, sampled, 5-10% of traffic): refusal rate, hedging language frequency, topic drift from expected domain, self-consistency under re-query. These require a second model call or regex pattern matching and run off the critical path.
Uncertainty signals (sampled): response entropy, logprob variance on output tokens (where available), self-consistency score across two independent generations of the same input. Lower entropy and higher self-consistency correlate with higher confidence — though not necessarily correctness.
LLM-as-judge signals (sampled, high-latency): rubric-based quality scoring by a separate judge model. This is your highest-fidelity signal but also your most expensive. Run it on 1-5% of traffic, not on everything.
The mistake most teams make is trying to build all four tiers on day one and shipping nothing. Start with structural signals across 100% of traffic. Add behavioral signals at the 30% milestone. Add uncertainty and judge signals at the 60% milestone.
Applying Statistical Process Control to Quality Metrics
Statistical process control was developed for manufacturing processes where you need to distinguish random variation (normal) from assignable causes (problems that require intervention). The core insight applies directly to LLM quality metrics: every metric has natural variance, and you need a principled way to know when observed variance exceeds that baseline.
The three most useful SPC constructs for LLM monitoring:
X-bar control charts plot rolling averages with control limits set at 3 standard deviations from the process mean. For LLM metrics, use a 7-day rolling baseline to establish the mean and standard deviation. A metric crossing the 3σ boundary should trigger an investigation. This is not a threshold you set once — it recalibrates as your traffic patterns evolve.
CUSUM (Cumulative Sum) charts accumulate the deviations from a target value over time. Where X-bar charts detect large sudden shifts, CUSUM is sensitive to small sustained drifts that would stay within 3σ in isolation but compound into a detectable signal over 12-24 hours. Model update regressions often look like this: a 2% quality drop per release that escapes the X-bar threshold but registers in CUSUM within six hours.
EWMA (Exponentially Weighted Moving Average) charts give recent observations more weight than older ones. This is useful for metrics with seasonality (query volume varies by time of day, user behavior varies by weekday) where a flat 7-day average would conflate unrelated variance. EWMA adapts faster to legitimate baseline shifts while still detecting sustained anomalies.
A practical starting point: apply X-bar charts to your refusal rate and schema conformance rate (the two metrics where sudden shifts have the clearest interpretation), and apply CUSUM to your judge quality score (where slow drift is the more common failure mode).
Alert Threshold Design
Fixed thresholds do not work for LLM quality metrics. "Alert when judge score drops below 4.0" sounds concrete but ignores that your baseline judge score may be 4.3 on Tuesday and 3.8 on Sunday because weekend traffic skews toward lower-quality use cases. A Sunday score of 3.9 is normal; a Tuesday score of 3.9 is an incident.
SLO burn rate alerting is more reliable. Define your quality SLO as a minimum acceptable value over a window (e.g., "judge score ≥ 3.5 on 95% of sampled traces per 24-hour period"). Then alert when you are burning through that error budget faster than sustainable:
- Short window (30 minutes): burning >2% of weekly budget → page immediately
- Long window (6 hours): burning >5% of weekly budget → create ticket for morning review
This structure gives you high-sensitivity paging for acute incidents (refusal rate suddenly at 30%) and quiet accumulation tracking for slow drift (quality score declining 0.1 per week).
The other threshold design failure is triggering alerts without enough context for diagnosis. An alert that says "judge score degraded" is nearly useless. An alert that includes the feature name, affected query categories, current score vs. 7-day baseline, model version, and most recent prompt deployment gives an engineer enough context to identify the cause in minutes rather than hours.
The Feedback Loop That Actually Matters
Continuous monitoring is only valuable if it changes what you do. The loop needs three components:
Trace logging with query categorization. Every sampled trace should be tagged with a feature identifier, query type (extracted by a lightweight classifier), session context, and the quality scores from every evaluation tier that ran on it. This is how you turn "judge score dropped" into "judge score dropped specifically for queries in the 'multi-step calculation' category, starting at 14:00 UTC."
Automated triage to the right owner. Quality degradation in a specific feature should route to that feature's prompt engineer or owner, not to a general alert channel. Most organizations have this routing defined for infrastructure alerts; almost none have it defined for AI quality alerts.
Discovered failures become permanent tests. Every time live monitoring surfaces a failure class that your eval suite did not catch, a new test should be added to the eval suite. This is the mechanism by which your eval suite actually improves over time rather than drifting further from production reality. Teams that skip this step watch their eval suite become less representative every month.
The tooling overhead here is lower than it used to be. Platforms like Langfuse, LangSmith, and LangWatch integrate trace logging, quality scoring, and alerting into a unified interface. The harder problem is organizational: someone needs to own the feedback loop from monitoring to prompt iteration, and that ownership is rarely assigned explicitly.
Failure Modes When Teams Try This
Monitoring the wrong layer. Teams typically start by monitoring latency and error rate — useful, but these are infrastructure metrics, not AI quality metrics. A system that reliably returns a low-quality response in 200ms is not healthy. Add quality-layer metrics first, infrastructure metrics second.
Sampling too aggressively. Running LLM-as-judge on 0.1% of traffic to save cost means you see roughly 100 evaluated traces per day for a mid-traffic system. That sample size is too small to detect a 3% quality drop at statistical confidence. A 5% sample gives you enough volume for reliable anomaly detection while keeping judge costs at roughly 5% of inference spend.
Using the same model as judge and subject. If GPT-4o is your production model and your judge, the judge will systematically miss the failure modes that GPT-4o is prone to. Use a different model family for the judge, or use a specialized quality model trained for this purpose.
No baseline period. SPC control limits only make sense if you have a stable baseline to measure against. New systems should run in monitoring-only mode for 7-14 days before any alerts are configured, building a baseline that reflects actual traffic patterns across a full weekday/weekend cycle.
A Minimal Starting Stack
You do not need to build all of this at once. A minimal production-quality monitoring stack looks like:
- Synchronous schema validation on every request, tracking conformance rate per feature. Zero cost, immediate incident detection for structural failures.
- Asynchronous refusal rate tracking on 100% of traffic via pattern matching. Cheap, high signal.
- CUSUM chart on judge quality score on 5% of traffic, with a 7-day rolling baseline. Catches slow drift.
- SLO burn rate alert on the judge quality metric, with short and long windows.
- Trace logging with feature tagging for all sampled traces, routed to feature owners.
That's the stack that catches 80% of production quality failures before users report them. The remaining 20% — distribution shift in edge cases, subtle behavioral drift across model versions, failure modes specific to rare query types — requires the more sophisticated signals described above. But those are the problems you have at six months, not the ones you have at launch.
The teams that get this right treat continuous production eval as infrastructure, not as a post-launch nice-to-have. Every week without it is a week of degradation accumulating silently.
- https://hamel.dev/blog/posts/evals/index.html
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://www.zenml.io/blog/best-llm-monitoring-tools
- https://langwatch.ai/blog/what-is-llm-monitoring-(quality-cost-latency-and-drift-in-production)
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns/
- https://futureagi.com/blog/real-time-llm-evaluation-setup-2025/
- https://www.montecarlodata.com/blog-llm-as-judge/
