Multi-Session Eval Design: Catching the AI Feature That Gets Worse Over Time
Your AI feature passed every eval at launch. Six weeks in, churn in the cohort that talks to it most has doubled, and your CSAT dashboard shows a flat line that no one can explain. The prompts haven't changed, the model hasn't been swapped, the retrieval index has grown but nobody thinks it's broken. What shipped was fine on turn one. What rots is what happens on turn four hundred, in session seventeen, three weeks after signup.
Most teams' eval suites can't see this failure. They test single-turn accuracy on a fixed dataset, maybe single-session multi-turn if they're ambitious, and then declare the feature shippable. The failure mode that matters — quality that degrades as the system accumulates state about a user — lives in a temporal dimension the eval harness was never built to cover. Researchers call it "self-degradation" in the memory literature: a clear, sustained performance decline after the initial phase, driven by memory inflation and the accumulation of flawed memories. Production engineers call it the reason their retention cohort silently bleeds.
This post is about what a multi-session eval actually needs to look like to catch that class of failure before users churn. The short version: you need synthetic user journeys that span sessions, quality decay curves instead of single-score dashboards, and a regression methodology that compares distributions over time rather than point accuracy on a frozen dataset.
Why Single-Session Evals Systematically Miss This
The default eval stack — fixed prompts, fixed inputs, judge model scores — optimizes for reproducibility. Every eval run starts from a clean context window, runs one turn or one short conversation, and emits a number. That design is load-bearing for comparing model versions, but it encodes an assumption that breaks for any feature with memory: that quality is stationary in time.
It isn't. AI features that accumulate user state have three moving parts that stationary evals can't see:
- The memory store changes. Every session adds, updates, or decays facts. By session thirty, the context retrieved for a query has almost nothing in common with what the session-one eval saw.
- The user changes. Explicit preferences become implicit, correction patterns build up, the user adapts their phrasing to what the model has previously understood.
- The model's effective context changes. The same base model, fed accumulated history, behaves differently — often worse, per the "lost in the middle" result and its successors showing performance falling as context fills.
Benchmarks like LoCoMo and LongMemEval were built partly to surface this. LongMemEval reports long-context LLMs dropping 30–60% on multi-session recall, and even frontier commercial memory systems scoring 30–70% on a simplified cut of the benchmark. Those numbers are from controlled academic settings — production state is messier. If your eval suite doesn't have a temporal axis, you are shipping into that decay and reading a flat dashboard.
The failure is also not just "retrieval gets worse." It's behavioral. Personalization drifts, abstention rates drop (the model stops saying "I don't know" because memory now has a fact-shaped blob that looks relevant), and prior corrections get silently reverted when a later memory overwrites them. None of that shows up in a single-turn BLEU score.
What a Temporal Eval Harness Looks Like
A multi-session eval harness replays a simulated user journey across many sessions and measures how the system's behavior evolves over that journey. It has three components that a standard harness doesn't.
Persona-grounded session generation. Instead of one-off prompts, you define a user: background, goals, constraints, and an event timeline that unfolds over simulated weeks. This is how LoCoMo built its benchmark — LLM-based agents grounded on personas and temporal event graphs, generating dialogues averaging 300 turns across up to 35 sessions. For product evals, you don't need 35 sessions; you need enough to cross the threshold where memory inflation bites, which in practice is somewhere between 8 and 20 sessions for most assistant-style features.
A simulator running the user side. A second LLM plays the user, following the persona and goals, reacting to what your system says. The real discipline here is making the simulator realistic — not a cooperative oracle that answers every clarifying question but a user who sometimes forgets what they told you, restates preferences in contradictory ways, and tests the "knowledge update" behavior that LongMemEval explicitly probes. Confident AI, Langfuse, DeepEval, and most major eval frameworks shipped multi-turn simulators in 2025–2026; the infrastructure is not the hard part anymore.
Checkpoints, not just final scores. The harness must score the system's behavior at multiple points along the journey, not just at the end. You want to measure turn-1 quality, session-5 quality, session-15 quality, on comparable tasks. This is what turns a pass/fail eval into a decay curve.
One tempting shortcut is worth naming and rejecting. Replaying real production conversations to score new prompts sounds appealing but has a fundamental flaw: the conversation trajectory in that log was produced by the old system, not the new one. You can't replay a multi-turn user against a modified model without the simulator, because the user's next message depends on what the system said in the previous turn. Log replay works for single-turn golden traces (the Google/ADK pattern of replaying recorded tool inputs and outputs); it does not work for simulating counterfactual multi-session journeys. Treat logs as fixture sources, not as substitutes for simulation.
Quality Decay Curves Instead of Single Scores
The core output of a multi-session eval is a curve, not a number. On the x-axis: session index, or turn count, or cumulative memory size. On the y-axis: whatever quality metric matters for the task — answer correctness, personalization fit, abstention correctness, contradiction rate. The shape of the curve is the diagnostic.
A healthy feature's curve is roughly flat or rising — the system gets better as it learns the user. A feature with memory inflation shows the characteristic "self-degradation" profile: a plateau or peak in the first few sessions, then sustained decline. A feature with catastrophic forgetting drops when something specific is displaced — often visible as a cliff after a knowledge update turn. A feature with memory poisoning susceptibility shows divergence between cohorts: the cohort whose simulated journey includes a malicious or ambiguous input has a curve that splits from the control cohort and never recovers.
The specific metrics you track on that curve should include:
- Task accuracy on anchor questions. Pre-write questions whose correct answers depend on facts established in early sessions. Score them at multiple later checkpoints. This is your direct measurement of long-range retention.
- Abstention correctness. Fraction of questions the system correctly refuses versus hallucinates an answer to, scored separately for questions where the answer is in memory versus genuinely absent. Abstention is the highest-trust behavior and the first to erode.
- Contradiction rate. Fraction of responses that contradict something the system said to the same user in an earlier session. This is the cheapest failure mode to automate scoring for (two model-generated statements, one judge call) and it correlates strongly with user-reported "the AI keeps forgetting" complaints.
- Personalization improvement rate. The RealPref benchmark's IR metric — a regression coefficient measuring whether alignment with stated preferences improves across the conversation. A flat or declining IR over sessions is a direct red flag.
These metrics are not a scorecard to optimize in aggregate. They are a diagnostic. A flat task accuracy curve with a sharply rising contradiction rate tells you the retrieval is still working but the memory store has developed conflicting entries. A declining accuracy curve with flat contradiction rate tells you memory is falling out, not being corrupted. The curve shapes let you localize.
Regression Detection Across Releases
Once you have decay curves, the regression question becomes: did this prompt change or model swap make the curve worse? This is harder than comparing two numbers because curves can be noisy and the eval is expensive. A few patterns work in practice.
Fix the persona set and the simulator's seed. Use the same set of simulated users across releases, with deterministic simulator responses (fixed temperature, recorded first-person utterances where possible). This removes the single biggest source of run-to-run variance. Without it, two runs of the same system will produce different curves and you'll chase ghosts.
Compare distributions, not points. For each checkpoint (session 5, 15, etc.), your eval produces a distribution of scores across the persona set. Compare distributions release-to-release with a nonparametric test — something like a two-sample Kolmogorov-Smirnov or a bootstrap confidence interval on the mean. A mean drop of 0.03 with high variance is noise; a distribution that has shifted left across all personas is a regression.
Gate on the worst checkpoint, not the average. The most common failure mode multi-session evals catch is a good average score masking a bad late-session tail. If session 1 improves by 5% and session 15 drops by 10%, your aggregate is a wash and you will ship the regression. Set per-checkpoint thresholds, and fail the release if any checkpoint regresses beyond its threshold.
Run it on a schedule, not just pre-merge. Multi-session evals are slow (dozens of sessions × many personas × multiple checkpoints × judge calls) and expensive. Running the full suite on every PR is rarely affordable. Most teams end up with a two-tier setup: a fast single-session smoke eval gates merges, and the full multi-session decay suite runs nightly or weekly on main, with regressions triaged the way you'd triage a flaky integration test rather than a commit-blocker. The cost model needs to accept that this eval is capital expenditure against future churn, not a CI gate you run on every change.
Keep your synthetic personas separate from your eval-writing LLM's persona training. If the same model generates the personas, drives the simulator, and judges the outputs, you get judge-model collusion. The simplest mitigation is using different model families for the persona generation, the user simulator, and the judge — even if the product model is a fourth system. This is the same principle as not letting a compiler test itself with its own output, applied to evals.
What This Catches That Nothing Else Does
A multi-session eval suite earns its cost when it catches a specific class of bug that no other test can see: regressions that only emerge after state accumulates. Examples from the field of the things it surfaces:
- A memory compression change that summarizes too aggressively, so by session 20 the system has lost the user's original preferences and responds as if it's session 1 again.
- A retrieval threshold tune that improves recall on the public benchmark but starts retrieving contradictory entries once the user's memory store exceeds a few hundred items, producing the "my assistant keeps correcting itself" failure.
- A safety-layer change that makes the model more willing to commit to answers, which looks good on abstention benchmarks but catastrophic on the contradiction-rate curve because every previous "I don't know" is now a stateful claim.
- A persona adjustment that improves first-impression CSAT but erodes personalization-over-time — users like turn one more, leave before turn fifty.
None of these get caught by a single-turn eval. All of them correlate with the kind of delayed churn that product teams blame on "market fit" or "onboarding" months later. The multi-session eval gives you the causal thread.
Where to Start
Building this harness from zero is a several-week project. You do not need to do that on day one. The minimum viable temporal eval is five synthetic personas, each played through eight sessions, with anchor questions scored at sessions 1, 4, and 8, and a contradiction-rate check between session 8 and the earlier transcript. That fits in a weekend's work against any mature eval framework, and it will tell you within a week whether your feature has a decay problem worth caring about.
If the curve is flat, you've ruled out a class of failure and earned the right to ship with confidence. If the curve bends, you now have a reproducible signal against which every prompt, retrieval, and memory change can be measured. The teams that keep shipping AI features without this are debugging churn data six months later, trying to reconstruct which release broke what. The teams that have it are catching those regressions before the affected cohort finishes its second week.
Multi-session quality is the product, eventually. Build the eval that can see it.
- https://snap-research.github.io/locomo/
- https://arxiv.org/pdf/2410.10813
- https://arxiv.org/abs/2402.17753
- https://arxiv.org/html/2510.27246v1
- https://arxiv.org/html/2510.17281v4
- https://www.producttalk.org/context-rot/
- https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html
- https://www.microsoft.com/en-us/security/blog/2026/02/10/ai-recommendation-poisoning/
- https://christian-schneider.net/blog/persistent-memory-poisoning-in-ai-agents/
- https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026
- https://langfuse.com/blog/2025-10-09-evaluating-multi-turn-conversations
- https://deepeval.com/guides/guides-multi-turn-simulation
- https://cloud.google.com/blog/topics/developers-practitioners/a-methodical-approach-to-agent-evaluation
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://arxiv.org/html/2603.07670v1
- https://arxiv.org/html/2603.29194
- https://arxiv.org/html/2511.17208v1
- https://arxiv.org/html/2509.18868v1
- https://arxiv.org/html/2602.14038
