Skip to main content

The Demo Account Eval Set Your Sales Team Is Running Without You

· 10 min read
Tian Pan
Software Engineer

The most expensive eval set in your company isn't in your repo. It's in a slide deck a sales engineer assembled six months ago, plus three demo accounts named after your top-five logos, plus a half-remembered script that says "click here, ask the agent to summarize last quarter, watch the magic happen." It runs once or twice a week, in front of prospects worth six or seven figures. Nobody on the AI team has ever scored a run.

Then you ship a model migration on a Tuesday. On Thursday at 4 PM, the sales engineer pings the on-call channel: the summary output now starts with "Certainly! Here is a summary…" instead of jumping into the bullet points, the numbers are spelled out instead of digits, and the prospect — a Fortune 500 CFO who scheduled this meeting four weeks ago — just asked whether the product is always this chatty. The release notes called it a 1.2-percentage-point eval lift.

The lift was real. The eval set that measured it just didn't include the most commercially important interaction the product handles all quarter.

Sales demos are an eval set, whether you treat them as one

An eval set is, structurally, a frozen list of inputs with expected behavior, run on demand, and graded. A sales demo is also a frozen list of inputs with expected behavior, run on demand, and graded — except the inputs are walkthrough steps, the expected behavior is "the prospect nods," and the grading happens in a live room where the cost of a regression is a deal that doesn't close.

Three properties make demo accounts dangerously eval-shaped:

  • They are stable. The same five or six demo accounts get used across every prospect call. The data inside them barely changes. A walkthrough that worked last quarter is expected to work this quarter, because the SE built the entire conversation rhythm around the assumption that it will.
  • They are stylized. Demo data is curated — clean, balanced, picked specifically to show the AI feature off. This is exactly the kind of low-noise, easy-mode input where small behavioral shifts (verbosity, formatting, refusal style) are most visible.
  • They are tested by humans who notice everything. A prospect on a Zoom call is not a customer scrolling past output. They're staring at the screen for thirty seconds at a time. A trailing period, an extra preamble, a hallucinated company name in an otherwise-correct paragraph — these are loud failures in a demo and silent failures in production.

A model that scores well on your offline benchmarks but has slightly drifted on these axes will pass your release gate and fail your sales motion. The aggregate eval score won't catch it because the demo accounts are a vanishingly small fraction of overall traffic. The slice that matters most to the business has zero weight in the rollout decision.

Why your existing golden set doesn't cover this

Most AI teams that have been in production for a year have built one or two golden trace datasets — sampled production traffic, frozen, scored by an LLM judge or a programmatic check. These are real evals and they're necessary. They're also systematically blind to the demo distribution.

The sampling is the problem. Production traffic is dominated by the median user doing the median thing. If a quarter of your daily volume is the same boring summarization request from a thousand low-value accounts, a uniformly sampled golden set is mostly that. The walkthrough your AE runs through with an enterprise prospect — a six-step orchestration involving a CRM lookup, a multi-document summary, a chart generation, a follow-up question, and a final answer that has to land in under fifteen seconds — does not look like the median request. It looks like a stylized choreography that the model has to nail in sequence.

Two more sampling gaps make it worse:

  • The data is different. Demo accounts hold curated, balanced data. Production accounts hold messy, partial, often-empty data. A model that has gotten worse at hallucinating into empty fields might score better on production traffic (where empty fields are common and an "I don't have enough information" response is correct) while scoring worse on demo (where every field is populated and a hedge response feels weirdly evasive).
  • The expectation is different. Production has SLOs around p95 latency, accuracy, refusal rate. Demos have aesthetic expectations: the response should land in one breath of screen-share narration, the formatting should look polished, the tone should be confident without being sycophantic. None of those properties get scored by your existing evals.

Sampling more aggressively doesn't fix this — the demo-shaped traffic isn't underrepresented because of low sample size. It's underrepresented because it's a different distribution that you're not sampling from at all.

The Thursday-afternoon failure mode

The way most teams discover this gap is identical across every company I've seen ship AI features at GTM scale. The sequence:

  1. Model migration ships Tuesday. Offline evals pass with a small lift. Canary on production traffic looks clean over twenty-four hours.
  2. Wednesday: nothing notable.
  3. Thursday afternoon: an SE running a high-stakes walkthrough notices the assistant is now adding a preamble before the structured response, or wrapping JSON in triple backticks that the demo UI doesn't strip, or refusing a step it used to handle, or producing a slightly different chart title that breaks the screenshot in their leave-behind deck.
  4. The SE pings the AI on-call channel. The AI team is now triaging a prod incident with the head of sales CC'd and a $1.4M ARR deal on the line.
  5. The fix is either (a) revert the migration entirely, taking back the 1.2-point lift, or (b) hand-patch the specific walkthrough behavior, taking on a quiet special case that nobody documents.

Either path is expensive. The deeper cost is that the AI team now has a credibility debt with the sales org — every future migration is suspect, every future eval improvement gets met with "but did you test it against the Acme walkthrough?", and the trust loop tightens.

The thing that would have caught this was a single deterministic run of the demo walkthrough — a five-minute job that the migration should have gated on. Nobody had it, because nobody on either side of the org owned it.

What "owning" the demo eval set actually looks like

The fix is structural, not heroic. It requires turning the demo suite from an oral tradition into a versioned artifact the AI team can grade against. The pattern that holds up across companies has four pieces.

A demo-trace ingestion path that the SE team can run themselves. The sales engineer recording a new demo walkthrough should be one button-press away from contributing the trace to the eval suite. This means a low-friction capture (browser extension, in-product "save this session" affordance, or a Slack bot that takes a session ID), not a Jira ticket to the AI team. If contributing is harder than ten seconds, the SE will not do it, and the suite will stay stale. The contributed trace gets a slice tag: which prospect type, which industry, which feature being showcased.

A pre-release demo regression gate. Before any model or prompt change rolls past canary, the demo suite runs end-to-end with deterministic seeds and the outputs get diffed — both programmatically (regex for format, length thresholds, structural checks on the JSON/schema) and via an LLM judge graded against the previously approved output. If any walkthrough in the suite degrades on a hard-constraint check, the release blocks until the team explicitly accepts the regression or fixes it. This is the same shadow-testing pattern teams already use for general production traffic, sliced by demo cohort instead of user cohort.

Slice-aware reporting that surfaces demo health to the business. When the head of sales asks "is the new model safe for the Acme call on Friday?", the answer cannot be "the aggregate eval is up 1.2 points." It has to be "the Acme walkthrough trace passed seven of seven steps with zero hard-constraint violations, and the LLM judge scored it 4.6 vs. 4.7 on tone." This is slice-level reporting, not aggregate-level. The dashboard has to name the deals and the prospects, not the percentile.

Pre-meeting smoke runs as a standing job. The night before any high-stakes demo, the suite runs against the current production model with the prospect's specific walkthrough. If anything diffs from the SE's last sign-off, the AI on-call sees a notification before the meeting, not during. This costs cents per run and saves the kind of Thursday afternoon nobody wants to have.

The operational shift behind the technical one

The technical pattern is straightforward. The hard part is the org chart. The demo suite has to have a named owner on the AI team — usually a forward-deployed engineer or a prompt engineer with a foot in both worlds — who is responsible for keeping it fresh, who attends sales kickoffs, and who has a standing weekly with the SE leads. Without that role, the suite drifts: the SE team adds new walkthroughs that don't get ingested, the AI team ships migrations that don't get gated, and the gap reopens within a quarter.

The sales org also has to give up something: the freedom to invent a walkthrough on the fly in front of a prospect. Every walkthrough that matters has to exist as a versioned artifact in the eval suite. SEs who improvise are, in effect, running an untested A/B against a customer. This is a hard cultural ask for a function that prizes adaptability — but the alternative is the Thursday afternoon, recurring once per migration cycle.

The eval team has to give something up too: the comfort of an aggregate score that goes up. Slice-aware reporting means some migrations that look like wins in aggregate get held back because the demo slice didn't budge or got slightly worse. That trade-off is correct — a 1.2-point lift that costs you the Acme renewal is not a lift — but it requires the AI lead to explain a "delayed" launch to leadership in slice terms, which is a muscle most teams haven't built.

The pattern beyond demos

Once you build this discipline for demos, the same shape applies to every other unowned eval set inside the company. The CEO's favorite query that they personally check after every model update. The investor demo that has to look polished on a fundraise call. The partner integration where one specific request pattern dominates the relationship. The internal exec dashboard whose summary feature gets quoted in the weekly business review.

These are all eval slices the AI team didn't write down, doesn't grade against, and finds out about only when someone important is angry. The mature pattern is the same in every case: name the slice, capture the inputs, set an explicit pass criterion, run it before every release, report on it separately. The demo suite is just the most expensive of these to ignore — and the easiest one to start with, because the sales team already has a list of accounts and walkthroughs they're using every week. They've been running your eval set for you. The only thing missing is that you've never asked for the data.

If you ship AI features and you have a sales motion, audit it this quarter. Ask your top five SEs which demo accounts they use, which walkthroughs they run, and which model output changes they've quietly worked around. You will find an eval suite hiding in plain sight, and the cost of formalizing it will be smaller than the cost of the next Thursday afternoon.

References:Let's stay in touch and Follow me for more thoughts and updates