Skip to main content

Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces

· 10 min read
Tian Pan
Software Engineer

Every AI feature launch has the same quiet moment before the first user sees it: someone on the team asks "how do we know this is good?" and the honest answer is "we don't, yet." You have no traces because you have no users. You have no users because you haven't shipped. The loop is real, and the two failure modes it produces are both fatal — ship blind and let the first week of escalations be your eval dataset, or wait for "real data" and watch the roadmap slide for a quarter while a competitor publishes a demo.

The way out is not to pretend cold-start evaluation is the same problem as post-launch evaluation with a smaller sample size. It isn't. You are not sampling a distribution; you are constructing a prior. Every day-1 signal is an artifact of a choice you made about what to measure, whose behavior to simulate, and which failures to care about. Teams that ship AI features well treat the pre-launch eval stack as a first-class deliverable — not a spreadsheet hacked together the night before the gate review, but a layered system of dogfooding, simulation, expert annotation, and adversarial probes, each contributing a different kind of signal and each weighted with an explicit story about what it can and cannot tell you.

This post is the stack I would build if I had to ship next month, with the weights I would assign on day one and the traps I would watch for.

Why "wait for real traces" is the wrong answer

The steelman of "wait until you have production data" is that synthetic evaluation is biased by whoever wrote it, and nothing beats the messy reality of actual users. The steelman is correct about bias and wrong about the conclusion. Real traces are not free either — they come from users who are often paying you, whose complaints are public, and whose churn you cannot undo. The cost of discovering a failure mode from a production trace is the user experiencing that failure plus all the downstream ones you haven't caught yet. Synthetic evaluation is noisy; production discovery is expensive. You are trading precision for blast radius.

The second problem with waiting is that the first few weeks of production traces are almost always skewed by self-selection. Your early adopters are not your median user. They will find clever ways to use the feature, forgive rough edges, and write prose feedback in your Slack channel that feels like signal but is actually one power user's preferences amplified. Treating their traces as the ground truth bakes that skew into every iteration that follows. By the time you realize the eval loop has been optimizing for a narrow cohort, you have a model tuned for ten users and confusing the other ten thousand.

The third problem is latent and worst. Teams that ship without an eval stack tend to build one retroactively, on top of a model they've already optimized against internal intuition. The evals that emerge are downstream of the model's current behavior — they encode "don't regress from here" rather than "reach the bar we actually cared about." Cold-start evaluation is your one chance to write the rubric before the artifact distorts it.

The four layers of a cold-start eval stack

The stack has four layers, and each one answers a different question. Confusing them — or relying on just one — is how teams end up with evals that look thorough and still miss the failures that matter.

Employee dogfooding with structured tasks. Dogfooding is almost always present and almost always run badly. The fix is structure. A structured dogfooding cohort does not mean "use the feature and tell us what you think." It means a fixed list of tasks, written in advance, that every participant attempts and reports against a rubric. The tasks should cover the golden path and at least two categories of edge cases you suspect will trip the model. The outputs go into a shared spreadsheet with binary pass/fail labels and a one-line note on the failure mode. You are not trying to score the model — you are trying to surface the failure taxonomy, which is the input to the next three layers.

Scenario-based simulation with persona variation. Once you know what kinds of failures to look for, you need volume. This is where synthetic personas earn their keep. A persona is a reproducible profile — user attributes, history, goals, prior context — that you can instantiate on demand and feed through the feature. The WHOOP team shipped a memory agent using personas like "an IN_THE_GREEN member with 15 Green Recoveries above 80%"; Algolia generates synthetic queries from customer records when a tenant has no search logs. Both patterns work because a persona is a cheap way to hold user state constant while varying the prompt, or vice versa. A simulator with 20 personas and 50 scenarios per persona produces 1,000 eval runs you can regenerate every time the prompt changes — which is the primary thing dogfooding cannot give you.

Expert annotators on a small but diverse seed set. Dogfooding and simulation both rely on your team's intuition about what "good" looks like. That intuition is wrong in specific ways that only show up when a domain expert reads the outputs. A seed set of 50-100 carefully chosen inputs, reviewed by one or two experts against an explicit rubric, gives you a high-signal anchor for every other layer. The seed set is where the rubric itself gets refined — you will find that your pass/fail criteria don't survive contact with real edge cases, and you will fix them. Eugene Yan's three-step guide converges on this number for a reason: it's small enough to review thoroughly, large enough to catch most systematic failures, and cheap enough to redo when the rubric evolves.

Adversarial probe library borrowed from public benchmarks. Your users are not adversarial on day one, but your product surface is adversarial from the moment it has an input box. Prompt injection, jailbreaks, policy-violation requests, context leakage across tenants — none of these require a malicious user to matter, because benign inputs routinely trigger them by accident. The good news is you do not have to write these probes from scratch. HarmBench, JailbreakBench, AdvBench, and the Promptfoo red-team corpus are public, curated, and maintained. Pull the subset that matches your deployment surface, run it as a non-negotiable gate, and re-run it on every prompt change. The cost is low and the floor it enforces is real.

Weighting day-1 decisions so the loudest user doesn't win

Here is the trap almost every team walks into. The CEO tries the feature, finds one thing that annoys them, sends a message in Slack, and suddenly the top of the prompt has a workaround for that specific case — wedged in with no context for anyone who wasn't in the conversation. The eval score ticks up on the dogfooding layer and down on the seed set, but nobody reads the seed set that week because the CEO is happy. Repeat this for three weeks and your prompt is a scar tissue of executive annoyances and your scenario sim has silently drifted from the product vision.

The mechanism that prevents this is explicit weighting and a written rule about which layer wins when they disagree. One workable set of defaults: expert seed set is the ground truth for correctness, scenario sim is the ground truth for coverage, adversarial probes are a hard gate with a pass threshold, dogfooding is qualitative and feeds the failure taxonomy but does not set the numeric score. When the dogfooding layer and the expert seed set disagree, the seed set wins and the dogfooding observation becomes a candidate addition to the seed set — pending expert review. When expert seed and scenario sim disagree, the sim is regenerated with persona adjustments that reflect the expert's concern.

The second mechanism is to publish the weights. Put them in the same repo as the eval harness, review them at the same cadence as the prompts. Teams that skip this step end up with informal weights that track whoever complained most recently. Teams that do it force the actual conversation about what "good" means, and that conversation is the real product of a cold-start eval program.

When the first real traces arrive, don't throw the synthetic stack away

Once production traffic starts flowing, the instinct is to pivot the entire eval program onto real traces because "finally, real data." Resist. The traces you get in week one are biased toward early adopters and biased against users who tried the feature once, hit a rough edge, and never came back. The synthetic stack is still your best defense against that bias, because it was explicitly designed to cover scenarios your traffic distribution underweights.

The integration pattern that works is additive. Mine traces for new personas and new scenarios, feed them into the simulator, keep the simulator's size growing. Sample traces for the seed set, have experts label them, grow the seed set. Cross-check adversarial probes against actual malformed inputs you've observed, add the novel ones to the probe library. The synthetic layers get richer over time, they do not get replaced. The day you cut them is the day your eval quality quietly starts tracking your traffic mix instead of your product intent.

The forward-looking version of this: treat your eval stack as a data asset with roughly the same weight as your prompts or your retrieval index. The stack compounds, the stack gates launches, the stack is what lets you answer "is the new model better" without relitigating the whole evaluation every time. Cold-start is just the first day of owning it.

What to ship first if you only have a week

If the timeline is compressed and you can only build one layer before launch, build the expert seed set with 50 cases and a written rubric. It's the cheapest, highest-signal artifact, and it anchors everything you add later. If you have two weeks, add the adversarial probe library — pull a subset of HarmBench or the Promptfoo red-team corpus, run it as a pre-launch gate, call it done. If you have three weeks, add structured dogfooding with a written task list. The scenario simulator with personas is the layer that pays off most over time but takes the longest to build correctly, so it comes last in a compressed timeline and first in a proper one.

The meta-point is that cold-start evaluation is not a scaled-down version of post-launch evaluation. It is the thing that determines whether post-launch evaluation has anything useful to compare against. Build the stack before you need it, weight the layers explicitly, and refuse to let the first week of customer escalations be your eval dataset. The teams that get this right ship faster, not slower, because every iteration after launch has a measuring stick that was built for the product rather than for whatever traffic happened to show up.

References:Let's stay in touch and Follow me for more thoughts and updates