Sampling Drift: When Temperature and Top-P Become Tribal Knowledge
Open the production config of any AI feature that has been live for more than a year and you will find an archaeological dig site. temperature: 0.7 because someone needed the demo to feel less robotic. top_p: 0.85 because a customer complained the outputs were too generic. frequency_penalty: 0.4 because there was a bad week in 2024 where a now-retired model kept repeating itself. None of these decisions are documented. None of them have been re-tested against the current foundation model. They run on every request, in every eval, in every A/B, shaping behavior nobody has consciously chosen since the original ticket got closed.
This is sampling drift. It is the slow accumulation of expedient sampler tweaks whose original justifications evaporate while their effects compound. The values in your config are not "tuned" — they are a fossil record of past incidents, scaled to the volume of your current traffic.
The reason it is invisible is structural. Every eval you run scores against the current sampling config, so the headline number always looks fine. There is no alarm that fires when a temperature value is two foundation-model versions out of date. There is no calendar invite that says "re-grid sampling parameters this quarter." The decay is silent until somebody runs a clean experiment and finds a quality lift, a token reduction, or both, sitting in plain sight at no engineering cost.
The fossil record in your config
To see how sampling drift forms, watch the lifecycle of a single value. A senior engineer demos an early prototype. The output reads as stiff. They bump temperature from the default 1.0 to 0.7 and the demo lands. The PR title is "make demo less robotic." The value ships.
Three months later, a customer reports that two responses to similar questions look too alike. Someone reaches for top_p, lowers it from 1.0 to 0.85 to introduce more variation in the long tail, ships the change, closes the ticket. Nobody re-runs the eval against the new value because eval scores were not the complaint.
A quarter after that, an incident postmortem identifies repetition in a long-form output. Someone adds frequency_penalty: 0.4. The model that produced the repetition is deprecated six weeks later. The penalty stays.
Eighteen months in, the team upgrades the foundation model. The inherited sampling config — temperature 0.7, top-p 0.85, frequency_penalty 0.4 — was tuned against the previous model's quirks. The new model has different logit dynamics, different default repetition behavior, often a different recommended sampler entirely. As recent reference guides note, provider-recommended defaults vary substantially across the current generation, and a value that was good for one model is unlikely to be optimal for the next.
Nothing in this story is reckless. Each decision was rational at the time. The drift is what happens when rational decisions accumulate without a process for revisiting them.
Why eval suites do not catch it
The seductive thing about a sampling config is that the eval suite never complains about it. Evals score against whatever sampling parameters you happen to be running, so any drift bakes itself into the baseline. The same trajectory holds whether the eval runs daily or quarterly: the score reflects the current config, the current config reflects accreted history, and the history has no advocate.
This is structurally different from a prompt regression, which surfaces the moment somebody changes the prompt and re-runs the eval. A sampling parameter that was chosen for a model two versions ago is invisible to the eval because the eval has no memory of what the parameter was when it was chosen, what it was supposed to fix, or whether the thing it was fixing is still a problem.
There is one experiment that breaks the spell: hold the prompt and the model fixed, vary the sampler, regrid temperature and top-p across a small grid (say, temperature in 1 crossed with top-p in 1), and re-score. Most teams who run this experiment for the first time discover the inherited point is not a local optimum. The same experiment also frequently shows that one or both of the older penalty parameters can be removed entirely with no quality loss — because recent frontier models have largely solved the repetition problems that the penalties were originally introduced to mitigate, and the penalties now occasionally suppress useful repetition, like the second occurrence of a precise technical term.
The experiment is cheap. It is not run because nobody owns it.
What disciplined sampling looks like
The pattern that scales is to treat the sampling config as a first-class artifact governed by the same discipline as the prompt. That means four things, none of them exotic.
First, rationale alongside values. Every non-default sampling parameter ships with a structured comment that names the model version it was tuned against, the eval delta it produced, and the failure it was fixing. "temperature: 0.7 // chosen 2024-09 against gpt-4o-2024-08-06; +3pp on conciseness eval; replaces default 1.0" is a rationale. "temperature: 0.7" is folklore. The rationale is the artifact future engineers use to decide whether to keep the value.
Second, versioning the triple, not the prompt alone. The unit of behavior in an LLM application is the model-plus-prompt-plus-sampling triple. A team that versions only the prompt is shipping behavior they cannot reproduce. Production prompt platforms increasingly bundle the sampling config into the version artifact so that a rollback rolls everything back, and so that an eval result is bound to a specific triple rather than to "the prompt as of yesterday."
Third, a sampling audit on a calendar. Quarterly is fine. The audit is a re-grid against the current model and the current eval, with a written conclusion: which parameters are still optimal, which can be removed, which need re-tuning. The output is two PRs — one removing dead parameters, one updating live ones — and a one-page note that becomes the rationale for the next audit. If the team is shipping faster than that, run it whenever the foundation model changes, because that is the moment the inherited sampling config is least likely to still be correct.
Fourth, a regression guard that pages on undocumented changes. If a sampling parameter changes without an accompanying rationale update and an eval delta in the same PR, the CI check fails. This sounds bureaucratic and is the cheapest mechanism in the bundle, because the sampling config is small enough that the false-positive rate is essentially zero.
The model upgrade trap
The single highest-leverage moment to catch sampling drift is a foundation model upgrade. It is also the moment most teams handle worst.
The dominant pattern is to swap the model identifier, run the eval, see the score did not regress, and ship. The sampling config — tuned against the previous model — comes along for the ride untouched. For the next six months, the application runs on a sampler optimized for behavior the underlying model no longer exhibits.
This is rarely catastrophic. It is mostly a slow leak. A temperature value that was right for the previous model produces outputs slightly more cautious or slightly more formulaic than the new model is capable of. A frequency penalty that was right for the previous model occasionally suppresses a useful repetition. A top-p value that was right for the previous model clips a tail that the new model uses well. None of these effects show up in the eval because the eval is running against the same sampler.
The treatment is to make the sampler re-tune part of the model upgrade checklist, not a separate project. When the model identifier changes in the config, the PR template should require either a regridded sampler or an explicit "kept inherited values, see audit log" note with a link to the audit that justifies it. Provider documentation makes this explicit — for instance, Azure OpenAI's reproducibility guidance calls out that the system fingerprint changes with backend updates, which is the same class of event for behavioral parity.
A related discipline: stop tuning two penalties that interact. The provider-recommended pattern is to tune temperature or top-p, not both. The interaction is hard to reason about, and a config that adjusts both is a config that no successor engineer can debug six months later. Pick one. Document the pick in the rationale.
Sampling parameters are part of the contract
The architectural realization is that sampling parameters are not "just hyperparameters." They are part of the behavioral contract your application makes with its users. The model determines what is possible. The prompt determines what is requested. The sampler determines which of the model's possible responses, given the prompt, the user actually sees.
Three knobs. Three independent contributors to behavior. A team that versions one of them with rigor (the prompt), one of them with discipline (the model identifier, usually pinned), and one of them with shrugs (the sampler) is shipping behavior they only partially control. When something looks off in production and the team rolls back the prompt and the regression persists, the sampler is the silent third axis that nobody thought to inspect.
The fix is not glamorous. It is a rationale comment, a versioned triple, a quarterly audit, a CI guard, and a model-upgrade checklist that includes the sampler. None of these are research problems. They are operational problems that compound silently until somebody decides to stop letting them.
The teams that do this consistently are not the ones with the cleanest sampler configs. They are the ones with the shortest distance between "this value exists" and "here is why it exists, against which model, with what eval delta." Sampling drift is what fills that distance. Closing it is the work.
- https://sureprompts.com/blog/llm-temperature-sampling-complete-guide-2026
- https://amitray.com/llm-parameters-temperature-top-p-top-k-guide/
- https://www.ibm.com/think/topics/llm-temperature
- https://www.promptingguide.ai/introduction/settings
- https://medium.com/@wasowski.jarek/temperature-0-0-generates-48x-more-repetition-loops-than-1-0-sampling-strategies-f0b8d7a3c850
- https://www.vellum.ai/llm-parameters/temperature
- https://muxup.com/2025q2/recommended-llm-parameter-quick-reference
- https://www.getmaxim.ai/articles/prompt-evaluation-frameworks-measuring-quality-consistency-and-cost-at-scale/
- https://lakefs.io/blog/toggle-openai-model-determinism/
- https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/reproducible-output
- https://www.keywordsai.co/blog/llm_consistency_2025
- https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation
- https://arxiv.org/html/2504.20131v2
- https://www.vellum.ai/llm-parameters/frequency-penalty
- https://docs.vllm.ai/en/v0.6.0/dev/sampling_params.html
