Sampling Parameters in Production: The Tuning Decisions Nobody Explains
Most engineers treat LLM quality regressions as a prompt engineering problem or a model capability problem. They rewrite system prompts, try a newer model, or add few-shot examples. They rarely check the three numbers sitting silently at the top of every API call: temperature, top-p, and top-k. But those defaults are shape-shifting every response your model produces, and tuning them wrong causes output variance that teams blame on the model for months before realizing the culprit was a configuration value they never touched.
This isn't an introductory explainer. If you're running LLMs in production—for extraction pipelines, code generation, summarization, or any output that feeds into real systems—these are the mechanics and tradeoffs you need to understand before you can tune intelligently.
What These Parameters Actually Do
Temperature, top-p, and top-k each operate at a different point in the same sampling pipeline: raw logits come out of the model, get filtered and reshaped, and then a token is drawn from the resulting probability distribution.
Temperature scales the logits before converting them to probabilities via softmax: P(token) = softmax(logits / temperature). A value below 1.0 sharpens the distribution—likely tokens become more likely, unlikely tokens become nearly impossible. A value above 1.0 flattens it—probabilities become more uniform, and the model explores a broader vocabulary. Temperature controls how peaked vs. diffuse the distribution is.
Top-p (nucleus sampling) works after temperature is applied. It sorts tokens by probability, walks down the ranked list, and keeps only the smallest set of tokens whose cumulative probability reaches the threshold p. Everything else is discarded. If p=0.95, you're sampling from whichever tokens collectively account for 95% of the probability mass. The key behavior: when the model is confident, the nucleus is small (a few tokens cover 95%). When it's uncertain, the nucleus expands (many tokens share probability mass).
Top-k is simpler and less adaptive: keep exactly the k highest-probability tokens, regardless of their actual probabilities. If k=50, you always sample from 50 candidates. The limitation is that a flat distribution at rank 50 still gets included even if it represents noise.
Repetition/frequency/presence penalties are applied before the truncation step. Repetition penalty divides logit values for previously-seen tokens (multiplying by 1/penalty). Frequency penalty subtracts proportionally to how often a token has appeared. Presence penalty applies a flat deduction to any token that appeared at all. The penalty type matters more than the value: presence penalty is aggressive and binary (saw it once = penalized forever), which breaks natural language that depends on pronouns, articles, and common words.
The order of operations—penalties, then truncation, then sampling—is not arbitrary. Applying penalties after truncation would mean some tokens get penalized but then included anyway because the truncation set was determined before penalty adjustments.
Why Temperature=0 Is Not Deterministic in Production
The belief that setting temperature=0 produces repeatable, stable outputs is one of the most expensive misconceptions in production AI. Temperature=0 means greedy decoding: always select the single highest-probability token. But in a real inference environment, the "highest-probability token" is not a stable quantity.
Floating-point arithmetic on GPUs is non-associative. The same mathematical operation—say, (a + b) + c—produces different results than a + (b + c) when values are represented as floating-point numbers, because rounding errors accumulate differently depending on computation order. Across the hundreds of matrix multiplications in a transformer forward pass, these rounding differences compound. The resulting logit values are slightly different than what pure mathematics would produce, and "slightly different" at the logit level can flip which token has the highest probability.
Modern inference servers use continuous batching to maximize throughput. Requests that arrive at the same time get batched together, and batch composition affects computation. Running your prompt alone produces different attention patterns—and different logits—than running it batched with 31 other requests. As server load fluctuates, so does batch composition, which means the "deterministic" model produces different outputs during peak hours than off-peak. A paper on this (arxiv 2408.04667v5) describes it directly: "For a given exact batch, the forward pass may be deterministic, but from the user's point of view, the system is still nondeterministic, because the batch itself is not stable from run to run."
GPU memory layout, quantization variants, and KV-cache optimization can all introduce additional drift. The practical takeaway: temperature=0 significantly reduces randomness, but it doesn't eliminate it. If your architecture requires reproducibility—audit logs, deterministic regression tests, bit-for-bit identical outputs—you need fixed batch sizes, consistent hardware configuration, and platform-level seed support where available. Temperature alone won't get you there.
The Top-P and Temperature Interaction Trap
The most common tuning mistake happens when engineers want "more creative" outputs and adjust both parameters simultaneously. The reasoning is intuitive: temperature controls creativity, top-p controls diversity, raising both should produce diverse and creative output. The actual result is often incoherent garbage.
Here's why. When temperature increases from 0.5 to 1.5, the probability distribution flattens. At temperature=0.5, your top token might have 80% probability—top-p=0.95 only needs a handful of tokens to reach that threshold. At temperature=1.5, the same model might assign 35% to the top token—top-p=0.95 now requires dozens of tokens to accumulate to 95%, including low-quality candidates that would have been discarded at lower temperatures. Raising temperature expands the nucleus automatically, even if you don't touch top-p.
A team that changed from (temperature=0.8, top-p=0.95) to (temperature=1.2, top-p=0.95) found that incoherent outputs appeared in 15-20% of responses. The fix wasn't reducing temperature—they wanted more variety. It was tightening top-p to 0.80, which compensated for the expanded nucleus that high temperature created. Incoherence dropped below 2%.
The general rule: treat temperature and top-p as coupled, not independent. If you raise temperature, lower top-p. If you raise top-p, consider lowering temperature. Changing both in the same direction amplifies their interaction in ways that are hard to predict without testing.
Recent research has formalized this problem. Min-p sampling (Nguyen et al., ICLR 2025) addresses it by making the threshold relative to the model's own confidence. Instead of a fixed cumulative probability, min-p keeps tokens whose probability exceeds a fraction of the peak token's probability. When the model is confident (peak=0.95), the threshold scales up automatically. When it's uncertain (peak=0.1), the threshold scales down. This eliminates the flat-distribution problem because the nucleus is always calibrated relative to model confidence. Min-p values of 0.05-0.1 work across most tasks. It's not yet universally available in commercial APIs, but vLLM supports it and adoption is growing.
Task-Specific Tuning Recommendations
There is no universal "good" temperature. The right value depends on what you need the model to optimize for.
Structured extraction and JSON generation: Use temperature 0.0-0.2. The goal is consistency—you want the same input to produce the same schema-valid output every time. Higher temperatures introduce unexpected characters, formatting deviations, and parsing failures. If you have access to constrained decoding (logit masking against a JSON schema), use it. It mathematically eliminates schema-invalid tokens before sampling, making temperature almost irrelevant for structural validity. Teams that moved from temperature=0.7 to temperature=0.1 with constrained decoding report structured output success rates improving from ~85% to over 99%.
Code generation: Use temperature 0.2-0.5. Code requires syntax precision, but full greedy decoding can miss valid solutions for complex problems. Temperature=0.3-0.5 allows exploration during logic synthesis without sacrificing syntax correctness. Keep repetition penalty at or near 1.0—code legitimately repeats patterns (variable names, common idioms, boilerplate), and penalizing repetition breaks this. Code quality degrades meaningfully above temperature=0.8 due to invalid syntax.
Multi-step reasoning: Counterintuitively, temperature=0 hurts complex reasoning tasks. Research shows accuracy on multi-step math problems drops roughly 17 percentage points when moving from temperature=0.5 to temperature=0. When the model encounters a decision point in a reasoning chain, greedy decoding forces it down one path. Slight temperature (0.5-0.7) allows exploration of alternative approaches. Note that reasoning-specific models (o1, o3) lock temperature at 1.0 and don't expose the parameter—trust their defaults rather than assuming you can override.
Summarization and QA: Temperature 0.5-0.7 strikes the right balance between factual accuracy and readable phrasing diversity. Below 0.3, summaries tend to be compressed and formulaic; above 0.8, factual slippage and hallucination increase. Top-p=0.9 is a reasonable default here.
Creative writing: Temperature 0.8-1.2, top-p 0.95. Creative quality benefits from distribution exploration, but quality collapses above roughly 1.2. This is a non-monotonic relationship: more temperature helps up to a point, then hurts. Test your specific model at 1.0, 1.1, and 1.2 before committing.
Provider Differences Matter More Than You Think
If you're moving a prompt between providers or running the same prompt on multiple models, sampling parameters are not portable. A temperature of 0.7 on OpenAI's GPT-4o is not equivalent to 0.7 on Anthropic's Claude. The scales differ (OpenAI supports 0-2.0, Anthropic caps at 1.0), and the underlying distributions differ. Identical temperature values produce different output variance.
The parameter support matrix differs significantly across providers:
- OpenAI does not support
top_k. This is deliberate—they've argued that top-p is strictly better for most use cases. - Anthropic supports
top_kbut does not exposefrequency_penaltyorpresence_penalty. Their view is that these parameters introduce enough complexity to cause user errors. - Google Gemini supports
top_kandtop_psimultaneously, plus temperature up to 2.0.
If you rely on frequency_penalty for conversational outputs and need to switch from OpenAI to Anthropic, you have no direct equivalent. You can approximate it by tightening top-k to restrict token repetition, but the behavior isn't the same. Similarly, migrating from Anthropic (where you set top_k explicitly) to OpenAI requires shifting to top-p-based control.
A subtle but important point: frequency_penalty and presence_penalty have different semantics even within OpenAI's API. Frequency penalty subtracts proportionally to token occurrence count (logits -= penalty * count), so a token used five times gets penalized five times as much as one used once. Presence penalty applies a flat deduction to any token that appeared at all, regardless of frequency. If you want to discourage repetition gently while preserving natural language flow, frequency penalty is the right tool. If you want to aggressively prevent any token reuse, presence penalty—but it will break pronouns and common words that naturally recur.
A Systematic Approach to Production Tuning
Most teams set temperature=0.7 because they saw it in a tutorial and never revisited it. A small number tune empirically without baselines and measure nothing. Neither approach finds the actual optimal configuration.
The right methodology has four phases:
Build a baseline test set. Curate 100-500 examples that cover your actual input distribution—including edge cases, ambiguous inputs, and adversarial examples that exercise the failure modes you care about. Generate reference outputs either from human review or from the model at your current settings. This set is your regression anchor.
Sweep one parameter at a time. Fix everything else, vary temperature across [0.0, 0.3, 0.5, 0.7, 0.9, 1.2] and measure against your baseline with task-appropriate metrics (syntax validity for code, entailment scores for summarization, LLM-as-judge for coherence). Identify the inflection points—where quality starts improving, where it plateaus, where it degrades.
Coordinate coupled parameters. If your single-variable sweep shows a benefit to higher temperature, tighten top-p accordingly and re-run. If you're adding repetition penalty, verify it doesn't break expected patterns in your outputs.
A/B in production. Deploy your new configuration to 5-10% of traffic. Track quality metrics in production rather than just in evaluation—batch dynamics, real input distribution, and actual user patterns reveal issues that evaluation harnesses miss. Monitor continuously because model behavior can drift when providers update model weights or change batching behavior.
The anti-pattern to avoid: optimizing sampling parameters for a tuning set that doesn't represent production traffic. If your test cases are clean, well-formed prompts and production users send ambiguous, messy inputs, the optimal parameters won't transfer. The baseline must reflect what the model actually sees.
Checking the Config Before Blaming the Model
The practical takeaway is this: when output quality regresses or behaves erratically, check sampling parameters before re-engineering the prompt or switching models. Temperature, top-p, and top-k are easy to overlook because they sit in infrastructure config rather than in prompt templates, and because their effects are gradual and stochastic rather than immediate and obvious.
The teams that get this right treat sampling configuration as a first-class tuning surface—versioned alongside prompts, monitored in production, and tested systematically. The teams that don't spend months reworking prompts for a model that was actually misconfigured from the start.
The parameters aren't magic. But wrong defaults are one of the most cost-effective failure modes to fix once you know to look for them.
- https://letsdatascience.com/blog/llm-sampling-temperature-top-k-top-p-and-min-p-explained
- https://www.zansara.dev/posts/2026-03-24-temp-0-llm/
- https://mbrenndoerfer.com/writing/why-llms-are-not-deterministic
- https://arxiv.org/html/2408.04667v5
- https://huyenchip.com/2024/01/16/sampling.html
- https://arxiv.org/pdf/2407.01082
- https://www.statsig.com/perspectives/top-vs-top-sampling
- https://www.kenmuse.com/blog/how-temp-topk-topp-minp-control-llm-output/
- https://arxiv.org/html/2402.05201v1
- https://smcleod.net/2025/04/llm-sampling-parameters-guide/
- https://optyxstack.com/llm-regression-testing
