Sampling Parameter Inheritance: When Temperature 0.7 Leaks From the Planner Into the Verifier
A verifier that flips its own answer eight percent of the time is not a flaky model. It is a sampling configuration bug that reached production because the framework defaulted to inheritance. The planner needed temperature=0.7 to brainstorm subtask decompositions. The verifier — the role whose entire job is to give a low-variance yes-or-no on whether the answer satisfies the rubric — was instantiated through the same harness call, and silently picked up the same temperature. Nobody set it that way on purpose. Nobody set it at all.
This is the most expensive parameter in your stack that nobody owns. It compounds across the call tree: the summarizer above the verifier, the structured-output extractor below it, and the retry loop wrapping the whole thing all consume the planner's "be creative" knob as if it were a global. The bill arrives in three places at once — eval flakiness, token spend, and the half-day a senior engineer spends bisecting a regression that turns out to be no regression at all.
The Inheritance Pattern That Looks Reasonable Until It Isn't
Most agent frameworks expose model configuration as constructor arguments to a single client object, then pass that client through the call tree. The planner agent, the verifier agent, and the structured-output extractor each get a reference to the same client, which means they each get the same temperature, top_p, and seed defaults. This pattern is convenient — you configure the model once at the edge of your application — and convenience is the reason it stays in place long after it has stopped serving you.
The problem is that "the model" is not a single role. A modern agent is at least four different speakers, and each one has different sampling needs:
- A planner that proposes alternative decompositions benefits from diversity. Temperature in the 0.6–0.9 range produces the variety that lets a verifier pick a winner from multiple candidates.
- A verifier that scores candidates against a rubric must be near-deterministic. Recent work shows that LLM-as-judge self-consistency peaks around
temperature=0.1, not0.0, but anything above0.3introduces flip-rates that destroy the value of the verification step. - A summarizer that compresses tool output for the next step needs a middle path: enough determinism to be reproducible, enough flexibility to handle unusual phrasing without truncating to a stock template.
- A structured-output extractor that produces a JSON object for downstream code wants
temperature=0with grammar-constrained decoding. Anything else creates parse failures that the retry loop converts from a 1% rate into a 1% tail-latency spike.
When inheritance is the default, all four roles get the planner's setting. The harness implementer's mental model — "I configured the model" — does not match the runtime reality. The verifier has not been configured for verification; it has been configured for whatever the most recent caller needed, and the most recent caller usually needed creativity.
Why Temperature 0 in the Verifier Is Not the Answer Either
The instinct, after the first incident, is to override the verifier to temperature=0 and call the bug fixed. This is half-right. Temperature zero is the correct intent, but the recent research on LLM judges complicates the implementation in two specific ways.
First, temperature=0 does not produce deterministic outputs in modern inference systems. Batch composition on the server side changes the floating-point order of operations, which changes the logits, which can flip the highest-probability token even when sampling is greedy. The "1000 runs, 1000 different answers" effect is well-documented, and recent work on batch-invariant kernels has shown the workaround costs about 60% of throughput. For a verifier that runs at scale, this is the difference between an SLO you can hit and one you cannot.
Second, even when sampling is mathematically greedy, single-sample evaluation is brittle. A judge that gives one yes/no per call captures the mode of the distribution but tells you nothing about the spread. The papers that took this seriously — the ones that ran the same comparison through multiple temperatures and counted flips — found that the optimal temperature for a judge is closer to 0.1 than to 0, because slightly noised sampling exposes the cases where the model's belief is genuinely 51/49, and those cases should be flagged, not silently committed to one side.
The right verifier configuration is not "temperature zero." It is "low temperature, sampled multiple times, with a flip-rate metric watched in production." If the flip-rate exceeds a threshold (say, 5%), the verifier escalates to a stronger model or a human reviewer. This turns sampling parameters from a knob you set once into a signal you observe.
Per-Role Sampling Profiles as a Framework Primitive
The architectural fix is to stop treating sampling parameters as model-client configuration and start treating them as role configuration. A role profile bundles temperature, top_p, presence_penalty, frequency_penalty, max_tokens, optional seed, and the grammar or schema constraints that go with the structured output. When a call site invokes the planner, it asks the framework for the planner profile, not the model client.
The discipline at the framework boundary is default-deny inheritance. A subagent invocation does not propagate sampling settings unless the call site explicitly opts in. If the planner spawns a verifier subagent and forgets to specify the verifier profile, the call should fail loudly — not silently fall back to the planner's settings. This is the same principle as least-privilege in security: the default should be the safe choice, and overrides should be visible.
In practice, this looks like a registry of named profiles — planner, verifier, summarizer, extractor, tool_use — and a wrapper around the model client that requires a profile name on every call. The model client itself becomes a low-level primitive that nobody calls directly. Application code requests profiles by role, and the framework enforces the per-role parameters. The win is that "what was the verifier's temperature on Tuesday at 4pm" becomes a config-lookup question, not a stack-trace archaeology question.
A second-order benefit shows up in the eval harness. If the eval suite runs the verifier through the same profile registry as production, it cannot accidentally test the verifier at a temperature production never sees. This is the same failure mode the prompt-version problem causes, applied to sampling: an eval suite that runs against a different config than production is measuring a different system.
The Eval That Catches Inheritance Bugs
Catching these bugs requires an eval that nobody writes by default: verifier disagreement-rate under repeated calls. Take a fixed input — a candidate answer and a rubric — and run the verifier against it ten times. If the verifier disagrees with itself more than 5% of the time, the sampling configuration is the bug, not the model. The number is tunable, but the principle is universal: a verifier that flips on its own input is not verifying anything; it is sampling.
This eval is cheap to implement and surprisingly effective at finding inheritance bugs because it catches both the mechanical case (temperature was inherited as 0.7) and the subtler case (temperature was set to 0.1 but the seed was not pinned, and the underlying inference system batches non-deterministically). When the disagreement-rate spikes, the cause is almost always one of three things: a config inheritance bug, a model-version change that altered the underlying distribution, or a prompt change that pushed the verifier's belief into a near-50/50 zone.
The same eval template applies to the structured-output extractor, where the metric becomes "schema-validation failure rate." If your extractor is producing valid JSON 99.2% of the time and failing 0.8%, the question is whether the failures correlate with batch size, time-of-day, or model load — all signals of inherited non-determinism that grammar-constrained decoding would have eliminated. The 0.8% failure rate is a token-spend story too: every parse failure triggers a retry, and retries have higher input-token cost than first-shot successes because they include the previous failed output as context.
The Cost Surprise You Find in the Bill, Not the Logs
The bill is where sampling-inheritance bugs become visible to people who do not read tracing dashboards. High-temperature subagents produce longer, more wandering outputs because the sampling distribution rewards low-probability tokens that, once chosen, drag the generation into the long tail of the response space. A verifier running at temperature=0.7 does not just flip its answer; it explains its (now incorrect) answer at length, because the same temperature that made it pick an unlikely token makes it pick an unlikely next token, repeatedly.
The token-cost trace looks like this: a verifier that should output 30 tokens ("Yes, the answer satisfies the rubric because…") outputs 180 tokens of meandering hedging. The 6× output multiplier hits the per-call cost. The retry loop, triggered by the inconsistency, doubles or triples the call count. The summarizer above, fed the verifier's wandering output, also runs longer because its input is longer. The total cost amplification from one mis-set sampling parameter, traced through the call graph, is rarely under 4× and not infrequently over 10× — and it shows up in the monthly bill before it shows up in any latency dashboard, because each individual call still completes in under a second.
The fix is not a cost dashboard. The fix is recognizing that sampling parameters have a cost effect that compounds through the agent graph, and that "sensible defaults" set at the model client cannot be sensible for every role downstream. A planner whose creativity costs 50% more on output tokens is a planner doing its job. A verifier costing 50% more on output tokens is a verifier doing the planner's job badly.
What Sensible Defaults Look Like
The framework-design lesson is that sampling defaults should be role-specific and explicit, not pipeline-wide and implicit. A reasonable starting registry:
- Planner:
temperature=0.7,top_p=0.95, no seed pinning. Output diversity is the point. - Verifier:
temperature=0.1,top_p=0.5, sampled three times with majority vote when stakes are high. Pin a seed for reproducibility in evals; do not pin in production unless your inference stack is batch-invariant. - Summarizer:
temperature=0.3,top_p=0.8. The middle path. - Extractor:
temperature=0,top_p=1.0, grammar-constrained decoding to a JSON schema. Bypass the schema only for known-trusted prefixes. - Tool-call generator:
temperature=0, structured output for the tool name and arguments. Free-form reasoning happens in a separate, hotter call.
These numbers are starting points, not destinations. The point is that the values differ by role, that the framework enforces them at the call site, and that nobody sets a default at the model-client layer that silently propagates everywhere. The most important parameter in your agent stack is the one that decides where parameters come from — and the right answer is "the role profile, every time, no inheritance, fail closed."
The next time a verifier flip-rate spikes or a JSON parse failure rate creeps up, the first question is not "did the model change?" It is "does the call site know which profile it asked for?" If the answer is "the model client picks it up from the constructor," the bug is the framework, not the model.
- https://arxiv.org/html/2603.28304v1
- https://arxiv.org/html/2510.27106
- https://arxiv.org/html/2412.12509v2
- https://arxiv.org/html/2306.05685v4
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://eugeneyan.com/writing/llm-evaluators/
- https://cameronrwolfe.substack.com/p/llm-as-a-judge
- https://www.aidancooper.co.uk/constrained-decoding/
- https://github.com/guidance-ai/llguidance
- https://www.lmsys.org/blog/2025-09-22-sglang-deterministic/
- https://mbrenndoerfer.com/writing/why-llms-are-not-deterministic
