Sampling Parameter Inheritance: When Temperature 0.7 Leaks From the Planner Into the Verifier
A verifier that flips its own answer eight percent of the time is not a flaky model. It is a sampling configuration bug that reached production because the framework defaulted to inheritance. The planner needed temperature=0.7 to brainstorm subtask decompositions. The verifier — the role whose entire job is to give a low-variance yes-or-no on whether the answer satisfies the rubric — was instantiated through the same harness call, and silently picked up the same temperature. Nobody set it that way on purpose. Nobody set it at all.
This is the most expensive parameter in your stack that nobody owns. It compounds across the call tree: the summarizer above the verifier, the structured-output extractor below it, and the retry loop wrapping the whole thing all consume the planner's "be creative" knob as if it were a global. The bill arrives in three places at once — eval flakiness, token spend, and the half-day a senior engineer spends bisecting a regression that turns out to be no regression at all.
The Inheritance Pattern That Looks Reasonable Until It Isn't
Most agent frameworks expose model configuration as constructor arguments to a single client object, then pass that client through the call tree. The planner agent, the verifier agent, and the structured-output extractor each get a reference to the same client, which means they each get the same temperature, top_p, and seed defaults. This pattern is convenient — you configure the model once at the edge of your application — and convenience is the reason it stays in place long after it has stopped serving you.
The problem is that "the model" is not a single role. A modern agent is at least four different speakers, and each one has different sampling needs:
- A planner that proposes alternative decompositions benefits from diversity. Temperature in the 0.6–0.9 range produces the variety that lets a verifier pick a winner from multiple candidates.
- A verifier that scores candidates against a rubric must be near-deterministic. Recent work shows that LLM-as-judge self-consistency peaks around
temperature=0.1, not0.0, but anything above0.3introduces flip-rates that destroy the value of the verification step. - A summarizer that compresses tool output for the next step needs a middle path: enough determinism to be reproducible, enough flexibility to handle unusual phrasing without truncating to a stock template.
- A structured-output extractor that produces a JSON object for downstream code wants
temperature=0with grammar-constrained decoding. Anything else creates parse failures that the retry loop converts from a 1% rate into a 1% tail-latency spike.
When inheritance is the default, all four roles get the planner's setting. The harness implementer's mental model — "I configured the model" — does not match the runtime reality. The verifier has not been configured for verification; it has been configured for whatever the most recent caller needed, and the most recent caller usually needed creativity.
Why Temperature 0 in the Verifier Is Not the Answer Either
The instinct, after the first incident, is to override the verifier to temperature=0 and call the bug fixed. This is half-right. Temperature zero is the correct intent, but the recent research on LLM judges complicates the implementation in two specific ways.
First, temperature=0 does not produce deterministic outputs in modern inference systems. Batch composition on the server side changes the floating-point order of operations, which changes the logits, which can flip the highest-probability token even when sampling is greedy. The "1000 runs, 1000 different answers" effect is well-documented, and recent work on batch-invariant kernels has shown the workaround costs about 60% of throughput. For a verifier that runs at scale, this is the difference between an SLO you can hit and one you cannot.
Second, even when sampling is mathematically greedy, single-sample evaluation is brittle. A judge that gives one yes/no per call captures the mode of the distribution but tells you nothing about the spread. The papers that took this seriously — the ones that ran the same comparison through multiple temperatures and counted flips — found that the optimal temperature for a judge is closer to 0.1 than to 0, because slightly noised sampling exposes the cases where the model's belief is genuinely 51/49, and those cases should be flagged, not silently committed to one side.
The right verifier configuration is not "temperature zero." It is "low temperature, sampled multiple times, with a flip-rate metric watched in production." If the flip-rate exceeds a threshold (say, 5%), the verifier escalates to a stronger model or a human reviewer. This turns sampling parameters from a knob you set once into a signal you observe.
Per-Role Sampling Profiles as a Framework Primitive
The architectural fix is to stop treating sampling parameters as model-client configuration and start treating them as role configuration. A role profile bundles temperature, top_p, presence_penalty, frequency_penalty, max_tokens, optional seed, and the grammar or schema constraints that go with the structured output. When a call site invokes the planner, it asks the framework for the planner profile, not the model client.
- https://arxiv.org/html/2603.28304v1
- https://arxiv.org/html/2510.27106
- https://arxiv.org/html/2412.12509v2
- https://arxiv.org/html/2306.05685v4
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://eugeneyan.com/writing/llm-evaluators/
- https://cameronrwolfe.substack.com/p/llm-as-a-judge
- https://www.aidancooper.co.uk/constrained-decoding/
- https://github.com/guidance-ai/llguidance
- https://www.lmsys.org/blog/2025-09-22-sglang-deterministic/
- https://mbrenndoerfer.com/writing/why-llms-are-not-deterministic
