Skip to main content

The Prompt Hot-Reload That Orphaned Every In-Flight Agent Run

· 11 min read
Tian Pan
Software Engineer

The pager went off at 11:47pm. A customer had been ten minutes into a refund conversation when the agent suddenly stopped calling the process_refund tool it had been reasoning about for the entire session, hallucinated a confirmation number, and ended the chat. By the time we traced it back, the cause was obvious in retrospect: a teammate had pushed an updated system prompt at 11:46. The push was clean, the tests passed, and every new conversation worked perfectly. The few hundred conversations already in progress did not.

We had built our prompt registry to support what every prompt-versioning vendor in 2026 markets as a feature: hot-reload without redeploy. We had treated that capability as if it were a CDN cache flush — a global swap that takes effect everywhere at once. What it actually was, we learned that night, was a contract break. Every active conversation was an in-flight negotiation between an LLM and a set of instructions plus tool definitions it had been reasoning against. When the registry swapped the prompt under those conversations, half the negotiated context was now stale.

The fix, eventually, was to stop thinking of prompt versions as a deploy-time global and start thinking of them as a per-conversation contract. That is a much smaller word change than it is an architecture change.

Why "Hot-Reload" Looks Safe Until It Isn't

Prompt registries are sold as a way to decouple prompt edits from application deploys. The pitch is irresistible: your prompt engineer fixes a wording issue, presses publish, and the next request picks it up. No CI run, no container restart, no rollback ceremony. For one-shot completions — classification, summarization, single-turn Q&A — the pitch is also basically true. The request reads the latest prompt from the registry, calls the model, returns. The only window of risk is between the read and the model response, and that window is short.

Agents break this model in two ways. First, an agent run is not a single LLM call; it is a loop. The system prompt and the tool schemas get re-sent to the model on every turn of that loop, because that is how stateless chat completion APIs work. Second, the LLM's reasoning within a session is path-dependent. It has already committed to a plan based on what tools it saw three turns ago, what the system prompt told it about its persona five turns ago, and which output format it locked in at the start. Swap any of those out mid-loop and you are not refreshing context — you are gaslighting the model.

The registry abstraction quietly assumes that prompts are pure inputs. In an agent loop, prompts are closer to a schema. Changing a schema while writes are streaming against it is exactly the kind of thing distributed databases spent two decades figuring out how to make safe.

The Specific Failure Modes

The "stopped mid-refund" incident was the loudest, but it was not the only one. Once we started looking, we found a family of failure modes that all traced to the same root cause.

Tool definitions go stale relative to the system prompt. Our prompt push had renamed process_refund to issue_refund and updated the system prompt's worked example accordingly. The tool registry, deployed separately, still exposed process_refund to in-flight conversations. The model, faithfully following the new prompt, called issue_refund — which did not exist — got a tool-not-found error, and rationalized its way to a hallucinated success.

Output schemas drift mid-session. Another push tightened a JSON output format from "a list of strings" to "a list of objects with a reason field." Conversations in flight had several earlier turns where the model produced the old shape and the parser accepted it. After the swap, the model — now seeing the new instruction — produced the new shape, and the downstream consumer crashed because it had been written assuming a session's output schema was stable.

Persona shifts confuse multi-turn reasoning. A wording change from "you are a concise assistant" to "you are a thorough assistant" landed mid-conversation for a sales chat. The first half of the conversation had short responses; the second half had multi-paragraph responses. Users noticed. They thought the agent was malfunctioning. It was just doing exactly what each successive prompt told it to do.

Guardrails invalidate prior approvals. A safety policy change moved a class of operations from "allowed" to "requires confirmation." Conversations where the operation had already been pre-approved in turn three got blocked when the same operation was attempted in turn seven. The user saw a tool that had worked moments earlier suddenly refuse. From their perspective, the agent had developed amnesia about its own permissions.

The thread connecting all four: nothing crashed. The LLM is too forgiving for that. It absorbs the contradiction and produces a confident, plausible response that happens to be wrong. The agent registry team's playbook for hot-reload assumed the failure mode would be a clean error somewhere. The actual failure mode was silent semantic corruption, which is the worst category of bug to have in a system whose entire output is plausible-sounding text.

The Mental Model: A Session Is a Contract, Not a Cache

The bug here is not in the registry implementation. It is in the abstraction the registry exposes. Treating a prompt as a cacheable value implies that the cost of a stale read is bounded — at worst you get last week's prompt for a few hundred milliseconds. Treating a prompt as a contract implies that the cost of breaking it is unbounded, because the consumer (the LLM) has built up state — chain-of-thought, intermediate plans, tool-call history — that depends on the contract holding for the duration of the interaction.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates