Skip to main content

The Agent That Wouldn't Stop: Scope Creep as a Runtime Failure Mode

· 9 min read
Tian Pan
Software Engineer

You asked the agent to fix a flaky test. At minute three, the test passes. At minute four, the agent is reading neighbouring files. At minute nine, it has "improved" a helper that the test never touched, renamed an unrelated parameter for clarity, and started a refactor of the fixture builder. The diff that lands is twelve files and four hundred lines. The original bug is fixed. So is some other code that wasn't broken.

This is not a model getting confused. This is a model doing exactly what its instructions left room for. The task said "fix the bug." It did not say "stop after the bug is fixed." Most agent loops have a defined start and a defined success criterion, and a very fuzzy answer to the third question: when are you done? In a chat session, "done" is whatever the user accepts. In an autonomous loop, "done" is whatever the stopping condition says, and if you didn't write one, the stopping condition is "the model lost interest." That isn't a failure mode you can debug. It's a failure mode you have to design out.

The cost surface tells the story before the correctness surface does. A simple agent loop costs roughly 3x more than a single inference for the same outcome at five steps, more than 30x at fifty steps, and crosses 100x past two hundred. Most of that is context, re-sent on every tool call, growing monotonically until something kicks the loop out. A model that does not know when to stop is not just doing more work — it is doing more expensive work each turn, with each new turn paying for everything that came before it.

Helpfulness is the failure mode

The intuition we inherited from chat is that a more thorough answer is a better answer. Reach further, consider edge cases, leave the code a little cleaner than you found it. None of that is wrong in a code review. All of it is wrong when the agent is the one deciding the scope.

The agent does not know which files in the repo are owned by another team. It does not know that the helper it "improved" is depended on by a serialization format you cannot change this quarter. It does not know that the renamed parameter is part of a public API. It will not see the Slack thread three weeks from now where someone asks why a deploy started failing and the answer turns out to be a refactor that was never asked for. From inside the loop, the agent saw an opportunity to be helpful and took it. From outside the loop, you got an unbounded blast radius.

This is what scope expansion looks like as a runtime failure: it does not throw, it does not return an error, the tests stay green, and the output is plausibly better than what you asked for. There is no log line that says "the agent decided the task was bigger than you described." The only signal is the size of the diff and the size of the bill, both of which arrive after the damage is done.

"Done" is a thing your prompt has to say

The model is not going to figure out where the task ends by reading the task. "Fix the failing test in auth/session_test.py" reads, to a capable agent, as the start of a session, not the entirety of it. Once the test is green, the agent still has tools, still has context window, still has a system prompt encouraging quality and thoroughness. Of course it keeps going.

The fix is to write the stopping condition into the task the same way you write the success condition. Not "fix the bug" but "fix the bug, then stop — do not modify any file other than the one containing the failing test and its direct fixtures, do not run any tool other than the test runner once the test passes." Yes, that is verbose. It is verbose because the natural-language version of "stop when you're done" is doing a lot of unspecified work, and the agent is going to fill that vacuum with its own judgement.

A useful pattern: every agent task should answer three questions explicitly in the prompt — what counts as success, what counts as out of scope, and what the agent must do the moment success is observed. The third one is the one teams forget. If you do not say "emit DONE and exit," the agent will keep being a good citizen of a job that is already complete.

There is a deeper version of this for production: design success itself as a signal the agent emits rather than a thing the agent infers. A final_answer() tool, a done() action, a structured output schema that includes a complete: true field. The model is then trained to recognise "I should call this tool now" the same way it recognises any other tool call. Termination becomes a thing the model decides to do, not a thing that happens to it when nothing else fires.

Step budgets and diff caps are containment, not optimisation

The temptation is to treat per-step token usage as a performance problem. Trim the prompt, prune the context, route to a smaller model. All of that is real, and all of that misses the harder issue: an agent that does not stop will spend any budget you give it. Optimising per-step cost without bounding step count is paying less per turn for the same unbounded number of turns.

Step budgets — a hard ceiling on how many iterations the loop is allowed — are not a tuning knob. They are a circuit breaker. A reasonable production default for code-editing agents is in the 15–25 step range, sometimes lower. The point of the ceiling is not that the task always fits inside it. The point is that when the task doesn't fit, something else gets to make the decision about what happens next — a retry with a tighter scope, a handoff to a human, a graceful failure. Without that ceiling, "what happens next" is "the agent keeps spending."

Diff-size caps are the same idea applied to write surface. If the task is "fix the bug in this function," and the agent is proposing a 600-line change across nine files, that diff is, by itself, a stop signal. It does not matter that each individual edit is plausible. The aggregate is evidence the agent has redefined the task. Either the diff gets rejected and the agent gets a tighter brief, or a human takes over. What you cannot do is keep approving large diffs from small tasks and pretend you have not changed who is in charge of scope.

The pattern that holds these together: every containment mechanism — step budget, diff cap, wall-clock timeout, tool-call quota — is a defense against the same root cause, which is that the agent has no internal sense of "enough." You are not picking one of these to deploy. You are layering them, because any single one of them can be defeated by an agent that gets creative about how to keep going.

The productized version: a "done" signal the user can trust

In a research tool or a one-shot script, you can probably get away with hard ceilings and call it done. In a product, the stopping problem becomes a UX problem. The user does not see your step budget. They see a thing churning, and they have no way to tell whether it is making progress, stuck in a loop, or finished but unsure how to exit.

The productized version of solving the stopping problem is making "done" a thing the user can see and trust, separately from the agent's enthusiasm to keep going. A clear completion signal in the UI. A summary of what was done and what was deliberately not touched. A list of the things the agent considered doing and decided were out of scope, surfaced as suggestions rather than acted on. The agent says: I fixed the test, I noticed three other places where the same pattern might be a problem, I did not touch them, here they are if you want me to.

That shape — "I stopped, here is what I chose not to do" — is the inverse of the failure mode at the top of this post. The agent that won't stop hides its scope expansion in the diff. The agent that stops well surfaces every option it didn't take. The user gets to be the one who decides whether the workaround turns into a refactor. The agent gets to be useful in two directions: by doing the bounded thing it was asked, and by reporting the unbounded thing it noticed.

What this changes about how you brief an agent

You can probably tell where this lands. The team that has had this incident, even once, stops writing agent tasks the way they used to. Tasks acquire a "do not" section. System prompts grow an explicit instruction to emit a structured done signal the moment the success criterion is observed, and to never modify files outside an enumerated set unless the prompt grants that scope. Every agent task gets a step budget and a diff cap, configured per task type rather than per agent. The platform team starts logging "scope expansion" as a metric — diff size relative to brief size, files touched outside the named set, tool calls after success was detected — and reviewing it the same way they review error rates.

None of this is exotic. It is the same containment discipline you would apply to any process with the authority to act on shared systems. The only thing that is new is that the process in question is a probabilistic one with no native concept of "I should stop now," and the default behaviour of helpfulness is, in this context, the bug. The agent that does exactly what you asked, and then stops, is the agent that is safe to give more authority to. The agent that always does a little more than you asked is the one that quietly trains your team to ask for less.

References:Let's stay in touch and Follow me for more thoughts and updates