Skip to main content

First-Touch Tool Burn: Why Your Agent Reads Twelve Files Before Doing What You Asked

· 11 min read
Tian Pan
Software Engineer

Your agent just spent ninety seconds and a few dollars to change a three-line function. Before the edit landed, it listed two directories, opened the test file, ran a grep for callers, read the config module, checked the CI workflow, and pulled up a type definition it never used. The diff it produced was four lines. The trace that produced it was forty-three tool calls.

This is first-touch tool burn: the pattern where an agent, handed a well-scoped task, behaves as if every request is a research problem. The exploration happens first and it happens hard — sixty to eighty percent of the token budget spent on listing, grepping, and reading before a single character is written to a file. Teams discover this the first time they look at a trace and realize the agent did the equivalent of a two-hour onboarding for a two-minute task.

The behavior isn't a bug in any specific model. It's the predictable output of how these systems were trained and evaluated, colliding with a production environment that measures something training never did: whether the work was cheap enough to bother doing at all.

The training reward that rewrites your bill

Coding agents are trained on trajectories where "reads first, writes second" was the correct strategy. When a benchmark question is ambiguous, the model that asks for more context, lists the directory, grounds itself in the actual code, and then acts tends to beat the model that dives straight into an edit. The trajectory scores well on the offline eval. Reviewers rate it as "careful." It rarely produces the kind of confidently-wrong edit that gets flagged.

So the training process, over many iterations, nudges the model toward more reads before acts. This is the right policy when you don't know what environment you're in. It's the wrong policy when the task description already specified the environment and the file path is in the prompt.

The mismatch is structural. Training loops optimize a reward that treats exploration as free, because in an offline eval the token budget is essentially infinite and latency is measured as "did it eventually converge to a correct answer?" Production optimizes a different reward entirely: did this finish before the user closed the tab, and did it fit in the context window, and did the bill for a trivial edit stay under the threshold where someone in finance starts asking questions?

Research on reward design for tool-use confirms this asymmetry directly. When exploratory tool calls are only mildly penalized, models learn to front-load them as a cheap hedge against uncertainty. The same training discipline that prevents the agent from collapsing into a trivial always-edit policy also teaches it that "when in doubt, read another file" is low-cost insurance. The cost isn't low — it's just invisible to the training loop.

The production cost the training loop can't see

One developer reported that a coding agent read twenty-five files to answer a question about three functions, not because it couldn't code but because it had no idea which three functions mattered. Sessions that do meaningful work routinely burn through tens of thousands of input tokens per task, and a large fraction of that is exploration that never influenced the output.

Three concrete costs compound:

Latency that breaks the feel of the product. A task that needs five seconds of model time plus forty tool calls at half a second each is a forty-five-second interaction. Users perceive agents as slow not because model inference is slow but because the tool-call fan-out hides the latency in a place no single metric catches. Your p95 chat latency looks fine. Your p95 end-to-end task completion is a disaster.

Context window collapse before the actual task starts. Every tool result goes into the context. Read twelve files, even with clipping, and you've pushed the task description thirty thousand tokens upstream of where the model is currently "thinking." Research on context bloat is blunt: irrelevant context actively degrades performance, and the model has to work harder to separate signal from the junk it just loaded. The agent that reads more to be careful ends up less accurate, not more.

Abandoned work that you still pay for. If the agent explores for sixty seconds and the user hits stop, you still get billed for everything that generated. Token cost per completed task quietly inflates with every additional read that didn't ship an edit. Finance sees a cost-per-active-user curve bending north and the engineering team can't explain why — the per-task headline metric looks fine, but the ratio of paid tokens to delivered tokens is drifting up.

Observability dashboards designed for the "LLM is a function" era don't catch any of this. They measure requests, not trajectories. The signal is hidden in the gap between "tokens consumed" and "tokens that changed an output."

Task-class routing: not every task is a research problem

The fix starts with admitting that most tasks aren't ambiguous. "Change the log level in this function to warn" does not require an investigation. "Add a null check to line 47 of this file" does not require listing the directory. Agents that treat these as discovery problems are burning budget on a uncertainty that doesn't exist.

Task-class routing means the orchestration layer classifies incoming work before handing it to the agent. The classifier doesn't have to be sophisticated — a heuristic that looks at whether the prompt contains concrete identifiers (file paths, function names, line numbers) and a specific verb (rename, delete, add, replace) is enough to separate "just do it" tasks from "figure it out" tasks. The two classes get different tool access, different system prompts, and different budgets.

For the "just do it" class, the agent should have read and edit tools disabled except for the exact file in the prompt. No grep. No list_files. No test runner. The only way to explore is to ask for clarification and stop the run, which is the correct behavior when a task description claims to be concrete but isn't.

For the "figure it out" class, the full exploration toolkit is available, but the budget caps are tighter and more visible. The agent is told, in the system prompt, how many reads remain and what will happen when the budget expires. Research out of Google and UC Santa Barbara showed that adding explicit budget tracking to agents reduced search calls by forty percent and total cost by around thirty percent while preserving accuracy. The budget itself isn't the interesting lever — making the agent aware of it is.

Exploration budget caps and plan-then-act gating

Budget caps need to be enforced, not just communicated. A system prompt that says "try not to read too many files" is a wish, not a policy. The enforceable version lives in the tool layer: after N reads on a single task, the read tool starts returning a synthetic response that says "exploration budget exhausted — declare what you're doing next before you can read another file."

This forcing function maps cleanly to the plan-then-act pattern. Several coding-agent products have adopted an explicit two-mode separation where the agent must produce a plan — a written hypothesis about what it intends to change and why — before it can execute edits. The plan itself is the gate. Once committed, the agent's remaining tool calls are interpreted against the plan: reads that support the plan pass, reads that look like the agent is still fishing get throttled.

Plan-then-act is sometimes criticized as friction, and for simple tasks it is friction — which is why task-class routing comes first. The plan gate is not for "rename this variable." It's for the long tail of tasks where exploration is legitimate but tends to metastasize. The plan bounds the search space by making the agent commit before it has read enough to be confidently wrong.

The second enforceable mechanism is output-over-input pressure. Instruct the agent that its score on this task is "tokens in the diff divided by tokens read." This metric is degenerate in isolation (a zero-diff answer optimizes it trivially), but combined with a correctness check it pushes strongly against exploration bloat. Practitioners who have tuned agents this way report the model starts asking sharper questions up front instead of doing ten speculative reads and hoping one of them resolves the ambiguity.

The eval-set anti-pattern that produces this behavior

The reason teams inherit over-exploratory agents from their training process is that the eval sets used during post-training are themselves selection-biased. Most coding benchmarks are built from tasks where exploration was the right answer: unfamiliar codebases, ambiguous requirements, bugs that require chasing through several files. The benchmarks reward "read twelve files before editing" because in those tasks, the twelfth file was load-bearing.

What the benchmarks almost never include is the balancing class: well-scoped tasks where exploration was waste. A trajectory that did nineteen reads to accomplish a two-line edit is not graded against a trajectory that did zero reads and got the same answer, because the test harness doesn't know that zero reads was sufficient. Both trajectories produced the correct diff, so both score a one.

This is how over-exploration gets locked in. The model isn't being rewarded for burning tokens; it's being rewarded for correctness on a distribution where token efficiency was never part of the grade. Fix the eval, and the behavior starts shifting. Concretely, that means adding task pairs where the same correct output can be produced by either a minimal trajectory or a bloated one, and grading the bloated trajectory as worse. Teams that have done this work report the effect size is larger than they expected — models will trade a few points of raw accuracy for a large reduction in exploration once the signal is in the loss.

If you're consuming a model rather than training one, you can approximate this by running both trajectories in your own eval harness and promoting the efficient one during prompt engineering. The exploration budget cap from the previous section is essentially a runtime substitute for the eval fix — it's how you compensate for a model that was trained against a benchmark that didn't care about cost.

The on-call signal to watch

The slow version of this failure doesn't show up as an incident. It shows up as a drift: the same task takes three times as many tokens this quarter as it did last quarter, and the completion quality didn't move. If you only track "did the agent finish" and "did users like it," you'll miss this entirely. If you track per-task-class token consumption over time, you'll see the drift weeks before it becomes a budget conversation.

The dashboard to put in front of on-call is this: for each task class, plot the ratio of tokens-in-output-diff to tokens-consumed-during-task, rolling ninety-day. When that ratio is trending down, your agent is learning — either from subtle prompt changes, upstream model updates, or retrieval drift — to explore more per unit of work produced. The fix is rarely a retrain. Usually it's that an abstraction leaked (a tool description got vaguer, a retrieval index got noisier, a router got disabled) and the agent is compensating by reading around it.

The underlying architectural move is to treat exploration as a first-class resource with a budget, not an implicit side effect of how the agent happens to start a task. Teams that do this find that first-touch tool burn drops, latency improves, and the "why is this agent so expensive for what it does" question stops arriving on every quarterly business review. The agent doesn't get worse at its job when you put it on a diet. It gets faster, because the calories it was burning weren't nourishing the work.

References:Let's stay in touch and Follow me for more thoughts and updates