Beam Search for Code Agents: Why Greedy Generation Is a Reliability Trap
A code agent that passes 90% of HumanEval is not a reliable code agent. It's a code agent that performs well on problems designed to be solvable in a single pass. Give it a competitive programming problem with strict constraints, or a multi-file refactor with subtle interdependencies, and watch the pass rate crater to 20–30%. The model isn't failing because it lacks knowledge. It's failing because greedy, single-pass generation commits to the first plausible-looking token sequence and never looks back.
The fix isn't a better model. It's a better generation strategy. Recent research has established that applying tree exploration to code generation — branching across multiple candidate solutions, scoring partial programs, and pruning unpromising paths — improves pass rates by 30–130% on hard problems, with no change to the underlying model weights.
This isn't a research curiosity anymore. Serving costs have dropped dramatically over the past two years, and the compute overhead of tree search has become economically viable for an expanding class of production workloads. What was a specialized technique for competitive programming is now a practical reliability primitive for engineering teams building AI code agents.
Why Greedy Generation Breaks on Hard Problems
When a language model generates code greedily, it selects the highest-probability token at each position, building the output from left to right. This works well when the correct solution is also the statistically most-likely solution — which is true for most common, well-represented coding patterns.
The problem is that hard problems often require solutions that look locally improbable. A dynamic programming approach to a graph problem might require counter-intuitive data structure choices early in the function body. A correct off-by-one fix might require the model to commit to a loop bound that looks "wrong" from the prior distribution. Greedy generation makes these commitments once and never revises them.
The benchmark data makes this concrete. On HumanEval — a collection of relatively straightforward Python function implementations — modern frontier models achieve 65–90% pass@1 with single-pass generation. On CodeContests — competitive programming problems where a single correct submission is required — the same models drop to 19–32% with greedy generation. The gap between easy and hard isn't model capability; it's search depth.
The Search Taxonomy
Not all search strategies are equal, and the right choice depends on your problem complexity and latency budget.
Greedy decoding is the baseline: one forward pass, deterministic, fast, and frequently wrong on hard problems.
Best-of-N sampling generates N independent candidates and ranks them. It's simple to implement and surprisingly effective — the key insight being that the best of ten mediocre candidates often beats any single one. The problem is waste: every sample gets the same compute budget regardless of whether it's clearly wrong by line two.
Beam search maintains K active hypotheses in parallel throughout generation, pruning at each step to keep only the most probable sequences. It's more compute-efficient than best-of-N for structured outputs, but it has a diversity problem. Beam search tends to produce sequences that are highly similar to each other — when you ask for ten beams, you often get minor token-level variations on the same approach. If the dominant strategy is wrong, all ten beams are wrong.
Monte Carlo Tree Search (MCTS) borrows from game-playing AI. Rather than maintaining a fixed beam, MCTS explores the solution space stochastically — selecting branches to expand based on a combination of estimated value and exploration bonus, then backpropagating outcomes to update those estimates. It's better at exploring diverse strategies than beam search, but it requires reliable value estimation and is sensitive to hyperparameter tuning.
Structured tree search with role-specialized agents is the current state of the art. Instead of treating code generation as a single sequence to optimize, systems like CodeTree decompose the task into distinct roles: a Thinker that explores different algorithmic strategies, a Solver that implements each strategy, a Debugger that refines implementations based on test results, and a Critic that scores nodes and guides expansion. The tree structure allows exploration of multiple approaches in parallel, not just multiple implementations of the same approach.
Scoring Partial Programs: What Signal Actually Works
The value of any tree search depends entirely on the quality of the signal used to score intermediate states. With bad scoring, you're just doing expensive random sampling.
Execution-based feedback is the strongest signal available. Running a partially implemented function against test cases — even if only a subset pass — provides direct evidence of semantic correctness that no static analysis can match. The typical reward schema is simple: penalize failing any test, reward passing all tests, with intermediate states scored proportionally. The critical requirement is a sandbox execution environment that can run untrusted code safely and return results in milliseconds.
Syntax constraints are cheap and surprisingly useful. Filtering out syntactically invalid partial programs early — using something as simple as Python's codeop module — eliminates dead-end branches before they consume full-context inference budget. In the TreeCoder framework, combining syntax constraints with unit-test validation lifted pass rates from 32% to 61% on a standard benchmark, before adding more sophisticated search.
LLM-as-judge scoring fills the gap when you don't have ground-truth test cases. A stronger model evaluating the reasoning quality of intermediate solutions provides a useful approximation — but it's noisier than execution feedback and adds inference cost per node evaluation. For production systems without comprehensive test suites, it's often worth investing in test generation (co-generating tests alongside code) rather than relying on LLM scoring alone.
Process reward models (PRMs) provide dense feedback at intermediate generation steps rather than only at completion. Instead of waiting until a full function is generated to evaluate quality, a PRM scores each reasoning step or code block, enabling early pruning of paths that have already diverged from correct reasoning. The empirical finding from SRA-MCTS is that self-generated reasoning traces with PRM guidance outperform distillation from much larger models — an 8B model with MCTS-guided reasoning improved by 5.52 points on average across code benchmarks, compared to 4.97 points for standard chain-of-thought.
BFS Beats DFS: The Counterintuitive Finding
One of the most practically important results from recent research is that breadth-first exploration consistently outperforms depth-first iterative refinement on hard code generation tasks.
The intuition behind iterative refinement is appealing: generate a solution, run the tests, identify what's broken, fix it, repeat. This matches how human engineers debug code, and it's the basis of most "self-healing" agent patterns. The problem is that when the initial solution strategy is fundamentally wrong, no amount of line-level debugging will fix it. You can't patch a dynamic programming solution into a recursive brute-force approach.
BFS exploration, by contrast, commits to exploring multiple strategies at the first branch point. If one strategy is wrong, the search doesn't waste compute trying to salvage it — it has other branches to develop. The CodeTree paper quantifies this: on CodeContests, BFS-based tree exploration reached 43.0% pass@1, compared to 32.7% for resampling approaches that refine a single initial strategy. The gap widens further on problems that require non-obvious algorithmic choices.
The practical implication: if you're implementing retry logic for a code agent and it keeps failing, the problem is probably strategy-level, not implementation-level. Adding more refinement iterations without exploring different approaches is unlikely to help.
When the Compute Overhead Pays Off
Tree search multiplies your inference costs. A beam width of 10 requires roughly 10× the token generation budget. MCTS with a budget of 20 evaluations costs 20 model calls plus evaluation overhead. You need to be deliberate about when this trade is worth making.
The case for search is strongest when:
The problem has a high cost of wrong answers. If your code agent is generating production code that gets committed, reviewed, and deployed, the cost of a subtle bug significantly exceeds the cost of 10× more inference at generation time. The compute budget per task is fixed; the downstream engineering cost of debugging a wrong answer is not.
Execution feedback is available. Search without reliable scoring is expensive sampling. If you have unit tests or a sandbox environment that can validate candidate solutions quickly, tree search compounds the value of that infrastructure. Without it, you're scoring with noise, and the returns diminish rapidly.
The success probability improvement is large relative to the compute multiplier. The rough heuristic: if search boosts your pass rate from 30% to 45% (a 50% relative improvement), a 10× compute multiplier is cost-neutral from a "cost per successful output" perspective. If it boosts from 30% to 33%, you're paying for almost nothing.
You can tolerate the latency. Each exploration step adds latency. For interactive code completion where users expect sub-second response, tree search is a non-starter. For autonomous agents running batch tasks — issue resolution, test generation, code review — adding several seconds of exploration is usually fine.
Conversely, skip the search overhead when:
- Problems are simple and your baseline pass rate is already above 80%
- You're doing real-time completion and latency dominates
- You don't have reliable execution feedback and can't build it cheaply
- The downstream consequence of a wrong answer is low (user can trivially retry)
Production Deployment Patterns
The compute cost of tree search is falling rapidly. Inference pricing has dropped by roughly 100× in two years, and hardware improvements (quantization, continuous batching, speculative decoding) reduce effective per-token costs by 10–16× compared to naive deployment. What cost prohibitive amounts in 2023 is now economically viable for production agent workloads in 2026.
A few practical patterns for deploying search-augmented code agents:
Adaptive search budgets. Don't run full tree search on every task. Classify incoming problems by estimated difficulty — using a lightweight heuristic model or problem description analysis — and allocate search budget accordingly. Easy problems get greedy generation. Medium problems get best-of-5. Hard problems get full tree search with execution feedback. This keeps your average cost manageable while improving reliability where it matters.
Modular constraint composition. Hardcoding beam width or MCTS parameters into your agent loop creates brittleness. Instead, treat decoding strategies and scoring constraints as composable components: a syntax checker, a test runner, an LLM scorer, a beam width setting. This lets you A/B test configurations and adapt to new problem types without reimplementing the search logic. Automated Bayesian optimization over these configurations can find combinations (like SMC + unit-tests) that outperform manually tuned MCTS variants.
Invest in test co-generation. The limiting factor for execution-based search is usually test quality, not generation quality. If your problems don't come with test cases, co-generate tests alongside code — have the agent produce both, then use execution against the tests as the scoring signal. Research on dual code-test generation shows this approach produces more reliable feedback loops than using LLM-as-judge as the primary scoring mechanism.
Log failed branches, not just final outputs. The standard agent instrumentation pattern captures inputs and outputs. For search-based agents, the failed branches are equally valuable — they tell you which strategies the model tried, where pruning happened, and why. This data is essential for diagnosing cases where search itself failed (e.g., all branches explored a wrong algorithmic family) versus cases where the scoring function gave bad signal.
The Reliability Primitive You're Missing
The benchmark evidence is now extensive enough to treat tree search as a reliability primitive rather than a research technique. On hard code generation tasks, execution-feedback-guided tree exploration delivers 30–130% improvement in pass rates over greedy single-pass generation, with no model changes required.
The architectural principle is straightforward: greedy generation optimizes for local probability. Search optimizes for global correctness. When local probability correlates well with global correctness — which it does for simple, common problems — greedy is fine. When they diverge — which they do for complex, novel problems — search is the only reliable path to correct outputs.
As autonomous coding agents take on longer-horizon tasks (multi-file changes, cross-repository refactors, test suite generation), the class of problems where greedy fails expands. The teams that treat search as a core component of their inference stack, not an optional enhancement, will end up with agents that are genuinely reliable on the work that matters — not just on the benchmarks designed to be solvable in a single shot.
- https://arxiv.org/abs/2401.08500
- https://arxiv.org/html/2411.04329v2
- https://aclanthology.org/2025.naacl-long.189/
- https://arxiv.org/html/2411.11053v1
- https://arxiv.org/html/2511.22277v1
- https://arxiv.org/abs/2407.01476
- https://arxiv.org/abs/2409.09584
- https://arxiv.org/html/2412.13464
- https://arxiv.org/html/2412.00154v1
- https://dev.to/ag2ai/reasoningagent-update-beam-search-mcts-and-lats-for-llm-reasoning-3im1
