Skip to main content

Beam Search for Code Agents: Why Greedy Generation Is a Reliability Trap

· 11 min read
Tian Pan
Software Engineer

A code agent that passes 90% of HumanEval is not a reliable code agent. It's a code agent that performs well on problems designed to be solvable in a single pass. Give it a competitive programming problem with strict constraints, or a multi-file refactor with subtle interdependencies, and watch the pass rate crater to 20–30%. The model isn't failing because it lacks knowledge. It's failing because greedy, single-pass generation commits to the first plausible-looking token sequence and never looks back.

The fix isn't a better model. It's a better generation strategy. Recent research has established that applying tree exploration to code generation — branching across multiple candidate solutions, scoring partial programs, and pruning unpromising paths — improves pass rates by 30–130% on hard problems, with no change to the underlying model weights.

This isn't a research curiosity anymore. Serving costs have dropped dramatically over the past two years, and the compute overhead of tree search has become economically viable for an expanding class of production workloads. What was a specialized technique for competitive programming is now a practical reliability primitive for engineering teams building AI code agents.

Why Greedy Generation Breaks on Hard Problems

When a language model generates code greedily, it selects the highest-probability token at each position, building the output from left to right. This works well when the correct solution is also the statistically most-likely solution — which is true for most common, well-represented coding patterns.

The problem is that hard problems often require solutions that look locally improbable. A dynamic programming approach to a graph problem might require counter-intuitive data structure choices early in the function body. A correct off-by-one fix might require the model to commit to a loop bound that looks "wrong" from the prior distribution. Greedy generation makes these commitments once and never revises them.

The benchmark data makes this concrete. On HumanEval — a collection of relatively straightforward Python function implementations — modern frontier models achieve 65–90% pass@1 with single-pass generation. On CodeContests — competitive programming problems where a single correct submission is required — the same models drop to 19–32% with greedy generation. The gap between easy and hard isn't model capability; it's search depth.

The Search Taxonomy

Not all search strategies are equal, and the right choice depends on your problem complexity and latency budget.

Greedy decoding is the baseline: one forward pass, deterministic, fast, and frequently wrong on hard problems.

Best-of-N sampling generates N independent candidates and ranks them. It's simple to implement and surprisingly effective — the key insight being that the best of ten mediocre candidates often beats any single one. The problem is waste: every sample gets the same compute budget regardless of whether it's clearly wrong by line two.

Beam search maintains K active hypotheses in parallel throughout generation, pruning at each step to keep only the most probable sequences. It's more compute-efficient than best-of-N for structured outputs, but it has a diversity problem. Beam search tends to produce sequences that are highly similar to each other — when you ask for ten beams, you often get minor token-level variations on the same approach. If the dominant strategy is wrong, all ten beams are wrong.

Monte Carlo Tree Search (MCTS) borrows from game-playing AI. Rather than maintaining a fixed beam, MCTS explores the solution space stochastically — selecting branches to expand based on a combination of estimated value and exploration bonus, then backpropagating outcomes to update those estimates. It's better at exploring diverse strategies than beam search, but it requires reliable value estimation and is sensitive to hyperparameter tuning.

Structured tree search with role-specialized agents is the current state of the art. Instead of treating code generation as a single sequence to optimize, systems like CodeTree decompose the task into distinct roles: a Thinker that explores different algorithmic strategies, a Solver that implements each strategy, a Debugger that refines implementations based on test results, and a Critic that scores nodes and guides expansion. The tree structure allows exploration of multiple approaches in parallel, not just multiple implementations of the same approach.

Scoring Partial Programs: What Signal Actually Works

The value of any tree search depends entirely on the quality of the signal used to score intermediate states. With bad scoring, you're just doing expensive random sampling.

Execution-based feedback is the strongest signal available. Running a partially implemented function against test cases — even if only a subset pass — provides direct evidence of semantic correctness that no static analysis can match. The typical reward schema is simple: penalize failing any test, reward passing all tests, with intermediate states scored proportionally. The critical requirement is a sandbox execution environment that can run untrusted code safely and return results in milliseconds.

Syntax constraints are cheap and surprisingly useful. Filtering out syntactically invalid partial programs early — using something as simple as Python's codeop module — eliminates dead-end branches before they consume full-context inference budget. In the TreeCoder framework, combining syntax constraints with unit-test validation lifted pass rates from 32% to 61% on a standard benchmark, before adding more sophisticated search.

LLM-as-judge scoring fills the gap when you don't have ground-truth test cases. A stronger model evaluating the reasoning quality of intermediate solutions provides a useful approximation — but it's noisier than execution feedback and adds inference cost per node evaluation. For production systems without comprehensive test suites, it's often worth investing in test generation (co-generating tests alongside code) rather than relying on LLM scoring alone.

Process reward models (PRMs) provide dense feedback at intermediate generation steps rather than only at completion. Instead of waiting until a full function is generated to evaluate quality, a PRM scores each reasoning step or code block, enabling early pruning of paths that have already diverged from correct reasoning. The empirical finding from SRA-MCTS is that self-generated reasoning traces with PRM guidance outperform distillation from much larger models — an 8B model with MCTS-guided reasoning improved by 5.52 points on average across code benchmarks, compared to 4.97 points for standard chain-of-thought.

BFS Beats DFS: The Counterintuitive Finding

One of the most practically important results from recent research is that breadth-first exploration consistently outperforms depth-first iterative refinement on hard code generation tasks.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates