Beam Search for Code Agents: Why Greedy Generation Is a Reliability Trap
A code agent that passes 90% of HumanEval is not a reliable code agent. It's a code agent that performs well on problems designed to be solvable in a single pass. Give it a competitive programming problem with strict constraints, or a multi-file refactor with subtle interdependencies, and watch the pass rate crater to 20–30%. The model isn't failing because it lacks knowledge. It's failing because greedy, single-pass generation commits to the first plausible-looking token sequence and never looks back.
The fix isn't a better model. It's a better generation strategy. Recent research has established that applying tree exploration to code generation — branching across multiple candidate solutions, scoring partial programs, and pruning unpromising paths — improves pass rates by 30–130% on hard problems, with no change to the underlying model weights.
