How Agents Teach Themselves: The Closed-Loop Self-Improvement Architecture
The most expensive part of training an agent isn't GPU time. It's the human annotators who label whether a multi-step task succeeded or failed. A single expert annotation of a long-horizon agentic trajectory — verifying that an agent correctly booked a flight, wrote a functional program, or filled out a legal form — can cost more than thousands of inference calls. Closed-loop self-improvement is the architectural pattern that eliminates this bottleneck by replacing human judgment with an automated verifier, then using that verifier to run the generate-attempt-verify-train cycle without any human in the loop. When done correctly, it works: a recent NeurIPS paper showed the pattern doubled average task success rates across multi-turn tool-use environments, going from 12% to 23.5%, without a single human annotation.
The key insight isn't that the model improves itself — it's that the verifier is free. Code execution returns a pass/fail signal deterministically, in milliseconds, at near-zero marginal cost. When your tasks have checkable outcomes, you can run thousands of training episodes per hour with ground-truth labels the model cannot fake (assuming your sandbox is designed correctly). That assumption is doing a lot of work, and we'll come back to it.
