Skip to main content

2 posts tagged with "tool-registry"

View all tags

The Sandbox Your Agent Didn't Notice Was Real

· 10 min read
Tian Pan
Software Engineer

A team I know has a textbook staging setup. Read-only replicas of the production database. A mock Stripe account that pretends to charge cards. Synthetic users with fake email addresses on a domain nobody owns. The agent is asked to walk through an "account delinquent" escalation flow in staging, end to end, as part of a release rehearsal. The trace looks clean. The agent does what it is supposed to do.

Three minutes later, a real customer — a paying one, who churned six months ago and was still in a dormant export the developer had used to seed a test fixture — replies to a politely-worded payment-overdue email. The "send_email" tool, registered next to a dozen other tools that all terminate in mocks, was wired to the production Mailgun key. The developer who set it up two sprints earlier had been iterating fast on email templates and the sandbox tier capped them at five emails an hour, which broke the inner loop, so they swapped in the real key "just for the afternoon" and forgot. Nobody re-checked. The agent had no way to know.

The Dead Tool Nobody Can Remove From the Registry

· 10 min read
Tian Pan
Software Engineer

A tool has been sitting in your shared agent catalog for fourteen months. It was wired up by an engineer who has since left, for a workflow that was sunset two reorgs ago, against a backend service whose owners are no longer sure who they are. The tool definition is 380 tokens. It ships in every system prompt for every agent in the org, on every turn, because nobody can prove it is unused, and the cost of being wrong about that proof is higher than the cost of carrying it forever.

That tool is the database column nobody dares drop. It is the cron job whose log file rotated out years ago. It is the dead code path you can grep for and find zero references to, except eval() exists and you cannot be sure. The agentic version of this problem is worse, because the carrying cost is not merely some bytes on disk — it is paid in tokens, in selection accuracy, and in security surface, on every single inference your platform runs.