Skip to main content

2 posts tagged with "ops"

View all tags

The On-Call Rotation Your Agent Platform Forgot to Staff

· 11 min read
Tian Pan
Software Engineer

The AI platform team has four engineers. The internal agent they shipped seven months ago is now answering questions for 200 employees a day. For the first month the founding engineer answered every Slack ping personally — Tuesday at 11pm, Sunday morning, the night of the company offsite. Then she got promoted to staff engineer for the impact she had on adoption, and three weeks later she stopped checking the channel after 6pm because that is what staff engineers do. The on-call rotation that was supposed to replace her was never formalized, because the operating model was always going to be figured out "after the pilot."

The day the agent silently degrades for a quarter of users — a retrieval index that quietly fell behind, or a model version flip that shifted refusal behavior, or a tool whose schema rotated and is now returning empty arrays — the complaints do not land on the platform team's pager. They land in the help desk queue, staffed by people who do not have access to the agent's traces, do not know what a system prompt is, and have been told by IT that the agent is "owned by the AI team." Sixteen hours pass between the first user complaint and the first engineer who looks at a trace. Nobody on the platform team is asleep at the wheel; there is no wheel.

The Recurring Task Your Agent Scheduled With Nobody To Inherit

· 9 min read
Tian Pan
Software Engineer

A user types "remind me every Tuesday to check that integration." The agent creates a cron entry, returns a polite confirmation, and the session closes. Six months later the user has changed teams. The integration was deprecated last quarter. The cron is still firing, hitting an API key that was rotated in April, into a Slack channel that was archived in May, charged to a project budget that nobody reviews. The agent did exactly what was asked. The asking is what aged badly.

This is not a bug in any particular agent. It is the shape of a category. The moment we gave agents the ability to schedule durable side effects — cron jobs, webhooks, polling loops, workflow triggers, calendar invites, recurring queries — we created a class of infrastructure that is born without a lifecycle. The create primitive is loud and easy. The delete primitive, the audit primitive, the inheritance primitive — they don't exist on equal footing, so they don't get used.

The cost is invisible until you go looking, which is exactly when nobody is looking.