Skip to main content

The CI Agent With Merge Rights at 3 AM

· 12 min read
Tian Pan
Software Engineer

A flaky test gets quarantined at 3:17 AM. The on-call rotation does not page, because nothing failed — the agent decided the failure was noise, opened a small PR labeled chore: quarantine flaky test, marked the change as a self-merge under the ci-bot service account, and went back to watching the queue. Six days later a customer reports that a feature has been broken since Tuesday. The test was not flaky. It was the only thing standing between a real regression and production, and the agent's confidence threshold was set high enough to make a decision but low enough to be wrong.

This is the part of agentic CI that the marketing decks skip. Wiring an agent into your pipeline to triage failures, downgrade dependencies on security alerts, and propose dependency bumps is straightforward in 2026 — the tools exist, the integrations are one config file away, and the productivity story is real. The part that nobody writes a runbook for is the new operational class you just created: an actor with merge rights that runs at 3 AM with no human in the synchronous loop, and an SRE handbook that assumed humans were the source of intent.

The SRE design pattern that built every modern on-call rotation is "humans approve, services execute." A human decides to deploy, a human acknowledges the page, a human writes the postmortem; the service is the thing that does what it is told. Drop an agent into CI with the authority to land changes and that pattern dissolves quietly, because the agent is neither a service (it has intent) nor a human (it does not sleep, get paged, or feel pressure). It is something the runbook has no slot for, and the absence of that slot is where the failures hide.

The runbook never named this actor

Open any mature SRE handbook and you will find two well-developed lifecycles. The service lifecycle covers what gets deployed, who owns it, what its SLO is, what page fires when it breaks, and who is on the rotation. The human-triggered job lifecycle covers what an engineer can do under approval, what requires a second pair of eyes, and what is forbidden outside a change window. Both lifecycles assume that the entity initiating an action is either a deterministic process or a human with accountability.

An unattended CI agent satisfies neither. It originates intent the way a human does — it decides which test is flaky, which dependency bump is safe, which alert is real — but it does so at the cadence and silence of a service. The runbook conventions for both classes break against it. Service runbooks assume the action is mechanical and can be inferred from inputs; agent actions cannot, because the input was a noisy log and the output was a judgment. Human-job runbooks assume an approver who can be paged for ambiguity; the agent has nobody to ask at 3 AM, so it picks.

The failure mode is not that the agent is reckless. The failure mode is that there is no place in the org chart, the IAM model, or the incident taxonomy where this actor sits. When something goes wrong, the question "who owned the action" produces an awkward pause. The platform team owns the CI infrastructure. The AI team configured the agent. Neither runbook says what to do when the agent did something wrong at 3 AM, and the lack of a designated owner is the first sign that you have built a tier-zero system without writing its reliability story.

A typed authorization layer is the missing primitive

The fix is not "add more guardrails." It is to formalize the actor. The agent needs an authorization layer that types its actions the way an IAM system types a service account's permissions, with the additional dimension that "type" is not just resource but reversibility and blast radius.

A workable shape looks like this. Each action class the agent can perform — quarantine a test, open a PR, merge a PR, downgrade a dependency, post a comment, run a workflow — gets classified into a tier. Tier one is read-only and reversible by definition; the agent runs autonomously. Tier two is mutable but cheap to roll back, like opening a PR or posting a comment; the agent acts but the change is staged and labeled. Tier three is anything that lands on main, modifies CI configuration, touches secrets, or affects deployment surfaces; the agent produces an artifact awaiting human approval, even at 3 AM, with the approval window matched to the action's urgency.

The point of the typed layer is that "what the agent can do unattended" is no longer encoded across a dozen scripts, GitHub Actions YAML files, and a service account's PAT scopes. It is one policy, reviewed like security policy, owned by the team that owns the agent. Adding a new tool to the agent's hand means choosing its tier explicitly, with an approver who can name the blast radius and who signs off on default-deny being downgraded. The default for any new action is the highest tier, not the lowest.

This sounds heavyweight. It is not — it is the same discipline that production IAM has had for a decade, applied to the new actor. The reason it feels new is that nobody on the platform team has IAM authority over the AI team's agents, and nobody on the AI team has the SRE muscle to write tiered policies. The discipline lives in the gap between them, and the gap is where the agent runs unsupervised.

Dry-run as a first-class mode, not a flag

The second primitive the agent needs is a dry-run mode that is structurally different from "run with a flag set." Dry-run for a tier-three action should produce a queued artifact — a PR description, a diff, a "I would have merged this" entry in the action ledger — that a morning review can approve, reject, or edit. The agent should not be the one deciding whether it is in dry-run; the tier of the action should decide, and the tier policy should be enforced at the tool layer, not by trusting the agent to respect a flag.

Why this matters: a flag-based dry-run trusts the agent to read its own configuration correctly. The same agent that just decided a test was flaky is now being asked to decide whether it is allowed to act on that decision. Treat dry-run as a property of the action class, not the agent's mental state, and the failure modes collapse into something testable. The agent always tries to do the thing; the tool layer intercepts based on the tier and either executes or queues. The audit trail is identical in both cases, which means morning review sees the same evidence that 3 AM execution would have produced.

The morning review is the other half. If queued artifacts pile up without a named reviewer, you have rebuilt the on-call problem inside a Jira board. Pair the dry-run queue with a daily standup item or a paged review slot, and make the review SLA part of the policy: an action queued at 3 AM gets a decision by 10 AM, or it expires. The expiry matters — without it, the agent gets blocked on a slow human, and the team's productivity story regresses to manual.

The action ledger has to distinguish who acted

The third primitive is an action ledger that separates agent-originated changes from human-originated ones, cleanly enough that a postmortem can scope the blast radius without ambiguity. This is not a nice-to-have. When a regression ships and you are trying to figure out what to revert, "who committed this" is the first question, and "ci-bot under service account X" is not an answer — it is a deferral. You need to know which agent session, with which model version, acting on which prompt, with which input context, produced the change.

The pragmatic version of this is an append-only log with a row per agent action, capturing the prompt hash, the model identifier, the tool calls made, the tier of each tool call, the human approver (or "unattended"), and the resulting artifact's identifier. SIEM integration is useful but secondary; what matters first is that the postmortem can answer the scoping question in one query rather than three days of forensics.

The attribution discipline matters for a less obvious reason: when the agent acts under a human's identity — using a developer's PAT, a stored OAuth token, a delegated credential — the audit trail records the human as the actor, and accountability collapses. The CI agent should have its own identity, scoped narrowly, and every action should record the agent identity plus the human who authorized the agent to act, never one masquerading as the other. The day a developer is paged because the audit log says they pushed something they never saw is the day the trust in the system breaks.

The kill switch nobody remembered to wire

Every agent runbook needs a freeze. Not "open a PR to disable the workflow," which assumes someone has 20 minutes and a clear head at 3 AM. A literal switch — a CLI command, a button, an authenticated endpoint — that revokes the agent's tool permissions, halts queued actions, locks the deployment surface, and pages whoever owns the agent, in under five minutes, executable by an SRE who did not build the agent and does not know its internals.

The hard part of the kill switch is not building it. The hard part is that nobody remembers it exists when the incident is happening, because the incident does not look like an agent incident — it looks like a regression, a flaky test situation, a dependency problem. The pattern most teams discover the wrong way is that the agent was contributing for three days before anyone connected the symptoms to the actor. By then the kill switch is theoretical; the actions that needed stopping have already shipped.

Two practices help. First, every agent runbook includes the freeze procedure in the same place service runbooks include the rollback procedure — same template, same drill cadence, same SRE on the rotation knowing where to find it. Second, the agent's actions are tagged in incident channels and dashboards loud enough that the connection is one click, not a forensic trail. When an SRE sees "ci-bot quarantined 3 tests in the last hour" in the same view as "test failure rate up 12%," the diagnosis happens before the incident does.

The quiet failure mode is months long

The dramatic version of agent failure — the database deletion, the production outage — is easy to talk about because it is loud, and the loud failures will get the runbook treatment first. The quieter failure mode is more dangerous and more common: the agent has been suppressing a real bug class for months, quarantining tests that flagged it, marking dependency security alerts as low-impact, classifying a recurring failure as "transient infra noise" because the embeddings looked like prior transient infra noise. The team finds out when the bug class produces a customer-visible incident, and the lookback is six months of agent decisions, each individually defensible.

This failure mode is invisible to most monitoring because each decision is well-formed. The dashboards show that the agent is operating within bounds, the eval set says its judgment is accurate, the action ledger shows actions that look reasonable. The thing the dashboards do not show is the cumulative weight of decisions that were each within bounds but together suppressed signal.

The defense is to instrument the agent's decision distribution and watch it as a signal in its own right. If the agent quarantined 30 tests last quarter and 90 this quarter, that is a question, not a number on a dashboard. If the dependency-bump-rejection rate doubled, ask why. The agent is one of the inputs to your reliability story now, and treating it as a black box that operates within its policy is how you discover the suppression pattern in a customer ticket instead of a metric.

The architectural realization is uncomfortable

The moment an agent has commit-or-deploy authority without a human in the synchronous loop, you have built a tier-zero system whose reliability story has to be written from scratch. The runbook for an unattended actor with intent is not a small extension of the runbook for a service; it is a new category that needs its own ownership, its own policies, its own paging tier, its own postmortem template, and its own retirement process.

The teams that survive this are the ones that name the actor, type its actions, queue what cannot be unattended, log every action with attribution, and rehearse the kill switch on a cadence that matches the runbook drills they already do for production services. The teams that bolt the agent onto existing CI and trust that the service-account permissions will hold are the ones who will have a bad night, and the bad night will look like a regular regression for the first hour, which is exactly how long it takes for the actual damage to compound.

Wiring the agent in took an afternoon. Writing the reliability story for the actor it introduced is the work the afternoon implied — and the team that has not started writing it is one quiet decision away from finding out that "the agent did the right thing within its policy" is not the same as "the right thing happened."

References:Let's stay in touch and Follow me for more thoughts and updates