Skip to main content

3 posts tagged with "release-engineering"

View all tags

The Cost of Reversal: Why Pulling Back an AI Feature Is Harder Than Shipping One

· 10 min read
Tian Pan
Software Engineer

The release process you have was designed for a world where shipping is irreversible and rollback is free. AI flips that. Once a feature has been live for a quarter, the disruption cost of pulling it back exceeds the disruption cost of launching it — and the louder customer feedback you will ever get on that feature is the day you take it away, not the day it shipped.

The team builds a kill switch for every AI launch. Nobody ever pulls it. Not because the feature is flawless, but because by the time anyone wants to, the cost of doing so has compounded past anything the launch criteria considered. Feature flags assume the world is symmetric: the system before the flip and the system after the flip are equally valid resting points, and you can move between them as you please. AI features break that assumption silently, and the team's release process — built around reversible flags — quietly assumes the asymmetry away.

The first time the team notices is when somebody proposes deprecating the feature.

Your Agent Has Two Release Pipelines, Not One

· 10 min read
Tian Pan
Software Engineer

A team I worked with shipped a "small prompt tweak" on a Wednesday afternoon. The same PR also added one new tool to the agent's registry — a convenience wrapper around an internal admin API that the prompt would now occasionally invoke. The eval suite passed. The canary looked clean. By Thursday morning a customer's billing record had been mutated by an agent acting on a prompt-injected support ticket, the audit trail showed the admin tool firing exactly as designed, and the on-call engineer's first instinct — roll back the prompt — did nothing useful, because the credential had already been used and the row had already been written.

The post-mortem framed it as a security review failure. It wasn't. It was a release-pipeline failure. The team had shipped two completely different asset classes — a behavioral nudge to the model and a new authority granted to the agent — through the same review, the same gate, and the same rollback story, as if they were the same kind of change. They aren't. And once you see them as two pipelines, most "agent governance" debates become much less mysterious.

Contract Tests for Prompts: Stop One Team's Edit From Breaking Another Team's Agent

· 9 min read
Tian Pan
Software Engineer

A platform team rewords the intent classifier prompt to "better handle compound questions." One sentence changes. Their own eval suite goes green — compound-question accuracy improves 6 points. They merge at 3pm. By 5pm, three downstream agent teams are paging: the routing agent is sending refund requests to the shipping queue, the summarizer agent is truncating at a different boundary, and the ticket-tagger has started emitting a category that no schema recognizes. None of those downstream teams were in the review. Nobody was on call for "the intent prompt."

This is not a hypothetical. It is what happens when a prompt becomes a shared dependency without becoming a shared API. A prompt change that improves one team's metric can silently invalidate the assumptions another team built on top. And unlike a breaking API change, there is no deserialization error, no schema mismatch, no 500 — the downstream just starts making subtly worse decisions.

Traditional API engineering solved this decades ago with contract tests. The consumer publishes the shape of what it expects; the provider is obligated to keep that shape working. Pact, consumer-driven contracts, shared schemas — this is release-engineering orthodoxy for HTTP services. Prompts deserve the same discipline, and most organizations still treat them like sticky notes passed between teams.