Skip to main content

The Internal Search Box You Replaced With an Agent Just Became Your SLO

· 11 min read
Tian Pan
Software Engineer

You retired the search bar on the company portal because the agent did the same job better. Type a question in plain English, get an answer with citations, refine with a follow-up. The pilot crushed the satisfaction metric. The rollout email said "deprecating legacy search, full cutover in two weeks." Two weeks passed. The old index was decommissioned. The query box was replaced with a chat input.

Six months later, on a Tuesday morning, three things happen at once. Your inference provider rate-limits the corporate account because somebody's batch job spiked the shared quota. The embedding service has a regional brownout. A config push clears the prompt cache. Every engineer in the company who used to type "vpn setup" or "expense policy" into a search bar instead watches a spinner for forty seconds, then gets a refusal that does not understand their question, or worse, a confident citation to a wiki page that does not exist. The Slack channel where employees ask each other things lights up. The IT inbox fills with "is search broken?"

The search bar you replaced had three nines of availability over a decade of small incremental improvements. The agent that replaced it has a different shape of failure — slow not down, wrong not empty, expensive not cached — and your SRE culture was not calibrated for it.

The Predecessor's SLO Migrates With the User

When you retire a feature, you do not retire the user expectation it set. A search bar that returned results in under 200 milliseconds for ten years did not just deliver speed; it taught every employee what "search" feels like inside your company. The model in their head — type, see results, move on — does not get a release note. It survives the cutover as a quiet expectation that the replacement is at least as good.

The replacement is not at least as good on the axes nobody measured. The pilot was graded on satisfaction. Satisfaction in a controlled pilot, with the inference provider healthy and the embedding cache warm, looks fantastic. The old search was graded on availability and latency, decade-old SLOs that lived in a runbook nobody opened because nothing ever broke. The new agent has no equivalent runbook because nothing has broken yet in the way SRE thinks about breakage.

This is the inheritance problem. The user's expectation is the contract, and the user does not care whether the replacement is technically a different system. They typed a question into a box and the box did not answer. That is the same failure as the old search going down, even though the underlying cause is a rate limit on a third-party API your SRE team has no authority to fix.

A practical move is an explicit SLO carry-over review whenever an agent replaces a deterministic feature. Walk the predecessor's SLI/SLO sheet, line by line, and decide for each one: does the agent meet it, can the agent be made to meet it, or do we publish a different commitment to the user? The third option is honest. The default — silently inheriting the contract without renegotiating it — is the one that bites.

The Failure Modes Have No Cell in Your Runbook

Classic search has a short failure taxonomy. The index is up or down. The query took too long or returned too few results. The crawler missed a page. Each row has a paging policy, a dashboard, an owner.

The agent fails in shapes that do not map. A response can be slow because the model is overloaded — not down, just degraded into a forty-second tail that exceeds the user's patience. A response can be wrong because the retrieval surfaced the right document but the generator paraphrased a clause backward. A response can be expensive because the user asked a follow-up that ballooned the context and the cost-per-query jumped tenfold without anyone noticing until the monthly bill arrived. A response can be a refusal because the safety classifier flagged a benign question about employment law. A response can be a hallucinated citation to a wiki page that was deleted last year and lived only in the embedding store the team forgot to re-index.

None of these have a row in the runbook the SRE team inherited from the search engineers. The classifications are different, the alerts are different, the mitigations are different. A spike in p99 latency for the old search meant the index needed more replicas. A spike in p99 latency for the agent might mean the provider is throttling you, or the prompt got longer because someone added a system instruction, or the model rolled to a new version that is slower on long contexts. The dashboards that worked for the old shape do not see the new one.

The work to do is not just a new dashboard. It is a new taxonomy, and someone has to write it. Take the agent's actual failure observations from the first thirty days in production, cluster them, name them, and give each cluster a default response — automatic mitigation if possible, paging policy if not. Without this taxonomy you have an on-call rotation responding to incidents whose categories do not yet exist, which is indistinguishable from no on-call.

The Fallback the Migration Plan Forgot

The right shape of a replacement is not "old thing off, new thing on." It is "new thing on, old thing degraded into a fallback path the user does not see until they need it." The teams who do this well keep the predecessor warm for longer than the migration plan said to. They wire the agent's confidence, latency, and cost into a routing layer that quietly punts to the previous search index when any of those exceed a threshold.

Three primitives carry most of this work — a token bucket per identity to prevent any one user or background job from exhausting the shared inference quota, a circuit breaker per failure pattern that flips to fallback when the agent's error rate or tail latency crosses a budget, and a declarative fallback chain that routes from primary model to cheaper model to semantic cache to the legacy index. None of these primitives are exotic. They are standard in any production system that depends on an external API. The team that ships an agent without them is operating as if the inference provider's status page is their SRE.

The fallback to the predecessor is the unusual move. It costs ongoing maintenance — the old index has to be kept fresh, the routing layer has to be tested, the dashboards have to show both paths. It is also the move that means a provider outage is invisible to most users instead of a company-wide IT incident. The math is straightforward. The cost of keeping the predecessor on standby for a year is much smaller than the cost of one half-day outage that produces a CEO email asking why nobody can find anything.

If keeping the predecessor warm is genuinely infeasible — the old infrastructure is decommissioned, the contract is gone, the team that knew how to operate it has moved on — at least build a stub fallback. A flat keyword index over the same corpus, served from a static endpoint, returning ten results and a banner that says "advanced search is temporarily unavailable." The bar is "users can still find the VPN setup page during an inference outage," not "feature parity in degraded mode."

Capacity Planning Has a New Multiplier

The old search served every employee, every day, at sub-cent unit cost because the index was yours and the compute amortized over a decade of optimization. The agent serves the same users at unit costs measured in cents to dollars per query, against a vendor whose quota was sized for your pilot, not your full rollout.

If the migration proposal did not include the math — the predecessor's queries per second, times the agent's per-query cost, times a multiplier for follow-up turns the conversational interface invites — the team has signed leadership up for a bill that grows with adoption. The first time this surfaces is usually a finance meeting where someone asks why the AI line item is up 8x quarter-over-quarter, and the answer is "rollout completed."

The honest capacity plan starts from the predecessor's traffic profile, not the pilot's. Sample a week of the old search logs. Project conversational expansion (users ask follow-up questions when given the chance — count on at least a 2x amplification in turns per task). Multiply by the per-turn cost at current provider pricing. Add a buffer for cache misses, retries, and the inevitable system prompt growth as product iterates. The number that falls out is the contract you need with your inference provider — not the pilot quota, not the friendly initial tier, but a committed-throughput agreement that survives a Tuesday morning when somebody else on the platform is burning their quota too.

The same math drives the fallback decision. If the per-query cost difference between the agent and the legacy index is two orders of magnitude, the routing layer should be biased toward serving short, common queries from the cheaper path. "vpn setup" does not need a generative answer. The user wants the page. The agent is the right answer for the long tail, not the head.

The Organizational Gap Is the Real Outage Surface

Product owns the migration metric. SRE owns the SLO. The gap between them is the agent's outage profile, and neither team budgeted for it.

This is how it goes wrong. Product runs the pilot, measures satisfaction, declares success, drives the cutover. SRE was not in the room for the pilot review because the pilot did not page anyone. The first incident hits and SRE is paged for a service whose dependency map they have never seen, whose failure modes are not in their on-call training, whose vendor relationship is owned by a procurement team they barely know. The post-incident review names "improve agent reliability" as an action item with no owner and a six-month timeline.

The fix is procedural, not technical. Before a feature retirement that hands user-facing reliability to a new system, the team owning the new system has to sign a deprecation parity milestone — agreed in advance, in writing — that the replacement meets or beats the predecessor on availability, latency, refusal rate, and unit cost for a stated window before the predecessor goes away. SRE signs it. Product signs it. Finance signs it. The signatures are the forcing function that converts the migration from a satisfaction project into a reliability project.

The window matters. A two-week parity check during a quiet release period does not catch the provider's monthly maintenance window, the quarterly index drift, or the seasonal traffic spike. Ninety days is the floor for any system whose dependencies you do not control. Six months is better. The cost of the predecessor running in parallel for two extra quarters is dwarfed by the cost of a botched cutover that the company remembers for years.

Agents Do Not Add Capability — They Replace Systems

The architectural realization is simple and uncomfortable. An agent that replaces a deterministic feature is not a new capability bolted onto the company's tech stack. It is a successor system, and a successor inherits the contract whether the team renegotiates it or not.

This reframes a lot of decisions. The pilot is not the end of due diligence, it is the start. The success metric is not satisfaction, it is the union of every metric the predecessor was graded on. The runbook is not the old runbook, it is a new one written from scratch against the agent's actual failure modes. The vendor relationship is not a procurement detail, it is the single largest determinant of your user-facing availability. The fallback path is not a future optimization, it is a launch requirement.

The teams that internalize this ship agents that survive their first bad Tuesday. The teams that do not ship agents that earn a postmortem, a finance meeting, and an executive who never quite trusts the next AI rollout. The gap between the two is not model quality. It is whether someone, before the predecessor was decommissioned, wrote down the contract the new system would have to honor — and then made the new system honor it.

References:Let's stay in touch and Follow me for more thoughts and updates