Your AI Product Needs an SRE Before It Needs Another Model
The sharpest pattern I see in struggling AI teams is the gap between how sophisticated their model stack is and how primitive their operations are. A team will run three frontier models in production behind custom routing logic, a RAG pipeline with eight retrieval stages, and an agent that calls twenty tools. They will also have no on-call rotation, no SLOs, no runbooks, and a #incidents Slack channel where prompts are hotfixed live by whoever happens to be awake. The product is operating on 2026 model infrastructure and 2012 operational infrastructure, and every week the gap costs them another outage.
The instinct when this hurts is to reach for the model lever. Quality dipped? Try the new release. Latency spiked? Switch providers. Hallucinations in production? Add another guardrail prompt. None of this fixes the underlying problem, which is that nobody owns the system's reliability as a discipline. What these teams actually need — usually before they need another applied scientist — is their first SRE.
The "ship a better prompt" loop is a trust fall with production
AI product teams tend to organize around the generative surface: prompts, evals, retrieval quality, tool schemas. This is reasonable work. It is also the work that makes every production incident feel novel. A prompt change from last Thursday is suddenly the leading explanation for a Monday latency regression. Nobody is sure, because nobody logged it. Nobody logged it because the team that edits prompts is not the team that owns incident response, and there is no incident response team at all.
One of the most consistent findings from large-scale LLMOps surveys is that prompt drift — the unchecked accumulation of small edits — is the single most common source of production regressions, exceeding both model failures and infrastructure issues. This is a purely operational failure. The prompts themselves are not the problem. The problem is that prompts are being treated as code without any of code's discipline: no review, no rollback, no change log, no linkage to telemetry. A traditional software team would consider this an obvious missing system. An AI team often considers it a feature: "we can iterate quickly."
They can iterate quickly until they can't. The moment the first customer complaint lands on a behavior that nobody can reproduce, the team discovers they have no way to correlate a prompt edit, a deployment, and a user-facing regression. Every incident becomes a multi-hour archaeology dig.
What an SRE actually brings that another model cannot
An SRE is not a devops engineer with a different title. The role exists specifically to treat reliability as a product property, not a side effect of individual engineering diligence. The practices that come with them — SLOs, error budgets, blameless postmortems, runbook discipline, alert hygiene — are a toolkit that AI teams almost never reinvent on their own, because the incentive to ship features always wins the argument on a Tuesday morning.
Consider what changes when an SRE enters an AI org:
- Reliability becomes measurable. "The agent feels broken" becomes "p99 latency on the completion endpoint breached our 8-second SLO at 14:02 and the retrieval service's error rate is elevated." The second sentence is actionable. The first one starts a group therapy session.
- Incidents leave a trail. A real incident timeline with a postmortem captures what changed, who noticed, how long it took, and what the class of failure was. After twelve of those, patterns emerge that no individual engineer can see.
- Alerts mean something. A recent industry study found 44% of organizations had an outage last year directly linked to suppressed or ignored alerts. On AI teams, the rate is almost certainly higher, because the baseline signal-to-noise on a non-deterministic system is already bad. An SRE's first three months are usually a quiet war on alert noise.
- Runbooks exist. Not wiki pages labeled "runbook" that nobody has read since Q2. Actual runbooks: the prompt change rollback procedure, the fallback model activation path, the retrieval-index rebuild script, with owners and expected outcomes documented.
None of this requires machine learning expertise. It requires someone whose job is reliability, not whose job is features that happen to involve reliability as a sometimes-thing.
The leading indicators you've crossed the "small team can wing it" line
There is a period in every startup where a team of five can keep the whole system in their heads and winging incident response is fine. The interesting question is when that period ends. For AI products, it ends earlier than most teams realize, because the non-determinism expands the failure surface faster than the team can grow.
Watch for these signals:
- Incident volume is rising but time-to-detect is not improving. If you're responding faster but incidents are more frequent, you're getting better at firefighting, not better at fire prevention. An SRE will redirect energy from the former to the latter.
- The same postmortem themes repeat across unrelated incidents. "We didn't notice for forty minutes because nobody watches that dashboard." "The rollback took two hours because the deployment pipeline doesn't support partial reverts." "Nobody knew whose alert that was." When the same root causes show up in three different postmortems, they are system-level gaps, not individual mistakes.
- On-call is informal and concentrated on one or two heroes. You have a Slack channel where people post when something breaks, and two senior engineers do 80% of the work because they happen to know the system best. This is both a reliability risk and a retention risk. Heroes burn out. Then you have no on-call at all.
- Engineers negotiate prompt edits in DM. When the change-management process for a load-bearing prompt is "ping Sarah, see if she has time to look at it," you have no change-management process.
- Your customers tell you about regressions before your telemetry does. This is the clearest sign. The gap between user-observable failure and internal detection is the gap an SRE exists to close.
Any one of these alone is survivable. Three or more simultaneously means the informal operational model has run out of runway and is quietly costing the business more than anyone is tracking.
Why another applied scientist is the most common version of the wrong move
When the pager starts going off at 3 AM, the instinct on an AI team is almost never "we need an SRE." It is "we need to make the model better so this doesn't happen." So the next hire is another applied scientist.
This is usually the wrong move, for three reasons.
First, applied scientists are expensive and the ratio of their work that ends up on the reliability axis is low. They are trained to push eval scores, not to design on-call rotations. You end up paying frontier-model-tier compensation for someone who will, by training and preference, spend most of their time on things that do not reduce production pain.
Second, adding modeling capacity to a team with a weak operational foundation produces more shipped features riding on the same broken foundation. The incident rate goes up, not down, because there is more surface to fail on. Six months later the team is in the same place with more debt.
Third — and this is the hardest one for founders to hear — the "we need model expertise" framing is often a way of avoiding the less glamorous truth: the team doesn't have a reliability problem because the model isn't good enough. The team has a reliability problem because nobody's job is reliability. The model is a scapegoat for an organizational gap.
The counterargument is usually "we don't have the budget for a full SRE hire." Fair, but the actual question is: what are you currently paying in incident response, customer trust, and engineering morale, and is the marginal dollar better spent on a fifth applied scientist or a first SRE? For most teams that have shipped past the prototype stage, the SRE is the better dollar.
A transition framework: embedded SRE vs. training ML engineers in operations
There are two reasonable paths once a team decides to take reliability seriously, and the right choice depends on team size and growth trajectory.
Embed an SRE early (team of 8–20). Hire one full-time SRE and embed them in the AI engineering team, not in a separate reliability org. Their first ninety days are establishing SLOs for the three most user-visible metrics, instrumenting the gap between prompt edit and deployment, replacing the "#incidents" channel with a real incident management tool, and writing the first three runbooks with the engineers who own the corresponding systems. By day 180 they are running postmortem reviews on a schedule and the team's on-call rotation has formal handoffs.
Train ML engineers in operational discipline (team of 3–8). At this size a dedicated SRE hire is hard to justify. Instead, designate one ML engineer as the operational owner for a quarter — not permanently, but long enough to stand up the minimum viable version of the practices above. Send them to work with SRE-cultured teams if you can borrow time from another part of the company. Treat the output — SLOs, runbooks, incident process — as their deliverable, not a side project.
The failure mode in both cases is the same: treating operations as volunteer work. SRE practices compound over years. They do not survive a team where nobody's performance review mentions them.
The model is not the bottleneck anymore
A few years ago, the quality of the underlying model was the binding constraint for most AI products. It often isn't now. The binding constraint for teams that have shipped past the prototype stage is whether they can operate a non-deterministic system under production load without the whole org being on fire every week. That is a solved problem, but the solution does not live in the model catalog. It lives in a set of practices that predate the current AI cycle by two decades, and the people who know how to install those practices are sitting in traditional infrastructure orgs while AI teams page each other at 2 AM.
The move that actually changes the trajectory is not the next model upgrade. It is the hire that treats your production system like the production system it already is.
- https://rootly.com/ai-sre-guide
- https://incident.io/blog/sre-tools-reliability-practices-2026
- https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://www.businesswire.com/news/home/20260406439955/en/New-Study-Finds-Alert-Fatigue-Has-Become-a-Production-Reliability-Risk-and-Incident-Response-Alone-Is-No-Longer-Enough
- https://www.srepath.com/starting-sre-at-startups-and-smaller-organizations/
- https://sre.google/sre-book/introduction/
