Skip to main content

27 posts tagged with "sre"

View all tags

AI in the SRE Loop: What Works, What Breaks, and Where to Draw the Line

· 12 min read
Tian Pan
Software Engineer

Most production incidents don't fail because of missing tools. They fail because the person holding the pager doesn't have enough context fast enough. An engineer wakes up at 3 AM to a wall of firing alerts, spends the first 20 minutes piecing together what actually broke, another 20 minutes deciding which runbook applies, and by the time they're executing the fix, the incident has been open for nearly an hour. The raw fix might take 5 minutes.

AI can compress that context-gathering window from 40 minutes to under 2. That's the genuine value on the table. But "LLM helps your oncall" is not one product decision — it's a stack of decisions, each with its own failure mode, and some of those failure modes have consequences that a customer service chatbot hallucination doesn't.

The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook

· 9 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.

This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.

SLOs for Non-Deterministic Systems: Defining Reliability When Every Response Is Different

· 8 min read
Tian Pan
Software Engineer

Your AI feature returns HTTP 200, completes in 180ms, and produces valid JSON. By every traditional SLI, the request succeeded. But the answer is wrong — a hallucinated product spec, a fabricated legal citation, a subtly incorrect calculation. Your monitoring is green. Your users are furious.

This is the fundamental disconnect that breaks SRE for AI systems. Traditional reliability engineering assumes a successful execution produces a correct result. Non-deterministic systems violate that assumption on every request. The same prompt, same context, same model version can produce a different — and differently wrong — answer each time.